No title

The Econometrics Journal Econometrics Journal (2009), volume 12, pp. 1–25. doi: 10.1111/j.1368-423X.2008.00273.x Ident...

Author: Richard J. Smith

89 downloads 661 Views 3MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form

DOWNLOAD PDF

The

Econometrics Journal Econometrics Journal (2009), volume 12, pp. 1–25. doi: 10.1111/j.1368-423X.2008.00273.x

Identification and estimation of local average derivatives in non-separable models without monotonicity S TEFAN H ODERLEIN † AND E NNO M AMMEN ‡ †

Department of Economics, Brown University, Robinson Hall #302C, Providence, RI 02912, USA E-mail: [email protected]

‡

Department of Economics, University of Mannheim, L 7, 3-5, 68131 Mannheim, Germany E-mail: [email protected] First version received: May 2008; final version accepted: October 2008

Summary In many structural economic models there are no good arguments for additive separability of the error. Recently, this motivated intensive research on non-separable structures. For instance, in Hoderlein and Mammen (2007) a non-separable model in the single equation case was considered, and it was established that in the absence of the frequently employed monotonicity assumption local average structural derivatives (LASD) are still identified. In this paper, we introduce an estimator for the LASD. The estimator we propose is based on local polynomial fitting of conditional quantiles. We derive its large sample distribution through a Bahadur representation, and give some related results, e.g. about the asymptotic behaviour of the quantile process. Moreover, we generalize the concept of LASD to include endogeneity of regressors and discuss the case of a multivariate dependent variable. We also consider identification of structured non-separable models, including single index and additive models. We discuss specification testing, as well as testing for endogeneity and for the impact of unobserved heterogeneity. We also show that fixed censoring can easily be addressed in this framework. Finally, we apply some of the concepts to demand analysis using British Consumer Data. Keywords: Non-parametric, Non-parametric identification, Quantile regression, Weak axiom.

IV,

Non-separable

model,

Partial

1. INTRODUCTION 1.1. The non-separable model Models without additively separable error terms have become increasingly popular because they are perfect tools for modelling general economic relationships empirically. The non-separable model takes the form Y = φ(X, A), (1.1) where Y is a scalar response variable and X is an observable real valued random d-vectors, while A is an unobservable random variable. Often, the relationship between Y and some or all of C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,

Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.

2

S. Hoderlein and E. Mammen

the regressors X is of key economic interest, whereas A is meant to capture omitted factors and all types of unobserved heterogeneity. There are instances in which economic theory suggests certain identifying restrictions on the functional dependence of φ on A. For instance, in the famous returns to schooling example where Y denotes log wage and X individual covariates including years of schooling, one may be willing to assume that the only important omitted factor is ability, and that this factor ceteris paribus drives up wages. Therefore, in this example, monotonicity of φ in a scalar A is a reasonable assumption. 1 However, often economists think of A as capturing all kinds of unobservables and are reluctant to place any structure on its influence. A particularly important example is unobserved heterogeneity in preferences or technologies: In the textbook model of consumer choice, rationality places strong restrictions on the compensated price effects for any given utility function, but no restrictions on the impact of changes in utility functions (cf. Mas-Colell et al., 1997). In particular, φ can neither be assumed to be monotonic in utility functions, nor are utility functions scalars—in fact they are elements of an infinitely dimensional function space. 2 1.2. Identification of LASD This paper discusses estimation, application and extensions of the concept of local average structural derivatives (LASD). It makes use of the key identification result in Hoderlein and Mammen (2007), which establishes what can be learned from the quantiles about the marginal effect of one regressor, say x 1 , on the dependent variable y in non-separable models of type (1.1) when no monotonicity assumption on the unobservables A is made. For fixed values x ∗ ∈ Rd and 0 < α < 1, in Hoderlein and Mammen (2007) the following relationship between the derivative of the conditional quantile and the marginal effect of the non-separable function φ has been shown as E[∂x1 φ(X, A)|X = x ∗ , Y = kα (x ∗ )] = ∂x1 kα (x ∗ ).

(1.2)

where k α (x) denotes the conditional α-quantile of Y given X = x, i.e. for 0 < α < 1 the quantity k α (x) is defined by P(Y ≤ kα (x)|X = x) = α. Furthermore, ∂x1 denotes the partial derivative with respect to the first component of x. The result (1.2) holds under the technical Assumptions A2, . . . , A5 (stated in Section A.1.1), and the essential assumption that the random variables A and X 1 are conditionally independent, given X 2 , . . , X d . For the result it is not necessary that A is a scalar. It is allowed to take values in a Borel space A, i.e. a set that is homeomorphic to a Borel subset of the unit interval endowed with the Borel σ -field. This includes the case that A is a random element of a Polish space, e.g. a random piecewise continuous (utility) function. The result states that we can identify an average over the marginal effects ∂x1 φ from the data. The derivative of the quantile is the best approximation to the underlying marginal effect, given all our information. To give an example, suppose we were given data on the expenditure for food by individuals, and some covariates, say, income, age (in decades) and gender. Then we may identify the average income effect for all women, age 1

See Matzkin (2003) for further examples, mainly from the duration literature. Of course, as Chesher (2003) points out, there are also completely econometric examples for high dimensional vectors of unobservables, including certain measurement error and mixture models, as well as models with nested errors. 2

C The Author(s). Journal compilation C Royal Economic Society 2009.

Non-separable models without monotonicity

3

40–50, at a given income and a given value of expenditure for food. But this will in general still be a heterogeneous group. Since the conditional expectation is a projection that minimizes the L 2 -distance, this means that LASDs are the best approximation of the true marginal effects given all the information at our disposal. It is also instructive to compare this with the mean regression E[Y |X = x ∗ ] = g(x ∗ ). In this case, ∂x1 g(x ∗ ) = E[∂x1 φ(X, A)|X = x ∗ ] is straightforward (see Altonji and Matzkin, 2005, as well as Hoderlein, 2002, 2008 for discussions). This result shows a lot of parallels to (1.2). However, the derivative of the quantile is a conditional average that includes the information about the dependent variable in our information set as well. 1.3. Related literature In econometrics, Roehrig (1988), extending earlier work of Brown (1983), was the first to consider identification in non-separable models formally. He considers identification in a system of equations, i.e. with multivariate Y, and gives conditions for global identification if there is continuous variation. His work was extended by Matzkin (2003). She examines the scenario when Y is a continuously distributed random scalar, X is continuously distributed and exogenous, and A is a scalar, w.l.o.g. U [0, 1] distributed. Moreover, she assumes that φ is monotonic in A and, as a consequence, achieves full identification of φ. Another closely related extension of Roehrig (1988) is Brown and Matzkin (1996), who discuss an extremum estimator for his system of equations. Chesher (2003, 2005), as well as Imbens and Newey (2003), consider identification in triangular systems of non-separable equations with, at least at some stage, monotonically entering error and endogenous regressors. Whereas Imbens and Newey (2003) aim at global identification of φ with continuous regressors, Chesher aims at local (i.e. at a fixed position of the regressors) identification with continuous (2003), respectively discrete, covariates (2005). Moreover Chesher (2003, 2005) gives more emphasis to identification, while Imbens and Newey (2005) consider estimation in detail. For the identification of average marginal effects, Imbens and Newey (2003), Altonji and Matzkin (2005) and Hoderlein (2002, 2008) give results without assuming monotonicity in unobservables, using derivatives of the mean regressions. We will explore the relationship between this line of research and our approach when discussing endogeneity. In the case of endogenous regressors, Florens et al. (2003), and Newey and Powell (2003) consider a non-parametric IV estimator, provided the errors enter additively. Another subclass of models with a more specific structure is considered in Florens et al. (2005), who also treat the case of endogenous regressors. Related work is Chernozhukov et al. (2007) and Chernozhukov and Hansen (2005) who assume marginal independence conditions. Finally, the main theme of the paper shares also similarities with the philosophy of partial identification (Manski, 2003), the work on heterogeneity in economic theory (Hildenbrand, 1993), and, last but not least, the work of Heckman and Vytlacil (1999, 2001), where potential outcomes are non-additive in unobservables. 1.4. Structure of the paper This paper is structured as follows. In the next section, we discuss estimation of LASD by local polynomial quantile fitting, we provide large sample theory for the estimator. In C The Author(s). Journal compilation C Royal Economic Society 2009.

4


Section 3, we provide a brief application of the core concepts and the estimator to British consumer data. Finally, in the fourth section, we provide several extensions in the case of exogenous regressors. We discuss the estimation of a class of semi- and non-parametric submodels of (1.1), containing the weakly separable (WS) single index, as well as the WS additive model. Moreover, we show how testing for unobserved heterogeneity, as well as testing for specification may be accomplished. Finally, we address the issue of censoring, and conclude this paper with an outlook.

2. ESTIMATION PROCEDURES FOR LASD AND THEIR LARGE SAMPLE BEHAVIOUR We consider model (1.1) with scalar response Y and vector X taking values in a compact subset I of Rd and propose estimates of m α and m, defined as mα (x) = E[∂x1 φ(X, A)|X = x, Y = kα (x)],

(2.1)

m(x, y) = E[∂x1 φ(X, A)|X = x, Y = y].

(2.2)

Specifically, we propose to estimate the function m α by a local polynomial smoother m α . The function m is estimated by using that m(x, y) = m α(x,y) (x) where α(x, y) = P(Y ≤ y|X = x). Employing a kernel smoothing estimate α of α, we propose the following estimate of m: m (x, y) = m α(x,y) (x).

(2.3)

We suppose that i.i.d. data: (Y i , X i ), (1 ≤ i ≤ n) are given. The estimator m α is defined as a local quadratic regression quantile estimator. For its calculation one has to minimize n

τα [Yi − μ0 − μT1 (Xi − x) − (Xi − x)T μ2 (Xi − x)]K[h−1 (Xi − x)]

(2.4)

i=1

over scalars μ 0 , d-vectors μ 1 and d × d matrices μ 2 . Here K is a product kernel function, h = (h 1 , . . . , h d ) is a bandwidth vector and τ α (u) = u[α − I (u < 0)]. The diagonal matrix −1 −1 with diagonal elements h−1 1 , . . . , hd is denoted by h . The minimizers of (2.4) are denoted by μ1 , μ2 . We define μ0 , m α (x) = μ1,1 , where μ1,1 denotes the first element of the vector μ1 . We now develop an asymptotic theory for m α (x). It is based on a Bahadur representation of the estimator, which, along with the assumptions, can be found in the Appendix. C The Author(s). Journal compilation C Royal Economic Society 2009.


5

T HEOREM 2.1. Under Assumptions C1–C5 (stated in Section A.1.1) it holds for x in the interior of I and α ∈ (0, 1) that 1/2 2 fY |X (kα (x)|x)2 fX (x)nh(prod) h21 κ2,1 2 2 m α (x) − mα (x) u1 K1 (u) du α(1 − α) d 1 1ll 2 111 2 κ2,1 κ2,l kα (x)hl − κ4,1 kα(x,y) (x)h1 + 3 → N (0, 1) 6κ2,1 l=2 4 ij l in distribution, 2where kα (x) = ∂xi ∂xj ∂xl kα (x) for 1 ≤ i, j , l ≤ d and where κ4,1 = u K1 (u)du α (x) and m β (u) are and κ2,l = u Kl (u)du for 1 ≤ l ≤ d. Furthermore, it holds that m asymptotically independent, if x = u, but they are asymptotically dependent, if x = u. For x fixed, the process 1/2 2 fY |X (kα (x)|x)2 fX (x)nhprod h21 κ2,1 2 2 α → m α (x) − mα (x) u1 K1 (u) du d 1 1ll 2 2 111 − κ2,1 κ2,l kα (x)hl κ4,1 kα (x)h1 + 3 6κ2,1 l=2 converges in distribution to a Brownian bridge B(α) for α in a closed subinterval of [0, 1]. Theorem 2.1 shows that m(x, y) = E[Y |X = x, Y = y] can be estimated with the same rate of convergence as E[Y |X = x], where we write in abuse of notation Y = ∂x1 φ(X, A). Note that ∂x1 E[Y |X] = ∂x1 E[φ(X, A)|X] = E[∂x1 φ(X, A) | X] = E[Y |X], under mild regularity assumptions. Thus E[Y |X = x] can be estimated with the same rate as a first order partial derivative of a regression function from Rd → R. Because m goes from Rd+1 to R we get one additional dimension without loosing speed of convergence of the estimator. This is interesting from a theoretical point of view and it is quite natural, too. Unconditional distribution functions and quantile functions can be estimated with parametric rate. For conditional quantile and distribution functions the (non-parametric) rate is determined by the dimension of the conditioning variables. The expansions of Theorem 2.1 can be used for obtaining asymptotic expressions of integrated squared errors. This gives formulas for asymptotically optimal bandwidths. Data-adaptive optimal bandwidths can be calculated by plugging-in consistent estimators for the corresponding unknown terms. This is in line with plug-in approaches in classical non-parametric regression. The bandwidths depend on the used criterion, i.e. if the integrated squared error is minimized over all α and x, for fixed α or for fixed x, respectively. We now discuss the estimator m . This estimator is defined by (2.3) where α (x, y) =

n

1 1[Yi ≤ y]L g −1 (Xi − x) ngprod i=1

with a product kernel function L and bandwidth vector g = (g 1 , . . . , g d ). As earlier, g −1 is the −1 diagonal matrix with diagonal elements g −1 1 , . . . , g d , g prod = g 1 · . . . · gd and g max = max 1≤l≤d gl . C The Author(s). Journal compilation C Royal Economic Society 2009.

6


The Bahadur expansion in the Appendix implies also asymptotic normality of m (x, y) if the difference between m (x, y) and m α(x,y) (x) is of second order. Because by definition α (x, y) converges to α(x, y) fast enough. This is guaranteed m (x, y) = m α(x,y) (x) this holds if by Assumptions C6 and C7, stated in Section A.1.1. T HEOREM 2.2. Under Assumptions C1–C7 (stated in Section A.1.1) it holds for x in the interior of I and y ∈ R that

2 fY |X (y|x)2 fX (x)nhprod h21 κ2,1

1/2

m (x, y) − m(x, y) u21 K 2 (u) du α(x, y)(1 − α(x, y)) d 1 1ll 111 − (x)h21 + 3 κ2,1 κ2,l kα(x,y) (x)h2l κ4,1 kα(x,y) → N (0, 1) 6κ2,1 l=2

in distribution. Furthermore it holds that m (x, y) and m (u, v) are asymptotically independent, if x = u, but they are asymptotically dependent, if x = u. For x fixed, the process

1/2 2 fY |X (y|x)2 fX (x)nhprod h21 κ2,1 2 y → m (x, y) − m(x, y) u1 K 2 (u) du d 1 1ll 2 111 2 − κ2,1 κ2,l kα(x,y) (x)hl κ4,1 kα(x,y) (x)h1 + 3 6κ2,1 l=2 converges in distribution to B(α(x, y)) for y in a compact set. Here B is a Brownian bridge. Theorem 2.1 or 2.2 can be applied for the construction of confidence intervals for m α (x) or m(x, y), respectively. This application requires consistent estimates of f Y |X , f X , k α and k 1ll α (x) for l = 1, . . . , d. The densities f Y |X and f X can be consistently estimated by kernel μ0 , defined earlier as minimizer of density estimates. A consistent estimate of k α is given by (2.4). The estimation of k 1ll α (x) (l = 1, . . . , d) causes the usual problems. It can be consistently estimated because we assume that this quantity is continuous in x. This can be done by smooth differentiation of kα or by using local cubic polynomials with undersmoothing bandwidth. Smooth differentiation is defined by −1 sprod ∂x1 ∂x2l K(s −1 (x − u) kα (u)du with undersmoothing bandwidth vector s. We do not pursue this discussion here because it is in line with the usual approaches of bias estimation in non-parametric regression.

3. AN EMPIRICAL APPLICATION: DEMAND ANALYSIS IN A HETEROGENEOUS POPULATION In this section we put some of the concepts and methods to work. We take our application from consumer demand, and start by giving an overview of the data, the methods of data clearance and of the definitions of variables involved. C The Author(s). Journal compilation C Royal Economic Society 2009.


7

3.1. The data: FES Every year, the FES reports the income, expenditures, demographic composition and other characteristics of about 7000 households. The sample surveyed represents about 0.05% of all households in the United Kingdom. The information is collected partly by interview and partly by records. Records are kept by each household member, and include an itemized list of expenditures during 14 consecutive days. The periods of data collection are evenly spread out over the year. The information is then compiled and provides a repeated series of yearly cross sections. 3.1.1. Grouping of goods, income definition and data clearance. We consider demand for a single category of good that is related to food consumption and consists of the subcategories food bought, food out (catering) and tobacco, which are self explanatory. For brevity, we call this category food. ‘Income’ in demand analysis is total expenditure, under an additive separability assumption of preferences over time and decisions. It is obtained by adding up all expenditures, with a few exceptions which are known to have measurement error. This is done to define nominal income; real income is then obtained by dividing through the retail price indices.

3.2. Issues in estimation 3.2.1. The issue of household characteristics. The role household covariates play in our approach is largely that of control variables that capture the observable part of preference heterogeneity. It is exactly for such a situation that the conditional independence assumption is ideal: For our purposes, the household covariates are nuisance directions and we are not interested in the derivatives. Hence, they are allowed to be arbitrarily correlated with the unobservables. Conditioning on their information is something that can be done in a variety of ways. The route that we take in this paper is to stratify the population to obtain more homogeneous subpopulations. More specifically, like much of the demand literature we focus on one subpopulation, namely two person households, both adults, at least one is working and the head of household is a white collar worker. This focus is also justified because other subpopulations are much more prone to measurement problems.

3.3. Empirical results The discussion of the empirical results will concentrate on the implementation of the main identification result. In particular, we illustrate concepts using Figures 1 and 2, shown below. Nevertheless, we will be able to address a number of issues discussed earlier. In the figures we show the semi-elasticities of food demand with respect to income. Consider Figure 1 first. The solid line shows the semi-elasticities of demand of a household in the earlier mentioned group which is at the 10th percentile of the budget share distribution. We call this group the ‘eat very little’. The x-axes displays log weakly income, while the y-axes shows the semi-elasticity. Obviously, the budget share of food of the ‘eat very little’ decreases strongly between 2 and C The Author(s). Journal compilation C Royal Economic Society 2009.

8


Figure 1. Marginal effect of income on the food budget share across quantiles of food budget share distribution.

3 units of log income. From there on the budget share continues to decrease, but at a lower rate, so that the decrease at an income level of around 4 is only half as strong. This result makes a lot of sense: It reflects the fact that food is a necessity, whose relative importance diminishes. The upswing for high incomes is due to the fact that those individuals substitute ‘food bought’ by ‘food out’, which is more expensive. In spite of this, the importance of food diminishes even further, so that the food budget share reduces by 0.25 in the income range displayed. The dotted lines around the solid line are 90% bootstrap confidence bands. The bootstrap kUI (Xi ). Here U 1 , . . . , U n is an i.i.d. resample is chosen as (X∗i , Y ∗i ) with X∗i = X i and Yi∗ = sample of random variables with uniform distribution on [0, 1] and kα (x) is a local quadratic quantile estimator with oversmoothed bandwidth. The bootstrap confidence bands show that the effect is significantly smaller than zero over the entire range, and among other things we see that the marginal effects for the ‘eat very little’ at low income and high income are significantly different as well. Moreover, a straight line through, e.g. y = −0.16 would be inside the dotted lines for incomes between 2 and 3, but it is outside for incomes bigger than 3, indicating that a single index specification would not be appropriate. Observe in particular the dashed line, which gives the semi-elasticities of the 30th percentile people, the ‘eat little’. Qualitatively, this group shows the same behaviour as the ‘eat very little’. Recall that in our approach an individual household could be ‘eat little’ at low income and ‘eat very little’ at high income. Hence, the only comparison we should make between different quantiles is locally, for each value of x. For instance, from the fact that the ‘eat very little’ and the C The Author(s). Journal compilation C Royal Economic Society 2009.


9

Figure 2. Marginal effect of income on the food budget share across quantiles of food budget share distribution.

‘eat little’ (almost) intersect at income 3, and the fact that the dashed line is within the confidence bounds at the same level, we can not reject the null that all individuals within these two subgroups have the same preferences. The same is actually true at any point on the income scale. We are nowhere in the income range displayed able to reject the hypothesis that within the subgroup defined by characteristics, households within the ‘eat very little’ and the ‘eat little’ subgroups have the same preferences. Contrast this with Figure 2: It shows exactly the same graphs as Figure 1, save for one difference: The dashed line represents not the 30th percentile ‘eat little’ group, but rather the 90th percentile ‘eat a lot’ group. This group is characterized by a stronger decline in their budget shares, i.e. a more negative semi-elasticity. Since this means that the conditional budget shares are getting ever closer, the result admits the following interpretation: At low income levels there are pronounced differences in the basic requirements. However, these differences in necessities disappear if one gets to higher income ranges, where the relative differences in the demand behaviour for other, more luxurious, categories than food becomes large. More important for the theory presented in this paper is that the ‘eat a lots’ show a significant difference: They are outside the dotted confidence bands for most of the income range. That means that, at a certain income level (say 3), we may be able to reject the hypothesis that the individual household within the ‘eat very little’ group, and the ‘eat a lots’ share the same preferences. However, note that at very high income levels we may still not be able to reject the hypothesis that the households in these groups share the same preferences. A closer discussion of these points needs the development of formal tests along the lines of (4.5) or (4.6). This issue along with a more elaborate analysis will be addressed elsewhere. C The Author(s). Journal compilation C Royal Economic Society 2009.

10


4. EXTENSIONS OF THIS MODELLING APPROACH Thus far we have discussed the estimation of marginal effects of general form in the single equation case with exogenous regressors, and illustrated that this concept is straightforward to apply. However, in most Microeconometric settings, additional complications arise. Some of them shall be discussed in this section. The first extension of our approach we discuss is towards systems of equations. Indeed, in many applications of microeconomic importance, individuals decide about more than just one issue, e.g. consumer purchase more than a single good. Unfortunately, we can show that the extension to the multivariate case is not straightforward. Since our analysis thus far is entirely non-parametric, another important issue we consider is that of semiparametric specifications. We propose both estimators with a semiparametric structure, as well as specification tests, including tests for the way heterogeneity enters the model. Finally, endogeneity is a major issue in most economic applications. We discuss what happens in our framework when the key identifying independence assumptions does not hold, and show how endogeneity can be accommodated in this framework. 4.1. The multivariate case We now discuss whether the identification theorem of Hoderlein and Mammen (2007) can be generalized to the case of a multivariate Y. As in the literature on non-separable models reviewed above, the case of a scalar dependent variable is less problematic, whereas the case of multivariate Y presents additional difficulties. Even with the monotonicity assumption, additional assumptions like triangularity have to be invoked, see Chesher (2003) as well as Imbens and Newey (2003), and references therein. Unsurprisingly, when using our more general approach, we also encounter difficulties in the multivariate case. We show by a counterexample that a generalization of the Hoderlein and Mammen (2007) theorem to the multivariate case is not possible without additional substantial assumptions. Before considering this example, we shortly mention a result that is an analogue of this theorem in the multivariate case. This result illustrates what may be learned from data in the multivariate case. For simplicity, we only consider bivariate responses Y = (Y 1 , Y 2 ) and a scalar valued observable X. We make the following assumption on the relationship between these random variables. There exists a measurable function φ = (φ 1 , φ 2 ) from R × A to R2 with Y = φ(X, A). Our result in the multivariate case is the following ∂x fY1 ,Y2 |X (y1 , y2 |x ∗ ) = ∂y1 y1 fY1 ,∂x φ1 ,Y2 |X (y1 , y1 , y2 |x ∗ )dy1 + ∂y2 × y2 fY1 ,Y2 ,∂x φ2 |X (y1 , y2 , y2 |x ∗ )dy2 .

(4.1)

For a proof of (4.1), see the Appendix. R EMARK 4.1. The right-hand side of (4.1) could also be written as trace[H] where H is a matrix with elements Hij = ∂yj hi and hi = E[∂x φi |Y1 , Y2 , X] · fY1 ,Y2 |X . It is important to note that the objects of interest, e.g. E[∂x φi |Y1 , Y2 , X] are not identified. C The Author(s). Journal compilation C Royal Economic Society 2009.


11

This result suggests that the Hoderlein and Mammen (2007) theorem may not be generalized to higher dimensions. We show this now formally by giving a counterexample. This example makes a similar point in our more general setting as Benkard and Berry (2006) for the case of a function monotonic in a scalar A. E XAMPLE 4.1. For independent random variables U , V and X where U and V have a standard normal distribution N(0,1) we define Y1 = G1 {[cos(ρ(X))U + sin(ρ(X))V ] X} , Y2 = G2 {[− sin(ρ(X))U + cos(ρ(X))V ] X} , where ρ is an arbitrary unknown function. Then the joint distribution of (X, Y 1 , Y 2 ) does not depend on ρ ! And in accordance with (1.2) both E[∂x φ1 |Y1 , X] and E[∂x φ2 |Y2 , X] do not depend on ρ. This follows from G−1 −1 1 (Y1 ) ∂x φ1 = + G2 (Y2 )ρ (X) G1 (Y1 ) , X G−1 −1 2 (Y2 ) + G1 (Y1 )ρ (X) G2 (Y2 ) . ∂x φ2 = X These representations immediately imply G−1 1 (Y1 ) G1 (Y1 ) , X G−1 (Y2 ) G2 (Y2 ) . E[∂x φ2 |Y2 , X] = 2 X

E[∂x φ1 |Y1 , X] =

Thus both expressions are independent of ρ. But E[∂x φ1 |Y1 , Y2 , X] depends on ρ according to

Y1 −1 E[∂x φ1 |Y1 , Y2 , X] = G1 (Y1 ) + Y2 Xρ (X) G1 (Y1 ). X Thus because ρ is assumed to be unknown, E[∂x φ1 |Y1 , Y2 , X] is not identifiable. This argument could be extended to the case where ρ(X) is replaced by ρ(X, R) with a random variable R that is independent of (U , V , X). This example suggests that the multivariate case is rather hopeless without invoking strong additional assumptions like triangularity, which may be hard to rectify on economic grounds. However, as we show in the following example, we may circumvent this problem by considering implications of hypotheses in systems of equations in one-dimensional subspaces without additional assumptions. 4.2. Application to models and test procedures In this section, we provide some extensions to more specific structures of the general form (1.1). First, we consider identification of several semi- and non-parametric model specifications that may lead to estimation procedures not discussed in this paper. Since it is a common problem in econometrics, and easily tractable within this framework, we also treat fixed censoring here. C The Author(s). Journal compilation C Royal Economic Society 2009.

12


Second, we discuss specification analysis in this very general framework: We provide a test for the presence of unobserved heterogeneity, and we consider tests for index type specifications for model (1.1). 4.2.1. Identification of semi- and non-parametric models. As is common in non-parametric analysis, the curse of dimensionality makes it imperative to place some structure on the function φ so that the LASDs be estimable with data sets commonly encountered in practise. Arguably the most popular semiparametric model with additive scalar errors is the single index models. By analogy, define the weakly separable single index model (WS-SIM) as follows Y = φ(X, A) = ψ(X β, A). From the main identification theorem (1.2), we get that ∇x kα (x) = βE[∂z ψ(X β, A)|X = x, Y = kα (x)],

(4.2)

where ∂ z denotes the derivative w.r.t. the index, and ∇ x denotes the gradient. In particular, with a weight function w ∇x kα (ξ )w(ξ )dξ (4.3) identifies β up to scale, for all α, and β could be estimated by an average quantile derivative estimator, as in Chaudhuri et al. (1997), imposing e.g. β = 1. Our approach allows to integrate (4.2) over α and x with a weighting function depending on α and x. Consequently, β would also be identified by ∇x kα (ξ )v(ξ, α)dξ dα, where v is a weighting function. This class of estimators includes the choice average weighted mean v(ξ , α) = v(ξ ), which yields the E[∂x φ(X, A)|X = x, Y = regression derivative estimator, E[∂x φ(X, A)|X = ξ ]v(ξ )dξ = kα (x)]v(ξ )dαdξ . This may be seen as an important advantage of our approach, as our approach allows to increase efficiency relative to weighted average (mean regression or quantile) derivative estimators for β. Another very popular class of non-parametric models is the class of additive model, which also nests the partially linear model. Define the weakly separable additive model (WS-AM) as follows: ⎞ ⎛ γj (Xj ), A⎠ , (4.4) Y = φ(X, A) = ψ ⎝ j

where the subscript j denotes the jth component. For related mean and quantile regression models with additive error terms, see also Christopeit and Hoderlein (2006), Horowitz (2001) and Horowitz and Mammen (2007a,b). Using again the main identification theorem (1.2), we obtain that ⎡ ⎛ ⎞ ⎤ γj (Xj ), A⎠ |X = x, Y = kα (x)⎦ = β(x)c(x, α), ∇x kα (x) = β(x)E ⎣∂z ψ ⎝ j

where β(x) = (γ 1 (x 1 ), . . , γ d (x d )). After imposing the scale normalization β(x) = 1∀x, we may identify c(x, α) ∀x. This suggests a normalized marginal quantile integration C The Author(s). Journal compilation C Royal Economic Society 2009.


estimator, i.e.

as well as,

13

∂x1 kα (x1 , ξ−1 ) (c(x1 , ξ−1 , α))−1 w(ξ−1 )dξ−1 ∀α,

∂x1 kα (x1 , ξ−1 ) (c(x1 , ξ−1 , α))−1 v(ξ−1 , α)dξ dα,

to estimate β 1 (x 1 ) = γ 1 (x 1 ). Obviously, the same arguments regarding efficiency can be made. 4.2.2. Specification testing in this framework A test for the influence of unobserved heterogeneity. Recall our major result, E[∂x1 φ(X, A)|X = x, Y = kα (x)] = mα (x), where the function m α may change with α. If unobserved heterogeneity is not allowed to have an effect on the derivative, then mα (x) = ∂x1 λ(x), where the right-hand side is a function that is independent of α. The fact that the marginal effect does not depend on x 1 could result from a model of the type Y = λ(X) + ϕ(X −1 , A). Hence, this hypothesis has also an interpretation as postulating that the model is (partially) additively separable in the error. Rewriting the null hypothesis, 2 mα (x) − mβ (x)g(β)dβ g(α)dα = 0, (4.5) where g denotes a weighting function. This hypothesis is for one fixed value x only. Averaging the test statistic over a range of values of x may determine whether heterogeneity has an effect for some values of x. An integrated version of this hypothesis is given by 2 mα (x) − mβ (x)g(β)dβ g(α)dαw(x)dx = 0, (4.6) where w is a weight function. Sample counterparts have the form 2 m α (x) − m β (x)g(β)dβ g(α)dα and

2 m α (x) − m β (x)g(β)dβ g(α)dαw(x)dx. An alternative test could be based on the sup norm using α (x) − m β (x)g(β)dβ . Tn = sup g(α) m α∈J

J

The asymptotic behaviour of this test statistic immediately follows from Theorem 2.1. Theorem 2.2 suggests to use g(β) equal to a consistent estimate of fY |X (kβ (x)|x)[ JfY |X (kγ (x)| C The Author(s). Journal compilation C Royal Economic Society 2009.

14


x)dγ ]−1 . With such a choice, nhprod h21 [ u21 K12 (u)du]−1/2 fX (x)1/2 [ JfY |X (kγ (x)|x)dγ ]−1 κ2,1 Tn converges in distribution to 1 sup |B(α) − B(β)dβ| λ(J ) J α∈J where B is a Brownian Bridge and λ is the Lebesgue measure. In the limit for J → [0, 1], this coincides with the asymptotic distribution of a two-sided Kolmogorov–Smirnov test statistic. Specification tests and model choice. After determining the impact of unobserved heterogeneity, we may ask whether the function φ is completely amorphous, or whether it has some structure. Paralleling the section on estimation, we also provide a test for the index structure. As was already noted earlier, we obtain that under the null of index structure the ratio of two derivatives is constant and does not depend on α. This opens up the way to non-parametric specification tests, e.g. by comparing the ratio of any two derivatives with a constant. Since this has to hold for any α, we either get a large number of hypotheses, or we may increase the power of tests by using a weighted average of the tests that arise from these hypotheses. Finally, we may test for additivity as in (4.4), but we may also be able to test for another generalization of the linear index restriction, i.e. Y = φ(X, A) = ψ(ζ (Z) , P , A),

(4.7)

where X = (Z , P ) . In this case, ∂z1 kα (x)/∂z2 kα (x) = ∂z1 ζ (z)/∂z2 ζ (z) implying that the ratio is only a function of z. Again, the ratio should also be invariant across α, and both facts open up the way for specification tests. 4.3. Censored data In a number of applications there is exogenous censoring of the data. More formally, consider a model with fixed censoring: Y ∗ = φ(X, A) and Y = 1 Y ∗ > 0 Y ∗ , (4.8) where X and Y are observed. In this case, (1.2) yields: E[∂x1 φ|X = x ∗ , Y = kα (x ∗ )] = mα (x ∗ ), as long as k α (x ∗ ) > 0. For k α (x ∗ ) > 0, the conditional α-quantile of Y, and the conditional α-quantile of Y ∗ coincide. Similar arguments can be applied to more complicated settings of censoring. 4.4. Endogeneity Thus far all of our analysis requires that the error be conditionally independent of the regressors. In economics it is common to believe that this assumption is violated. In the non-parametric mean regression case with additively separable errors, endogeneity of regressors has proven to be a difficult problem. Although there do exist estimators, e.g. Newey and Powell (2003), Darolles et al. (2003) and Hall and Horowitz (2003), their speed of convergence might be very slow in some cases. C The Author(s). Journal compilation C Royal Economic Society 2009.


15

In this section, we will establish the following results. First, we show precisely what goes wrong when we do not invoke the conditional independence assumption required ofr identification. Second, we show that LASD is well identified under a control function assumption, and estimable under general conditions with standard speed of convergence. Third, we propose a test for endogeneity. Fourth, we show the relationship to Altonji and Matzkin (2005) as well as Imbens and Newey (2003) as far as the estimation of derivatives is concerned. Finally, we give a counterexample which establishes that the independence of errors and instruments alone is not sufficient to identify the LASD. 4.5. The role of the conditional independence assumption For the identification result of LASD in Hoderlein and Mammen (2008) the essential assumption was that the random variables A and X 1 are conditionally independent, given X 2 , . . , X d . In this section, we discuss the importance of the conditional independence assumption by highlighting what happens if this assumption does not hold. In the case of dependent X and A, the following theorem may be used to obtain bounds on the marginal effects. T HEOREM 4.1. For fixed values x ∗ ∈ Rd and 0 < α < 1 assume that Assumptions A1–A5 (stated in Section A.1.1) hold. Then, E[∂x1 φ(X, A)|X = x ∗ , Y = kα (x ∗ )] = mα (x ∗ ) + lα (x ∗ ), where

E 1[Y ≤ kα (x ∗ )]∂x1 lnfA|X (A|x ∗ )|X = x ∗ . lα (x ) = fY |X (kα (x ∗ )|x ∗ ) ∗

R EMARK 4.2. This result illustrates the importance of the conditional independence assumption. In its absence, the derivative of the conditional quantile function contains both the best projection of the underlying marginal effect ∂x1 φ, and a distributional effect that indicates how the composition of the unobservables changes as we vary the level of covariates. The conditional expectation E[1[Y ≤ kα (x ∗ )]∂x1 lnfA|X (a|x ∗ )|X = x ∗ ] is not identified without additional assumptions. In this paper, we will not discuss such assumptions, but Theorem 4.1 could be used as a starting point for weakening the independence Assumption A1. If the conditional independence assumption breaks down we adopt the standard terminology and call X 1 endogenous. We now discuss possible solutions for this problem. 4.5.1. A control function approach for the unrestricted case. Let the model be given by Y = φ(X, A), but A is now not independent of X 1 conditionally on X 2 , . . , X d . However, we have instruments, Z, with the following property: Define U as the unobservables in the mapping X 1 = μ(Z, X 2 , . . , X d , U ), with U independent of Z conditionally on X 2 , . . , X d . Then, assume that A is independent of Z, conditional on U and X 2 , . . , X d . Call this assumption (ACF ). This has as implication that A is independent of X, conditional on U and X 2 , . . , X d . Then, due to (1.2), mα (x, u) = E[∂x1 φ(X, A)|X = x, U = u, Y = kα (x, u)], for any α. Note that no monotonicity in U is required for this argument. However, since U has to be used as a regressor, it must be pre-estimated and therefore additional assumptions have to C The Author(s). Journal compilation C Royal Economic Society 2009.

16


be imposed. For instance, μ is monotone in U and that U is uniform on [0, 1] (conditionally on Z and X 2 , . . , X d ). This can be seen as a control function approach (CF-IV) to the endogeneity problem. 4.5.2. Testing endogeneity. A test for endogeneity under (ACF) can be built on the following observation. In the following section, we assume throughout that A is independent of X, conditional on U and X 2 , . . , X d . The assumption to be tested is whether A is independent of X, conditional on X 2 , . . , X d . Under the null that X is conditionally independent, we have mα (x) = E E[∂x1 φ(X, A)|X, U , Y ]|X = x, Y = kα (x) , but this equals mα (x) =

Mα (x, u)fU |X,Y α (u|x, y α )du,

where M α (x, u) denotes the derivative of the α quantile of Y conditional on X and U. This suggests again using the L 2 -distance between both sides of the equality to test for the validity of the exogeneity assumption. An empirical test statistic for the global validity of the exogeneity assumption is the following: 2 ˆ α (x) w(α, x, u)dαdxdu, Mˆ α (x, u) − m where the hats denote standard non-parametric estimators as in Section 2, and w is a weighting function. 4.5.3. The relationship to Imbens and Newey (2003) and Altonji and Matzkin (2005). As far as the estimation of average structural marginal effects are concerned, both Altonji and Matzkin (2005) and Imbens and Newey (2003) use the fact that (4.9) E[∂x1 φ(X, A)|X] = E E[∂x1 φ(X, A)|X, U ]|X = E ∂x1 M(X, U )|X , where M(x, u) = E[Y |X = x, U = u], to obtain an estimator for the LASD. Similar arguments where frequently used in Hoderlein (2002). In (4.9), conditioning was done with respect to σ (X). In the approach put forward in this paper we use more information, i.e. E ∂x1 φ(X, A)|X, Y = y α = E E[∂x1 φ(X, A)|X, U , Y ]|X, Y = y α ] = E ∂x1 kα (X, U )|X, Y = y α , ∀α. (4.10) where y α is shorthand for the conditional α quantile. If we are interested in obtaining average derivatives over the entire population, then from both quantities, i.e. E[∂x1 φ(X, A)|X] and E[∂x1 φ(X, A)|X, Y ] overall averages may be obtained. However, if we also consider weighted average derivatives, our approach allows to consider a larger class of averages as the weights may also depend on Y. This may be important for policy considerations. Last, but by no means least, note again that an estimator that uses (4.10) gives a closer approximation to the true underlying ∂x1 φ than the one based on the right-hand side of (4.9). 4.5.4. The limitations of traditional IV. In general, the LASD is not identified in the traditional IV setting. Assume that in the model Y = φ(X, A) the error variable A is not C The Author(s). Journal compilation C Royal Economic Society 2009.


17

independent of X 1 (conditionally on X 2 , . . . , X d ). Then we show that for identification of LASD it does not suffice to have an instrument Z that is independent of A. We now show this by a class of counterexamples. Suppose that three scalar random variables Z, A and B are given. Moreover, assume that Z is independent of A, that Z is independent of B, but that Z is not independent of the tuple (A, B). In particular, we assume that Z is not independent of ρ(A, B) where ρ(A, B) is a function that is strictly monotone in a for fixed b. For a function φ : R2 → R we suppose that we observe X, Y and Z with X = ρ(A, B) and Y = φ(X, A). Because ρ is monotone in a there exists a function λ : R2 → R with A = λ(X, B). Then Y = ψ(X, B) with ψ(x, b) = φ[x, λ(x, b)]. Now we have two representations of our data: Y = φ(X, A) and Y = ψ(X, B) with error variables A or B, respectively. By construction, Z is an instrument in both specifications. LASD is a conditional expectation of ∂ x φ(X, A) or ∂ x ψ(X, B) = ∂ x φ(X, A) + ∂ a φ(X, A)∂ x λ(X, B). It is clear that in general (conditional) expectations of these two expressions differ. This shows that for identifiability of LASD it does not suffice that the instrument is independent of the error variable. More structural assumptions are needed. It does not help to assume that A is a scalar and that φ is monotone in A. This can be seen by a slight modification of our example. Assume additionally that φ is monotone in A and that ρ(a, b) is monotone in b for fixed a. Then ψ is monotone in B and we have two representations of our data where the error variable enters the model monotonically. We conjecture that the additional assumption has to exclude a complicated non-parametric notion of weakness of the instrument.

5. CONCLUSION AND OUTLOOK In this paper, we were concerned with the non-separable model Y = φ(X, A),

(5.1)

where Y and X are observable real valued random p- and d-vectors, while A is an element of a Borel space. The key innovation is that we do not place any restrictions on A, or on it’s influence. Nevertheless, we were able to show identification of local average structural derivatives (LASD), which is a conditional expectation of the marginal effects that exhausts all the information given in the entire data. More specifically, our main results links derivatives of conditional quantiles to conditional average structural functions. From this perspective both the quantile regression and the mean regression are not mutually exclusive competitors, but different projections from model (5.1) which characterizes a heterogeneous population. Since the derivative of the conditional quantile can be seen as a conditional expectation w.r.t. a larger σ -algebra, it is closer in the L 2 distance sense, and should therefore be preferred in the single equation case. However, the mean regression works well in the multiple equation case, whereas we established that this case is not easily tractable using quantiles. Another area of application for the principles developed in this paper are other operators acting on the function φ. For instance, for applications where risk taking behaviour is analyzed the second derivative of φ may be of interest. Also, differentials, integrals or other objects like the Slutsky matrix could be considered in a similar way as in this paper. This paper has furthermore established that many econometric methods may be generalized to this very general class of models. We provided the large sample theory necessary to handle C The Author(s). Journal compilation C Royal Economic Society 2009.

18


the asymptotic distribution of all estimators, and showed how the bootstrap may be performed. Moreover, we discussed the identification of structured models, specifically weakly separable single index and additive models. In addition, we established how testing for specification may be performed in model (5.1). It is possible to test for the influence of unobserved heterogeneity, and for many semiparametric specifications. Starting from the asymptotic results given in this paper, the large sample theory of all test statistics may be developed, but we leave this to a companion paper. Further specification analysis may also be performed in a similar fashion as in this paper. More specifically, this issue may be combined with relaxing the exogeneity assumption (conditional independence of X 1 and A). The issue of relaxing the exogeneity assumption played an important role in this paper. We have established that a control function approach works neatly and provides a natural extension to the exogenous case. Moreover, we were able to provide quantile analogues to the results on estimation of average marginal effects under endogeneity in Imbens and Newey (2003) and Altonji and Matzkin (2005) without assuming monotonicity. In addition, we also suggested a test for endogeneity. Finally, we have established that simple independence between instruments and unobservables is not sufficient to identify the LASD. In this paper, we argued in favour of the control function approach assumption, because it yields identification of the LASD in the absence of any major additional assumption, and hence provides a robust and convenient route. Hoderlein (2008) discusses the economic content of the exogeneity assumption for the case of consumer demand and the mean regression. However, it remains to be established how tractable this assumption is in general. There may well be applications when researchers are only willing to assume the marginal independence condition as in Chernozhukov et al. (2007). It is clear from our analysis that additional assumptions, e.g. on the functional form or the dependence structure between all variables are needed for identification. What type of additional assumption suits a specific type of application still remains to be determined, and is a challenging question for future research.

ACKNOWLEDGMENTS The authors are indebted to Andrew Chesher, Joel Horowitz, Oliver Linton, Rosa Matzkin, Whitney Newey, Jim Powell and seminar participants at the ESWC, EMS Oslo, Bergen, Berlin, Göttingen, Frankfurt, Madrid, Mannheim, Northwestern, Strasbourg, Tübingen, and UCL/IFS for helpful comments. The usual disclaimer applies. Financial support by Landesstiftung BadenWùrttemberg ‘Eliteförderungsprogramm’ is gratefully acknowledged.

REFERENCES Altonji, J. and R. Matzkin (2005). Cross section and panel data estimators for nonseparable models with endogenous regressors. Econometrica 73, 1053–103. Benkard, L. and S. Berry (2006). Nonparametric identification of nonlinear simultaneous equation models. Econometrica 74, 1429–40. Billingsley, P. (1968). Convergence of Probability Measures. New York: John Wiley. Brown, D. and R. Matzkin (1996). Estimation of nonparametric functions in simultaneous equation models, with an application to consumer demand. Working paper, Northwestern University. C The Author(s). Journal compilation C Royal Economic Society 2009.


19

Chaudhuri, P., K. Doksum and A. Samarov (1997). On average derivative quantile estimation. Annals of Statistics 25, 715–44. Chernozhukov, V. and C. Hansen (2005). An IV model of quantile treatment effects. Econometrica 73, 245–61. Chernozhukov, V., G. Imbens and W. Newey (2007). Instrumental variables identification and estimation via quantile conditions. Journal of Econometrics 139, 4–14. Chesher, A. (2003). Identification in nonseparable models. Econometrica 71, 1405–43. Chesher, A. (2005). Nonparametric identification under discrete variation. Econometrica 73, 1525–50. Christopeit, N. and S. Hoderlein (2006). Local partitioned regression. Econometrica 74, 787–817. Darolles, S., J. P. Florens and E. Renault (2003). Nonparametric instrumental regression. Working paper, IDEI, Université de Toulouse. Fan, J., T. Hu and Y. Truong (1994). Robust non-parametric function estimation. Scandinavian Journal of Statistics. 21, 433–46. Florens, J. P., J. Heckman, C. Meghir and E. Vyctlacil (2003). Instrumental variables, local instrumental variables and control functions. Working paper, IDEI, Université de Toulouse. Hall, P. and J. Horowitz (2003). Nonparametric methods for inference in the presence of instrumental variables. CWP 02/03, Centre for Microdata Methods and Practice, IFS and UCL. Heckman, J. and E. Vyctlacil (1999). Local instrumental variables and latent variable model for identifying and bounding treatment effects. Proceedings of the National Academy of Science 96, 4730–34. Heckman, J. and E. Vyctlacil (2001). Local instrumental variables. In C. Hsiao and K. Morimune (Eds.), Nonlinear Statistical Inference: Essays in Honour of Takeshi Amemiya, 1–46. Cambridge: Cambridge University Press. Hildenbrand, W. (1993). Market Demand: Theory and Empirical Evidence. Princeton, NJ: Princeton University Press. Hoderlein, S. (2002). Econometric modelling of heterogeneous consumer behaviour theory, empirical evidence and aggregate implications. Ph.D. thesis, LSE. Hoderlein, S. (2008). How many consumers are rational? Working paper, Brown University. Hoderlein, S. and E. Mammen (2007). Identification of marginal effects in nonseparable models without monotonicity. Econometrica 75, 1513–18. Horowitz, J. (2001). Nonparametric estimation of a generalized additive model with an unknown link function. Econometrica 69, 499–514. Horowitz, J. and E. Mammen (2007a). Oracle-efficient nonparametric estimation of an additive model with an unknown link function. Working paper, University of Manheim. Horowitz, J. and E. Mammen (2007b). Rate-optimal estimation for a general class of nonparametric regression models with unknown link functions. Annals of Statistics 35, 2589–619. Imbens, G. and W. Newey (2003). Identification and estimation of triangular simultaneous equations models without additivity. Working paper, MIT. Manski, C. F. (2003). Partial Identification of Probability Distributions. New York: Springer. Mas-Colell, A., M. Whinston and J. Green (1997). Microeconomic Theory. Oxford: Oxford University Press. Matzkin, R. (2003). Nonparametric estimation of nonadditive random functions. Econometrica 71, 1339– 76. Newey, W. and J. Powell (2003). Instrumental variable estimation of nonparametric models. Econometrica 67, 565–603. Roehrig, C. (1988). Conditions for identification in nonparametric and parametric models. Econometrica 56, 433–47.


20


APPENDIX A.1. Assumptions and Proofs A.1.1. Assumptions. The following assumptions are needed to show a Bahadur representation for m α (x), see Theorem A.1 below. Theorem A.1 will be used to prove Theorems 2.1 and 2.2. A SSUMPTION C1. The random tuples (X i , Y i ) are i.i.d. The random variables X i take values in a compact subset I ⊂ Rd . A SSUMPTION C2. The conditional density f Y |X (y|x) of Y given X is uniformly continuous in x and y and bounded from below and from above on R × I . The density f X of X is bounded from below and continuous on I. A SSUMPTION C3. For 0 < α < 1 all partial derivatives of k α of order 3 exist in I and are bounded. A SSUMPTION C4. There exists a β > 0 with nβ h l → 0 and n1−β h prod → ∞ for l = 1, . . . , d and h prod = h 1 · . . . · hd . The kernel K is a product kernel K(u) = K 1 (u 1 ) · . . . · K d (u d ). The functions K 1 , . . . , K d are symmetric probability density functions with bounded support. A SSUMPTION C5. With h max = max 1≤l≤d hl it holds that nh prod h21 h4max = OP (1). These assumptions are rather standard. In Assumption C3 we assume the existence of three derivatives to obtain an asymptotic expansion of the bias for our estimate of the derivative of k α . Assumption C5 assumes that no oversmoothing is used. As can be seen from Theorem A.1 if the variance part and the bias part of m α (x) are of the same order we have that also (nh prod h21 )−1 and h4max are of the same order. Assumption C5 is used to avoid a higher order expansion of the bias with its additional smoothness assumptions. For the result of Theorem 2.2 we need that α (x, y) converges to α(x, y) fast enough. This is guaranteed by the following assumptions. h

h2

1 → 0 and ng4max h prod h21 → 0. A SSUMPTION C6. For the bandwidth vectors h and g it holds that gprod prod The kernel L is a product kernel L(u) = L 1 (u 1 ) · . . . · L d (u d ). The functions L 1 , . . . , L d are symmetric probability density functions with bounded support.

A SSUMPTION C7. All partial derivatives of α(x, y) of order two with respect to x exist and are bounded. For the general identification result, as well as the discussions in Section 4.5 we make use of the following assumptions. A SSUMPTION A1. (a) For fixed a ∈ A the function φ(x 1 , x ∗−1 , a) is continuous in x 1 = x ∗1 . ∗ (b) P[φ(x1 , x−1 , A) = kα (x ∗ )|X = x ∗ ] = 0 for x 1 in a neighbourhood of x ∗1 .

(c) The conditional distribution of A given X = (x 1 , x ∗−1 ) is absolutely continuous w.r.t. the conditional distribution of A given X = x ∗ for x 1 in a neighborhood of x ∗1 . It holds that ∗ fA|X (a|x1∗ + δ, x−1 ) ≤ |δ|g(a) − 1 f (a|x ∗ ) A|X

for |δ| small enough for a measurable function g fulfilling E[g(A)|X = x ∗ ] < +∞. The function x 1 f A|X (a|x 1 , x ∗−1 ) is differentiable at x 1 = x ∗−1 for all a ∈ A. C The Author(s). Journal compilation C Royal Economic Society 2009.


21

A SSUMPTION A2. The conditional distribution of Y given X is absolutely continuous w.r.t. the Lebesgue measure for x 1 in a neighbourhood of x ∗1 and for x −1 = x ∗−1 . Here we use the notation x −1 = (x 2 , . . . , x d ) . The density f Y |X (y|x 1 , x ∗−1 ) of Y given X is continuous in (y, x 1 ) at the point (y, x 1 ) = (k α (x ∗ ), x ∗1 ). The conditional density f Y |X (y|x ∗ ) of Y given X = x ∗ is bounded in y ∈ R. A SSUMPTION A3.

k α (x) is partially differentiable with respect to the first component at x = x ∗ .

A SSUMPTION A4.

There exists a measurable function satisfying ∗ , A) − φ(x ∗ , A) − δ(A)| ≥ εδ | X = x ∗ ) = o(δ) P(|φ(x1∗ + δ, x−1

for δ → 0 and fixed ε > 0. We also write ∂x1 φ(x ∗ , a) for (a) and ∂x1 φ or ∂x1 φ(x ∗ , A) for (a). A SSUMPTION A5. The conditional distribution of (Y , ∂x1 φ), given X, is absolutely continuous w.r.t. the Lebesgue measure for x = x ∗ . For the conditional density [fY ,∂x1 φ|X of [(Y ; ∂x1 φ) given X the following inequality holds with a constant C and a positive density g on [R with finite mean (i.e. [ |y |g(y ) dy < ∞) fY ,∂x1 φ|X (y, y |x ∗ ) ≤ Cg(y ). Assumptions A2–A5 are as the ones used for the theorem in Hoderlein and Mammen (2007). There, instead of Assumption A1 it has been assumed that the random variables A and X 1 are conditionally independent, given X 2 , . . . , X d .

A.1.2. Proofs for Theorems 2.1 and 2.2. The following result states a Bahadur representation for m α (x). The theorem is the basic tool for the proofs of Theorems 2.1 and 2.2. T HEOREM A.1. (Bahadur representation). Under Assumptions C1–C4 the following property holds uniformly for α in a closed subset of (0, 1) and for x in a closed subset of I that does not contain boundary points m α (x) − mα (x) 1 1 1 −1 κ2,1 =− fY |X (kα (x)|x) fX (x) nhprod h21 ×

n

Xi,1 − x1 K h−1 (Xi − x) i=1

× I [Yi ≤ kα (Xi )] − α + FY |X kα (x) + Dx kα (x)T (Xi − x) 1 + (Xi − x)T Dxx kα (x)(Xi − x)|Xi − FY |X [kα (Xi )|Xi ] 2

−1/2 , + oP nhprod h21 where κ2,1 = u2 K1 (u)du and where Dx k α (x) (or Dxx k α (x)) is the vector (or matrix) of first (second) order partial derivatives. Proof. For simplification of notation we give the proof only for d = 1. We start the proof similarly as the proof of Theorem 2 in Fan et al. (1994). Put √ μ1 − kα (x)), h2 ( μ2 − kα (x)) . μ0 − kα (x), h( θ = θ (α, x) = nh The vector θ minimizes Gn (θ ) = Gn,α,x (θ ) =

n i=1

τα

Yi∗

! "X − x # √ i ∗ , − θ Zi / nh − τα (Yi ) K h T


22


where Z i = [1, (X i − x)/h, (X i − x)2 /h2 ] and 1 Yi∗ = Yi∗ (α, x) = Yi − kα (x) − kα (x)(Xi − x) − kα (x)(Xi − x)2 . 2 Put 1 T θ Zi τα (Yi∗ )K Wn (θ ) = Wn,α,x (θ ) = √ nh i=1 n

"

Xi − x h

# .

For the proof of the theorem we will make use of the following two lemmas.

L EMMA A.1. For all η > 0 it holds for γ > 0 small enough and for a closed interval J ⊂ (0, 1) that sup

θ ≤nγ ,α∈J ,x∈I

|Gn (θ ) + Wn (θ ) − E[Gn (θ ) + Wn (θ)|X1 , . . . , Xn ]| = oP (1).

L EMMA A.2. For all η > 0 it holds for γ > 0 small enough and for a closed interval J ⊂ (0, 1) that sup


−

|E[Gn (θ ) + Wn (θ )|X1 , . . . , Xn ]

" # n 1 1 1 fY |X kα (x) + kα (x)(Xi − x) + kα (x)(Xi − x)2 |Xi 2 nh i=1 2 # " Xi − x | = oP (1). (θ T Zi )2 K h

Lemma A.1 follows by application of Bernstein’s inequality. Note that Gn (θ ) + Wn (θ ) = −

n ∗ Yi − (nh)−1/2 θ T Zi 1 Yi∗ − (nh)−1/2 θ T Zi < 0 i=1

− 1 Yi∗ < 0 K[(Xi − x)/h] is a sum of independent random variables that are absolutely bounded by C(nh)−1/2 θ with a positive constant C. For a proof of Lemma A.2 one uses a Taylor expansion of E[Gn (θ) + Wn (θ)|X1 , . . . , Xn ]. Lemma A.2 immediately implies that sup


|E[Gn (θ ) + Wn (θ )|X1 , . . . , Xn ] ⎛

1

0

κ2,1

⎞

⎜ ⎟ 1 κ2,1 0 ⎟ − fY |X (kα (x)|x)fX (x)θ T ⎜ ⎝0 ⎠ θ| = oP (1). 2 κ4,1 κ2,1 0 We now use the fact that G n is a convex function and that it is approximated in the last equation by another convex function. This shows that the location of the minimum G n is approximated by the location of the minimum of the approximating function. This implies that uniformly for α ∈ J and x ∈ I 1 −1 1 κ2,1 (Xi − x)K[(Xi − x)/h] 3 fY |X (kα (x)|x)fX (x) nh i=1 × 1[Yi∗ < 0] − α + oP ((nh3 )−1/2 ). n

m α (x) − mα (x) = −



23

The theorem now follows from n 1 (Xi − x)K[(Xi − x)/h] nh3 i=1

× 1[Yi∗ < 0] − 1[Yi < kα (Xi )] − E 1[Yi∗ < 0] − 1[Yi < kα (Xi )] |Xi

= oP ((nh3 )−1/2 ). Proof of Theorem 2.1: The theorem follows by application of Theorem A.1. The bias term can be calculated by using Taylor expansions and standard smoothing theory to

nhprod h31

n

−1 (Xi,1 − x1 )K(h−1 (Xi − x)) i=1

1 FY |X kα (x) + Dx kα (x)T (Xi − x) + (Xi − x)T Dxx kα (x)(Xi − x)|Xi − FY |X [kα (Xi )|Xi ] . 2 Convergence of the process Bn (α) =

2 fY |X (kα (x)|x)2 fX (x)nhprod h21 κ2,1 2 u1 K 2 (u) du

1/2 { mα (x) − mα (x)

d 1 1ll 2 2 111 κ2,1 κ2,l kα (x)hl − κ4,1 kα (x)h1 + 3 6κ2,1 l=2 follows by application of a tightness criterion (e.g. Theorem 15.6 in Billingsley, 1968) to its Bahadur approximation

−1/2 2 2 n (α) = −fX (x)−1/2 (nhprod )−1/2 h−2 K (u) du u B 1 1 1 ×

n

(Xi,1 − x1 )K(h−1 (Xi − x)){I [Yi ≤ kα (Xi )] − α}.

i=1

Proof of Theorem 2.2: Note that α (x, y) − α(x, y) is of order OP ((ng prod )−1/2 + g 2max ). This expansion holds for fixed x and uniformly for y in a compact interval J. We will make use of the following fact. For all sequences c n → 0 and constants δ > 0 it holds that sup

n (β) − B n (α)| = oP (1). |B

δ α,

where λ ≥ λ 0 denotes λ i ≥ λ 0i , and λ < λ 0 denotes λ i < λ 0i , i = 1, . . . , n. This establishes that the test has appropriate size and is unbiased. Now suppose that Assumption LTZ1 holds, i.e. that the instruments are weak. It is readily established that this leads to the conclusion that plim|T λ i − σ i | = 0, implying that A2 → 1 and hence that −τ log A2 will be less than a α (λ 0 ) for any given α > 0 and λ 0 > 0.


The


Determining the number of factors in a multivariate error correction–volatility factor model Q IAOLING L I † AND J IAZHU P AN ‡ †

‡

School of Mathematical Sciences, Peking University, Beijing 100871, China E-mail: [email protected]

Department of Statistics and Modelling Science, University of Strathclyde, Livingstone Tower, Richmond Street, Glasgow G1 1XH, UK E-mail: [email protected] First version received: March 2007; final version accepted: August 2008

Summary In order to describe the co-movements in both conditional mean and conditional variance of high dimensional non-stationary time series by dimension reduction, we introduce the conditional heteroscedasticity with factor structure to the error correction model (ECM). The new model is called the error correction–volatility factor model (EC–VF). Some specification and estimation approaches are developed. In particular, the determination of the number of factors is discussed. Our setting is general in the sense that we impose neither i.i.d. assumption on idiosyncratic components in the factor structure nor independence between factors and idiosyncratic errors. We illustrate the proposed approach with a Monte Carlo simulation and a real data example. Keywords: Co-integration, Dimension reduction, Error correction–volatility factor model, Model selection, Penalized goodness-of-fit criteria.

1. INTRODUCTION The concept of co-integration (Granger, 1981, Granger and Weiss, 1983, and Engle and Granger, 1987) has been successfully applied to modelling multivariate non-stationary time series. The literature on co-integration is extensive. The most frequently used representations for a cointegrated system are the ECM of Engle and Granger (1987), the common trends form of Stock and Watson (1998) and the triangular model of Phillips (1991). The error correction model has been applied in various practical problems, such as determining exchange rates, capturing the relationship between expenditure and income, modelling and forecasting inflation, etc. From the equilibrium point of view, the term ‘error correction’ reflects the correction on the long-run relationship by the short-run dynamics. However, the ECM ignores the characteristics of time-varying volatility, which plays an important role in various financial areas such as portfolio selection, option evaluation and risk management. Kroner and Sultan (1993) argued that the neglect of either co-integration or timevarying volatility would affect the hedging performance of existing models in the literature for the futures market. Similar conclusion has been given by Ghost (1993) and Lien (1996) through C The Author(s). Journal compilation C Royal Economic Society 2008. Published by Blackwell Publishing Ltd, 9600 Garsington Road,


46

Q. Li and J. Pan

empirical calculation and theoretical analysis, respectively. Therefore, the traditional ECM needs to be generalized to have conditional heteroscedasticity for capturing both co-integration and time-varying volatility. Univariate volatility models have been extended to multivariate cases. Extensions of the generalized autoregressive heteroscedastic (GARCH) model (Bollerslev, 1986) include, e.g. vectorized GARCH (VEC-GARCH) model of Bollerslev et al. (1988), the BEKK model of Engle and Kroner (1995), a dynamic conditional correlation (DCC) model of Engle (2002) and Engle and Sheppard (2001), a generalized orthogonal GARCH model of van der Weide (2002); see a survey of multivariate GARCH models by Bauwens et al. (2006). 1 These models assume that a vector transformation of the covariance matrix can be written as a linear combination of its lagged values and the innovations. Andersen et al. (1999) showed that these models perform well relatively to competing alternatives. But the curse of dimensionality becomes a major obstacle in application. A useful approach to simplifying the dynamic structure of a multivariate volatility process is to use factor models. As is well known, factor models have been used for performance evaluation and risk measurement in finance. Moreover, it is now widely accepted that the financial volatilities move together over time across assets and markets (Anderson et al., 2006). These make it reasonable that we impose a factor structure on the residual term of a multivariate error correction model. In this sense, an error correction–volatility factor (EC–VF) model can capture the features of co-movements in both conditional mean (co-integration) and conditional variance (volatility factors) of a high dimensional time series. The contribution of this paper is to estimate the EC–VF model. The set of parameters is divided into three subsets: structural parameter set including lag order and all autoregressive coefficient vector and matrices, co-integration parameter set including the co-integration vectors and the rank, and factor parameter set including the factor loading matrix and the number of factors. We conduct a two-step procedure to estimate relevant parameters. First, assuming that the structural and co-integration parameters are known, we give the estimation of factor loading matrix in the volatility factor model, and then give a method to determine the number of factors consistently. Our model specification and estimation approaches are general, because we impose neither i.i.d. assumption on the idiosyncratic components in the factor structure nor independence between factors and idiosyncratic errors. In contrast to the innovation expansion method in Pan and Yao (2008) and Pan et al. (2007), where they can not prove that their algorithm for the number of factors is consistent, our method in this paper is based on a penalized goodness-of-fit criterion. We prove our estimator of the number of factors is consistent. Secondly, the structural and co-integration parameters will be consistently estimated without knowing the true factor structure. The main distinction between Bai and Ng (2002) and this paper is that their factor model concerned the unconditional mean of economic variables while our factor structure is imposed on the conditional variance to reduce the dimension of volatilities. The rest of the paper is organized as follows. Section 2 defines the EC–VF model and mentions some practical backgrounds of the model. Section 3 presents an information criterion for determining the number of factors and the consistency of our estimator. In Section 4, a simple Monte Carlo simulation is conducted to check the accuracy of the proposed estimation for the factor loading matrix and the number of factors. In Section 5, an application to financial risk management is discussed to show the advantages of the EC–VF model to other traditional alternatives. All theoretical proofs are given in the Appendix. 1 The early version of Engle and Kroner (1995) was written by Baba, Engle, Kraft and Kroner, which led to the name BEKK of their model.


Determining the number of factors in EC–VF model

47

2. MODEL 2.1. Definition Suppose that {Y t } is a d × 1 time series. The EC–VF model is of the form Yt = μ + 1 Yt−1 + 2 Yt−2 + · · · + k−1 Yt−k+1 + 0 Yt−1 + Zt , Zt = AFt + et ,

(2.1)

where Y t = Y t − Y t−1 , μ is a d × 1 vector, i , i = 1, . . . , k are d × d matrices. The rank of 0 , denoted by m, is called the co-integration rank. {Z t } is strictly stationary with E(Zt |Ft−1 ) = 0 and V ar(Zt |Ft−1 ) = z (t), where Ft = σ (Zt , Zt−1 , . . .). Ft is a r × 1 time series, r < d is unknown, A is a d × r unknown constant matrix. F t and e t are assumed to satisfy E(Ft |Ft−1 ) = 0, E(et |Ft−1 ) = 0, (2.2) E(Ft et |Ft−1 ) = 0, E(et et |Ft−1 ) = e , where e is a positive definite matrix independent on t. The components of F t are called ‘factors’, and r is the number of factors. Note that F t and e t are conditional uncorrelated. There is no loss of generality in assuming that E(Ft F t ) is a r × r positive definite matrix (otherwise, the above model may be expressed equivalently in terms of a smaller number of factors). R EMARK 2.1. The error term {Z t } in an EC–VF model is conditionally heteroscedastic and follows a factor structure, while the error term in the traditional ECM developed by Engle and Granger (1987) is covariance stationary with mean 0. Here, the factor structure is not the classical one because we assume neither that the idiosyncratic components e t are i.i.d. with a diagonal covariance matrix nor that the factor components F t is independent of e t . Model (2.1) assumes that the volatility dynamics of Y t is determined by a lower dimensional volatility dynamics of F t and the static variation of e t , as y (t) = z (t) = Af (t)A + e ,

(2.3)

where y (t) = V ar(Yt |Ft−1 ) and f (t) = V ar(Ft |Ft−1 ). Without loss of generality, we assume rank (A) = r. The lower dimensional volatility dynamics f (t) can be fitted by, e.g. the dynamic conditional correlation model of Engle (2002) or the conditionally uncorrelated components model of Fan et al. (2008). 2.2. Practical background Factor analysis is an effective way for dimension reduction, and then it is a useful statistical tool for modelling multivariate volatility. Because there might exist co-integration relationship among financial asset prices, the framework given by equation (2.1) applies to many cases of financial analysis. 2.2.1. Value-at-risk. Value-at-risk (VaR) defines the maximum expected loss on an investment over a specified horizon at a given confidence level, and is used by many financial institutions as a key measurement of market risk. The VaR of a portfolio of multiple assets can be obtained when the prices are described by an EC–VF model. The EC–VF model can be also used to determine C The Author(s). Journal compilation C Royal Economic Society 2008.

48

Q. Li and J. Pan

an optimal portfolio based on maximizing expected returns subject to a downside risk constraint measured by VaR. 2.2.2. Hedge ratio. The importance of incorporating the co-integration relationship into statistical modelling of spot and futures prices is well documented in the literature for futures market. It has been shown in Lien and Luo (1994) that although GARCH model may characterize the price behaviour, the co-integration relationship is the only indispensable component when comparing ex post performance of various hedge strategies. A hedger who omits the cointegration relationship will adopt a smaller than optimal futures position, which results in a relatively poor hedge performance; see Lien and Tse (2002) for a survey on hedging and references there. 2.2.3. Multi-factor option. A multi-factor option (or multi-asset option) is an option whose payoff depends upon the performance of two or more underlying assets. Basket and rainbow options belong to this category. Duan and Pliska (2004) investigated theoretical and practical aspects of such options when the multiple underlying assets are co-integrated. In particular, they proposed an ECM with stochastic volatilities that follow a multivariate GARCH process. To avoid introducing too many parameters, they give a parsimonious diagonal model for the volatilities, but it is rather restrictive for the cross-dynamics. In contrast, volatility factor models can be used for reducing dimension as well as for representing the dynamics of both variances and covariances. The EC–VF model, with some modification, is more suitable for valuating the multi-factor options.

3. ESTIMATION OF THE NUMBER OF FACTORS The parameter set of the EC–VF model (2.1) is {; 0 ; A}, in which = {μ, 1 , . . . , k−1 } is called the structural parameter, 0 the co-integration parameter and A the factor parameter. In first two subsections, {, 0 } is assumed known and its determination will be discussed later in subsection 3.3. 3.1. Determining A Note that the factor loading matrix A and the vector of factors F t in equation (2.1) are not separately identifiable. Our goal is to determine the rank of A and the space spanned by the columns of A. Without loss of generality, we may assume A A = I r , where I r denotes the r × r identity matrix. Let M(A) be the linear subspace of R d spanned by the columns of A, which is called the factor loading space. Then, we need to estimate M(A) or its orthogonal complement M(B), where B is a d × (d − r) matrix for which (A, B) forms a d × d orthogonal matrix, i.e. B A = 0 and B B = I d−r . Now it follows from equation (2.1) that B Zt = B et .

(3.1)

From equation (3.1) and the assumption that {e t } is a conditional homoscedastic sequence of martingale differences (see equation (2.2)), we have E(B Zt Zt B|Ft−1 ) = B e B = B z B, C The Author(s). Journal compilation C Royal Economic Society 2008.

49


where z = E(Zt Zt ). This implies that B E(Zt Zt − e )I (Zt−τ ∈ C)B = 0 for any τ ≥ 1 and C ∈ B,

(3.2)

or equivalently sup B E[(Zt Zt − e )I (Zt−τ ∈ C)]B = 0

for any τ ≥ 1 and C ∈ B,

(3.3)

C∈B

where B consists of some subsets in R d , and M = [tr(M M)]1/2 denotes the norm of matrix M. Hence, we may estimate B by minimizing n 1 ˆ n (B) = sup (Zt Zt − z )I (Zt−τ ∈ C)B B n − τ 0 t=τ +1 1≤τ ≤τ0 ,C∈B 0

(3.4)

ˆz = subject to the condition B B = I d−r , where τ 0 is a prescribed positive integer and 1 n Z Z . This is a high-dimensional optimization problem, but it does not explicitly t=τ0 +1 t t n−τ0 address the issue how to determine the number of factors r consistently. We first assume r is known and introduce some properties of the estimator of B derived by Pan et al. (2007) before we present a consistent estimator of r. Let Hr be the set of all d × (d − r) (d ≥ r) matrix B satisfying B B = I d−r . We partition Hr into equivalent classes such that B1 , B2 ∈ Hr belong to the same class if and only if M(B1 ) = M(B2 ), which is equivalent to (Id − B1 B1 )B2 = 0 and (Id − B2 B2 )B1 = 0.

(3.5)

Define D(B1 , B2 ) = (Id − B1 B1 )B2 . r = Hr /D defined The equivalent classes can be regarded as the elements of the quotient space HD r , and by D-distance. It can be shown that D is a well-defined metric distance on the space HD r thus (HD , D), which is our parametric space, is a metric space; see Pan and Yao (2008). r , i.e. Our estimator of B is the minimizer of n (·) in HD

Bˆ = arg minr n (B). B∈HD

Under the assumptions listed below, the estimator Bˆ is consistent with a convergence rate

√

n.

A SSUMPTION 3.1. {Z t } is a strictly stationary d-dimensional time series with EZ t 2p < ∞ for some p > 2. The β-mixing coefficients β(n) = E

0 sup P (B) − P B|F−∞

B∈Fn∞


50

Q. Li and J. Pan

satisfy β n = O(n−b ) for some b > j }.

p , p−2

j

where Fi is the σ -algebra generated by {Z t , i ≤ t ≤

A SSUMPTION 3.2. Denote (B) = sup1≤τ ≤τ0 ,C∈B BE[(Zt Zt − e )I (Zt−τ ∈ C)]B. There exists a matrix B0 ∈ Hr which minimizes (B), and (B) reaches its minimum value at a matrix B ∈ Hr if and only if D(B, B 0 ) = 0. A SSUMPTION 3.3. There exists a positive constant a such that (B) − (B 0 ) ≥ aD(B, B 0 ) for any matrix B ∈ Hr . By the similar way to that in proof of Theorem 2 in Pan et al. (2007), we can prove the following result, which is useful in deriving a consistent estimator for the number of factors in next subsection. T HEOREM 3.1. If the collection B of subsets in R d is a VC-class, and Assumptions 3.1 and 3.2 hold, then √ (3.6) sup n|n (B) − (B)| = Op (1). B∈HD

If, in addition, Assumption 3.3 also holds, √ ˆ B0 ) = Op (1). nD(B,

(3.7)

˘ (VC) class can be found in van der Vaart R EMARK 3.1. The definition of Vapnik-Cervonenkis and Wellner (1996). 3.2. Determining r Let r 0 be the true number of factors and A 0 the true factor loading matrix with rank r 0 . We discuss ˆ derived how to estimate r 0 based on the estimated factor loading matrix Aˆ (or its counterpart B) in the previous subsection. The basic idea is to treat the number of factors as the ‘order’ of model (2.1) and to determine the order in terms of an appropriate information criterion. In the following, we always assume that Assumptions 3.1–3.3 hold. Let M l denote a matrix with rank d − l. In particular, B r00 and Bˆ r (0 ≤ r ≤ d) denote the matrices B 0 and Bˆ with ranks d − r 0 and d − r, respectively. Let ˆr ˆ n (r, Bˆ r ) = sup B Dn,τ (C)Bˆ r , 1≤τ ≤τ0 ,C∈B (3.8) r r r, B0 = sup B0 Dτ (C)B0r , 1≤τ ≤τ0 ,C∈B

where Dˆ n,τ (C) =

n 1 ˆ z )I (Zt−τ ∈ C), (Zt Zt − n − τ0 t=τ +1 0

Dτ (C) = E[(Zt Zt − e )I (Zt−τ ∈ C)], Bˆ r = arg minr n (r, B), B0r = arg minr (r, B). B∈HD

B∈HD



51

Our penalized goodness-of-fit criterion is defined as P C(r) = n (r, Bˆ r ) + rg(n),

(3.9)

where g(n) is a penalty for ‘overfitting’. We may estimate r 0 by minimizing PC(r), i.e. rˆ = arg min P C(r). 0≤r≤d

We call equation (3.9) a penalized goodness-of-fit criterion because of Lemma A.1. R EMARK 3.2. n (·) can be regarded as fitting error, because a model with r + 1 factors can fit no worse than a model with r factors, while Lemma A.1 shows that n (·) is a non-increasing function of r. But the efficiency is lost as more factors are estimated. For example, there is neither error nor efficiency in the extreme case when r = d, n (d, Bˆ d ) = 0 with Bˆ d = 0. The following theorem shows that rˆ is a consistent estimator of r 0 provided that the penalty function g(n) satisfies some mild conditions. P

T HEOREM 3.2. Under Assumptions 3.1–3.3, as n → ∞, rˆ → r0 provided that g(n) → 0 and √ ng(n) → ∞. 3.3. Determining {, 0 } In this subsection, we give an estimation of the structural and co-integration parameter sets without knowledge of the true factor structure for Z t . By the Grange representation theorem, if there are exactly m co-integration relations among the components of Y t , and 0 admits the decomposition 0 = γ α , then α is a d × m matrix with linearly independent columns and α Y t is stationary. In this sense, α consists of m co-integration vectors. As α and γ are not separately identifiable, our goal is to determine the rank of α, i.e. the dimension of the space spanned by the columns of α. Besides Assumptions 3.1–3.3 on {Z t }, we need an additional assumption on {Y t } as follows. A SSUMPTION 3.4. The process Y t satisfies the basic assumptions of the Granger representation theorem given by Engle and Granger (1987), and Eα Y t−1 4 < ∞. Our estimation of co-integration vectors is the solution to the following optimization problem max tr(α S10 S01 α),

α S11 α=Im

(3.10)

where Sij = T −1 Tt=1 Rit Rj t , R0t = Yt − 1 Xt , R1t = Yt−1 − 2 Xt , Xt = (1, Yt−1 ,..., T T T T −1 −1 Yt−k+1 ) , 1 = t=1 Yt Xt ( t=1 Xt Xt ) , 2 = t=1 Yt−1 Xt ( t=1 Xt Xt ) . The solution of equation (3.10) is αˆ ≡ (αˆ 1 , . . . , αˆ m ), where αˆ 1 , . . . , αˆ m are the m generalized eigenvectors of S 10 S 01 with respect to S 11 corresponding to the m largest generalized eigenvalues. The estimated co-integration vectors are consistent with the standard root-n convergence rate. The corresponding estimator γˆ = S01 αˆ of the co-integration loading matrix and the estimator ˆ = 1 − γˆ αˆ 2 of the structural parameter are also consistent. These conclusions are obtained by Li et al. (2006), who also give a joint estimation for the co-integration rank and the lag order of the error correction model by a penalized goodness-of-fit measure M(m, k) = R(m, k, α) ˆ + nm,k g1 (n), C The Author(s). Journal compilation C Royal Economic Society 2008.

(3.11)

52

Q. Li and J. Pan

where R(m, k, α) ˆ = tr S00 − S01 α( ˆ αˆ S11 α) ˆ −1 αˆ S10 ,

(3.12)

g 1 (n) is the penalty for ‘overfitting’ and n m,k is the number of free parameters. Note that n m,k = d + d 2 (k − 1) + 2dm − m2 for model (2.1). We may estimate m 0 by minimizing M(m, k), i.e. ˆ = arg ˆ k) (m,

min

0≤m≤d,1≤k≤K

M(m, k),

where K is a prescribed positive integer. Let k 0 be the true lag order. The theorem below ensures ˆ is a consistent estimator for (m 0 , k 0 ). ˆ k) that (m, P ˆ → ˆ k) (r0 , k0 ) provided that g 1 (n) T HEOREM 3.3. Under Assumptions 3.1–3.4, as n → ∞, (m, → 0 and ng1 (n) → ∞. √ √ In practice, the choice of penalty function g(·) is flexible, e.g. ln(n)/ n or 2 ln(ln(n))/ n.

4. MONTE CARLO SIMULATION We present a simple Monte Carlo experiment to illustrate the proposed approach in this section. Particularly, we check the accuracy of our estimation for the factor loading matrix A and the number of factors r. Consider a simple EC–VF model with d = 6, m = 1, r = 1, ⎧ ⎪ ⎨ Yt = μ + γ α Yt−1 + Zt , Zt = AFt + et , ⎪ ⎩ Ft |Ft−1 ∼ N 0, σt2 , et |Ft−1 ∼ N (0, I6 ),

(4.1)

where σ 2t = β 0 + β 1 F 2t−1 + β 2 σ 2t−1 , e t is independent of F t , and the values of ,γ = parameters are given as follows: μ = (0.2028, 0.1987, 0.6038, 0.2722, 0.1988, 0.0153) √ √ √ √ √ √ √ 6 6 6 6 6 6 6 (0.1, 0.2, 0.3, 0.4, 0.5, 0.6) , α = (1, 2, −1, −1, −2, 3) , A = ( 6 , 6 , 6 , 6 , 6 , 6 , 6 ) and β = (β 0 , β 1 , β 2 ) = (0.02, 0.10, 0.76) . Note that A A = 1. We conduct 2000 replications, and for each replication, the sample sizes are n = 500 and 1000, respectively. We estimate the transformation matrix B by minimizing n (B) defined by equation (3.4), and measure the estimation error of the factor loading space M(A) by 1/2 ˆ ˆ = ([tr{Aˆ (Id − AA )A} ˆ + tr(Bˆ AA B)]/d) D1 (A, A) .

The coefficients β i , i = 0, 1, 2, are estimated by quasi-maximum likelihood estimation (MLE) based on a Gaussian likelihood. The resulting estimates are summarized in Table 1. ˆ is less than 0.06, while it decreases over 15% as the The mean of estimation errors D1 (A, A) sample size increases from 500 to 1000. The negative biases indicate a slight underestimation for the heteroscedastic coefficients. The relative frequencies for rˆ taking different values are listed in Table 2. It shows that when the sample size n increases, the estimation of r becomes more accurate. C The Author(s). Journal compilation C Royal Economic Society 2008.

53

Determining the number of factors in EC–VF model Table 1. Simulation results: summary statistics of estimation errors. ˆ A) D1 (A, βˆ0 βˆ1

n = 500

n = 1000

rˆ

βˆ2

Mean Median

0.0563 0.0438

0.0179 0.0183

0.0894 0.0827

0.7414 0.7521

STD Bias

0.0601 –

0.0022 −0.0021

0.0403 −0.0106

0.0935 −0.0186

RMSE

–

0.0029

0.0454

0.0958

Mean Median STD

0.0477 0.0390 0.0426

0.0193 0.0199 0.0010

0.0922 0.0897 0.0276

0.7481 0.7543 0.0724

Bias RMSE

– –

−0.0007 0.0013

−0.0078 0.0295

−0.0119 0.0766

Table 2. Relative frequencies for rˆ taking different values, when r = 1. 0 1 2 3 4 5

n = 500 n = 1000

0.0120 0.0090

0.8425 0.9765

0.1310 0.0100

0.0105 0.0045

0.0040 0

0 0

6 0 0

5. APPLICATION TO REAL DATA The VaR is widely adopted by banks and other financial institutions to measure and manage market risk, as it reflects downside risk of a given portfolio or investment. Specifically, at a given confidence level 1 − a, the VaR of a portfolio with weight ω t is defined as the solution to P (ωt Yt < V aRa |Ft−1 ) = a,

(5.1)

where Y t is a vector of log returns of assets in the portfolio. In the case when the conditional density f (Yt |Ft−1 ) is normal, equation (5.1) reduces to the well-known formula 1/2 za , (5.2) V aRa = ωt μy (t) + ωt y (t)ωt where z a is the ath quantile of the univariate standard normal distribution. In this section, we attempt to compare the VaR forecasting results by assuming three different models: AR-DCC, EC-DCC, EC-VF-DCC for the asset price series {Y t }. The DCC refers to dynamic conditional correlation, a volatility model proposed by Engle (2002). Focusing on the methodology, we only consider the case when the conditional multivariate density f (Yt |Ft−1 ) is normal, while the impact of other distributions (like Student-t and some non-parametric densities) on VaR computation is beyond our scope here. 5.1. Data set and estimation of the EC-VF-DCC model Our data set consists of 2263 daily log prices of CSCO, DELL, INTC, MSFT and ORCL, the five most active stocks in US market, from 19 June 1997 to 16 June 2006. The plots of C The Author(s). Journal compilation C Royal Economic Society 2008.

54

Q. Li and J. Pan (a) CSCO 20 0 0

500

1000

1500

2000

2500

1500

2000

2500

1500

2000

2500

(b) DELL 20 0

0

500

1000

0

500

1000

(c) INTC

20 0

(d) MSFT 20 0 0

500

1000 1500 (e) ORCL

2000

2500

0

500

1000

2000

2500

20 0 1500

Figure 1. Plots of daily log-returns in percentage.

log returns (in percentage) are presented in Figure 1 which shows significant time-varying volatilities. Descriptive statistics are listed in Table 3. All unconditional distributions of these series exhibit excessive kurtosis and non-zero skewness, indicating significant departure from the normal distribution. The estimation procedure for the EC-VF-DCC model is given step by step as follows. Step 1. Fit an ECM for Y t to determine the structural and co-integration parameters. Compute ˆ t + γˆ αˆ Yt−1 . the estimate of conditional mean vector μˆ y (t) = X Step 2. Conduct a multivariate portmanteau test for the squared residuals obtained from the previous step to detect conditional heteroscedasticity. If there exists serial dependence, fit ˆ a volatility factor model for the residual series {Z t } to determine the factor loading matrix A, otherwise switch to Step 3 with Aˆ = Ir and r = d. C The Author(s). Journal compilation C Royal Economic Society 2008.

55

Determining the number of factors in EC–VF model Table 3. Summary statistics of the log-returns. n = 2263 Mean Stdev

CSCO 0.000423 0.031847

DELL

INTC

MSFT

ORCL

0.000523 1.95 × 10−5 0.030270 0.030313

0.000200 0.023074

0.000418 0.036400

Min −0.145000 −0.20984 Max 0.218239 0.163532 Skewness 0.149215 −0.118260 Kurtosis

4.558020

3.690575

−0.248680 −0.169760 −0.346150 0.183319 0.178983 0.270416 −0.391560 −0.173470 −0.226370 5.631860

5.955046

8.519630

Denote B = (b 1 , b 2 , . . ., b d−r ), the objective function (3.4) can be modified to 2 τ0 n 1 ˆ B

n (B) = w(C) (Z Z − )I (Z ∈ C)B t t z t−τ n−τ 0 t=τ +1 τ =1 C∈B 0

where w(C) ≥ 0 are weights which ensure that the sum over C ∈ B converges. In numerical implementation, we simply take B as the collection of all the balls centred at the origin in R d and w(C) = {#(B)}−1 . An algorithm for estimating B and r is given as follows. Put ⎡ ⎤2 τ0 n 1 ˜ τ (b) = ˆ z )I (Zt−τ ∈ C)b⎦ , ˜ τ (b),

(b) = w(C) ⎣b (Zt Zt − n − τ 0 τ =1 t=τ0 +1 C∈B ⎫ ⎧ ⎡ ⎤2 ⎪ τ0 ⎪ n l−1 ⎬ ⎨ 1 ˆ z )I (Zt−τ ∈ C)b⎦ + ˜ τ (b) . w(C) ⎣bî (Zt Zt −

l (b) = ⎪ ⎪ n − τ0 t=τ +1 ⎭ τ =1 ⎩ i=1 C∈B 0 Compute bˆ1 by minimizing (b) subject to the constraint b b = 1. For l = 2, . . ., d, compute bˆl which minimizes l (b) subject to the constraint b b = 1, b bî = 0 for i = 1, 2, . . ., l − 1. Let rˆ = arg min0≤r≤d P C(r) with Bˆ r = (bˆ1 , bˆ2 , . . . , bˆr ), where PC(r) is defined by equation (3.9). Note that Bˆ r Bˆ r = Id−ˆr . Let Aˆ consist of the rˆ (orthogonal) unit eigenvectors, corresponding to the common eigenvalue 1, of matrix Id − Bˆ r Bˆ r . Step 3. Fit a DCC volatility model (Engle, 2002) for {Aˆ Zt } and compute its conditional ˜ z (t) = Dt1/2 Rt Dt1/2 . covariance To this end, we first fit each element of D t with a univariate GARCH(1,1) model using the ith component of Aˆ Zt only, and then model the conditional correlation matrix R t by ) + θ2 Rt−1 , Rt = S(1 − θ1 − θ2 ) + θ1 (εt−1 εt−1

where ε t is a rˆ × 1 vector of the standardized residuals obtained from the separate GARCH(1,1) fittings for the rˆ components of Aˆ Zt , and S is the sample correlation matrix of Aˆ Zt . ˆ y (t) of Y t is equal to ˜ z (t) and If Aˆ = Id , the estimate of conditional covariance matrix terminate the algorithm. Otherwise, proceed to Step 4. C The Author(s). Journal compilation C Royal Economic Society 2008.

56

Q. Li and J. Pan

2

M(m,k)

1.5

1

0.5 4 3 lag order k

4 3

2 1

1

2 cointegration rank m

0

Figure 2. Plot of M(m, k) against the co-integration rank m and the lag order k.

Step 4. The factor structure in equation (2.1) and the facts B A = 0, B e t = B Z t , AA + BB = I d lead to a dynamics for y (t) ≡ z (t) as follows ˆz = where

1 n−τ0

n

˜ z (t)Aˆ + Aˆ Aˆ ˆ z Bˆ Bˆ + Bˆ Bˆ ˆ z, ˆ y (t) = Aˆ

t=τ0 +1

(5.3)

Zt Zt .

We determine the co-integration rank by minimizing M(m, k) defined by equation (3.11). The surface of M(m, k) is plotted against m and k in Figure 2. The minimum point of the surface is attained at (m, k) = (1, 1), leading to an error correction model for this data set with lag order 1 and co-integration rank 1. Applying the Ljung-Box statistics to the squared residuals, we have Q 5 (1) = 63.2724, Q 5 (5) = 305.7613 and Q 5 (10) = 633.7103. Based on asymptotic χ 2 distributions with degrees of freedom 11, 111 and 236, the p-values of these Q statistics are all close to zero. 2 Consequently, the portmanteau test confirms the existence of conditional heteroscedasticity. The algorithm stated in Step 2 leads to an estimator for the number of factors, and PC(r) is plotted against r in Figure 3. Clearly, a two-factor structure (i.e. rˆ = 2) is determined for the residual series {Z t }. 5.2. Comparison of value-at-risk forecasting results The VaRs are computed at level 0.05 (denoted by VaR 0.05 ) for the last 1000 trading days of data span. We assume three models: AR-DCC, EC-DCC, EC-VF-DCC for the asset prices {Y t }, and 2 The Q (l) statistic has asymptotically a χ 2 distribution with degree of freedom d 2 l − n d m,k where n m,k = d + d 2 (k − 1) + 2dm − m2 is the number of free parameters in the ECM. C The Author(s). Journal compilation C Royal Economic Society 2008.

57

Determining the number of factors in EC–VF model 0.445 0.44 0.435

PCvalue

0.43 0.425 0.42 0.415 0.41 0.405 0.4

0

1

2 3 the number of factors r

4

5

Figure 3. Plot of PC(r) against the number of factors r.

Table 4. Comparison of VaR 0.05 . ω1 AR-DCC

ω2

ω3

ω4

t (Min)

0.067 (0.001) 0.071 (0.000) 0.065 (0.005) 0.062 (0.032)

287.3

EC-DCC 0.052 (0.659) 0.059 (0.061) 0.051 (0.713) 0.053 (0.268) EC-VF-DCC 0.049 (0.713) 0.056 (0.308) 0.053 (0.268) 0.055 (0.312)

294.7 41.5

Note: Figures in parentheses are p-values for the Kupiec likelihood ratio test used to compare the empirical failure rate with its theoretical value, see Kupiec (1995). The average computing time in minute for each model is recorded in the last column.

four time invariant portfolios with weights ω1 = (1, 1, 1, 1, 1) /5, ω2 = (1, 2, 3, 4, 5) /15, ω3 = (5, 4, 3, 2, 1) /15, ω4 = (1, 3, 5, 4, 2) /15. To compare the VaR forecasting performances, we calculate failure rates for the different specifications. The failure rate is defined as the proportion of r t = ωt Y t smaller than the VaRs. For a correctly specified model, the empirical failure rate is supposed to be close to the true level a. Table 4 displays the results for the 5% level. We observe from Table 4 that the EC-VF-DCC performs reasonably well, while AR-DCC has a difficulty in providing failure rates close to 0.05. The empirical failure rates for AR-DCC are high, which means that it underestimates the risk. The results for the EC-DCC and EC-VFDCC model are comparable, but the average computing time for EC-DCC is much longer, see C The Author(s). Journal compilation C Royal Economic Society 2008.

58

Q. Li and J. Pan

the last column of Table 4. This shows that the factor structure imposed on the residual term of an ECM can improve the computational velocity in high-dimensional problems. The above results show that the EC-VF model proposed in this paper is a promising tool for risk analysis. First, it incorporates the impact of co-integration which makes the VaR computation more accurate. Second, it deduces a high-dimensional optimization problem into a much lowerdimensional problem, thus accelerates the VaR computation to a great extent.

ACKNOWLEDGMENTS The authors are grateful to an anonymous referee and the co-editor for their insightful comments and valuable suggestions. Qiaoling Li was partially supported by the National Natural Science Foundation of China (grant no. 10571003). Jiazhu Pan was partially supported by the starter grant from University of Strathclyde (UK) and the National Basic Research Program of China (grant no. 2007CB814902).

REFERENCES Andersen, T. G., T. Bollerslev, F. X. Diebold and P. Labys (1999). (Understanding, optimizing, using and forecasting) realized volatility and correlation. Working paper, Northwestern University. Anderson, H. M., J. V. Issler and F. Vahid (2006). Common features. Journal of Econometrics 132, 1–5. Bai, J. S. and S. Ng (2002). Determining the number of factors in approximate factor models. Econometrica 70, 191–211. Bauwens, L., S. Laurent and J. V. K. Rombouts (2006). Multivariate GARCH models: a survey. Journal of Applied Econometrics 21, 79–109. Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. Journal of Econometrics 31, 307–27. Bollerslev, T., R. F. Engle and J. M. Wooldridge (1988). A capital asset pricing model with time varying covariances. Journal of Political Economy 96, 116–31. Duan, J. C. and S. R. Pliska (2004). Option valuation with co-integrated asset prices. Journal of Economic Dynamics and Control 28, 727–54. Engle, R. F. (2002). Dynamic conditional correlation – a simple class of multivariate GARCH models. Journal of Business and Economic Statistics 20, 339–50. Engle, R. F. and C. W. J. Granger (1987). Co-integration and error correction: representation, estimation and testing. Econometrica 55, 251–76. Engle, R. F. and K. F. Kroner (1995). Multivariate simultaneous generalized ARCH. Econometric Theory 11, 122–50. Engle, R. F. and K. Sheppard (2001). Theoretical and empirical properties of dynamic conditional correlation multivariate GARCH. Working paper 2001-15, Department of Economics, University of California, San Diego. Fan, J., M. Wang and Q. Yao (2008). Modelling multivariate volatilities via conditionally uncorrelated components. Journal of the Royal Statistical Society, Series B 70, 679–702. Ghost, A. (1993). Hedging with stock index futures: estimation and forecasting with error correction model. The Journal of Futures Markets 13, 743–52. Granger, C. W. J. (1981). Some properties of time series data and their use in econometric model specification. Journal of Econometrics 16, 121–30. C The Author(s). Journal compilation C Royal Economic Society 2008.


59

Granger, C. W. J. and A. A. Weiss (1983). Time series analysis of error correction models. In S. Karlin, T. Amemiya and L. A. Goodman (Eds.), Studies in Econometrics, Time Series and Multivariate Statistics, 255–78. New York: Academic Press. Kroner, K. and J. Sultan (1993). Time-varying distributions and dynamic hedging with foreign currency futures. Journal of Financial and Quantitative Analysis 28, 535–51. Kupiec, P. (1995). Techniques for verifying the accuracy of risk measurement models. Journal of Derivatives 2, 173–84. Li, Q., J. Pan and Q. Yao (2006). On determination of cointegration rank. Working paper, Peking University. Lien, D. (1996). The effect of the cointegration relationship on futures hedging: a note. The Journal of Futures Markets 16, 773–80. Lien, D. and X. Luo (1994). Multiperiod hedging in the presence of conditional heteroscedasticity. The Journal of Futures Markets 14, 927–55. Lien, D. and Y. K. Tse (2002). Some recent developments in futures hedging. Journal of Economic Surveys 16, 357–96. Pan, J., D. Pena, W. Polonik and Q. Yao (2007). Modelling multivariate volatilities by common factors: an innovation expansion method. Working paper, The London School of Economics and Political Science. Pan, J. and Q. Yao (2008). Modelling multiple time series via common. Biometrika 95, 365–79. Phillips, P. C. B. (1991). Optimal inference in cointegrated systems. Econometrica 59, 283–306. Stock, J. H. and M. Watson (1988). Testing for common trends. Journal of the American Statistical Association 83, 1097–107. van der Vaart, A. W. and J. A. Wellner (1996). Weak Convergence and Empirical Processes. New York: Springer. van der Weide, R. (2002). Go-GRACH: a multivariate generalized orthogonal GRACH model. Journal of Applied Econometrics 17, 549–64.

APPENDIX: PROOFS OF RESULTS The first lemma shows the n (r, Bˆ r ) defined in subsection 3.2 is a non-increasing function of the number of factors r. L EMMA A.1. If 0 ≤ r 1 < r 2 ≤ d, then n (r1 , Bˆ r1 ) ≥ n (r2 , Bˆ r2 ). Proof: For 0 ≤ r1 < r2 ≤ d, Bˆ r1 can be written as (B˜ r2 , B˜ d−(r2 −r1 ) ) where B˜ r2 consists of the first d − r 2 columns of the matrix Bˆ r1 . We have

n (r1 , Bˆ r1 ) = =

sup

(B˜ r2 , B˜ d−(r2 −r1 ) ) Dˆ n,τ (C)(B˜ r2 , B˜ d−(r2 −r1 ) )

sup

(B˜ r2 Dˆ n,τ (C) B˜ r2 B˜ r2 Dˆ n,τ (C)B˜ d−(r2 −r1 )

1≤τ ≤τ0 ,C∈B 1≤τ ≤τ0 ,C∈B

B˜ d−(r2 −r1 ) Dˆ n,τ (C)B˜ r2 B˜ d−(r2 −r1 ) Dˆ n,τ (C)B˜ d−(r2 −r1 ) )

≥

sup

1≤τ ≤τ0 ,C∈B

B˜ r2 Dˆ n,τ (C)

B˜ r2 = n (r2 , B˜ r2 )

≥ n (r2 , Bˆ r2 ).

r , D). The last inequality holds because Bˆ r is the minimizer of n (B) in the metric space (HD C The Author(s). Journal compilation C Royal Economic Society 2008.

60

Q. Li and J. Pan The proof of Theorem 3.2 needs the following two lemmas.

r such that (r, B) = 0. For 0 ≤ r < L EMMA A.2. For any fixed r with r 0 ≤ r ≤ d, there exists a B ∈ HD r . r 0 , (r, B) > 0 holds for all B ∈ HD

Proof: It is clear that B A 0 = 0 implies (r, B) = 0 from the relation between (r, B) and the factor model with true loading matrix A 0 . r For r = r 0 , there must be a matrix in HD0 , denoted by B r0 , such that B r0 A0 = r0 r r0 r0 0, thus (r0 , B ) = 0 and it reaches the minimum value. We have B = B0 in HD0 by Assumption 3.2. r For r 0 < r ≤ d, let B = B00 H , where H is an arbitrary (d − r 0 ) × (d − r) matrix such that H H = r r I d−r . Then, B ∈ HD and B A 0 = 0. In the other words, (r, B00 H ) = 0. r For any B ∈ HD with r < r 0 , B A 0 = 0. If (r, B) = 0, which means that for any 1 ≤ τ ≤ τ 0 and any C ∈ B, B Dτ (C)B = 0, by choosing C = Rd , we have B A 0 E(Ft Ft )A 0 B = 0. This is impossible because E(Ft Ft ) is a positive definite matrix. L EMMA A.3. For any 0 ≤ r < r 0 , there exists a κ r > 0 such that p lim n (r, Bˆ r ) − n (r0 , Bˆ r0 ) ≥ κr , n→∞

where p lim denotes the limit in probability. For any r 0 ≤ r < d, it holds that 1 . n (r, Bˆ r ) − n (r0 , Bˆ r0 ) = Op √ n Proof: It follows from the definition of Bˆ that r n (r, Bˆ r ) − n r0 , Bˆ r0 ≥ n (r, Bˆ r ) − n r0 , B00 . r

Recall that (r0 , B00 ) = 0 by Lemma A.2. Hence, r n (r, Bˆ r ) − n r0 , B00 r r = [n (r, Bˆ r ) − (r, Bˆ r )] − n r0 , B00 − r0 , B00 + (r, Bˆ r ) = Op √1n + (r, Bˆ r ) ≥ Op √1n + (r, B0r ).

(A.1)

The second equality holds by the similar way to equation (3.6) with a slight modification that Bˆ r is related to n. The last inequality is from the definition of B 0 . These imply that, for any 0 ≤ r < r 0 , p lim n (r, Bˆ r ) − n r0 , Bˆ r0 ≥ κr := r, B0r , n→∞

and from Lemma A.2, κ r > 0. For the second part, since n (r, Bˆ r ) − n r0 , Bˆ r0 ≤ n (r, Bˆ r ) − n r0 , B r0 + n r0 , B r0 − n r0 , Bˆ r0 0 0 r ≤ 2 maxr ≤r≤d n (r, Bˆ r ) − n r0 , B 0 , 0

0

it is sufficient to prove that for any r 0 ≤ r ≤ d, r n (r, Bˆ r ) − n r0 , B00 = Op

1 . √ n


61


r Notice that, from equation (A.1), n (r, Bˆ r ) − n (r0 , B00 ) = Op ( √1n ) + (r, Bˆ r ). Thus, we need to prove (r, Bˆ r ) = Op ( √1 ) for any r 0 ≤ r ≤ d, where n

(r, Bˆ r ) =

sup

1≤τ ≤τ0 ,C∈B

ˆr B Dτ (C)Bˆ r .

For an arbitrary (d − r 0 ) × (d − r) matrix H such that H H = I d−r , we have r Bˆ Dτ (C)Bˆ r r r r r r r r r = (Bˆ r − B00 H H B00 Bˆ r + B00 H H B00 Bˆ r ) Dτ (C)(Bˆ r − B00 H H B00 Bˆ r + B00 H H B00 Bˆ r ) r0 r0 r0 r0 ˆ r r r0 ˆ r r0 r ˆ ˆ (I = B B − B H H B ) B D (C) B + B H H B D (C) I − B H H B d τ τ d 0 0 0 0 0 0 r

r

r

where the last equality holds because the relation B00 A0 = 0 implies that B00 Dτ (C)B00 = 0 for any τ ≥ 1 and C ∈ B. Hence, r r r r r Bˆ Dτ (C)Bˆ r ≤ Id − B00 H H B00 Bˆ r Dτ (C) Bˆ r + B00 H H B00 Bˆ r √ r r r = D Bˆ r , B00 H Dτ (C) d − r + B00 H H B00 Bˆ r √ r ≤ D Bˆ r , B00 H Dτ (C)( d − r(1 + d − r)). r

Note that (r, B00 H ) = 0 Op ( √1n ). It is easy to Op ( √1n ).

r r by Lemma A.2, i.e. D(B00 H , B0r ) = 0. Thus, D(Bˆ r , B00 H ) = see that sup1≤τ ≤τ0 ,C∈B Dτ (C) = Op (1). Therefore, (r, Bˆ r ) =

Proof of Theorem 3.2: The objective is to verify that lim n→∞ P (PC(r) − PC(r 0 ) < 0) = 0 for all 0 ≤ r ≤ d and r = r 0 , where P C(r) − P C(r0 ) = n (r, Bˆ r ) − n r0 , Bˆ r0 − (r0 − r)g(n). For r < r 0 , if g(n) → 0 as n → ∞,

P (P C(r) − P C(r0 ) < 0) = P n (r, Bˆ r ) − n r0 , Bˆ r0 < (r0 − r)g(n) → 0

because, by Lemma A.3, n (r, Bˆ r ) − n (r0 , Bˆ r0 ) has a positive limit in probability. √ For r > r 0 , Lemma A.3 implies that n (r, Bˆ r ) − n (r0 , Bˆ r0 ) = Op ( √1n ). Thus, if ng(n) → ∞ as n → ∞, we have P (P C(r) − P C(r0 ) < 0) = P n r0 , Bˆ r0 − n (r, Bˆ r ) > (r − r0 )g(n) √ √ n n r0 , Bˆ r0 − n (r, Bˆ r ) > (r − r0 ) ng(n) → 0. =P


The


On the impact of error cross-sectional dependence in short dynamic panel estimation VASILIS S ARAFIDIS † AND D ONALD R OBERTSON ‡ †

‡

Discipline of Econometrics and Business Statistics, The University of Sydney, NSW, 2006, Australia E-mail: [email protected]

Faculty of Economics and Politics, The University of Cambridge, Cambridge, CB3 9DD, UK E-mail: [email protected] First version received: September 2006; final version accepted: July 2008

Summary This paper explores the impact of error cross-sectional dependence (modelled as a factor structure) on a number of widely used IV and generalized method of moments (GMM) estimators in the context of a linear dynamic panel data model. It is shown that, under such circumstances, the standard moment conditions used by these estimators are invalid – a result that holds for any lag length of the instruments used. Transforming the data in terms of deviations from time-specific averages helps to reduce the asymptotic bias of the estimators, unless the factor loadings have mean zero. The finite sample behaviour of IV and GMM estimators is investigated by means of Monte Carlo experiments. The results suggest that the bias of these estimators can be severe to the extent that the standard fixed effects estimator is not generally inferior anymore in terms of root median square error. Time-specific demeaning alleviates the problem, although the effectiveness of this transformation decreases when the variance of the factor loadings is large. Keywords: Asymptotic bias, Cross-sectional dependence, Dynamic panel data, Generalized method of moments, Instrumental variables, Time-specific demeaning.

1. INTRODUCTION In a panel regression model with lagged endogenous variables, the fixed effects estimator (FE) is inconsistent for small T (the number of time series observations in the panel), as shown by Nerlove (1967, 1971) using simulated data and formalized by Nickell (1981) for the case of a simple first-order autoregressive model. Since then, a standard estimation approach has been to transform the regression model in first-differences and use appropriate lagged values of the dependent variable in levels as instruments for the transformed endogenous regressor (see Anderson and Hsiao, 1981, Holtz-Eakin et al., 1988 and Arellano and Bond, 1991). However, the first-differenced generalized method of moments (GMM) estimator can have poor finite sample properties (bias and imprecision) when the series is highly persistent or when the variance of the individual time-invariant unobserved effects is large relative to the variance of the purely idiosyncratic error component (Blundell and Bond, 1998). To alleviate this problem, subsequent C The Author(s). Journal compilation C Royal Economic Society 2008. Published by Blackwell Publishing Ltd, 9600 Garsington Road,


On the impact of error cross-sectional dependence in short dynamic panel estimation

63

research in the field has led to the development of GMM estimators that make use of additional moment conditions, based upon certain extra assumptions about the initial conditions (see e.g. Ahn and Schmidt, 1995, Arellano and Bover, 1995, and Blundell and Bond, 1998). These methods have proved popular in empirical work; among the many examples where GMM-type estimators have been used is the empirical growth literature (Bond et al., 2001, among many others), the literature of estimating wage equations and Phillips curves (Alonso-Borrego and Arellano, 1999 and others), the literature of estimating production functions (Blundell et al., 2000), and for the estimation of money demand functions (e.g. Bover and Watson, 2005). More recently, a vigorous literature has been developed on testing and dealing with error cross-sectional dependence in panel data models. Cross-sectional dependence is a situation that is often encountered in macroeconomic and financial applications, where the ever-increasing economic and financial integration of countries and financial entities implies substantial crosssectional interactions – but also in microeconomic panel data sets, where the propensity of micro-units to behave similarly may be explained by social norms, neighbourhood effects, herd behaviour and interdependent preferences. A particular form of cross-sectional dependence that has become popular is the factor structure approach. This has been used extensively in empirical work (see e.g. Barro and Sala-i-Martin, 1992) and it has been analysed in theoretical treatments at even greater length. 1 Consequently, in this paper we use the notions of error cross-sectional dependence and factor structure dependence interchangeably. The impact of error cross-sectional dependence on the dynamic fixed effects (FE) estimator has been studied by Phillips and Sul (2003, 2007), who showed that if there is sufficient dependence across cross-sectional units, the efficiency gains that one had hoped to achieve by pooling the data largely diminish – to the extent that the FE estimator provides little gain over estimating each individual-specific time series regression separately using OLS. 2 This paper investigates the impact of error cross-sectional dependence on a number of widely used IV and GMM dynamic panel estimators. We demonstrate that estimators relying on standard instruments with respect to lagged values of the dependent variable (either in levels or in firstdifferences) are inconsistent as N → ∞ for fixed T. We show that this result holds true for any lag length of the instruments used. This is an important outcome given that error cross-sectional dependence is a likely empirical situation – the econometrician may not have sufficient explanatory variables to remove all correlated behaviour – and the desirable N-asymptotic properties of estimators based on instrumental variables rely crucially upon the assumption that the errors are uncorrelated across individuals. We also show that the asymptotic bias of these estimators will most likely decrease if the data are transformed in terms of deviations from time-specific averages prior to estimation, which we phrase as time-specific demeaning, provided that the factor loadings do not have mean zero. In the latter case, while time-specific demeaning does not have a positive effect, it does not deteriorate the properties of the estimators either. Simulation results confirm these findings and provide a formal justification for the practice of including common time effects in the context of a short dynamic panel data model with large N. 1 The literature on factor models is growing rapidly. See e.g. Robertson and Symons (2000), Coakley et al. (2002), Bai (2006), Pesaran (2006), to mention only a few. 2 Intuitively, if all cross-sectional units behave similarly, there is little gain to be obtained from looking at more than one of them.


64

V. Sarafidis and D. Robertson

The structure of the paper is as follows. Section 2 sets out the assumptions of our model and Section 3 provides the main results of the paper. Section 4 analyses the asymptotic bias reduction of the IV estimator achieved by time-specific demeaning of the data. Section 5 describes the Monte Carlo design and discusses the simulation results of the paper. A final section concludes.

2. ASYMPTOTIC BIAS OF INSTRUMENTAL VARIABLES AND GMM ESTIMATORS We consider the following first-order autoregressive panel data model 3 yit = λyit−1 + υit , i = 1, . . . , N and t = 2, . . . , T υit = αi + uit , uit =

M

φmi fmt + εit = φi ft + εit ,

(2.1)

m=1

where y it is the observation of the dependent variable of the ith cross-sectional unit at time t and λ is the unknown parameter of interest with 0 < λ < 1. α i denotes an individual-specific time-invariant effect with zero mean and constant, finite variance σ 2α . u it obeys a multi-factor structure, where f t = (f 1t , . . . , f Mt ) denotes an M × 1 vector of individual-invariant timespecific unobserved effects, φ i = (φ 1i , . . . , φ Mi ) is an M × 1 vector of factor loadings and ε it is a purely idiosyncratic component with zero mean and constant, finite variance σ 2ε . The factor structure approach is widely used to model error cross-sectional dependence because it can approximate a wide variety of dependence forms, provided that the number of factors allowed is sufficiently large. 4 We make the following assumptions: A SSUMPTION 2.1. E(α i ε it ) = 0 for all i, t. A SSUMPTION 2.2. E(ε it ε is ) = 0 for all i and t = s. A SSUMPTION 2.3. E(y i1 ε it ) = 0, for t = 2, 3, . . . T. IM for t = s A SSUMPTION 2.4. E(ft ) = 0, E(ft fs ) = 0 otherwise. A SSUMPTION 2.5. E(φ i ) = μ φ , E[(φ i − μ φ ) (φ i − μ φ ) ] = φ , where ||μ φ || < B 1 and φ is an M × M positive semi-definite matrix. A SSUMPTION 2.6. E(ε it φ i ) = 0, E(ε it f t ) = 0, E(α i φ i ) = 0, E(α i f t ) = 0, E(φ i ft ) = 0 for all i, t. Assumptions 2.1–2.3 are standard in the GMM literature. Assumption 2.2 can be easily relaxed by allowing ε it ∼ MA(k), where k is a small number. Assumption 2.3 means that the initial conditions are predetermined. This ensures that sufficiently lagged values of y it will be uncorrelated with ε it . Assumption 2.4 implies that the factors are serially and mutually 3 The main results of this paper naturally extend to panel autoregressive processes of higher order, as well as to autoregressive distributed lag panel models. See e.g. Sarafidis et al. (2008). 4 See footnote 1.



65

uncorrelated. Assumption 2.5 ensures that the initial observations are bounded. Assumption 2.6 implies that f t and φ i are mutually uncorrelated, as well as uncorrelated with α i and ε it for all i and t. Define the (T − 1) × M matrix F = [f 2 , f 3 , . . . f T ] and the N × M matrix = [φ 1 , φ 2 , . . . φ N ] . The initial model given in (2.1) can be written more compactly as Y = λY−1 + α + F + ε,

(2.2)

where Y = [Y 1 , . . . , Y N ] , a N × (T − 1) matrix with Y i = (y i2 , y i3 , . . . , y iT ) , Y −1 = [Y 1,−1 , . . . , Y N,−1 ] a N × (T − 1) matrix with Y i,−1 = (y i1 , y i2 , . . . , y iT −1 ) , α = [α 1 , α 2 , . . . , α N ] with α i = α i i T −1 and i T −1 being a (T − 1) × 1 column vector of ones and ε = [ε 1 , . . . , ε N ] with ε i = (ε i2 , ε i3 , . . . , ε iT ) . Given that F = AA−1 F , where A is an arbitrary M × M invertible matrix with M(M − 1)/2 free elements, identification of the effects between factors and factor loadings requires M(M − 1)/2 restrictions. 5 These can be obtained by requiring φ to be a diagonal matrix, which implies that factor loadings from different factors are mutually uncorrelated.

3. ASYMPTOTIC BIAS OF INSTRUMENTAL VARIABLES AND GMM ESTIMATORS Under error cross-sectional dependence, the standard IV and GMM estimators are inconsistent as N → ∞ for fixed T. This is an important result given that the applied econometrician may not have sufficient explanatory variables to remove all correlated behaviour and the desirable N-asymptotic properties of these estimators rely crucially upon this assumption. In order to illustrate this point, suppose first that φ i = 0 so that there is no error cross-sectional dependence, in which case (2.1) may be written in first-differences as yit = λ yit−1 + εit .

(3.1)

To overcome the induced endogeneity between the lagged dependent variable and the resulting error term, Anderson and Hsiao (1981) suggested a choice of single instruments for y it−1 , the most popular of which has been y it−2 . 6 Holtz-Eakin et al. (1988) and Arellano and Bond (1991) pointed out that under Assumptions 2.1–2.3 the autoregressive model in (2.1) implies that the following [(T − 1) (T − 2)/2] linear moment conditions are valid: E(yit−s εit ) = 0;

for t = 3, . . . , T

and 2 ≤ s ≤ t − 1.

(3.2)

These moment conditions give rise to a first-differenced GMM (DIF GMM) estimator, which is consistent for fixed T and asymptotically more efficient than the standard IV estimator. On the other hand, DIF GMM has been shown to be subject to a weak instruments problem when λ → 1 or σ 2α /σ 2ε → ∞. Hence, Blundell and Bond (1998) developed an approach outlined in Arellano and Bover (1995), which uses y it−1 for t = 3, 4, . . . , T, as additional instruments with respect to the equations in levels and results in a system GMM (SYS GMM) estimator. This approach is valid provided that the deviations of the initial observations from the long-run convergent 5 The total number of elements of A is M 2 but we have already imposed the standard orthonormalization, E(f ) = 0 t and var(f t ) = I M , which yields M(M + 1)/2 restrictions. 6 See Arellano (1989).


66


αi values are uncorrelated with the individual effects – namely, E[αi (yi1 − 1−λ )] = 0. Under this mean-stationarity restriction on the initial conditions of the data generating process, the following T − 2 additional linear moment conditions are valid 7

E( yit−1 vit ) = 0;

for t = 3, 4, . . . , T .

(3.3)

However, with error cross-sectional dependence, the moment conditions outlined in (3.2) and (3.3) are violated as the following proposition demonstrates. P ROPOSITION 3.1. Under Assumptions 2.1–2.6 and model (2.1), the moment conditions used by the standard DIF GMM and SYS GMM estimators are violated as N → ∞ for fixed T. In particular, we have: Moment conditions used by DIF GMM: E yit−s uit | {fn }t−∞ = ft φ + μφ μφ wt−s = 0; for t = 3, . . . , T and 2 ≤ s ≤ t − 1, where wt−s =

∞

j =0

(3.4) λj ft−s−j .

Additional moment conditions used by SYS GMM: E yit−1 υit | {fn }t−∞ = ft φ + μφ μφ wt−1 = 0; for t = 3, 4, . . . , T , j where wt−1 = ∞ j =0 λ ft−1−j .

(3.5)

Notice that these moment conditions are violated for any lag length of the instruments used. This implies that standard estimation procedures that try to exploit orthogonality conditions between lags of the dependent variable and the error process will not be consistent for fixed T as N → ∞ regardless of the lag length of the instruments used. R EMARK 3.1. The unconditional expectation in (3.4) and (3.5) is equal to zero because f t has mean zero. The conditional expectation, of course, need not be zero. In our large N asymptotics, it is the conditional expectation that is relevant because we are never taking large T averages. It is useful to consider y it−2 as a single instrument for the first-differenced endogenous regressor in (3.1), which gives rise to the standard IV estimator (Anderson and Hsiao, 1981). In particular, under Assumptions 2.1–2.6 and model (2.1), the asymptotic bias of the IV estimator for λ has the following convenient representation 8 T

plimN→∞ N1 N (T − 2) 2 −1 i=1 t=3 yit−2 uit σ plimN→∞ λI V − λ = = ηNT ηDT − , T 1+λ ε plimN→∞ N1 N i=1 t=3 yit−2 yit−1 (3.6) where ηNT = wt−2 = 7 8

∞

j =0

T

T ft φ + μφ μφ wt−2 , ηDT = wt−1 φ + μφ μφ wt−2 ,

t=3

λj ft−2−j and wt−1 =

∞

j =0

t=3

λj ft−1−j .

Kiviet (2007) refers to this as a ‘stationary accumulated effects’ restriction. See Appendix A. C The Author(s). Journal compilation C Royal Economic Society 2008.

67


For the single factor case, the asymptotic bias expression reduces to 2 σφ + μ2φ κ1 plimN→∞ λI V − λ = 2 , −2) 2 σε σφ + μ2φ κ2 − (T1+λ

(3.7)

T 2 2 where σφ2 + μ2φ = plimN→∞ N1 N and κ2 = i=1 φi = E(φi ), κ1 = t=3 wt−2 ft T T T 2 w w = w w − (w ) . t−1 t−2 t−2 t=3 t=3 t−1 t−2 t=3 For any fixed T, the magnitude of the asymptotic bias of λI V depends on the mean and the variance of the factor loadings and the variance of ε it . Clearly, if σ 2φ and μ φ are negligible, or the purely idiosyncratic component, ε it , dominates the error process, the asymptotic bias is relatively small. Similarly, for serially uncorrelated stochastic factors and large T, the asymptotic bias diminishes because T1 κ1 = op (1). However, in this case the bias of the fixed effects estimator disappears too so the IV estimator loses its relative merit anyway. 9 Notice that the sign of the asymptotic bias of λI V depends on the sign of κ 1 and κ 2 . For instance, if κ2 < [(T − 2)/(1 + λ)][σε2 /(σφ2 + μ2φ )], the asymptotic bias will be negative for κ 1 > 0 and positive for κ 1 < 0. Under Assumption 2.4 it can be shown that 10 E(κ1 ) = 0 E(κ2 ) = −(T − 2)/(1 − λ2 )

(3.8)

cov(κ1 , κ2 ) < 0. Hence, there are two cases that we need to consider: (i) for κ 2 < E(κ 2 ), κ 1 is more likely to be positive. But as in this case κ 2 is smaller than [(T − 2)/(1 + λ)][σε2 /(σφ2 + μ2φ )], the denominator in (3.7) is positive and therefore the asymptotic bias of λI V has a negative sign; (ii) similarly, for κ 2 > E(κ 2 ) κ 1 is more likely to be negative. Therefore, for κ2 > [(T − 2)/(1 + λ)][σε2 / λI V has a negative (σφ2 + μ2φ )] the denominator in (3.7) is positive and the asymptotic bias of sign again. The opposite holds for −(T − 2)/(1 − λ2 ) < κ2 < [(T − 2)/(1 + λ)][σε2 /(σφ2 + μ2φ )]. Notice also that for f t ∼ N(0, 1), the actual distribution of κ 2 is skewed to the left because it involves the sum of squares of T − 2 standard normal terms (with a minus sign in front). This implies some large negative values for κ 2 , which are naturally associated with large positive λI V . As a result, the asymptotic bias of λI V values for κ 1 and a large negative asymptotic bias for is likely to be negative – an outcome that has been confirmed in our simulation experiments.

4. REDUCING THE BIAS OF IV AND GMM IN SHORT DYNAMIC PANELS One way to reduce the amount of error cross-sectional dependence and therefore the bias of the IV and GMM estimators is to transform the data in terms of deviations from time-specific averages. This is an appealing procedure because it is easy to implement and it does not require

9 See Phillips and Sul (2007) for an analysis of the properties of the dynamic fixed effects estimator with error crosssectional dependence. 10 See Appendix A.


68


projecting out the unobserved factors or estimating the factor loadings, both of which typically require large T. 11 Transforming the data in terms of deviations from time-specific averages is equivalent to including common time-specific effects in the regression model, which is standard practice in the estimation of short dynamic panels as a way of capturing common variations in the dependent variable. 12 In order to see the impact of this transformation it is instructive to reconsider the simple IV estimator. Specifically, averaging (2.1) over i and subtracting yields yit − y t = (αi − α) + λ yit−1 − y t−1 + φi − φ ft + (εit − ε t ) , (4.1) N N where y t = i=1 yit /N, φ = i=1 φi /N, and similarly for the remaining variables. As a result, the mean value of the factor loadings has been removed and the error term has now mean zero. Taking first-differences in the above equation yields yit − y t = λ yit−1 − y t−1 + φi − φ ft + (εit − ε t ) (4.2) = λ yit−1 − y t−1 + (uit − ut ) . Using (yit−2 − y t−2 ) as an instrument for (yit−1 − y t−1 ) gives rise to an IV estimator with the following probability limit 13 T yit−2 − y t−2 [ (uit − ut )] plimN→∞ N1 N i=1 t=3 λI V − λ = plimN→∞ T plimN→∞ N1 N i=1 t=3 yit−2 − y t−2 yit−1 − y t−1

(T − 2) 2 −1 (4.3) = ηNT ηDT − σε , 1+λ where ηNT =

T

ft φ wt−2 , ηDT =

t=3

T

wt−1 φ wt−2

t=3

and w t−2 , w t−1 are defined below (3.6). For M = 1, (4.3) reduces to λI V − λ = plimN→∞ where σφ2 = plimN→∞ N1 (3.7).

N

i=1 (φi

− φ)2 , φ =

1 N

σφ2 κ1

, (4.4) −2) 2 σφ2 κ2 − (T1+λ σε φi and κ 1 and κ 2 are defined below

11 Such methods have been proposed by Robertson and Symons (2000), Coakley et al. (2002), Phillips and Sul (2003), Moon and Perron (2004), Bai (2006) and Pesaran (2006). However, these methods are justified only in a set up where T is large. Ahn et al. (2001) developed a fixed T consistent GMM approach that controls for a single time-varying individual effect (which is similar to a single-factor structure with no individual-specific time-invariant effects) in a model with strictly exogenous regressors. We do not consider this approach here because our focus is on a dynamic panel with a multi-factor structure and the usual individual-specific time-invariant effects. 12 See, for example, Arellano and Bond (1991) and Blundell et al. (2000). 13 See Appendix A.



69

Comparing (3.7) and (4.4), it will follow that the IV estimator applied to time-specific demeaned data, λI V , will have a smaller asymptotic bias compared to the IV estimator in the basic model if one can show that

σφ2

−2) 2 asy.bias( σε λI V ) σφ2 ·κ2 − (T1+λ

= (4.5) RB λI V =

< 1, (σφ2 +μ2φ )

asy.bias λI V

(T −2) 2 2 2 (σφ +μφ )·κ2 − 1+λ σε where RB( λI V ) denotes the relative bias of λI V compared to λI V . Intuitively, as time-specific demeaning reduces the amount of error cross-sectional dependence (by removing the mean value of φ i ) and because the latter is responsible for the non-zero asymptotic bias of the IV estimator, it is natural to expect that RB( λI V ) < 1. This will hold true indeed, unless κ 2 takes an unusually large positive value, in which case it is possible for λI V . The following proposition provides the necessary the bias of λI V to be greater than that of condition for (4.5) to hold true. P ROPOSITION 4.1. Under Assumptions 2.1–2.6 and (2.1), the asymptotic bias of the IV estimator will be reduced when the estimator is applied to time-specific demeaned data so long as 2 2σφ + μ2φ (T − 2)σε2 κ2 < B = 2 > 0. (4.6) 2 σφ + μ2φ σφ2 (1 + λ) Using Assumption 2.4, κ 2 will most likely be smaller than this bound, unless the value of σ 2φ is unrealistically large. 14 In particular, as an indication, for f t standard normal we have simulated the probability that κ 2 ≥ B and we found that for – say – σ 2φ = 10, P r(κ 2 ≥ B) = 0.00087. To see what this practically means, if the factor loadings are uniformly distributed, such that φ i ∼ i.i.d.U [a, b], σ 2φ = 10 gives a difference between a and b of 10.95, which seems an unlike degree of heterogeneity to arise in most empirical applications; and even if it does, the probability that (4.6) is violated is still very small. Therefore, time-specific demeaning will most likely have favourable effects for practical purposes, at least in large samples. 15 Turning to (4.5), for a given value of κ 2 < B and T fixed, the magnitude of RB( λI V ) will depend on σ 2φ , μ φ and σ 2ε . For example, if μ φ is zero, the numerator and the denominator in (4.5) are exactly the same and therefore RB( λI V ) equals unity; there is no gain from timespecific demeaning of the data. On the other hand, if the factor loadings are the same across λI V ) is also zero; time-specific demeaning is all individuals, σ 2φ is equal to zero and hence RB( fully effective. λI V ), for a Figure 1 shows graphically the relative asymptotic bias of λI V , denoted as RB( range of values of σ 2φ and μ φ , while setting T = 6, λ = 0.4, σ 2ε = 1 and κ 2 = E(κ 2 ). We can see that RB( λI V ) is consistently less than 1 unless μ φ is zero, in which case time-specific demeaning λI V ) → 0 as σ 2φ becomes smaller. Also, has no effect. For a given non-zero value of μφ , RB( 2 for a fixed value of σφ RB( λI V ) decreases as μ φ becomes larger, although the rate of decrease 2 depends on the value of σ φ . Similar results hold for different values of λ.

14 This result is intuitive because the effect that time-specific demeaning has on reducing the asymptotic bias of the IV estimator is small in this case. 15 This is unless μ = 0, as we will see. φ C The Author(s). Journal compilation C Royal Economic Society 2008.

70


Figure 1. Relative asymptotic bias of λI V .

Figure 2. Asymptotic bias of λI V and λI V and relative bias of λI V .

Figure 2 illustrates the asymptotic bias of λI V and λI V , as well as RB( λI V ) for a range of values of μ φ and σ 2ε , while setting T = 6, λ = 0.4, σ 2φ = 1, κ 2 = E(κ 2 ). Observe that while the asymptotic bias of both λI V and RB( λI V ) falls (the surface goes up since it is negative) with 2 λI V than it is for RB( λI V ) (unless higher values of σ ε , the rate of decrease is much lower for λI V ) decreases as σ 2ε becomes larger. μ φ = 0). Thus, the net effect is that RB( C The Author(s). Journal compilation C Royal Economic Society 2008.


71

5. SMALL SAMPLE PROPERTIES OF ESTIMATORS We investigate the finite sample properties of the IV, DIF GMM and SYS GMM estimators with error cross-sectional dependence. In the simulation experiments presented below, we have restricted ourselves regarding the generality of the Monte Carlo design by focusing on three specifications for the distribution of the factor loadings, chosen on the basis of existing literature. However, in conjuction with the analytical results above, we aim to make rather more general inferences about the properties of the estimators. Notice that we are only interested in the small-T, large-N case, i.e. samples where these estimators are routinely applied by practitioners to estimate dynamic panel data models. 5.1. Experimental design The underlying data generating process is given by yit = λyit−1 + αi + uit , uit = φi ft + εit ,

(5.1)

where α i , ε it and f t are drawn in each replication from i.i.d.N (0, σ 2α ), i.i.d.N (0, σ 2ε ) and i.i.d.N(0, 1), respectively. To control the degree and heterogeneity of error cross-sectional dependence, we follow closely Phillips and Sul (2003) and we consider three specifications for the distribution of the factor loadings – namely, φ i = 0, φ i ∼ i.i.d.U [0.5, 2.1] and φ i ∼ i.i.d.U [1, 4]. The last two are chosen as examples of medium and high cross-sectional dependence, corresponding to an average error correlation coefficient that is roughly equal to 0.55 and 0.80, respectively. 16 Following Kiviet (1995) and Bun and Kiviet (2006), we choose σ 2α such that the impact of the two error components, a i and u it , on var(y it ) is held constant across different experiments. This is because the performance of the GMM estimators depends on the ratio σ 2α /σ 2u and therefore as the level of cross-sectional dependence rises, the impact of α i on var(y it ) will tend to fall, making the comparisons across experiments with different levels of cross-sectional dependence invalid. Hence, noticing that ⎛ ⎞ ∞ 2 σα σu2 σ2 σα2 var(yit ) = + var ⎝ λj uit−j ⎠ = + = (ψ 2 + 1) u 2 , 2 2 2 (1 − λ) (1 − λ) 1−λ 1−λ j =0 where ψ 2 =

σα2 /(1−λ)2 , σu2 /(1−λ2 )

we choose σ 2α by setting 1−λ 1−λ ψ 2 σu2 = ψ 2 [var(φi ft ) + var(εit )] , σα2 = 1+λ 1+λ

with var(φi ft ) = [E(ft )]2 σφ2 + [E(φi )]2 σf2 + σφ2 σf2 .

16 Phillips and Sul (2003) set φ ∼ i.i.d.U [0, 0.2] and φ ∼ i.i.d.U[1, 4] as examples of low and high error crossi i sectional dependence, respectively. Therefore, the bounds we choose for medium cross-sectional dependence are the average values of the bounds of these two specifications.


72


Normalizing σ 2ε to the value of one and given that E(f t ) = 0, the following result is obtained 1−λ 2 ψ 2 μ2φ + σφ2 + 1 . σα = 1+λ We consider N = 100, 400 and T = 6, 10, since the focus of the analysis is T fixed, N large. λ alternates between 0.4 and 0.8 and ψ is set equal to one. The initial value of y it is set equal to zero and the first 50 observations are discarded before choosing our sample, so as to ensure that the initial conditions do not have an impact on the results. 2000 replications are performed in each experiment. 5.2. Results Since the IV estimator has no finite moments, Table A1 in the appendix reports median bias and root median square error (denoted as RMedSE) for all estimators. The latter is defined as λr − λ)2 , RMedSE = median ( where λr is an estimator of λ in the rth draw. λF E , λI V , λDI F and λSY S denote the FE, IV, DIF GMM and SYS GMM estimators, λI V , λDI F and λSY S denote the corresponding estimators operated on respectively, while λF E , time-specific demeaned data. The GMM estimators are estimated in two steps and they use the second and third lag of the dependent variable (in levels) as instruments for the endogenous regressor in the first-differenced equations. 17 We can see that with zero error cross-sectional dependence, the median bias of λI V , λDI F and λSY S is small for λ = 0.4 and it increases for λ λF E = 0.8, especially for λI V and λDI F . Regardless of this, all estimators perform better than with respect to both bias and RMedSE. Also, λDI F and λSY S outperform λI V , and they perform similarly to each other for λ = 0.4 but not for λ = 0.8, in which case λSY S does somewhat better. 18 Notice that transforming the data in terms of deviations from time-specific averages, even when this is redundant, has no adverse effects on either median bias or RMedSE for all estimators. With error cross-sectional dependence, the situation changes considerably; first, the results suggest that all estimators – without exception – experience a large increase in bias and RMedSE. This is regardless of the size of N, T and the value of λ. The direction of the bias appears to be negative, which confirms the analysis provided below (3.8). Secondly, in plenty of circumstances λDI F can be so severe that these estimators are consistently the downward bias in λI V and outperformed by λF E with respect to RMedSE. The impact of time-specific demeaning on estimation differs, depending on the variance of the factor loadings. When σ 2φ is not large, the transformation is effective in reducing bias and RMedSE considerably for all estimators. Naturally, the performance of the estimators improves with the size of N and T, as expected. On the other hand as σ 2φ gets larger, the effectiveness of time-specific demeaning decreases, although it still reduces noticeably the bias and RMedSE for λDI F all estimators compared to the case where the data have not been transformed. λI V and 2 suffer from severe bias, especially when λ = 0.8 and σ φ is large, which tends to decrease slightly Furthermore, SYS GMM uses the optimal weighting matrix (when σ 2α = 0), as derived in Windmeijer (2000). However, notice that our Monte Carlo does not consider data series where the initial conditions deviate from mean stationarity. In this case, SYS GMM is not consistent while DIF GMM remains so. Our design is also specific in that it imposes ψ = 1. See Bun and Kiviet (2006) and Kiviet (2007). 17 18



73

Table 1. Comparison between RB( λI V ) and the finite sample results.

λDI F −λ λSY S −λ

T =6 λ = 0.4 RB( λI V ) λλI V −λ

λDI F −λ λSY S −λ I V −λ N = 100 φ i ∼ i.i.d.U [0.5, 2.1]

0.268

0.247

0.235

0.113

φ i ∼ i.i.d.U [1, 4] N = 400 φ i ∼ i.i.d.U [0.5, 2.1] φ i ∼ i.i.d.U [1, 4]

0.489 0.268 0.489

0.294 0.214 0.265

0.445 0.127 0.363

0.110 0.026 0.057

as N grows. λSY S performs comparatively better than the other estimators in all cases. Table 1 λI V compared to λI V as defined in (4.5), for evaluates RB( λI V ), the relative asymptotic bias of a subset of the parameters specified in our Monte Carlo design, and compares this with the finite sample results that we have obtained for all estimators. 19 As we can see, for IV and DIF GMM there is a common expected pattern, which is that as σ 2φ increases, the relative median bias of the estimators performed on time-specific demeaned data increases, although the actual results are somewhat better than what is implied by (4.5). 20 For the system GMM estimator, the change in relative bias is not as dramatic.

6. CONCLUDING REMARKS This paper has analysed the impact of error cross-sectional dependence on a number of widely used IV and GMM estimators in the context of a linear dynamic panel data model with fixed T. It has been demonstrated that the estimators relying on standard instruments with respect to lagged values of the dependent variable (either in levels or in first-differences) are inconsistent as N → ∞ for fixed T. This result persists for any lag length of the instruments used. This is an important outcome given that the error cross-sectional dependence is a likely empirical situation in many applications. Transforming the data in terms of deviations from time-specific averages is shown to have a favourable effect in bias and RMedSE when there is error cross-sectional dependence, whilst it has no adverse effect otherwise. This provides a formal justification for using common time effects when estimating short dynamic panels based on methods of moments estimators, even in those cases where dealing with cross-sectional dependence does not seem to be a priority or particularly relevant for the empirical researcher.

ACKNOWLEDGMENTS This paper has benefited substantially from the comments and suggestions of two anonymous referees and a Co-Editor, Frank Windmeijer. We would also like to thank Richard Gerlach, Jan Kiviet, Daniel Oron, Vladimir Smirnov and Takashi Yamagata for helpful discussions. All remaining errors are our own. The first author gratefully acknowledges full financial support from the ESRC during his Ph.D. at Cambridge University (PTA-030-2002-00328).

For instance, ( λI V − λ)/( λI V − λ) is the modulus of the ratio between the median bias of λI V and the median bias of λI V . Similarly for the other estimators. 20 Qualitatively similar conclusions can be made for λ = 0.8, although this time the results are somewhat worse than implied by (4.5). 19


74


REFERENCES Ahn, S. C., Y. H. Lee, and P. Schmidt (2001). GMM estimation of linear panel data models with timevarying individual effects. Journal of Econometrics 101, 219–55. Ahn, S. C. and P. Schmidt (1995). Efficient estimation of models for dynamic panel data. Journal of Econometrics 68, 5–28. Alonso-Borrego, C. and M. Arellano (1999). Symmetrically normalized instrumental-variable estimation using panel data. Journal of Business and Economic Statistics 17, 36–49. Anderson, T. W. and C. Hsiao (1981). Estimation of dynamic models with error components. Journal of the American Statistical Association 76, 598–606. Arellano, M. (1989). A note on the Anderson-Hsiao estimator for panel data. Economic Letters 31, 337–41. Arellano, M. and S. Bond (1991). Some tests of specification for panel data: monte carlo evidence and an application to employment equations. Review of Economic Studies 58, 277–97. Arellano, M. and O. Bover (1995). Another look at the instrumental variable estimation of error-component models. Journal of Econometrics 68, 29–51. Bai, J. (2006). Panel data models with interactive fixed effects. Working paper, New York University. Barro, R. and X. Sala-i-Martin (1992). Convergence. Journal of Political Economy 100, 223–51. Blundell, R. and S. Bond (1998). Initial conditions and moment restrictions in dynamic panel data models. Journal of Econometrics 87, 115–43. Blundell, R., S. Bond, and F. Windmeijer (2000). Estimation in dynamic panel data models: improving on the performance of the standard GMM estimators. In B. Baltagi (Ed.), Nonstationary Panels, Panel Cointegration, and Dynamic Panels. Advances in Econometrics, Volume 15, 53–91. New York: JAI Press, Elsevier Science. Bond, S., A. Hoeffler and J. Temple (2001). GMM Estimation of empirical growth models. Oxford Economic Papers 2001-W21, Nuffield College, University of Oxford. Bover, O. and N. Watson (2005). Are there economies of scale in the demand for money by firms? Some panel data estimates. Journal of Monetary Economics 52, 1569–89. Bun, M. J. G. and J. Kiviet (2006). The effects of dynamic feedbacks on LS and MM estimator accuracy in panel data models. Journal of Econometrics 132, 409–44. Coakley, J., A. Fuertes, and R. Smith (2002). A principal components approach to cross-section dependence in panels. Working paper, Birkbeck College, University of London. Holtz-Eakin, D., W. Newey, and H. Rosen (1988). Estimating vector autoregressions with panel data. Econometrica 56, 1371–96. Kiviet, J. (1995). On bias, inconsistency, and efficiency of various estimators in dynamic panel data models. Journal of Econometrics 68, 53–78. Kiviet, J. (2007). Judging contending estimators by simulation: tournaments in dynamic panel data models. In G. D. A. Phillips and E. Tzavalis (Eds.), The Refinement of Econometric Estimation and Test Procedures; Finite Sample and Asymptotic Analysis, 282–318. Cambridge, UK: Cambridge University Press. Moon, R. G. and B. Perron (2004). Efficient estimation of the SUR cointegrating regression model and testing for purchasing power parity. Econometric Reviews 23, 293–23. Nerlove, M. (1967). Experimental evidence on the estimation of dynamic economic relations from a time series of cross-sections. Economic Studies Quarterly 18, 42–74. Nerlove, M. (1971). Further evidence on the estimation of dynamic economic relations from a time series of cross-sections. Econometrica 39, 359–87. Nickell, S. (1981). Biases in dynamic models with fixed effects. Econometrica 49, 1417–26.



75

Pesaran, H. (2006). Estimation and inference in large heterogeneous panels with a multifactor error structure. Econometrica 74, 967–1012. Phillips, P. and D. Sul (2003). Dynamic panel estimation and homogeneity testing under cross-sectional dependence. Econometrics Journal 6, 217–59. Phillips, P. and D. Sul (2007). Bias in dynamic panel estimation with fixed effects, incidental trends and cross-sectional dependence. Journal of Econometrics 137, 162–88. Robertson, D. and J. Symons (2000). Factor residuals in SUR regressions: estimating panels allowing for cross-sectionalal correlation. Working paper, Faculty of Economics and Politics, The University of Cambridge. Sarafidis, V., T. Yamagata, and D. Robertson (2008). A test of error cross section dependence for a linear dynamic panel model with regressors. Working paper, Faculty of Economics and Politics, The University of Cambridge. Windmeijer, F. (2000). Efficiency comparisons for a system GMM estimator in dynamic panel data models. In R. D. H. Heijmans, D. S. G. Pollock, and A. Satora (Eds.), Innovations in Multivariate Statistical Analysis. Dordrecht: Kluwer Academic Publishers.

APPENDIX A: PROOFS OF RESULTS Proof of Proposition 3.1: For the moment conditions used by DIF GMM, we have

E yit−s uit | {fn }t−∞

⎡⎛

⎞ ∞ ∞ α i + φi = E⎣ ⎝ λj ft−s−j + λj εit−s−j ⎠ 1−λ j =0 j =0 ⎤

φi ft + εit | {fn }t−∞ ⎦ ⎡

=E

⎣ ft φi φi

∞

⎤ λ

j

ft−s−j | {fn }t−∞ ⎦

= ft φ + μφ μφ wt−s = 0,

(A.1)

j =0

∞ j where φ + μ φ μφ = plimN→∞ N1 N i=1 φi φi and wt−s = j =0 λ ft−s−j . For the additional moment conditions of SYS GMM we have ⎡⎛ ⎞ ∞ ∞ λj ft−1−j + λj εit−1−j ⎠ E yit−1 υit | {fn }t−∞ = E ⎣ ⎝φi ⎤

j =0

j =0

αi + φi ft + εit | {fn }t−∞ ⎦ ⎡ = E ⎣ft φi φi

∞

⎤

λj ft−1−j | {fn }t−∞ ⎦ = ft φ + μφ μφ wt−1 = 0,

(A.2)

j =0

where wt−1 =

∞

j =0

λj ft−s−j .

Proof of (3.6): The derivation of η NT in (3.6) follows directly from (A.1) by replacing conditional expectations with plims and setting s = 2. C The Author(s). Journal compilation C Royal Economic Society 2008.

N = 100, T = 6

0.409

λI V

0.411

λI V

0.380

λDI F

λSY S

λI V

λF E

λF E

λI V

0.379

0.404

0.404

0.369

0.369

0.767

Error cross-sectional independence: φ i = 0

λDI F

0.775

λI V

0.732

λDI F

0.732

λDI F

0.785

λSY S

0.786

λSY S

0.234

0.404

0.400

0.388

0.388

0.408

0.408

0.556

0.556

0.793

0.784

0.764

0.764

0.799

0.799

0.098

0.402

0.402

0.393

0.393

0.399

0.399

0.372

0.372

0.803

0.806

0.781

0.781

0.798

0.798

0.402

0.402

0.397

0.397

0.401

0.401

0.557

0.557

0.807

0.807

0.790

0.790

0.798

0.798

0.092

0.242

0.439

0.170

0.346

0.249

0.417

0.309

0.361

0.124

0.562

0.274

0.608

0.553

0.795

(0.343) (0.308) (0.732) (0.279) (0.306) (0.115) (0.218) (0.094) (0.492) (0.439) (1.20) (0.869) (0.553) (0.236) (0.267) (0.091)

0.063

Medium cross-sectional dependence: φ i ∼ i.i.d.U [0.5, 2.1]

(0.166) (0.165) (0.053) (0.053) (0.019) (0.019) (0.018) (0.018) (0.243) (0.243) (0.171) (0.172) (0.027) (0.027) (0.020) (0.020)0

0.235

(0.302) (0.302) (0.086) (0.085) (0.034) (0.034) (0.028) (0.029) (0.428) (0.428) (0.275) (0.274) (0.055) (0.055) (0.030) (0.030)

0.098

(0.166) (0.166) (0.112) (0.114) (0.041) (0.041) (0.039) (0.039) (0.244) (0.244) (0.357) (0.366) (0.060) (0.060) (0.040) (0.040)

N = 400, T = 10 0.236

N = 400, T = 6

0.095

λF E

(0.304) (0.305) (0.175) (0.174) (0.069) (0.069) (0.058) (0.060) (0.431) (0.431) (0.523) (0.533) (0.116) (0.118) (0.063) (0.046)

0.096

N = 100, T = 10 0.237

N = 100, T = 6

λF E

Table A1. Median point estimates of λ and root median square error of FE, IV, DIF GMM and SYS GMM. λ = 0.4 λ = 0.8

76 V. Sarafidis and D. Robertson



0.233

0.364

0.303

0.422

0.502

0.546

0.148

0.703

0.450

0.699

0.624

0.808

0.094

0.269

0.428

0.203

0.375

0.240

0.396

0.310

0.360

0.176

0.712

0.314

0.687

0.517

0.778

0.227

0.339

0.418

0.231

0.380

0.279

0.397

0.502

0.545

0.174

0.772

0.456

0.733

0.602

0.786

0.077

0.152

0.327

0.054

0.246

0.154

0.427

0.273

0.338

0.013

High cross-sectional dependence: φ i ∼ i.i.d.U [1, 4] 0.181

0.185

0.334

0.449

0.797

0.207

0.450

0.146

0.280

0.250

0.450

0.475

0.525 -0.025 0.261

0.355

0.484

0.565

0.809

0.085

0.196

0.454

0.078

0.283

0.153

0.386

0.280

0.338

0.035

0.193

0.214

0.399

0.408

0.749

0.267

0.436

0.134

0.291

0.228

0.394

0.476

0.523 -0.007 0.307

0.352

0.508

0.533

0.774

(0.241) (0.189) (0.830) (0.402) (0.304) (0.148) (0.221) (0.099) (0.327) (0.277) (1.27) (1.05) (0.450) (0.297) (0.278) (0.088)

0.213

(0.364) (0.316) (1.01) (0.541) (0.398) (0.196) (0.310) (0.135) (0.526) (0.463) (1.38) (1.22) (0.616) (0.423) (0.409) (0.139)

0.059

(0.235) (0.186) (0.828) (0.409) (0.297) (0.164) (0.208) (0.112) (0.329) (0.275) (1.24) (1.02) (0.445) (0.318) (0.241) (0.085)

0.217

(0.380) (0.324) (0.990) (0.563) (0.417) (0.231) (0.309) (0.163) (0.536) (0.462) (1.38) (1.18) (0.641) (0.500) (0.375) (0.132)

0.048

N = 400, T = 10 0.185

N = 400, T = 6

0.420

(0.208) (0.173) (0.620) (0.166) (0.221) (0.055) (0.167) (0.044) (0.299) (0.255) (1.15) (0.547) (0.339) (0.082) (0.206) (0.048)

N = 100, T = 10 0.195

N = 100, T = 6

0.316

(0.333) (0.306) (0.753) (0.231) (0.287) (0.082) (0.222) (0.065) (0.490) (0.440) (1.26) (0.739) (0.512) (0.153) (0.300) (0.073)

0.074

N = 400, T = 10 0.200

N = 400, T = 6

0.228

(0.202) (0.172) (0.620) (0.197) (0.222) (0.071) (0.161) (0.060) (0.299) (0.254) (1.13) (0.062) (0.351) (0.120) (0.182) (0.055)

N = 100, T = 10 0.208


77

78

V. Sarafidis and D. Robertson For η DT we have T N 1 yit−2 yit−1 N i=1 t=3 ⎧ ⎫ ∞ j ∞ j αi ⎪ T ⎪ N 1 ⎨ 1−λ + φi j =0 λ ft−2−j + j =0 λ εit−2−j ⎬ = plimN→∞ ∞ j ⎪ N i=1 t=3 ⎪ j ⎭ ⎩ · φi ∞ j =0 λ ft−1−j + j =0 λ εit−1−j ⎡⎛ ⎞ ⎛ ⎞⎤ N ∞ ∞ T 1 ⎣⎝ j = plimN→∞ λ ft−1−j ⎠ φi φi ⎝ λj ft−2−j ⎠⎦ N i=1 t=3 j =0 j =0 ⎡ ⎤ T ∞ N ∞ 1 ⎣ j λ εit−2−j λj εit−1−j ⎦ + plimN→∞ N i=1 t=3 j =0 j =0

plimN→∞

=

T t=3

(T − 2) 2 σ , wt−1 φ + μφ μφ wt−2 − 1+λ ε

(A.3)

j where wt−1 = ∞ j =0 λ ft−1−j . Then (3.6) follows directly by multiplying η NT with the inverse of η DT . Proof of (3.8): First, given Assumption 2.4, we have E(κ1 ) = E[ Tt=3 wt−2 (ft − ft−1 )] = 0 and " E(κ2 ) = E

T

#

"

wt−1 wt−2 − E

t=3

= (T − 2)

T

# (wt−2 )

2

t=3

1 1 λ < 0. − (T − 2) = −(T − 2) 1 − λ2 1 − λ2 1+λ

(A.4)

Therefore, the covariance between κ 1 and κ 2 equals " cov(κ1 , κ2 ) = E(κ1 κ2 ) = E −E

" T t=3

T

wt−2 ft

t=3

wt−2 ft−1

T t=3

T t=3

#

"

wt−1 wt−2 − E #

wt−1 wt−2 + E

"

T

wt−2 ft

t=3 T t=3

wt−2 ft−1

T

# (wt−2 )

t=3 T

(wt−2 )

2

# 2

.

(A.5)

t=3

It is straightforward to show that the individual elements of (A.5) are equal to the following terms: % # ⎧ λ $T −4 " T T 2j ⎨ λ (2T − 6 − 2j ) + (T − 3) for T ≥ 3 2 j =1 wt−2 ft wt−1 wt−2 = 1−λ E ⎩ 0 otherwise. t=3 t=3 % " T # ⎧ 2λ2 $T −5 T 2j ⎨− j =1 λ (T − 4 − j ) + (T − 4) for T ≥ 5 1−λ2 2 −E wt−2 ft (wt−2 ) = ⎩ 0 otherwise. t=3 t=3 $ % " T # ⎧ T ⎨ − 1 T −3 λ2j (2T − 4 − 2j ) + (T − 2) for T ≥ 3 j =1 1−λ2 −E wt−2 ft−1 wt−1 wt−2 = ⎩ 0 otherwise. t=3 t=3 % " T # ⎧ 2λ $T −4 T 2j ⎨ j =1 λ (T − 3 − j ) + (T − 3) for T ≥ 4 1−λ2 2 E wt−2 ft−1 (wt−2 ) = ⎩ 0 otherwise. (A.6) t=3 t=3 C The Author(s). Journal compilation C Royal Economic Society 2008.


79

Hence, the covariance between κ 1 and κ 2 will be negative provided that F (T ) = λ

T −4

λ2j (2T − 6 − 2j ) − 2λ2

j =1

+ 2λ

T −5

λ2j (T − 4 − j ) −

j =1 T −4

T −3

λ2j (2T − 4 − 2j )

j =1

λ2j (T − 3 − j ) − 2λ2 (T − 4) + 3λ(T − 3) < (T − 2), for T ≥ 4.

(A.7)

j =1

(A.7) holds immediately for T = 3 and T = 4 because F (3) = 0 < (T − 2) = 1 and F (4) = −2λ2 + 3λ < 2 ⇒ λ/(1 + λ2 ) < 2/3, which is true for any ∀λ ∈ R. Below we demonstrate that this result holds for any T ≥ 5 using induction. In particular, let us assume that for some T ≥ 5 the following is true F (T ) < T − 2.

(A.8)

We need to prove that (A.8) holds for T + 1 as well. Notice first that (A.8) can also be written as F (T ) = λ2 F (T − 1) + λ3 (4T − 16) + λ(3T − 9) − λ3 (3T − 12) − λ2 (4T − 14) < T − 2. Therefore, we obtain F (T + 1) = λ2 F (T ) + λ3 (4T − 12) + λ(3T − 6) − λ3 (3T − 9) − λ2 (4T − 10) = = λ2 F (T ) + λ3 (T − 3) + λ(3T − 6) − λ2 (4T − 10) < T − 1.

(A.9)

Since (λ − 1)3 < 0 and (2 − λ) (1 − λ) > 0, we have

(λ − 1)3 (T − 3) < (2 − λ)(1 − λ) ⇒ 1 + (λ − 1)3 (T − 3) < T − 3 + (2 − λ)(1 − λ) ⇒

(λ3 − 3λ2 + 3λ)(T − 3) < T − 1 − 3λ + λ2 ⇒ λ3 − 3λ2 + 3λ (T − 3) + 3λ − λ2 < T − 1 ⇒

λ3 (T − 3) − λ2 (3T − 8) + λ(3T − 6) < T − 1 ⇒ λ2 (T − 2) + λ3 (T − 3) + λ(3T − 6) − λ2 (4T − 10) < T − 1.

(A.10)

The last inequality in (A.10) is very similar to (A.9), with the only difference being that (T − 2) replaces F(T) in the first term. But since F(T) < T − 2, (A.9) must also be true. This proves that for any T ≥ 5, cov(κ 1 , κ 2 ) < 0. Proof of (4.3): For ηNT in (4.3) we get the following expression T N % $ 1 yit−2 − y t−2 φi − φ ft + (εit − εt ) = N i=1 t=3 ⎧$ % ⎫ ∞ j ∞ j αi −α ⎪ T ⎪ N ⎨ ⎬ ε + φ − φ λ f + λ − ε i t−2−j it−2−j t−2−j j =0 j =0 1−λ 1 = plimN→∞ % $ ⎪ ⎪ N ⎩ ⎭ · φ − φ f + (ε − ε )

plimN→∞

i=1 t=3

=

T

i

ft φ wt−2 ,

t=3 C The Author(s). Journal compilation C Royal Economic Society 2008.

t

it

t

(A.11)

80


where φ = plimN→∞ N1

N

i=1 (φi

− φ)(φi − φ) and w t−2 has been defined above. For ηDT we have

N T 1 yit−2 − y t−2 yit−1 − y t−1 N i=1 t=3 ⎧$ % ⎫ ∞ j ∞ j ⎪ T ⎪ αi −α N ⎬ j =0 λ ft−2−j + j =0 λ εit−2−j − ε t−2−j 1 ⎨ 1−λ + φi − φ = plimN→∞ % $ ⎪ ∞ j ⎪ · φ − φ ∞ λj f N i=1 t=3 ⎩ ⎭ i t−1−j + j =0 j =0 λ εit−1−j − ε it−1−j ⎡ ⎤ T N ∞ ∞ 1 ⎣ j = plimN→∞ λ ft−1−j φi − φ φi − φ λj ft−2−j ⎦ N i=1 t=3 j =0 j =0 ⎤ ⎡ T N ∞ ∞ 1 ⎣ j λ εit−2−j − εt−2−j λj εit−1−j − εit−1−j ⎦ +plimN→∞ N i=1 t=3 j =0 j =0

plimN→∞

=

T

wt−1 φ wt−2 −

t=3

(T − 2) 2 σ . 1+λ ε

(A.12)

Then, (4.3) follows directly by multiplying (A.11) with the inverse of (A.12). Proof of Proposition 4.1: Essentially, what we need to show is that

σφ2 + μ2φ σφ2

<

.

2

σφ · κ2 − (T −2) σε2 (σφ2 + μ2φ ) · κ2 − (T −2) σε2 1+λ 1+λ

(A.13)

For this, we need to consider three cases for the value of κ 2 and we will naturally assume that T ≥ 3. Case I: κ2
κ2 >

[(T −2)/(1+λ)]σε2 $ % . σφ2 +μ2φ

In this case, (A.13) reduces to σφ2 + μ2φ < 2 ⇒ −2) 2 2 (σφ + μφ ) · κ2 − (T1+λ σε

2 (T − 2) 2 2 (T − 2) 2 2 σε σφ < − σφ2 + μ2φ σφ2 · κ2 − σε σφ + μ2φ ⇒ σφ + μ2φ σφ2 · κ2 − 1+λ 1+λ 2 (T − 2) 2 σ ⇒ 2 σφ + μ2φ σφ2 · κ2 < 2σφ2 + μ2φ 1+λ ε 2 −2) 2 2σφ + μ2φ (T1+λ σε (A.16) 2 κ2 < . 2 2 2 σφ + μφ σφ σφ2 2 − σφ · κ2 −

(T −2) 2 σ 1+λ ε

This provides the complete proof of Proposition 4.1.


The


Value at Risk with time varying variance, skewness and kurtosis—the NIG-ACD model A NDERS W ILHELMSSON † †

Department of economics, Lund University, P.O. Box 7082, S-220 07 Lund, Sweden E-mail: [email protected] First version received: October 2007; final version accepted: November 2008

Summary A new model for financial returns with time varying variance, skewness and kurtosis based on the Normal Inverse Gaussian (NIG) distribution is proposed. The new model and two previously suggested NIG models are evaluated by their Value at Risk (VaR) forecasts on a long series of daily Standard and Poor’s 500 returns. All three models perform very well compared with extant models and clearly outperform a Gaussian GARCH model. Moreover, the results show that only the new model cannot be rejected as providing correct conditional VaR forecasts. Keywords: GARCH, Normal inverse Gaussian distribution, Time varying kurtosis, Time varying skewness, Value at Risk.

1. INTRODUCTION Realistic modelling of financial time series is of utmost importance in asset pricing and risk management. Empirical ‘facts’ for equity returns that should be accounted for include skewed leptokurtic return distributions and dependence in second moments. The second moment dependence and, to some extent, the leptokurtosis are addressed in the seminal article of Engle (1982). Among the models that account for the excess kurtosis not captured by the Gaussian GARCH (GARCH-n) model is the model of Barndorff-Nielsen (1997) based on the Normal inverse Gaussian (NIG) distribution. This distribution, in addition to having nice analytical properties, can be theoretically motivated from the mixture of distribution hypothesis of Clark (1973). Extensions of Barndorff-Nielsen’s model that allow for complex dynamics in the variance equation have been proposed by Andersson (2001), Jensen and Lunde (2001), as well as Forsberg and Bollerslev (2002). Recent studies by, for example, Harvey and Siddique (1999, 2000) indicate that there is also dependence in the conditional skewness and possibly in the kurtosis of stock returns. Alternative models for conditional skewness and/or kurtosis are proposed by, among others, Hansen (1994), Harvey and Siddique (1999), Guermat and Harris (2002), Mittnik ˜ and Paolella (2003), Brännäs and Nordman (2003a,b), Niguez and Perote (2004), as well as Lanne and Saikkonen (2007). These contributions motivate my extension of the Jensen and Lunde (2001) model to comprise not only time dependence in the conditional variance but also a time-dependent conditional skewness and kurtosis. The model proposed in this study has several advantages over previous models. The parameters that govern the shape of the distribution need not be restricted as in Hansen (1994). C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,


The NIG-ACD model

83

Both the skewness and kurtosis are time varying, whereas in Harvey and Siddique (1999) and Lanne and Saikkonen (2007) only the skewness and in Guermat and Harris (2002) only the kurtosis is allowed to vary over time. The model has a closed form likelihood, making ˜ estimation easy, in contrast to Mittnik and Paolella (2003) and Niguez and Perote (2004), where the likelihood lacks an analytical expression. The attainable combinations of skewness and kurtosis are much more flexible than in Brännäs and Nordman (2003a,b). Furthermore, the NIG distribution can be motivated from economic theory and is in that sense not an ad hoc choice, such as the Student’s t distribution. In the initial estimation sample of daily Standard and Poor’s 500 returns ranging from July 3, 1962 to July 11, 1974, the new model shows a dramatic improvement in terms of insample fit. With the ongoing adoption of Basel II, which allows banks to use internal Value at Risk (VaR) models for the purpose of regulating capital requirements, there is much practical as well as academic interest in this measure. VaR is the maximum loss expected to incur over a certain time period (h) with a given probability α. The new model as well as the models of Jensen and Lunde (2001), Hansen (1994), and Forsberg and Bollerslev (2002) are applied to compute VaR forecasts each day from July 12, 1974 to September 20, 2005, giving 7879 forecasts, each based on the 3000 latest observations. Since the capital requirements for a bank are directly affected by the number of VaR exceptions, i.e. the number of occasions when the actual loss is larger than predicted by the VaR model, evaluating VaR models by their ability to produce a correct number of exceptions (correct unconditional coverage) seems natural. However, Christoffersen (1998) points out that the exceptions from a correctly specified model should also be independently distributed over time. Using the terminology of Christoffersen (1998), a VaR model that has the correct number of exceptions, which are also independent, is said to have correct conditional coverage. The VaR forecasts in this study are therefore evaluated both by their conditional and unconditional coverage. I find that the models based on the NIG-distribution perform very well with an unconditional coverage that cannot be rejected as incorrect for any of the six VaR levels evaluated. The NIGautoregressive conditional density (NIG-ACD) model proposed in this study as well as the models of Jensen and Lunde (2001) and Forsberg and Bollerslev (2002) are in the green zone as defined in BASEL (1996), whereas the ACD model of Hansen (1994) is in the yellow zone and a GARCH-n model is in the red zone. The green zone means that no additional capital requirements are necessary. The capital requirements given by a model in red zone have to be scaled upwards and measures to improve the model must be taken immediately. Further, the NIG-ACD model is the only model that cannot be rejected as providing correct conditional forecasts for any of the six VaR levels investigated. The rest of this article is structured as follows. Section 2 presents the theoretical and empirical motivation for modelling financial returns using the NIG distribution. Section 3 presents the model whereas Section 4 describes the data and estimation. Section 5 gives a brief introduction to VaR and backtesting of VaR models. The results are presented in Section 6. Section 7 summarizes and discusses the findings.

2. THEORETICAL AND EMPIRICAL MOTIVATIONS FOR THE NIG DISTRIBUTION To capture the characteristics of financial returns such as non-normality, conditional heteroscedasticity (Mandelbrot, 1963, and Fama, 1965) and leverage effects for stocks (Black C The Author(s). Journal compilation C Royal Economic Society 2009.

84

A. Wilhelmsson

1976), a vast number of models has been proposed. Amongst the most successful are the GARCH-type models, for a review see (1994) or the collection of articles in Engle (1995). Previous research show these models to capture the persistence in volatility well. They also capture some but not all of the excess kurtosis in the data. To remedy this problem, alternative error distributions have been proposed. Among these are the Student’s t (Bollerslev, 1987), generalized error distribution (Nelson, 1991) and the skewed Student’s t (Hansen, 1994). The effect of different error distributions for estimation efficiency has recently been investigated in a simulation setting by Venter and de Jongh (2004), and their results favoured the NIG distribution for most of the data generation processes used. Since the number of possible distributions to choose from is very large and since results are also dependent on the specification of the mean and variance equations, the number of possible combinations is daunting. This is true even if we restrict ourselves to the GARCH class models. For example, Hansen and Lunde (2005) examine 330 different model specifications; despite this impressive number of models, their study is far from being exhaustive. An alternative to an empirical or simulation based hunt for the best distribution is needed. The current study pursues the use of a distribution that, in addition to capturing the salient features of the data, can be motivated from economic theory. Consider the time t price of a financial asset, for example, a stock price, denoted by Pt , whose continuously compounded return over the unit time interval is given by rt = log (Pt /Pt−1 ) assuming possible dividends being added to the price. The mixture of distribution hypothesis (MDH) of Clark (1973) states that the conditional distribution of rt given a latent information arrival process (directing process) is normal. Traditionally the directing process has been assumed to follow a lognormal distribution resulting in a lognormal normal mixture distribution for the returns, which unfortunately cannot be written in closed form. 1 Instead, I follow Barndorff-Nielsen (1997) and assume the conditional mixing distribution, which is the distribution of the directing process, to be the inverse Gaussian. That is, σt2 |t−1 ∼ IG(δ, γ ) with t being the information set up to and including time t information. Forsberg (2002) tests this assumption empirically on realized variance calculated from 5 minute ECU/USD data and find that the inverse Gaussian distribution provides an even better fit for both the conditional and unconditional variance than the lognormal distribution. The density function √ of an IG-distributed variable x, is given by f (x; δ, γ ) = δx −3/2 exp[δγ − 12 (δ 2 x −1 + γ 2 x)]/ 2π . The results in Barndorff-Nielsen (1977, 1978) then give that the unconditional distribution of rt must be NIG. In contrast to the lognormal normal mixture distribution, the density of the NIG distribution can be expressed in closed form using the Bessel function. Ease of estimation is thus greatly enhanced and can be done by straightforward (numerical) maximum likelihood. The density function of the NIG distribution is given by x − μ −1 α f (x; α, β, μ, δ) = exp δ α 2 − β 2 − βμ q π δ (2.1) x−μ exp (βx) × K1 δαq δ √ with 0 ≤ |β| < α, δ > 0 and q(z) = 1 + z2 . K1 (·) is the modified Bessel function of third order and index one. α controls the kurtosis of the distribution and β the asymmetry. The location and 1 Clark (1973) assumed an iid lognormal distribution. Later Taylor (1986) relaxed this assumption and let the variance, which proxies for the information arrival, follow an auto-regression, resulting in the stochastic volatility model.


The NIG-ACD model

85

Figure 1. Skewness-kurtosis bounds.

scale of the distribution are decided by μ and δ, respectively. The attractive features of the NIG distribution include the ability to fit leptokurtic and skewed data combined with nice analytical properties. In particular, the NIG distribution is closed under convolution, for fixed values of α and β, meaning that if, for example, daily returns are NIG distributed then weekly returns will also be NIG distributed. Figure 1 illustrates what levels and combinations of skewness and kurtosis are attainable by the NIG distribution. The results in Jondeau and Rockinger (2003) are used to show that the skewness (μ 3 ) is always bounded for a given level of kurtosis (μ 4 ) by μ23 < μ 4 − 1, assuming zero mean and unit variance. The results for the NIG distribution are compared with the generalized skewed t distribution of Hansen (1994). For the NIG-distribution the bounds are found by setting α = 0.9999β and computing the skewness and kurtosis for a fine grid of values that corresponds to levels of kurtosis ranging from 3.01 to 30. The bounds for the Generalized skewed t distribution are computed as in Jondeau and Rockinger (2003). As can be seen from Figure 1, the NIG-distribution is generally more flexible in accommodating varying combinations of skewness and kurtosis than the generalized skewed t distribution, making it a strong candidate distribution for financial modelling. The NIG distribution is a special case of the generalized hyperbolic distribution, which was introduced to the field of finance by Eberlein and Keller (1995) and Barndorff-Nielsen (1995). It has shown promising results for computing VaR in Eberlein, Keller, and Prause (1998), Bauer (2000), Forsberg and Bollerslev (2002), as well as Venter and de Jongh (2002). For more details about the distribution, including the moment generating function, see Barndorff-Nielsen (1997) and references therein.

3. PRESENTATION OF THE NIG-ACD MODEL The discussion in the previous section showed the unconditional return distribution to be NIG according to the MDH if the directing process is inverse Gaussian. However, above I have C The Author(s). Journal compilation C Royal Economic Society 2009.

86

A. Wilhelmsson

assumed that the mixing distribution is iid even though the conditional variance and possibly higher moments are time-varying. The contribution of the current study is to capture this effect by making the three parameters in the NIG distribution that govern the variance, skewness and kurtosis conditional on prior information. In previous models based on the NIG distribution (Andersson, 2001, Jensen and Lunde, 2001, and Forsberg and Bollerslev, 2002), only the variance is allowed to vary over time. To specify the dynamics of the variance, it is convenient to have it depend on a single parameter. This is done by using the location-scale invariant parametrization in Jensen and Lunde (2001) α¯ = αδ, β¯ = βδ, resulting in the density

x − μ −1 α¯ t (x − μ) 2 2 ¯ ¯ ¯ q exp α¯ t − βt + βt fJ L x; α¯ t , β t , μ, δ t = π δt δt δt (3.1) x−μ × K1 α¯ t q δt with 0 ≤ |β¯t | < α¯ t , δt > 0. Let γ¯t = α¯ t2 − β¯t2 ∈ R+ and ρt = (β¯t /α¯ t ) = (βt /αt ) ∈ [0, 1). Now specify the mean equation according to 1/2

rt = μ + γ¯t

δt ρt + δt ηt

(3.2) 3/2 γ¯t

1/2 with δ t η t = ε t and the distribution of η t is NIG(α¯ t , β¯ t , −γ¯t ρt , α¯ t ). The t subscripts on the parameters are added to indicate parameters that can vary over time. The purpose of the above specification used by Jensen and Lunde (2001) is that the mean and variance of η t will equal 0 and 1, respectively. Moreover, the conditional mean and variance of the returns will be given 1/2 by E(rt |t−1 ) = δt γ¯t ρt + μ and Var(rt |t−1 ) = δt2 . The return in (3.2) can be divided into 1/2 three parts—a constant mean μ, a compensation for time-varying volatility risk γ¯t δt ρt and a return innovation ε t . The sign of the risk compensation is given by the ρ t parameter, which as pointed out in Lanne and Saikkonen (2007) is a limitation, since a positive compensation for time-varying volatility risk is expected, but to model negative skewness, the ρ t parameter must be negative. However, the specification in (3.2) is necessary to be able to model the conditional variance within the GARCH framework. Moreover, there is no theoretical reason for why the risk compensation must be positive in an intertemporal setting as pointed out in, e.g. Abel (1988) and Glosten, Jagannathan and Runkle (1993). In the NIG-ACD model, as in the ACD model of Hansen (1994), the rescaled innovations η t = ε t /δ t are uncorrelated but not independent of each other since higher order dependence is present. The conditional standard deviation, δ t , is chosen to evolve according to the asymmetric power ARCH (APARCH) model of Ding, Granger, and Engle (1993)

υ υ + a |εt−1 | − τ εt−1 . (3.3) δtυ = c + bδt−1

So far the model is identical to the NIG-S & ARCH model of Jensen and Lunde (2001). The current study adds to the literature by adding time variation in skewness and kurtosis. Recent interest in conditional higher moments (see, inter alia, Harvey and Siddique, 1999, 2000, Dittmar, 2002, Guermat and Harris, 2002, as well as Christoffersen, Heston and Jacobs, 2006) motivates this extension. This is done by the steepness and asymmetry parameters, given in BarndorffNielsen and Prause (2001), ξ = (1 + γ¯ )−1/2 , which is closely related to the kurtosis and χ = ρξ , ¯ < α¯ makes the region for the attainable which is related to the skewness. The restriction 0 ≤ |β| steepness and asymmetry a triangle in R2 given by {(χ , ξ ): −1 < χ < 1, 0 < ξ < 1}, which is C The Author(s). Journal compilation C Royal Economic Society 2009.

87

The NIG-ACD model

called the NIG shape triangle. I make the steepness and asymmetry of the distribution conditional on the data set according to 2 γ¯t = exp λ0 +λ1 εt−1 + λ2 log (γ¯t−1 ) (3.4) ρ˜t = θ0 + θ1 εt−1

(3.5)

with ρ˜t = ). The exponential form in (3.4) is used to guarantee that γ¯t is positive without having to impose any restrictions on the estimated parameters λ 0 , λ 1 and λ 2 . Similarly, since t ) ∈ R, and so, (3.5) can be estimated without any restrictions on the ρt ∈ [0, 1), ρ˜t = log( 1+ρ 1−ρt parameters θ 0 and θ 1 . Equation (3.4) for the steepness can be seen similar to the EGARCH model Nelson (1991) proposed for the variance, but (3.4) does not allow for different responses in steepness to positive and negative return innovations of the same magnitude. A specification for (3.5) that allows for more persistence by adding ρ˜t−1 and ε3t−1 as explanatory variables was tried but turned out to be insignificant. It should be mentioned that these other specifications were only tried in the in sample estimation and not on the forecasting part of the sample. The parameters γ¯t and ρ˜t are closely related, but not identical, to the kurtosis and skewness. The conditional skewness depends on both ρ˜t and, to some degree, also on α¯ t . The conditional kurtosis depends on both γ¯t and on the skewness parameter ρ˜t . To investigate the effect of this, the conditional skewness is plotted for ρ˜t ∈ (−0.30, 0.30) × α¯ t ∈ {4, 11, 22}, a region that covers 98.3% of the empirical data values of ρ˜t and α¯ t . As can be seen from Figure 2, the skewness is an increasing and slightly concave function of ρ˜t . The effect of α¯ t on the skewness increases for small values of α¯ t but is still minor compared with the effect of ρ˜t . Figure 3 shows the conditional kurtosis for γ¯t ∈ (0.4, 40) × ρ˜t ∈ {−0.30, −0.15, 0} (99.8% of the sample values are in this region). The kurtosis is shown to be a decreasing and concave function of γ¯t and the influence of the skewness parameter ρ˜t on the kurtosis is very minor. In conclusion, γ¯t and ρ˜t do jointly determine kurtosis and skewness, but the influence of ρ˜t on t log( 1+ρ 1−ρt

Alphabar=4

Alphabar=11

Alphabar=22

1 0.8 0.6

Skewness

0.4 0.2 0 -0.2 -0.4 -0.6 -0.8 -1

rhotilde

Figure 2. Skewness as a function of ρ˜ and α. ¯ C The Author(s). Journal compilation C Royal Economic Society 2009.

88

A. Wilhelmsson

rhotilde=-0.30

rhotilde=-0.15

rhotilde=-0.00

16 14 12

Kurtosis

10 8 6 4 2

39.3

37.2

38.2

36.1

35.1

34.0

33.0

30.9

31.9

29.8

28.8

26.7

27.7

25.6

24.6

22.5

23.5

20.4

21.4

19.3

18.3

17.2

15.1

16.2

13.0

14.1

12.0

9.9

10.9

8.8

7.8

6.7

5.7

4.6

2.5

3.6

0.4

1.5

0

gammabar

Figure 3. Kurtosis as a function of γ¯ and ρ. ˜

kurtosis and α¯ t on skewness is small, and in the remainder of the text, γ¯t will be referred to as the kurtosis parameter and ρ˜t as the skewness parameter. One alternative would be to directly model the skewness and kurtosis. However, the combinations of skewness and kurtosis must then be restricted to lie within the allowed region in Figure 1. This can be achieved with numerical techniques during optimization, but there would be no guarantee that forecasted values of skewness and kurtosis would be in the allowed region and this alternative is not pursued further. The full NIG-ACD model proposed in this study is given by equations (3.2)–(3.5). 2 The NIG-S & ARCH model of Jensen and Lunde (2001) is given by (3.2) and (3.3), meaning that the steepness and asymmetry of the NIG distribution is forced to be constant in their specification. The conditional moments of the NIG distribution are given in Appendix A. In addition to the NIG-ACD and the NIG-S&ARCH models also the GARCH-NIG model of Forsberg and Bollerslev (2002), a GARCH model with Gaussian errors and the ACD model of Hansen (1994) are used in the empirical part. These models are described in Appendix B.

4. DATA AND ESTIMATION In this section, we take a look at the data and describe the estimation and forecasting procedures. Estimation results are presented and discussed, and the section ends with model diagnostics and density forecasting results.

2 The model can also be derived from the stochastic volatility literature (see Jensen and Lunde, 2001), which explains why the name NIG-S & ARCH (Stochastic volatility and ARCH) was chosen by Jensen and Lunde.


89

The NIG-ACD model

Measure

Table 1. Descriptive statistics of the Standard and Poor’s 500 index. Initial estimation sample Forecasting sample

No. of observations Daily mean (%)

3000 0.025

7879 0.046

Yearly standard deviation (%) Maximum (%)

10.89 4.90

16.08 8.50

Minimum (%) Skewness Excess kurtosis

−3.20 0.081 3.233

−21.70 −1.340 30.50

JB

1 indicating that only (unaccounted) one day dependence is the cause of rejection. The reason for this interpretation is the joint testing of uniformity and dependence. If the test statistic drops below the rejection level after a certain value for j then the static properties (the marginal distribution) of the returns must be well modelled, and the high Q-statistic for lower values of j must be due to misspecification of the dynamics since poor modelling of the static properties will show for all values of j. The out-of-sample density forecast ability among the models shown in Figure 7 is rather similar for the models based on the NIG distribution. The solid line is the 5% one-sided critical value (1.64) for rejecting the null of the ‘predicted’ density being the true density. The ACD and GARCH-n models clearly performs the worst with Q(1) statistics at 18.08 and 23.93. The NIG-ACD is still the best model, but it can be rejected with a Q(1) statistic of 5.68. Actually, the NIG-ACD model is the only model that fares worse out-of-sample than in-sample, which hints at possible overfitting and/or difficulty in estimating the dynamics of the higher moments. C The Author(s). Journal compilation C Royal Economic Society 2009.

The NIG-ACD model

95

5. VALUE AT RISK VaR is the maximum loss expected to incur over a certain time period (h) with a given probability α. Equivalently, it can be stated that the loss will be less than VaR(α, h) dollars, (1 − α) × 100% −1 of the time. Statistically, VaRt (α, h) = F −1 t+h (α)| t ., where F t+h is the h-step conditional forecast of the inverse cumulative distribution function (cdf) of the return rt = log (Pt /P t−1 ), and t is the information set up to and including time t information. With the ongoing adoption of Basel II, which allows banks to use internal VaR models for the purpose of regulating capital requirements, there is much interest in the measure. For a survey see, for example, Duffie and Pan (1997) or the textbook treatment in Jorion (2000). 5.1 Backtesting VaR models The Basel committee on banking supervision states in their 2004 ‘International convergence of capital measurement and capital standards’ (page 39) that ‘Internal models will only be accepted when a bank can prove the quality of its model to the supervisor through the backtesting of its output using one year of historical data’. The exact method for backtesting is not prescribed. However, the number of exceptions, i.e. the number of occasions when the actual loss is larger than predicted by the VaR model, is used to determine a multiplier that directly affects the capital requirements. Following the terminology of Christoffersen (1998), the Basel committee is only concerned with the unconditional coverage of the models. However, a model might have the correct average coverage even though it can be miss-specified at a given point of time. Christoffersen (1998) derives a test for correct conditional coverage that will be presented below. Define the indicator variable It with t being a time subscript according to 1, if rt > Ft−1 (α) |t−1 It = (5.1) 0 Otherwise where F −1 t (α)| t−1 is the conditional VaR forecast (the inverse of the cdf evaluated at α) from the particular model being evaluated. To test if the number of exceptions is correct, is called to test for correct unconditional coverage. If we have correct unconditional coverage, α percent of the returns will be lower than the VaR forecast, so under the null we will have T 1 It = 1 − α. E (5.2) T 1 The test for correct conditional coverage can be divided into two separate parts—one part tests for correct unconditional coverage and one part tests for independence in the sequence of exceptions. This is very useful, since it can then be investigated if model rejection is due to unconditional coverage failure, clustering of the exceptions or both. The null hypothesis for correct unconditional coverage gives that It ∼ Bern (1 − α), which can be tested by a likelihood ratio test of the form (5.3) LRU C = 2 log πˆ 1T1 (1 − πˆ 1 )T −T1 − log (1 − α)T1 α T −T1 . The number of observations is given by T, the number of ones is given by T 1 and πˆ 1 = T1 /T . C The Author(s). Journal compilation C Royal Economic Society 2009.

96

A. Wilhelmsson

To see if the exceptions tend to cluster together over time, Christoffersen (1998) suggests testing for independence with first-order Markov dependence used as an alternative. The test statistic is given by ⎛ ⎞ T01 T11 log (1 − πˆ 01 )T0 −T01 πˆ 01 (1 − πˆ 11 )T1 −T11 πˆ 11 ⎟ ⎜ (5.4) LRIND = 2 ⎝ ⎠. T1 T −T1 − log πˆ 1 (1 − πˆ 1 ) Tij is the number of observations valued i followed by observations valued j. The maximum likelihood estimates of πˆ ij are simply πˆ 01 = T01 /T0 and πˆ 11 = T11 /T1 . The joint test of correct conditional coverage means that It ∼ iid (Bern(1 − α) ∀t). The test statistic is simply given as the sum of the two individual tests in equations (5.3) and (5.4): LRCC = LRUC + LRIND .

(5.5)

Christoffersen (1998) uses the asymptotic distribution results for the test statistics in equations (5.3)–(5.5). However, I will follow the recommendation in Christoffersen and Pelletier (2004) and simulate the distribution of the test statistics, since the effective sample sizes (the expected number of exceptions) are rather small in typical VaR settings.

6. VALUE AT RISK RESULTS All models, with the exception of the GARCH-n and ACD models, show exceptionally good results for unconditional coverage at the six different VaR levels 0.5%, 1%, 2%, 3%, 4% and 5%. The P values reported in Table 3 are calculated by simulating the distribution of the test statistic under the null as outlined by Christoffersen and Pelletier (2004). A sample size equal to the empirical sample size (7879 observations) with 100,000 replications is used. For 0.5% VaR, the effective sample size is very small, and there are large deviations between the simulated and the asymptotic (unreported) P values. For the 1% level and higher, the simulated and asymptotic P values are very similar. Since the Basel regulations only evaluate models based on the 1% VaR level, special emphasis will be given to the models performance at this level. The empirical percentage of rejections is 0.96% for the NIG-ACD model and 0.89% for the NIG-S & ARCH model. These numbers can clearly not be rejected as being different from 1% with P values from the LR UNC test of 0.77 and 0.33, respectively. The unconditional coverage of the GARCH-NIG model at the 1% level is almost perfect with 80 exceptions compared with the correct value of 78.79, whereas the GARCH-n clearly underestimates the VaR with the empirical percentage of exceptions at 1.52%, which can be rejected as providing correct unconditional coverage with a P value of less than 0.01. The ACD model has an empirical number of exceptions at 1.31% for the 1% VaR, which is significantly higher than 1% with a P value of 0.01. The BASEL (1996) ‘Supervisory Framework for the use of “Backtesting” in conjunction with the internal models approach to market risk capital requirements’ is only concerned with unconditional coverage and divides models into three groups—green, yellow and red— depending on the number of violations. A model is in the green zone if a 95% confidence interval around the correct number of exceptions covers the realized number of exceptions, in the yellow zone if a 99.99% confidence interval covers and, otherwise, in the red zone. The green zone has no additional capital requirements, the yellow zone can lead to additional capital C The Author(s). Journal compilation C Royal Economic Society 2009.

97

The NIG-ACD model

VaR level

0.5%

Table 3. Value at Risk results. 1% 2% 3%

4%

5%

Basel

Panel A: Percentage of returns below VaR NIG-ACD ACD NIG-S&ARCH

0.61% 0.80% 0.56%

0.96% 1.31% 0.89%

1.92% 2.39% 2.02%

2.96% 3.25% 2.92%

4.01% 4.11% 3.82%

4.91% 5.06% 4.67%

Green Yellow Green

GARCH-NIG GARCH-n

0.55% 0.93%

1.02% 1.52%

2.15% 2.54%

3.19% 3.27%

3.95% 4.00%

4.89% 4.66%

Green Red

Panel B: LR statistics and simulated P values from the unconditional test for coverage NIG-ACD 1.77 0.10 0.28 0.05

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

No title

Recommend Documents