The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. Si–Sv. doi: 10.1111/j.1368-423X.2009.00293.x
Tenth Anniversary Special Issue
EDITORIAL New Year 2008 marked the tenth anniversary of The Econometrics Journal, which was established in 1998 by the Royal Economic Society with the intention of creating a high-quality refereed general journal for the publication of econometric research with a standard of intellectual rigour and academic standing similar to those of the pre-existing top international field journals in econometrics. To celebrate this event a Special Issue of the journal was commissioned by inviting contributions from a number of leading scholars in econometrics, the research interests of those invited ranging across all aspects of the discipline. The eleven papers that appear in this special issue deal with a number of topics of current research interest. Given the breadth of the discipline and the coverage of the papers collected here, they cannot be gathered easily under any one single heading. However, some papers do fall rather loosely into a number of overlapping categories. The ordering of the papers in this special issue is made so as to reflect these groupings and their intersection as far as possible. Many economic data are generated by stochastic processes that can be modelled as occurring in continuous time with the data treated as realizations of random functions, i.e. functional data. The particular focus of the paper by Federico Bandi, Peter Hall, Joel Horowitz and George Newman is a scenario in which economic theory may be described by a finite dimensional parametric stochastic process and thereby explicitly or implicitly specifies the probability distribution of the process sample paths. A test that the theory model generated the data may be constructed by comparing the empirical and theoretical sample path distributions, i.e. a test of a finite dimensional parametric model against a non-parametric alternative. This paper generalizes the Cram´er-von Mises approach to distributions of random functions as a particular example of functional data approaches to tests of specification in econometrics. It also develops parametric bootstrap methods that facilitate the use of techniques based on integration over function spaces. The functional data approach not only presents a novel way of conceptualizing specification testing problems but potentially can provide a basis for new test methods for continuous time models in finance as well as the equilibrium search model considered in this paper. The next two papers consider particular aspects of quantile regression. The subject of the paper by Elise Coudin and Jean-Marie Dufour concerns finite sample and asymptotically valid distribution-free tests and confidence sets for the parameters of a linear median regression. The problem of interest consists in obtaining conditions under which signs are i.i.d. and follow a known distribution despite the underlying random variates neither being independent nor satisfying other regularity conditions. The setting employed allows for the presence of heteroskedasticity and the possibility of nonlinear dependence in the regression disturbances of unknown form as well as discretely distributed random variates. Tests based on residual C The Author(s). Journal compilation C Royal Economic Society 2009 Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
Sii
signs constitute a system for finite-sample exact inference under very general assumptions. An advantage of the sign-based inference methods considered in this paper is that no parametric assumption is imposed on the distribution of the regression error. Moreover, they avoid estimation of the estimator asymptotic variance matrix, which can be particularly problematic for standard procedures. The procedures considered in the paper remain asymptotically valid with weakly exogenous regressors and stationary regression disturbances. Standard heteroskedasticity and autocorrelation consistent (HAC) methods permit sign-based statistics to be transformed appropriately, thereby eliminating nuisance parameters asymptotically. Consequently, the test method retains asymptotic validity although at the expense of exactness. Furthermore, it is unnecessary to evaluate the disturbance density at zero, a major difficulty associated with asymptotic kernel-based methods used in least absolute deviations (LAD)-based techniques. The performance of the proposed procedures is illustrated through a set of simulation experiments. The particular concern of the paper by Xiaohong Chen, Roger Koenker and Zhijie Xiao is the specification and robust estimation of quantile autoregressive models for nonlinear time series. They note that many current methods for quantile estimation and prediction rely heavily on unrealistic global distributional assumptions. This paper proposes local, quantilespecific copula-based time-series models motivated by parametric copula models, which retain some semiparametric flexibility and should thereby offer some robustness over classical global parametric approaches. Parametric copula models are used to generate nonlinear-inparameters quantile autoregression models that, by construction, possess monotone conditional quantile functions over the entire support of the conditioning variables. However, rather than impose this global structure explicitly, the implied conditional quantile function at a particular quantile is assumed to be correctly specified and is used as the basis for estimation and inference. This distinction between global parametric and local quantile-specific models facilitates an analysis of potential misspecification of the global structure. Consistency and asymptotic normality of the quantile estimator are obtained under mild sufficient conditions only requiring stationarity and ergodicity of the underlying copula-based Markov model without any mixing conditions. The results are particularly relevant for estimation and inference about extreme conditional quantiles (value-at-risk) for financial time-series data that typically display strong temporal dependence and tail dependence as well as heavy-tailed marginals. Peter Robinson details a simple model that can explain spatial dependence in circumstances when observations may have been purged of spatial correlation. The model displays similarities to the well-known stochastic volatility model of financial econometrics adapted for the spatial context. Parameter estimation is based on quasi maximum likelihood using logarithms of squared observations. An asymptotic theory is described in which consistency and asymptotic normality of model parameter estimators are established. Related asymptotically valid tests for spatial dependence are presented. Although the simple model may be straightforwardly extended to incorporate spatial correlation in the observables and the inclusion of explanatory variables. The next two papers deal with certain inferential issues that arise in time-series econometric models. The paper by Xu Cheng and Peter Phillips extends earlier work by the second author for the univariate context to the multivariate setting. In particular, the paper addresses the issue of cointegrating rank choice using information criteria. If cointegration and cointegrating rank selection is of primary concern, a complete model is unnecessary for statistical purposes allowing a reduced rank regression with a single lag to form the basis for model choice with no explicit
C The Author(s). Journal compilation C Royal Economic Society 2009
Siii
account necessary to be taken of the short memory component. Standard information criteria including BIC and Hannan and Quinn are shown to be weakly consistent in the choice of cointegrating rank provided the penalty coefficient Cn satisfies Cn → ∞ and Cn /n → ∞ as n → ∞. The paper also provides the limit distribution of the Akaike information criterion (AIC), which as in the standard setting is inconsistent. A general limit theory for semiparametric reduced rank regressions under weakly dependent errors is presented. The finite-sample performance of the criteria is studied in some simulation experiments. In the article Miguel Delgado, Javier Hidalgo and Carlos Velasco extends their earlier work for observable time-series processes to dynamic regression models. In this setting the null hypothesis of interest concerns the absence of serial correlation in the regression errors. The paper proposes goodness-of-fit tests where regressors are permitted to be only weakly exogenous and arbitrarily correlated with past shocks. The tests employ a linear transformation of Barlett’s Tp -process of the regression residuals. The linear transformation approximates the martingale component of the process, thereby ensuring that it converges weakly to standard Brownian motion under the null hypothesis. A feasible transformation might be based on a non-parametric smoothed estimator of this crossspectrum. Smoothing in the feasible martingale transformation can be avoided by using the (inconsistent) cross-periodogram directly. Nevertheless, the tests have non-trivial power against local alternatives converging to the null at the parametric root-n rate. A notable aspect of the tests is that there is no necessity to specify the dynamic structure of the regressors, hence avoiding restrictions on the class of local alternatives that the tests are able to detect which contrasts with tests which employ smoothing techniques. A Monte-Carlo study illustrates the finite-sample performance of the tests. Bertille Antoine and Eric Renault examine weak identification characterized by drifting population moment conditions. The focus of the paper is on nearly weak identification where the limit rank deficiency obtains at a rate δ T , slower than the standard root-T. Consequently, generalised methods of moments (GMM) estimators of all parameters remain consistent but at a rate potentially less than root-T. The standard GMM-based Lagrange multiplier (LM) test remains asymptotically chi-square in contrast to the weakly identified context. A comparative study of the power of the standard LM test and its modified weak identification version indicates that the latter statistic can be relatively deficient in power in a nearly weak identified environment. Moreover, both tests are asymptotically equivalent for rates δ T slower than T 1/4 which the authors classify as nearly strong identification. A reparameterization obtained via a rotation in the parameter space results in the first components being estimated at the standard root-T rate with the others at the slower rate T 1/2 /δ T . Standard GMM formulae for asymptotic variance matrices are only applicable in the nearly strong identified set-up. A Monte Carlo study using the consumption-based capital asset pricing model concludes the paper. The paper by Donald Andrews and Sukjin Han is concerned with the finite-sample and asymptotic properties of a number of sampling methods for constructing confidence interval (CI) endpoints for partially identified parameters in models defined by moment inequalities. The particular emphasis is on the bootstrap and m out of n bootstrap applied directly to construct CI endpoints. These bootstrap methods are valid neither in finite samples nor in a uniform asymptotic sense in general when applied directly to construct CI endpoints. Both backward and forward forms of the bootstrap together with the m out of n bootstrap CIs are considered. The failure of the bootstrap arises because of the non-differentiability of the statistics of interest as a function of underlying sample moments. Although the results described
C The Author(s). Journal compilation C Royal Economic Society 2009
Siv
in the paper are for parametric versions of the bootstrap, the asymptotic properties of their non-parametric counterparts follow directly through their asymptotic equivalence. Moreover, asymptotic results for subsampling are identical to those for the non-parametric i.i.d. m out of n bootstrap provided the subsample size obeys certain restrictions, indicating that the m out of n bootstrap results should also apply to subsampling methods. The finite-sample and asymptotic properties of sampling methods for CI endpoints are obtained in two simple models and therefore their invalidity applies generally. Other methods for constructing confidence sets, e.g. inverting acceptance regions based on an Anderson-Rubin-type test statistic, based on subsampling and the m out of n bootstrap are asymptotically valid in a uniform sense. Moreover, these confidence sets may be combined with a recentred bootstrap applied as part of a moment selection method for constructing critical values. The final three papers address issues that may be broadly considered to arise in the area of programme evaluation. First, Charles Manski and John Pepper revisit the topic of their 2000 paper published in Econometrica. That paper introduced a monotone instrumental variable (MIV) assumption, weakening the traditional IV assumption of mean independence; i.e. mean response is constant across subpopulations of persons with different values of an observed covariate, and thereby replacing a moment equality with a weak inequality restriction. The paper employs an explicit response model to contrast the content of MIV and traditional IV assumptions and to illustrate why MIV assumptions might reasonably be adopted in studies of the returns to schooling and production. The identifying power of MIV assumptions when combined with the homogeneous linear response assumption maintained in many studies of treatment response is examined to provide an indication of the implications of the latter assumption. The estimation of MIV bounds is reconsidered. An analysis of the finite-sample bias of analogue estimators for MIV bounds and of their tendency to be narrower than the true bounds is presented. The paper gives some simulation-based evidence for this bias and on the performance of a bias-correction method. Next, the classical sample selection model is analysed in the paper by Whitney Newey. In this paper the selection correction term is treated semiparametrically rather than in a parametric fashion as in the standard case. The functional form of the selection term is assumed to be unknown and dependent on an index known up to a finite dimensional vector of parameters for which a root-n consistent estimator is available. Although the semiparametric efficiency lower bound is known for this form of conditional moment restriction no efficient estimator has as yet been proposed. Least squares estimation after substitution for the unknown index parameter and approximating the unknown selection term via either power series or splines provides a very simple and straightforward estimation method for the regression parameters and a potentially attractive alternative to fully non-parametric procedures. The paper provides root-n consistency and asymptotic normality results for the regression parameter estimator and a consistent estimator for the estimator asymptotic variance matrix. Finally, it has been known for some time that for choice-based samples matching and selection methods are necessarily robust if the probability of selection into treatment is consistently estimated. James Heckman and Petra Todd demonstrate the empirically important result that when the propensity score is estimated based on unweighted choice-based samples and is thus inconsistently estimated these procedures remain valid. In conclusion I hope that the papers collected in the tenth anniversary special issue go some way towards achieving the original objective of the Royal Economic Society for The Econometrics Journal. I would also like to take this opportunity to extend the gratitude of The Econometrics Journal to the contributors for their submissions. Especial thanks are owed to
C The Author(s). Journal compilation C Royal Economic Society 2009
Sv
the referees listed below of the papers comprising the tenth anniversary special issue without whose assistance it would not have possible. V. Chernozhukov
J. Pinkse
M. Jansson
P. Guggenberger Y. Hong P. M. D. C. Parente
A. M. R. Taylor V. Corradi E. Guerre
A. Patton R. J. Smith
Richard J. Smith (Managing Editor) University of Cambridge
C The Author(s). Journal compilation C Royal Economic Society 2009
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. S1–S18. doi: 10.1111/j.1368-423X.2008.00266.x
Goodness-of-fit tests for functional data F EDERICO A. B UGNI † , P ETER H ALL ‡ , J OEL L. H OROWITZ § AND G EORGE R. N EUMANN ¶ †
‡
Department of Economics, Northwestern University, Evanston, IL 60208–2600, USA E-mail:
[email protected] Department of Mathematics and Statistics, University of Melbourne, Melbourne, VIC 3010, Australia E-mail:
[email protected] §
Department of Economics, Northwestern University, Evanston, IL 60208–2600, USA E-mail:
[email protected] ¶
Department of Economics, University of Iowa, Iowa City, IA 52242–1000, USA E-mail:
[email protected] First version received: July 2008; final version accepted: August 2008
Summary Economic data are frequently generated by stochastic processes that can be modelled as occurring in continuous time. That is, the data are treated as realizations of a random function (functional data). Sometimes an economic theory model specifies the process up to a finite-dimensional parameter. This paper develops a test of the null hypothesis that a given functional data set was generated by a specified parametric model of a continuous-time process. The alternative hypothesis is non-parametric. A random function is a form of infinitedimensional random variable, and the test presented here a generalization of the familiar Cram´er-von Mises test to an infinite dimensional random variable. The test is illustrated by using it to test the hypothesis that a sample of wage paths was generated by a certain equilibrium job search model. Simulation studies show that the test has good finite-sample performance. Keywords: Bootstrap, Cram´er-von Mises test, Equilibrium search model, Functional data analysis, Hypothesis testing.
1. INTRODUCTION Economic data are frequently generated by stochastic processes that can be modelled as occurring in continuous time. The data may then be treated as realizations of random functions (functional data). Examples include wage paths and asset prices or returns. Sometimes economic theory provides a parametric model for the data. That is, economic theory may provide a stochastic process that is known up to a finite-dimensional parameter and may be the process that generated the data. For example, certain equilibrium job search models specify the wage process up to a finite-dimensional parameter, and certain diffusion models specify an asset’s price or returns process up to a finite-dimensional parameter. In such cases, it is natural to test C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
S2
F. A. Bugni et al.
the theory model against the data. More specifically, it is natural to test the hypothesis that, for some value of its parameter, the theory model is a correct specification of the data-generation process. This paper describes a method for carrying out such a test. A theory model of a stochastic process explicitly or implicitly specifies the probability distribution of the random functions (or sample paths) that are realizations of the process. If the theory model depends on an unknown finite-dimensional parameter, which we assume to be the case here, the specification is up to the value of this parameter. Functional data can be used to form an empirical analogue of the probability distribution of the random functions (the empirical distribution of the data). Therefore, a test of the hypothesis that the theory model generated the data can be made by comparing the empirical and theoretical distributions of the sample paths. This amounts to testing a finite-dimensional parametric model of a probability distribution against a non-parametric alternative. When the random variable of interest is finite-dimensional, the Cram´er-von Mises and Kolmogorov–Smirnov tests, among many others, can be used for this purpose but these tests do not apply to random functions, which are infinite-dimensional random variables. 1 The test described in this paper generalizes the Cram´er-von Mises test to distributions of random functions, or infinite-dimensional random variables that depend on an unknown finitedimensional parameter. Novel aspects of our contribution include the introduction of functional data approaches to specification testing in econometrics, and the development of parametric bootstrap methods that facilitate the use of techniques based on integration over function spaces. The functional data view offers new ways of conceptualizing specification testing problems and can lead to new approaches for testing continuous time models, such as models of financial data that are quite different from the equilibrium search model that motivates the present work. More specifically, suppose that the distribution of a random function Y depends on an unknown, finite-dimensional parameter θ and that we have a random sample X = {X1 , . . . , Xn } of n realizations of a random function X that may be distributed as Y for some value of θ . We develop a Cram´er-von Mises type test of the null hypothesis, H 0 , that “the distribution of X is identical to that of Y for some unspecified value of θ .” A mathematically concise interpretation of the phrase in quotation marks will be given in the first paragraph of Section 2.1. The paper presents the test statistic and explains how to compute it, derives the test statistic’s asymptotic distribution under H 0 and local alternative hypotheses, and presents a bootstrap procedure for computing the critical value of the test. We illustrate the use of the test by applying it to an equilibrium job search model (Mortensen, 1990, Burdett and Mortensen, 1998, Bowlus et al., 2001, and Christensen et al., 2005). This model aims to explain the frequencies and durations of spells of unemployment as well as the distribution of wages among employed individuals. In particular, the model provides an explanation for why seemingly identical individuals have different wages. One of the model’s outputs is a random function, Y say, that gives an individual’s wage as a function of time up to an unknown, vector-valued parameter. We also have data on wage paths of a random sample of individuals. Our test allows us to assess whether the equilibrium search model provides a correct description of the wage process. If the distribution specified by the theory model did not depend on an unknown parameter, so that the distribution of Y were completely known, then the permutation test of Hall and Tajvidi (2002) would be an alternative to the test presented here. However, we have found through Monte 1 Durbin (1973a, b) and Pollard (1984) discuss the Kolmogorov-Smirnov and Cram´ er-von Mises tests of distributions that depend on an unknown parameter.
C The Author(s). Journal compilation C Royal Economic Society 2009.
Goodness-of-fit tests for functional data
S3
Carlo experimentation that the finite-sample performance of the Hall-Tajvidi test is poor when the null hypothesis distribution depends on an unknown parameter. In particular, the test has low power, and the probability that it rejects a correct H 0 greatly exceeds the nominal rejection probability. We present Monte Carlo evidence indicating that the differences between the true and nominal rejection probabilities of our Cram´er-von Mises type test are small. A variety of other tests can be considered. One possibility is the development of adaptive methods in which the weight function of our Cram´er-von Mises test (that is, the measure μ in Section 2.1 of this paper) is chosen to optimize power against a specific class of alternatives. Cuesta-Albertos et al. (2006) and Cuesta-Albertos et al. (2007) have shown that a class of tests based on random projections is consistent against certain location-scale families. The equilibrium search model that we consider here is not of this type, however, and it is unknown whether the random-projection tests are consistent under conditions that are more general than those considered in the two foregoing papers. In Monte Carlo experiments using the designs of CuestaAlbertos et al. (2007), we compared the power of our Cram´er-von Mises test with the power of the random-projections test. The results of the experiments are reported in Section 4.3. In every case the power of the Cramer-von Mises test is similar to or greater than the power of the randomprojections test. There is a large econometrics literature on specification testing but, in contrast to the test in this paper, it applies to data consisting of finite-dimensional vectors, rather than functions. Some of the existing work addresses specification testing problems for dynamic processes and processes that are observed in continuous time. See, e.g. Cay and Hong (2003), Hong and Haito (2005), Guay and Guerre (2006) and Kim and Wang (2006). Many econometric specification testing problems are in a context that is semi-parametric or non-parametric. Examples include Fan and Li (1996, 2000, 2002), Guerre and Lavergne (2002, 2005), Horowitz and Spokoiny (2001) and Miles and Mora (2003). There is also an extensive statistics literature on functional data analysis. Much of this literature is synthesized in the books by Ramsey and Silverman (2002, 2005). Principal components analysis in statistics plays a role in methods for computing our test statistic. Recent work in that context includes Boente and Fraiman (2002), He et al. (2003), Yao et al. (2005), Hall and Hosseini-Nasab (2006) and Jank and Shmueli (2006). Virtually all problems in functional data analysis can be reformulated to permit treatment with finite-dimensional methods. In particular, the functional testing problem dealt with in this paper can be made finite-dimensional by testing only finitely many features of the parametric model instead of the entire stochastic process. However, this approach has several drawbacks. First, depending on the features that are chosen for testing, a finite-dimensional test may be inconsistent against important deviations of the data-generation process from the theory model. Choosing an appropriate low-dimensional approximation and its associated test can be quite difficult in practice. Even if the model is finite-dimensional, the test needs to be sensitive to relatively complex, high-dimensional departures from the null hypothesis, for example as represented by the shapes of random functions. The functional approach avoids this problem. Second, the accuracy of finite-dimensional asymptotic approximations tends to deteriorate as the dimension of the object being tested increases. Specifically, the difference between the true and nominal probabilities of rejecting a correct null hypothesis tends to increase as the dimension of the distribution being tested increases. The functional approach avoids this problem by developing an asymptotic approximation that is specifically designed for infinite-dimensional data. Finally, the functional approach avoids having to explicitly model the correlation between function values at nearby points in their domain. C The Author(s). Journal compilation C Royal Economic Society 2009.
S4
F. A. Bugni et al.
Section 2 of this paper describes the test statistic and methods for computing the statistic and its critical value. Section 3 presents the test’s theoretical properties. Section 4 presents the empirical application and Monte Carlo results. The proofs of theorems are given in the Appendix.
2. THE TEST PROCEDURE This section describes the test statistic and its implementation. Section 2.1 presents the statistic. Sections 2.2 and 2.3, respectively, explain how to compute the test statistic and the critical value. 2.1. The test statistic Assume that the random functions X and Y are defined on a bounded interval, I, which we take to be [0, 1]. Let L 2 [0, 1] denote the space of square integrable functions on [0, 1] and let · denote the L 2 norm. We assume that X and Y are both in L 2 [0, 1] (so that X, Y < ∞) with probability 1. Note that this condition accommodates unbounded random functions such as smooth Gaussian processes defined on compact intervals. It is not equivalent to requiring P (X ≤ C, Y ≤ C) = 1 for some finite constant C > 0. Define the distribution functionals of X and Y, respectively, by FX (x) = P [X(t) ≤ x(t) for all t ∈ I] and FY (x|θ ) = P [Y (t) ≤ x(t) for all t ∈ I], where the non-stochastic function x(·) is the argument of the distribution functional and θ is the finite-dimensional parameter on which the distribution of Y depends. Assume that θ is contained in a parameter set ⊂ Rp for some finite p > 0. The null hypothesis that we test is H0 : FX (x) = FY (x|θ ) for some θ ∈ and all x ∈ L 2 [0, 1]. The alternative hypothesis, H 1 , is that there is no θ ∈ for which H 0 holds. Basing the definition of H 0 on the distribution functionals F X and F Y is natural because FX = FY implies that the finite-dimensional distributions associated with FX and F Y coincide, implying that X and Y correspond to the same probability measure. We note that the sets {X(t) ≤ x(t) for all t ∈ I} and {Y (t) ≤ x(t) for all t ∈ I} are measurable. For example, X(t) ≤ x(t) for all t ∈ I is equivalent to supt∈I [X(t) − x(t)] ≤ 0, and the supremum is measurable whenever the functions X and x are measurable. Let the data be a random sample of X: {Xi : i = 1, . . . , n}. Because X is a function, each Xi is also a function, Xi (t), on the interval [0, 1]. For example, in the empirical application presented in Section 4, each Xi is the wage path of a randomly sampled individual. The empirical distribution functional of the data is defined as FˆX (x) = n−1
n
I [Xi (t) ≤ x(t) for all t ∈ I],
i=1 C The Author(s). Journal compilation C Royal Economic Society 2009.
Goodness-of-fit tests for functional data
S5
where I (·) is the indicator function. Let θˆ be an estimator of θ that is consistent under H 0 . Then ˆ H 0 is rejected if the “distance” between FˆX H 0 can be tested by comparing FˆX with FY (·|θ). ˆ is too large in some metric. and FY (·|θ) In practice, FY (·|θ ) may not be available in a convenient, analytic form. However, FY (x|θ ) can be estimated for any x and θ and with any desired level of accuracy if sample paths of Y can be generated by simulation. Specifically, let {Y 1 (t), . . . Ym (t)} be m sample paths that are generated by simulation from the Y process with a specified value of θ . Then FY (·|θ ) is estimated consistently by the empirical distribution functional of the simulated paths: FˆY (x|θ ) = m−1
m
I [Yi (t) ≤ x(t) for all t ∈ I].
i=1
The random sampling errors of FˆY can be made arbitrarily small by making m sufficiently large. ˆ is too large. Therefore, our test rejects H 0 if the distance between FˆX and FˆY (·|θ) If X and Y were finite-dimensional random variables, then the Cram´er-von Mises test would ˆ Thus, the test consist of using the L 2 metric to measure the distance between FˆX and FˆY (·|θ). statistic would be ˆ 2 dη(z), TCvM = [FˆX (z) − FY (z|θ)] where η is Lebesgue measure on the support of X and Y . A generalization to the case of random functions, which are infinite-dimensional random variables, can be obtained by replacing η with a probability measure on L 2 [0, 1]. Let μ be such a measure. The resulting test statistic is ˆ (2.1) T (X |θ) = [FˆX (x) − FˆY (x|θˆ )]2 dμ(x). ˆ is too large. Section 2.2 explains how T (X |θ) ˆ can be computed. The test rejects H 0 if T (X |θ) The measure μ is analogous to the weight function that can enter the finite-dimensional Cram´er-von Mises statistic and many other test statistics. As in finite-dimensional testing, μ or the weight function cannot be selected empirically as this would require knowing how the true data-generation process differs from the parametric model. Rather, one chooses a measure μ or a weight function that is tractable computationally and assigns relatively high probability to regions in the space of alternatives against which one wants good power. In Section 2.2, we propose using a measure based on a Gaussian process. This emphasizes deviations from the null hypothesis that are relatively stable. If, however, we were concerned with highly erratic deviations from the null hypothesis, we would choose a measure corresponding to an erratic stochastic process (e.g. a process with few finite moments). It is also possible to construct a Kolmogorov-Smirnov type test of H 0 . The test statistic is ˆ = sup |FˆX (x) − FˆY (x|θ)|. ˆ TKS (X |θ) x tβ∗ (X ∗ )], where P ∗ denotes the probability measure induced by sampling the distribution of ˆ > t ˆ (X ), where t ˆ (X ) ˆ For an α-level test, define βˆ to be the solution of q(β) ˆ Y with θ = θ. = α. Reject H 0 if T (X |θ) β β ∗ ∗ ˆ ˆ is the 1 − β quantile of the bootstrap distribution of T (X |θ ). Bootstrap iteration provides asymptotic refinements in many settings. We do not investigate refinements here. The Monte Carlo results reported in Section 4 indicate that the bootstrap procedure for obtaining the critical value works well without iteration.
C The Author(s). Journal compilation C Royal Economic Society 2009.
S8
F. A. Bugni et al.
depends on unknown population parameters, so it is not useful for obtaining critical values for ˆ The bootstrap procedure of Section 2.3 is used for that purpose. However, the asymptotic T (X |θ). distributional results show that the test is consistent against fixed alternative hypotheses and that it has power against local alternatives whose distance from the null-hypothesis distribution, F Y , is O(n−1/2 ). ˆ depends on the asymptotic properties of θˆ . We assume The asymptotic distribution of T (X |θ) ˆ that as n → ∞, θ converges in probability to a unique, non-stochastic limit, θ 0 . We also assume that n1/2 (θˆ − θ0 ) has the representation n1/2 (θˆ − θ0 ) = n−1/2
n
(Xi ) + op (1)
(3.1)
i=1
as n → ∞, where is a p-vector valued function that is square-integrable with respect to μ and is such that E (X) = 0 and cov[ (X)] is non-singular. The estimator θˆ = arg min T (X |θ ), θ∈
has these properties under mild regularity conditions. Many other estimators also have these properties. We use the following additional notation. Define the p-vector F˙ (·|θ ) = ∂F (·|θ )/∂θ . Let ζ be the Gaussian process on [0, 1] having the same covariance structure as the indicator process I (X ≤ x) ≡ I [X(t) ≤ x(t) for all t ∈ I]. That is, the covariance function is ψ(x1 , x2 ) = cov[ζ (x1 ), ζ (x2 )] = FX (x1 ∧ x2 ) − FX (x1 )FX (x2 ), where x 1 ∧ x 2 denotes the function that equals x 1 (t) ∧ x 2 (t) for each t ∈ [0, 1]. Let ξ be a p-variate normal random variable whose mean is 0, covariance matrix is V , and satisfies E[ξ ζ (x)] = E (X)[I (X ≤ x) − FX (x)]. We make the following assumptions. A SSUMPTION 3.1. The functional data X ≡ {X1 (·), . . . , Xn (·)} are an independent random sample from the population whose distribution functional is FX . A SSUMPTION 3.2. (i) θ 0 is uniquely defined. (ii) n1/2 (θˆ − θ0 ) has the asymptotic representation (3.1). Moreover, E (X) = 0, V is finite and non-singular, and (x) (x)μ(dx) < ∞.
A SSUMPTION 3.3. F˙ (·|θ ) exists for all θ in an open set O that contains θ 0 . Moreover, sup F˙ (x|θ ) F˙ (x|θ )dμ(x) < ∞ θ∈O
and lim
ε→0
p
sup
i,j =1 θ−θ0 ≤ ε
[F˙i (x|θ ) − F˙i (x|θ0 )][F˙j (x|θ ) − F˙j (x|θ0 )]dμ(x) = 0, C The Author(s). Journal compilation C Royal Economic Society 2009.
Goodness-of-fit tests for functional data
S9
where F˙i denotes the ith component of F˙ and θ − θ 0 is the Euclidean distance between θ and θ 0. A SSUMPTION 3.4. μ is the measure induced by the Gaussian process Z(t) =
∞
ρk Nk φk (t); 0 ≤ t ≤ 1,
k=1
where 0 < |ρ k | ≤ Ck−d for all k and some constants C < ∞ and d > 1, the Nk ’s are independent standard normal random variables, and φ k (t) = 21/2 sin(kπ t). The independence requirement of Assumption 3.1 precludes applying our test to the path of prices or returns of a single financial asset. However, the test can be applied to a portfolio of assets whose prices or returns move independently after removal of any common trends. 3 Assumption 3.4 ensures that with probability 1, functions sampled from the population with distribution μ are bounded and in L 2 [0, 1]. Other basis functions and distributions of the Nk ’s could be used. For example, the basis could be cosine functions or sines and cosines together. Our asymptotic distributional result treats the following three cases: 1. 2.
H0 is true. That is, θ0 ∈ O and FX (·) = FY (·|θ0 ). H0 is false, and FX constitutes a sequence of local alternatives. That is FX (·) = FY (·|θ0 ) + n−1/2 D(·)
3.
for some θ0 ∈ O, where D is a bounded functional on L2 [0, 1]. FX is fixed and H0 is false. That is, there is no θ ∈ such that FX (·) = FY (·|θ ). Observe that case 1 is identical to case 2 with D = 0 We now have the following theorem.
ˆ →d V , where T HEOREM 3.1. Let Assumptions 3.1–3.4 hold. In cases 1 and 2, nT (X |θ) V = [ζ (x) + D(x) + F˙Y (x|θ0 ) ξ ]2 dμ(x) and D = 0 in case 1. In case 3, ˆ →p T (X |θ)
(3.2)
[FX (x) − FY (x|θ0 )]2 dμ(x).
(3.3)
Result (3.3) implies that the test is consistent against fixed alternative hypotheses. Result (3.2) gives the distribution of the test statistic under the null hypothesis (D = 0) and contiguous alternatives (D = 0). In particular, (3.2) implies that the test has non-trivial asymptotic power (that is, asymptotic power exceeding the probability of rejecting a correct null hypothesis) against alternatives whose distance from the null hypothesis is O(n−1/2 ). From some points of view, cases 1 and 2 of Theorem 3.1 can be interpreted as extensions, to the setting of functional data, of work of Neuhaus (1971, 1976) and Behnen and Neuhaus (1975) on limit theory under contiguous alternatives. 3 If the prices or returns of a single asset are weakly dependent, then it may be possible to apply a version of our test to data consisting of blocks of prices or returns. However, the investigation of this extension is beyond the scope of this paper.
C The Author(s). Journal compilation C Royal Economic Society 2009.
S10
F. A. Bugni et al.
An alternative representation of V at (3.2) is V = Wj =
X
∞
j =1
Wj2 , where
[ζ (x) + D(x) + F˙Y (x|θ0 ) ξ ]ψj (x)dμ(x)
and ψ 1 , ψ 2 , . . . is an orthonormal sequence of eigenfunctionals of the linear operator γ that takes a function ψ to F (x|θ0 )ψ(x)dμ(x). γ (ψ) = X
This representation of V can be regarded as an extension of Neuhaus’ (1976) result concerning the power of Cram´er-von Mises tests under contiguous alternatives. 3.2. Consistency of the bootstrap This section establishes the validity of the bootstrap procedure of Section 2.3 for estimating the ˆ Let V 0 be the random variable critical value of T (X |θ). V0 = [ζ (x) + F˙Y (x|θ0 ) ξ ]2 dμ(x). ˆ is asymptotically distributed as V 0 when H 0 is true. It follows from Theorem 3.1 that nT (X |θ) ˆ Then P (V 0 > Let s α denote the asymptotic α-level critical value of the test based on T (X |θ). ∗ s α ) = α. Let P denote the probability measure that is induced by bootstrap sampling. The bootstrap α-level critical value, s ∗α , is the solution to P ∗ [nT (X ∗ |θˆ ∗ ) > sα∗ ] = α. The Cram´er-von ˆ > sα∗ . Mises test based on the bootstrap critical value rejects H 0 at the nominal α level if T (X |θ) ∗ ˆ The true rejection probability is P [nT (X |θ) > sα ]. The following theorem shows that the true rejection probability approaches the nominal level α as n → ∞. T HEOREM 3.2. Let Assumptions 3.1–3.4 hold. Then s ∗α →p s α in each of the three cases of Theorem 3.1. Moreover, if H 0 is true, then ˆ > sα∗ ] = α. lim P [nT (X |θ)
n→∞
It follows immediately from Theorems 3.1 and 3.2 that if FX is fixed and does not satisfy H 0 , then the probability of rejecting H 0 approaches 1 as n → ∞ whenever FX (x) − FY (x|θ 0 ) is non-zero on a set of non-vanishing μ measure. Thus, the Cram´er-von Mises test based on the bootstrap critical value is consistent. Moreover, in case 3, where FX is the sequence of local alternatives FX (·) = FY (·|θ 0 ) + n−1/2 D(·), the probability that the bootstrap-based test rejects H 0 converges to P (V > s α ). If we set D = cD 0 , where c is a constant and D 0 is functional that is non-zero on a set of positive μ measure, then ˆ > sα∗ ] = 1. lim lim P [nT (X |θ)
|c|→∞ n→∞
Thus, as the local alternative distributions in case 3 move further from H 0 , the Cram´er-von Mises test with a bootstrap critical value detects them with probability approaching 1. C The Author(s). Journal compilation C Royal Economic Society 2009.
S11
Goodness-of-fit tests for functional data
3.3. Convergence of the finite-dimensional approximation to μ ˆ is an integral with respect to the infinite-dimensional measure, μ, that is The statistic T (X |θ) induced by the process Z defined in (2.3). Section 2.2 proposes approximating the integral by replacing μ with the finite-dimensional measure, μ M , that is induced by the process ZM defined in (2.5). This section shows that integrals with respect to μ M converge to the corresponding integrals with respect to μ as M → ∞. ∞ ) denote the set of all infinite sequences b = {b 1 , b 2 , . . . } of real numbers such Let ∞L 2 (R 2 that i=1 bi < ∞. Let {φ i : i = 1, 2, . . . } be the basis functions that are used in (2.3) and (2.5). Let A be a μ-measurable subset of L 2 [0, 1]. For each function a ∈ A there is a sequence b ∈ L 2 (R∞ ) that is defined by 1 bi = a(t)φi (t)dt. (3.4) 0
Define the sets B = {b : bi is given by (3.4) for some a ∈ A}, BM = {b1 , . . . , bM : (b1 , b2 , . . . ) ∈ B for some (bM+1 , bM+2 , . . .)} and
AM =
∞
bi φi : (b1 , . . . , bM ) ∈ BM and
i=1
∞
bi2
0, define the stochastic processes WnK1 (x) =
K
cnk ϕk (x), WnK2 (x) =
k=1
VKζ =
VnK1 =
cnk ϕk (x), ζK (x) =
k=K+1
Define the random variables
and
∞
K
dk ϕk (x).
k=1
[ζK (x) + D(x) + F˙Y (x|θ0 ) ξ ]2 dμ(x),
[WnK1 (x) + D(x) + F˙Y (x|θ0 ) ξ ]2 dμ(x).
Let V be the random variable defined in (3.2). Now VW = (VW − VnK1 ) + VnK1 . C The Author(s). Journal compilation C Royal Economic Society 2009.
(A.2)
S18
F. A. Bugni et al.
Expanding the integrand of VW and applying the Cauchy–Schwarz inequality yields
1/2 1/2 2 2 dμ(x) + WnK2 dμ(x). |VW − VnK1 | ≤ 2VnK1 WnK2
(A.3)
Standard methods for K-variate problems may be used to show that VnK1 →d VKζ
(A.4)
VKζ − V →p 0
(A.5)
for each K as n → ∞. Moreover,
as K → ∞, and
lim lim sup
K→∞
n→∞
2 dμ(x) = 0. E WnK2
(A.6)
Combining (A.2)–(A.6) yields the result that V W →d V
(A.7)
The theorem follows by combining (A.1) and (A.7).
Proof of Theorem 3.2. Let ζ θ denote a Gaussian process having mean zero and the covariance structure of the indicator process I (Y ≤ x) − FY (x|θ). Define Vθ = [ζθ (x) + F˙Y (x|θ) ξ ]2 dμ(x). Write P θ for probability measure under the assumption that X has the distribution of Y with parameter θ. Then arguments like those used to prove Theorem 3.1 show that if η > 0 is sufficiently small, then sup
lim
n→∞ θ: θ−θ ≤η 0
sup |Pθ [nT (X |θ) ≤ t] − P (Vθ ≤ t)| = 0.
Moreover, lim
(A.8)
t: t>0
sup
η→0 θ: θ−θ ≤η 0
sup P (Vθ ≤ t) − P (Vθ0 ≤ t) = 0.
(A.9)
t: t>0
Let t nβ (θ ) denote the β-level critical value of nT (X |θ ) when the data have the distribution of Y and t β (θ) denote the β-level critical value of V θ . Note that the distribution of V θ is continuous and has support equal to the real line. Then it follows from (A.8)–(A.9) that for all sufficiently small δ > 0 sup
sup
sup
n≥n1 θ: θ−θ0 ≤ η β: |β−α| ≤δ
|tnβ (θ) − tβ (θ0 )| → 0
(A.10)
ˆ and t α (θ 0 ), respectively. Therefore, (A.10) as n 1 → ∞ and δ → 0. Now s ∗α and s α are identical to tnα (θ) with β = α and the fact that θˆ →p θ0 imply that s ∗α →p s α . Proof of Theorem 3.3. The proof consists of showing that the class of sets A for which (3.5) holds is a sigma field and that it contains all balls.
C The Author(s). Journal compilation C Royal Economic Society 2009.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. S19–S49. doi: 10.1111/j.1368-423X.2009.00285.x
Finite-sample distribution-free inference in linear median regressions under heteroscedasticity and non-linear dependence of unknown form E LISE C OUDIN † AND J EAN -M ARIE D UFOUR ‡,§ ,¶ †
‡
´ Centre de Recherche en Economie et Statistique, Institut National de la Statistique et des ´ Etudes Economiques, 15 Boulevard Gabriel Perl, 92245, Malakoff Cedex, France E-mail:
[email protected] Department of Economics, McBill University, 855 Sherbrooke Street West, Montr´eal, Quebec H3A 2T7, Canada §
Centre interuniversitaire de recherche en analyse des organisations, 2020 rue University, 25e e´ tage, Montr´eal, Quebec H3A 2A5, Canada
¶
Centre interuniversitaire de recherche en e´ conomie quantitative, Universit´e de Montr´eal, Quebec H3C 3J7, Canada E-mail:
[email protected] First version received: August 2008; final version accepted: January 2009
Summary We construct finite-sample distribution-free tests and confidence sets for the parameters of a linear median regression, where no parametric assumption is imposed on the noise distribution. The set-up studied allows for non-normality, heteroscedasticity, nonlinear serial dependence of unknown forms as well as for discrete distributions. We consider a mediangale structure—the median-based analogue of a martingale difference—and show that the signs of mediangale sequences follow a nuisance-parameter-free distribution despite the presence of non-linear dependence and heterogeneity of unknown form. We point out that a simultaneous inference approach in conjunction with sign transformations yield statistics with the required pivotality features—in addition to usual robustness properties. Monte Carlo tests and projection techniques are then exploited to produce finite-sample tests and confidence sets. Further, under weaker assumptions, which allow for weakly exogenous regressors and a wide class of linear dependence schemes in the errors, we show that the procedures proposed remain asymptotically valid. The regularity assumptions used are notably less restrictive than those required by procedures based on least absolute deviations (LAD). Simulation results illustrate the performance of the procedures. Finally, the proposed methods are applied to tests of the drift in the Standard and Poor’s composite price index series (allowing for conditional heteroscedasticity of unknown form). Keywords: Bootstrap, Discrete distribution, Distribution-free, GARCH, Heteroscedasticity, Median regression, Monte Carlo test, Non-normality, Projection methods, Quantile regression, Serial dependence, Signs, Sign test, Simultaneous inference, Stochastic volatility.
C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
S20
E. Coudin and J.-M. Dufour
1. INTRODUCTION Median regression (and related quantile regressions) provides an attractive bridge between parametric and non-parametric models. Distributional assumptions on the disturbance process are relaxed, but the functional form remains parametric. Associated estimators, such as the least absolute deviations (LAD) estimator, are more robust to outliers than usual least-squares (LS) methods and may be more efficient whenever the median is a better measure of location than the mean (Dodge, 1997). They are especially appropriate when unobserved heterogeneity is suspected in the data. The current expansion of such ‘semiparametric’ techniques reflects an intention to depart from restrictive parametric frameworks (see Powell, 1994). However, related tests remain usually based on asymptotic normality approximations. In this paper, we show that tests based on residual signs yield an entire system of finitesample exact inference under very general assumptions. We study a linear median regression model where the (possibly dependent) disturbance process is assumed to have a null median, conditional on some exogenous explanatory variables and its own past. This set-up covers non-stochastic heteroscedasticity, standard conditional heteroscedasticity (like ARCH, GARCH, stochastic volatility models, . . .) as well as other forms of non-linear dependence. We provide both finite-sample and asymptotic distributional theories. In the first set of results, we show that the level of the tests is provably equal to the nominal level, for any sample size. Exact tests and confidence regions are valid under general assumptions and allow for heteroscedasticity and non-linear dependence of unknown forms, as well as for discrete distributions. This is done, in particular, by combining Monte Carlo tests adapted to discrete statistics—using a randomized tie-breaking procedure (Dufour, 2006)—with projection techniques, which allow inference on general parameter transformations (Dufour, 1990). We also show that the tests proposed include locally optimal tests. However, for more general processes that may involve stationary ARMA disturbances, sign-based statistics are no longer pivotal. The serial dependence parameters constitute nuisance parameters. In a second set of results, we show that the proposed procedures remain asymptotically valid when the regressors are weakly exogenous and disturbances are stationary ARMA. Transforming sign-based statistics with standard heteroscedasticity and autocorrelation-corrected (HAC) methods allows one to eliminate nuisance parameters asymptotically. We thus extend the validity of the Monte Carlo test method. In such cases, we lose exactness but retain asymptotic validity. The latter holds under much weaker assumptions on moments or the shape of the distribution (such as the existence of a density) than usual asymptotically justified inference (such as LAD-based techniques). Besides, one does not need to evaluate the disturbance density at zero, which constitutes one of the major difficulties of asymptotic kernel-based methods associated with LAD and other quantile estimators. A basic motivation for the sign-based techniques studied in this paper comes from an impossibility result due to Lehmann and Stein (1949), who proved that inference procedures that are valid under conditions of heteroscedasticity of unknown form when the number of observations is finite, must control the level of the tests conditional on the absolute values (see also Pratt and Gibbons, 1981). This result has two main consequences. First, sign-based methods constitute the only general way of producing provably valid inference for any given sample size. Second, all other methods, including the usual HAC methods developed by White (1980), Newey and West (1987), Andrews (1991) and others, which are not based on signs, are not provably valid for any sample size. Although this provides a compelling argument for using sign-based
C The Author(s). Journal compilation C Royal Economic Society 2009.
Finite-sample distribution-free inference in linear median regressions
S21
procedures, the latter have barely been exploited in econometrics; for a few exceptions which focus on simple time series models, see Dufour (1981), Campbell and Dufour (1991, 1995, 1997) and Wright (2000). In a regression context, the vast majority of the statistical literature is reviewed by Boldin et al. (1997). These authors also develop sign-based inference and estimation for linear models, both exact and asymptotic with i.i.d. errors. In the same vein, the recent paper by Chernozhukov et al. (2008) considers quantile regression models and derives finite sample inference using quantile indicators when the observations are independent. The problem of interest in the present paper consists in giving conditions under which signs will be i.i.d. according to a known distribution, even though the variables to which indicator functions are applied are not independent or do not satisfy other regularity conditions (such as following an absolutely continuous distribution). An important feature of our results consists in allowing for a dynamic structure in the error distribution, providing a considerable extension of earlier results on the distribution of signs in the presence of dependent observations. Moreover, errors with discrete distribution (or mixtures of discrete and continuous distributions) are allowed, as opposed to the usual continuity assumption. This is made possible by the combination of a ternary sign operator—rather than binary—and Monte Carlo test techniques involving randomized tie-breaking. Sign-based inference methods constitute an alternative to inference derived from the asymptotic distribution of LAD estimators and their extensions (see Koenker and Bassett, 1978, Powell, 1984, Weiss, 1991, Fitzenberger, 1997b, Horowitz, 1998, Zhao, 2001, etc.). An important problem in the LAD literature consists in providing good estimates of the asymptotic covariance matrix, on which inference relies. Powell (1984) suggested kernel estimation, but the most widespread method of estimation is the bootstrap (Buchinsky, 1995; Fitzenberger, 1997b; Hahn, 1997). 1 Kernel techniques are sensitive to the choice of kernel function and bandwidth parameter, and the estimation of the LAD asymptotic covariance matrix needs a reliable estimator of the error term density at zero. This may be tricky especially when disturbances are heteroscedastic or simply do not possess a density with respect to the Lebesgue measure (discrete distributions). Besides, whenever the normal distribution is not a good finite-sample approximation, inference based on covariance matrix estimation may be problematic. From a finite-sample point of view, asymptotically justified methods can be arbitrarily unreliable. Test sizes can be far from their nominal levels. One can find examples of such distortions for time series in Dufour (1981) and Campbell and Dufour (1995, 1997) and for L 1 -estimation in Dielman and Pfaffenberger (1988a,b), De Angelis et al. (1993) and Buchinsky (1995). Inference based on signs constitutes an alternative that does not suffer from these shortcomings. 2 The paper is organized as follows. In Section 2, we present the model and the notations. Section 3 contains results on exact inference. In Section 4, we derive confidence intervals at any given confidence level and illustrate the method on a numerical example. Section 5 is dedicated to the asymptotic validity of the finite-sample inference method. In Section 6, we give simulation results from comparisons with usual techniques. Section 7 presents an illustrative application: testing the presence of a drift in the Standard and Poor’s composite price index series. Section 8 concludes. The Appendix contains the proofs.
1
See Buchinsky (1995, 1998) for a review and Fitzenberger (1997b) for a comparison between these methods. Other notable areas of investigation in the L 1 -literature concern: (1) censored quantile regressions (Powell, 1984, 1986, Fitzenberger, 1997a, Buchinsky and Hahn, 1998), (2) endogeneity (Amemiya, 1982, Powell, 1983, Hong and Tamer, 2003), (3) misspecification (Jung, 1996, Kim and White, 2002, Komunjer, 2005). 2
C The Author(s). Journal compilation C Royal Economic Society 2009.
S22
E. Coudin and J.-M. Dufour
2. FRAMEWORK We consider a stochastic process {(yt , xt ) : → Rp+1 : t = 1, 2, . . . } defined on a probability space (, F, P), such that y t and x t satisfy a linear model of the form yt = xt β + ut ,
t = 1, . . . , n,
(2.1)
where y t is a dependent variable, x t = (x t1 , . . . , x tp ) is a p-vector of explanatory variables, and u t is an error process. The x t ’s may be random or fixed. In the sequel, y = (y1 , . . . , yn ) ∈ Rn will denote the dependent vector, X = [x 1 , . . . , x n ] the n × p matrix of explanatory variables, and u = (u1 , . . . , un ) ∈ Rn the disturbance vector. Moreover, F t (· |x 1 , . . . , x n ) represents the distribution function of u t conditional on X. Inference on this model will be made possible through assumptions on the conditional medians of the errors. To do this, it will be convenient to consider adapted sequences of the form S(v, F) = {vt , Ft : t = 1, 2, . . . },
(2.2)
where v t is any measurable function of Wt = (yt , xt ) , Ft is a σ -field in , Fs ⊆ Ft for s < t, σ (W1 , . . . , Wt ) ⊂ Ft and σ (W 1 , . . . , W t ) is the σ -algebra spanned by W 1 , . . . , W t . We shall depart from the usual assumption that E(ut |Ft−1 ) = 0, ∀t ≥ 1, i.e. u = {u t : t = 1, 2, . . . } in the adapted sequence S(u, F) = {ut , Ft : t = 1, 2, . . . } is a martingale difference with respect to Ft = σ (W1 , . . . , Wt ), t = 1, 2, . . . . In a framework that allows for heteroscedasticity of unknown form, it is known from Bahadur and Savage (1956) that inference on the mean of i.i.d. observations of a random variable, without any further assumption on the form of the distribution, is impossible. Such a test has no power. This problem of non-testability can be viewed as a form of non-identification in a wide sense. Unless relatively strong distributional assumptions are made, moments are not empirically meaningful. Thus, if one wants to relax the distributional assumptions, one must choose another measure of central tendency, such as the median. The median is especially appropriate if the distribution of the disturbance process does not possess moments. Thus, in the median regression framework, it appears that the martingale difference assumption should be replaced by an analogue in terms of median. Such a mediangale may be defined conditional on the design matrix X or unconditionally. Here, we focus on the conditional form. D EFINITION 2.1. (Weak conditional mediangale). Let Ft = σ (u1 , . . . , ut , X), for t ≥ 1. u in the adapted sequence S(u, F) is a weak mediangale conditional on X with respect to {Ft : t = 1, 2, . . . } iff P[u1 < 0|X] = P[u1 > 0|X] and P[ut < 0|u1 , . . . , ut−1 , X] = P[ut > 0|u1 , . . . , ut−1 , X], for t > 1. The above definition allows u t to have a discrete distribution with a non-zero probability mass at zero. A more restrictive version, called the strict conditional mediangale, imposes a zero probability mass at zero. Then, P[u1 < 0|X] = P[u1 > 0|X] = 0.5 and P[ut < 0|u1 , . . . , ut−1 , X] = P[ut > 0|u1 , . . . , ut−1 , X] = 0.5, for t > 1. With no mass at zero and no matrix X, this concept coincides with the mediangale one defined in Linton and Whang (2007), together with other quantilegales. 3 3 Linton and Whang (2007) define that u is a mediangale if E(ψ (u )|F 1 t t t−1 ) = 0, ∀t, where Ft−1 = σ (ut−1 , 2 ut−2 , . . . ) and ψ 1 (x) = 12 − 1(−∞,0) (x). This definition is adapted to continuous distributions but does not work 2
C The Author(s). Journal compilation C Royal Economic Society 2009.
Finite-sample distribution-free inference in linear median regressions
S23
Stating that u is a weak mediangale with respect to F is equivalent to assuming that its sign process s(u) = {s(u t ) : t = 1, 2, . . . }, where s(a) = 1[0,+∞) (a) − 1(−∞,0] (a), ∀a ∈ R, is a martingale difference with respect to the same sequence of sub-σ algebras F. The difference of martingale assumption on the raw process u is replaced by a quasi-similar hypothesis on a robust transform of this process s(u). However, the weak conditional mediangale concept differs from a martingale difference on the signs, because it requires conditioning upon the whole process X. We shall see later that asymptotic inference may be available under a classical martingale difference on signs or, more generally, mixing conditions on {s(u t ), σ (W 1 , . . . , W t ) : t = 1, 2, . . . }. It is relatively easy to deal with a weak mediangale by a simple transformation of the sign operator. Consider P[ut = 0 | X, u1 , . . . , ut−1 ] = pt (X, u1 , . . . , ut−1 ) > 0, where the p t (·) are unknown and may vary between observations. A way out consists in modifying the sign function s(x) as s˜ (x, V ) = s(x) + [1 − s(x)2 ]s(V − 0.5), where V ∼ U(0, 1). If V t is independent of u t then, irrespective of the distribution of u t , P[˜s (ut , Vt ) = +1] = P[˜s (ut , Vt ) = −1] =
1 2
To simplify the presentation, we shall focus on the strict mediangale concept. Therefore, our model will rely on the following assumption. A SSUMPTION 2.1. (Strict conditional mediangale). The components of u = (u 1 , . . . , u n ) satisfy a strict mediangale conditional on X. One remark concerns exogeneity. As long as the x t ’s are strongly exogenous, the conditional mediangale concept is equivalent to a martingale difference on signs with respect to Ft = σ (W1 , . . . , Wt ), t = 1, 2, . . . . P ROPOSITION 2.1. (Mediangale exogeneity). Suppose {x t : t = 1, 2, . . . } is a strongly exogenous process for β, P[u1 > 0] = P[u1 < 0] = 0.5 and P[ut > 0|u1 , . . . , ut−1 , x1 , . . . , xt ] = P[ut < 0|u1 , . . . , ut−1 , x1 , . . . , xt ] = 0.5. Then {u t : t = 1, 2, . . . } is a strict mediangale conditional on X. Model (2.1) with the Assumption 2.1 allows for very general forms of the disturbance distribution, including asymmetric, heteroscedastic or dependent ones, as long as conditional medians are 0. Neither density nor moment existence are required. Indeed, what the mediangale concept requires is a form of independence in the signs of the residuals. This extends results in Dufour (1981), Campbell and Dufour (1991, 1995, 1997) and Dufour et al. (1998). For example, Assumption 2.1 is satisfied if u t = σ t (x 1 , . . . , x n ) ε t , t = 1, . . . , n, where ε 1 , . . . , ε n are i.i.d. conditional on X, which is relevant for cross-sectional data. Many dependence schemes are also covered, especially any model of the form u 1 = σ 1 (x 1 , . . . , x t−1 )ε 1 , u t = σ t (x 1 , . . . , x t−1 , u 1 , . . . , u t−1 )ε t , t = 2, . . . , n, where ε 1 , . . . , ε n are independent with median 0, σ 1 (x 1 , . . . , x t−1 ) and σ t (x 1 , . . . , x n , u 1 , . . . , u t−1 ), t = 2, . . . , n, are non-zero with probability one. In time series context, this includes models presenting robustness properties to endogenous disturbance variance (or volatility) specification, such as ARCH, GARCH or stochastic volatility well with discrete distributions. If u t has a mass at zero, the condition given by Definition 2.1 can hold even if E(ψ 1 (ut )|Ft−1 ) = 0. 2
C The Author(s). Journal compilation C Royal Economic Society 2009.
S24
E. Coudin and J.-M. Dufour
models with non-Gaussian noises. Further, the mediangale property is more general because it does not specify explicitly the functional form of the variance in contrast with an ARCH specification. Note again that the disturbance process does not have to be second-order stationary. Asymptotic normality of the LAD estimator, which is presented in its most general way in Fitzenberger (1997b), holds under some mixing concepts on {s(u t ), σ (W 1 , . . . , W t ) : t = 1, 2, . . . } and an orthogonality condition between s(u t ) and x t . Besides, it requires additional assumptions on moments. 4 With such a choice, testing is necessarily based on approximations (asymptotic or bootstrap). Here, we focus on valid finite-sample inference without any further assumption on the form of the distributions. This non-parametric set-up extends those used in Dufour (1981) and Campbell and Dufour (1991, 1995, 1997). Assumption 2.1 can easily be extended to allow for another quantile q by setting P[ut < 0|Ft−1 ] = q, ∀t, which would lead to P[ut < 0|u1 , . . . , ut−1 , x1 , . . . , xt ] = q in Proposition 2.1. However, with error heterogeneity or dependence of unknown form, such an assumption can plausibly hold only for a single quantile. So little generality is lost by focusing on the median case. Further, contrary to other quantiles, the median may have an economic meaning when it coincides with the expectation, e.g. if the error density is symmetric. It can be used to state expectation-based economic conditions such as a no-arbitrage opportunity condition on a market etc. A classical result in non-parametric statistics consists in using this Bernoulli distribution to build exact tests and confidence intervals on quantiles (for i.i.d. observations); see Thompson (1936), Scheff´e and Tukey (1945) and the review of David (1981, ch. 2). For recent econometric exploitation of a quantile version of this result which holds if the observations are X-conditionally independent, see Chernozhukov et al. (2008). Proposition 2.1 above provides general conditions under which such a result holds for non-i.i.d. observations. Finally, the set-up presented here extends those approaches to the time series context where some kinds of Markovian serial dependence are permitted as well as discrete distributions.
3. EXACT FINITE-SAMPLE SIGN-BASED INFERENCE In finite samples, first-order asymptotic approximations can be misleading. Test sizes of asymptotically justified t- or χ 2 -statistics can be quite far from their nominal level. One can find examples of such distortions in the dynamic literature (see, for example, Dufour, 1981, Mankiw and Shapiro, 1986, Campbell and Dufour, 1995, 1997); on inference based on L 1 estimators (see Dielman and Pfaffenberger, 1988a,b; Buchinsky, 1995; De Angelis et al., 1993). This remark usually motivates the use of bootstrap procedures. In a sense, bootstrapping (once bias corrected) is a way to make approximation closer by introducing artificial observations. However, the bootstrap still relies on approximations and in general there is no guarantee that the level condition is satisfied in finite samples. The asymptotic method unreliability motivates us to turn a fully finite-sample-based approach. Sign-based procedures provide a way to build distribution-free statistics even in finite samples. Sign-based statistics have been used in the statistical literature to derive non-parametric sign tests. In this section, we present the general sign pivotality result and apply it in median regression context to derive sign-based test statistics that are pivots and provide power against alternatives 4 Fitzenberger (1997b) show that LAD and quantile estimators are consistent and asymptotically normal when E[xt sθ (ut )] = 0, ∀t, where (u t , x t ) has a density and finite second moments. C The Author(s). Journal compilation C Royal Economic Society 2009.
Finite-sample distribution-free inference in linear median regressions
S25
of interest. This will enable us to build Monte Carlo tests relying on their exact distribution. Therefore, the level of those tests is exactly controlled for any sample size. We study first the test problem, then build confidence sets. Finally, estimators can be derived. 5 Hence, results on the valid finite-sample test problem will be adapted to obtain valid confidence intervals and estimators. 3.1. Distribution-free pivotal functions and non-parametric tests When the disturbance process is a conditional mediangale, the joint distribution of the signs of the disturbances is completely determined. If there is no positive mass at zero, the signs are i.i.d. and take the values 1 and −1 with equal probability 1/2. The case with a mass at zero can be covered provided the transformation in the sign operator definition presented in the previous section. These results are stated more precisely in the following propositions. P ROPOSITION 3.1. (Sign distribution). Under model (2.1), suppose the errors (u 1 , . . . , u n ) satisfy a strict mediangale conditional on X = [x 1 , . . . , x n ] . Then the variables s(u 1 ), . . . , s(u n ) are i.i.d. conditional on X according to the distribution 1 , t = 1, . . . , n. (3.1) 2 More generally, this result holds for any combination of t = 1, . . . , n. If there is a permutation π : i → j such that mediangale property holds for j, then the signs are i.i.d. From Proposition 3.1, it follows that the residual sign vector P[s(ut ) = 1 |x1 , . . . , xn ] = P[s(ut ) = −1 |x1 , . . . , xn ] =
s(y − Xβ) = [s(y1 − x1 β), . . . , s(yn − xn β)] has a nuisance-parameter-free distribution (conditional on X), i.e. it is a ‘pivotal function’. Its distribution is easy to simulate from a combination of n independent uniform Bernoulli variables. Furthermore, any function of the form T = T (s(y − Xβ), X) is pivotal, conditional on X. Once the form of T is specified, the distribution of the statistic T is totally determined and can also be simulated. Using Proposition 3.1, it is possible to construct tests for which the size is fully controlled in finite samples. Consider testing H 0 (β 0 ) : β = β 0 against H 1 (β 0 ) : β = β 0 . Under H 0 (β 0 ), s(y t − x t β 0 ) = s(u t ), t = 1, . . . , n. Thus, conditional on X, T s(y − Xβ0 ), X ∼ T (Sn , X), (3.2) i.i.d.
where S n = (s 1 , . . . , s n ) and s1 , . . . , sn ∼ B(1/2). A test with level α rejects H 0 (β 0 ) when T s(y − Xβ0 ), X > cT (X, α), (3.3) where c T (X, α) is the (1 − α)-quantile of the distribution of T (S n , X). This result is generalized for distributions with a positive mass at zero in the following proposition. P ROPOSITION 3.2. (Randomized sign distribution). Suppose (2.1) holds with the assumption that u 1 , . . . , u n belong to a weak mediangale conditional on X. Let V 1 , . . . , V n be i.i.d. random variables U(0, 1) distributed and independent of u 1 , . . . , u n and X. Then the variables s˜t = 5
For the estimation theory, the reader is referred to Coudin and Dufour (2006).
C The Author(s). Journal compilation C Royal Economic Society 2009.
S26
E. Coudin and J.-M. Dufour
s˜ (ut , Vt ) are i.i.d. conditional on X with the distribution P[˜st = 1 | X] = P[˜st = −1 | X] = 12 , t = 1, . . . , n. All the procedures described in the paper can be applied by replacing s by s˜ . When the error distributions possess a mass at zero, the test statistic T (˜s (y − Xβ0 ), X) has to be used instead of T (s(y − Xβ 0 ), X). 3.2. Regression sign-based statistics We consider test statistics of the following form: DS (β0 , n ) = s(y − Xβ0 ) Xn s(y − Xβ0 ), X X s(y − Xβ0 ),
(3.4)
where n (s(y − Xβ 0 ), X) is a p × p weight matrix that depends on the constrained signs s(y − Xβ 0 ) under H 0 (β 0 ). The weight matrix n (s(y − Xβ 0 ), X) provides a standardization that can be useful for power considerations as well as to account for dependence schemes that cannot be eliminated by the sign transformation. Further, n (s(y − Xβ 0 ), X) would normally be selected to be positive definite (although this is not essential to show the pivotality of the test statistic under the null hypothesis). 6 Statistics of the form D S (β 0 , n ) include as special cases the ones studied by Koenker and Bassett (1982) and Boldin et al. (1997). Namely, on taking n = I p and n = (X X)−1 , we get: 2 SB(β0 ) = s(y − Xβ0 ) XX s(y − Xβ0 ) = X s(y − Xβ0 ) , (3.5) 2 SF (β0 ) = s(y − Xβ0 ) P (X)s(y − Xβ0 ) = X s(y − Xβ0 )M ,
(3.6)
where P (X) = X(X X)−1 X . In Boldin et al. (1997), it is shown that SB(β 0 ) and SF(β 0 ) can be associated with locally most powerful tests in the case of i.i.d. disturbances under some regularity conditions on the distribution function (especially f (0) = 0). 7 Their proof can easily be extended to disturbances that satisfy the mediangale property and for which the conditional density at zero is the same f t (0|X) = f (0|X), t = 1, . . . , n. SF(β 0 ) can be interpreted as a sign analogue of the Fisher statistic. SF(β 0 ) is a monotonic transformation of the Fisher statistic for testing γ = 0 in the regression of s(y − Xβ 0 ) on X : s(y − Xβ 0 ) = Xγ + v. This remark holds also for a general sign-based statistic of the −1/2 form (3.6), when s(y − Xβ 0 ) is regressed on n X. Wald, Lagrange multiplier (LM) and likelihood ratio (LR) asymptotic tests for M-estimators, such as the LAD estimator, in L 1 -regression are developed by Koenker and Bassett (1982). They 6 Under more restrictive assumptions, statistics that exploit other robust functions of y − Xβ (such as ranks, signed 0 ranks, and signs and ranks) can lead to more powerful tests. However, the fact we allow for both heteroscedasticity and non-linear serial dependence of unknown forms appears to break the required pivotality result and makes the use of such statistics quite difficult if not impossible in the context of our set-up. For discussion of such alternative statistics (applicable under stronger assumptions), see Hallin and Puri (1991, 1992), Hallin et al. (2006, 2008), Hallin and Werker (2003) and the references therein. 7 The power function of the locally most powerful sign-based test has the faster increase when departing from β . 0 In the multiparameter case, the scalar measure required to evaluate that speed is the curvature of the power function. Restricting to unbiased tests, Boldin et al. (1997) introduced different locally most powerful tests corresponding to different definitions of curvature. SB(β 0 ) maximizes the mean curvature, which is proportional to the trace of the shape; see Dubrovin et al. (1984, ch. 2, pp. 76–86) or Gray (1998, ch. 21, pp. 373–80) for a discussion of various curvature notions.
C The Author(s). Journal compilation C Royal Economic Society 2009.
Finite-sample distribution-free inference in linear median regressions
S27
assume i.i.d. errors and a fixed design matrix. In that set-up, the LM statistic for testing H 0 (β 0 ) : β = β 0 turns out to be the SF(β 0 ) statistic. The same authors also remarked that this type of statistic is asymptotically nuisance-parameter-free, contrary to LR and Wald-type statistics. The Boldin et al. (1997) local optimality interpretation can be extended to heteroscedastic disturbances. In such a case, the locally optimal test statistic associated with the mean curvature, i.e. the test with the highest power near the null hypothesis according to a trace argument, will be of the following form. P ROPOSITION 3.3. In model (2.1), suppose the mediangale Assumption 2.1 holds, and the disturbances are heteroscedastic with conditional densities f t (· |X), t = 1, 2, . . . , which are continuously differentiable around zero and such that f t (0|X) = 0. Then, the locally optimal sign-based statistic associated with the mean curvature is ˜ 0 ) = s(y − Xβ0 ) X˜ X˜ s(y − Xβ0 ), SB(β
(3.7)
where X˜ = diag(f1 (0|X), . . . , fn (0|X))X. When the f i (0|x)’s are unknown, the optimal statistic is not feasible. The optimal weights must be replaced by approximations, such as weights derived from the normal distribution. Sign-based statistics of the form (3.4) can also be interpreted as GMM statistics which exploit the property that {st ⊗ xt , Ft } is a martingale difference sequence. 8 However, these are quite unusual GMM statistics. Indeed, the parameter of interest is not defined by moment conditions in explicit form. It is implicitly defined as the solution of some robust estimating equations (involving constrained signs): n
s(yt − xt β) ⊗ xt = 0.
t=1
For i.i.d. disturbances, Godambe (2001) showed that these estimating functions are optimal − xt β). For among all the linear unbiased (for the median) estimating functions nt=1 at (β)s(yt independent heteroscedastic disturbances, the set of optimal estimating equations is nt=1 s(yt − ˜ can be viewed as optimal instruments for the linear xt β) ⊗ x˜t = 0. In those cases, X (resp. X) model. We now turn to linearly dependent processes. We propose to use a weighting matrix directly derived from the asymptotic covariance matrix of √1n s(y − Xβ0 ) ⊗ X. Let us denote it by J n (s(y − Xβ 0 ), X). We consider n (s(y − Xβ0 ), X) = n1 Jˆn (s(y − Xβ0 ), X)−1 , where Jˆn (s(y − Xβ0 ), X) stands for a consistent estimate of J n (s(y − Xβ 0 ), X) that can be obtained using kernel estimators; for example, see Parzen (1957), Newey and West (1987), Andrews (1991) and White (2001). This leads to 1 1 ˆ−1 = s(y − Xβ0 ) XJˆn−1 X s(y − Xβ0 ). DS β0 , Jn (3.8) n n J n (s(y − Xβ 0 ), X) accounts for dependence among signs and explanatory variables. Hence, by using an estimate of its inverse as weighting matrix, we perform a HAC correction. Note that the correction depends on β 0 . 8 Concerning power performance again, Chernozhukov et al. (2008) show also the class of GMM sign-based statistics contains a locally asymptotically uniformly most powerful invariant test.
C The Author(s). Journal compilation C Royal Economic Society 2009.
S28
E. Coudin and J.-M. Dufour
In all cases, H 0 (β 0 ) is rejected when the statistic evaluated at β = β 0 is large: DS (β0 , n ) > cn (X, α), where cn (X, α) is a critical value, which depends on the level α. Since we are looking at pivotal functions, the critical values can be evaluated to any degree of precision by simulation. This is the strategy followed by Chernozhukov et al. (2008), which exploits the same finite sample property of (θ -) signs in a quantile regression context with conditionally independent observations. However, as the distribution is discrete, a test based on cn (X, α) may not exactly reach the nominal level. A more elegant solution consists in using the technique of Monte Carlo tests with a randomized tie-breaking procedure, which do not suffer from this shortcoming. Further, we will show later that the Monte Carlo procedure also enables one to build tests with asymptotically controlled level for general processes when Assumption 2.1 fails to hold. 3.3. Monte Carlo tests Monte Carlo tests can be viewed as a finite-sample version of the bootstrap. They have been introduced by Dwass (1957) (see also Barnard, 1963) and can be adapted to any pivotal statistic whose distribution can be simulated. For a general review and for extensions in the case of the presence of a nuisance parameter, the reader is referred to Dufour (2006). In the case of discrete distributions, the method must be adapted to deal with ties. Here, we use a randomized tie-breaking procedure for evaluating empirical survival functions (see Dufour, 2006). Let us consider a statistic T, whose conditional distribution given X is discrete and free of nuisance parameters, and a test which rejects the null hypothesis when T ≥ c(α). Let T (0) be the observed value of T, and T (1) , . . . , T (N) , N independent replicates of T. Each replication T (j ) is associated with a uniform random variable W (j ) ∼ U(0, 1) to produce the pairs (T (j ) , W (j ) ). The vector (W (0) , . . . , W (N) ) is independent of (T (0) , . . . , T (N) ). (T (i) , W (i) )’s are ordered according to (T (i) , W (i) ) ≥ (T (j ) , W (j ) ) ⇔ {T (i) > T (j ) or (T (i) = T (j ) and W (i) ≥ W (j ) )}. This leads to the following p-value function: ˜ N (x) + 1 NG , N +1 ˜ N (x) = 1 − 1 N s+ (x − T (i) ) + where the empirical survival function, G i=1 N x)s+ (W (i) − W (0) ), with s + (x) = 1 [0, ∞) (x), δ(x) = 1 {0} . Then p˜ N (x) =
P[p˜ N (T (0) ) ≤ α] =
I [α(N + 1)] , N +1
1 N
N i=1
δ(T (i) −
for 0 ≤ α ≤ 1.
The randomized tie-breaking allows one to exactly control the level of the procedure. This may also increase the power of the test.
4. REGRESSION SIGN-BASED CONFIDENCE SETS In this section, we discuss how to use Monte Carlo sign-based joint tests in order to build confidence sets for β with known level. This can be done as follows. For each value β0 ∈ Rp , perform the Monte Carlo sign test for H 0 (β 0 ) and get the associated simulated p-value. The confidence set C 1−α (β) that contains any β 0 with p-value higher than α has, by construction, level 1 − α (see Dufour, 2006). From this simultaneous confidence set for β, it is possible, C The Author(s). Journal compilation C Royal Economic Society 2009.
Finite-sample distribution-free inference in linear median regressions
S29
by projection techniques, to derive confidence intervals for the individual components. More generally, we can obtain conservative confidence sets for any transformation g(β), where g can be any kind of real functions, including non-linear ones. Obviously, obtaining a continuous grid of Rp is not realistic. We will instead require global optimization search algorithms. 4.1. Confidence sets and conservative confidence intervals Projection techniques yield finite-sample valid confidence intervals and confidence sets for general functions of the parameter β. For examples of use in different settings and for further discussion, the reader is referred to Dufour (1990, 1997), Abdelkhalek and Dufour (1998), Dufour and Kiviet (1998), Dufour and Jasiak (2001) and Dufour and Taamouti (2005). The basic idea is the following one. Suppose a simultaneous confidence set with level 1 − α for β, C 1−α (β), is available. Since β ∈ C 1−α (β) =⇒ g(β) ∈ g(C 1−α (β)), we have P[β ∈ C1−α (β)] ≥ 1 − α =⇒ P[g(β) ∈ g(C1−α (β))] ≥ 1 − α. Thus, g(C 1−α (β)) is a conservative confidence set for g(β). If g(β) is scalar, the interval (in the extended real numbers) Ig [C1−α (β)] = [infβ∈C1−α (β) g(β) , supβ∈C1−α (β) g(β)] has level 1 − α:
P
inf
β∈C1−α (β)
g(β) ≤ g(β) ≤
sup
g(β) 1 − α.
β∈C1−α (β)
Hence, to obtain valid conservative confidence intervals for the individual component β k in the model (2.1) under mediangale Assumption 2.1, it is sufficient to solve the following numerical optimization problems, where s.c. stands for ‘subject to the constraint’: minp βk s.c. p˜ N DS (β) ≥ α, maxp βk s.c. p˜ N DS (β) ≥ α, β∈R
β∈R
(j )
where p˜ N is computed using N replicates D S of the statistic D S under the null hypothesis. In practice, we use simulated annealing as optimization algorithm (see Goffe et al., 1994; Press et al., 1996). 9 In the case of multiple tests, projection techniques allow to perform tests on an arbitrary number of hypotheses, without ever losing control of the overall level: rejecting at least one true null hypothesis will not exceed the specified level α. 4.2. Numerical illustration This part reports a numerical illustration. We generate the following normal mixture process for n = 50, N [0, 1] with probability 0.95 i.i.d. yt = β0 + β1 xt + ut , t = 1, . . . , n, ut ∼ N [0, 1002 ] with probability 0.05. We conduct an exact inference procedure with N = 999 replicates. The true process is generated with β 0 = β 1 = 0. We perform tests of H 0 (β ∗ ) : β = β ∗ on a grid for β ∗ = (β ∗0 , β ∗1 ) and retain the associated simulated p-values. As β is a two-vector, we can provide a graphical illustration. To each value of the vector β is associated the corresponding simulated p-value. Confidence
9
See Chernozhukov et al. (2008) for the use of other MCMC algorithms.
C The Author(s). Journal compilation C Royal Economic Society 2009.
S30
E. Coudin and J.-M. Dufour
98%
for B2
75%
95% 90%
B1 Figure 1. Confidence regions provided by SF-based inference. Table 1. Confidence intervals. OLS White
SF
β0
95%CI 98%CI
[−4.57, 0.82] [−5.10, 1.35]
[−4.47, 0.72] [−4.98, 1.23]
[−0.54, 0.23] [−0.64, 0.26]
β1
95%CI 98%CI
[−2.50, 3.22] [−3.07, 3.78]
[−1.34, 2.06] [−1.67, 2.39]
[−0.42, 0.59] [−0.57, 0.64]
region with level 1 − α contains all the values of β with p-values greater than α. Confidence intervals are obtained by projecting the simultaneous confidence region on the axis of β 0 or β 1 ; see Figure 1 and Table 1. The confidence regions so obtained increase with the level and cover other confidence regions with smaller level. Confidence regions are highly non-elliptic and thus may lead to different results than an asymptotic inference. Concerning confidence intervals, sign-based ones appear to be largely more robust than OLS and White CI and are less sensitive to outliers.
5. ASYMPTOTIC THEORY This section is dedicated to asymptotic results. We point out that the mediangale Assumption 2.1 excludes some common processes, whereas usual asymptotic inference still can be conducted on them. We relax Assumption 2.1 to allow random X that may not be independent of u. We show that the finite-sample sign-based inference remains asymptotically valid. For a fixed number of replicates, when the number of observations goes to infinity, the level of a test tends to the C The Author(s). Journal compilation C Royal Economic Society 2009.
Finite-sample distribution-free inference in linear median regressions
S31
nominal level. Besides, we stress the ability of our methods to cover heavy-tailed distributions, including infinite disturbance variance. 5.1. Asymptotic distributions of test statistics In this part, we derive asymptotic distributions of the sign-based statistics. We show that the HAC-corrected version of the sign-based statistic DS (β0 , n1 Jˆn−1 ) in (3.8) allows one to obtain an asymptotically pivotal function. The set of assumptions we make to stabilize the asymptotic behaviour will be needed for further asymptotic results. We consider the linear model (2.1), with the following assumptions: A SSUMPTION 5.1. (Mixing). {(x t , u t ) : t = 1, 2, . . .} is α-mixing of size −r/(r − 2), r > 2. 10 A SSUMPTION 5.2. (Moment condition). E[s(ut )xt ] = 0, t = 1, . . . , n, ∀n ∈ N. A SSUMPTION 5.3. (Boundedness). x t = (x 1t , . . . , x pt ) and E[|xht |r ] < < ∞, h = 1, . . . , p, t = 1, . . . , n, ∀n ∈ N. A SSUMPTION 5.4. (Non-singularity). Jn = var[ √1n nt=1 s(ut )xt ] is uniformly positive definite. A SSUMPTION 5.5. (Consistent estimator of J n ). n (β 0 ) is symmetric positive definite p uniformly over n and n − n1 Jn−1 → 0. We can now give the following result on the asymptotic distribution of D S (β 0 , n ) under H 0 (β 0 ). T HEOREM 5.1. (Asymptotic distribution of sign-based statistics). In model (2.1), with Assumptions 5.1–5.5, we have, under H 0 (β 0 ), D S (β 0 , n ) → χ 2 (p). In particular, when the mediangale condition holds, J n reduces to E(X X/n), and (X X/n)−1 is a consistent estimator of J −1 n . This yields the following corollary. C OROLLARY 5.1. In model (2.1), suppose the mediangale Assumption 2.1 and boundedness Assumption 5.3 are fulfilled. If X X/n is positive definite uniformly over n and converges in probability to a definite positive matrix, then, under H 0 (β 0 ), SF(β 0 ) → χ 2 (p). 5.2. Asymptotic validity of Monte Carlo tests We first state some general results on asymptotic validity of Monte Carlo-based inference methods. Then, we apply these results to sign-based inference methods. 5.2.1. Generalities. Let us consider a parametric or semi-parametric model {Mβ , β ∈ }. Let S n (β 0 ) be a test statistic for H 0 (β 0 ). Let c n be the rate of convergence. Under H 0 (β 0 ), the distribution function of cn Sn (β 0 ) is denoted by F n (x). We suppose that F n (x) converges almost everywhere to a distribution function F (x). G(x) and G n (x) are the corresponding survival functions. In Theorem 5.2, we show that if a sequence of conditional survival functions G˜n (x|Xn (ω)) given X(ω), satisfies G˜n (x|Xn (ω)) → G(x) with probability one, where G does not
10
See White (2001) for a definition of α-mixing.
C The Author(s). Journal compilation C Royal Economic Society 2009.
S32
E. Coudin and J.-M. Dufour
depend on the realization X(ω), then G˜n (x|Xn (ω)) can be used as an approximation of G n (x). It can be seen as a pseudo survival function of cn Sn (β 0 ). T HEOREM 5.2. (Generic asymptotic validity). Let S n (β 0 ) be a test statistic for testing H 0 (β 0 ): β = β 0 against H 1 (β 0 ) : β = β 0 in model (2.1). Suppose that, under H 0 (β 0 ), P[cn Sn (β0 ) ≥ x|Xn ] = Gn (x|Xn ) = 1 − Fn (x|Xn ) → G(x) a.e., n→∞
where {c n } is a sequence of positive constants, and suppose that G˜n (x|Xn (ω)) is a sequence of survival functions such that G˜n (x|Xn (ω)) → G(x) with probability one. Then n→∞
˜ n (cn Sn (β0 ), Xn (ω)) ≤ α] ≤ α. lim P[G
n→∞
(5.1)
This theorem can also be stated in a Monte Carlo version. Following Dufour (2006), we use empirical survival functions and empirical p-values adapted to discrete statistics in a randomized way, but the replicates are not drawn from the same distribution as the observed statistic. However, both distribution functions, respectively F n and F˜n , converge to the same limit F. Let U (N + 1) = (U (0) , U (1) , . . . , U (N) ) be a vector of N + 1 i.i.d. real variables drawn from (1) (N) a U(0, 1) distribution, S (0) n is the observed statistic and S n (N ) = (S n , . . . , S n ) a vector of N independent replicates drawn from F˜n . Then, the randomized pseudo empirical survival function under H 0 (β 0 ) is N 1 (0) ˜ (N) G x, n, S , S (N ), U (N + 1) = 1 − s+ x − cn Sn(j ) n n n N j =1
+
N 1 (j ) δ cn Sn − x S+ U (j ) − U (0) . N j =1
(0) ˜ (N) ˜ G n (x, n, Sn , Sn (N ), U (N + 1)) is in a sense an approximation of Gn (x). Thus, it depends on the number of replicates, N, and the number of observations, n. The randomized pseudo empirical p-value function is defined as
p˜ n(N) (x) =
˜ (N) NG n (x) + 1 . N +1
(5.2)
We can now state the Monte Carlo-based version of Theorem 5.2. T HEOREM 5.3. (Monte Carlo test asymptotic validity). Let S n (β 0 ) be a test statistic for testing H 0 (β 0 ) : β = β 0 against H 1 (β 0 ) : β = β 0 in model (2.1) and S (0) n the observed value. Suppose that, under H 0 (β 0 ), P[cn Sn (β0 ) ≥ x|Xn ] = Gn (x|Xn ) = 1 − Fn (x|Xn ) → G(x) a.e., n→∞
where {c n } is a sequence of positive constants. Let S˜n be a random variable with conditional ˜ n (x|Xn ), such that survival function G ˜ n (x|Xn ) = 1 − F˜n (x|Xn ) → G(x) a.e., P[cn S˜n ≥ x|Xn ] = G n→∞
C The Author(s). Journal compilation C Royal Economic Society 2009.
Finite-sample distribution-free inference in linear median regressions
S33
(N) ˜ and (S (1) n , . . . , S n ) be a vector of N independent replicates of Sn , where (N + 1)α is an integer. Then, the randomized version of the Monte Carlo test with level α is asymptotically valid, i.e. limn→∞ P[p˜ n(N) (β0 ) ≤ α] ≤ α.
These results can be applied to the sign-based inference method. However, Theorems 5.2 and 5.3 are much more general. They do not exclusively rely on asymptotic normality—the limiting distribution may be different from a Gaussian one. Besides, the rate of convergence may differ √ from n. 5.2.2. Asymptotic validity of sign-based inference. In model (2.1), suppose that conditions 5.1–5.5 hold and consider the testing problem: H0 (β0 ) : β = β0 against H1 (β0 ) : β = β0 . Let DS (β, Jˆn−1 ) be the test statistic as defined in (3.8). Observe SF (0) = DS (β0 , Jˆn−1 ). Draw N independent replicates of sign vector, each one having n independent components, from a B(1, 0.5) distribution. Compute (SF (1) , SF (2) , . . . , SF (N) ), the N pseudo replicates of D S (β 0 , X X−1 ) under H 0 (β 0 ). We call them ‘pseudo’ replicates because they are drawn as if observations were independent. Draw N + 1 independent replicates (W (0) , . . . , W (N) ) from a U(0, 1) distribution and form the couple (SF (j ) , W (j ) ). Compute p˜ n(N) (β0 ) using (5.2). From Theorem 5.3, the confidence region {β ∈ Rp |p˜ n(N) (β) ≥ α} is asymptotically conservative with level at least 1 − α. H 0 (β 0 ) is rejected when p˜ n(N) (β0 ) ≤ α. Contrary to usual asymptotic tests, this method does not require the existence of moments nor a density on the {u t : t = 1, 2, . . . } process. Usual Wald-type inference is based on the asymptotic behaviour of estimators and, consequently, is more restrictive. More moments existence restrictions are needed; see Weiss (1991) and Fitzenberger (1997b). Besides, asymptotic variance of the LAD estimator involves the conditional density at zero of the disturbance process {u t : t = 1, 2, . . . } as unknown nuisance parameter. The approximation and estimation of asymptotic covariance matrices constitute a large issue in asymptotic inference. This usually requires kernel methods. We get around those problems by adopting the finite-sample sign-based procedure.
6. SIMULATION STUDY In this section, we study the performance of sign-based methods compared with usual asymptotic tests based on OLS or LAD estimators, with different approximations for their asymptotic covariance matrices. We consider the sign-based statistics D S (β, (X X)−1 ) and DS (β, Jˆn−1 ) when a correction is needed for linear serial dependence. We consider a set of general DGPs to illustrate different classical problems one may encounter in practice. They are presented in Table 2. First, we investigate the performance of tests, then, confidence sets. We use the following linear regression model: yt = xt β0 + ut ,
t = 1, . . . , n,
(6.1)
where x t = (1, x 2,t , x 3,t ) and β 0 are 3 × 1 vectors. We denote the sample size n. For the first six ones, {u t : t = 1, 2. . .} is i.i.d. or depends on the explanatory variables and its past values in a multiplicative heteroscedastic way: u t = h(x t , u t−1 , . . . , u 1 )ε t , t = 1, . . . , n. In those cases, the error term constitutes a strict conditional mediangale given X (see Assumption 2.1). Correspondingly, the levels of sign-based tests and confidence sets are perfectly controlled. Case C1 presents i.i.d. normal observations without conditional heteroscedasticity. Case C2 involves outliers in the error term. This can be seen as an example of measurement error in the observed C The Author(s). Journal compilation C Royal Economic Society 2009.
S34
E. Coudin and J.-M. Dufour Table 2. Simulated models. i.i.d.
C1:
Normal HOM:
(x2,t , x3,t , ut ) ∼ N (0, I3 ), t = 1, . . . , n
C2:
Outlier:
(x2,t , x 3,t ) ∼ N (0, I2 ), N [0, 1] with p = 0.95 i.i.d. ut ∼ N [0, 10002 ] with p = 0.05 xt , ut , independent, t = 1, . . . , n.
C3:
Stat.
(x2,t , x3,t ) ∼ N (0, I2 ), ut = σt εt with
GARCH(1,1):
σ t2 = 0.666u2t−1 + 0.333σ 2t−1 where εt ∼ N (0, 1), xt , εt , independent, t = 1, . . . , n.
Stoc.
(x2,t , x3,t ) ∼ N (0, I2 ), ut = exp(wt /2)εt with
Volatility:
w t = 0.5w t−1 + v t , where εt ∼ N (0, 1), vt ∼ χ2 (3), xt , ut , independent, t = 1, . . . , n.
Deb. design matrix
x2,t ∼ N (0, 1), x3,t ∼ χ2 (1),
+ HET. dist.:
ut = x3,t εt , εt ∼ N (0, 1), xt , εt independent, t = 1, . . ., n.
Cauchy
(x2,t , x3,t ) ∼ N (0, I2 ),
disturbances:
ut ∼ C, xt , ut , independent, t = 1, . . . , n.
AR(1)-HET,
x j,t = ρ x x j,t−1 + ν t , j = 1, 2,
ρ u = 0.5, : ρ x = 0.5
ut = min{3, max[0.21, |x2,t |]} × u˜ t , u˜ t = ρu u˜ t−1 + νtu ,
C4:
C5:
C6:
C7:
i.i.d.
i.i.d.
i.i.d.
i.i.d.
i.i.d.
i.i.d.
i.i.d.
i.i.d.
i.i.d.
i.i.d.
j
i.i.d.
(νt2 , νt3 , νtu ) ∼ N (0, I3 ), t = 2, . . . , n ν 21 , ν 31 and ν u1 chosen to ensure stationarity. C8:
Exp. Var.:
i.i.d.
(x2,t , x3,t , εt ) ∼ N (0, I3 ), ut = exp(0.2t)εt .
y. Cases C3 and C4 involve other non-linear dependent schemes with stationary GARCH and stochastic volatility disturbances. Case C5 combines a very unbalanced design matrix (where the LAD estimator performs poorly) with highly conditional heteroscedastic disturbances. Case C6 is an example of heavy-tailed errors (Cauchy). Next, we study the behaviour of the sign-based inference (involving a HAC correction) when inference is only asymptotically valid. Case C7 illustrates the behaviour of sign-based inference when the error term involves linear dependence at a mild level (see the discussion paper for results at other levels of linear dependence and Fitzenberger, 1997b, for a study of LAD block bootstrap performance on such DGPs). In that case, x t and u t are such that E(ut xt ) = 0 and E[s(u t )x t ] = 0 for all t. Finally, case C8 involves disturbances that are not second-order stationary (exponential variance) but for which the mediangale assumption holds. As we noted previously, sign-based inference does not require stationary assumptions in contrast with tests derived from CLT. In each case, the design matrix is simulated once. Hence, results are conditional. More simulation results on other types of DGPs can be found in the discussion paper (Coudin and Dufour, 2007). C The Author(s). Journal compilation C Royal Economic Society 2009.
S35
Finite-sample distribution-free inference in linear median regressions
Table 3. Linear regression under mediangale errors: empirical sizes of conditional tests for H 0 : β = (1, 2, 3) . y t = x t β + ut , t = 1, . . . , 50.
SIGN SF
LAD
SHAC
OS
DMB
MBB
OLS BT
LR
IID
WH
BT
Stationary models with mediangale errors C1: HOM ρ = ρ x = 0, C2:
0.052 0.047 ∗ 0.047
0.050 0.019∗ 0.048
0.086
0.050
0.089
0.047
0.068
0.060
0.096
0.113
0.088
0.043
0.083
0.039
0.066
0.056
0.008
0.009
Outlier C3:
0.044∗ 0.042
0.015∗ 0.046
0.040
0.005
0.005
0.004
0.012
0.080
0.046
0.046
St. GARCH(1,1) C4: Stoch. Volat.
0.040∗ 0.043 0.045∗
0.013∗ 0.041 0.021∗
0.063
0.006
0.014
0.006
0.031
0.054
0.014
0.014
C5: Deb. + Het. C6:
0.044 0.040∗ 0.058
0.042 0.018∗ 0.059
0.687
0.020
0.044
0.152
0.307
0.421
0.171
0.173
0.069
0.013
0.033
0.012
0.044
0.061
0.023
0.023
Cauchy
0.049∗
0.021∗
C8: Exp. Var.
0.049
0.051
0.014
0.014
0.328
0.276
Non-stationary models with mediangale errors 0.017
0.000
0.000
0.000
0.000
0.132
Stationary models with serial dependence C7: HET ρ = ρ x = 0.5∗∗
0.218 –
0.026 0.017 ∗
0.440
0.131
0.097
0.108
0.308
0.407
Notes: ∗ Sizes using asymptotic critical values based on χ 2 (3). ∗∗ Automatic bandwidth parameters are restricted to be 0|X] = E(P[ut > 0|ut−1 , . . . , u1 , X]) = 1/2, P[ut > 0|st−1 , . . . , s1 , X] = P[ut > 0|ut−1 , . . . , u1 , X] = 1/2, ∀t ≥ 2. Further, the joint density of (s 1 , s 2 , . . . , s n ) can be written l(s1 , s2 , . . . , sn |X) =
n
l(st |st−1 , . . . , s1 , X)
t=1
=
n
P[ut > 0|ut−1 , . . . , u1 , X](1−st )/2 {1 − P[ut > 0|ut−1 , . . . , u1 , X]}(1+st )/2
t=1 n = (1/2)(1−st )/2 [1 − (1/2)]](1+st )/2 = (1/2)n . t=1 i.i.d.
Hence, conditional on X, s1 , s2 , . . . , sn ∼ B(1/2). C The Author(s). Journal compilation C Royal Economic Society 2009.
S46
E. Coudin and J.-M. Dufour
Proof of Proposition 3.2: Consider model (2.1) with {u t : t = 1, 2, . . .}, satisfying a weak mediangale conditional on X. Let us show that s˜ (u1 ), s˜ (u2 ), . . . , s˜ (un ) can have the same role in Proposition 3.1 as s(u 1 ), s(u 2 ), . . . , s(u n ) under Assumption 2.1. The randomized signs are defined by s˜ (ut , Vt ) = s(ut ) + [1 − s(ut )2 ]s(Vt − 0.5), hence
P[˜s (ut , Vt ) = 1|ut−1 , . . . , u1 , X] = P s(ut ) + 1 − s(ut )2 s(Vt − 0.5) = 1|ut−1 , . . . , u1 , X . As (V 1 , . . . , V n ) is independent of (u 1 , . . . , u n ) and Vt ∼ U(0, 1), it follows 1 P[˜s (ut , Vt ) = 1] = P[ut > 0|ut−1 , . . . , u1 , X] + P[ut = 0|ut−1 , . . . , u1 , X]. 2
(A.1)
The weak conditional mediangale assumption given X entails P[ut > 0|ut−1 , . . . , u1 , X] = P[ut < 0|ut−1 , . . . , u1 , X] =
1 − pt , 2
(A.2)
where pt = P[ut = 0|ut−1 , . . . , u1 , X]. Substituting (A.2) into (A.1) yields P[˜s (ut , Vt ) = 1|ut−1 , . . . , u1 , X] =
1 − pt pt 1 + = . 2 2 2
(A.3)
In a similar way, P[˜s (ut , Vt ) = −1|ut−1 , . . . , u1 , X] =
1 . 2
(A.4)
The rest is similar to the proof of Proposition 3.1.
Proof of Proposition 3.3: Let us consider first, the case of a single explanatory variable case (p = 1), which contains the basic idea for the proof. The case with p > 1 is just an adaptation of the same ideas to multidimensional notions. Under model (2.1) with the mediangale Assumption 2.1, the locally optimal signbased test (conditional on X) of H 0 (β) : β = 0 against H 1 (β) : β = 0 is well defined. Among tests with level α, the power function of the locally optimal sign-based test has the highest slope around zero. The power function of a sign-based test conditional on X can be written Pβ [s(y) ∈ Wα |X], where W α is the critical d Pβ [S(y) = s|X]β=0, is region with level α. Hence, we should include in W α the sign vectors for which dβ as large as possible. An easy way to determine that derivative is to identify the terms of a Taylor expansion around zero. Under Assumption 2.1, we have Pβ [S(y) = s|X] =
n
[Pβ (yi > 0|X)](1+si )/2 [Pβ (yi < 0|X)](1−si )/2
(A.5)
i=1
=
n [1 − Fi (−xi β|X)](1+si )/2 [Fi (−xi β|X)](1−si )/2 .
(A.6)
i=1
Assuming that continuous densities at zero exist, a Taylor expansion at order one entails Pβ [S(y) = s|X] =
n 1 [1 + 2fi (0|X)xi si β + o(β)] 2n i=1
n 1 fi (0|X)xi si β + o(β) . = n 1+2 2 i=1
(A.7)
(A.8)
C The Author(s). Journal compilation C Royal Economic Society 2009.
S47
Finite-sample distribution-free inference in linear median regressions
All other terms of the product decomposition are negligible or equivalent to o(β). That allows us to identify the derivative at β = 0: d Pβ=0 [S(y) = s|X] = 2−n+1 fi (0|X)xi si . dβ i=1 n
(A.9)
Therefore, the required test has the form
n Wα = s = (s1 , . . . , sn )| fi (0|X)xi si > cα ,
(A.10)
i=1
or equivalently, Wα = {s|s(y) X˜ X˜ s(y) > cα }, where c α and cα are defined by the significance level. When the disturbances have a common conditional density at zero, f (0|X), we find the results of Boldin et al. (1997). The locally optimal sign-based test is given by W α = {s|s(y) XX s(y) > cα } . The statistic does not depend on the conditional density evaluated at zero. When p > 1, we need an extension of the notion of slope around zero for a multidimensional parameter. Boldin et al. (1997) propose to restrict to the class of locally unbiased tests with given level α and to consider dP (W ) = 0, and, as the maximal mean curvature. Thus, a locally unbiased sign-based test satisfies, βdβ α β=0
f i (0) = 0, ∀ i, the behaviour of the power function around zero is characterized by the quadratic term of its Taylor expansion 1 1 d 2 Pβ (Wα ) β = n−2 [fi (0|X)si β xi ][fj (0|X)sj xj β]. (A.11) β 2 2 dβ 2 1≤i= j ≤n The locally most powerful sign-based test in the sense of the mean curvature maximizes the mean curvature, d 2 Pβ (Wα ) which is, by definition, proportional to the trace of dβ 2 ; see Boldin et al. (1997, p. 41), Dubrovin β=0
et al. (1984, ch. 2, pp. 76–86) or Gray (1998, ch. 21, pp. 373–80). Taking the trace in expression (A.11), we find (after some computations) that p d 2 Pβ (Wα ) f (0|X)f (0|X)s s xik xj k . (A.12) tr = i j i j dβ 2 β=0
By adding the independent of s quantity
1≤i= j ≤n
n p
n p k=1
i=1
k=1
k=1
xik2 to (A.12), we find 2
xik fi (0|X)si
= s (y)X˜ X˜ s(y) .
(A.13)
i=1
Hence, the locally optimal sign-biased test, in the sense developed by Boldin et al. (1997) for heteroscedastic signs, is Wα = {s : s (y)X˜ X˜ s(y) > cα }. Another quadratic test statistic convenient for ˜ Wα = {s : s (y)X( ˜ −1 X˜ s(y) > cα }. ˜ X˜ X) large-sample evaluation is obtained by standardizing by X˜ X: Proof of Theorem 5.1: This proof follows the usual steps of an asymptotic normality result for mixing processes (see White, 2001). Consider model (2.1). In the following, s t stands for s(u t ). Under Assumption exists for any n. Set Znt = λ Vn−1/2 xt s(ut ), for some λ ∈ Rp such that λ λ = 1. The mixing 5.4, V −1/2 n s(u t ) ⊗ x t property 5.1 of (x t , u t ) gets transmitted to Znt ; see White (2001, Theorem 3.49). Hence, λV −1/2 n is α-mixing of size −r/(r − 2), r > 2. Assumptions 5.2 and 5.3 imply E[λ Vn−1/2 xt s(ut )] = 0, t = 1, . . . , n, ∀n ∈ N .
(A.14)
r Eλ Vn−1/2 xt s(ut ) < < ∞, t = 1, . . . , n, ∀n ∈ N .
(A.15)
C The Author(s). Journal compilation C Royal Economic Society 2009.
S48
E. Coudin and J.-M. Dufour
Note also that
n 1 Znt Var √ n t=1
n 1 −1/2 λ Vn s(ut ) ⊗ xt = Var √ n t=1
= λ Vn−1/2 Vn Vn−1/2 λ = 1 .
(A.16)
The mixing property of Znt and equations (A.14)–(A.16) allow one to apply a central limit theorem (see White, 2001, Theorem 5.20) that yields n 1 −1/2 λ Vn s(ut ) ⊗ xt → N (0, 1) . √ n t=1
(A.17)
Since λ is arbitrary with λ λ = 1, the Cram´er–Wold device entails Vn−1/2 n−1/2
n
s(ut ) ⊗ xt → N (0, Ip ) .
(A.18)
t=1
Finally, Assumption 5.5 states that n is a consistent estimate of V −1 n . Hence, n−1/2 1/2 n
n
s(ut ) ⊗ xt → N (0, Ip ),
(A.19)
t=1
and n−1 s (y − Xβ 0 )X n X s(y − Xβ 0 ) → χ 2 (p). yt , x0 ,
xt ).
Proof of Corollary 5.1: Let Ft = σ (y0 , . . . , ... , When the mediangale Assumption 2.1 = 1, . . . , n} belong to a martingale difference with respect to Ft . Hence, holds, {s(ut ) ⊗ xt , Ft : t Vn = Var[ √1n s ⊗ X] = n1 nt=1 E(xt st st xt ) = n1 nt=1 E(xt xt ) = n1 E(X X), and X X/n is a consistent estimate of E(X X/n). Theorem 5.1 yields SF(β 0 ) → χ 2 (p). In order to prove Theorem 5.2, we will use the following lemma on the uniform convergence of distribution functions (see Chow and Teicher, 1988, sec. 8.2, p. 265). L EMMA 8.1. Let (Fn )n∈N and F be right continuous distribution functions. Suppose that Fn (x) → F (x), ∀x ∈ R. Then, sup−∞<x 0. It is known that the lower tail dependence parameter for this family is λ L = 2−1/γ and the upper tail dependence parameter is λ U = 2 − 21/k . When k = 1, the Joe-Clayton copula reduces to the Clayton copula: C(u, v; α) = [u−α + v −α − 1]−1/α , where
α = γ > 0.
When γ → 0, the Joe–Clayton copula approaches the Joe copula whose concordance ordering and upper tail dependence increase as k increases. For other properties of the Joe–Clayton copula, see Joe (1997) and Patton (2006). When coupled with heavy-tailed marginal distributions such as the Student’s t distribution, this family of copulas can generate time series with clusters of extreme values and hence provide alternative models for economic and financial time series that exhibit such clusters. For the Joe–Clayton copula, one can easily verify that −(γ +1) C1 (ut−1 , ut ; α) = (1 − ut−1 )k−1 1 − u¯ kt−1 −γ −γ −(γ −1 +1) × 1 − u¯ kt−1 + 1 − u¯ kt −1 −γ −γ −1/γ k−1 −1 × 1 − 1 − u¯ kt−1 + 1 − u¯ kt −1 . For any τ ∈ [0, 1], solving τ = C 1 (u t−1 , u t ; α) for u t , we obtain the τ th conditional quantile function of U t given u t−1 based on the Clayton copula: −1/α QUt (τ |ut−1 ) = (τ −α/(1+α) − 1)u−α t−1 + 1 2
An elliptical copula is a copula generated from an elliptically symmetric bivariate distribution. C The Author(s). Journal compilation C Royal Economic Society 2009.
Copula-based nonlinear quantile autoregression
S57
Note that this expression and the similar expressions in the foregoing examples provide a convenient mechanism with which to simulate observations from the respective models. See Bouy´e and Salmon (2008) for additional examples of copula-based conditional quantile functions.
3. ASYMPTOTIC PROPERTIES In this section, we study estimation of the copula-based QAR model (2.2). The vector of parameters θ (τ ) and thus the conditional quantile of Y t can be estimated by the following nonlinear quantile autoregression: min θ∈
ρτ (Yt − H (Yt−1 , θ )),
(3.1)
t
where ρ τ (u) ≡ u(τ − I (u < 0)) is the usual check function (Koenker and Bassett, 1978). We denote the solution as θ (τ ) ≡ arg minθ∈ t ρτ (Yt − H (Yt−1 , θ )). Then the τ -th conditional quantile of Y t given Y t−1 , can be estimated by Yt (τ |Yt−1 = x) = H (x, (τ )), (τ ) . Q α (τ ) , β θ (τ )) ≡ F −1 C1−1 τ, F (x, β 3.1. Consistency To facilitate our analysis, we define C1 (u, v; α) ≡
∂ 2 C(u, v; α) ∂C(u, v; α) ; c(u, v; α) ≡ . ∂u ∂u∂v
Denote C −1 1 (τ , u; α) as the inverse function of C 1 (u, v; α) with respect to the argument v, and H (x, θ ) ≡ F −1 (C −1 1 (τ , F (x; β), α); β). We first introduce some simple regularity conditions to ensure consistency of our QAR estimator θ (τ ). A SSUMPTION 3.1.
The parameter space is a compact subset in k .
A SSUMPTION 3.2. (i) F (·; β) and F −1 (·; β) (the inverse function of F (·; β)) are continuous with respect to all their arguments; (ii) the copula function C(u, v; α) is second order differentiable with respect to u and v, and has copula density C(u, v; α) and (iii) C −1 1 (τ , u; α) (the inverse function of C 1 (u, v; α) with respect to v) is continuous in α and u. A SSUMPTION 3.3. (i) The true τ th conditional quantile of Y t given Yt−1 , QYt (τ |Yt−1 ), takes the form H (Y t−1 , θ (τ )) ≡ F −1 (C −1 1 (τ , F (Y t−1 ; β(τ )), α(τ )); β(τ )) for a θ (τ ) = (α(τ ) , β(τ ) ) ∈ for almost all Y t−1 ; (ii) The true unknown conditional density of Y t given Y t−1 , g ∗ (·|Y t−1 ), is bounded and continuous, and there exist 1 > 0, p > 0 such that Pr[g ∗ (QYt (τ |Yt−1 )) ≥ 1 ] ≥ p. A SSUMPTION 3.4. For any > 0, there exists a δ > 0 such that, for any θ − θ (τ ) > , E Pr H (Yt−1 , θ ) − QYt (τ |Yt−1 ) > δ | g ∗ (QYt (τ |Yt−1 )) ≥ 1 > 0. C The Author(s). Journal compilation C Royal Economic Society 2009.
S58
X. Chen, R. Koenker and Z. Xiao
A SSUMPTION 3.5. (i) E(supθ∈ |H (Yt−1 , θ )|) < ∞; (ii) {Y t } is stationary, ergodic and satisfies Assumption DGP. Assumptions 3.1–3.4 and 3.5(i) are mild regularity conditions that are typically imposed even for parametric nonlinear quantile regression of Y t given x t with i.i.d. data {(Y t , x t )}nt=1 . Thus they are natural conditions for our nonlinear Markov model (with x t = Y t−1 ). Assumption 3.5(ii) is a very mild condition on temporal dependence of {Y t }. Although we do not assume the correct specification of the parametric functional forms of the copula C(·, α) and the marginal distribution F (·, β), we assume that the parametric functional form of the conditional quantile H (Y t−1 , θ (τ )) is correct at the τ -th quantile (Assumption 3.3(i)). Hence, we do not need any betamixing decay rate condition on {Y t } that is assumed in Chen and Fan (2006). See Beare (2008) for temporal dependence properties of copula-based strictly stationary Markov processes. T HEOREM 3.1. (Consistency) For any fixed τ ∈ (0, 1), under Assumptions 3.1–3.5, we have θ (τ ) = θ (τ ) + op (1). 3.2. Normality We introduce the following additional notation: ∂H (x; θ ) ¨ ∂ 2 H (x; θ ) H˙ θ (x, θ ) ≡ , Hθθ (x, θ ) ≡ . ∂θ ∂θ ∂θ Given the consistency of θ (τ ), we only need to impose the following additional conditions in a shrinking neighbourhood of θ (τ ). Denote 0 = A 0 × B 0 = {θ = (α , β ) ∈ : θ − θ (τ ) = o p (1)}. We assume: A SSUMPTION 3.6. (i) H˙ θ (Yt−1 , θ ) and H¨ θθ (Yt−1 , θ ) are well defined and measurable for all θ ∈ 0 and for almost all Y t−1 ; (ii) E[supθ∈0 |H˙ θ (Yt−1 , θ )|2 ] < ∞; (iii) E(supθ∈0 | H¨ θθ (Yt−1 , θ )|) < ∞ and (iv) V (τ ) and (τ ) are finite non-singular, where V (τ ) ≡ E g ∗ (QYt (τ |Yt−1 ))H˙ θ (Yt−1 , θ (τ ))H˙ θ (Yt−1 , θ (τ )) , (τ ) ≡ E H˙ θ (Yt−1 , θ (τ ))H˙ θ (Yt−1 , θ (τ )) . (3.2) We impose Assumption 3.6(i)(iii) for simplicity. We could replace Assumption 3.6(i)(iii) by assuming that only H˙ θ (Yt−1 , θ ) exists for θ ∈ 0 and satisfies some milder regularity conditions such as those imposed in Huber (1967) and Pollard (1985) for i.i.d. data, and Hansen et al. (1995) for stationary ergodic data, without the need of the existence of H¨ θθ (Yt−1 , θ ) satisfying Assumption 3.6(iii). Comparing our Assumptions 3.1–3.6 to the regularity conditions imposed in earlier papers (e.g. Weiss, 1991, White, 1994, Engle and Mangenelli, 2004, and the references therein) on parametric nonlinear quantile time series models, we do not need any mixing nor near epoch dependence of mixing process conditions (see our Assumption 3.5(ii)), and our moment requirement is also much weaker than the existing ones (see our Assumption 3.5(i) and Assumption 3.6(ii)(iii)). Both these relaxations are important for financial applications that typically exhibit persistent temporal dependence and heavy-tailed marginals. Denote f (·; β) as the parametric density of F (·; β), and h(x, α, β) = C1−1 (τ ; u, α)u=F (x,β) = C1−1 (τ ; F (x, β), α) C The Author(s). Journal compilation C Royal Economic Society 2009.
S59
Copula-based nonlinear quantile autoregression −1
∂C (τ ;u,α) −1 ˙ α (x, α, β) = ∂h(x,α,β) and F˙β (x, β) = with C1u (τ ; u, α) = 1 ∂u , h ∂α (τ ) defined in (3.2) can be expressed as follows:
V (τ ) = where
Vαα (τ ) = E
Vαα (τ ) Vαβ (τ ) αα (τ ) , (τ ) = Vβα (τ ) Vββ (τ ) βα (τ )
∂F (x,β) . ∂β
Then V (τ ) and
αβ (τ ) , ββ (τ )
(3.3)
g ∗ (QYt (τ | Yt−1 )) ˙ ˙ α (Yt−1 ; θ(τ )) ; h (Y ; θ(τ )) h α t−1 {f (QYt (τ | Yt−1 ))}2
∂F −1 (h(Yt−1 ; θ(τ )), β(τ )) g ∗ (QYt (τ | Yt−1 )) ˙ Vαβ (τ ) = E hα (Yt−1 ; θ(τ )) f (QYt (τ | Yt−1 )) ∂β ∗ g (QYt (τ | Yt−1 )) ˙ −1 ˙β (Yt−1 , β(τ )) ; (Y ; θ(τ ))C (τ ; F (Y , β(τ )), α(τ )) F h +E α t−1 t−1 1u {f (QYt (τ | Yt−1 ))}2
Vβα (τ ) = Vαβ (τ ) ; Vββ (τ ) = E g ∗ (QYt (τ | Y t−1 ))
∂F −1 (h(Yt−1 ; θ(τ )), β(τ )) ∂F −1 (h(Yt−1 ; θ(τ )), β(τ )) ∂β ∂β
g ∗ (QYt (τ | Yt−1 )) ∂F −1 (h(Yt−1 ; θ(τ )), β(τ )) −1 C1u (τ ; F (Y t−1 , β(τ )), α(τ ))F˙ β (Y t−1 , β(τ )) + 2E ∂β f QYt (τ | Yt−1 ) 2 g ∗ (QYt (τ | Yt−1 )) −1 C1u (τ ; F (Yt−1 , β(τ )), α(τ )) ·F˙ β (Y t−1 , β(τ ))F˙ β (Y t−1 , β(τ )) . +E {f QYt (τ | Yt−1 ) }2 1 ˙ α (Yt−1 ; θ(τ )) h ˙ α (Yt−1 ; θ(τ )) ; h αα (τ ) = E {f (QYt (τ | Yt−1 ))}2 −1 ∂F (h(Y ; θ(τ )), β(τ )) 1 t−1 ˙ α (Yt−1 ; θ(τ )) h αβ (τ ) = E f (QYt (τ | Yt−1 )) ∂β 1 ˙ α (Yt−1 ; θ(τ ))C −1 (τ ; F (Yt−1 , β(τ )), α(τ ))F˙β (Yt−1 , β(τ )) ; h +E 1u {f (QYt (τ | Yt−1 ))}2
βα (τ ) = αβ (τ ) ; ∂F −1 (h(Yt−1 ; θ(τ )), β(τ )) ∂F −1 (h(Yt−1 ; θ(τ )), β(τ )) ββ (τ ) = E ∂β ∂β −1 ∂F −1 (h(Yt−1 ; θ(τ )), β(τ )) C1u (τ ; F (Yt−1 , β(τ )), α(τ )) · F˙β (Yt−1 , β(τ )) + 2E ∂β f QYt (τ | Yt−1 ) 2 −1 C1u (τ ; F (Yt−1 , β(τ )), α(τ )) ˙ ˙ +E · Fβ (Yt−1 , β(τ ))Fβ (Yt−1 , β(τ )) . {f QYt (τ | Yt−1 ) }2
T HEOREM 3.2. For any fixed τ ∈ (0, 1), under Assumptions 3.1–3.6 and θ (τ ) ∈ int(), we have: √ n θ (τ ) − θ (τ ) ⇒ N (0, τ (1 − τ )V (τ )−1 (τ )V (τ )−1 ), with V (τ ) and (τ ) are given in (3.2) (or (3.3) equivalently). C The Author(s). Journal compilation C Royal Economic Society 2009.
S60
X. Chen, R. Koenker and Z. Xiao
R EMARK 3.1. When the marginal distribution function of Y is completely known F (y, β) = F (y), V (τ ) and (τ ) reduce to the following simplified forms: ∗ g (QYt (τ | Yt−1 )) ˙ ˙ hα (Yt−1 ; α(τ )) hα (Yt−1 ; α(τ )) , V (τ ) = E [f (QYt (τ | Yt−1 ))]2 1 ˙ ˙ (τ ) = E hα (Yt−1 ; α(τ )) hα (Yt−1 ; α(τ )) . [f (QYt (τ | Yt−1 ))]2 R EMARK 3.2. When both the copula function C ∗ (u, v) = C(u, v; α) and the marginal distribution F ∗ (y) = F (y; β) are correctly specified, the parameters θ (τ ) define an explicit onedimensional manifold in , as illustrated in the examples of Section 2.3. To the extent that the estimated θˆ (τ ) departs from this curve we can infer various forms of misspecification. See, for example, Koenker and Xiao (2002).
4. INFERENCE The asymptotic normality of the QAR estimate also facilitates inference. In order to standardize the QAR estimator and remove nuisance parameters from the limiting distribution, we need to estimate the asymptotic covariance Matrix. In particular, we need to estimate (τ ) and V (τ ). Let Yt (τ | Yt−1 ) ≡ H (Yt−1 , θ (τ )), Q ) be the plug-in estimate of the and let f = f (·, β (τ ) can be estimated by n,αα (τ ) n (τ ) = n,βα (τ )
parametric marginal density function. Then n,αβ (τ ) , n,ββ (τ )
with n,αα (τ ) =
1 1 ˙ α (Yt−1 ; ˙ α (Yt−1 ; h θ (τ )) h θ (τ )) ; n t=1 {f (QYt (τ | Yt−1 ))}2
n,αβ (τ ) =
n (τ )) 1 θ (τ )), β 1 ∂F −1 (h(Yt−1 ; ˙ α (Yt−1 ; θ (τ )) h Yt (τ | Yt−1 )) n t=1 f(Q ∂β
n
+
n ˙ α (Yt−1 ; θ (τ )) 1 h (τ )), (τ )) ; C −1 (τ ; F (Yt−1 , β α (τ ))F˙β (Yt−1 , β n t=1 {f (QYt (τ | Yt−1 ))}2 1u
n,βα (τ ) = n,αβ (τ ) ; n,ββ (τ ) =
n (τ )) ∂F −1 (h(Yt−1 ; (τ )) θ (τ )), β θ (τ )), β 1 ∂F −1 (h(Yt−1 ; n t=1 ∂β ∂β −1 (τ )), (τ )) (τ )) C1u (τ ; F (Yt−1 , β α (τ ))F˙β (Yt−1 , β θ (τ )), β 2 ∂F −1 (h(Yt−1 ; Yt (τ | Yt−1 )) n t=1 ∂β f(Q 2 n −1 (τ )), (τ ; F (Yt−1 , β α (τ )) 1 C1u (τ ))F˙β (Yt−1 , β (τ )) . + F˙β (Yt−1 , β Yt (τ |Yt−1 ))}2 n {f(Q n
+
t=1
C The Author(s). Journal compilation C Royal Economic Society 2009.
Copula-based nonlinear quantile autoregression
S61
Next, the true (unknown) conditional density of Y t given Yt−1 , g ∗ (QYt (τ |Yt−1 )), can be estimated by the difference quotients, ˆ Yt (τi |Yt−1 ) − Q ˆ Yt (τi−1 |Yt−1 )), Yt (τ |Yt−1 )) = (τi − τi−1 )/(Q g (Q for some appropriately chosen sequence of {τ i }’s. Then the matrix V (τ ) can be estimated by n,αα (τ ) V n,αβ (τ ) V n (τ ) = V n,βα (τ ) V n,ββ (τ ) V with n Yt (τ |Yt−1 )) g (Q ˙ α (Yt−1 ; ˙ (Y ; n,αα (τ ) = 1 θ (τ )) h θ (τ )) ; V h Yt (τ |Yt−1 ))}2 α t−1 n t=1 {f(Q n Yt (τ |Yt−1 )) (τ )) ∂F −1 (h(Yt−1 ; g (Q θ (τ )), β ˙ α (Yt−1 ; n,αβ (τ ) = 1 V θ (τ )) h Yt (τ |Yt−1 )) n t=1 f(Q ∂β n Yt (τ |Yt−1 )) 1 g (Q −1 ˙β (Yt−1 , β ˙ α (Yt−1 ; + θ (τ ))C (τ ; F (Y , β (τ )), α (τ )) F (τ )) ; h t−1 1u Yt (τ |Yt−1 ))}2 n t=1 {f(Q
n,βα (τ ) = V n,αβ (τ ) ; V n −1 −1 Yt (τ |Yt−1 )) ∂F (h(Yt−1 ; θ (τ )), β (τ )) ∂F (h(Yt−1 ; θ (τ )), β (τ )) n,ββ (τ ) = 1 g (Q V n t=1 ∂β ∂β
+
n Yt (τ |Yt−1 )) ∂F −1 (h(Yt−1 ; (τ )) −1 2 g (Q θ (τ )), β (τ )), (τ )) α (τ ))F˙β (Yt−1 , β C1u (τ ; F (Yt−1 , β n t=1 f (QYt (τ |Yt−1 )) ∂β
+
n Yt (τ |Yt−1 )) −1 2 1 g (Q (τ )), (τ ))F˙β (Yt−1 , β (τ )) . C (τ ; F (Yt−1 , β α (τ )) F˙β (Yt−1 , β Yt (τ |Yt−1 ))}2 1u n t=1 {f(Q
Wald type tests can then be constructed immediately based on the standardized QAR n (τ ). The copula-based QAR models and related quantile regression n (τ ) and V estimators using estimation also provide important information about specification. Specification of, say, the copula function may be investigated based on parameter constancy over quantiles, along the lines of Koenker and Xiao (2006). In addition, specification of conditional quantile models can be studied based on the quantile autoregression residuals. For example, if we want to test the hypothesis of a general form: H0 : R(θ (τ )) = 0 where R(θ ) is an q-dimensional vector of smooth functions of θ , with derivatives to the second order, the asymptotic normality derived from the previous section facilitates the construction of a Wald statistic. Let ∂Rq (θ ) ∂R1 (θ ) ˙ ,..., R(θ (τ )) = , ∂θ ∂θ θ=θ(τ ) denote a p × q matrix of derivatives of R(θ ), we can construct the following regression Wald statistic −1 ˙ ˙ n (τ )−1 R( n (τ )−1 n (τ )V Wn,τ ≡ nR( θ (τ )) θ (τ )) V θ (τ )) τ (1 − τ )R( R( θ (τ )). C The Author(s). Journal compilation C Royal Economic Society 2009.
S62
X. Chen, R. Koenker and Z. Xiao
Under the hypothesis and our regularity conditions, we have Wn,τ ⇒ χq2 where χ 2q has a central chi-square distribution with q degrees of freedom.
5. CONCLUSION There are many competing approaches to broadening the scope of nonlinear time series modelling. We have argued that parametric copulas offer an attractive framework for specifying nonlinear quantile autoregression models. In contrast to fully parametric methods like maximum likelihood that impose a global parametric structure, estimation of distinct copula-based QAR models retains considerable semi-parametric flexibility by permitting local, quantile-specific parameters. There are many possible directions for future development. Inference and specification diagnostics is clearly a priority. Extensions to methods based on nonparametric estimation of the invariant distribution are possible. Finally, semi-parametric modelling of the copula itself as a sieve appears to be a feasible strategy for expanding the menu of currently available parametric copulas.
ACKNOWLEDGMENTS We thank Richard Smith and a referee for helpful comments on an earlier version of this paper. Chen and Koenker gratefully acknowledge financial support from National Science Foundation grants SES-0631613 and SES-0544673, respectively.
REFERENCES Beare, B. (2008). Copulas and temporal dependence. Working paper, U. C. San Diego. Bouy´e, E. and M. Salmon (2008). Dynamic copula quantile regressions and tail area dynamic dependence in forex markets, Working paper, Financial Econometrics Research Centre, Warwick Business School, UK. Chen, X. and Y. Fan (2006). Estimation of copula-based semiparametric time series models. Journal of Econometrics 130, 307–35. Engle, R. and S. Mangenelli (2004). CAViaR: conditional autoregressive value at risk by regression quantiles. Journal of Business and Economic Statistics 22, 367–81. Hansen, L. P., J. Heaton and E. Luttmer (1995). Econometric evaluation of asset pricing models. The Review of Financial Studies 8, 237–74. Hayashi, F. (2000). Econometrics. Princeton: Princeton University Press. Huber, P. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. In L. Le Cam and J. Neyman (Eds.), Proceedings of Berkeley Symposium on Mathematical Statistics and Probability, Volume I, 221–233. Berkeley: University of California Press. Joe, H. (1997). Multivariate Models and Dependence Concepts. London: Chapman & Hall/CRC. Ibragimov, R. (2006). Copulas-based characterizations and higher-order Markov processes. Working paper, Harvard University. Koenker, R. (2005). Quantile Regression. Econometric Society Monographs 38. New York: Cambridge University Press. Koenker, R. and G. Bassett (1978). Regression quantiles. Econometrica 46, 33–49. C The Author(s). Journal compilation C Royal Economic Society 2009.
S63
Copula-based nonlinear quantile autoregression
Koenker, R. and Z. Xiao (2006). Quantile autoregression. Journal of the American Statistical Association 101, 980–90. Koenker, R. and Z. Xiao (2002). Inference on the quantile regression process. Econometrica 81, 1583–612. Newey, W. K. and D. F. McFadden (1994). Large sample estimation and hypothesis testing. In R. F. Engle and D. F. McFadden (Eds.), Handbook of Econometrics, Volume 4, 2113–247. Amsterdam: NorthHolland. Patton, A. (2006). Modelling asymmetric exchange rate dependence. International Economic Review 47, 527–56. Patton, A. (2009). Copula-based models for financial time series. Forthcoming in T. G. Andersen, R. A. Davis, J.-P. Kreiss and T. Mikosch (Eds.), Handbook of Financial Time Series. New York: SpringerVerlag. Pollard, D. (1985). New ways to prove central limit theorems. Econometric Theory 1, 295–313. Weiss, A. (1991). Estimating nonlinear dynamic models using least absolute errors estimation. Econometric Theory 7, 46–68. White, H. (1994). Estimation, inference and specification analysis. Econometric Society Monographs no. 22. New York: Cambridge University Press.
APPENDIX: MATHEMATICAL PROOFS Proof of Theorem 3.1. We denote Y t−1 as x t . Then θ (τ ) = arg minθ∈ ρ τ (u) ≡ u(τ − I (u < 0)). Define
t
ρτ (Yt − H (xt , θ)) where
εt ≡ Yt − QYt (τ |xt ) ≡ Yt − H (xt , θ(τ )). Then Qεt (τ |xt ) = 0 and Yt = H (xt , θ (τ )) + εt ,
Pr[εt ≤ 0|xt ] = τ.
Denote H (xt , θ) ≡ H (xt , θ ) − H (xt , θ (τ )) and qτ (Yt , xt , θ) ≡ ρτ (εt − H (xt , θ )) − ρτ (εt ), and Qn (θ ) ≡
n 1 qτ (Yt , xt , θ). n t=1
Then it is easy to see that θ (τ ) = arg min Qn (θ ) and θ∈
θ (τ ) = arg min E [Qn (θ)] . θ∈
We apply Theorem 2.1 of Newey and McFadden (1994) to establish consistency. The compactness of (Assumption 3.1), continuity of E[Q n (θ )] with respect to θ ∈ (Assumptions 3.2 and 3.3) are directly assumed. It remans to verify uniform convergence (supθ∈ |Qn (θ) − E[Qn (θ)]| = op (1)), and that θ(τ ) is the unique minimizer of E[Q n (θ )]. Notice that under Assumptions 3.2 and 3.3, q τ (Y t , x t , θ) is continuous in θ ∈ and measurable in (Y t , x t ). Since sup |qτ (Yt , xt , θ )| = sup ρτ (εt − H (xt , θ )) − ρτ (εt ) ≤ sup |H (xt , θ) − H (xt , θ(τ ))| , θ∈
θ∈
θ∈
we have E(supθ∈ |qτ (Yt , xt , θ )|) < ∞ under Assumption 3.5(i). These and compactness of (Assumption 3.1) and stationary ergodicity of {Y t } (Assumption 3.5(ii)) together imply that all the conditions of Proposition 7.1 of Hayashi (2000) hold. Thus, by apply the uniform law of large numbers C The Author(s). Journal compilation C Royal Economic Society 2009.
S64
X. Chen, R. Koenker and Z. Xiao
for stationary ergodic processes (see, e.g. Proposition 7.1 of Hayashi, 2000), we obtain the uniform convergence: supθ∈ |Qn (θ ) − E[Qn (θ )]| = op (1). Next we verify that E[Q n (θ )] is uniquely minimized at θ(τ ). Recall that the true but unknown conditional density and distribution function of Y t given x t are g ∗ (· | x t ) and g ∗ (· | x t ) respectively, and use the following identity ρτ (u − v) − ρτ (u) = −vψτ (u) + (u − v){I (0 > u > v) − I (0 < u < v)} v = −vψτ (u) + {I (u ≤ s) − I (u < 0)}ds,
(A.1)
0
where ψτ (u) ≡ τ − I (u < 0), and by definition
E [ψτ (εt )|xt ] = 0.
we have, with simplified notation H t = H (xt , θ ),
Ht
qτ (Yt , xt , θ ) = ρτ (εt − H t ) − ρτ (εt ) = −H t ψτ (εt ) +
{I (εt ≤ s) − I (εt < 0)}ds.
0
thus E[Q n (θ )] = E{E[q τ (Y t , x t , θ )|x t ]} and E[qτ (Yt , xt , θ )|xt ] = E
Ht
{I (εt ≤ s) − I (εt < 0)}ds|xt
0
= 1 Ht > 0 E
Ht
I (0 ≤ εt ≤ s)ds|xt
0
+1 Ht < 0 E
0
I (s ≤ εt ≤ 0)ds|xt .
Ht
Notice that under Assumptions 3.3, Ht I (0 ≤ εt ≤ s)ds|xt 1 Ht > 0 E 0
= 1 Ht > 0
Ht 0
s+QYt (τ |xt )
QYt (τ |xt )
∗
g (y|xt )dy ds
≥ 1 H t > 0 1 g ∗ (QYt (τ |xt ) ≥ 1 )
Ht
0
2 1 ≥ 1 H t > 0 1 g ∗ (QYt (τ |xt ) ≥ 1 ) H t , 2
s+QYt (τ |xt )
QYt (τ |xt )
g ∗ (y|xt )dy ds
and similar result can be obtained for the case H t < 0. Thus, 2 1 E [Qn (θ )] ≥ E 1 g ∗ (QYt (τ |xt ) ≥ 1 ) H t , 2 which, under Assumption 3.4, is strictly positive. Thus for any ε > 0, Qn (θ) is bounded away from zero, uniformly in θ for θ − θ (τ ) ≥ ε. Proof of Theorem 3.2. We obtain the asymptotic normality using Pollard’s (1985) approach. In particular, we apply Pollard’s (1985) Theorem 2 except that we replace his i.i.d. assumption by our stationary ergodic data Assumption 3.5(ii), (note that we could also apply Theorem 7.1 of Newey and McFadden, θ (τ ) ∈ 0 with 1994). Recall that θ (τ ) = arg minθ∈ n1 t ρτ (Yt − H (xt , θ)), and under our Theorem 3.1, probability approaching one. Note that ψ τ (u) ≡ τ − I (u < 0) is the right-hand derivative of ρ τ (u) ≡ u(τ − I (u < 0)). (ρ τ (u) is everywhere differentiable with respect to u except at u = 0). Under Assumption 3.6(i), C The Author(s). Journal compilation C Royal Economic Society 2009.
Copula-based nonlinear quantile autoregression
S65
the derivative of ρ τ (Y t − H (x t , θ)) with respect to θ ∈ 0 exists (except at the point Y t = H (x t , θ)), and is given by ϕtτ (θ ) ≡ [τ − I (Yt < H (xt , θ))] H˙ θ (xt , θ). By the mean value theorem, ρτ (Yt − H (xt , θ )) = ρτ (Yt − H (xt , θ (τ ))) + (θ − θ(τ )) ϕtτ (θ(τ )) + θ − θ(τ ) rt (θ) with rt (θ ) ≡
(θ − θ (τ )) [ϕtτ (θ) − ϕtτ (θ(τ ))] ,
θ − θ(τ )
where θ ∈ 0 is in between θ and θ (τ ). Likewise, E[ρτ (Yt − H (xt , θ))] = E[ρτ (Yt − H (xt , θ (τ )))] + (θ − θ(τ )) E[ϕtτ (θ(τ ))] + θ − θ(τ ) E[rt (θ)]. Since E[τ − I (Y t < H (x t , θ (τ )))|x t ] = 0 under Assumption 3.3, we have, under Assumptions 3.3, 3.5 and 3.6(i)(iv), that E[ρ τ (Y t − H (x t , θ ))] has a second-order (i.e. E[ϕ tτ (θ)] has a first-order) derivative at θ(τ ) that is nonsingular, and is given by −V (τ ) ≡ −E g ∗ (H (xt , θ (τ )))H˙ θ (xt , θ(τ ))H˙ θ (xt , θ(τ )) . Thus condition (i) of Pollard’s (1985) Theorem 2 is satisfied. Condition (ii) of Pollard’s (1985) Theorem 2 is directly assumed (θ(τ ) ∈ int()), and his Condition (iii) holds due to our Theorem 3.1 ( θ (τ ) − θ(τ ) = oP (1)). We shall replace his Condition (iv) by a CLT for stationary ergodic martingale difference data. Since E[ϕtτ (θ (τ ))|xt ] = E E (τ − I (Yt < H (xt , θ(τ )))|xt ) H˙ θ (xt , θ(τ )) = 0, V ar[ϕtτ (θ (τ ))|xt ] = τ (1 − τ )H˙ θ (xt , θ (τ ))H˙ θ (xt , θ (τ )) . Under Assumptions 3.3, 3.5(ii) and 3.6(iv), we can apply the CLT for strictly stationary ergodic martingale difference sequence (see, e.g. Hayashi, 2000, p. 106), and obtain: n 1 ϕtτ (θ (τ )) ⇒ N (0, τ (1 − τ )(τ )) √ n t=1
with (τ ) ≡ E H˙ θ (xt , θ (τ ))H˙ θ (xt , θ(τ )) . Thus it remains to verify that condition (v) (stochastic differentiability) of Pollard’s (1985) Theorem 2 holds: √1 n t (rt (θ ) − E[rt (θ )]) → 0 in probability sup √ 1 + n θ − θ (τ ) θ∈Un for each sequence of balls {U n } that shrinks to θ(τ ) as n → ∞. Since rt (θ ) ≡
(θ − θ (τ )) [ϕtτ (θ) − ϕtτ (θ(τ ))] ,
θ − θ(τ )
Pollard’s (1985) Condition (v) holds provided that 1 [ϕ (θ ) − ϕ (θ (τ ))] − E[ϕ (θ ) − ϕ (θ(τ ))] tτ tτ tτ tτ sup → 0 in probability
θ − θ (τ ) θ∈Un n t for each sequence of balls {U n } that shrinks to θ(τ ) as n → ∞. C The Author(s). Journal compilation C Royal Economic Society 2009.
S66
X. Chen, R. Koenker and Z. Xiao Recall that ϕtτ (θ ) ≡ [τ − I (Yt < H (xt , θ ))]H˙ θ (xt , θ ), we have: ϕtτ (θ ) − ϕtτ (θ (τ )) = H˙ θ (xt , θ ) [I (Yt < H (xt , θ(τ ))) − I (Yt < H (xt , θ))] + H˙ θ (xt , θ ) − H˙ θ (xt , θ(τ )) [τ − I (Yt < H (xt , θ(τ )))] ≡ R1t (θ ) + R2t (θ ).
Under Assumption 3.6(i)(iii) we have: for all U n ⊆ 0 , R2t (θ ) ≤ E sup H¨ θθ (xt , θ) < ∞. E sup θ∈U θ − θ (τ )
θ∈0
n
By Assumption 3.3, E[R2t (θ )] = E
H˙ θ (xt , θ ) − H˙ θ (xt , θ (τ )) E{τ − I (Yt < H (xt , θ(τ )))|xt } = 0.
Thus, under Assumptions 3.5(ii) and 3.6(i)(iii), by the uniform law of large numbers for stationary ergodic processes, since U n ⊆ 0 ⊂ we obtain: 1 R (θ ) − E[R (θ)] 2t 2t sup = oP (1)
θ − θ (τ ) θ∈Un n t for each sequence of balls {U n } that shrinks to θ (τ ) as n → ∞. Consequently, Pollard’s (1985) Condition (v) holds provided that 1 R (θ ) − E[R (θ)] 1t 1t sup = oP (1)
θ − θ (τ ) θ∈Un n t
(A.2)
for each sequence of balls {U n } that shrinks to θ(τ ) as n → ∞. For any positive sequence of decreasing numbers {ε n }, denote U n ≡ {θ ∈ 0 : θ = θ(τ ), θ − θ(τ ) < ε n }. Then, under Assumption 3.6(i)(ii), we have: E
sup
θ∈Un
≤E
R1t (θ )
θ − θ (τ )
|I (Yt < H (xt , θ(τ ))) − I (Yt < H (xt , θ ))| sup H˙ θ (xt , θ ) × E sup |xt
θ − θ(τ ) θ∈0 θ∈Un
For all θ ∈ 0 , under Assumption 3.6(i)(iii), we have (θ − θ(τ )) H¨ θθ (xt , θ)(θ − θ(τ )) H (xt , θ) = H (xt , θ (τ )) + H˙ θ (xt , θ (τ )) (θ − θ (τ )) + 2 with E(supθ ∈0 |H¨ θθ (xt , θ )|) < ∞. Therefore, under Assumptions 3.3 and 3.6(i)(iii), conditioning on x t , there exists a small (x t ) > 0 such that for all θ ∈ 0 with θ − θ(τ ) < (x t ), we have that Y t − H (x t , θ (τ )) and Y t − H (x t , θ ) are of the same sign. Hence, under Assumptions 3.3 and 3.6(i)(ii), conditioning C The Author(s). Journal compilation C Royal Economic Society 2009.
Copula-based nonlinear quantile autoregression
S67
on x t and for any ε n ≤ (x t ) with ε n 0, we have: |I (Yt < H (xt , θ (τ ))) − I (Yt < H (xt , θ))| E sup |xt
θ − θ (τ ) θ∈Un I (Yt < H (xt , θ)) − I (Yt < H (xt , θ (τ ))) 1{H t > 0} |xt ≤E sup
θ − θ(τ ) θ∈Un : θ−θ(τ ) λ1 > · · · > λm > 0 and corresponding eigenvectors V S11 V = Im . Estimates of β and α are then obtained vm ], which are normalized by V [ v1 , . . . , as = [ β v1 , . . . , vr ],
, ) = S01 β and α = α (β
(2.6)
corresponding to the r largest roots of (2.5). The formed from the eigenvectors of V with β residuals from the RRR and the corresponding moment matrix of residuals that appear in the information criterion are Xt−1 , and ut = Xt − αβ (r) = n−1
n
β S10 . ut ut = S00 − S01 β
(2.7) (2.8)
t=1
Using (2.8) we have (e.g. Theorem 6.1 of Johansen, 1995) (r) = |S00 | ri=1 (1 − λi ),
(2.9)
where λi , 1 ≤ i ≤ r, are the r largest solutions to (2.5). The criterion (1.2) is then well determined for any given value of r.
3. ASYMPTOTIC RESULTS The following assumptions make specific the semiparametric and co-integration components of (1.1). Assumption LP is a standard linear process condition of the type that is convenient in developing partial sum limit theory. The condition can be relaxed to allow for martingale difference innovations and to allow for some mild heterogeneity in the innovations without disturbing the limit theory in a material way (see Phillips and Solo, 1992). Assumption RR gives conditions that are standard in the study of reduced rank regressions with some unit roots (Johansen, 1988, 1995, and Phillips, 1995). j A SSUMPTION LP. Let D(L) = ∞ j =0 Dj L , with D 0 = I and full rank D(1), and let u t have Wold representation ut = D(L)εt =
∞ j =0
Dj εt−j ,
with
∞
j 1/2 ||Dj || < ∞,
(3.1)
j =0
for some matrix norm ||·|| and where ε t is i.i.d. (0, ε ) with ε > 0. We use the notation ∞ ab (h) = E(at bt+h ) and ab = h=1 ab (h) for autocovariance matrices and one sided long run autocovariances and set = ∞ h=−∞ uu (h) = D(1)ε D(1) > 0 and ε = E{ε t ε t }. A SSUMPTION RR. (a) The determinantal equation |I m − (I m + αβ )L| = 0 has roots on or outside the unit circle, i.e. |L| ≥ 1. (b) Set = I m + αβ where α and β are m × r 0 matrices of full column rank r 0 , 0 ≤ r 0 ≤ m. (If r 0 = 0 then = I m ; if r 0 = m then β has full rank m and β Xt and hence Xt are (asymptotically) stationary) (c) The matrix R = I r + β α has eigenvalues within the unit circle. C The Author(s). Journal compilation C Royal Economic Society 2009.
S87
Semiparametric cointegrating rank selection
Assumption (c) ensures that the matrix β α has full rank. Let α ⊥ and β ⊥ be orthogonal complements to α and β, so that [α, α ⊥ ] and [β, β ⊥ ] are non-singular and β ⊥ β ⊥ = I m−r . Then, non-singularity of β α implies the non-singularity of α ⊥ β ⊥ . Under RR we have the Wold representation of β Xt vt := β Xt =
∞
R i β ut−i = R(L)β ut = R(L)β D(L)εt ,
(3.2)
i=0
and some further manipulations yield the following useful partial sum representation Xt = C
t
us + α(β α)−1 R(L)β ut + CX0 ,
(3.3)
s=1
where C = β ⊥ (α ⊥ β ⊥ )−1 α ⊥ . Expression (3.3) reduces to the Granger representation when u t is a martingale difference (e.g. Johansen, 1995). Under LP a functional law for partial sums of u t holds, so that n−1/2 [n·] s=1 us ⇒ Bu (·) as n → ∞, where B u is vector Brownian motion with variance matrix . In view of (3.2) and the i −1 = −(β α)−1 , we further have fact that R(1) = ∞ i=0 R = (I − R) n−1/2
[n·] s=1
vs = n−1/2
[n·]
β Xs ⇒ −(β α)−1 β Bu (·),
as n → ∞.
(3.4)
s=1
These limit laws involve the same Brownian motion B u and determine the asymptotic forms of the various sample moment matrices involved in the reduced rank regression estimation of (1.1). Define Xt 00 0β Var = . (3.5) β Xt−1 β0 ββ Explicit expressions for the submatrices in this expression may be worked out in terms of the autocovariance sequences of u t and v t and the parameters of (1.1). These expressions are given in (A.2)–(A.4) in the proof of Lemma A1 in the Appendix. The following result provides some (r) in the criterion asymptotic limits that are useful in deriving the asymptotic properties of function (1.2). L EMMA 3.1. Under Assumptions LP and RR, S00 → p 00 , β S11 β →p ββ , β S10 →p β0 ,
1 −1 n−1 β⊥ S11 β⊥ ⇒ (α⊥ β⊥ )−1 α⊥ Bu Bu α⊥ β⊥ α⊥ , 0 1 1 β⊥ (S10 − S11 βα ) ⇒ (α⊥ β⊥ )−1 α⊥ Bu dBu + wu , 0 1 β⊥ )−1 α⊥ Bu dBu β(α β)−1 + wv , β⊥ S11 β ⇒ −(α⊥ 0 1 −1 1 −1 Bu dBu α⊥ β⊥ α⊥ β⊥ + wu + wv α , β⊥ S10 ⇒ (α⊥ β⊥ ) α⊥ 0
C The Author(s). Journal compilation C Royal Economic Society 2009.
S88
Xu Cheng and Peter C. B. Phillips
where 1 wu =
∞
E{(β⊥ Xt )ut+h },
h=1
wv =
∞
E{(β⊥ Xt )(β Xt+h ) },
h=0
and w t = β ⊥ Xt = β ⊥ u t + β ⊥ αv t−1 . R EMARK 3.1. (a) When u t is weakly dependent, it is apparent that the asymptotic limits of β ⊥ (S 10 − S 11 βα ), β ⊥ S 10 , and β ⊥ S 11 β involve bias terms that depend on various one sided longrun covariance matrices associated with the stationary components u t , v t , and w t = β ⊥ Xt . Explicit values of these one-sided long-run covariance matrices are given in (A.7) and (A.8) in the Appendix. (b) When u t is a martingale difference sequence, 1wu = 0 and wv = β ⊥ E(ut vt ). Simpler results, such as β⊥ )−1 α⊥ β⊥ (S10 − S11 βα ) ⇒ (α⊥
1
0
Bu dBu ,
(3.6)
then hold for the limits in Lemma 3.1, and these correspond to earlier results given for example in theorem 10.3 of Johansen (1995). From (1.1), Xt = u t + αβ X t−1 = u t + αv t−1 , so that (c.f. (A.3)–(A.4)) 0β = α ββ + Eut vt−1 and 00 = αβ0 + Eut vt−1 α + E(ut ut ).
(3.7)
−1 −1
α = 0β ββ = α + Eut vt−1 ββ ,
(3.8)
Define
and let α⊥ be an m × (m − r) orthogonal complement to α such that [ α, α⊥ ] is non-singular. L EMMA 3.2. Under Assumptions LP and RR, when the true co-integration rank is r 0 , the λi with 1 ≤ i ≤ r 0 , converge to the roots of r 0 largest solutions to (2.5), denoted by λββ − β0 −1 0β = 0. (3.9) 00 The remaining m − r 0 roots, denoted by λi with r 0 + 1 ≤ i ≤ m, decrease to zero at the rate n−1 and {n λi : i = r0 + 1, . . . , m} converge weakly to the roots of 1
1
1 −1 ρ α⊥ 00 Gu Gu − Gu dGu β⊥ + α⊥ α⊥
α⊥ β⊥ dGu Gu + = 0, 0
0
0
(3.10)
where G u (r) = (α ⊥ β ⊥ )−1 α ⊥ B u (r) is m − r 0 dimensional Brownian motion with variance matrix (α ⊥ β ⊥ )−1 α ⊥ α ⊥ (β ⊥ α ⊥ )−1 and = 1wu + wv α . R EMARK 3.2. (a) Comparing these results with those of the standard RRR case with martingale difference errors (Johansen, 1995, p. 158), we see that, just as in the standard case, the r 0 largest roots of (2.5) are all positive in the limit and the m − r 0 smallest roots converge to 0 at the rate n−1 , both results now holding under weakly dependent errors. However, when u t C The Author(s). Journal compilation C Royal Economic Society 2009.
S89
Semiparametric cointegrating rank selection
is weakly dependent, the limit distribution determined by (3.10) is more complex than in the standard case. In particular, the determinantal equation (3.10) involves the composite one-sided long-run covariance matrix . 1 α = α, α⊥ = α⊥ , wu = (b) When u t is a martingale difference sequence, we find that 0, = wv α , 00 = αββ α + and −1 −1 α⊥ 00 α⊥
α⊥ = α⊥ α⊥ α⊥ α⊥ .
α⊥ Then α ⊥ β ⊥ G u (r) = α ⊥ B u (r) is Brownian motion with covariance matrix α ⊥ α ⊥ , α ⊥ = 0, and the determinantal equation (3.10) reduces to 1 1 1 −1 ρ Gu Gu − Gu dGu β⊥ α⊥ α⊥ α⊥ α⊥ β⊥ dGu Gu = 0, 0
0
which is equivalent to ρ
0
1
Vu Vu −
0
1 0
Vu dVu
1 0
dVu Vu = 0,
where V u (r) is m − r 0 dimensional standard Brownian motion, thereby corresponding to the standard limit theory of a parametric reduced rank regression (Johansen, 1995). T HEOREM 3.1. (a) Under Assumptions LP and RR, the criterion IC(r) is weakly consistent for selecting the rank of co-integration provided C n → ∞ at a slower rate than n. (b) The asymptotic distribution of the AIC criterion (IC(r) with coefficient C n = 2) is given by lim P (ˆrAIC = r0 ) ⎫⎤ ⎧ ⎡ r ⎬ ⎨ m =P⎣ ∩ ξi < 2(r − r0 )(2m − r − r0 ) ⎦ , ⎭ r=r0 +1 ⎩ n→∞
i=r0 +1
lim P (ˆrAIC = r|r > r0 ) r m =P ∩ ξi < 2(r − r)(2m − r − r) ∩ n→∞
r =r+1
r−1
∩
r =r0
i=r+1
r
, ξi > 2 r − r 2m − r − r
i=r +1
and lim P (ˆrAIC = r|r < r0 ) = 0,
n→∞
where ξr0 +1 , . . . , ξm are the ordered roots of the limiting determinantal equation (3.10). R EMARK 3.3. (a) BIC, HQ and other information criteria with C n → ∞ and C n /n → 0 are all consistent for the selection of cointegrating rank without having to specify a full parametric model. C The Author(s). Journal compilation C Royal Economic Society 2009.
S90
Xu Cheng and Peter C. B. Phillips
The same is true for the criterion IC∗ (r) where only the cointegrating space is estimated and structural identification conditions on the cointegrating vector are not imposed or used in rank selection. (b) AIC is inconsistent, asymptotically never underestimates cointegrating rank, and favours more liberally parametrized systems. This outcome is analogous to the well-known overestimation tendency of AIC in lag length selection in autoregression. Of course, in the present case, maximum rank is bounded above by the order of the system. Thus, the advantages to overestimation in lag length selection that arise when the autoregressive order is infinite might not be anticipated here. However, when cointegrating rank is high (and close to full dimensional), AIC typically performs exceedingly well (as simulations reported below attest) largely because the upper bound in rank restricts the tendency to overestimate. (c) When m = 1, r 0 = 0 corresponds to the unit root case and r 0 = 1 to the stationary case. Thus, one specialization of the above result is to unit root testing. In this case, the criteria consistently discriminate between unit root and stationary series provided C n → ∞ and C n /n → 0, as shown in Phillips (2008). In this case, the limit distribution of AIC is much 1 1 simpler and involves only the explicit limiting root ξ1 = ( 0 Bu dBu + λ)2 /{( 0 Bu2 )00 } where λ = ∞ h=1 E(ut ut+h ). (d) While Theorem 3.1 relates directly to model (1.1), it is easily shown to apply in cases where the model has intercepts and drift. Thus, the result provides a convenient basis for consistent co-integration rank selection in most empirical contexts.
4. SIMULATIONS Simulations were conducted to evaluate the finite sample performance of the criteria under various generating mechanisms for the short memory component u t , different settings for the true cointegrating rank, and for various choices of the penalty coefficient C n . Some illustrative findings for cases of dimension m = 2 and 4 are reported here. The data generating process follows (1.1). When m = 2, the design is as follows. For r 0 = 0 we have α β = 0. For r 0 = 1 the reduced rank coefficient structure is set so that α β = R1 = (1, 0.5)
−1 1
.
For r 0 = 2, two different designs (A and B) were simulated, one with smaller and the other with larger stationary roots as follows: A: α β = R2 = B: α β = R3 =
−0.5 0.1 0.2 −0.4
−0.5 0.1 0.2 −0.15
with stationary roots λi [I + β α] = {0.7, 0.4} ;
, ,
with stationary roots λi [I + β α] = {0.9, 0.45} .
When the dimension m = 4, the matrix α β was constructed to have a block diagonal form reflecting the true cointegrating rank. We call the four dimensional set up design C in what C The Author(s). Journal compilation C Royal Economic Society 2009.
Semiparametric cointegrating rank selection
S91
follows. For r 0 = 0 we have α β = 0. Let 2 −0.7 0.1 R4 = . (−1, 1) and R5 = 0.5 0.2 −0.6 For r 0 = 1, the reduced rank coefficient structure is set to α β = diag{R4, 0, 0},
with stationary root λi [I + β α] = −0.5.
For r 0 = 2, α β = diag{R5, 0, 0},
with stationary roots λi [I + β α] = {0.2, 0.5}.
For r 0 = 3, α β = diag{R2, R4 },
with stationary roots λi [I + β α] = {0.4, 0.7, −0.5}.
For r 0 = 4, α β = diag{R2, R3 },
with stationary roots λi [I + β α] = {0.4, 0.7, 0.45, 0.9}.
Simulations were conducted with AR(1), MA(1) and ARMA(1,1) errors, corresponding to the models ut = Aut−1 + εt , ut = εt + Bεt−1
and ut = Aut−1 + εt + Bεt−1 ,
(4.1)
respectively, with coefficient matrices A = ψI m , B = φI m , where |ψ| < 1, |φ| < 1, and with innovations ε t = i.i.d. N (0, ε ), where ε = diag{1 + θ, 1 − θ } > 0 when m = 2 and ε = diag{1 + θ1 , 1 − θ1 , 1 + θ2 , 1 − θ2 } when m = 4. The parameters for these models were set to ψ = φ = 0.4, θ = 0.25, θ 1 = 0.25 and θ 2 = 0.4. The performance of the criteria AIC, BIC, HQ and log(HQ) was investigated for sample sizes n = 100 in design A and n = 100, 400 in design B and n = 100, 250 and 400 in design C. 1 All cases including 50 additional observations to eliminate start-up effects from the initializations X 0 = 0 and ε 0 = 0. The results are based on 20,000 replications and are summarized in Fig. 1, which shows the results for design A, in Table 1, which shows the results for design B, and in Table 2, which shows the results for design C, with correct selections in bold type. The results displayed are for the model with AR(1) errors. Similar results were obtained for the other error generating schemes in (4.1). As is evident in Fig. 1, the BIC criterion generally performs very well when n ≥ 100. For design B, where the stationary roots of the system are closer to unity, BIC has a tendency to underestimate rank when n = 100 and r 0 = 2, thereby choosing more parsimoniously parameterized systems in this case, just as it does in lag length selection in autoregressions. But BIC performs well when n = 400, as seen in Table 1. The tendency of AIC to overestimate rank is also clear in Fig. 1, but this tendency is noticeably attenuated when the true rank is 1 and is naturally delimited when the true rank is 2 because of the upper bound in rank choice. For design B, AIC performs better than BIC when 1
log(HQ) has penalty coefficient C n = log (2 log log n).
C The Author(s). Journal compilation C Royal Economic Society 2009.
S92
Xu Cheng and Peter C. B. Phillips (a) AIC
(b) BIC 100
100 Prob
Prob
80
80
60
60
40
40 20
20
0
0 0
1
rˆ
2
0
0
2
1
1
2
rˆ
r0
(c) HQ
1
0
2
r0
(d) log(HQ) 100 Prob
0
100 Prob
80
80
60
60
40
40
20
20
0 1
rˆ
2
0
1
r0
2
0 1
rˆ
0 2
0
1
2
r0
Figure 1. Cointegrating rank selection in design A when u t is AR(1) and n = 100.
the cointegrating rank is 2 and the system is stationary, as does HQ, for which the penalty is C n < 2 when n = 100, 400. Criteria with weaker penalties, such as log(HQ) with C n = log(2 log log n), also do better in this case, although for other cases they perform much less satisfactorily than AIC and HQ, showing a strong tendency to overestimate cointegrating rank. Design C is a more extensive set-up for the higher dimensional system with m = 4. As shown in Table 2, BIC generally performs well when n = 250 and the pattern follows that of the two dimensional set-up. When r 0 = 4 and the system is stationary, BIC tends to underestimate cointegrating rank because one stationary root is close to unity. BIC also performs better when some stationary roots are negative than it does when all stationary roots are positive. We found, for example, that when the cointegrating rank is 3 with the three positive stationary roots {0.4, 0.5, 0.7} BIC can perform poorly even for n = 250. However, if the distribution of the stationary roots is more balanced—for example if one of the roots is negative as in {−0.5, 0.4, 0.7} in Design C—performance improves significantly. Based on overall performance, it seems that BIC can be recommended for practical work in choosing cointegrating rank and it gives generally very sharp results when n ≥ 250. The main weakness of BIC is its tendency to choose more parsimonious models (i.e. models with more unit roots) in the following conditions: (i) when the system is stationary and has a root near unity; (ii) when there are many positive stationary roots and (iii) when the sample size is small and the system dimension is large. 2 Wang and Bessler (2005) reported some related simulation work under the assumption that it is known that the time series are already transformed into a form where the observed variables are 2 Simulations (not reported here) showed that the tendency for BIC to select models with more unit roots is exacerbated when the criterion IC∗ (r), which has a stronger penalty, is used.
C The Author(s). Journal compilation C Royal Economic Society 2009.
S93
Semiparametric cointegrating rank selection Table 1. Cointegrating rank selection in design B when u t follows an AR(1) process. n = 100 n = 400 r0 = 0 r=0
r=1
r=2
r=0
r=1
r=2
AIC
0.48
0.40
0.12
0.53
0.36
0.11
BIC HQ
0.88 0.35
0.11 0.47
0.01 0.18
0.95 0.47
0.05 0.40
0.00 0.13
Log(HQ)
0.03
0.44
0.53
0.07
0.49
0.44
r=0
r=1
r=2
r=0
r=1
r=2
AIC BIC
0.00 0.00
0.78 0.94
0.22 0.06
0.00 0.00
0.76 0.97
0.24 0.03
HQ Log(HQ)
0.00 0.00
0.71 0.40
0.29 0.60
0.00 0.00
0.74 0.46
0.26 0.54
r=0
r=1
r=2
r=0
r=1
r=2
AIC BIC HQ
0.00 0.05 0.00
0.25 0.74 0.14
0.75 0.21 0.86
0.00 0.00 0.00
0.00 0.02 0.00
1.00 0.98 1.00
Log(HQ)
0.00
0.02
0.98
0.00
0.00
1.00
r0 = 1
r0 = 2
either stationary or integrated. In the present context, this is equivalent to setting αβ to a diagonal matrix with elements of either zero or unity. The problem of cointegrating rank selection in this simpler framework is equivalent to direct unit root testing on each variable. We may therefore use the selection method of Phillips (2008) to estimate the co-integration rank by conducting a unit root test on each time series and simply counting the number of unit roots obtained. Simulations (not reported here) indicate that this procedure works well. However, since the transformation which takes the model into a canonical form where the observed variables are either stationary or integrated is seldom known, this procedure is generally not practical for estimating cointegrating rank.
5. CONCLUSION Model selection for cointegrating rank treats rank as an order parameter and provides the convenience of consistent estimation of this parameter under weak conditions on the expansion rate of the penalty coefficient. The approach is easy to implement in practice and is sympathetic with other semiparametric approaches to estimation and inference in cointegrating systems where the focus is on long-run behaviour. Information criteria such as (1.2) provide a useful C The Author(s). Journal compilation C Royal Economic Society 2009.
S94
Xu Cheng and Peter C. B. Phillips Table 2. Cointegrating rank selection in design C when u t follows an AR(1) process. n = 250 r0 = 0 r=0
r=1
r=2
r=3
r=4
AIC BIC HQ
0.13 0.94 0.58
0.40 0.06 0.33
0.31 0.00 0.08
0.12 0.00 0.01
0.04 0.00 0.00
Log(HQ)
0.00
0.09
0.31
0.40
0.20
r0 = 1 r=0
r=1
r=2
r=3
r=4
AIC BIC
0.00 0.00
0.34 0.96
0.43 0.04
0.18 0.00
0.05 0.00
HQ Log(HQ)
0.00 0.00
0.75 0.05
0.21 0.30
0.03 0.45
0.00 0.20
r0 = 2 r=0
r=1
r=2
r=3
r=4
AIC BIC
0.00 0.00
0.00 0.02
0.54 0.93
0.36 0.05
0.10 0.00
HQ Log(HQ)
0.00 0.00
0.00 0.00
0.80 0.23
0.17 0.50
0.03 0.26
r=1
r=2
r=3
r=4
r0 = 3 r=0 AIC
0.00
0.00
0.00
0.82
0.18
BIC HQ Log(HQ)
0.00 0.00 0.00
0.00 0.00 0.00
0.10 0.00 0.00
0.88 0.93 0.66
0.02 0.07 0.34
r=2
r=3
r=4
r0 = 4 r=0
r=1
AIC
0.00
0.00
0.00
0.00
1.00
BIC HQ
0.00 0.00
0.00 0.00
0.04 0.00
0.16 0.02
0.80 0.98
Log(HQ)
0.00
0.00
0.00
0.00
1.00
C The Author(s). Journal compilation C Royal Economic Society 2009.
Semiparametric cointegrating rank selection
S95
diagnostic check on system cointegrating rank and proceed as if there is no prior information on cointegrating rank. If prior information were available and could be formulated as prior probabilities on the models of different rank then this could, of course, be incorporated into a Bayes factor. In subsequent work Cheng and Phillips (2008) show that consistent cointegrating rank selection by information criteria continues to hold in models where there is unconditional heterogeneity in the error variance of unknown form, including breaks in the variance or smooth transition functions in the variance over time. Such permanent changes in variance are known to invalidate both unit root tests and likelihood ratio tests for cointegrating rank because of their effects on the limit distribution theory under the null (see Cavaliere, 2004, Beare, 2007, and Cavaliere and Taylor, 2007). Since consistency of the information criteria is unaffected by the presence of this form of variance induced non-stationarity, the approach offers an additional degree of robustness in cointegrating rank determination that is useful in empirical applications. Cheng and Phillips (2008) give an empirical application of this theory to exchange rate dynamics. Some applications of the methods outlined here are possible in other models. First, rather than work with reduced rank regression formulations within a vector autoregressive framework, it is possible to use reduced rank formulations in regressions of the time series on a fixed (or expanding) number of deterministic basis functions such as time polynomials or sinusoidal polynomials (Phillips, 2005). In a similar way to the present analysis, it can be shown that information criteria such as BIC and HQ will be consistent for cointegrating rank in such coordinate systems. The coefficient matrix in such systems turns out to have a random limit, corresponding to the matrix of random variables that appear in the Karhunen–Lo`eve representation (Phillips, 1998), but has a rank that is the same as the dimension of the cointegrating space, which enables consistent rank estimation by information criteria. A second application is to dynamic factor panel models with a fixed number of stochastically trending unobserved factors, as in Bai and Ng (2004). Again, these models have reduced rank structure (this time with non-random coefficients) and the number of factors may be consistently estimated using model selection criteria of the same type as those considered here but in the presence of an increasing number of incidental loading coefficients. In such cases, the BIC penalty, as derived from the asymptotic behaviour of the Bayes factor, has a different form from usual and typically involves both cross section and time series sample sizes. Some extensions of the present methods to these models will be reported in later work.
ACKNOWLEDGMENTS Our thanks go to the Editor and a referee for helpful comments on the original version. Cheng acknowledges support from an Anderson Fellowship. Phillips acknowledges support from the NSF under Grant No. SES 06-47086.
REFERENCES Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In B. N. Petrov and F. Csaki (Eds.), Second International Symposium on Information Theory, 267–281. Budapest: Akademiai Kiado. Akaike, H. (1977). On entropy maximization principle. In P. R. Krishnarah (Ed.), Applications of Statistics, 27–41. Amsterdam: North–Holland. C The Author(s). Journal compilation C Royal Economic Society 2009.
S96
Xu Cheng and Peter C. B. Phillips
Bai, J. and S. Ng (2004). A PANIC attack on unit roots and cointegration. Econometrica 72, 1127– 77. Beare, B. (2007). Robustifying unit root tests to permanent changes in innovation variance. Working paper, Yale University. Cavaliere, G. (2004). Unit root tests under time-varying variance shifts. Econometric Reviews 23, 259–92. Cavaliere, G. and A. M. R. Taylor (2007). Testing for unit roots in time series models with non-stationary volatility. Journal of Econometrics 140, 919–47. Chao, J. and P. C. B. Phillips (1999). Model selection in partially non-stationary vector autoregressive processes with reduced rank structure. Journal of Econometrics 91, 227–71. Cheng, X. and P. C. B. Phillips (2008). Cointegrating rank selection in models with time-varying variance. Working paper, Yale University. Hannan, E. J. and B. G. Quinn (1979). The determination of the order of an autoregression. Journal of the Royal Statistical Society, Series B 41, 190–5. Johansen, S. (1988). Statistical analysis of cointegration vectors. Journal of Economic Dynamics and Control 12, 231–54. Johansen, S. (1995). Likelihood-Based Inference in Cointegrated Vector Autoregressive Models. Oxford: Oxford University Press. Kapetanios, G. (2004). The asymptotic distribution of the cointegration rank estimation under the Akaike information criterion. Econometric Theory 20, 735–42. Nielsen, B. (2006). Order determination in general vector autoregressions. IMS Lecture Notes- Monograph Series 52, 93–112. Phillips, P. C. B. (1991). Optimal inference in cointegrated systems. Econometrica 59, 283–306. Phillips, P. C. B. (1995). Fully modified least squares and vector autoregression. Econometrica 63, 1023–78. Phillips, P. C. B. (1996). Econometric model determination. Econometrica 64, 763–812. Phillips, P. C. B. (1998). New tools for understanding spurious regressions. Econometrica 66, 1299– 326. Phillips, P. C. B. (2005). Challenges of trending time series econometrics. Mathematics and Computers in Simulation 68, 401–16. Phillips, P. C. B. (2008). Unit root model selection. Journal of the Japan Statistical Society 38, 65–74. Phillips, P. C. B. and J. McFarland (1997). Forward exchange market unbiasedness: the case of the Australian dollar since 1984. Journal of International Money and Finance 16, 885–907. Phillips P. C. B. and W. Ploberger (1996). An asymptotic theory of bayesian inference for time series. Econometrica 64, 381–413. Phillips, P. C. B. and V. Solo (1992). Asymptotics for linear processes. Annals of Statistics 20, 971– 1001. P¨otscher, B. M. (1989). Model selection under nonstationarity: autoregressive models and stochastic linear regression models. Annals of Statistics 17, 1257–74. Rissanen, J. (1978). Modeling by shortest data description. Automatica 14, 465–71. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics 6, 461–4. Tsay, R. S. (1984). Order selection in nonstationary autoregressive models. Annals of Statistics 12, 1425– 33. Wang, Z. and D. A. Bessler (2005). A Monte Carlo Study on the selection of cointegration rank using information criteria. Econometric Theory 21, 593–620. Wei, C. Z. (1992). On predictive least squares principles. Annals of Statistics 20, 1–42.
C The Author(s). Journal compilation C Royal Economic Society 2009.
S97
Semiparametric cointegrating rank selection
APPENDIX A.1. Normalization restrictions and degrees of freedom In triangular system specifications (Phillips, 1991) the cointegrating matrix β in (1.1) takes the form [I r , −B] for some unrestricted r × (m − r) matrix B, which involves r2 restrictions and leads to degrees of freedom 2mr − r 2 in A. Under normalization restrictions of the form (2.4) on β that are conventionally employed in empirical reduced rank regression modeling, the degrees of freedom term would be 2mr − r(r + 1)/2, leading to the alternate criterion (r)| + Cn n−1 (2mr − r(r + 1)/2). IC∗ (r) = log | In this case the outer product form of the coefficient matrix in (1.1) implies that A = αβ = αCC β for an arbitrary orthogonal matrix C, so that α and β are not uniquely identified even though the likelihood is well defined. In such cases, only the cointegrating rank and the cointegrating space are identified and consistently estimable. Correspondingly, under this normalization there are more degrees of freedom in the system. However, the usual justification for the BIC criterion (Schwarz, 1978 and Ploberger and Phillips, 1996) involves finding an asymptotic approximation to the Bayesian data density (and hence the posterior probability of the model), which is obtained by Laplace approximation methods using a Taylor series expansion of the log likelihood around a consistent parameter estimate. In the reduced rank regression case, r2 restrictions on β are required to identify the structural parameters as in the above formulation β = [I r , −B]. If only normalization restrictions such as β β = I r are imposed, then we can write A = αβ = αCC (Ir + BB )−1/2 [Ir , −B], with β = (I r + BB )−1/2 [I r , −B] and where C is an arbitrary orthogonal matrix. In this case, C is unidentified and if C has a uniform prior distribution on the orthogonal group O(r) independent of the prior on (α, B), then C may be integrated out of the Bayesian data density or marginal likelihood. The data density then has the same form as it does for the case where A = α(Ir + BB )−1/2 [Ir , −B] = α¯ −1/2 [Ir , −B], and where α¯ and B are identified. In this event, the model selection criterion is the same as that given in (1.2).
A.2. Proofs L EMMA A1. Under (1.1) and Assumption LP, −1 −1/2 −1/2 −1 −1 −1 −1 00 − 00 0β β0 00 β0 β0 00 = 00 c⊥ (c⊥ c⊥ )−1 c⊥ 00 , −1/2
−1 −1 where c = 00 0β and c ⊥ is an orthogonal complement to c. Defining α = 0β ββ = 00 cββ and −1/2
α⊥ = 00 c⊥ , we have the alternate form −1 −1/2 −1/2 α⊥ 00 00 c⊥ (c⊥ c⊥ )−1 c⊥ 00 = α⊥ α⊥ α⊥ . (A.1) 1/2
When 0β = α ββ , (A.1) reduces to −1/2
−1/2
00 c⊥ (c⊥ c⊥ )−1 c⊥ 00
−1 = α⊥ α⊥ 00 α⊥ α⊥ .
Proof: Since [c, c ⊥ ] is non-singular we have I = c(c c)−1 c + c⊥ (c⊥ c⊥ )−1 c⊥ , C The Author(s). Journal compilation C Royal Economic Society 2009.
S98
Xu Cheng and Peter C. B. Phillips
and then −1 −1 −1 −1 − 00 0β (β0 00 β0 )−1 β0 00 00 −1/2
−1/2
= 00 {I − c(c c)−1 c }00
−1/2
−1/2
= 00 c⊥ (c⊥ c⊥ )−1 c⊥ 00 ,
as required. −1/2 −1/2 Observe that when 0β = α ββ , we have c = 00 0β = 00 α ββ and we may choose c ⊥ = 1/2 00 α ⊥ where α ⊥ is an orthogonal complement to α. In that case we have −1 −1/2 −1/2 α⊥ , 00 c⊥ (c⊥ c⊥ )−1 c⊥ 00 = α⊥ α⊥ 00 α⊥ as stated. This corresponds with the result in Johansen (1995, lemma 10.1) where u t is a martingale difference. In the present semiparametric case, Xt = u t + αβ X t−1 = u t + αv t−1 and the covariance vu (1) = E(v t−1 u t ) is generally non-zero, so that ββ = Evt vt = vv (0),
(A.2)
β0 = Evt−1 (ut + αvt−1 ) = vu (1) + vv (0)α = vu (1) + ββ α ,
(A.3)
00 = αββ α + α vu (1) + uv (−1) α + uu (0).
(A.4)
−1/2
−1 −1 = 00 cββ and we may choose α⊥ = 00 c⊥ . In this notation, we may write in Note that α = 0β ββ the general case −1 −1/2 −1/2 α⊥ 00 α⊥ α⊥ α⊥ , (A.5) 00 c⊥ (c⊥ c⊥ )−1 c⊥ 00 = 1/2
as given in (A.1).
Proof of Lemma 3.1: Since both Xt = u t + αβ X t−1 and v t = β Xt are stationary and satisfy Assumption LP, the law of large numbers gives S00 = n−1
n
Xt Xt →p 00 = uu (0) + α vv (0)a + α vu (0) + uv (0)a ,
t=1
β S11 β = n−1
n
β Xt−1 β Xt−1 →p ββ = vv (0), and
t=1
β S10 = n−1
n
β Xt−1 Xt →p β0 = vu (1) + vv (0)a .
t=1
In view of (3.3) we have β⊥ Xt = β⊥ C
t
us + β⊥ α(β α)−1 R(L)β ut + β⊥ CX0
s=1
=
(α⊥ β⊥ )−1 α⊥
t
us + X0 + β⊥ α(β α)−1 R(L)β ut ,
s=1
so that the standardized process n−1/2 β ⊥ X [n·] ⇒ (α ⊥ β ⊥ )−1 α ⊥ B u (·), and from (3.4) we have n−1/2
[n·]
β Xs ⇒ −(β α)−1 β Bu (·) .
(A.6)
s=1 C The Author(s). Journal compilation C Royal Economic Society 2009.
S99
Semiparametric cointegrating rank selection It follows by conventional weak convergence methods that
1 −1 n−1 β⊥ S11 β⊥ ⇒ (α⊥ β⊥ )−1 α⊥ Bu Bu α⊥ β⊥ α⊥ , 0 n Xt−1 Xt − αβ Xt−1 β⊥ (S10 − S11 βα ) = β⊥ n−1 t=1
1 n β⊥ Xt−1 ut 1 Bu dBu + wu , = √ √ ⇒ (α⊥ β⊥ )−1 α⊥ n n 0 t=1 1 n β⊥ Xt−1 (β Xt−1 ) −1 Bu dBu β(α β)−1 + wv , β⊥ S11 β = ⇒ −(α⊥ β⊥ ) α⊥ √ √ n n 0 t=1 where 1 = wu
∞
E
β⊥ Xt ut+h
and wv =
h=1
∞
E
β⊥ Xt (β Xt+h )
h=0
are one-sided long-run covariance matrices involving w t = β ⊥ Xt , u t and v t . Note that wt := β⊥ Xt = β⊥ ut + β⊥ αvt−1 so we may deduce the explicit form ∞ ∞ ∞ 1 = E β⊥ Xt ut+h = β⊥ E ut ut+h + β⊥ α E vt−1 ut+h wu h=1
h=1
h=1
= β⊥ uu + β⊥ α [ vu − vu (1)] , and wv =
∞
E
β⊥ Xt
(A.7)
β Xt+h
h=0
= β⊥
∞
∞ + β⊥ α E ut vt+h E vt−1 vt+h
h=0
h=0
= β⊥ ( uv + uv (0)) + β⊥ α vv .
(A.8)
Finally, using (A.6) and standard limit theory again, we obtain
n n β⊥ Xt−1 Xt β⊥ Xt−1 ut + αvt−1 β⊥ S10 = √ √ = √ √ n n n n t=1 t=1 1 1 −1 ⇒ (α⊥ β⊥ )−1 α⊥ Bu dBu − (α⊥ β⊥ )−1 α⊥ Bu dBu β α β α 0
+
∞
E
= (α⊥ β⊥ )−1 α⊥ ∞
E
β⊥ Xt ut+h +
h=1
+
0
∞
E
β⊥ Xt vt+h α
h=0 1
Bu dBu {I − β(α β)−1 α }
0
∞ β⊥ Xt ut+h + E β⊥ Xt vt+h α
h=1
= (α⊥ β⊥ )−1 α⊥
0
h=0 1
−1 1 Bu dBu α⊥ β⊥ α⊥ β⊥ + wu + wv α ,
since β (α β)−1 α + α ⊥ (β ⊥ α ⊥ )−1 β ⊥ = I (e.g. Johansen, 1995, p. 39). C The Author(s). Journal compilation C Royal Economic Society 2009.
S100
Xu Cheng and Peter C. B. Phillips
Proof of Lemma 3.2: Let S(λ) = λS 11 − S 10 S −1 00 S 01 , so that the determinantal equation (2.5) is |S(λ)| = 0. Defining P n = [β, n−1/2 β ⊥ ] and using Lemma 3.1, we have P (S (λ)) Pn n λβ S β λn−1/2 β S11 β⊥ 11 = λn−1/2 β⊥ S11 β λn−1 β⊥ S11 β⊥ −1 −1 β S10 S00 S01 β n−1/2 β S10 S00 S01 β⊥ − −1/2 −1 −1 n β⊥ S10 S00 S01 β n−1 β⊥ S10 S00 S01 β⊥ ⎡ ⎤ λββ 0 −1 β0 00 0β 0
1 ⎢ ⎥ ⇒ ⎣ − ⎦ −1 0 0 0 λ(α⊥ β⊥ )−1 α⊥ Bu Bu α⊥ β⊥ α⊥ 0
1 −1 −1 −1 = λββ − β0 00 0β λ(α⊥ β⊥ ) α⊥ Bu Bu α⊥ β⊥ α⊥ . (A.9) 0
The determinantal equation
λββ − β0 −1 0β λ(α β⊥ )−1 α ⊥ 00 ⊥
0
1
−1 Bu Bu α⊥ β⊥ α⊥ = 0
has m − r 0 zero roots and r 0 positive roots given by the solutions of λββ − β0 −1 0β = 0. 00
(A.10)
Thus, the r 0 largest roots of (2.5) converge to the roots of (A.10) and the remainder converge to zero. Defining P = [β, β⊥ ], we have β S (λ) β β S (λ) β ⊥ P (S (λ)) P = β⊥ S (λ) β β⊥ S (λ) β⊥ = |β S(λ)β||β⊥ {S(λ) − S(λ)β[β S(λ)β]−1 β S(λ)}β⊥ |.
(A.11)
As in Johansen (1995, theorem 11.1), we let n → ∞ and λ → 0 such that ρ = nλ = Op (1). Using Lemma 3.1, we have β S (λ) β
−1 −1 = ρn−1 β S11 β − β S10 S00 S01 β = −β0 00 0β + op (1),
−1 β⊥ S (λ) β⊥ = ρn−1 β⊥ S11 β⊥ − β⊥ S10 S00 S01 β⊥ , and −1 β⊥ S (λ) β = ρn−1 β⊥ S11 β − β⊥ S10 S00 S01 β −1 = −β⊥ S10 S00 S01 β + op (1).
(A.12)
Define −1 −1 −1 −1 Nn = S00 − S00 S01 β β S10 S00 S01 β β S10 S00 . Using Lemmas 3.1 and A1, we have −1 −1 −1 −1 − 00 0β β0 00 0β β0 00 + op (1) Nn = 00 −1 = α⊥ α⊥ 00 α⊥ α⊥ + op (1).
(A.13)
C The Author(s). Journal compilation C Royal Economic Society 2009.
S101
Semiparametric cointegrating rank selection By (A.12) and (A.13), the second factor in (A.11) becomes β⊥ {S(λ) − S(λ)β[β S(λ)β]−1 β S(λ)}β⊥ = ρn−1 β⊥ S11 β⊥ − β⊥ S10 Nn S01 β⊥ + op (1) −1 = ρn−1 β⊥ S11 β⊥ − β⊥ S10 α⊥ 00 α⊥ α⊥ α⊥ S01 β⊥ + op (1).
(A.14)
By Lemma 3.1, we have β⊥ {S(λ) − S(λ)β[β S(λ)β]−1 β S(λ)}β⊥
1 ∼ ρ(α⊥ β⊥ )−1 α⊥ Bu Bu α⊥ (β⊥ α⊥ )−1 0
#
− (α⊥ β⊥ )−1 α⊥
1
0
$ Bu dBu α⊥ (β⊥ α⊥ )−1 β⊥ +
# α⊥ 00 α⊥ )−1 α⊥ β⊥ (α⊥ β⊥ )−1 α⊥ × α⊥ (
1
=ρ
Gu Gu
1
−
0
0
Gu dGu β⊥
+
1
0
$ dBu Bu α⊥ (β⊥ α⊥ )−1 +
α⊥ ( α⊥ 00 α⊥ )−1 α⊥
β⊥
1
dGu Gu
+
,
0
where = 1wu + wv α and G u (r) = (α ⊥ β ⊥ )−1 α ⊥ B u (r) is Brownian motion with variance matrix (α⊥ β⊥ )−1 α⊥ α⊥ (β⊥ α⊥ )−1 . Equations (A.11), (A.14) and Lemma 3.1 reveal that the m − r 0 smallest solutions of (2.5) normalized by n converge to those of the equation ρ
1
1
Gu Gu −
0
0
−1 α⊥ 00 Gu dGu β⊥ + α⊥ α⊥ α ⊥ β⊥
1
0
dGu Gu + = 0, (A.15)
as stated. Proof of Theorem 3.1:
Part (a) Let ICr0 (r) denote the information criterion defined in (1.2) when the true co-integration rank is r 0 . Cointegrating rank is estimated by minimizing ICr0 (r) for 0 ≤ r ≤ m. To check the consistency of this estimator, we need to compare ICr0 (r) with ICr0 (r0 ) for any r = r 0 . When r > r 0 , using (1.2) and (2.9), we have ICr0 (r) − ICr0 (r0 ) =
r
log(1 − λˆ i ) + Cn n−1 (2mr − r 2 ) − 2mr0 − r02
i=r0 +1
=
r
log(1 − λˆ i ) + Cn n−1 (r − r0 )(2m − r − r0 ).
(A.16)
i=r0 +1
In order to consistently select r 0 with probability 1 as n → ∞ we need r
log(1 − λˆ i ) + Cn n−1 (r − r0 )(2m − r − r0 ) > 0,
i=r0 +1 C The Author(s). Journal compilation C Royal Economic Society 2009.
(A.17)
S102
Xu Cheng and Peter C. B. Phillips
with probability 1 as n → ∞ for any r 0 < r < m. From (3.10), we know that λˆ i is O p (n−1 ) for all i = r 0 + 1, . . . , r. Expanding log(1 − λˆ i ), we have r
log(1 − λˆ i ) = −
i=r0 +1
r
λˆ i + op (n−1 ) = Op (n−1 ).
(A.18)
i=r0 +1
Using (A.18) and Lemma 3.2, we then have ⎛ n⎝
⎞
r
log(1 − λˆ i ) + Cn n−1 (r − r0 )(2m − r − r0 )⎠
i=r0 +1
=−
r
nλˆ i + Cn (r − r0 )(2m − r − r0 ) + op (1),
(A.19)
i=r0 +1
where nλˆ i for i = r 0 + 1, . . . , r are O p (1). As such, as long as C n → ∞ as n → ∞, the second term on the right-hand side of (A.19) dominates, which leads to (A.17) as n → ∞. Hence, if the penalty coefficient C n → ∞, cointegrating rank r > r 0 will never be selected. So, too few unit roots will never be selected in the system in such cases. Thus, the criteria BIC and HQ will never select excessive cointegrating rank as n → ∞. On the other hand the AIC penalty is fixed at C n = 2 for all n, so we may expect AIC to select models with excessive cointegrating rank with positive probability as n → ∞. This corresponds to a more liberally parametrized system. When r < r 0 , ICr0 (r) − ICr0 (r0 ) =−
r0
log(1 − λˆ i ) + Cn n−1 (2mr − r 2 ) − 2mr0 − r02
i=r+1
=−
r0
log(1 − λˆ i ) + Cn n−1 (r − r0 )(2m − r − r0 ).
(A.20)
i=r+1
In order to consistently select r 0 with probability 1 as n → ∞, we need −
r0
log(1 − λˆ i ) + Cn n−1 (r − r0 )(2m − r − r0 ) > 0, as n → ∞.
(A.21)
i=r+1
From Lemma 3.2, we know that 0 < λˆ i < 1 for i = r + 1, . . . , r 0 . So the first term on the right-hand side of (A.20) is a positive number that is bounded away from 0 and the second term on the right-hand side of (A.20) is a negative number of order O(Cn n−1 ). In order for (A.21) to hold as n → ∞, we therefore require only that C n /n = o(1), i.e. that the penalty coefficient must pass to infinity slower than n. For each of the criteria AIC, BIC and HQ, the penalty coefficient C n → ∞ slower than n. Hence, these three information criterion all select models with insufficient cointegrating rank (or excess unit roots) with probability zero asymptotically. Combining the conditions on C n for r > r 0 and r < r 0 , it follows the information criterion will lead to consistent estimation of the co-integration rank provided the penalty coefficient satisfies C n → ∞ and C n /n → 0 as n → ∞. C The Author(s). Journal compilation C Royal Economic Society 2009.
Semiparametric cointegrating rank selection
S103
Part (b) Under AIC, C n = 2. The limiting probability that AIC(r 0 ) ≤ AIC(r) for some r ≤ r 0 is given by lim P {AIC(r0 ) ≤ AIC(r)} r0 −1 log(1 − λˆ i ) + 2n (r − r0 )(2m − r − r0 ) > 0 = lim P − n→∞
n→∞
i=r+1
r0
= lim P n→∞
−1 ˆ log(1 − λi ) < 2n (r − r0 )(2m − r − r0 ) = 1,
(A.22)
i=r+1
r0 because 0 < λ i < 1 for i = r + 1, . . . , r 0 are the r 0 − r smallest solutions to (3.9) and then i=r+1 log(1 − λi ) < 0, giving (A.22). Hence, when r 0 is the true rank, AIC will not select any r < r 0 as n → ∞, i.e. lim P (ˆrAIC = r|r < r0 ) = 0.
n→∞
(A.23)
Let ξr0 +1 > · · · > ξm be the ordered roots of the limiting determinantal equation (3.10). When r > r ≥ r 0 , AIC(r) < AIC(r ) iff
r
log(1 − λˆ i ) + Cn n−1 (r − r)(2m − r − r) > 0,
i=r+1
so that the limiting probability that r will be chosen over r is lim P AIC(r) < AIC(r ) n→∞ r ˆ nλi + 2(r − r)(2m − r − r) > 0 = lim P − n→∞
=P
i=r+1 r
ξi < 2(r − r)(2m − r − r) .
(A.24)
i=r+1
Accordingly, the probability that AIC will select rank r is equivalent to the probability that r is chosen over any other r ≥ r 0 . This probability is limn→∞ P (ˆrAIC = r|r > r0 ) r m =P ∩ ξi < 2(r − r)(2m − r − r) ∩ r =r+1
r−1
∩
r =r0
r
i=r+1
ξi > 2 r − r 2m − r − r ,
i=r +1
(A.25)
where the first part is the limiting probability that r is chosen over all r > r and the other part is the probability that r is chosen over all r 0 ≤ r < r. Any rank less than r 0 is not taken into account here because those ranks are always dominated in the limit by r 0 from (A.23). The probability that the co-integration rank r 0 is consistently estimated by AIC as n → ∞ is limn→∞ P (ˆrAIC = r0 ) ⎫⎤ ⎧ ⎡ r ⎬ ⎨ m ξi < 2(r − r0 )(2m − r − r0 ) ⎦ . =P⎣ ∩ r=r0 +1 ⎩ ⎭
(A.26)
i=r0 +1
This is a special case of (A.25) with r = r 0 . C The Author(s). Journal compilation C Royal Economic Society 2009.
S104
Xu Cheng and Peter C. B. Phillips
The unit root case. When the system order is m = 1, the procedure provides a mechanism for unit root testing. If r 0 = 0, i.e. the model has a unit root, we have by (A.26) lim P (ˆrAIC = 1|r0 = 0) = P {ξ1 > 2} = 1 − P {ξ1 < 2} and
n→∞
lim P (ˆrAIC = 0|r0 = 0) = P {ξ1 < 2} ,
n→∞
(A.27)
where ξ 1 is the solution to (3.10) when m = 1 and r 0 = 0. In this case, we see that ) ) *2 *2 1 1 Gu dGu β⊥ + Bu dBu + λ 0 0 ) * ) * ξ1 = = 1 1 2 Gu Gu 00 Bu 00 0 0 since Gu = Bu , α⊥ = 1, β⊥ = 1, and = λ in this case. If r 0 = 1, when the model is stationary, we have lim P (ˆrAIC = 0|r0 = 1) = 0 and lim P (ˆrAIC = 1|r0 = 1) = 1,
n→∞
n→∞
(A.28)
using (A.22). These results for the scalar case m = 1 are consistent with those in Phillips (2008).
C The Author(s). Journal compilation C Royal Economic Society 2009.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. S105–S134. doi: 10.1111/j.1368-423X.2009.00280.x
Distribution-free specification tests for dynamic linear models M IGUEL A. D ELGADO † , JAVIER H IDALGO ‡ AND C ARLOS V ELASCO † †
Universidad Carlos III, 28903 Madrid, Spain E-mail:
[email protected],
[email protected] ‡
London School of Economics, Houghton Street, London WC2A 2AE, UK E-mail:
[email protected] First version received: July 2008; final version accepted: December 2008
Summary This article proposes goodness-of-fit tests for dynamic regression models, where regressors are allowed to be only weakly exogenous and arbitrarily correlated with past shocks. The null hypothesis is stated in terms of the lack of serial correlation of the errors of the model. The tests are based on a linear transformation of a Bartlett’s T p -process of the residuals. This transformation approximates the martingale component of the process so that it converges weakly to the standard Brownian motion under the null hypothesis. One feature of our setup is that we do not require to specify the dynamic structure of the regressors. Due to this, the transformation employs a semi-parametric correction that does not restrict the class of local alternatives that our tests can detect, in contrast with other works using smoothing techniques. A Monte Carlo study illustrates the finite sample performance of the tests. Keywords: Dynamic models, Empirical processes, Exogeneity, Goodness-of-fit, Local alternatives, Martingale decomposition.
1. INTRODUCTION Delgado et al. (2005) (DHV henceforth) proposed asymptotically distribution-free tests for the correct parametric specification of the autocorrelation structure of a time series process. The tests were based on a parametric transformation of Bartlett’s (1954) Tp -process, which entails to consider its martingale component, so that asymptotically the transformed process converges to a standard Brownian motion. The tests were applied to observable data, so there was no need to compute the residuals of the model, and the martingale transformation only depended on a set of unknown parameters under the null hypothesis. The aim of this paper consists in extending the DHV procedure to test the specification of dynamic regression models. Here, we use the empirical spectral process of the residuals of the model, because, in the presence of general explanatory variables, regression models do not specify completely the dynamics of the dependent variable, unlike the linear models studied by DHV. The transformation of the corresponding Tp -process depends, despite the unknown parameters, on the non-parametric cross-spectrum between the regressors and the regression error term, which is non-constant and different from zero when regressors are only assumed to be weakly exogenous. A feasible transformation might be computed via a non-parametric smoothed estimator of this crossspectrum. However, we show that we can avoid the smoothing in the feasible martingale C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
S106
M. A. Delgado, J. Hidalgo and C. Velasco
transformation by using directly the cross-periodogram, even though it is an inconsistent estimate of the cross-spectrum. In spite of this non-parametric aspect of our model, our tests have nontrivial power against local alternatives converging to the null at the parametric rate n1/2 . The remainder of the paper is organized as follows. Section 2 introduces the model and describes the testing problem. Section 3 presents the transformation to obtain asymptotically distribution-free tests, whereas Section 4 discusses the power of our tests. Section 5 describes a Monte Carlo experiment to shed some light on the finite sample performance of our test and how it compares with Portmanteau tests based on non-parametric smoothing as well as directional and smooth tests. Finally, the proofs have been placed in the Appendix.
2. DYNAMIC MODELS This section discusses methods for the correct specification of dynamic regression models Xt = μ0 + α01 Xt−1 + · · · + α0p Xt−p + β0 Zt + εt ,
(2.1)
where Z t is a q-dimensional vector of deterministic and/or (weakly) exogenous variables and where the parameter vector θ 0 = (μ 0 , α 0 , β 0 ) is identified as the solution of the p + q + 1 moment conditions (2.2) E Wt Xt − θ Wt = 0, where W t = (1, X t−1 , . . . , X t−p , Z t ) and E(Wt Wt ) is a positive definite matrix. The models considered in (2.1), also known as ARX models, are an important extension of those examined in DHV. Notice that in Z t we can allow some of its components to be lagged values, for example Z kt = Z j t− for some ≥ 1. In the context of model (2.1), a natural assumption is that (2.3) E εt |F {εs , Zs+1 , s < t} = 0, where F{εs , Zs+1 , s < t} is the σ -algebra generated by {ε s , Z s+1 , s < t}. Equation (2.3) implies that E[Zt εs ] = 0 for all s ≥ t, although it allows for feedback from ε t to Z t+j , j > 0. The latter implies that it is possible that the cross-autocovariance of Z t and ε t satisfies that γZε (j ) = E[Zt+j εs ] = 0 for some j > 0. Denoting herewith the cross-spectral density function between the sequences {Ut }t∈Z and {Vt }t∈Z by f UV , we have that one consequence of the latter is that the cross-spectral density function between the sequences {Zt }t∈Z and {εt }t∈Z , fZε , defined by π γZε (j ) = fZε (λ) eij λ dλ, j = 0, ±1, ±2, . . . , −π
is not a null function. That is, the sequence {Zt }t∈Z is only predetermined in (2.1). The null hypothesis of interest is that the errors {εt }t∈Z in (2.1) are not autocorrelated. In other words, that the regression model (2.1) captures the linear dynamic structure of {Xt }t∈Z . More specifically, for given θ , define the residuals {εt (θ )}t∈Z by εt (θ ) := Xt − θ Wt ,
(2.4)
and its autocovariance structure by γε (j ; θ ) := E(εt (θ )εt+j (θ )). Then, our null hypothesis of interest is H0 : γε (j ; θ0 ) = 0,
for all |j | ≥ 1
and some θ0 ∈ ⊂ Rp+q+1 .
C The Author(s). Journal compilation C Royal Economic Society 2009.
Goodness-of-fit tests for dynamic models
S107
We are interested in omnibus tests, where the alternative hypothesis is the negation of the null. The compact set := A × Rq+1 is chosen such that for all α ∈ A, all the roots of the polynomial α(z) := 1 − α1 z − · · · − αp zp
(2.5)
are outside the unit disk. Notice that the least-squares estimator of the parameters may be inconsistent if H 0 does not hold even if the true value of α is zero. R EMARK 2.1. It is worth mentioning that we could allow the so-called ARMAX models, i.e. (2.1), where εt = 1 εt−1 + · · · + εt− + ηt . In this latter scenario, our null hypothesis would be that {ηt }t∈Z follows a white-noise sequence. However, we shall consider (2.1) because of its generality and mathematical simplicity in terms of arguments and notation. Also, extensions of (2.1) to non-linear specifications is fairly straightforward and will not be pursued in this paper. As in DHV, we can write the null hypothesis H 0 in the frequency domain. Indeed, let f ε (λ; θ ) denote the spectral density function of {εt (θ )}t∈Z in (2.4), that is π γε (j ; θ ) = fε (λ; θ ) exp(ij λ)dλ, j = 0, ±1, . . . , −π
and denote its spectral distribution function as F ε (λ; θ 0 ), i.e. λ Fε (λ; θ ) := 2 fε (ω; θ ) dω. 0
Under H 0 we have respectively the spectral density and distribution functions of {εt (θ0 )}t∈Z = {εt }t∈Z . Then, we can equivalently write the null hypothesis H 0 as H0 :
λ Fε (λ; θ0 ) for all λ ∈ [0, π ] and some θ0 ∈ , = Fε (π ; θ0 ) π
(2.6)
being the alternative hypothesis H 1 the negation of H 0 . Thus, the null hypothesis H 0 in (2.6) states that there exists a parameter value θ 0 ∈ such that the sequence {εt (θ0 )}t∈Z has a constant spectral density function, i.e. they are uncorrelated. A natural estimator of F ε (λ; θ ) is ˜ [nλ/π] 2π Fˆn (λ; θ ) := Iεε λj ; θ , n˜
(2.7)
j =1
˜ n˜ := [n/2], [·] denoting the integer part, and where λ j := 2πj /n, for j = 1, . . . , n, 2 n 1 itλ Iεε (λ; θ ) := εt (θ ) e 2π n t=1 is the periodogram of the sequence {ε t (θ )}nt=1 defined in (2.4). In what follows, for a generic function g(·; θ ), we shall suppress any reference to θ when the function is evaluated at the true value θ 0 . That is, g(·; θ 0 ) =: g(·). Observe that the estimator Fˆn (λ; θ ) is location invariant, due to the omission of j = 0 in (2.7). Thus, there is no need to centre the residuals or to estimate the mean μ in (2.1). See Remark 2.2 below for a more explicit explanation and some implications. C The Author(s). Journal compilation C Royal Economic Society 2009.
S108
M. A. Delgado, J. Hidalgo and C. Velasco
If the true value of θ , θ 0 , were known, or equivalently if we could observe the sequence {ε t }nt=1 , following Bartlett (1954), we might perform a goodness-of-fit test using the T p -process
ˆ 1/2 Fn (π ω; θ ) ˆ Tn (ω; θ ) := n˜ − ω , ω ∈ [0, 1] , (2.8) Fˆn (π ; θ ) evaluated at θ = θ 0 . Recall that in this case, we denote Tˆn (ω; θ0 ) by Tˆn (ω). Before we present the properties of Tˆn (ω), let us introduce the following regularity assumption. A SSUMPTION 2.1. {εt }t∈Z is a zero mean sequence of random variables such that E(εt εs ) = σε2 I(t = s) and that E[ε(t)k |Ft−1 ] = κk , k = 1, . . . , 3, and E|ε(t)|k = μk , k = 3, . . . , 8 with μ 8 < ∞, where Ft−1 is the σ -algebra of events generated by {ε s , Z s+1 , s < t}. Herewith, we are denoting the indicator function by I(·). Assumption 2.1 is similar to that given in Dahlhaus (1985) who only assumed constant conditional moments up to the third order. This implies that the fourth-order spectral density function of the process {εt }t∈Z is not necessarily constant (cf. lemma 2 in DHV). Now, denoting by B(ω) the standard Brownian bridge on [0, 1], we have the following proposition. P ROPOSITION 2.1. Under Assumption 2.1, we have that d Tˆn (·) ⇒ B(·) in the Skorohod metric space D [0, 1] .
The statistic given in (2.8) is not feasible as it depends on the unknown vector of parameters θ 0 . To be able to compute (2.8), and so the test, we shall replace θ 0 by, for example, the leastsquares estimator, denoted θˆn . A SSUMPTION 2.2. Under H 0 , it holds that θˆn − θ0 = Op (n−1/2 ). Sufficient conditions for Assumption 2.2 are the stationarity of {Zt }t∈Z , (2.3) and that γ Zε (0) = 0. Notice that in contrast to DHV, Assumption 2.2 does not require a linear expansion of θˆn , only its rate of convergence. This is due to the explicit solution of the least-squares estimator. Also, we shall not give explicit conditions under which the sequence {Zt εt }t∈Z , and so θˆn , satisfies the central limit theorem. R EMARK 2.2. It is worth noticing that the least-squares estimator of (α , β ) is given by the minimization of Fˆn (π ; θ ). That is, (αˆ n , βˆn ) = arg min (α ,β )
n˜
|wX (λj ) − α wX− (λj ) − β wZ (λj )|2
j =1
= arg min Fˆn (π ; θ ), (α ,β )
(2.9)
where w X (λ j ), w X− (λ j ) and w Z (λ j ) are respectively the discrete Fourier transform of {X t }nt=1 , {X t−1 , . . . , X t−p }nt=1 and {Z t }nt=1 . So, observing that we do not employ the frequency λ j = 0 to compute Fˆn (π ; θˆn ), Fˆn (π ; θˆn ) is independent of the intercept estimator μˆ n . The latter implies that the computation of Tˆn (ω; θˆn ) is independent of the intercept μ. For this reason and to simplify notation, in what follows, we shall assume that there is no intercept in (2.1) and accordingly C The Author(s). Journal compilation C Royal Economic Society 2009.
Goodness-of-fit tests for dynamic models
S109
that W t = (X t−1 , . . . , X t−p , Z t ) and θ = (α , β ) . Moreover when we have trend regressors, such as polynomial trends, apart from a different rate of convergence of θˆn , we have that the distribution of Tˆn (ω; θˆn ) is asymptotically independent of the estimation of the trend component of the regression model. Hence, in what follows we can consider the model Xt = α01 Xt−1 + · · · + α0p Xt−p + β0 Zt + εt
(2.10)
without loss of generality. Also, notice that if we employed tapers, Tˆn (ω; θˆn ) would be invariant to the trend as well as to the intercept. Now, once we have an estimator of the unknown parameters θ 0 , we can obtain the residuals as εˆ t := εt (θˆn ) = Xt − θˆn Wt , and with Iεˆ εˆ (λj ) := Iεε (λj ; θˆn ), we set ˜ [nω] 2π Fˆn ω; θˆn := Iεˆ εˆ (λj ). n˜ j =1
So, the feasible T p -process is defined as in (2.8) but with θˆn replacing θ . That is,
Fˆn π ω; θˆn 1/2 −ω . Tˆn ω; θˆn = n˜ Fˆn π ; θˆn
(2.11)
Before we describe the asymptotic properties of Tˆn (ω; θˆn ), we introduce the following regularity assumption. A SSUMPTION 2.3. (i) The cross-spectrum f Zε (λ) is differentiable at all λ ∈ [−π , π ]. (ii) The spectral density matrix f ZZ (λ) is continuous for all λ ∈ [−π , π ]. (iii) The higher order (cross) spectral densities up to eighth order of {Zt }t∈Z and {εt }t∈Z are bounded. Assumption 2.3(i) could be replaced by some Lipschitz condition, but that might complicate some of the technical arguments. Nevertheless, the assumption as it stands is very mild and it is satisfied for most models employed with real data. Next, because all the roots of the polynomial α(z) in (2.5) are outside the unit disk, we obtain that the stationary solution of X t is given by α 0 (L)−1 (ε t − β 0 Z t ), where α 0 (z) is defined in (2.5) with α = α 0 . Thus, it follows that Lp (eiλ ) σε2 − β0 fZε (λ) , (2.12) fX−,ε (λ) = α0 (eiλ ) 2π with L p (z) = (z, . . . , zp ) , so that Assumption 2.3(i) implies that f W ε (λ) is differentiable everywhere in λ ∈ [−π , π ]. One implication of (2.12) and Assumption 2.1 is that (1) = 0, where ω φ(v)dv, ω ∈ [0, 1] (ω) := 0
and φ(ω) = 4π RefW ε (π ω) = 4π Re(fX−,ε (π ω) , fZε (π ω) ) , by orthogonality between {Wt }t∈Z and {εt }t∈Z and evenness (oddness) of the real (imaginary) part of f W ε (λ). However, it is important to emphasize that we are not assuming that f W ε (λ) = 0 for all λ. In fact, this is not the case because E[Zt εs ] can be different than zero for some t > s. This is one of the main features of our specification in (2.1)/(2.10). C The Author(s). Journal compilation C Royal Economic Society 2009.
S110
M. A. Delgado, J. Hidalgo and C. Velasco
On the other hand, Assumptions 2.3(i)–(ii) imply that f WW (λ) is bounded for all λ ∈ [−π , π ] because
fX−,X− (λ) fX−,Z (λ) fW W (λ) = , fZX− (λ) fZZ (λ) with Lp (eiλ ) α0 (eiλ )
σε2 + 2β0 RefZε (λ) + β0 fZZ (λ)β0 2π Lp (eiλ ) fεZ (λ) + β0 fZZ (λ) . fX−,Z (λ) = α0 (eiλ )
fX−,X− (λ) =
Lp (e−iλ ) α0 (e−iλ )
Finally, Assumption 2.3(iii) implies eight finite moments for Z t and X t as assumed for ε t in Assumption 2.1. However, the requirement of higher order bounded spectra function of {Wt }t∈Z can be relaxed as in DHV at the expense of much lengthier arguments. In what follows, for two vector sequences {V t }nt=1 and {U t }nt=1 , we denote its crossperiodogram by
n
n 1 itλ −itλ IV U (λ) := Vt e Ut e . 2π n t=1 t=1 P ROPOSITION 2.2. Assuming Assumptions 2.1–2.3, under H 0 we have that ˜ [ωn] 4π Tˆn ω; θˆn = Tˆn (ω) − 2 ReIεW λj n˜ 1/2 θˆn − θ0 + op (1) σε n˜ j =1 = Tˆn (ω) − (ω) n˜ 1/2 θˆn − θ0 /σε2 + op (1),
(2.13)
where the o p (1) is uniformly in ω ∈ [0, 1]. R EMARK 2.3. The second equality in (2.13) follows because under weak regularity conditions, Brillinger (1981) implies that
˜ [ωn]
2π
I λ − f λ sup VU j VU j = op (1).
˜ ω n
j =1
R EMARK 2.4. Proceeding as with the proof of Theorem 2 of DHV, Propositions 2.1 and 2.2 imply that the asymptotic distribution of Tˆn (ω; θˆn ) depends, in general, on θˆn and so on the model as in other goodness-of-fit tests with estimated parameters. However, since the aim of the paper is to describe distribution-free (pivotal) tests, we will not explicitly examine the asymptotic distribution of Tˆn (ω; θˆn ). R EMARK 2.5. (Strong) exogeneity and predetermined regressors. When the regressors Z t are (strong) exogenous, we have that f Zε (λ) = 0 for all λ ∈ [0, π ], and hence φ(ω) = (φ 1 (ω) , 0q ) , where φ 1 (ω) := 4π Ref Xε (π ω). So, the latter together with (2.13) implies that ω Tˆn ω; θˆn = Tˆn (ω) − φ1 (v) dv n˜ 1/2 (αˆ n − α0 ) /σε2 + op (1). 0
C The Author(s). Journal compilation C Royal Economic Society 2009.
Goodness-of-fit tests for dynamic models
S111
That is, similar to the case where regressors are deterministic, the estimation of β in (2.1) has no influence on the asymptotic distribution of Tˆn (ω; θˆn ), only the least-squares estimator of α 0 . Moreover, in this case the function (ω) is known up to a set of parameters which can be consistently estimated by Assumption 2.2. But this case was already covered by DHV, and hence it is not of interest in this paper. On the other hand, it is worth mentioning that the null hypothesis that one particular component of Z t is (strong) exogenous can be tested using the methods put forward in the paper. From Proposition 2.2 and Remark 2.4, it is obvious that tests based on continuous functionals of Tˆn (ω; θˆn ) are not pivotal, as their asymptotic distribution depends on the model specified under the null hypothesis H 0 and on the unknown function φ(·). The latter function not only depends on θ 0 but also on the joint dynamic properties of {Zt }t∈Z and {εt }t∈Z described by f Zε , which is unknown to the practitioner. The next section introduces a linear transformation of Tˆn (ω; θˆn ) which converges weakly, under H 0 , to the standard Brownian motion, denoted B 0 (·), whose critical values are readily available.
3. DISTRIBUTION-FREE TESTS ¯ such that L¯ Tˆn (·; θˆn ) converges weakly to We are looking for a linear transformation, say L, 0 the standard Brownian motion B under H 0 . This transformation must remove the effect of (ω) n˜ 1/2 (θˆn − θ0 ) into the asymptotic linear expansion of Tˆn (ω; θˆn ); see Proposition 2.2. As pointed out in Remarks 2.2 and 2.4, we shall only consider the interesting case where the regressors Z t are only predetermined, but not strictly exogenous, so that the cross-spectral density f Zε (λ) is not constant. Abbreviating for a generic function h(·), h(λ j ) by h j , and denoting m j = 2π I εε,j − σ 2ε , we observe that, applying Proposition 2.2, we can write Tˆn (ω; θˆn ), up to terms of order o p (1), as ⎛ ⎞−1 ˜ [ωn] n˜ n˜ n˜ n˜ −1/2 (ω) n˜ 1/2 ⎝ ωn˜ −1/2 mj − IW W ,j ⎠ ReIW ε,j − mj , Fˆn (π ) j =1 Fˆn (π ) Fˆn (π ) j =1 j =1 j =1
(3.1)
which is similar to the corresponding expression given in DHV but with our generic definition of φ(ω). However, unlike DHV, this expression (3.1) cannot be directly identified as a CUSUM of least-squares residuals. Nevertheless, a similar martingale transformation based on a forward on the function g(u) := (1, φ(u) ) will remove the terms in (3.1) depending on projection ω 0 g(u)du, i.e. (ω) and ω. The latter are the non-martingale components in the tied-down empirical process with estimated parameters Tˆn (ω; θˆn ). ¯ So, following similar arguments to those in DHV, we propose as our transformation L, L¯ Tˆn ω; θˆn := Tˆn ω; θˆn −
⎛ ⎞−1 ¯ [ωn] n˜ n˜ n˜ −1/2 ⎝ ˆ k, gj gk gk ⎠ gk m Fˆn π ; θˆn j =1 k=j +1 k=j +1
(3.2)
ˆ k := 2π Iεˆ εˆ ,k − Fˆn (π ; θˆn ), n¯ = n˜ − p − q − 1. The limiting continuous version of L¯ is where m defined, for a generic function ξ : [0, 1] → R, as 1 ω L0 ξ (ω) := ξ (ω) − g (v) −1 (v) g (u) ξ (du) dv, 0 C The Author(s). Journal compilation C Royal Economic Society 2009.
v
S112
M. A. Delgado, J. Hidalgo and C. Velasco
1 with (v) := v g(u)g(u) du. Before we examine the properties of L¯ Tˆn (ω; θˆn ) in (3.2), we need to introduce the following assumption. ˜ gk gk is non-singular. A SSUMPTION 3.1. The matrix n˜ −1 nk= ¯ n+1 T HEOREM 3.1. Assume Assumptions 2.1–2.3 and 3.1. Then, under H 0 , d L¯ Tˆn ω; θˆn ⇒ B 0 in the Skorohod metric space D [0, 1] . The transformation L¯ is infeasible, as it depends on the unknown function g(u). To construct ¯ we need to replace g(u) by some estimate. Recall that from (2.12), a feasible version of L, φ(ω) = 4π Re(f X−,ε (π ω) , f Zε (π ω) ) and because f Zε is an unknown function, we have that φ is a non-parametric function. The latter is one of the main differences with DHV’s paper. Because of that, we shall propose two feasible transformations. The first one employs the standard average periodogram estimator of the (scaled real part) of the cross-spectrum between {Wt }t∈Z and {εt }t∈Z , i.e. 4π ˜ = φˆ m,j := φˆ m (j /n) K¯ m
m
K ReIW εˆ ,j + ,
(3.3)
=−m;=0
where K = K(/m) and K¯ m = m =−m;=0 K . The second approach replaces f W ε by the crossperiodogram. The latter is a much more delicate matter, as the periodogram is not a consistent estimator of f W ε , only unbiased, unlike the former approach or that in DHV, where the function φ(·) was known up to a finite set of parameters. A SSUMPTION 3.2. (i) K(x) is a non-negative continuous symmetric function in [−1, 1]. (ii) m−2 n1+δ + mn−1 → 0, for some δ > 0. Non-parametric adjustment in related contexts has been also examined in Stute et al. (1998) and Stute and Zhu (2002). The estimator φˆ m,j is of the leave-one-out type as it does not use the frequency λ j in its computation. The latter is done to guarantee the orthogonality in finite samples of φˆ m,j with respect to Iεˆ εˆ ,j for all m, using the well-known result of the approximate orthogonality between the discrete Fourier transform of vector time series at different Fourier frequencies. We need to strengthen Assumption 2.3. A SSUMPTION 2.3 . Assumption 2.3 holds and f W ε (λ) has two bounded derivatives. Thus, in practice, we can take the discrete sample counterpart of L¯ Tˆn (ω; θˆn ), L¯ n Tˆn ω; θˆn := Tˆn ω; θˆn −
n˜ −3/2 gˆ m Fˆn π ; θˆn j =1 ¯ [nω]
ˆ m (ω) := n˜ −1 where gˆ m (ω) := (1, φˆ m (ω) ) and
−1 n˜ j k ˆm j ˆ k, m gˆ m n˜ n˜ n˜
n˜
˜ j =1+[nω]
(3.4)
k=1+j
gˆ m,j gˆ m,j .
T HEOREM 3.2. Assuming Assumptions 2.1–2.2, 2.3 and 3.1–3.2, under H 0 , d L¯ n Tˆn ω; θˆn ⇒ B 0 in the Skorohod metric space D [0, 1] . C The Author(s). Journal compilation C Royal Economic Society 2009.
Goodness-of-fit tests for dynamic models
S113
Note that the proof of this result does not show that supω∈[0,1] |L¯ n Tˆn (ω; θˆn ) − L¯ Tˆn (ω)| = op (1) as it was necessary in DHV’s proofs. We now describe the unsmoothed version of the feasible transformation. Here the aim is to use the cross-periodogram instead of g k or a consistent estimate of it. We propose to employ the transformation Lˇ n Tˆn ω; θˆn = Tˆn ω; θˆn −
⎛ ⎞−1 ¯ [nω] n˜ n˜ n˜ −1/2 ⎝ ⎠ ˆ k+1 , (3.5) gˆ j +1 gˆ k+2 gˆ k+2 gˆ k+2 m Fˆn π ; θˆn j =1 k=j +1 k=j +1
˜ where gˆ j = IW εˆ ,j , j = 1, . . . , n. ˜ ˜ ˆ k+1 instead of nk=j ˆ k as in (3.4) is The reason to employ, for example nk=j gk m +1 gk+2 m n˜ +1 ˆ k which does because, contrary to the latter, there is leverage effect from g j +1 into k=j +1 gk m not vanish sufficiently fast, as in the case with the smoothed version or in the case examined in ˆ k+1 is approximately centred because gˆ and m ˆ DHV. At the same time, we guarantee that gˆ k+2 m have different indices. Then, we have our next result. T HEOREM 3.3. Assuming Assumptions 2.1–2.2, 2.3 and 3.1–3.2, under H 0 , the unsmoothed transformation given in (3.5) satisfies that d Lˇ n Tˆn ω; θˆn ⇒ B 0 in the Skorohod metric space D [0, 1] . Theorems 3.2 and 3.3 justify asymptotic admissible tests based on continuous functionals of Lˇ n Tˆn (ω; θˆn ), as stated in the following corollary. C OROLLARY 3.1. For any continuous functional ϕ : D[0, 1] −→ R+ , under H 0 and assuming the same conditions of Theorem 3.3, d ϕ Lˇ n Tˆn ω; θˆn → ϕ B 0 . Note that the non-parametric estimation does not affect first-order asymptotics of the tests, which have the same limiting behaviour as if g were known or parametrically modelled. However, ˆ m (ω) in a discrete grid ω = j /n, ˜ the need to invert the (p + q + 1) × (p + q + 1) matrix implies that this is only possible at j = 1, . . . , n¯ due to the loss of degrees of freedom as we need to estimate the parameters in the regression model (2.1). The distribution of ϕ(B 0 ) can be tabulated by Monte Carlo. For the main goodness-offit proposals, Kolmogorov–Smirnov and Cram´er–von Mises, ϕ(B 0 ) is already tabulated, for instance in Shorack and Wellner (1986, pp. 34 and 748).
4. LOCAL ALTERNATIVES AND CONSISTENCY We consider two types of local alternatives, first a parametric one and secondly a more general non-parametric type of alternative which it may suggest or establish the origin of the possible misspecification of the model given in (2.1). C The Author(s). Journal compilation C Royal Economic Society 2009.
S114
M. A. Delgado, J. Hidalgo and C. Velasco
4.1. Parametric alternatives To study the power of our test let us consider local alternatives of the type Han : α0,p+1 =
c n˜ 1/2
for some c = 0.
(4.1)
Similar results are available for other forms of misspecification, including errors in the modelling of the relationship between the sequences {Zt }t∈Z and {Xt }t∈Z . T HEOREM 4.1. Assuming the same conditions as in Theorem 3.3, under H an , d (4.2) L¯ n Tˆn ⇒ B 0 + cL0 in the Skorohod metric space D [0, 1] , ω where (ω) := σε−2 0 φp+1 (u)du, with exp(i(p + 1)π v) 2 2σε + 4πβ0 RefZε (π v) . φp+1 (v) := 4π RefεX−p−1 (π v) = Re α0 (eiπv )
R EMARK 4.1. Under the set of assumptions in the previous section, the proposed test does not have trivial power, as stated in the following theorem if Z t cannot explain all the information contained in X t−p−1 at all frequencies. i.e. there is a set of positive Lebesgue measure where the spectral density matrix of (Z t , X t−p−1 ) has full rank. This should imply that in a set of positive Lebesgue measure the cross-spectral density fXt−p−1 ε (λ) is not a linear combination of the rows of f Zε (λ), which guarantees that L0 is not zero for all λ. Therefore, for a suitable continuous functional ϕ : D[0, 1] → R+ , such as the Cram´er–von Mises or the Kolmogorov–Smirnov, Pr[ϕ(B 0 + L0 ) > ϕ(B 0 )] = 1, and the test will detect local departures from the null of the type H an given in (4.1). 4.2. Non-parametric alternatives We now consider the case when {εt }t∈Z has not flat spectrum up to an n−1/2 factor. Notice that H an implies that the spectral density function of {εt }t∈Z , where θ does not include α p+1 , is 2c σε2 c2 + 1/2 RefεX−p−1 (λ) + fX−p−1 (λ) n˜ n˜ 2π 2 φp+1 (λ/π ) −1/2 σ 2 −1 ˜ ˜ = 1+c n + O(c n ) . 2π σ2
f (λ; θ0 ) =
So, we could consider non-parametric alternatives of the type Han : f (λ; θ0 ) =
σ2 1 + l (λ) n˜ −1/2 2π
for some θ0 ∈ ,
where the function l(·) is not in the space spanned by φ(·/π ). The latter implies that the correlation structure of {εt }t∈Z cannot be explained either by lag values of X t or by any of the components of the variables Z t . It is worth noticing that the test has maximum power against alternatives for which l(·) belongs to the orthogonal space spanned by g. Then Theorem 4.1 ω holds for H an with (ω) := 0 l(π u)du and c = 1 there. C The Author(s). Journal compilation C Royal Economic Society 2009.
Goodness-of-fit tests for dynamic models
S115
The test is consistent in the direction of general fixed non-parametric or parametric alternatives in (4.1), such as α p+1 = c, c = 0. Though a precise justification under suitable regularity conditions is possible, this is beyond the scope of this paper and we will only provide a sketch of the main arguments. Assuming certain regularity conditions (such as that α(L) has all roots outside the unit circle), Assumption 2.1 could be replaced by a linear process specification and Assumption 2.2 is satisfied under the alternative hypothesis H 1 , where now θ 0 denotes the pseudo true value, defined by θ0 := arg minθ∈ F (π ; θ ), which is such that the pseudo-innovations {ε t (θ 0 )} are autocorrelated under H 1 . Denote by f ε (λ) := f ε (λ; θ 0 ) the (nonconstant) spectral density of {εt }t∈Z . Indeed, proceeding as in DHV or Dahlhaus and Wefelmeyer (1996), we shall have that, for each ω ∈ [0, 1], ˜ 1/2 θˆn − θ0 n ˆ ˆ ˆ + op (1) . Tn ω; θn = Tn (ω) + (ω) Fˆn (π ) Now, Tˆn (ω) =
1
2π
n˜ 1/2 Fˆn (π )
˜ [nω] Iεε,j j =1
fε,j
⎞ ⎛ ˜ [nω] 2π 1 − 1 fε,j + n˜ 1/2 ⎝ fε,j − ω⎠ , Fˆn (π ) n˜ j =1
where, under suitable regularity conditions, the first term on the right-hand side of the last display expression is Op (1), whereas the expression inside the parenthesis of the second term on the right-hand side converges to a constant for each ω. Thus, |Tˆn (ω)| and |L¯ n Tˆn (ω)| diverge to infinity at the rate n1/2 . From here, the consistency of the test follows by standard arguments. Following the discussion in DHV, we can use Theorem 4.1 to derive optimal tests for H 0 against the direction l given in H an . These test statistics are based on L¯ n Tˆn (λ) and thus they are also asymptotically distribution-free under H 0 .
5. MONTE CARLO EXPERIMENT This section presents a small simulation exercise to shed some light on the small sample behaviour of our tests. To that end, we have considered the ARX(1, 1) model Xt = α1 Xt−1 + β1 Z1t + εt , t = 1, . . . , n,
(5.1)
where Z1t = aZ1(t−1) + ut , 1/2 vt + bεt−1 , ut = 1 − b2 and {vt }t∈Z and {εt }t∈Z are mutually independent i.i.d. N (0, 1) variates. We have employed three sample sizes n = 100, 200, 400, and the following values of the parameters: β1 = {0.2, 0.5, 1.0} ,
α1 ∈ {0.2, 0.5, 0.8} ,
b ∈ {0, 0.4, 0.8} ,
whereas a = 0.5 for all the combinations and sample sizes. The autoregressive parameters α 1 and a control partially the dependence structure of {Xt }t∈Z and {Zt }t∈Z . On the other hand, b measures the ‘endogeneity’ of {Zt }t∈Z in (5.1) (so that Z t is strongly exogenous if b = 0), together with the regression coefficient β 1 . C The Author(s). Journal compilation C Royal Economic Society 2009.
S116
M. A. Delgado, J. Hidalgo and C. Velasco
We first estimate the parameters α 1 , β 1 and σ 2ε in (5.1) by (2.9), and for a given feasible ˆ er–von Mises statistic transformation Lm n of Tn we compute the Cram´ Cnm
2 ˜ n−3 j 1 m ˆ Ln Tn := , ˜n − 3 n˜ j =1
where m indicates the type of approximation of φ employed. We have considered three alternatives for the martingale transformation. The first one uses a non-consistent estimator of φ, using the transformation Lˇ n , and it is denoted as C 0n in Tables 1–5. For the cases where we estimate consistently φ, we use the Tuckey–Hanning kernel in (3.3), πx 1 1 + cos , Km (x) = 2 m with bandwidth parameters m = [0.25n0.9 ] and [0.30n0.9 ]. To be able to make comparisons we provide the results for the popular Ljung and Box’s (1978) Portmanteau test p ρˆεˆ (j )2 , Qp := n (n + 2) n−j j =1 where
ρˆεˆ (j ) :=
n t=1
−1 εˆ t2
n
εˆ t εˆ t−j , j ≥ 1,
t=j +1
are the sample autocorrelations of the residuals {ˆεt }nt=1 for two choices for p. For n = 100, 200, we chose p = 10, 15, whereas for n = 400, p = 15, 20. Those choices are close to n1/2 , which seems a reasonable compromise in terms of size and power. As in Hong (1996), we employ a standardized version of Q p which we compare against the standard normal critical values. For power comparisons we consider two local alternatives. The first one is based on the ARX(2, 1) model, 5 Xt = α1 Xt−1 + 0.5 Xt−1 + β1 Z1t + εt , n whereas the second local alternative is the ARMAX(1, 1, 1) model 5 Xt = α1 Xt−1 + β1 Z1t + 0.5 εt−1 + εt . n We report the percentage of rejections in 100,000 Monte Carlo replications. The empirical size for tests based on C 0n show an improvement with the sample size, but it also appears to depend on the model under consideration. More specifically, the percentage of rejections under H 0 increases with α 1 , b and β 1 for all sample sizes. On the other hand, those for Cnm are more stable, although there is some dependence on the value of b, perhaps due to some additional dependence on m. Q p provides better sizes for the smaller values of n but similar for the larger ones. Here the choice of p seems to be quite important, with the number of rejections increasing with p and also with α 1 , β 1 , although it decreases with b. For the power analysis we only report the simulations with n = 200, being the picture for other sample sizes similar, although perhaps for n = 100, the results show some instability due perhaps to the oversize of the tests for some parameter combinations. For AR(2) alternatives, C The Author(s). Journal compilation C Royal Economic Society 2009.
3.39 5.65
9.38
4.40
6.75 10.06
6.24 8.28
10.71
α1 0.2 0.5
0.8
0.2
0.5 0.8
0.2 0.5
0.8
C 0n
C The Author(s). Journal compilation C Royal Economic Society 2009.
3.71
2.75 3.52
3.54 4.27
2.68
4.44
2.65 3.50
C 15 n
7.30
6.09 6.63
β 1 = 1.0 2.22 2.39
2.65
5.63 6.47
2.31 2.70
5.20
5.85
4.79 5.04
β 1 = 0.2 2.16 2.23
2.70 β 1 = 0.5 2.22
Q 10
C 18 n
b=0
7.90
6.72 7.22
6.30 7.10
5.87
6.46
5.47 5.77
Q 15
11.15
7.51 9.46
8.08 11.02
5.23
10.36
3.71 6.41
C 0n
2.69
3.94 3.99
4.28 3.40
3.22
3.99
2.88 3.96
C 15 n
2.17
β 1 = 1.0 2.55 2.65
2.53 2.45
2.52 β 1 = 0.5 2.25
β 1 = 0.2 2.23 2.34
C 18 n
6.52
6.12 6.30
5.63 6.11
5.25
5.66
4.80 5.13
Q 10
Table 1. Size of 5% tests, n = 100. b = 0.4
7.21
6.72 6.93
6.28 6.83
5.99
6.38
5.50 5.82
Q 15
11.65
7.93 9.55
8.87 11.10
6.34
10.51
4.11 7.22
C 0n
2.58
3.69 3.23
3.62 2.72
4.02
3.03
3.24 4.05
C 15 n
1.63
β 1 = 1.0 1.93 1.76
1.93 1.69
1.81 β 1 = 0.5 2.35
β 1 = 0.2 2.15 2.21
C 18 n
b = 0.8
4.54
4.80 4.52
4.75 4.54
5.07
4.58
4.73 4.95
Q 10
5.34
5.51 5.23
5.42 5.30
5.79
5.39
5.46 5.66
Q 15
Goodness-of-fit tests for dynamic models
S117
2.20 3.71
6.40
2.79
4.29 6.71
4.03 5.46
7.22
α1 0.2 0.5
0.8
0.2
0.5 0.8
0.2 0.5
0.8
C 0n
6.01
3.97 4.99
4.87 6.57
3.89
6.59
3.86 4.83
C 29 n
6.79
5.83 6.22
β 1 = 1.0 3.48 3.59
4.20
5.33 5.95
3.48 4.23
5.06
5.37
4.70 4.86
β 1 = 0.2 3.46 3.47
4.13 β 1 = 0.5 3.48
Q 10
C 35 n
b=0
6.96
6.29 6.60
5.83 6.33
5.57
5.77
5.20 5.37
Q 15
7.41
4.99 6.29
5.29 7.41
3.38
7.07
2.41 4.20
C 0n
3.99
5.49 5.91
6.19 5.10
4.51
6.10
4.12 5.36
C 29 n
3.39
β 1 = 1.0 3.83 4.08
3.84 3.84
4.00 β 1 = 0.5 3.53
β 1 = 0.2 3.44 3.53
C 35 n
5.88
5.75 5.85
5.30 5.47
5.10
5.17
4.68 4.88
Q 10
Table 2. Size of 5% tests, n = 200. b = 0.4
6.27
6.17 6.20
5.70 5.95
5.57
5.60
5.20 5.35
Q 15
6.88
4.91 5.50
5.54 6.75
4.19
6.67
2.71 4.65
C 0n
4.48
5.40 5.27
5.67 4.64
5.54
5.02
4.59 6.01
C 29 n
2.98
β 1 = 1.0 3.00 2.92
3.13 3.05
3.13 β 1 = 0.5 3.59
β 1 = 0.2 3.37 3.53
C 35 n
b = 0.8
4.15
4.52 6.89
4.43 4.13
4.86
4.19
4.57 4.68
Q 10
4.73
5.00 7.76
4.93 4.70
5.32
4.69
5.10 5.17
Q 15
S118 M. A. Delgado, J. Hidalgo and C. Velasco
C The Author(s). Journal compilation C Royal Economic Society 2009.
1.83 2.91
4.95
2.21
3.41 5.08
3.15 4.27
5.58
α1 0.2 0.5
0.8
0.2
0.5 0.8
0.2 0.5
0.8
C 0n
C The Author(s). Journal compilation C Royal Economic Society 2009.
7.87
4.61 5.61
5.53 8.25
4.56
8.32
4.54 5.54
C 54 n
6.51
5.83 6.09
β 1 = 1.0 4.22 4.30
5.39
5.39 5.83
4.20 5.31
5.21
5.30
4.90 5.00
β 1 = 0.2 4.17 4.23
5.20 β 1 = 0.5 4.17
Q 15
C 65 n
b=0
6.57
6.04 6.25
5.66 6.02
5.48
5.51
5.19 5.25
Q 20
5.57
3.99 4.86
4.13 5.53
2.69
5.29
2.01 3.34
C 0n
5.07
6.31 7.38
7.68 6.45
3.99
7.99
4.87 6.23
C 54 n
4.45
β 1 = 1.0 4.62 5.17
5.02 4.94
5.33 β 1 = 0.5 4.16
β 1 = 0.2 4.24 4.39
C 65 n
5.82
5.71 5.71
5.30 5.46
5.19
4.71
4.88 5.00
Q 15
Table 3. Size of 5% tests, n = 400. b = 0.4
5.98
5.95 5.96
5.59 6.58
5.46
5.14
5.17 5.23
Q 20
4.54
3.72 3.86
4.08 4.56
3.33
4.60
2.16 3.74
C 0n
6.13
6.85 6.91
7.36 6.34
6.78
6.84
5.69 7.68
C 54 n
4.38
β 1 = 1.0 4.11 4.41
4.59 4.51
4.61 β 1 = 0.5 4.44
β 1 = 0.2 4.32 4.81
C 65 n
b = 0.8
4.32
4.70 4.40
4.55 4.28
5.02
4.31
4.78 4.83
Q 15
4.66
5.02 4.75
4.89 4.66
5.22
4.69
5.05 5.11
Q 20
Goodness-of-fit tests for dynamic models
S119
57.38 77.79
99.30
61.35
84.89 99.24
73.05 93.70
98.98
α1 0.2 0.5
0.8
0.2
0.5 0.8
0.2 0.5
0.8
C 0n
22.10
28.29 34.58
57.04 31.15
48.40
36.56
64.79 69.48
C 29 n
93.75
89.69
96.11 95.76
β 1 = 1.0 21.39 97.90 27.59 97.75
20.25
94.06 89.47
96.61 93.55
46.93 26.84
93.08
89.47
91.03 92.39
β 1 = 0.2 55.01 94.48 58.47 95.38
31.19 93.53 β 1 = 0.5 38.67 95.98
Q 15
Q 10
C 35 n
b=0
91.80
81.58 94.76
86.90 94.26
68.04
95.28
58.34 79.36
C 0n
19.62
58.94 46.97
69.50 20.88
63.81
25.01
62.55 76.06
C 29 n Q 10
91.88 79.78
17.19
70.89
β 1 = 1.0 45.97 94.93 44.71 87.39
61.77 19.22
22.99 85.32 β 1 = 0.5 50.57 95.55
β 1 = 0.2 51.11 94.65 68.41 93.67
C 35 n
63.35
91.53 81.65
87.52 73.05
92.61
79.40
91.25 89.98
Q 15
Table 4. Power of 5% tests: AR(2) alternative, n = 200. b = 0.4
17.45
59.38 33.61
58.12 37.04
67.56
51.79
60.73 66.83
C 0n
5.15
28.50 5.94
22.78 15.60
87.35
28.27
90.54 62.05
C 29 n
3.89
β 1 = 1.0 25.46 4.79
18.31 11.23
21.45 β 1 = 0.5 84.47
β 1 = 0.2 80.66 57.51
C 35 n
b = 0.8
12.70
52.09 20.86
53.39 28.79
87.43
51.25
94.57 79.61
Q 10
12.19
45.70 18.46
47.29 25.53
82.35
45.22
91.34 73.14
Q 15
S120 M. A. Delgado, J. Hidalgo and C. Velasco
C The Author(s). Journal compilation C Royal Economic Society 2009.
21.82
66.50 95.45
32.96 76.50
97.04
68.60
91.95 98.65
α1 0.2
0.5 0.8
0.2 0.5
0.8
0.2
0.5 0.8
C 0n
C The Author(s). Journal compilation C Royal Economic Society 2009.
19.85 27.35
11.07
25.12
12.01 19.84
19.65 23.68
12.59
C 29 n
32.41 62.30
16.42 20.72
73.64 87.69
18.36 84.33 β 1 = 1.0 10.50 49.52
12.05 16.93
64.86 80.81
42.18
77.23
28.24 54.40
53.10 76.34
27.21
β 1 = 0.2 12.90 30.99
17.12 60.72 17.12 83.19 β 1 = 0.5
Q 15
95.49 95.91
89.21
95.57
58.36 88.15
74.56 93.93
31.08
C 0n
16.08 14.77
15.78
14.04
14.59 15.55
14.97 13.21
13.89
C 29 n Q 10
49.60 71.44
9.81 9.41
79.12 82.38
8.30 80.96 β 1 = 1.0 12.40 68.86
14.28 10.09
11.21 64.91 7.52 79.37 β 1 = 0.5
β 1 = 0.2 14.92 38.29
C 35 n
70.74 74.55
60.06
73.20
42.88 63.02
56.96 71.86
33.28
Q 15
Table 5. Power of 5% tests: MA(1) alternative, n = 200. b = 0.4
Q 10
C 35 n
b=0
48.98 40.27
54.80
47.55
53.72 56.49
55.70 53.61
34.95
C 0n
8.79 13.85
3.70
12.85
3.96 6.69
5.23 11.34
10.58
C 29 n
4.19 8.07
6.67 β 1 = 1.0 2.01
3.63 2.86
2.35 5.20 β 1 = 0.5
β 1 = 0.2 13.33
C 35 n
b = 0.8
29.98 30.86
38.14
33.53
47.68 38.68
47.95 37.45
43.22
Q 10
26.11 26.66
32.83
28.92
41.08 34.01
41.27 32.27
37.55
Q 15
Goodness-of-fit tests for dynamic models
S121
S122
M. A. Delgado, J. Hidalgo and C. Velasco
C 0n shows highest power for models with high α 1 and b, otherwise Cnm dominates, with power decreasing with m. Tests based on Cnm are dominated in general by C 0n , except for the least persistent models (with lowest α 1 and β 1 ) for which φ is rather flat and can be well estimated by kernel estimates with some oversmoothing as with the choices of m we employ. In general power increases with α 1 and β 1 for small b, but the reverse situation arises for large value of b. For the MA(1) alternative, C 0n dominates in almost every case, in some situations outperforming noticeably Q p , while Cnm displays much inferior results for all m.
ACKNOWLEDGMENTS The research of the first and third author is funded by the Spanish ‘Plan Nacional de I+D+I’, reference number SEJ2007-62908.
REFERENCES Bartlett, M. S. (1954). Problems de l’analyse spectral des s´eries temporelles stationnaires. Publications Institut de Statistique Universit´e de Paris III, 119–34. Billingsley, P. (1968). Convergence of Probability Measures. New York: John Wiley. Brillinger, D. R. (1981). Time Series, Data Analysis and Theory. San Francisco: Holden-Day. Dahlhaus, R. (1985). On the asymptotic distribution of Bartlett’s U p -statistic. Journal of Time Series Analysis 6, 213–27. Dahlhaus, R. and W. Wefelmeyer (1996). Asymptotically optimal estimation in misspecified time series models. Annals of Statistics 24, 952–74. Delgado, M. A., J. Hidalgo and C. Velasco (2005). Distribution free goodness-of-fit tests for linear processes. Annals of Statistics 33, 2568–609. Hong, Y. (1996). Consistent testing for serial correlation of unknown form. Econometrica 64, 837–64. Ljung, G. M. and G. E. P. Box (1978). On a measure of lack of fit in time series models. Biometrika 65, 297–303. Shorack, G. R. and J. A. Wellner (1986). Empirical Processes with Applications to Statistics. New York: John Wiley. Stute, W., S. Thies and L.-X. Zhu (1998). Model checks for regression: an innovation process approach. Annals of Statistics 26, 1916–34. Stute, W. and L.-X. Zhu (2002). Model checks for generalized linear models. Scandinavian Journal of Statistics 29, 535–45.
APPENDIX: PROOFS We first state two general lemmas. ˜ Then under H 0 , as n → ∞ L EMMA A.1. Let Assumptions 2.1–2.3 hold. Set gˆ j = gˆ m (j /n). sup gˆ j − gj = op (1), j
ˆ j − j = op (1). sup
(A.1)
j
ˆ j − j = op (1) follows by identical Proof: We only prove the first part of (A.1) since the proof of supj steps. Because g j = (1, φ j ) , we ignore the first element. By the triangle inequality, we have that the left-hand side of (A.1) is bounded by
j − φj + sup j + sup Eφ j , φj − Eφ (A.2) sup φˆ j − φ j
j
j C The Author(s). Journal compilation C Royal Economic Society 2009.
S123
Goodness-of-fit tests for dynamic models where, using the errors ε t , j := φ
4π K¯ m
m
K ReIWε,j + .
=−m;=0
To simplify arguments, we shall take herewith K(u) = 1, so 4π φˆ j = 2m Now
m
ReIWˆε,j + .
=−m;=0
m
1
ˆ ˆ ReIWW,k+ sup φj − φj ≤ θn − θ0 sup
j j m =−m
n˜
−1 1/2 −1 1/2 ˆ ≤ n m n θn − θ0 n ReIWW,
=1 −1 1/2 = Op m n = op (1)
˜ because n1/2 θˆn − θ0 = Op (1) by Assumption 2.2, and n−1 n=1 ReIWW, = Op (1) by Assumption 2.3(i). The second term in (A.2) is O(m2 n−2 + n−1 log n) = o(1) because of Assumption 2.3(i), whereas j 4 = j − Eφ E φ
m m m m 1 E[hj +a hj +b hj +c hj +d ], 4 16m a=−m b=−m c=−m d=−m
where we have considered hj := ReIW ε,j − EReIW ε,j as scalar to simplify notation. Now E[hj +a hj +b hj +c hj +d ] = E[hj +a hj +b ]E[hj +c hj +d ] + E[hj +a hj +c ]E[hj +b hj +d ] + E[hj +a hj ,d ]E[hj +b hj +c ] + cum[hj +a , hj +b , hj +c , hj +d ]. But, for all a, b, E[hj +a hj +b ] = O(n−1 log3 n + I(a = b)), whereas, distinguishing the contribution from higher-order cumulants and second-order cumulants (see Brillinger, 1981, p. 20 and Theorem 2.6.1), 3 2 n−1 log3 n + δa,b,c,d cum[hj +a , hj +b , hj +c , hj +d ] = O n−2 log6 n + δa,b,c,d n−1 log2 n + δa,b,c,d = O(m−1 n−1 log2 n + m−3 ), where δ a,b,c,d indicates a restriction among the indices a, b, c, d. Thus, j − Eφ j 4 = O(n−2 log6 n + m−2 ). E φ j − Eφ j = op (1) using that From here, we can conclude easily that supj φ
n˜ n˜ j − Eφ j > c ≤ j − Eφ j > c) ≤ c−4 j − Eφ j 4 Pr( φ E φ Pr sup φ j
j =1
and that m−2 n = o(1). L EMMA A.2. Under the assumptions of Theorem 3.2,
˜ [nω]
1
ˆ j − φj )mj = op (1). sup ( φ
1/2 ˜ ω∈(0,π ) n
j =1 C The Author(s). Journal compilation C Royal Economic Society 2009.
j =1
(A.3)
S124
M. A. Delgado, J. Hidalgo and C. Velasco
j − 4π RefWε,j is Proof: To simplify arguments we will assume that K(u) = I(|u| ≤ 1). Because Eφ O(n−2 m2 ) uniformly in j, it is easy to show that
˜ [nω]
1
sup 1/2 (Eφj − 4π RefWε,j )mj
= op (1), ˜ ω∈(0,π ) n
j =1 assuming finite second derivatives of f W ε,j in Assumption 2.3 , and that
˜ [nω]
1
ˆj − φ j )mj = op (1), ( φ sup
1/2 ˜ ω∈(0,π ) n
j =1 using Assumptions 2.2 and 2.3 as in Lemma A.1. The lemma now follows by Propositions A.1 and A.2. P ROPOSITION A.1. Under the assumptions of Theorem 3.2, for all ω ∈ [0, π ], ˜ [nω] 1 j − Eφ j )mj = op (1). (φ 1/2 n˜ j =1
j = j − Eφ Proof: Writing φ side of (A.4) is
1 2m
1 2n˜ 1/2 m =
m =−m;=0
hj + , by Abel summation by parts, we obtain that the left-hand
j ˜ [nω] (hj −m − hj +1+m − hj + hj +1 ) m j =1
1 2n˜ 1/2 m
+
(A.4)
=1 ˜ [nω]
(hj −m − hj +1+m − hj + hj +1 )(mj −m + mj )
(A.5)
j =1
1 2n˜ 1/2 m
˜ [nω] (hj −m − hj +1+m − hj + hj +1 ) j =1
j −1
m .
(A.6)
=1;=j −m
Equation (A.5) is o p (1) because the Cauchy–Schwarz inequality implies that (E|hj −m − hj +1+m − hj + hj +1 ||mj −m + mj |)2 ≤ E|hj −m − hj +1+m − hj + hj +1 |2 E|mj −m + mj |2 < D, where, in what follows, D denotes a finite and positive constant. It is worth mentioning that this is the best rate we can obtain under our general assumptions, because lack of (strong) exogeneity implies that E(hj mj ) = 0. Next, we examine (A.6). We employ that h • and m • do not have subindices in common. So, although the expectation is not zero, unless the fourth cumulant is, this is O(n−1 log3 n) at most. The expectation of (A.6) is O(m−1 n−1/2 log3 n) because ⎛ ⎞ j −1 E ⎝(hj −m − hj +1+m − hj + hj +1 ) m ⎠ = O(n−1 log3 n). =1;=j −m
Note that under Gaussianity the expectation would have been exactly zero. Next, we examine the second moment of (A.6). By the Cauchy–Schwarz inequality, it suffices to examine the second moment of each of C The Author(s). Journal compilation C Royal Economic Society 2009.
S125
Goodness-of-fit tests for dynamic models the following four terms: ˜ [nω]
1 2n1/2 m −
j −1
hj −m
j =1
=1;=j −m ˜ [nω]
1 2n1/2 m
H −
j −1
hj
j =1
˜ [nω]
1 2n1/2 m
j =1 ˜ [nω]
1
H +
2n1/2 m
=1;=j −m
j −1
hj +1+m
m
=1;=j −m j −1
hj +1
j =1
m .
(A.7)
=1;=j −m
We will study the contribution due to the first term, the other three terms are similarly handled. The second moment of the first term of (A.7) is proportional to j1 ˜ [nω] 1 ˜ 2 j =1 j =1 nm 1
=
2
j1 −1
1
+
1 ˜ 2 nm
2
j1 −1
E hj1 −m hj2 −m E m1 m2
1 =1;1 =j1 −m 2 =1;2 =j2 −m
j1 −1
j2 −1
E hj1 −m m1 E hj2 −m m2
j1 =1 j2 =1 1 =1;1 =j1 −m 2 =1;2 =j2 −m
j1 ˜ [nω] 1 ˜ 2 j =1 j =1 nm
1 ˜ 2 nm
E hj1 −m hj2 −m m1 m2 j2 −1
j1 ˜ [nω]
1
+
1 =1;1 =j1 −m 2 =1;2 =j2 −m
j1 ˜ [nω] 1 2 ˜ nm j =1 j =1
+
j2 −1
2
j1 −1
j2 −1
E hj2 −m m1 EE hj1 −m m2
1 =1;1 =j1 −m 2 =1;2 =j2 −m
j1 −1
j1 ˜ [nω]
j2 −1
cum hj1 −m , hj2 −m , m1 , m2 .
(A.8)
j1 =1 j2 =1 1 =1;1 =j1 −m 2 =1;2 =j2 −m
Because E(m1 m2 ) = O(n−1 ) + I(1 = 2 ), E(hj1 −m hj2 −m ) = O(n−1 ) + I(j1 = j2 ) and ˜ 2 ), the first term on the right-hand side of (A.8) is O([nω] O
n m2
+
j1 −1 ˜ [nω] 1 nm2 j =1 =1; =j 1
1
D=O
1 −m
[nω] ˜ j =1
j=
n . m2
Similarly, the second and third terms on the right-hand side of (A.8) are O(m−2 n). Finally, the fourth term on the right-hand side of (A.8). First, observe that cum hj1 −m , hj2 −m , m1 , m2 ∗ ∗ ∗ ∗ = cum wz,j1 −m wε,j , wz,j2 −m wε,j , wε,1 wε, , wε,2 wε, 1 −m 2 −m 1 2 =
q υ
cum wa,s1 wb,s2 ; (s1 , s2 ) ∈ υr
r=1
with s 1 , s 2 = j 1 − m, j 2 − m, 1 , 2 , a and b are Z and ε and where the summation in υ is over all indecomposable partitions υ = υ 1 ∪ . . . ∪ υ q , q = 1, . . . , 4, of the table wz,j1 −m wz,j2 −m wε,1 wε,2
C The Author(s). Journal compilation C Royal Economic Society 2009.
∗ wε,j 1 −m ∗ wε,j 2 −m ∗ wε, 1 ∗ wε, 2
S126
M. A. Delgado, J. Hidalgo and C. Velasco
see Brillinger (1981, p. 20 and Theorem 2.6.1). So, a typical component of the fourth term on the right-hand side of (A.8) is j1 ˜ [nω] 1 nm2 j =1 j =1 1
2
j1 −1
j2 −1
cum wa,s1 wb,s2 ; (s1 , s2 ) ∈ υr = O(n−1 m−1 log3 n).
1 =1;1 =j1 −m 2 =1;2 =j2 −m
So, we conclude that the second moment of (A.6) converges to zero, and (A.4) holds true by Markov inequality. P ROPOSITION A.2. Under the assumptions of Theorem 3.2, the process ˜ [nω] 1 j − Eφ j mj , φ n˜ 1/2 j =1
Xn (ω) =
ω ∈ [0, 1]
is tight. Proof: Proceeding as with Proposition A.1, Xn (ω) can be written as Xn1 (ω) + Xn2 (ω) :=
˜ [nω]
1 2n˜ 1/2 m +
(hj −m − hj +1+m − hj + hj +1 )(mj −m + mj )
j =1 ˜ [nω]
1 2n˜ 1/2 m
j −1
(hj −m − hj +1+m − hj + hj +1 )
j =1
m .
=1;=j −m
Following Billingsley (1968, Theorem 15.6), a sufficient condition for the tightness of Xn (ω) is E|Xnj (ω2 ) − Xnj (ω1 )|τ ≤ D(ω2 − ω1 )1+δ , j = 1, 2,
(A.9)
where τ , δ > 0, ω 2 > ω 1 , and, where without loss of generality, we can assume that n˜ −1 ≤ ω2 − ω1 . We begin with Xn1 (ω). By definition, Xn1 (ω2 ) − Xn1 (ω1 ) =
˜ 2] [nω
1 2n˜ 1/2 m
(hj −m − hj +1+m − hj + hj +1 )(mj −m + mj ).
˜ 1 ]+1 j =[nω
So, by the triangle inequality and proceeding as with the estimation of the second moment of (A.5), we have that E|Xn1 (ω2 ) − Xn1 (ω1 )| is bounded by D
˜ 1] ˜ 2 ] − [nω n1/2 [nω ≤ D (ω2 − ω1 )1+δ ≤ D (ω2 − ω1 ) 1/2 mn m
because by Assumption 3.2(ii), m−1 n1/2 = o(n−2δ ) for some δ > 0. To complete the proof, we need to show (A.9) for Xn2 (ω). We will only examine the contribution due to the first term of (A.7) into the left-hand side of (A.9), that is
1 E
2n˜ 1/2 m
τ
hj −m m
.
˜ 1 ]+1 j =[nω =1;=j −m
˜ 2] [nω
j −1
C The Author(s). Journal compilation C Royal Economic Society 2009.
S127
Goodness-of-fit tests for dynamic models Choosing τ = 2, we have that the last displayed expression is bounded by 1 ˜ 2 nm
+
+
+
j1 −1
j1
˜ 2] [nω
j2 −1
E hj
1 −m
hj2 −m E m1 m2
˜ 1 ]+1 j2 =[nω ˜ 1 ]+1 1 =1;1 =j1 −m 2 =1;2 =j2 −m j1 =[nω
1 ˜ 2 nm 1 ˜ 2 nm 1 ˜ 2 nm
˜ 2] [nω
j1
j1 −1
j2 −1
E hj
1 −m
m1 E hj2 −m m2
E hj
2 −m
m1 E hj1 −m m2
˜ 1 ]+1 j2 =[nω ˜ 1 ]+1 1 =1;1 =j1 −m 2 =1;2 =j2 −m j1 =[nω
˜ 2] [nω
j1
j1 −1
j2 −1
˜ 1 ]+1 j2 =[nω ˜ 1 ]+1 1 =1;1 =j1 −m 2 =1;2 =j2 −m j1 =[nω
˜ 2] [nω
j1
j1 −1
j2 −1
cum hj
1 −m
, hj2 −m , m1 , m2 .
˜ 1 ]+1 j2 =[nω ˜ 1 ]+1 1 =1;1 =j1 −m 2 =1;2 =j2 −m j1 =[nω
However, the last expression is bounded by D(ω 2 − ω 1 )1+δ because proceeding as with the proof of (A.8), they are bounded by D
˜ 1 ]2 ˜ 2 ]2 − [nω n [nω log3 n ≤ D (ω2 − ω1 ) 2 log3 n ≤ D (ω2 − ω1 )1+2δ , n3 m2 m
as n˜ −1 ≤ ω2 − ω1 . This completes the proof.
L EMMA A.3. Let ξ (u) : [0, 1] → Rp+q+1 be continuous. Assuming Assumption 2.1, we have that in p+q+1 D[0, 1], ×1 ω ω ξ (u)Tˆn (du) : ω ∈ [0, 1] converges in distribution to ξ (u)dB (u) : ω ∈ [0, 1] . 0
0
Proof: The proof is much simpler than that of Lemma 2 in DHV, so it is omitted.
Proof of Proposition 2.1: The proof proceeds as that of Lemma 7 in DHV, and so it is omitted.
Proof of Proposition 2.2: First it can be shown that Fˆn (π ) →p 1 by Assumption 2.1 under H 0 . So, we can write 4π Fˆn (ωπ ; θˆn ) = Fˆn (ωπ ) − (θˆn − θ0 ) n˜ where
˜ [ωn]
4π ReIWε,j + (θˆn − θ0 ) n˜ j =1
˜ [ωn]
ReIWW,j (θˆn − θ0 ),
j =1
˜ [ωn] n˜
4π
4π
≤
= Op (1) sup ReI ReI WW,j WW,j
˜ j =1 ω∈[0,1] n
n˜ j =1
(A.10)
because of Assumption 2.3(ii). Then Fˆn (π ; θˆn ) →p 1 by Assumption 2.2 and because as we now show An (ω) :=
˜ [ωn] 4π ReIWε,j = (ω) + op (1) n˜ j =1
C The Author(s). Journal compilation C Royal Economic Society 2009.
(A.11)
S128
M. A. Delgado, J. Hidalgo and C. Velasco
uniformly in ω, and (1) = 0. By Assumption 2.3(i) we obtain that EFˆn (ω) = (ω) + o(1) uniformly in ˆ n (ω) − (ω) = op (1) for each ω. Then we have to check the tightness of ω and by Assumption 2.3(iii), ¯ An (ω) := An (ω) − E[An (ω)]. Following Billingsley (1968, Theorem 15.6), a sufficient condition is that, for some δ > 0, 0 ≤ ω 1 < ω 2 ≤ 1, 2 [ω2 n] ˜ 2 hj ≤ D (ω2 − ω1 )1+δ . E|A¯ n (ω2 ) − A¯ n (ω1 )|2 = E ˜ n j =1+[ω1 n] ˜
(A.12)
Without loss of generality, we consider only n˜ −1 ≤ (ω2 − ω1 ). Then using Assumption 2.1 and Assumption 2.3(i)–(iii), the left-hand side of (A.10) is bounded by E|A¯ n (ω2 ) − A¯ n (ω1 )|2 ≤ D n˜ −1
ω2
dλ + D n˜
−1
ω1
3
2
ω2
log n
dλ
≤ D (ω2 − ω1 )2 .
ω1
Now (A.11) follows from (A.10) and Assumption 2.2, while (2.13) follows from (A.11) and Assumption 2.2. Proof of Theorem 3.1: Using the arguments in the proofs of Theorems 3 and 1 in DHV, we only need to consider convergence in intervals [0, ω 0 ], for any ω 0 < 1. Since it is trivially satisfied that ¯ = 0, the theorem is a consequence of supω∈[0,ω0 ] LG(ω) ¯ Tˆn (ω; θˆn ) − Tˆn (ω))| = op (1), sup |L(
(A.13)
ω∈[0,ω0 ]
d L¯ Tˆn (ω) ⇒ B 0 in the space D [0, ω0 ] .
(A.14)
¯ Tˆn (ω; θˆn ) − Tˆn (ω)) is By definition, L( Tˆn (ω; θˆn ) − Tˆn (ω) −
ω
−1
0
1
g(v)(Tˆn (dv; θˆn ) − Tˆn (dv))du.
g(u) (u)
(A.15)
u
By Proposition 2.2, the first two terms in (A.15) are equal to −n˜ 1/2 (ω) (θˆn − θ0 ) + op (1) uniformly in ω, whereas the third term is n˜ 1/2 0
ω
g(u) −1 (u)
1
g(v)g(v) dvdu(θˆn − θ0 ) + op (1) = n˜ 1/2 (ω) (θˆn − θ0 ) + op (1),
u
which shows (A.13). To complete the proof we need to show (A.14). Fidi’s convergence follows as in Proposition 2.1 or Lemma A.3. Then, it suffices to prove tightness. Since Tˆn (ω) is tight, we only need to show the tightness condition of
r
Pn (r) :=
H (u) n (u) du, 0 C The Author(s). Journal compilation C Royal Economic Society 2009.
S129
Goodness-of-fit tests for dynamic models where H (u) := g(u) (u)−1 and n (u) := n˜ −1/2 supu∈[0,ω0 ] n (u) = Op (1) and E n (u) 2 < D, r
r
E |Pn (r) − Pn (s)|2 =
H (u1 )H (u2 ) s
s
n˜
˜ j =1+[nu]
n˜ 1 n˜ j =1+[nu ˜
1]
r
gj mj . Because by Lemma A.3,
n˜
gj gk E(mj mk )du1 du2
˜ 2] k=1+[nu
r
≤D
H (u1 ) H (u2 ) du1 du2 s
s
= D |L (r) − L (s)|2 , where L(·) =
· 0
H (u) du is a monotonic continuous and non-decreasing function.
˜ L¯ n Tˆn (ω; θˆn ) is, up to terms o p (1) uniformly in ω, Proof of Theorem 3.2: Setting φˆ j = φˆ m (j /n), ˜ ¯ [nω] [nω] 1 1 m − j n˜ 1/2 j =1 n˜ j =1
⎛
˜ ¯ [nω] [nω] 1 1 φj − +⎝ n˜ j =1 n˜ j =1
1 ˆ φj
1 ˆ φj
ˆ j−1 +1
ˆ j−1 +1
n˜ 1 n˜ 1/2 =j +1
n˜ 1 n˜ =j +1
1 ˆ φ
1 ˆ φ
(A.16)
m
⎞ φ ⎠ n˜ 1/2 (θˆn
− θ0 ),
(A.17)
using Proposition 2.2. ¯ using Lemma A.1 and Assumption 2.2 Since j is assumed non-singular for all j = 1, . . . , n, we obtain that (A.17) is o p (1), which is what it is required to conclude that the asymptotic behaviour of L¯ n Tˆn (ω) is given by that of (A.16). We now show the weak convergence of (A.16) with φˆ replaced by φ . In Lemma A.2 we show that the difference is negligible. ˜ Next, we study the First, the expectation is clearly zero because E(Iεε,j − 1) = 0, j = 1, . . . , n. covariance structure. Let ω 1 ≤ ω 2 . Our aim is to show that E (an (ω1 ) an (ω2 )) → ω1 ,
(A.18)
n→∞
where ˜ ¯ [nω] [nω] 1 1 mj − an (ω) = n˜ j =1 n˜ j =1
1 φj
j−1 +1
n˜ 1 n˜ =j +1
1 Eφ
m
:= an1 (ω) − an2 (ω) , since (A.18) implies that if a n (ω) converges to a Gaussian process, this would be the standard Brownian motion. Because E(an1 (ω1 )an1 (ω2 )) = ω1 , we have that (A.18) holds true if E (an2 (ω1 ) an2 (ω2 )) = E (an1 (ω1 ) an2 (ω2 )) + E (an1 (ω2 ) an2 (ω1 )) . First, it is easy to check that the right-hand side of (A.19) is ⎧ ⎫
¯ ] ¯ ] ¯ ] j ∧[nω ¯ ] j ∧[nω [nω [nω 1 1 ⎨ 1 1 2 2 1 1 ⎬ 1 −1 + j2 . ⎭ φj 2 n˜ ⎩ j =1 j =1 φj 1 j =1 j =1 1
2
1
2
C The Author(s). Journal compilation C Royal Economic Society 2009.
(A.19)
S130
M. A. Delgado, J. Hidalgo and C. Velasco
Next, we examine the left-hand side of (A.19), which is
1 n˜ =1+j =1+j φj 1 1 2 1 1 2 2
⎫
⎬ 1 1 ×E m1 m2 , j−1 2 φ2 ⎭ φj 2
¯ ] ¯ ] [nω [nω 1 1 2 n˜ j =1 j =1
1
j−1 1
1
φ 1
showing (A.19). Since the fidis of (A.16) converge to those of a Brownian motion, we only need to examine the tightness of a n2 (ω), as it is already known that a n1 (ω) is tight. But we have that a n2 (ω 2 ) − a n2 (ω 1 ) is ⎧ ⎫
¯ 2] nω n¯ ⎨ 1 [ ⎬ 1 1 1 −1 j +1 m ⎩ n˜ ⎭ n˜ 1/2 φ φ j ¯ 1 ]+1 ¯ 2 ]+1 j =[nω =[nω ⎛ ⎞ ¯ 2] [nω 1 1 1 1 ⎝ ⎠ m , + 1/2 j−1 +1 ˜ n˜ =[nω n φ φ j ¯ ]+1 ¯ ]+1 j =[nω 1
1
from where it is easy to show that a n2 (ω) is tight. Observe that, for instance, the first term has again the structure (ζ (ω2 ) − ζ (ω1 ))Z, where Z is a random variable with at least finite second moments. ˇ assuming that Proof of Theorem 3.3: We first analyse an unfeasible version of the transformation Lˇ n , L, ˆ j by mj , j = 1, . . . , n, ˜ we observe θ 0 and hence replacing gˆ j by g j = Re I W ε,j and m ⎧ ⎫ ⎛ ⎞−1 ¯ ⎨ [ωn] n˜ n˜ ⎬ −1/2 ˜ n ˜ j +1 ⎝ gk+2 gk+2 ⎠ gk+2 mk+1 , mj − ng Lˇ Tˆn (ω; θˆn ) = ⎩ ⎭ Fˆn (π ) j =1
k=j +1
(A.20)
k=j +1
and show that under the same conditions of the theorem, d Lˇ Tˆn (ω; θˆn ) ⇒ B 0 in the Skorohod metric space D [0, ω0 ] ,
for any ω 0 < 1. Then the proof of Theorem 3.3 is standard after we notice that εˆ t = εt + (θˆn − θ0 ) Wt , Assumption 2.2 implies that θˆn − θ0 = Op (n−1/2 ) and the arguments in the proofs of Theorems 3 and 4 in DHV. We shall abbreviate g n,k by g k to simplify the notation. Now, because Fˆn (π, θˆn ) − σε2 = op (1), recall that we can assume that σ 2ε = 1 without loss of generality, we obtain that ⎛ ⎞−1 ¯ [nω] n˜ n˜ 1 ⎠ gj +1 ⎝ gk+2 gk+2 gk+2 mk+1 + op (1) . Lˇ Tˆn (ω) = Tˆn (ω) − 1/2 n˜ j =1 k=j +1 k=j +1
(A.21)
So, except the o p (1), the right-hand side of Lˇ Tˆn (ω) is ⎧ ⎫ ⎛ ⎞−1 ¯ [nω] n˜ n˜ ⎬ 1 ⎨ ⎝ ⎠ − g g g g m m . j k+2 k+2 k+1 j +1 k+2 ⎭ n˜ 1/2 j =1 ⎩ k=j +1 k=j +1
(A.22)
C The Author(s). Journal compilation C Royal Economic Society 2009.
S131
Goodness-of-fit tests for dynamic models ˜ n˜ 1 Now, we could replace Gj ,n = n1˜ nk=j +1 gk+2 gk+2 by Gj = n˜ k=j +1 E(gk+2 gk+2 ). Indeed, ⎧ ⎫ ¯ [nω] n˜ ⎬ 1 ⎨ −1 1 −1 G − G g m g k+2 k+1 j j ,n j +1 ⎭ n˜ 1/2 j =1 ⎩ n˜ k=j +1 ⎧ ⎫ ¯ [nω] n˜ ⎬ 1 ⎨ −1 (G − G )G g m = 3/2 gj +1 G−1 . j ,n j k+2 k+1 j j ,n ⎩ ⎭ n˜ j =1
(A.23)
k=j +1
However, Brillinger’s (1981) Theorem 7.6.3, see also the proof of Lemma A.1, implies that uniformly in j, Gj ,n − Gj = op n−1/4 ;
n˜
gk+2 mk+1 = Op n3/4 ,
k=j +1
so that the right-hand side of (A.23) is o p (1), and hence the asymptotic distribution of (A.22) is given by that of ⎧ ⎫ ¯ [nω] n˜ ⎬ 1 ⎨ −1 1 gp+2 mp+1 mj − gj +1 Gj 1/2 ⎩ ⎭ n˜ n˜ p=j +1 j =1 ⎧ ⎫ ¯ [nω] n˜ ⎬ 1 1 ⎨ (A.24) gk+2 mk+1 + op (1), = 1/2 mj − E(gj +1 )G−1 j ⎩ ⎭ n˜ n˜ j =1
k=j +1
as we now show. Writing gj = gj − E(gj ), the difference between left-hand side and the first term on the right-hand side of (A.24) is ¯ [nω] n˜ 1 −1 g G gk+2 mk+1 . n˜ 3/2 j =1 j +1 j k=j +1
Next, the second moment of the right-hand side of the last displayed equality is ⎧⎛ ⎞⎛ ⎞⎫ ¯ [nω] n˜ n˜ ⎬ 1 ⎨⎝ −1 −1 ⎠ ⎝ ⎠ . E g G g m G g m g k+2 k+1 q+2 q+1 j j +1 +1 ⎩ ⎭ n˜ 3 1=j ≤
k=j +1
(A.25)
q=+1
Now because G−1 j < D, the expectation term in (A.25) is governed by ⎞ ⎛ n˜ n˜ gq+2 mq+1 gk+2 mk+1 ⎠ E( g+1 ) gj +1 E⎝ k=j +1
q=+1
⎛
+ E ⎝ gj +1 ⎛ + E ⎝ g+1
n˜ k=j +1 n˜
⎞ ⎛
gk+2 mk+1 ⎠ E ⎝ g+1
⎞
gq+2 mq+1 ⎠ E ⎝ gj +1
q=+1
+
n˜ n˜
⎛
n˜
⎞ gq+2 mq+1 ⎠
q=+1 n˜
⎞ gk+2 mk+1 ⎠
k=j +1
cum( gj +1 , gk+2 mk+1 , g+1 , gq+2 mq+1 ).
k=j +1 q=+1
Now, because for example Cov( gj +1 , gk+1 ) = I(j = k) + O(n−1 ), Cov( gj +1 , mk+1 ) = I(j = k) + O(n−1 ) and by Brillinger (1981, p. 20 and Theorem 4.3.2), the last displayed expression is O(1), and C The Author(s). Journal compilation C Royal Economic Society 2009.
S132
M. A. Delgado, J. Hidalgo and C. Velasco
hence (A.25) is O(n−1 ). So, we conclude that ⎧ ⎫ ¯ ⎨ [nω] n˜ ⎬ 1 1 ˚j gk+2 mk+1 + op (1) , Lˇ Tˆn (ω) = 1/2 mj − G ⎩ ⎭ n˜ n˜ j =1
k=j +1
where ˚ j = E(gj +1 )G−1 G j . So, it suffices to examine the asymptotic behaviour of ⎧ ⎫ ¯ ⎨ [nω] n˜ ⎬ 1 1 ˚j gk+2 mk+1 , mj − G Lˇ Tˆn (ω) = 1/2 ⎩ ⎭ n˜ n˜ j =1
(A.26)
(A.27)
k=j +1
and more specifically that (a) |ELˇ Tˆn (ω)| = o(1), (b) Cov(Lˇ Tˆn (ω1 ), Lˇ Tˆn (ω2 )) = (ω1 ∧ ω2 )π −1 + o(1) and (c) the tightness of the process Lˇ Tˆn (ω). ˚ j < D, we have that We begin with part (a). Now, because Emj = 0 and G
¯ [nω] n˜
1 −1/2 D
ˇ ˆ , E(gk+2 mk+1 ) |ELTn (w)| ≤ 1/2
=O n
n˜ n˜
j =1 k=j +1 because E(gk+2 mk+1 ) = Cov(Iε,k+1 , gk+2 ) = O(n−1 ). Now, we examine part (b). To that end it suffices to show that ⎞2 ⎛ ˜ 1] ˜ 2] [nω [nω 1 1 (ω1 ∧ ω2 ) + o(1); mj 1/2 mj ⎠ + o (1) = (i) E ⎝ 1/2 ˜ n˜ n π j =1 j =1
(A.28)
(ii) that the contribution of the other three terms in Cov(Lˇ Tˆn (ω1 ), Lˇ Tˆn (ω2 )) = o(1). That (A.28) holds true is standard. See for instance DHV’s Lemma 7. Now, regarding part (ii), it suffices to see that ⎧ ⎫ ¯ ¯ [nω] [nω] n˜ n˜ ⎨ ⎬ 1 1 ˚ ˚j E mj G gk+2 mk+1 − 2 E m G gk+2 mk+1 − 2 ⎭ n˜ j ≤ n˜ j ≤ ⎩ k=+1 k=j +1 ⎧ ⎞⎫
⎛ ¯ [nω] n˜ n˜ ⎬ 1 ⎨ ˚ ˚j (A.29) G E gk+2 mk+1 ⎝G gk+2 mk+1 ⎠ + 3 ⎩ ⎭ n˜ j ≤
k=+1
k=j +1
is o(1). Observe that this is the term we obtain when ω 1 = ω 2 = ω. That (A.29) is o(1) follows because the first term on (A.29) is proportional to ¯ ¯ [nω] [nω] n˜ n˜ 1 1 E{m g m } = {E{mj }E{gk+2 mk+1 } j k+2 k+1 n˜ 2 j ≤ k=+1 n˜ 2 j ≤ k=+1
+ E{mj gk+2 }E{mk+1 } + E{mj mk+1 }E{gk+2 }}, which is zero because E{mj mp+1 } = E{mj } = 0. Next, the second term of (A.29) is −
¯ ¯ [nω] [nω] n˜ 1 ˚ 1 ˚ } {m = − G Gj E {g+1 } E g m j k+2 k+1 n˜ 2 j ≤ n˜ 2 j ≤ k=j +1 C The Author(s). Journal compilation C Royal Economic Society 2009.
S133
Goodness-of-fit tests for dynamic models
because E{mj } = 0 and E{m mk+1 } = I( = k + 1). And finally, the third term of (A.29) is ¯ −1 ˚ n˜ −2 j[nω] ≤ Gj E{g+1 } + O(n ). Indeed, proceeding as before using Brillinger (1981, p. 20 and theorem 4.3.2) as in the proof of Proposition A.1, that for instance E(mk+1 mq+1 ) = I(p = q) and then using the ˚ j in (A.26), the third term of (A.29) is definition of G ¯ ¯ [nω] [nω] n˜ 1 ˚ ˚ 2 1 ˚ −1 G G Gj E{g+1 } + o(1). + O(n E g ) = j k+2 n˜ 3 j ≤ n˜ 2 j ≤ k=+1
So, we conclude part (b) that Cov(Lˇ Tˆn (ω1 ), Lˇ Tˆn (ω2 )) = (ω1 ∧ ω2 )π −1 + o(1). To complete the proof we need to show part (c). From the definition of Lˇ Tˆn (ω) in (A.27), it suffices to examine the tightness of ¯ [nω] n˜ 1 ˚ 1 G gk+2 mk+1 j n˜ 1/2 j =1 n˜ k=j +1
˜ ˚ as n˜ −1/2 j[nω] =1 mj is known to be tight. See, for instance, DHV. Now, because by Assumption 2.3, Gj − ˚ j +1 = O(n−1 ), it suffices to examine the tightness of G ¯ [nω] n˜ 1 1 1 gk+2 mk+1 = (1 − ω) 1/2 n˜ 1/2 j =1 n˜ k=j +1 n˜
+
n¯
g+2 m+1
¯ =1+[nω]
¯ [nω] 1 g+2 m+1 . 1 − n˜ 1/2 =1 n˜
We shall examine the second term on the right-hand side being the first one similarly handled. Now by standard arguments, see Billingsley (1968), we only need to show that
1 E
n˜ 1/2
4
1+δ g+2 m+1
≤ D (ω2 − ω1 )
¯ 1] =1+[nω
¯ 2] [nω
for some δ > 0. Now, the left-hand side of the last displayed expression is 1 n˜ 2
¯ 2] [nω
E g1 +2 m1 +1 g2 +2 m2 +1 g3 +2 m3 +1 g4 +2 m4 +1
¯ 1 ]=1 ,2 ,3 ,4 1+[nω
⎛ ¯ 2] [nω 1 ⎝ =3 2 n˜ ¯ ]= 1+[nω 1
+
1 n˜ 2
⎞2
E g1 +2 m1 +1 g2 +2 m2 +1 ⎠ 1 ,2
¯ 2] [nω
cum g1 +2 m1 +1 , g2 +2 m2 +1 , g3 +2 m3 +1 , g4 +2 m4 +1 .
¯ 1 ]=1 ,2 ,3 ,4 1+[nω
Now proceed as we did in part (b) to conclude that the right-hand side of the last displayed expression is bounded by D(ω 2 − ω 1 )2 after we notice that we can always take ω 1 and ω 2 such that n˜ −1 ≤ (ω2 − ω1 ). This completes the proof of the theorem.
C The Author(s). Journal compilation C Royal Economic Society 2009.
S134
M. A. Delgado, J. Hidalgo and C. Velasco
Proof of Theorem 4.1: From the definition of Fˆn (ω, θˆn ) in (2.7), under H an , and proceeding as in Proposition 2.2, we have that ˜ [ωn] 4π Fˆn (ω, θˆn ) = Fˆn (ω) − 3/2 ReIεW,j n˜ 1/2 (θˆn − θ0 ) n˜ j =1
−c
˜ [ωn] 4π ReIεX(−p−1),j + op (1) n˜ 3/2 j =1
= Fˆn (ω) − n˜ −1/2 {(ω) (θn − θ0 ) + cσ 2 (ω)} + op (1), uniformly in ω ∈ [0, 1]. From here, (4.2) follows ω repeating the same steps of Theorems 3.1 and 3.2, but noting the additional term given by (ω) := 0 l(π u)du in the general case. So, under H an , the B 0 + L0 is a non-centred Gaussian process, being the ‘non-centrality function’ given by L0 . Now, the test will have non-trivial power under H an if L0 (ω) =0 in a set, say (L), with Lebesgue measure greater than 1 zero. From the definitions of L0 and and that 0 φ(v)dv = 0, it is easily seen that ω 1 l(π u) − g(u) (u)−1 L0 (ω) = g(v)l(π v)dv du. 0
u
However, the expression in braces is just the residuals from the least-squares projection of l(π u) on g(u) = (1, φ(u) ) , which obviously is different than zero unless l(π u) is in the space spanned by g(u). But the latter is ruled out, which concludes the proof.
C The Author(s). Journal compilation C Royal Economic Society 2009.
The
Econometrics Journal Econometrics Journal (2009), volume 12, pp. S135–S171. doi: 10.1111/j.1368-423X.2009.00279.x
Efficient GMM with nearly-weak instruments B ERTILLE A NTOINE † AND E RIC R ENAULT ‡ †
‡
Department of Economics, Simon Fraser University, 8888 University Drive, Burnaby, BC V5A 1S6, Canada E-mail:
[email protected] Department of Economics, University of North Carolina at Chapel Hill, CB 3305, Chapel Hill, NC 27599, USA E-mail:
[email protected] First version received: July 2008; final version accepted: December 2008
Summary This paper is in the line of the recent literature on weak instruments, which, following the seminal approach of Stock and Wright captures weak identification by drifting population moment conditions. In contrast with most of the existing literature, we do not specify a priori which parameters are strongly or weakly identified. We rather consider that weakness should be related specifically to instruments, or more generally to the moment conditions. In addition, we focus here on the case dubbed nearly-weak identification where the drifting DGP introduces a limit rank deficiency reached at a rate slower than root-T. This framework ensures the consistency of Generalized Method of Moments (GMM) estimators of all parameters, but at a rate possibly slower than usual. It also validates the GMM-LM test with standard formulas. We then propose a comparative study of the power of the LM test and its modified version, or K-test proposed by Kleibergen. Finally, after a well-suited rotation in the parameter space, we identify and estimate directions where root-T convergence is maintained. These results are all directly relevant for practical applications without requiring the knowledge or the estimation of the slower rate of convergence. Keywords: GMM, Instrumental variables, Weak identification.
1. INTRODUCTION In this paper, we revisit the Generalized Method of Moments (GMM) of Hansen (1982) when classical identification assumptions are only barely satisfied. Following Hansen (1982), we imagine an economic model with structural parameter of interest θ ∈ ⊆ Rp . The econometrician’s information about the true unknown value θ 0 of θ comes through moment conditions. For some stationary ergodic process (Y t ), φ(Y t , θ ) is a K-dimensional function, integrable for all θ ∈ , and the underlying economic model states that these moment conditions are satisfied at the true unknown value of the parameter: E[φ(Yt , θ 0 )] = 0.
(1.1)
Moment conditions (1.1) strongly globally identify θ 0 if they do not admit any other solution: E [φ(Yt , θ )] = 0,
θ ∈ ⇔ θ = θ 0.
(1.2)
C The Author(s). Journal compilation C Royal Economic Society 2009. Published by Blackwell Publishing Ltd, 9600 Garsington Road,
Oxford OX4 2DQ, UK and 350 Main Street, Malden, MA, 02148, USA.
S136
B. Antoine and E. Renault
Hansen (1982) maintains (1.2) to prove consistency of a GMM estimator, defined as: θˆT = arg min[φ¯ T (θ )T φ¯ T (θ )]. θ∈
(1.3)
φ¯ T (θ ) is the sample mean of φ(Y t , θ ); T is a sequence of random non-negative matrices with positive definite probability limit. To address the asymptotic distribution of a GMM estimator, Hansen (1982) extends the above definition and considers any sequence θˆT such that: Plim[T 1/2 AT φ¯ T (θˆT )] = 0.
(1.4)
(A T ) is a sequence of (p, K) random matrices converging in probability to a constant full-row rank matrix A 0 . The strong global identification condition (1.2) may be replaced by a local one: moment conditions (1.1) strongly locally identify θ 0 if θ 0 belongs to the interior of where φ(Y t , θ ) is continuously differentiable, and E[∂φ(Y t , θ 0 )/∂θ ] has full-column rank. Under this strong local identification condition, the GMM estimator θˆT defined by (1.4) is consistent and asymptotically normal. Moreover, the GMM estimator defined by (1.3) is a special case of (1.4) (through the first-order conditions), when strong local identification holds. Both global and local strong identification conditions have been questioned in the literature during the last ten years. Stock and Wright (2000) relax the strong global identification when considering a drifting Data Generating Process (DGP) with: E[φ(Yt , θ )] =
m1T (θ ) + m2 (θ1 ) for some given subvector θ1 of θ. T 1/2
(1.5)
Then, only possibly θ 1 is identified, since, for the other components of θ , the relevant moment information vanishes at rate square-root T, the speed at which information is accumulated with a larger sample size. This case has been referred to as (global) weak identification. Kleibergen (2005) focuses on the GMM score-type test of a null hypothesis: H 0 : θ = θ0 . For such a problem, only local identification is relevant, and Kleibergen (2005) refers to as (local) weak identification the case where: C ∂φ(Yt , θ 0 ) = 1/2 with full-column rank matrix C. E (1.6) ∂θ T We revisit the issue of GMM estimators and score-type tests in the nearly-weak identification case, first introduced by Hahn and Kuersteiner (2002): through a drifting DGP approach, information now vanishes when sample size T increases, but at a rate δ T slower than T 1/2 . In the context of Stock and Wright (2000), global nearly-weak identification would mean: E [φ(Yt , θ )] =
m1T (θ ) + m2 (θ1 ), δT
while in the context of Kleibergen (2005), local nearly-weak identification would mean: C ∂φ(Yt , θ 0 ) = . E ∂θ δT
(1.7)
(1.8)
The possible case of nearly-weak identification has been quite overlooked in the literature while, after all, it makes sense to study a variety of asymptotic behaviours when δ T may be associated to any rate between O(1) and O(T 1/2 ). Weak identification is only a limit case where identification is completely lost. So far, only Hahn and Kuersteiner (2002) in a linear context, and Caner (2007) in a non-linear one, have considered nearly-weak identification. Our contribution as C The Author(s). Journal compilation C Royal Economic Society 2009.
Efficient GMM with nearly-weak instruments
S137
concerns nearly-weak identification is to imagine that, in realistic circumstances, nearly-weak identification may occur for some moments while strong identification is still guaranteed for others. This new point of view paves the way for new results as follows. In terms of GMM score-type testing, the partition between locally strongly- and locally nearly-weakly identifying moment conditions determines the different rates of convergence associated with specific directions in the parameter space against which the test has power. 1 As a result, the GMM score test has power even in quite weak directions, where the weakness degree δ T may be arbitrarily close to T 1/2 . We show that, by contrast, Kleibergen’s modified score test is more likely to waste some power in such directions. It is the price to pay to be robust to weak identification (δ T = T 1/2 ) when, as shown by Kleibergen (2005), the standard GMM score test does not work. We show that the GMM score test and Kleibergen’s modified score test are actually asymptotically equivalent under relevant sequences of local alternatives, but only in cases of moderate weakness of identification: we refer to nearly-strong identification, when δ T goes to infinity slower than T 1/4 . This equivalence is tightly related to a stronger equivalence result between the standard two-step (efficient) GMM and the continuously updated GMM of Hansen et al. (1996) for efficient estimation of all directions. Such a result can only be embraced after extending the pioneered setting introduced by Stock and Wright (2000): we now consider that some moment conditions are globally-identifying, while some others are weaklyidentifying.. In other words, the vector φ(Y t , θ ) is partitioned into two subvectors φ(Y t , θ ) = [φ 1 (Y t , θ ) .. φ 2 (Y t , θ ) ] such that: E [φ1 (Yt , θ )] = ρ1 (θ ) and
E [φ2 (Yt , θ )] = ρ2 (θ )/δT
(1.9)
with the global nearly-weak identification condition: ρ(θ ) = 0 ⇔ θ = θ 0
where
. ρ(θ ) = [ρ1 (θ ) .. ρ2 (θ )] .
Identification is nearly-weak because δ T goes to infinity, but we rather call it nearly-strong when associated to a rate slower than T 1/4 . By contrast with Stock and Wright (2000), we have no prior knowledge on the subset of parameters that are weakly identified. Intuitively, the first set of moment conditions (respectively the second one) identifies strong (respectively weak) directions in the parameter space. Through a convenient rotation in the parameter space in the spirit of Phillips (1989), we define a reparametrization such that the first components of this new parameter are estimated at standard rate square-root T, while the others are estimated only at slower rate λ T = T 1/2 /δ T . Asymptotic covariances come with standard GMM-like formulas, but only with nearly-strong identification, i.e. rate λ T faster than T 1/4 . Interpreting this latter condition is germane to Andrews’ (1994) study of MINPIN estimators. 2 In our case, the nuisance parameter is not infinite dimensional. However, due to nearly-weak identification, it is associated to a rate of convergence slower than the standard parametric square-root T. As in Andrews (1994), the slow rate of convergence needs to be faster than T 1/4 to avoid contamination of the well-identified estimated directions by the nearly-weak ones. We also show that the nearly-strong identification condition is exactly needed to ensure that all directions are equivalently estimated 1 As far as the size properties are concerned, we know from results in Andrews and Guggenberger (2007) that nearlyweak identification does not offer additional insights. Only genuine weak identification is important in investigating size properties of testing procedures. 2 These estimators are defined as MINimizing a criterion function that might depend on a Preliminary Infinite dimensional Nuisance parameter estimator.
C The Author(s). Journal compilation C Royal Economic Society 2009.
S138
B. Antoine and E. Renault
by efficient two-step GMM and continuously updated GMM. This explains the aforementioned partial equivalence between GMM score and Kleibergen’s modified score tests. More generally, our unified setting for mixed strong/nearly-strong identification incorporates coherently both global and local points of view. Ultimately, evidence of weak identification should not always lead to renounce meaningful estimation and testing, as the alleged weak identification may only be nearly-weak, or even nearly-strong. Overlooking these cases could lead to wasting some relevant information. Moreover, possible weakness should be assigned to some specific instruments (or to some moment conditions) as in (1.9) rather than to specific parameters as in (1.7). It is the econometrician’s duty to determine the different directions in the parameter space where she has more or less accurate information. We illustrate this new point of view with a Monte Carlo study on the well-known example of consumption-based CAPM, already extensively studied in the literature: see Stock and Wright (2000) among others. The paper is organized as follows. In Section 2, we discuss GMM-based tests of a simple null hypothesis H 0 : θ = θ0 . We compare the asymptotic behaviour of the standard GMM score test (Newey and West, 1987), and the Kleibergen modified score test. When complete weak identification is precluded, both tests work. Our framework allows us to display relevant sequences of local alternatives with heterogeneous rates of convergence depending on the direction of departure in the parameter space. By contrast with Kleibergen (2005), different degrees of nearly-weak identification are simultaneously considered: this opens the door for non-equivalence, even asymptotically, between standard and modified score tests. In Section 3, consistency and rate of convergence of any GMM estimator are analysed in a nearly-weak identification setting. The special case of nearly-strong identification allows us in Section 4 to discuss efficient estimation with various rates of convergence in various directions, and to check equivalence between two-step efficient GMM and continuously updated GMM in all directions. These last results bridge the gap between estimation and score tests as discussed in Section 2. The practical relevance of our new asymptotic theory is checked in Section 5 in a consumption-based intertemporal asset pricing model. It validates our point of view of nearly-strong identification with different rates of convergence in different directions for realistic simulated parameter configurations. Section 6 concludes. Proofs are gathered in the Appendix; Table 1 summarizes the different concepts of identification.
2. GMM SCORE-TYPE TESTING We want to test the null hypothesis: H 0 : θ = θ0 . Our information about parameter θ comes from the following moment conditions, E[φ(Yt , θ 0 )] = 0,
(2.1)
always assumed to be fulfilled at least by the true unknown value of the parameter θ 0 . Observed time series (Y t ) 1≤t≤T of a stationary ergodic process are available, and such that sample counterparts of the moment conditions satisfy a Central Limit Theorem (CLT) at the true value: A SSUMPTION 2.1. (CLT at the true value θ 0 ). With φ¯ T (θ 0 ) = T1 Tt=1 φ(Yt , θ 0 ) : √ (i) T φ¯ T (θ 0 ) is asymptotically normally distributed with zero mean and covariance matrix S0. C The Author(s). Journal compilation C Royal Economic Society 2009.
C The Author(s). Journal compilation C Royal Economic Society 2009.
δ T = T 1/2
Weak
1