INTRODUCTION Simulation-based methods and simulation-assisted estimators have greatly increased the reach of empirical applications in econometrics. The received literature includes a thick layer of theoretical studies, including landmark works by Gourieroux and Monfort (1996), McFadden and Ruud (1994), and Train (2003), and hundreds of applications. An early and still influential application of the method is Berry, Levinsohn, and Pakes’s (1995) (BLP) application to the U.S. automobile market in which a market equilibrium model is cleared of latent heterogeneity by integrating the heterogeneity out of the moments in a GMM setting. BLP’s methodology is a baseline technique for studying market equilibrium in empirical industrial organization. Contemporary applications involving multilayered models of heterogeneity in individual behavior such as that in Riphahn, Wambach, and Million’s (2003) study of moral hazard in health insurance are also common. Computation of multivariate probabilities by using simulation methods is now a standard technique in estimating discrete choice models. The mixed logit model for modeling preferences (McFadden & Train, 2000) is now the leading edge of research in multinomial choice modeling. Finally, perhaps the most prominent application in the entire arena of simulationbased estimation is the current generation of Bayesian econometrics based on Markov Chain Monte Carlo (MCMC) methods. In this area, heretofore intractable estimators of posterior means are routinely estimated with the assistance of simulation and the Gibbs sampler. The 10 chapters in this volume are a collection of methodological developments and applications of simulation-based methods that were presented at a workshop at Louisiana State University in November 2009. Among the earliest applications of the principles discussed here was the development of the GHK simulator for multivariate normal probabilities. Prior analysts of multinomial choice had reluctantly focused on the multinomial logit model for estimation of choice probabilities. The substantive shortcomings of the MNL model as a description of behavior were well known. The assumptions underlying the MNL random utility model of independent, identically distributed random components in the utility functions produce unattractive restrictions on the behavioral model. A multinomial probit model with multivariate normal components relaxes ix
x
INTRODUCTION
the restrictions. However, computation of choice probabilities requires computation of the multivariate cumulative probabilities. This is simple in the MNL case, for which Prob [v1 r v, v2 r v,y,vj r v] has a simple closed form when vj and v (where is one of the j’s) are independent extreme value random variables. However, the counterpart when v has a multivariate normal distribution with mean vector l and covariance vector S cannot be computed directly as there is no closed form expression (and, for more than two variables, no convenient quadrature method). Manski and Lerman (1977) proposed an indirect estimator based on a set of R multivariate random draws from the normal (l,S) population R 1X 1½v1r vr ; v2r vr ; . . . ; vKr vr jl; R P^ ¼ R r¼1
That is, based on the (easily drawn) psuedo-random sample, simply count the number of observations that fall in the specified orthant and divide by the number of draws. Two practical limitations, the extraordinarily large number of draws needed to get a satisfactory estimate and the nonzero probability of producing a one or zero estimate inhibit the direct use of this approach. However, the principle that underlies it provides the means to proceed. The central result is that by dint of the law of large numbers, P^ is a consistent estimator of the expectation of a random variable that can be simulated P ¼ E½1ðv1r vr ; v2r vr ; . . . ; vKr vr jl; RÞ The GHK simulator (Geweke, 1989; Hajivassiliou, 1990; Bo¨rsch-Supan & Hajivassiliou, 1990; Keane, 1994) provides a smooth, efficient means of estimating multivariate normal cumulative probabilities. [See Greene (2008, pp. 582–584) for mechanical details, and a special issue of the Caves, Moffatt, and Stock (1994) for numerous studies.] Two of the chapters in our volume are extensions of the GHK simulator. Ivan Jeliazkov and Esther Hee Lee, in ‘‘MCMC Perspectives on Simulated Likelihood Estimation,’’ reconsider the computation of the probabilities in a discrete choice model. They show that by interpreting the outcome probabilities through Bayes’ theorem, the estimation can alternatively be handled by methods for marginal likelihood computation based on the output of MCMC algorithms. They then develop new methods for estimating response probabilities and propose an adaptive sampler for producing highquality draws from multivariate truncated normal distributions. A simulation study illustrates the practical benefits and costs associated with each
xi
Introduction
approach. In ‘‘The Panel Probit Model: Adaptive Integration on Sparse Grids,’’ Florian Heiss suggests an algorithm that is based on GHK but uses an adaptive version of sparse grids integration (SGI) instead of simulation. It is adaptive in the sense that it uses an automated change of variables to make the integration problem numerically better behaved along the lines of efficient importance sampling (EIS) and adaptive univariate quadrature. The resulting integral is approximated using SGI that generalizes Gaussian quadrature in a way such that the computational costs do not grow exponentially with the number of dimensions. Monte Carlo experiments show an impressive performance compared to the original GHK algorithm, especially in difficult cases such as models with high intertemporal correlations. The extension of the simulation principles to models in which unobserved heterogeneity must be integrated out of the likelihood function is ubiquitous in the contemporary literature. The template application involves a log likelihood that is conditioned on the unobserved heterogeneity ln Ljv ¼
n X
gð yi ; X i jm i ; hÞ
i¼1
Feasible estimation requires that the unobserved heterogeneity be integrated out of the log-likelihood function; the unconditional loglikelihood function is n Z X gð yi ; X i jm i ; hÞf ðm i Þ dmi ln L ¼ i¼1
¼
n X
mi
E mi gð yi ; X i jm i ; hÞ
i¼1
Consistent with our earlier observation, the log-likelihood function can be approximated adequately by averaging a sufficient number of pseudo-draws on vi. This approach has been used in a generation of applications of random parameters models in which hi ¼ h þ vi where vi is a multivariate realization of a random vector that can be simulated. Two of our studies are focused specifically on the methodology. Chandra R. Bhat, Cristiano Varin, and Nazneen Ferdous compare the performance of the maximum simulated likelihood (MSL) approach with their proposed composite marginal likelihood (CML) approach in multivariate ordered-response situations. Overall, the simulation results demonstrate the ability of the CML approach to recover the parameters very well in a five- to six-dimensional orderedresponse choice model context. In addition, the CML recovers parameters
xii
INTRODUCTION
as well as the MSL estimation approach in the simulation contexts used in the current study, while also doing so at a substantially reduced computational cost. The CML approach appears to be a promising approach for the estimation of not only the multivariate ordered-response model considered here, but also for other analytically intractable econometric models. Tong Zeng and R. Carter Hill in ‘‘Pretest Estimation in the Random Parameters Logit Model’’ examine methods of testing for the presence of heterogeneity in the heterogeneity model. The models are nested; the model without heterogeneity arises if Sv ¼ 0 in the template formulation. The authors use Monte Carlo sampling experiments to examine the size and power properties of pretest estimators in the random parameters logit (RPL) model. The pretests are for the presence of random parameters. They study the Lagrange multiplier, likelihood ratio, and Wald tests using the conditional logit as the restricted model. Importance sampling is a method of improving on the core problem of using simulation to estimate E[g(v)] with (1/R)Sr g(vr) through a tranformation of the random variable. The GHK simulator is an application of the technique. In ‘‘Simulated Maximum Likelihood Estimation of Continuous Time Stochastic Volatility Models’’ Tore Selland Kleppe, Jun Yu, and Hans J. Skaug develop and implement a method for MSL estimation of the continuous time stochastic volatility model with a constant elasticity of volatility. To integrate out latent volatility from the joint density of return and volatility, a modified EIS technique is used after the continuous time model is approximated using the Euler–Maruyama scheme. We have five applications in the series. In ‘‘Education Savings Accounts, Parent Contributions, and Education Attainment,’’ Michael D. S. Morris uses a dynamic structural model of household choices on savings, consumption, fertility, and education spending to perform policy experiments examining the impact of tax-free education savings accounts on parental contributions toward education and the resulting increase in the education attainment of children. The model is estimated via MSL using data from the National Longitudinal Survey of Young Women. Unlike many similarly estimated dynamic choice models, the estimation procedure incorporates a continuous variable probability distribution function. In ‘‘Estimating the Effect of Exchange Rate Flexibility on Financial Account Openness,’’ Raul Razo-Garcia considers the estimation of the effect of exchange rate flexibility on financial account openness. Using a panel data set of advanced countries and emerging markets, a trivariate
xiii
Introduction
probit model is estimated via an MSL approach. The estimated coefficients exhibit important differences when exchange rate flexibility is treated as an exogenous regressor relative to the case when it is treated as endogenous. In ‘‘Estimating a Fractional Response Model with a Count Endogenous Regressor and an Application to Female Labor Supply,’’ Hoa B. Nguyen proposes M-estimators of parameters and average partial effects in a fractional response model for female labor supply with an endogenous count variable, number of children, under the presence of time-constant unobserved heterogeneity. To address the endogeneity of the right-handside count variable, he uses instrumental variables and a two-step estimation approach. Two methods of estimation are employed: quasi-maximum likelihood (QML) and nonlinear least squares (NLS). Greene (2003) used Monte Carlo simulation for directly evaluating an integral that appears in the normal-gamma stochastic frontier model. In ‘‘Alternative Random Effects Panel Gamma SML Estimation with Heterogeneity in Random and One-Sided Error,’’ Saleem Sheik and Ashok K. Mishra utilize the residual concept of productivity measures defined in the context of a normal-gamma stochastic frontier production model with heterogeneity to differentiate productivity and inefficiency measures. Three alternative two-way random effects panel estimators of normal-gamma stochastic frontier model are proposed using simulated maximum likelihood estimation techniques. Finally, we have an application of Bayesian MCMC methods by Esmail Amiri, ‘‘Modeling and Forecasting Volatility in a Bayesian Approach.’’ The author compares the forecasting performance of five classes of models: ARCH, GARCH, SV, SV-STAR, and MSSV using daily Tehran stock market exchange (TSE) data. The results suggest that the models in the fourth and the fifth classes perform better than the models in the other classes.
ACKNOWLEDGMENTS I would like to thank Tom Fomby of Southern Methodist University and Carter Hill of Louisiana State University, editors of the Advances in Econometrics Series, for giving me the enjoyable opportunity to host the conference with them and to put together this volume. And, of course, I’d like to thank the authors for their contributions to the volume and for their invaluable help in the editorial process.
xiv
INTRODUCTION
REFERENCES Berry, S., Levinsohn, J., & Pakes, A. (1995). Automobile prices in market equilibrium. Econometrica, 63(4), 841–890. Bo¨rsch-Supan, A., & Hajivassiliou, V. (1990). Smooth unbiased multivariate probability simulators for maximum likelihood estimation of limited dependent variable models. Journal of Econometrics, 58(3), 347–368. Caves, R., Moffatt, R., & Stock, J. (Eds). (1994). Symposium on simulation methods in econometrics. Review of Economics and Statistics, 76(4), 591–702. Geweke, J. (1989). Bayesian inference in econometric models using Monte Carlo integration. Econometrica, 57, 1317–1340. Gourieroux, C., & Monfort, A. (1996). Simulation-based econometric methods. Oxford: Oxford University Press. Greene, W. (2003). Simulated maximum likelihood estimation of the normal-gamma stochastic frontier model. Journal of Productivity Analysis, 19, 179–190. Greene, W. (2008). Econometric analysis (6th ed.). Englewood Cliffs, NJ: Pearson Prentice Hall. Hajivassiliou, V. (1990). Smooth simulation estimation of panel data LDV models. New Haven, CT: Department of Economics, Yale University. Keane, M. (1994). A computationally practical simulation estimator for panel data. Econometrica, 62(1), 95–116. Manski, C., & Lerman, S. (1977). The estimation of choice probabilities from choice based samples. Econometrica, 45, 1977–1988. McFadden, D., & Ruud, P. (1994). Estimation by simulation. Review of Economics and Statistics, 76, 591–608. McFadden, D., & Train, K. (2000). Mixed multinomial logit models for discrete response. Journal of Applied Econometrics, 15, 447–470. Riphahn, R., Wambach, A., & Million, A. (2003). Incentive effects in the demand for health care: A bivariate panel count data estimation. Journal of Applied Econometrics, 18(4), 387–405. Train, K. (2003). Discrete choice methods with simulation. Cambridge: Cambridge University Press.
William Greene Editor
ADVANCES IN ECONOMETRICS Series Editors: Thomas B. Fomby, R. Carter Hill and Ivan Jeliazkov Recent Volumes: Volume 20A: Econometric Analysis of Financial and Economic Time Series, Edited by Dek Terrell and Thomas B. Fomby Volume 20B:
Econometric Analysis of Financial and Economic Time Series, Edited by Dek Terrell and Thomas B. Fomby
Volume 21:
Modelling and Evaluating Treatment Effects in Econometrics, Edited by Daniel L. Millimet, Jeffrey A. Smith and Edward Vytlacil
Volume 22:
Econometrics and Risk Management, Edited by Jean-Pierre Fouque, Thomas B. Fomby and Knut Solna
Volume 23:
Bayesian Econometrics, Edited by Siddhartha Chib, Gary Koop, Bill Griffiths and Dek Terrell
Volume 24:
Measurement Error: Consequences, Applications and Solutions, Edited by Jane Binner, David Edgerton and Thomas Elger
Volume 25:
Nonparametric Econometric Methods, Edited by Qi Li and Jeffrey S. Racine
ADVANCES IN ECONOMETRICS VOLUME 26
MAXIMUM SIMULATED LIKELIHOOD METHODS AND APPLICATIONS EDITED BY
WILLIAM GREENE Stern School of Business, New York University
R. CARTER HILL Department of Economics, Louisiana State University
United Kingdom – North America – Japan India – Malaysia – China
LIST OF CONTRIBUTORS Esmail Amiri
Department of Statistics, Imam Komeini International University, Ghazvin, Iran
Chandra R. Bhat
Department of Civil, Architectural and Environmental Engineering, University of Texas, Austin, Texas, USA
Nazneen Ferdous
Department of Civil, Architectural and Environmental Engineering, University of Texas, Austin, Texas, USA
Florian Heiss
Department of Business and Economics, University of Mainz, Germany
R. Carter Hill
Department of Economics, Louisiana State University, LA, USA
Ivan Jeliazkov
Department of Economics, University of California, Irvine, CA, USA
Tore Selland Kleppe
Department of Mathematics, University of Bergen, Norway
Esther Hee Lee
IHS, EViews, Irvine, CA, USA
Ashok K. Mishra
Department of Agricultural Economics and Agribusiness, Louisiana State University, LA, USA
Michael D. S. Morris
Department of Economics and Legal Studies in Business, Oklahoma State University, OK, USA
Hoa B. Nguyen
Department of Economics, Michigan State University, MI, USA
Raul Razo-Garcia
Department of Economics, Carleton University, Ottawa, Ontario, Canada vii
viii
LIST OF CONTRIBUTORS
Saleem Shaik
Department of Agribusiness and Applied Economics, North Dakota State University, ND, USA
H. J. Skaug
Department of Mathematics, University of Bergen, Norway
Cristiano Varin
Department of Statistics, Ca’Foscari University, Venice, Italy
Jun Yu
School of Economics, Singapore Management University, Singapore
Tong Zeng
Department of Economics, Louisiana State University, LA, USA
Emerald Group Publishing Limited Howard House, Wagon Lane, Bingley BD16 1WA, UK First edition 2010 Copyright r 2010 Emerald Group Publishing Limited Reprints and permission service Contact:
[email protected] No part of this book may be reproduced, stored in a retrieval system, transmitted in any form or by any means electronic, mechanical, photocopying, recording or otherwise without either the prior written permission of the publisher or a licence permitting restricted copying issued in the UK by The Copyright Licensing Agency and in the USA by The Copyright Clearance Center. No responsibility is accepted for the accuracy of information contained in the text, illustrations or advertisements. The opinions expressed in these chapters are not necessarily those of the Editor or the publisher. British Library Cataloguing in Publication Data A catalogue record for this book is available from the British Library ISBN: 978-0-85724-149-8 ISSN: 0731-9053 (Series)
Emerald Group Publishing Limited, Howard House, Environmental Management System has been certified by ISOQAR to ISO 14001:2004 standards Awarded in recognition of Emerald’s production department’s adherence to quality systems and processes when preparing scholarly journals for print
MCMC PERSPECTIVES ON SIMULATED LIKELIHOOD ESTIMATION Ivan Jeliazkov and Esther Hee Lee ABSTRACT A major stumbling block in multivariate discrete data analysis is the problem of evaluating the outcome probabilities that enter the likelihood function. Calculation of these probabilities involves high-dimensional integration, making simulation methods indispensable in both Bayesian and frequentist estimation and model choice. We review several existing probability estimators and then show that a broader perspective on the simulation problem can be afforded by interpreting the outcome probabilities through Bayes’ theorem, leading to the recognition that estimation can alternatively be handled by methods for marginal likelihood computation based on the output of Markov chain Monte Carlo (MCMC) algorithms. These techniques offer stand-alone approaches to simulated likelihood estimation but can also be integrated with traditional estimators. Building on both branches in the literature, we develop new methods for estimating response probabilities and propose an adaptive sampler for producing high-quality draws from multivariate truncated normal distributions. A simulation study illustrates the practical benefits and costs associated with each approach. The methods are employed to
Maximum Simulated Likelihood Methods and Applications Advances in Econometrics, Volume 26, 3–39 Copyright r 2010 by Emerald Group Publishing Limited All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1108/S0731-9053(2010)0000026005
3
4
IVAN JELIAZKOV AND ESTHER HEE LEE
estimate the likelihood function of a correlated random effects panel data model of women’s labor force participation.
1. INTRODUCTION Limited dependent variable models deal with binary, multivariate, multinomial, ordinal, or censored outcomes that can arise in cross-sectional, time-series, or longitudinal (panel data) settings. To enable inference in this class of models, however, one must address a central problem in multivariate discrete data analysis, namely, evaluation of the outcome probability for each observation. Outcome probabilities are required in constructing the likelihood function and involve multivariate integration constrained to specific regions that correspond to the observed data. To illustrate the main ideas in some detail, consider the latent variable representation zi ¼ X i b þ ei ;
ei N ð0; XÞ
(1)
where, for i ¼ 1; . . . ; n, zi ¼ ðzi1 ; . . . ; ziJ Þ0 is a vector of continuous latent variables underlying the discrete observations yi ¼ ðyi1 ; . . . ; yiJ Þ0 , X i is a J k matrix of covariates with corresponding k-vector of parameters b, and X is a J J covariance matrix in which the variances of any binary or ordinal variables yij are typically set to 1 for identification reasons. This latent variable framework is a general probabilistic construct in which different threshold-crossing mappings from zi to the observed responses yi can produce various classes of discrete data models such as the multivariate probit for binary and ordinal data, multinomial probit, panels of binary, ordinal, or censored (Tobit) outcomes, models with incidental truncation or endogenous treatment indicators, and Gaussian copula models. For example, the indicator function mapping yij ¼ 1fzij 40g underlies binary data models, the relationship yij ¼ 1fzij 40gzij leads Pto a Tobit model with censoring from below at 0, the discretization yij ¼ Ss¼1 1fzij 4gj;s g for some strictly increasing sequence of cutpoint parameters fgj;s gSs¼1 arises in ordinal data modeling and copula models for count data, and so on. Variations on the distributional assumptions can be used to construct mixtures or scale mixtures of normals models including the Student’s t-link (‘‘robit’’) and logit models. In economics, the latent zi are interpreted as unobserved utility differences (relative to a baseline category), and discrete data models are often referred to as discrete choice models.
MCMC Perspectives on Simulated Likelihood Estimation
5
A representative example that will form the basis for discussion in the remainder of this chapter is the multivariate probit model where the binary outcomes in yi relate to the latent zi in Eq. (1) through the indicator functions yij ¼ 1fzij 40g for j ¼ 1; . . . ; J. In this context, the object of interest is the probability of observing yi , conditionally on b and X, which is given by Z Z f N ðzi jX i b; XÞdzi1 dziJ Prðyi jb; XÞ ¼ B Bi1 Z iJ ¼ 1fzi 2 Bi g f N ðzi jX i b; XÞdzi ð2Þ where f N ðzi jX i b; XÞ is the normal density with mean X i b and covariance matrix X (which is in correlation form), and the region of integration is given by Bi ¼ Bi1 Bi2 BiJ with ( ð1; 0 if yij ¼ 0 Bij ¼ ð0; 1Þ if yij ¼ 1 P The log-likelihood function is given by ln f ðyjb; XÞ ¼ ni¼1 lnPrðyi jb; XÞ, however, a major stumbling block in evaluating that function is that the multivariate integrals defining the likelihood contributions in Eq. (2) typically have no closed-form solution, but need to be evaluated at various values of b and X for the purposes of estimation (e.g., in maximization algorithms) and model comparison (e.g., in evaluating likelihood ratio statistics, information criteria, Bayes factors, and marginal likelihoods). Standard grid-based numerical approximations (e.g., Gauss–Legendre or quadrature methods) exist for univariate and bivariate problems, but the computational costs associated with these approaches rise exponentially with dimensionality, which makes them prohibitively expensive in higher dimensions. While in many instances, the computational intensity of numerical integration can be moderated by sparse-grid approximations as in Heiss and Winschel (2008), the most widely used approaches for obtaining Eq. (2) in discrete data analysis have been based on simulation. Such methods exploit a number of practical advantages that make them particularly appealing. For example, simulation methods typically rely on standard distributions which makes them conceptually and computationally straightforward and efficient, even in high dimensions. Moreover, simulation often resolves the problem of having to specify the location and size of a grid so that it corresponds to areas of high density. This is especially useful because knowledge of these features is often absent, especially in
6
IVAN JELIAZKOV AND ESTHER HEE LEE
high-dimensional problems. For these reasons, simulation methods have become a fundamental tool in multivariate integration in general and in simulated likelihood estimation in particular. One popular approach for simulation-based evaluation of the outcome probabilities in discrete choice models is the Geweke, Hajivassiliou, and Keane (GHK) method (Geweke, 1991; Bo¨rsch-Supan & Hajivassiliou, 1993; Keane, 1994; Hajivassiliou & McFadden, 1998). Another one is studied by Stern (1992). These methods have risen to prominence because they are efficient and offer continuous and differentiable choice probabilities that are strictly bounded between 0 and 1, making them very suitable for maximum likelihood estimation and other problems that require gradient or Hessian evaluation. Other methods, such as the accept–reject (AR) simulator and its variants, are appealing because of their transparency and simplicity. Many of these techniques, together with other useful alternatives, have been carefully reviewed in Hajivassiliou and Ruud (1994), Stern (1997), and Train (2003). In this chapter we pursue several objectives. Our first main goal is to show that the probability of the observed response, given the model parameters, can be estimated consistently and very efficiently by a set of alternative techniques that have been applied in a very different context. In particular, the calculation of integrals which have no closed-form solution has been a central issue in Bayesian model comparison. The marginal likelihood, which is given by the integral of the likelihood function with respect to the prior distribution of the model parameters, is an important ingredient in producing Bayes factors and posterior odds of competing models. A large number of Markov chain Monte Carlo (MCMC) methods have been introduced to calculate marginal likelihoods, Bayes factors, and posterior odds (e.g., Ritter & Tanner, 1992; Newton & Raftery, 1994; Gelfand & Dey, 1994; Chib, 1995; Meng & Wong, 1996; DiCiccio, Kass, Raftery, & Wasserman, 1997; Geweke, 1999; Chib & Jeliazkov, 2001, 2005), but these methods have not yet been employed to estimate response probabilities and construct likelihood functions for discrete data models even though MCMC data augmentation techniques have been routinely used to obtain parameter estimates without computing those probabilities (see, e.g., Koop, 2003; Greenberg, 2008, and the references therein). A recent comparison of Bayesian and classical inferences in probit models is offered in Griffiths, Hill, and O’Donnell (2006). Given the specifics of the current context, in this chapter, we first focus on MCMC estimation techniques that embody desirable characteristics such as continuity and differentiability, but mention that the other approaches can be very useful as well. Second, we design
MCMC Perspectives on Simulated Likelihood Estimation
7
several new estimation methods by integrating two branches of the literature and combining features of the classical and Bayesian methods. This allows for several enhancements in the resulting ‘‘hybrid’’ approaches that tend to improve the quality of the simulated latent data sample, the efficiency of the resulting estimates, and retain simplicity without sacrificing continuity and differentiability. Our third goal is to provide a comparison and document the performance of the alternative methods in a detailed simulation study that highlights the practical costs and benefits associated with each approach. Finally, we present an application to the problem of estimating the likelihood ordinate for a correlated random effects panel data model of women’s labor force participation, which illustrates the applicability of the proposed techniques. The rest of this chapter is organized as follows. In Section 2, we review several traditional simulation methods that have been used to estimate the response probabilities in simulated likelihood estimation. A number of alternative MCMC approaches are discussed in Section 3. Building on existing work, we introduce new approaches for estimating outcome probabilities that are obtained by integrating features of the Bayesian and traditional techniques. Section 4 provides evidence on the relative performance of these simulation methods, while Section 5 applies the techniques to evaluate the likelihood function of a correlated random effects panel data model using data on women’s labor force participation. Concluding remarks are presented in Section 6.
2. EXISTING METHODS We begin with a brief review of the basic idea behind the AR, or frequency, method which is perhaps the most straightforward approach for estimating the probability in Eq. (2). The AR method draws independent identically distributed (iid) random variables zðgÞ i N ðX i b; XÞ for g ¼ 1; . . . ; G. Draws 2 B are accepted, whereas those that do not are rejected. that satisfy zðgÞ i i The probability in Eq. (2) is then calculated as the proportion of accepted draws b i jb; XÞ ¼ G1 Prðy
G X
1fzðgÞ i 2 Bi g
(3)
g¼1
The AR approach is very simple and intuitive and is easy to implement with a variety of distributions for the random terms. This estimator has been
8
IVAN JELIAZKOV AND ESTHER HEE LEE
applied in discrete choice problems by Lerman and Manski (1981); additional discussion and applications of AR methods are offered in Devroye (1986) and Ripley (1987). However, for a given finite number of draws, the AR approach has a number of pitfalls especially when used in the context of likelihood estimation. One is that the estimated probability is not strictly bounded between 0 and 1, and there is a positive probability of obtaining an estimate on the boundary, which can cause numerical problems when taking the logarithm of the estimated probability. The more important problem with the AR method, however, is the lack of differentiability of the estimated probability with respect to the parameter vector. Because the AR probability in Eq. (3) has the form of a step function with respect to the parameters, the simulated probability is either constant or jumps by a discrete amount with respect to a small change in the parameter values. These features of the estimator impede its use in numerical optimization and complicate the asymptotics of estimators that rely on it. The difficulties of the AR method can be circumvented by replacing the indicator function 1fzi 2 Bi g in Eq. (2) with a smooth and strictly positive function. One strategy, suggested in McFadden (1989), is to approximate the orthant probability as Z zi (4) Prðyi jb; XÞ K f ðzi jX i b; XÞdzi b N where KðÞ is a smooth kernel function, for example the logistic cumulative distribution function (cdf), and ba0 is a scale factor that determines the degree of smoothing. It can easily be seen that the function Kðzi =bÞ approaches 1fzi 2 Bi g as b ! 0. This approach avoids the problem of nondifferentiability, however, it comes at the cost of introducing a bias in the estimate of the probability (Hajivassiliou, McFadden, & Ruud, 1996). Whereas the bias can be reduced by picking a value of b that is very close to 0, doing so can potentially revive the problem of nondifferentiability if KðÞ begins to approximate the indicator function 1fg too closely. In practice, therefore, the choice of b is not straightforward and must be done very carefully. Other complications, for example, appropriate kernel selection, could arise with this approach when Bi is bounded from both below and above as in ordinal probit and copula models. Another strategy for overcoming the difficulties of estimating Eq. (2) was developed by Stern (1992) and relies on a particular decomposition of the correlation structure in Eq. (1). The basic idea underlying the Stern method
MCMC Perspectives on Simulated Likelihood Estimation
9
is to decompose the error component ei in Eq. (1) into the sum of two terms – one that is correlated and another one that contains orthogonal errors. In particular, the Stern simulator is based on rewriting the model as z i ¼ v i þ wi where vi N ðX i b; X KÞ and wi N ð0; KÞ with K ¼ lI. Note that the mean X i b can be incorporated either in vi or wi , or in the limits of integration (these representations are equivalent). Moreover, as a matter of simulation efficiency, Stern (1992) suggests that l should be chosen as large as possible subject to leaving ðX KÞ positive definite. This is done by setting l close to the smallest eigenvalue of X. With this decomposition, the likelihood contribution in Eq. (2) can be rewritten as Z f N ðzi jX i b; XÞdzi Prðyi jb; XÞ ¼ Bi
Z Z
þ1
f N ðwi j0; KÞf N ðvi jX i b; X KÞdvi dwi
¼ Ci
1
where the change of variable implies that Ci ¼ Ci1 CiJ with Cij ¼ ð1; vij Þ if yij ¼ 0 and Cij ¼ ½vij ; 1Þ if yij ¼ 1. Because the independent elements of wi have a Gaussian density, which is symmetric, this probability can be expressed as # Z "Y J ð1Þ1yij vij pffiffiffi f N ðvi jX i b; X KÞdvi Prðyi jb; XÞ ¼ F l j¼1 where FðÞ denotes the standard normal cdf. Estimation of this integral then proceeds by ( !) G J Y ð1Þ1yij vðgÞ 1X ij b pffiffiffi F Prðyi jb; XÞ ¼ G g¼1 j¼1 l where vðgÞ i N ðvi jX i b; X KÞ for g ¼ 1; . . . ; G. Another popular method is the GHK algorithm which builds upon simulation techniques for multivariate truncated normal distributions that were pioneered by Geweke (1991) and has been successfully implemented in a variety of problems in cross-sectional, time series, and panel data settings. The GHK algorithm has been extensively studied in Bo¨rsch-Supan and Hajivassiliou (1993), Hajivassiliou and Ruud (1994), Keane (1994),
10
IVAN JELIAZKOV AND ESTHER HEE LEE
Hajivassiliou et al. (1996), and Hajivassiliou and McFadden (1998) and has been carefully reviewed in Train (2003). The insight behind the GHK algorithm is that one can design a tractable importance density that could facilitate simulation-based estimation by writing the model as zi ¼ X i b þ Lgi ;
gi N ð0; IÞ
(5)
where L is a lower triangular Cholesky factor of X with elements l ij such that LL0 ¼ X. Because the entries in gi are independent and L is lower triangular, a recursive relation between the elements of zi can be established to produce the importance density used in the GHK algorithm hðzi jyi ; b; XÞ ¼ hðzi1 jyi1 ; b; XÞhðzi2 jzi1 ; yi2 ; b; XÞ hðziJ jzi1 ; . . . ; zi;J1 ; yiJ ; b; XÞ ¼
J Y
hðzij jfzik gkoj ; yij ; b; XÞ
ð6Þ
j¼1
and the terms in the product are restricted to the set Bi by letting
hðzij jfzik gkoj ; yij ; b; XÞ ¼ f TN B
ij
zij jx0ij b þ
j1 X
! l jk Zik ; l 2jj
k¼1
¼ 1fzij 2 Bij g f N
zij jx0ij b þ
j1 X
! l jk Zik ; l 2jj
=cij
k¼1
P where cij ¼ Fðð1Þð1yij Þ ðx0ij b þ j1 the normalizing constant of k¼1 l jk Zik Þ=l jj Þ isP 2 the truncated normal density f TN B ðzij jx0ij b þ j1 k¼1 l jk Zik ; l jj Þ. As a result, ij taking the product in Eq. (6) produces QJ hðzi jyi ; b; XÞ ¼ ¼
j¼1 1fzij
P 2 2 Bij gf N zij jx0ij b þ j1 k¼1 l jk Zik ; l jj QJ j¼1 cij
1fzi 2 Bi g f N ðzi jX i b; XÞ QJ j¼1 cij
MCMC Perspectives on Simulated Likelihood Estimation
upon which one could write Eq. (1) as Z f N ðzi jX i b; XÞdzi Prðyi jb; XÞ ¼ Bi Z f N ðzi jX i b; XÞ hðzi jyi ; b; XÞdzi ¼ Bi hðzi jyi ; b; XÞ Z f N ðzi jX i b; XÞ ¼ hðzi jyi ; b; XÞdzi QJ Bi f N ðzi jX i b; XÞ= j¼1 cij ) Z (Y J ¼ cij hðzi jyi ; b; XÞdzi Bi
11
(7)
j¼1
Therefore, Prðyi jb; XÞ can be estimated as G Y J X b i jb; XÞ ¼ 1 cðgÞ Prðy G g¼1 j¼1 ij ðgÞ with draws zðgÞ obtained recursively as zðgÞ i ij hðzij jfzik gkoj ; yij ; b; XÞ for j ¼ 1; . . . ; J 1, and g ¼ 1; . . . ; G, using techniques such as the inverse cdf method (see, e.g., Devroye, 1986) or simulation-based techniques such as those proposed in Robert (1995). Both the Stern and GHK methods provide continuous and differentiable multivariate probability estimates. They also typically produce smaller estimation variability than the AR method because the simulated probabilities are strictly bounded between 0 and 1, whereas each draw in the AR method gives either 0 or 1. However, all three methods suffer from a common problem that can often produce difficulties. In particular, in all three approaches, the simulation draws come from proposal distributions that differ from the truncated normal distribution of interest, TN Bi ðX i b; XÞ. When this disparity is large, the efficiency of all methods can be adversely affected. For example, it is easy to recognize that the AR method provides a sample from the unrestricted normal distribution N ðX i b; XÞ, the Stern method generates draws from the normal distribution N ðX i b; X KÞ, while GHK simulation relies on the recursive importance density in Eq. (6) in which draws depend only on the restrictions implied by yij but ignore the restrictions implied by subsequent fyik gk4j . These mismatches between the proposal and target densities may adversely affect the efficiency of the AR, GHK, and Stern methods. We next introduce a class of simulated likelihood methods which are, in fact, based on draws from the truncated density of interest.
12
IVAN JELIAZKOV AND ESTHER HEE LEE
3. MCMC METHODS The calculation of multivariate integrals that generally have no analytical solution has been an important research area in Bayesian statistics. In particular, a key quantity of interest in Bayesian model comparison is the marginal likelihood, which is obtained by integrating the likelihood function with respect to the prior distribution of the parameters (for a discussion, see Kass & Raftery, 1995, and the references therein). It is one of the basic goals of this chapter to link the simulated likelihood literature with that on Bayesian model choice in order to introduce MCMC methods as new and viable approaches to simulated likelihood estimation in discrete data analysis. Another goal is to develop new MCMC methods that are specifically tailored to simulated likelihood estimation. Our third goal is to provide an efficient simulation method for sampling zi TN Bi ðX i b; XÞ, which is particularly important in this class of models but also has broad ramifications beyond simulated likelihood estimation. These goals are pursued in the remainder of this section.
3.1. The CRB Method To see the common fundamentals between outcome probability estimation and Bayesian model choice, and to establish the framework for the estimation methods that will be discussed subsequently, we begin by rewriting the expression for Prðyi jb; XÞ. In particular, note that we can write the probability in Eq. (2) as Z 1fzi 2 Bi g f N ðzi jX i b; XÞ (8) Prðyi jb; XÞ ¼ 1fzi 2 Bi g f N ðzi jX i b; XÞdzi ¼ f TN ðzi jX i b; XÞ which can be interpreted in terms of Bayes formula based on the recognition that the indicator function 1fzi 2 B i g actually gives Prðyi jzi Þ and hence can be treated as a ‘‘likelihood,’’ f N ðzi jX i b; XÞ can be treated as a ‘‘prior’’ because it does not respect the truncation implied by yi , and f TN ðzi jX i b; XÞ can be viewed as a ‘‘posterior’’ that accounts for the truncation constraints reflected in yi . Thus, we can see that Prðyi jb; XÞ can actually be viewed as a ‘‘marginal likelihood,’’ that is, the normalizing constant of the ‘‘posterior’’ f TN ðzi jX i b; XÞ. Even though the interpretation of Prðyi jb; XÞ as the normalizing constant of a truncated normal distribution is directly visible from Eq. (2), its reinterpretation in terms of the quantities in Eq. (8) is useful for developing empirical strategies for its estimation. In fact, the equivalent
MCMC Perspectives on Simulated Likelihood Estimation
13
of Eq. (8) was used in Chib (1995) in developing his method for marginal likelihood estimation. This identity is particularly useful because, as discussed in Chib (1995), it holds for any value of zi 2 Bi , and therefore the calculation is reduced to finding the estimate of the ordinate f TN ðzi jX i b; XÞ at a single point zi 2 Bi . In our implementation, an estimate of the log-probability is obtained as b i jb; XÞ ¼ ln f N ðz jX i b; XÞ ln fd ln Prðy TN ðzi jX i b; XÞ i
(9)
where we take zi to be the sample mean of the MCMC draws zðgÞ i TN Bi ðX i b; XÞ, g ¼ 1; . . . ; G, and make use of the fact that the numerator quantities 1fzi 2 Bi g and f N ðzi jX i b; XÞ in Eq. (8) are directly available. Draws zðgÞ i TN Bi ðX i b; XÞ can be produced by employing the Gibbs sampling algorithm of Geweke (1991) in which a new value for zi is generated by iteratively simulating each element zij from its full-conditional density zij f ðzij jfzik gkaj ; yij ; b; XÞ ¼ TN Bij ðmij ; s2ij Þ for j ¼ 1; . . . ; J, where mij and s2ij are the conditional mean and variance of zij given fzik gkaj , which are obtained by the usual updating formulas for a Gaussian density. Note that unlike the aforementioned importance sampling methods, a Gibbs sampler constructed in this way produces draws from the exact truncated normal distribution of interest and those draws will be used to estimate f TN ðzi jX i b; XÞ, thereby leading to an estimate of Prðyi jb; XÞ. To estimate the ordinate f ðzi jyi ; b; XÞ ¼ f TN ðzi jX i b; XÞ, the joint density is decomposed by the law of total probability as f ðzi jyi ; b; XÞ ¼
J Y
f ðzij jyi ; fzik gkoj ; b; XÞ
j¼1
In the context of Gibbs sampling, when the full-conditional densities are fully known, Chib (1995) proposed finding the ordinates f ðzij jyi ; fzik gkoj ; b; XÞ for 1ojoJ by Rao-Blackwellization (Tanner & Wong, 1987; Gelfand & Smith, 1990) in which the terms in the decomposition are represented by Z f ðzij jyi ; fzik gkoj ; fzik gk4j ; b; XÞ f ðzij jyi ; fzik gkoj ; b; XÞ ¼ f ðfzik gk4j jyi ; fzik gkoj ; b; XÞdfzik gk4j and estimated as fbðzij jyi ; fzik gkoj ; b; XÞ ¼ G1
G X g¼1
f ðzij jyi ; fzik gkoj ; fzðgÞ ik gk4j ; b; XÞ
14
IVAN JELIAZKOV AND ESTHER HEE LEE
where the draws fzðgÞ ik gk4j come from a reduced run in which the latent variables fzik gkoj are fixed and sampling is over fzðgÞ ik gkj f ðfzik gkj j ðgÞ from fz g yields draws fzik gk4j yi ; fzik gkoj ; b; XÞ. Excluding zðgÞ ij ik kj f ðfzik gk4j jyi ; fzik gkoj ; b; XÞ that are required in the average. The ordinate f ðzi1 jyi ; b; XÞ is estimated with draws from the main MCMC run, while the ordinate f ðziJ jyi ; fzik gkoJ ; b; XÞ is available directly, and hence the method requires ðJ 2Þ reduced MCMC runs. An advantage of this approach is that it breaks a large-dimensional problem into a set of smaller and more manageable steps and, at the cost of additional MCMC simulation, typically leads to very efficient estimates in many practical problems. In the remainder of this chapter, we will refer to Chib’s method with Rao-Blackwellization as the CRB method. This method provides a direct application of existing MCMC techniques (Chib, 1995) to simulated likelihood estimation and forms an important benchmark case against which other MCMC methods can be compared. Moreover, the CRB method provides continuous and differentiable probability estimates in the context of estimating Eq. (2), which distinguishes it from the other MCMC methods referenced in Section 1. It will also form a basis for the new estimators that will be developed in the remainder of this section.
3.2. The CRT Method Our first extension aims to address a potential drawback of RaoBlackwellization, namely the cost of the additional reduced MCMC runs that it requires. For this reason, we examine a different way of obtaining fd TN ðzi jX i b; XÞ that is required in Eq. (8) or (9). An approach to density estimation which is based on the Gibbs transition kernel and does not entail reduced runs is discussed in Ritter and Tanner (1992). In particular, the Gibbs transition kernel for moving from zi to zi is given by the product of well-known univariate truncated normal full-conditional densities Kðzi ; zi jyi ; b; XÞ ¼
J Y
f ðzij jyi ; fzik gkoj ; fzðgÞ ik gk4j ; b; XÞ
(10)
j¼1
Because the full-conditional densities are the fundamental building blocks of the Gibbs sampler, the additional coding involved in evaluating Eq. (10) is minimized. By virtue of the fact that the Gibbs sampler satisfies Markov chain invariance (see, e.g., Tierney, 1994; Chib & Greenberg, 1996),
15
MCMC Perspectives on Simulated Likelihood Estimation
we have that f TN B ðzi jX i b; XÞ ¼ i
Z
Kðzi ; zi jyi ; b; XÞ f TN B ðzi jX i b; XÞdzi i
(11)
which was exploited for density estimation in Ritter and Tanner (1992). Therefore, an estimate of the denominator in Eq. (8) can be obtained by invoking Eq. (11) and averaging the transition kernel Kðzi ; zi jyi ; b; XÞ with respect to draws from the truncated normal distribution zðgÞ i TN Bi ðX i b; XÞ, that is fd TN B ðzi jX i b; XÞ ¼ i
G 1X KðzðgÞ i ; zi jyi ; b; XÞ G g¼1
(12)
As in the CRB method, the random draws zðgÞ i required in the average are generated by a Gibbs sampler that iteratively simulates each element zij from its full-conditional distribution zij f ðzij jfzik gkaj ; b; XÞ for j ¼ 1; . . . ; J. Because this method combines the marginal likelihood estimation approach of Chib (1995) with the density ordinate estimation approach of Ritter and Tanner (1992), it will be referred to as the CRT method in the remainder of this chapter. Several remarks about the CRT method and its relationship to CRB can be made. First, because the CRT and CRB methods are continuous and differentiable, they are applicable in maximum likelihood estimation and other problems that require differentiation. Second, in contrast to CRB, CRT does not require reduced run simulation as all ordinates are estimated with draws from the main MCMC run. However, CRT may require storage for the latent variables fzðgÞ i g, because the point zi , typically taken to be the mean of f TN B ðzi jX i b; XÞ, may not be i available during the main MCMC run, thus preventing concurrent evaluation of KðzðgÞ i ; zi jyi ; b; XÞ. If storage is a problem, then estimation can involve some limited amount of precomputation such as a short MCMC run to determine zi for subsequent evaluation of the Gibbs kernel. Note, however, that such a problem rarely presents itself in Bayesian studies where zi may be readily available from MCMC runs conducted during the estimation of b and X. Third, note that in bivariate problems CRB will be more efficient than CRT because it does not involve any reduced runs and only requires estimation of f ðzi1 jyi ; b; XÞ, whereas f ðzi2 jyi ; zi1 ; b; XÞ is directly available. Finally, the main ideas stemming from the CRB and CRT approaches – that response probability evaluation can be reduced to finding a density ordinate and that the Gibbs kernel can be employed in estimating this density ordinate – will form a foundation for the methods that we
16
IVAN JELIAZKOV AND ESTHER HEE LEE
discuss next. The key distinction between the alternatives that we consider has to do with the way in which the sample of latent data fzðgÞ i g is generated.
3.3. The ARK Method Our second extension deals with the AR estimator. As discussed in Section 2, the AR approach can be appealing because of its simplicity and ease of implementation, but can be problematic because of its nondifferentiability and discontinuity and the potential for numerical instability when estimating probabilities near 0 or 1. In this section, we show that the integration of MCMC theory into AR sampling can produce a method that circumvents many of the drawbacks of standard AR estimation. An important advantage of the proposed method relative to the estimator in Eq. (4) is that continuity and differentiability are introduced without sacrificing simulation consistency or requiring additional tuning parameters. Because the approach combines the AR simulator with the kernel of the Gibbs sampler, we will refer to it as the ARK method. The derivation of the ARK method is fairly uncomplicated. It proceeds by simply rewriting the invariance condition in Eq. (11) as Z Kðzi ; zi jyi ; b; XÞ f TN B ðzi jX i b; XÞdzi f TN B ðzi jX i b; XÞ ¼ i i Z ¼ Kðzi ; zi jyi ; b; XÞ1fzi 2 Bi g f N ðzi jX i b; XÞdzi ð13Þ which suggests a straightforward way of producing an estimate b fd TN Bi ðzi jX i b; XÞ that can be used to obtain Prðyi jb; XÞ by Eq. (8) or (9). Specifically, from Eq. (13) it follows that f TN B ðzi jX i b; XÞ can be estimated i by drawing zi N ðX i b; XÞ, accepting only draws that satisfy zi 2 Bi , and using those draws to average Kðzi ; zi jyi ; b; XÞ as in Eq. (12). At this point, it may be helpful to review the main pros and cons of ARK estimation in some detail. First, the ARK method retains the simplicity of AR sampling, while simultaneously offering continuous, differentiable, and simulation consistent estimates of Prðyi jb; XÞ based on the Gibbs kernel (even though simulation of fzðgÞ i g does not involve Gibbs sampling as in CRB or CRT). Second, because ARK subsumes the traditional AR estimator, the AR estimate will also typically be available as a by-product of ARK estimation. Third, although both ARK and CRT average the kernel in Eq. (12) using latent data zi TN Bi ðX i b; XÞ, the fact that the latent data are
17
MCMC Perspectives on Simulated Likelihood Estimation
Low Correlation Case
Gibbs Sampling 4
4
3
3
2
2
1
1
0
0 0
High Correlation Case
Independent Sampling
1
2
3
4
4
4
3
3
2
2
1
1
0
0 0
Fig. 1.
1
2
3
4
0
1
2
3
4
0
1
2
3
4
Gibbs vs. Independent Sampling from Distributions with Varying Degrees of Correlation.
obtained by either AR or Gibbs sampling can have important implications for the relative efficiency or ARK versus CRT. To see this, consider Fig. 1. The figure shows that with low correlations, the Gibbs sampler can traverse the parameter space relatively quickly, without inducing much serial dependence in the sampled fzðgÞ i g. When the elements of zi are highly correlated, however, iterative sampling of the full-conditional distributions produces relatively small Gibbs steps that lead to slow mixing of the Markov chain. In contrast, ARK provides an independent sample of draws whose mixing is unaffected by the extent of correlation between the elements of zi . One should keep in mind, however, that this advantage of the ARK approach comes at the cost of a well-known problem with AR samplers that too many rejections may occur if Prðzi 2 Bi jb; XÞ is relatively low, thereby adversely affecting simulation efficiency. In some cases, this problem may be remedied by estimating Prðzi 2 Bci jb; XÞ because when the probability of Bi is small, that of its complement Bci must be relatively large. However, we
18
IVAN JELIAZKOV AND ESTHER HEE LEE
caution that in doing so, one must be careful to ensure that a number of technical requirements are met. In particular, while the set Bi is convex, its complement Bci need not be. As a result, some choices of zi 2 Bci may potentially introduce nondifferentiability in the estimate of Prðzi 2 Bci jb; XÞ because the kernel Kðzi ; zi jyi ; b; XÞ may not be strictly positive for all fzi g. Even worse, for some settings of b and X the nonconvexity of Bci may lead to near reducibility of the Markov chain on Bci , rendering convergence and kernel estimation altogether problematic. Therefore, ARK estimation of Prðzi 2 Bci jb; XÞ should only be attempted after careful consideration of the aforementioned issues.
3.4. The ASK Method In this section, we discuss an approach which aims to improve the quality of the sample of fzi g that is used in estimation by addressing some of the simulation difficulties discussed in Section 3.3. Another look at Fig. 1 suggests that improving the mixing properties of the Gibbs sampler in problems with high correlation would be key to reducing the serial dependence in the MCMC sample zi TN Bi ðX i b; XÞ which, in turn, can reduce the sampling variability of the average in Eq. (12). Moreover, the discussion in Section 3.3 also indicates that Gibbs sampling has important advantages over AR sampling because every Gibbs draw satisfies zi 2 Bi , whereas meeting this requirement may lead to large rejection rates in AR simulation. In developing the method, we link a variety of approaches and introduce a new adaptive MCMC algorithm for simulating zi TN Bi ðX i b; XÞ which improves the quality of the MCMC sample. We build upon Chib (1995) to relate estimation of Prðzi 2 Bi jb; XÞ to that of f TN B ðzi jX i b; XÞ, rely on i ideas from Ritter and Tanner (1992) to obtain the latter quantity, and use full-conditional truncated normal sampling (see Geweke, 1991), but with the key difference that our proposed Gibbs algorithm improves mixing by adaptively sampling either the latent fzi g or a particular transformation of those variables. Specifically, we use the Mahalanobis transformation to map fzi g into a priori independent standard normal variables fgi g such as those used in Eq. (5) to develop the recursive conditioning importance density of the GHK estimator. Due to the particular combination of inputs that appear in this method, in the remainder of this chapter, we shall refer to it as the adaptive sampling kernel (ASK) method.
MCMC Perspectives on Simulated Likelihood Estimation
19
The ASK approach proceeds along the following lines. We write the model as zi ¼ X i b þ Lgi , where gi N ð0; IÞ and L is a lower triangular Cholesky factor such that LL0 ¼ X. Then, solving for gi , we obtain gi ¼ L1 ðzi X i bÞ, which is the Mahalanobis transformation of zi . Even though the elements of gi are a priori independent, it is important to note that conditioning on yi introduces dependence through the constraints on each Zij in the full-conditional distributions f ðZij jfZik gkaj ; yi ; b; XÞ, j ¼ 1; . . . ; J. To see this, note that the constraints on Zij are obtained from those on zi by solving the system zi ¼ X i b þ Lgi , and that Zij enters all equations for which the elements in the jth column of L are not zero (L is lower triangular by construction, but it can possibly contain zero elements below the main diagonal). Let E ijk denote the feasible region for Zij implied by the kth equation, and let E ij ¼ \Jk¼j E ijk and E i ¼ fgi : Zij 2 E ij ; 8jg. Readers may recall that some constraints arising from yi are ignored in the GHK method in order to obtain a tractable importance density. However, all constraints must be incorporated in the sequence of Gibbs steps ½Zij jfZik gkaj ; yi ; b; X TN E ij ð0; 1Þ;
j ¼ 1; . . . ; J
leading to the Gibbs kernel Kðgi ; gyi jyi ; b; XÞ ¼
J Y
f ðZyij jyi ; fZyik gkoj ; fZðgÞ ik gk4j ; b; XÞ
(14)
j¼1
so that MCMC simulation produces gi TN E i ð0; IÞ that correspond to zi TN Bi ðX i b; XÞ. Some intuition about the mechanics of the ASK approach can be gleaned from Fig. 2, which relates the sets Bi and E i implied by observing yi ¼ 12 . The Mahalanobis transformation demeans, orthogonalizes, and rescales the draws zi to produce gi , but these operations also map Bi into E i by shifting the vertex of the feasible set and rotating its boundaries (the axes) depending on the sign of the covariance elements in X. Note that because Zij enters the equations for fzik gkj , updating Zij corresponds to simultaneously updating multiple elements of zi ; conversely, updating zij affects all elements fZik gk j that enter the jth equation. The key feature of the transformation that will be exploited here is that it offers a trade-off between correlation (in the case of zi ) and dependence in the constraints (for the elements of gi ) and implies that important benefits can be obtained by adaptively sampling the elements of zi or those of the Mahalanobis transformation gi . To understand the trade-offs between Gibbs simulation of Zij f ðZij jfZik gkaj ; yi ; b; XÞ as a way of obtaining gi TN E i ð0; IÞ and the resultant
IVAN JELIAZKOV AND ESTHER HEE LEE
3
3
2
2
1
1 η2
z2
20
0 −1
−1
−2
−2
−3
−3 −3 −2 −1
0 z1
1
2
3
3
3
2
2
1
1 η2
z2
0
0
0 η1
1
2
3
−3 −2 −1
0 η1
1
2
3
0
−1
−1
−2
−2
−3
−3 −2 −1
−3 −3 −2 −1
0 z1
1
2
3
Fig. 2. Correspondence Between zi 2 Bi and the Mahalanobis Transform gi 2 E i for yi ¼ 12 . The Mahalanobis Transform Orthogonalizes and Standardizes zi to Produce gi , but Causes Dependence to be Reflected in the Boundaries of E i .
zi ¼ X i b þ Lgi , and Gibbs sampling of zij f ðzij jfzik gkaj ; b; XÞ which yields zi TN Bi ðX i b; XÞ directly, consider a setting where X contains high correlations but the constraints implied by yi are relatively mildly binding. In this case, it will be beneficial to simulate gi because f TN E ðgi j0; IÞ ! i f N ðgi j0; IÞ as Prðgi 2 E i Þ ! 1 and drawing gi produces a sample that will be close to iid. In contrast, a traditional Gibbs sampler defined on the elements of zi will exhibit high serial correlation between successive MCMC draws because such a sampler must traverse large portions of the support by taking small steps (recall the discussion of Fig. 1). Note also that as the correlations in X increase toward 1 for similar components of yi or decrease toward 1 for dissimilar components of yi , the feasible sets tend to be binding on one
21
MCMC Perspectives on Simulated Likelihood Estimation
5
3
3
1
1
−1 −3
−1 −3
1
3
−5 −5 −3 −1
5
1
5
5
3
3
1
1
2
−1 −3
1
3
−1 −3
−5 −5 −3 −1
1
3
−5 −5 −3 −1
5
1
z1 5
3
3
1
1
2
5
−1 −3
1
3
1
z1
3
5
K
Kz
K
Kz
K
3
−1
−5 −5 −3 −1
Kz
5
0
5
−3
−5 −5 −3 −1
5
0
5
inefficiency
z2
z1
inefficiency
−5 −5 −3 −1
z2
10 inefficiency
5
2
z2
Zij but not the other, and the MCMC sampler is well behaved. In other cases, it may be better to sample zi directly without transforming to gi , for example, when the constraints E ij ¼ \Jk¼j E ijk on Zij are such that they slow down the mixing of other fZik gkaj . Some of these scenarios, together with measures of sampling inefficiency (to be discussed shortly) for each Gibbs kernel (K z ðÞ and K Z ðÞ) are presented in Fig. 3. In yet other cases, for example, when correlations are low or the probabilities to be estimated are small, the two sampling approaches will typically exhibit similar mixing and either approach will work well. However, in order to produce an MCMC sample that is as close to iid as possible, we have to be able to adaptively determine whether to simulate gi (and convert to zi ) or sample zi directly. Our proposed approach for doing so is presented next.
1
1
3
5
2 1 0
Fig. 3. Performance of K z ðÞ and K Z ðÞ in Different Settings. Higher Inefficiencies Indicate Stronger Autocorrelations Between Successive MCMC Draws fzi g for Each Kernel (1 Indicates iid Sampling).
22
IVAN JELIAZKOV AND ESTHER HEE LEE
Algorithm 1. Adaptive Gibbs Sampler for Multivariate Truncated Normal Simulation 1. Initialize pg 2 ð0; 1Þ and let pz ¼ 1 pg ; 2. Given the current zi and the corresponding gi in the Markov chain, with probability pg sample gi using the Gibbs kernel K Z ðÞ in Eq. (14) and convert to zi by Eq. (5), or, with probability pz sample zi directly using the Gibbs kernel K z ðÞ in Eq. (10); 3. After a burn-in period, accumulate the sample fzi g while keeping track of the draws obtained by K Z ðÞ and K z ðÞ; 4. Periodically update pg using a rule P Z : Rð2JÞ ! ½0; 1 that maps the autocorrelations from the two kernels to the closed interval [0,1]; P Z is an increasing function in the autocorrelations of the draws produced by K z ðÞ and a decreasing function of those produced by K Z ðÞ. We now discuss Algorithm 1 in greater detail. From a theoretical point of view, the algorithm is quite transparent: it is very simple to show that the mixture of kernels, each of which converges to the target distribution, also converges to that distribution. Specifically, one only has to observe that invariance is satisfied for each kernel (see, e.g., Chib & Greenberg, 1996) and therefore for any weighted average (mixture) of those kernels. An interesting observation, based on our experience with step 1, is that good mixing in the initial stages of sampling does not require that the better mixing sampler be favored by the initial choice of pg and pz . In fact, a ‘‘neutral’’ probability choice pg ¼ pz ¼ 0:5 typically leads to a mixed sampler whose performance is more than proportionately closer to that of the more efficient transition kernel. The goal of steps 3 and 4 is to ensure that if one kernel dominates the other on every margin (i.e., sampling is more efficient for every element of zi ), the mixed chain settles on that more efficient kernel; otherwise, the aim is to produce an adaptive Markov chain that strikes a balance between the two kernels in a way that reduces overall inefficiency. There are many possible ways in which pg and pz can be determined depending on one’s aversion (as captured by some loss function) to slow mixing for each element of zi . In our examples, we considered the following P Z : 8 1 if rz rZ > > > < 0 if rZ rz pg ¼ 0 w r > z > > 0 otherwise : w rz þ w0 rZ
MCMC Perspectives on Simulated Likelihood Estimation
23
where w is a vector of (loss function) weights, rz and rZ are J-vectors of inefficiency measures for each element of zi under the two MCMC sampling kernels, and ‘‘ ’’ denotes element-by-element inequality. Let rjl ðglÞ Þ be the lth autocorrelation for the sampled draws of zij . Note corrðzðgÞ ij ; zij that in computing rjl for each of the two kernels in Algorithm 1, one should ensure that the draws zðgÞ ij come from the kernel of interest, even though the draws zijðglÞ could have been generated by either kernel. As a quick and inexpensive (but fairly accurate) measure of the inefficiency factors 1 þ P1 r of the two samplers, we use the draws generated by K z ðÞ to compute l¼1 jl rz ½j ¼ ð1 rj1 Þ1 and similarly base the computation of rZ ½j ¼ ð1 rj1 Þ1 on draws generated by K Z ðÞ. These expressions require calculation of only the first-order autocorrelations and minimize computational and bookkeeping requirements but approximate the inefficiency factors rather well. Note also that in determining the mixing probabilities, the vector w could contain equal weights if the goal is to improve overall MCMC mixing. However, the weights can easily be adjusted when it may be desirable to weigh the mixing of a particular subset of zi more heavily, such as in problems when a subset of zi must be integrated out. A final remark is that it will typically suffice to update pg only a few times in the course of sampling and that the sampling probability tends to stabilize very rapidly, almost immediately in the case of algorithms that exhibit widely diverging MCMC mixing properties. The definition of the ASK simulator is completed by noting that once a sample of draws fzðgÞ i g is available, then estimation proceeds by Eqs. (12) and (9). We emphasize that while by construction it is true that Z Prðyi jb; XÞ ¼ Z
Bi
¼ Ei
f N ðzi jX i b; XÞdzi f N ðgi j0; IÞdgi
we do not use the second representation in estimation (and only rely on it in simulation) because after the Mahalanobis transformation the dependence in the constraints, seen in Fig. 2, implies that some values of gi will possibly lead to Kðgi ; gi jyi ; b; XÞ ¼ 0 which may lead to nondifferentiability of the resulting probability estimate. This, however, is not a problem when the ðgÞ draws fgðgÞ i g are converted to fzi g and the kernel Kðzi ; zi jyi ; b; XÞ is used in estimation.
24
IVAN JELIAZKOV AND ESTHER HEE LEE
3.5. Summary and Additional Considerations In this section, we have presented a variety of MCMC methods for estimating response probabilities in discrete data models. We have shown that simulated likelihood estimation can proceed by adapting methods from the Bayesian literature on marginal likelihood estimation and have developed a set of new techniques designed to address features that are specific to simulated likelihood evaluation. The methods are applicable in binary, ordinal, censored, count, and other settings and can be easily extended to handle mixtures and scale-mixtures of normal distributions that include the Student’s t-link and logit models, among others (see, e.g., Andrews & Mallows, 1974; Poirier, 1978; Albert & Chib, 1993; Geweke, 1993), and to models with heteroskedasticity (Gu, Fiebig, Cripps, & Kohn, 2009). Moreover, even though for most of the approaches presented here we have discussed Gibbs kernel versions of estimating the posterior ordinate (as in Ritter & Tanner, 1992), we emphasize that it is possible to use Rao-Blackwellization as in Section 3.1, which can be desirable in high-dimensional problems or in settings where natural groupings of the latent variables may be present. An important goal of this chapter is to consider approaches for obtaining MCMC samples fzi g that result in better mixing of the Markov chain and improved efficiency of estimation. The improvements in simulation made possible by Algorithm 1 have ramifications not only for estimation of response probabilities, but also for problems in which high-quality samples from a truncated normal distribution are needed. For example, Chib’s approach, which was discussed in Section 3.1, can be combined with the output of Algorithm 1 to further improve its efficiency. Many of the methods discussed here can also be combined with recently developed simulation techniques such as slice sampling (Neal, 2003; Damien & Walker, 2001) and antithetic draws such as those produced by reflection samplers and Halton sequences (see, e.g., Tierney, 1994; Train, 2003; Bhat, 2001, 2003). In this chapter, we focused on algorithms that provide continuous and differentiable probability estimates but have also cited a number of important MCMC approaches that lead to nondifferentiable estimates. It is useful to keep in mind that many of these latter methods can still be applied in optimization algorithms that do not require differentiation – for example, in simulated annealing (Goffe, Ferrier, & Rogers, 1994) and particle swarming (Kennedy & Eberhart, 2001), although such algorithms involve computationally intensive stochastic search that typically requires numerous evaluations of the objective function.
MCMC Perspectives on Simulated Likelihood Estimation
25
Finally, we remark that although our discussion has centered on discrete data models, the techniques developed in this chapter are directly applicable to the computation of p-values for multivariate directional hypothesis tests.
4. COMPARISON OF SIMULATORS We carried out a simulation study to examine the performance of the techniques proposed in Section 3 and compare them to the methods discussed in Section 2. In particular, we report estimates of the probability Z f N ðzjl; XÞdz Prðyjl; XÞ ¼ B
under several settings of l and X. Because the problem of estimating any orthant probability can always be represented as an equivalent problem of estimating the probability of another orthant by simple rotation of the space, without loss of generality, we let y be a J-dimensional vector of ones, and hence B is the positive orthant. We vary the dimension of integration from J ¼ 3 to J ¼ 12 in increments of 3. In each case, we consider three settings of l and four settings of X. Specifically, when J ¼ 3, we let lA ¼ ð0; 0:5; 1Þ0 be the value of l that makes y ‘‘likely’’, lB ¼ ð0:5; 0; 0:5Þ0 as the ‘‘intermediate’’ value of l, and lC ¼ ð1; 0:5; 0Þ0 as the ‘‘least likely’’ value. For J ¼ 6 the ‘‘likely,’’ ‘‘intermediate,’’ and ‘‘least likely’’ values are obtained by setting l ¼ ðl0A ; l0A Þ0 or l ¼ ðl0B ; l0B Þ0 or l ¼ ðl0C ; l0C Þ0 , respectively. The means are similarly constructed for higher values of J. We use a covariance matrix X of the type X½k; j ¼ rjkjj , that is, 1 0 1 r r2 rJ1 C B 1 r rJ2 C B r C B B .. C B r2 r 1 . C C B X¼B . C .. .. C B . 2 B . . r C . C B B 1 r C A @ rJ1 rJ2 r2 r 1 where r 2 f0:7; 0:3; 0:3; 0:7g, which allows for high and low positive and negative correlations in the examples. Finally, the reported results for all simulators are based on simulation runs of length 10,000; for the three
26
IVAN JELIAZKOV AND ESTHER HEE LEE
Table 1.
l ¼ lA
r
AR
STERN
GHK
CRB
CRT
ASK
ARK
0.7
1.574 (1.956) 1.396 (1.743) 1.069 (1.382) 0.835 (1.143)
1.557 (0.896) 1.393 (0.414) 1.059 (0.729) 0.840 (0.895)
1.556 (0.386) 1.392 (0.123) 1.067 (0.080) 0.836 (0.094)
1.557 (0.086) 1.393 (0.018) 1.065 (0.033) 0.838 (0.265)
1.557 (0.082) 1.393 (0.017) 1.066 (0.033) 0.836 (0.318)
1.558 (0.085) 1.393 (0.017) 1.065 (0.033) 0.841 (0.239)
1.560 (0.171) 1.394 (0.033) 1.066 (0.051) 0.834 (0.395)
3.523 (5.736) 2.658 (3.642) 1.862 (2.332) 1.427 (1.780)
3.498 (1.212) 2.664 (0.518) 1.853 (1.151) 1.406 (1.341)
3.498 (0.543) 2.663 (0.170) 1.867 (0.113) 1.424 (0.134)
3.502 (0.029) 2.665 (0.009) 1.865 (0.023) 1.423 (0.203)
3.502 (0.026) 2.665 (0.009) 1.865 (0.024) 1.423 (0.253)
3.502 (0.026) 2.665 (0.009) 1.865 (0.023) 1.424 (0.200)
3.503 (0.176) 2.666 (0.040) 1.865 (0.057) 1.423 (0.427)
7.824 (49.990) 4.656 (10.211) 3.006 (4.382) 2.248 (2.910)
7.208 (1.463) 4.645 (0.607) 2.979 (1.851) 2.213 (2.105)
7.208 (0.676) 4.645 (0.212) 3.002 (0.148) 2.237 (0.179)
7.213 (0.008) 4.648 (0.004) 3.000 (0.017) 2.234 (0.175)
7.213 (0.008) 4.648 (0.004) 3.000 (0.017) 2.235 (0.199)
7.213 (0.008) 4.648 (0.004) 3.000 (0.017) 2.233 (0.172)
7.220 (1.881) 4.648 (0.097) 3.000 (0.065) 2.230 (0.532)
0.3 0.3 0.7 l ¼ lB
0.7 0.3 0.3 0.7
l ¼ lC
Log-Probability Estimates (J ¼ 3) with Numerical Standard Errors ( 102) in Parentheses.
0.7 0.3 0.3 0.7
simulators requiring MCMC draws (CRB, CRT, and ASK), the main run is preceded by a burn-in of 1,000 cycles. Tables 1–4 present results for different settings of l and X for J ¼ 3, J ¼ 6, J ¼ 9, J ¼ 12, respectively. We find that for low values of J and settings of l that make y ‘‘likely,’’ all methods produce point estimates that agree closely. However, the variability differs widely across estimators and different settings of J, r, and l (note that the entries in parentheses have to be divided by 100 to obtain the actual numerical standard errors (NSE) of the estimates). Among the traditional estimators, we see that GHK outperforms AR and Stern, regardless of the values of J, r, and l. AR performs worst and can also fail in high-dimensional problems or in other settings where the outcome is ‘‘unlikely’’ and no draws are accepted. These findings for the traditional estimators are consistent with
27
MCMC Perspectives on Simulated Likelihood Estimation
Table 2.
Log-Probability Estimates (J ¼ 6) with Numerical Standard Errors ( 102) in Parentheses. r
AR
STERN
GHK
CRB
CRT
ASK
ARK
l ¼ 12 lA 0.7 3.032 (4.444) 0.3 2.859 (4.056) 0.3 2.039 (2.586) 0.7 1.350 (1.691)
3.097 (1.700) 2.831 (0.729) 2.030 (1.237) 1.355 (1.315)
3.066 (0.643) 2.823 (0.235) 2.041 (0.221) 1.376 (0.433)
3.074 (0.112) 2.828 (0.026) 2.037 (0.049) 1.371 (0.408)
3.074 (0.121) 2.828 (0.027) 2.037 (0.055) 1.386 (0.539)
3.076 (0.129) 2.828 (0.027) 2.038 (0.053) 1.372 (0.449)
3.071 (0.576) 2.829 (0.128) 2.036 (0.146) 1.360 (0.818)
l ¼ 12 lB 0.7 7.131 (35.341) 0.3 5.473 (15.398) 0.3 3.527 (5.746) 0.7 2.294 (2.985)
7.211 (2.846) 5.480 (0.969) 3.518 (2.238) 2.246 (2.178)
7.164 (0.912) 5.469 (0.311) 3.534 (0.297) 2.283 (0.555)
7.174 (0.035) 5.475 (0.013) 3.527 (0.037) 2.277 (0.351)
7.175 (0.037) 5.475 (0.014) 3.528 (0.041) 2.280 (0.464)
7.175 (0.036) 5.475 (0.014) 3.528 (0.040) 2.278 (0.379)
7.195 (2.917) 5.473 (0.453) 3.529 (0.183) 2.271 (1.090)
– 15.476 15.444 15.458 15.458 15.458 – – (4.408) (1.140) (0.010) (0.010) (0.010) – 0.3 – 9.669 9.656 9.663 9.663 9.663 – – (1.227) (0.374) (0.006) (0.006) (0.007) – 0.3 5.497 5.620 5.629 5.621 5.621 5.621 5.624 (15.585) (4.486) (0.374) (0.027) (0.030) (0.028) (0.450) 0.7 3.537 3.518 3.514 3.498 3.502 3.503 3.501 (5.776) (3.925) (0.730) (0.288) (0.366) (0.342) (1.745)
l ¼ 12 lC 0.7
earlier studies (e.g., Bo¨rsch-Supan & Hajivassiliou, 1993; Hajivassiliou et al., 1996). The interesting finding from Tables 1–4 in this study is that the MCMCbased estimation methods perform very well. In fact, CRB, CRT, and ASK outperform GHK most of the time, although the methods are roughly on par in ‘‘likely’’ cases characterized by high values of r. However, in cases with unlikely outcomes, the MCMC-based methods typically produce NSE that are much lower than those of GHK. Although it could be argued that some of the efficiency of CRB comes at the cost of additional reduced runs, neither CRT nor ASK require reduced runs and are still typically more efficient than GHK. These results present a strong case in favor of the proposed MCMC-based approaches. Even ARK, which similarly to AR could also fail when no draws are accepted, seem to provide very efficient
28
IVAN JELIAZKOV AND ESTHER HEE LEE
Table 3.
Log-Probability Estimates (J ¼ 9) with Numerical Standard Errors ( 102) in Parentheses. r
AR
l ¼ 13 lA 0.7 4.585 (9.851) 0.3 4.200 (8.103) 0.3 2.966 (4.292) 0.7 1.885 (2.363)
STERN
GHK
CRB
CRT
ASK
ARK
4.599 (2.578) 4.258 (0.957) 3.001 (1.789) 1.880 (1.784)
4.610 (0.864) 4.269 (0.318) 3.005 (0.326) 1.890 (0.611)
4.590 (0.155) 4.263 (0.035) 3.008 (0.062) 1.893 (0.563)
4.588 (0.142) 4.263 (0.034) 3.008 (0.071) 1.901 (0.738)
4.589 (0.156) 4.263 (0.032) 3.009 (0.069) 1.905 (0.519)
4.566 (1.719) 4.260 (0.350) 3.005 (0.318) 1.906 (1.394)
– 10.877 10.878 10.846 10.845 10.846 – – (5.093) (1.277) (0.046) (0.043) (0.040) – 0.3 7.824 8.281 8.293 8.285 8.285 8.285 8.449 (49.990) (1.312) (0.421) (0.017) (0.017) (0.016) (4.721) 0.3 5.116 5.157 5.186 5.190 5.190 5.190 5.188 (12.871) (3.919) (0.440) (0.047) (0.053) (0.050) (0.657) 0.7 3.128 3.100 3.107 3.107 3.113 3.113 3.090 (4.672) (3.294) (0.910) (0.457) (0.609) (0.520) (2.229)
l ¼ 13 lB 0.7
– 23.764 23.743 23.702 23.702 23.702 – – (8.661) (1.615) (0.012) (0.012) (0.011) – 0.3 – 14.675 14.689 14.678 14.678 14.678 – – (1.725) (0.505) (0.008) (0.008) (0.008) – 0.3 8.112 8.141 8.236 8.241 8.242 8.242 8.100 (57.726) (9.785) (0.557) (0.035) (0.039) (0.037) (15.765) 0.7 4.804 4.743 4.733 4.741 4.743 4.738 4.740 (10.998) (6.980) (1.264) (0.375) (0.518) (0.473) (4.517)
l ¼ 13 lC 0.7
estimates that are close to those of the other estimators (provided at least some draws are accepted). In comparing the MCMC approaches to each other, we see that the ASK estimates, as expected, are at least as efficient as those from CRT, but that in many settings all three methods (ASK, CRT, and CRB) perform similarly. This suggests that ASK (which nests CRT as a special case) may be preferable to CRB in those cases because of its lower computational demands. The advantages of adaptive sampling by Algorithm 1 become more pronounced the higher the correlation r. An important point to note, in light of the results presented in this section and in anticipation of the application in Section 5, is that precise estimation of the log-likelihood is essential for inference. For example, it is crucial
29
MCMC Perspectives on Simulated Likelihood Estimation
Table 4.
Log-Probability Estimates (J ¼ 12) with Numerical Standard Errors ( 102) in Parentheses. r
AR
l ¼ 14 lA 0.7 5.914 (19.219) 0.3 5.599 (16.409) 0.3 3.868 (6.844) 0.7 2.429 (3.217)
STERN
GHK
CRB
CRT
ASK
ARK
6.096 (3.775) 5.699 (1.220) 3.961 (2.332) 2.397 (2.326)
6.084 (1.207) 5.696 (0.412) 3.979 (0.389) 2.410 (0.836)
6.102 (0.170) 5.697 (0.040) 3.979 (0.074) 2.410 (0.628)
6.101 (0.162) 5.697 (0.037) 3.979 (0.078) 2.417 (0.909)
6.103 (0.180) 5.697 (0.038) 3.978 (0.083) 2.404 (0.747)
6.129 (3.428) 5.685 (1.898) 3.987 (0.481) 2.365 (2.305)
– 14.504 14.484 14.516 14.516 14.516 – – (8.462) (1.864) (0.049) (0.046) (0.047) – 0.3 – 11.091 11.092 11.094 11.094 11.094 – – (1.734) (0.547) (0.020) (0.019) (0.019) – 0.3 6.725 6.858 6.850 6.852 6.851 6.851 6.818 (28.850) (5.393) (0.524) (0.055) (0.058) (0.062) (5.829) 0.7 3.821 3.923 3.914 3.933 3.944 3.937 3.943 (6.683) (4.651) (1.213) (0.540) (0.709) (0.762) (3.553)
l ¼ 14 lB 0.7
– 31.980 31.901 31.945 31.945 31.945 – – (13.751) (2.411) (0.013) (0.012) (0.013) – 0.3 – 19.684 19.690 19.693 19.693 19.693 – – (2.327) (0.656) (0.009) (0.009) (0.009) – 0.3 – 11.090 10.860 10.862 10.862 10.861 – – (12.964) (0.667) (0.041) (0.043) (0.046) – 0.7 5.776 6.044 5.959 5.972 5.974 5.961 6.003 (17.933) (11.969) (1.718) (0.428) (0.642) (0.588) (7.573)
l ¼ 14 lC 0.7
for computing likelihood ratio statistics, information criteria, marginal likelihoods, and Bayes factors for model comparisons and hypothesis testing. Estimation efficiency is also key to mitigating simulation biases (due to Jensen’s inequality and the nonlinearity of the logarithmic transformation) in the maximum simulated likelihood estimation of parameters, standard errors, and confidence intervals (see, e.g., McFadden & Train, 2000, Section 3 and Train, 2003, Chapter 10). To summarize, the results suggest that the MCMC simulated likelihood estimation methods perform very well and dominate other estimation methods over a large set of possible scenarios. Their performance improves with the ability of the Markov chain to mix well, making Algoirthm 1 an important component of the estimation process.
30
IVAN JELIAZKOV AND ESTHER HEE LEE
4.1. Computational Caveats and Directions for Further Study In this chapter, we have compared a number of new and existing estimators for a fixed Monte Carlo simulation size. Such comparisons are easy to perform and interpret in practically any simulation study. However, an important goal for future research would be to optimize the code, perform formal operation counts, and study the computational intensity of each estimation algorithm. This would enable comparisons based on a fixed computational budget (running time), which are less straightforward and more difficult to generalize because they depend on various nuances of the specific application. In this section, we highlight some of the subtleties that must be kept in mind. For instance, although AR and ARK are simple and fast, the computational cost to achieve a certain estimation precision is entirely dependent on the context. Importance sampling and MCMC simulators such as GHK, CRT, CRB, and ASK, on the other hand, involve more coding and more costly iterations, but they are also more reliable and statistically efficient, especially in estimating small orthant probabilities. Based on rough operation counts, GHK, CRT, and ASK involve comparable computations and simulations, while the efficiency of CRB depends on the number of reduced runs that is required. The complications of optimizing these methods for speed, while retaining their transparency and reliability, go well beyond simply removing redundant computations (e.g., inversions, multiplications, conditional moment calculations) and making efficient use of storage. Although these steps are essential in producing efficient algorithms, another difficulty arises because random number generators may have to rely on a mix of techniques in order to be reliable and general. For example, to simulate truncated normal draws close to the mean, one can use the inverse cdf method. However, it is well known that the inverse cdf method can fail in the tails. Fortunately, in those circumstances the algorithms proposed in Robert (1995) are very efficient and reliable. Because in a given application, the estimation algorithms may use a different mix of these simulation approaches, the computational times across algorithms may not be directly comparable. Another caveat arises due to the specifics of algorithms that rely on MCMC samplers and has to do with determining an appropriate way to account for the dual use (a benefit) and the need for burn-in sampling (a cost) of MCMC simulation. Specifically, in many instances MCMC draws have dual use in addition to evaluation of the likelihood function (e.g., for computing marginal effects, point elasticities, etc.) or are already
31
MCMC Perspectives on Simulated Likelihood Estimation
available from an earlier step in the MCMC estimation (so likelihood estimation requires no simulation but only the computation and averaging of certain conditional densities and transition kernels). Similarly, the costs of burn-in simulation would typically not be an issue in Bayesian studies where a Markov chain would have already been running during the estimation stage, but could be an additional cost in hill-climbing algorithms. Of course, for well-mixing Markov chains convergence and burn-in costs are trivial but should otherwise be properly accounted into the cost of MCMC simulation. These special considerations are compounded by having to examine the estimators in the context of different dimensionality, mean, and covariance matrix combinations, making a thorough computational examination of the methods an important and necessary next step in this area of research. Gauss programs for the new estimators are available on the authors’ websites.
5. APPLICATION TO A MODEL FOR BINARY PANEL DATA This section offers an application of the techniques to the problem of likelihood ordinate estimation in models for binary panel data. In particular, we consider data from Chib and Jeliazkov (2006) that deals with the intertemporal labor force participation decisions of 1545 married women in the age range of 17–66. The data set, obtained from the Panel Study of Income Dynamics, contains a panel of women’s working status indicators (1 = working during the year, 0 = not working) over a 7-year period (1979–1985), together with a set of seven covariates given in Table 5. The sample consists of continuously married couples where the husband is a labor force participant (reporting both positive earnings and hours worked) in each of the sample years. Similar data have been analyzed by Chib and Greenberg (1998), Avery, Hansen, and Hotz (1983), and Hyslop (1999) using a variety of techniques. We considered two competing specifications that differ in their dynamics. For i ¼ 1; . . . ; n and t ¼ 1; . . . ; T, the first specification, model M1 , is given by yit ¼ 1fx~ 0it d þ w0it bi þ gðsit Þ þ f1 yi;t1 þ f2 yi;t2 þ eit 40g;
eit N ð0; 1Þ
and captures state dependence through two lags of the dependent variable but does not involve serial correlation in the errors. The second
32
IVAN JELIAZKOV AND ESTHER HEE LEE
Table 5.
Variables in the Women’s Labor Force Participation Application.
Variable
Explanation
Mean
SD
WORK INT AGE RACE EDU CH2 CH5 INC
Wife’s labor force status (1 ¼ working, 0 ¼ not working) An intercept term (a column of ones) The woman’s age in years 1 if black, 0 otherwise Attained education (in years) at time of survey Number of children aged 0–2 in that year Number of children aged 3–5 in that year Total annual labor income of head of householda
0.7097
0.4539
36.0262 0.1974 12.4858 0.2655 0.3120 31.7931
9.7737 0.3981 2.1105 0.4981 0.5329 22.6417
a
Measured as nominal earnings (in thousands) adjusted by the consumer price index (base year 1987).
specification, model M2 , involves only a single lag of the dependent variable, but allows for AR(1) serial correlation in the errors: yit ¼ 1fx~ 0it d þ w0it bi þ gðsit Þ þ f1 yi;t1 þ eit 40g; eit ¼ rei;t1 þ nit ;
nit N ð0; 1Þ
Both M1 and M2 include mutually exclusive sets of covariates x~ it and wit , where the effects of the former, d, are modeled as common across women, and the effects of the latter, bi , are individual specific (random); the models also include a covariate sit whose effect is modeled through an unknown function gðÞ which is estimated nonparametrically. In both specifications yit ¼ WORKit , x~ 0it ¼ ðRACEi ; EDUit ; lnðINCit ÞÞ, sit ¼ AGEit , w0it ¼ ð1; CH2it ; CH5it Þ, and heterogeneity is modeled in a correlated random effects framework which allows bi to be correlated with observables through bi ¼ Ai c þ bi ;
bi N 3 ð0; DÞ
(15)
We let all three random effects depend on the initial conditions, and the effects of CH2 and CH5 also depend on average husbands’ earnings through 0 1 yi0 B C 1 yi0 lnðINCi Þ Ai ¼ @ A 1 yi0 lnðINCi Þ where neither x~ it nor the first row of Ai involves a constant term because the unknown function gðÞ is unrestricted and absorbs the overall intercept.
33
MCMC Perspectives on Simulated Likelihood Estimation
Upon substituting Eq. (15) into each of the two models, stacking the observations over time for each woman, and grouping the common and individual-specific terms, we can write models M1 and M2 , similarly to Eq. (1), in the latent variable form (16)
zi ¼ X i b þ gi þ ei
X i ¼ ðX~ i : W i Ai : Li Þ, X~ i ¼ ðx~ i1 ; . . . ; x~ iT Þ0 , where zi ¼ ðzi1 ; . . . ; ziT Þ0 , 0 W i ¼ ðwi1 ; . . . ; wiT Þ , Li contains the requisite lags of yit , b ¼ ðd0 ; c0 ; /0 Þ0 , and gi ¼ ðgðsi1 Þ; . . . ; gðsiT i ÞÞ0 . The errors ei ¼ ðei1 ; . . . ; eiT Þ0 follow the distribution ei N ð0; Xi Þ, where Xi ¼ R þ W i DW 0i and R is the Toeplitz matrix implied by the autoregressive process, that is, R ¼ I T for model M1 and R½j; k ¼ rjjkj =ð1 r2 Þ for model M2 . Because M1 requires two lags of the dependent variable, both models are estimated conditionally on the initial two observations in the data. Our focus in this section is on the problem of estimating the log-likelihood ordinate conditionally on the model parameters. For details on the estimation of the parameters in the two models, interested readers are referred to Chib and Jeliazkov (2006). As the model construction shows, both the cluster means and covariance matrices depend on cluster-specific characteristics, and hence the panel data setup with multidimensional heterogeneity is quite useful for examining the performance of the estimators in a variety of possible circumstances that occur in the data. Estimates of the log-likelihood function obtained by various methods are presented in Table 6. The log-likelihood and NSE estimates were obtained Table 6. Estimator
Log-Likelihood Estimates in the Women’s Labor Force Participation Application. Model M1 Log-Likelihood
Model M2 NSE
Log-Likelihood
NSE
Traditional estimators Stern 2501.435 GHK 2501.434 AR 2502.005
(0.291) (0.100) (2.355)
2537.926 2537.631 2540.702
(0.573) (0.137) (2.440)
MCMC-based estimators CRB 2501.425 CRT 2501.403 ASK 2501.411 ARK 2501.498
(0.027) (0.039) (0.036) (0.090)
2537.593 2537.572 2537.563 2537.898
(0.061) (0.081) (0.073) (0.202)
34
IVAN JELIAZKOV AND ESTHER HEE LEE AR
ARK NSE
0.02
NSE
0.5
10 8 6 4 2
0
500
2
500
1000
0
1500
NSE
1 500
1000
1500
0
500
1000
1500
1000
1500
ASK
1 0
1500
1000 CRB
2
Stern
0
500
× 10
−3
Select NSE Boxplots
10 NSE
0.04 NSE
500
× 10−3
× 10−3
2
0
0
1
CRT
× 10−3
0
0
1500
GHK
× 10−3
0
NSE
1000
NSE
NSE
0
0.01
0.02 0
5 0
0
500
1000
1500
GHK
CRB
CRT
ASK
Fig. 4.
Numerical Standard Error (NSE) Estimates for Model M1 .
from runs of length 10,000 draws for each likelihood contribution (there are n ¼ 1545 likelihood contributions, one for each woman in the sample). The NSEs of the log-likelihood contribution estimates are presented in Figs. 4 and 5. The results in Table 6 and Figs. 4 and 5 show that in this application, the new MCMC methods are more efficient than existing approaches. While the argument can be made that the higher efficiency of CRB is due to its reliance on additional reduced runs, the results also reveal that the remaining MCMC methods are also generally more efficient even though they do not require any reduced run simulations. We can also see that the improvement in MCMC sampling due to Algorithm 1 used in the ASK method leads to lower standard errors relative to CRT. A much more striking improvement in efficiency, however, can be seen in a comparison between the AR and ARK methods. What makes the comparison impressive is that both methods are based on the same simulated draws (with the AR estimate being obtained as a by-product of ARK estimation), yet ARK is orders of magnitude more efficient.
35
MCMC Perspectives on Simulated Likelihood Estimation AR
ARK NSE
0.2 0
500
1000
3
10 5 0
500
500
1000
1500
0
500
1000
1500
1000
1500
ASK
2 0
1500
1000 CRB
× 10−3
2
0
500
× 10−3
1 0
1500
NSE
NSE
1000
0
2
CRT
× 10−3
0
0
1500
GHK
× 10−3
15 NSE
0
0.05
NSE
NSE
0.4
Stern
0
500
× 10
−3
Select NSE Boxplots
0.05
NSE
NSE
15
0
10 5 0
0
500
1000
1500
GHK
CRB
CRT
ASK
Fig. 5.
Numerical Standard Error (NSE) Estimates for Model M2 .
Comparison of the estimates for models M1 and M2 shows that allowing for autocorrelated errors (the estimated value of r is 0:29), at the cost of excluding a second lag of yit from the mean, has a detrimental effect on the efficiency of all estimators. While the relative efficiency rankings of estimators are largely preserved as we move from M1 to M2 (with the exception of GHK and ARK), traditional methods appear to exhibit more high-variability outliers, whereas MCMC-based methods show a general increase in variability of estimation across all clusters (the plots for ARK, similarly to those of AR, show both features). This section has considered the application of several simulated likelihood estimation techniques to a hierarchical semiparametric models for binary panel data with state dependence, serially correlated errors, and multidimensional heterogeneity correlated with the covariates and initial conditions. Using data on women’s labor force participation, we have illustrated that the proposed MCMC-based estimation methods are practical and can lead to improved efficiency of estimation in a variety of
36
IVAN JELIAZKOV AND ESTHER HEE LEE
environments occurring in a real-world data set. Comparisons of these and other simulated likelihood estimators in other model settings is an important item for future research.
6. CONCLUSIONS This chapter considers the problem of evaluating the likelihood functions in a broad class of multivariate discrete data models. We have reviewed traditional simulation methods that produce continuous and differentiable estimates of the response probability and can be used in hill-climbing algorithms in maximum likelihood estimation. We have also shown that the problem can be handled by MCMC-based methods designed for marginal likelihood computation in Bayesian econometrics. New computationally efficient and conceptually straightforward MCMC algorithms have been developed for (i) estimating response probabilities and likelihood functions and (ii) simulating draws from multivariate truncated normal distributions. The former of these contributions aims to provide simple, efficient, and sound solutions from Markov chain theory to outstanding problems in simulated likelihood estimation; the latter is motivated by the need to provide high-quality samples from the target multivariate truncated normal density. A simulation study has shown that the methods perform well, while an application to a correlated random effects panel data model of women’s labor force participation shows that they are practical and easy to implement. In addition to their simplicity and efficiency, an important advantage of the methods considered here is that they are modular and can be mixed and matched as components of composite estimation algorithms in a variety of multivariate discrete and censored data settings. Important topics for future work in this area would be to examine the effectiveness of the estimators in practical applications, to explore extensions and develop additional hybrid approaches, and to perform detailed computational efficiency studies in a range of contexts.
ACKNOWLEDGMENT We thank the editors and two anonymous referees for their detailed comments and helpful suggestions.
MCMC Perspectives on Simulated Likelihood Estimation
37
REFERENCES Albert, J., & Chib, S. (1993). Bayesian analysis of binary and polychotomous response data. Journal of the American Statistical Association, 88, 669–679. Andrews, D. F., & Mallows, C. L. (1974). Scale mixtures of normal distributions. Journal of the Royal Statistical Society–Series B, 36, 99–102. Avery, R., Hansen, L., & Hotz, V. (1983). Multiperiod probit models and orthogonality condition estimation. International Economic Review, 24, 21–35. Bhat, C. R. (2001). Quasi-random maximum simulated likelihood estimation of the mixed multinomial logit model. Transportation Research Part B, 35, 677–693. Bhat, C. R. (2003). Simulation estimation of mixed discrete choice models using randomized and scrambled halton sequences. Transportation Research Part B, 37, 837–855. Bo¨rsch-Supan, A., & Hajivassiliou, V. A. (1993). Smooth unbiased multivariate probability simulators for maximum likelihood estimation of limited dependent variable models. Journal of Econometrics, 58, 347–368. Chib, S. (1995). Marginal likelihood from the Gibbs output. Journal of the American Statistical Association, 90, 1313–1321. Chib, S., & Greenberg, E. (1996). Markov Chain Monte Carlo simulation methods in econometrics. Econometric Theory, 12, 409–431. Chib, S., & Greenberg, E. (1998). Analysis of multivariate probit models. Biometrika, 85, 347–361. Chib, S., & Jeliazkov, I. (2001). Marginal likelihood from the Metropolis-Hastings output. Journal of the American Statistical Association, 96, 270–281. Chib, S., & Jeliazkov, I. (2005). Accept-reject Metropolis-Hastings sampling and marginal likelihood estimation. Statistica Neerlandica, 59, 30–44. Chib, S., & Jeliazkov, I. (2006). Inference in Semiparametric dynamic models for binary longitudinal data. Journal of the American Statistical Association, 101, 685–700. Damien, P., & Walker, S. G. (2001). Sampling truncated normal, beta, and gamma densities. Journal of Computational and Graphical Statistics, 10, 206–215. Devroye, L. (1986). Non-uniform random variate generation. New York: Springer-Verlag. DiCiccio, T. J., Kass, R. E., Raftery, A. E., & Wasserman, L. (1997). Computing bayes factors by combining simulation and asymptotic approximations. Journal of the American Statistical Association, 92, 903–915. Gelfand, A. E., & Dey, D. K. (1994). Bayesian model choice: Asymptotics and exact calculations. Journal of the Royal Statistical Society-Series B, 56, 501–514. Gelfand, A. E., & Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. Journal of the American Statistical Association, 85, 398–409. Geweke, J. (1991). Efficient simulation from the multivariate normal and student-t distributions subject to linear constraints. In: E. M. Keramidas (Ed.), Computing science and statistics, Proceedings of the twenty-third symposium on the interface, Interface Foundation of North America, Inc., Fairfax (pp. 571–578). Geweke, J. (1993). Bayesian treatment of the independent student-t linear model. Journal of Applied Econometrics, 8, S19–S40. Geweke, J. (1999). Using simulation methods for Bayesian econometric models: Inference, development, and communication. Econometric Reviews, 18, 1–73. Goffe, W. L., Ferrier, G. D., & Rogers, J. (1994). Global optimization of statistical functions with simulated annealing. Journal of Econometrics, 60, 65–99.
38
IVAN JELIAZKOV AND ESTHER HEE LEE
Greenberg, E. (2008). Introduction to Bayesian econometrics. New York: Cambridge University Press. Griffiths, W. E., Hill, R. C., & O’Donnell, C. J. (2006). A comparison of Bayesian and sampling theory inferences in a probit model. In: M. Holt & J.-P. Chavas (Eds), Essays in honor of Stanley R. Johnson, article 12. Available at http://www.bepress.com/ sjohnson/ Gu, Y., Fiebig, D. G., Cripps, E., & Kohn, R. (2009). Bayesian estimation of a random effects heteroscedastic probit model. Econometrics Journal, 12, 324–339. Hajivassiliou, V. A., & McFadden, D. (1998). The method of simulated scores for the estimation of LDV models. Econometrica, 66, 863–896. Hajivassiliou, V. A., McFadden, D. L., & Ruud, P. (1996). Simulation of multivariate normal rectangle probabilities and their derivatives: Theoretical and computational results. Journal of Econometrics, 72, 85–134. Hajivassiliou, V. A., & Ruud, P. (1994). Classical estimation methods for LDV models using simulation. Handbook of Econometrics, 4, 2383–2441. Heiss, F., & Winschel, V. (2008). Likelihood approximation by numerical integration on sparse grids. Journal of Econometrics, 144, 62–80. Hyslop, D. (1999). State dependence, serial correlation and heterogeneity in intertemporal labor force participation of married women. Econometrica, 67, 1255–1294. Kass, R., & Raftery, A. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773–795. Keane, M. (1994). A computationally practical simulation estimator for panel data. Econometrica, 95–116. Kennedy, J., & Eberhart, R. C. (2001). Swarm intelligence. San Francisco, CA: Morgan Kaufmann. Koop, G. (2003). Bayesian econometrics. New York: John Wiley & Sons. Lerman, S., & Manski, C. (1981). On the use of simulated frequencies to approximate choice probabilities. Structural Analysis of Discrete Data with Econometric Applications, 305–319. McFadden, D. (1989). A method of simulated moments for estimation of discrete response models without numerical integration. Econometica, 57, 995–1026. McFadden, D., & Train, K. (2000). Mixed MNL models for discrete response. Journal of Applied Econometrics, 15, 447–470. Meng, X.-L., & Wong, W. H. (1996). Simulating ratios of normalizing constants via a simple identity: A theoretical exploration. Statistica Sinica, 6, 831–860. Neal, R. M. (2003). Slice sampling. The Annals of Statistics, 31, 705–767. Newton, M. A., & Raftery, A. E. (1994). Approximate Bayesian inference with the weighted likelihood bootstrap. Journal of the Royal Statistical Society-Series B, 56, 3–48. Poirier, D. J. (1978). A curious relationship between probit and logit models. Southern Economic Journal, 44, 640–641. Ripley, B. D. (1987). Stochastic simulation. New York: John Wiley & Sons. Ritter, C., & Tanner, M. A. (1992). Facilitating the Gibbs sampler: The Gibbs stopper and the Griddy-Gibbs sampler. Journal of the American Statistical Association, 87, 861–868. Robert, C. P. (1995). Simulation of truncated normal variables. Statistics and Computing, 5, 121–125. Stern, S. (1992). A method for smoothing simulated moments of discrete probabilities in multinomial probit models. Econometica, 60, 943–952.
MCMC Perspectives on Simulated Likelihood Estimation
39
Stern, S. (1997). Simulation-based estimation. Journal of Economic Literature, 35, 2006–2039. Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association, 82, 528–549. Tierney, L. (1994). Markov chains for exploring posterior distributions. Annals of Statistics, 22, 1701–1761. Train, K. E. (2003). Discrete choice methods with simulation. Cambridge: Cambridge University Press.
THE PANEL PROBIT MODEL: ADAPTIVE INTEGRATION ON SPARSE GRIDS Florian Heiss ABSTRACT In empirical research, panel (and multinomial) probit models are leading examples for the use of maximum simulated likelihood estimators. The Geweke–Hajivassiliou–Keane (GHK) simulator is the most widely used technique for this type of problem. This chapter suggests an algorithm that is based on GHK but uses an adaptive version of sparsegrids integration (SGI) instead of simulation. It is adaptive in the sense that it uses an automated change-of-variables to make the integration problem numerically better behaved along the lines of efficient importance sampling (EIS) and adaptive univariate quadrature. The resulting integral is approximated using SGI that generalizes Gaussian quadrature in a way such that the computational costs do not grow exponentially with the number of dimensions. Monte Carlo experiments show an impressive performance compared to the original GHK algorithm, especially in difficult cases such as models with high intertemporal correlations.
Maximum Simulated Likelihood Methods and Applications Advances in Econometrics, Volume 26, 41–64 Copyright r 2010 by Emerald Group Publishing Limited All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1108/S0731-9053(2010)0000026006
41
42
FLORIAN HEISS
1. INTRODUCTION Panel probit models are widely used specifications for modeling discrete (binary) dependent variables in panel data settings. Multinomial probit models are popular for modeling multinomial outcomes with flexible substitution patterns. Both types of models suffer from the fact that the implied outcome probabilities are not analytically tractable which impedes estimation. Consequently, a lot of research has been devoted to the simulation-based estimation of these types of models. For a textbook discussion, see Greene (2008) and Train (2009). Hajivassiliou, McFadden, and Ruud (1996) provide a comprehensive survey of different approaches and Monte Carlo simulations to compare their relative performance. They and other studies found the Geweke– Hajivassiliou–Keane (GHK) simulator to be most powerful among the different alternatives. As a result, it is by far the most used approach for estimating multinomial and panel probit models and is routinely implemented in software packages like LIMDEP and Stata. However, computational costs required to obtain accurate simulation estimators can be substantial depending on the nature of the problem. For example, Geweke, Keane, and Runkle (1997) and Lee (1997) find that the simulation estimators can have poor properties with strong serial correlation and long panels. One commonly used general method of improving properties of simulation estimators is the use of deterministic sequences with certain properties instead of random numbers from a (pseudo-) random number generator. An especially successful sequence of numbers are Halton sequences, see Bhat (2001). Another alternative to simulation is deterministic numerical integration. The problem of the most widely known method of multivariate numerical integration is that the computational costs grow exponentially with the number of dimensions making it infeasible for higher dimensions and impractical for an intermediate number of dimensions such as 4 or 5. A much less common approach is integration on sparse grids, see Heiss and Winschel (2008). It is easy to implement, and its computational costs only rise polynomially with the number of dimensions. In the setting of maximum simulated likelihood estimation, Heiss and Winschel (2008) apply it very successfully to problems with up to 20 dimensions. Importance sampling has long been known to provide an opportunity to reformulate simulation problems. The term ‘‘efficient importance sampling’’ (EIS) has been used for an algorithm that aims at an optimal reformulation in order to minimize simulation errors, see Richard and Zhang (2007). A related approach is called adaptive numerical integration, see
The Panel Probit Model: Adaptive Integration on Sparse Grids
43
Rabe-Hesketh, Skrondal, and Pickles (2005). It reformulates the integration problem in a similar fashion as EIS and is typically used for univariate integration problems. This chapter combines these different aspects of the problem of approximating panel (and/or multivariate) probit probabilities and compares the relative performance in a simulation study. Section 2 introduces the class of models and notation. Section 3 discusses the various dimensions of the approximation problem. Section 4 presents the design and results of a simulation study and Section 5 summarizes and concludes.
2. THE PANEL PROBIT MODEL We are concerned with a panel data model for a binary dependent variable yit , with i denoting cross-sectional units and t ¼ 1; . . . ; T a time indicator. It is driven by a set of strictly exogenous regressors xit and an unobserved random variable e~it . Dropping the individual subscripts for notational convenience, we make the parametric assumption that: 1 if e~t xt b (1) yt ¼ 1 otherwise where b is a vector of parameters to be estimated. Assume that the error terms e~1 through e~T are independent of the covariates and jointly normally distributed with means of zero and a general variance–covariance matrix. Its elements are parameters to be estimated subject to identifying normalizations and potentially other constraints. Let y ¼ ½y1 ; y2 ; . . . ; yT denote the vector of dependent random variables and d ¼ ½d 1 ; d 2 ; . . . ; d T a corresponding vector of observed outcomes. For the estimation, for example, by maximum likelihood, the joint outcome probability P :¼ Pr½y ¼ djx
(2)
needs to be evaluated. Define vt :¼ ð1=st Þðd t xt bÞ and et :¼ ð1=st Þðd t e~t Þ with st denoting the standard deviation of e~t . Then, et is a normally distributed random variable with mean 0 and variance–covariance matrix R, where the variance of et is equal to unity and Covðet ; es Þ ¼ ðd t d s =st ss ÞðCovðe~t ; e~s ÞÞ. This allows the convenient expression for the joint outcome probability: P ¼ Pr½e1 v1 ; e2 v2 ; . . . ; eT vT jv ¼ Pr½e vjv
(3)
44
FLORIAN HEISS
This is the cumulative distribution function (CDF) of the joint normal distribution, which does not have a closed-form expression and needs to be approximated. While this chapter is concerned with this type of panel data model for binary outcomes, multinomial probit models for cross-sectional and panel data lead to a very similar structure and can be expressed in the same form as Eq.(3), with an appropriately defined jointly normally distributed vector of random variables e and observed values v. The discussion of how to approximate P is therefore the same for this class of models. To see this, consider the cross-sectional case. Each individual faces the choice between T outcomes. She chooses outcome j, denoted as y ¼ j, if it delivers the highest utility, which is specified as a linear function of alternative-specific regressors plus an alternative-specific error term xj b þ e~j . Then y¼j
3 e~k e~j ðxj xk Þb
8k ¼ 1; . . . ; T
(4)
With d denoting the observed outcome, define vt :¼ ðxd xt Þb and et ¼ e~k e~d . The relevant outcome probability is P:¼ Pr½y ¼ djx ¼ Pr½e vjv
(5)
where e and v are vectors of all random variables et and observed values vt , respectively, leaving out the d th values for which this inequality trivially holds. The vector of random variables e is jointly normally distributed if the original error terms e~j are. This is the same expression-the CDF of jointly normal random variables evaluated at a given vector-as Eq.(3). Similar normalization by the standard deviations can be applied.
3. SIMULATION AND APPROXIMATION OF THE OUTCOME PROBABILITIES 3.1. (Pseudo) Monte Carlo Integration Simulation estimators are nowadays very popular and routinely used in empirical work. For general discussions, see Gourie´roux and Monfort (1996), Hajivassiliou and Ruud (1994), and Train (2009). For the panel probit model, the conceptually simplest approach for the simulation of the joint outcome probability is the so-called crude frequency simulator
The Panel Probit Model: Adaptive Integration on Sparse Grids
(CFS). Write
Z
Z
1
P¼
45
1
1½e v f e ðeÞ deT de1 1 Z1 ¼ gðzÞ /ðzÞ dz
ð6Þ
RD
with gðzÞ:¼1½LR z v
(7)
with f e denoting the probability density function (PDF) of the jointly normally distributed e ¼ ½e1 ; . . . ; eT and 1½ being the indicator function. The second line is the result of a simple change of variables with LR denoting the Cholesky decomposition of the variance–covariance matrix R and /ðzÞ being the joint PDF of T i.i.d. standard normal random variables. The CFS and makes a number of draws z1 ; . . . ; zR from independent standard P normals CFS r ¼ 1=R R 1½L z v. approximates the joint outcome probability as P~ R r¼1 The CFS will converge in probability to P as R ! 1, but for a finite R, this intuitive simulator has unfavorable properties compared to other simulators. CFS and the implied likelihood is a step function of the For example, P~ parameters which impedes numerical maximization. A smoothed version of this simulator avoids these problems and leads to a better approximation for a given finite number of draws R, see McFadden (1989). Stern (1992) suggests a partially analytic (PA) simulator for which the error term is decomposed as e ¼ eI þ eD , where eI is a vector of i.i.d. normally distributed random variables with some variance sI and eD is another vector of jointly normally distributed random variables with variance–covariance matrix RD ¼ ðR sI IÞ. This decomposition is in general possible due to special features of the joint normal distribution. Stern suggests using the first eigenvalue of R as sI . Now the joint outcome probability can be solved partially analytically. Write Z 1 Z 1 D P¼ Pr½eI þ eD vjv; eD f eD ðeD Þ deD T de1 1 1 Z 1 Z 1 ¼ Pr½eI v LRD zjv; z /ðzÞ dzT dz1 ð8Þ 1
1
D
where e ¼ LRD z and LRD is the Cholesky matrix of the covariance matrix RD . Because of the independence of eI , the conditional probability inside of the integral has an analytic expression. Namely, Z gðzÞ /ðzÞdz (9) P¼ RD
46
FLORIAN HEISS
T Y v t eD t with gðzÞ ¼ F sI t¼1
(10)
Stern’s decomposition simulator now makes draws z1 ; . . . ; zR from the standard normal distribution and then evaluates R 1X PA Pr½eI v LRD zr jv; zr P~ ¼ R r¼1
(11)
A comprehensive collection of these and many other simulators for the panel and the multinomial probit models can be found in Hajivassiliou et al. (1996). Their Monte Carlo study finds that, in general, the GHK simulator works most accurately for a given number of draws and computational burden – a result that has been confirmed by other studies as well. Therefore, the majority of empirical work makes use of the GHK simulator, and it is used by boxed routines implemented in software packages like LIMDEP and Stata. It is based on work of Geweke (1989), Bo¨rsch-Supan and Hajivassiliou (1993), and Keane (1994). While the GHK algorithm is often directly motivated by sampling from appropriate joint distributions, it will be useful below to describe it in terms of an integral that can be easily approximated, for example, by simulation. Let L denote the Cholesky factorization of the variance/covariance matrix R and let l t;s denote its element in row t and column s. Then, the joint outcome probability can be written as1 Z P¼ gðzÞ /ðzÞ dz (12) RD
with gðzÞ :¼
T Y
gt ðUðzÞÞ
(13)
t¼1
v1 g1 ðuÞ :¼ F l 1;1
(14)
vt l t;1 h1 ðuÞ l t;2 h2 ðuÞ l t;t1 ht1 ðuÞ gt ðuÞ :¼ F l t;t
(15)
and ht ðuÞ :¼ F1 ðzt gt ðuÞÞ
(16)
The Panel Probit Model: Adaptive Integration on Sparse Grids
47
The GHK simulator starts from a sample z1 ; . . . ; zR from the independent normal distribution or u1 ; . . . ; uR from the independent uniform distribution with urt ¼ Fðzrt Þ for all t and r. For each draw r, the values gt ður Þ can be recursively computed: g1 ður Þ actually does not depend on ur and is the same for all draws. The value of g2 ður Þ depends on g1 ður Þ through h1 ðuÞ. In the next step, g3 ður Þ depends on g1 ður Þ and g2 ður Þ both of which have been evaluated in the steps before. This is repeated until gT ður Þ is evaluated. The GHK simulator is then simply R Y T 1X GHK ¼ g ður Þ. P~ R r¼1 t¼1 t
(17)
3.2. Quasi Monte Carlo (QMC) The integrals in Eqs.(6),(9), and (12) are all written as expectations over independent standard normal random variables. Monte Carlo integration approximates them by making random draws from this distribution.2 A straightforward and often powerful way to improve the properties of simulation estimators are quasi-Monte Carlo (QMC) approaches. They can be used in combination with all simulation estimators presented above. Instead of random draws, QMC algorithms use deterministic sequences of numbers. These sequences share some properties with random numbers, but are specifically designed to provide a better performance in Monte Carlo simulations. Intuitively, QMC draws cover the support of the integral more evenly than random numbers. This leads to a more accurate approximation with a finite number of draws and can – depending on the nature of the integrand – also improve the rate of convergence. One example of such sequences which has been found to work well for related problems are Halton sequences, Bhat (2001, 2003). For an intuitive introduction to Halton sequences, other QMC and related methods, see Train (2009). A more formal discussion is provided by Sloan and Woz´niakowski (1998).
3.3. Multivariate Numerical Integration For the approximation of univariate integrals for estimation purposes, Gaussian quadrature is a powerful alternative to simulation. This is well established at least since Butler and Moffit (1982). Gaussian quadrature
48
FLORIAN HEISS
rules prescribe a given number R of nodes z1 ; . . . ; zR and corresponding weights w1 ; . . . ; wR for a given class of integration problems. An important example is R 1the expectation of a function of a normally distributed random the Gauss– variable 1 gðzÞfðzÞ dz which can be approximated using P r r Hermite quadrature rule. The approximated integral is simply R r¼1 gðz Þw . While Monte Carlo integration chooses the nodes randomly and quasiMonte Carlo spreads them evenly, Gaussian quadrature places them strategically. A Gaussian quadrature rule with R nodes and weights gives the exact value of the integral if gðzÞ is a polynomial of order 2R 1 or less. The result will not be exact if gðzÞ is not a polynomial, but since smooth functions can well be approximated by polynomials of sufficient order, it gives an often remarkably precise approximation even with a relatively small value R. Gaussian quadrature in several dimensions is less straightforward. There are three approaches. The first, easiest and best-known approach is the so-called (tensor) product integration rule. It sequentially nests univariate rules. For example, the expectation of Ra function of two independent 1 R1 normally distributed random variables 1 1 gðz1 ; z2 Þfðz1 Þfðz2 Þ dz2 dz1 can using the by the Gauss–Hermite nodes and weights as PR bePapproximated R r s r s gðz ; z Þw w . The problem of this rule is the curse of dimensionr¼1 s¼1 ality: if the integral is over D dimensions and the rule uses R nodes in each dimension, gðÞ has to be evaluated RD times which quickly results in prohibitive computational costs as D exceeds 4 or 5. These methods deliver the exact value of the integral if gðÞ belongs to a rather peculiar class of polynomials: If the underlying univariate rule is exact for a polynomial order K, the maximal exponent (max½j 1 ; j 2 ; . . . ; j D ) showing up in any j j j monomial z11 z22 zDD which make up the multivariate polynomial cannot exceed K. The second approach to Gaussian quadrature in the multvariate setting is based on the observation that complete multivariate polynomials of a given total order are P a more natural class of functions. They restrict the sum of exponents ( D d¼1 j d ) of the monomials instead of the maximum, see Judd (1998).3 Unlike univariate integration problems, multivariate Gaussian quadrature rules for complete polynomials are hard and often impossible to derive, very problem-specific and therefore (if mentioned at all) considered impractical for applied work, see, for example, Geweke (1996). For a compendium of different such multivariate integration rules, see Cools (2003). A third way to use Gaussian quadrature in several dimensions is sparsegrids integration (SGI). Heiss and Winschel (2008) suggest to use this
The Panel Probit Model: Adaptive Integration on Sparse Grids
Fig. 1.
49
Product Rule vs. Sparse Grid: Example. (a) Product Rule; (b) Sparse Grid.
approach in the setting of econometric estimation and provide encouraging evidence for its performance. SGI combines univariate quadrature rules just as the product rule does and is therefore just as flexible and generic. But it does so in a more careful way such that the resulting rule delivers exact results for polynomials of a given total order, and the number of nodes rises only polynomially instead of exponentially. As an example, to achieve exactness for polynomials of order 5 or less for 5 and 20 dimensions, SGI requires 51 and 801 nodes, while the product rule requires 35 ¼ 243 and 320 ¼ 3; 486; 784; 401 nodes, respectively. Fig. 1 presents an example of how the nodes of a sparse grid are distributed compared to a full product rule. The product rule uses the full grid of all combinations of nodes in all dimensions, as demonstrated in the two-dimensional case in Fig. 1(a). Univariate nodes and weights are combined to a sparse grid in a more complicated fashion. It consists of a combination of several sub grids each of which has a different degree of accuracy in each dimension. If one of these sub grids is very fine in one dimension, it is very coarse in the other dimensions. This is analogous to the definition of polynomials of a given total order which restricts the sum of the exponents – if one exponent in a monomial is very high, the others must be relatively low. The sparse grid now combines grids which are very fine in one and coarse in other dimensions with others which are moderately fine in each dimension. The general strategy to extend univariate operators to multivariate problems in this way is based on the work of Smolyak (1963) and has generated a lot of interest in the field of numerical mathematics, see Bungartz and Griebel (2004). Heiss and Winschel (2008) discuss the use of SGI in econometrics, and show the exact formulas and details on how the sparse grids are built. They also provide Matlab and Stata code for generating them as well as a set of readily evaluated numbers for the most important cases for download at http://sparse-grids.de.
50
FLORIAN HEISS
With those, the implementation is straightforward. The difference to simulation boils down to calculating a weighted sum instead of an average. Suppose you want to evaluate the expected value of some function gðz1 ; . . . ; z20 Þ of independent normals such as in Eqs.(6), (8), and (12), with T ¼ 20 and a level of polynomial exactness of 5. You obtain the 801 20 matrix of nodes ½z1 ; . . . ; z20 and the 801 1 vector of weights w as a download orPfrom the mentioned software. The approximated value is r r r then equal to 801 r¼1 gðz1 ; . . . ; z20 Þw .
3.4. Efficient Importance Sampling and Adaptive Integration Remember that the integrals in Eqs.(6), (9), and (12) are all written in the form Z gðzÞ /ðzÞ dz (18) P¼ RD
with /ðzÞ denoting the joint PDF of independent standard normally distributed random variables and different functions gðzÞ. In the case of the CFS in Eq.(6), gðzÞ is an indicator function which is either zero or one depending on z. In the other two equations, gðzÞ are smooth functions of z which improves the properties of all approximation and simulation approaches. Define some parametric function kðz; hÞ which is bounded away from zero on RD . Quite obviously, the general integral can be rewritten as Z gðzÞ/ðzÞ P¼ kðz; hÞ dz (19) D kðz; hÞ R Consider a Gaussian importance sampler kðz; hÞ ¼ jL1 j/ðL1 ðz aÞÞ where the parameters h define a vector (of means) a and a nonsingular lower triangular (Cholesky factor of a covariance) matrix L. Rewrite the integral as Z gðzÞ/ðzÞ /ðL1 ðz aÞÞ dz (20) P¼ D /ðL1 ðz aÞÞ R Z hðx; hÞ /ðxÞ dx
¼
(21)
RD
with hðx; hÞ ¼ jLjgða þ LxÞ
/ða þ LxÞ /ðxÞ
(22)
The Panel Probit Model: Adaptive Integration on Sparse Grids
51
Given any h, this integral conforms to the general form (18) and can be numerically approximated by any of the methods discussed above. While the actual value of the integral by definition is the same for any value of h, the quality of numerical approximation is in general affected by the choice of h. The introduction of the parameters provides the freedom of choosing values which help to improve the approximation quality. To see this, consider the ‘‘ideal case’’ in which there is some h for which hðx; h Þ ¼ c is constant over x, that is kðz; hÞ is proportional to gðzÞ/ðzÞ. In this case, the integral is solved exactly and P ¼ c. While this ideal case is infeasible for nontrivial problems, a goal in choosing the parametric family of functions kðz; hÞ and the parameter vector h is to come close to this ideal in some sense. Such a change of variables approach is quite common for univariate Gaussian quadrature and is then often called ‘‘adaptive quadrature,’’ a term used, for example, by Rabe-Hesketh et al. (2005). In the setting of simulation, an algorithm based on this approach is called ‘‘efficient importance sampling’’ by Richard and Zhang (2007). There are different approaches for choosing the parameters h such that a given parametric family of functions kðz; hÞ is close to proportional to gðzÞ/ðzÞ. Liu and Pierce (1994) and Pinheiro and Bates (1995) choose the parameters such that kðz; hÞ and gðzÞ/ðzÞ have the same mode and the curvature at the mode. Note that in this case, the one-point quadrature rule with R ¼ 1 and the single node z1 ¼ ½0; . . . ; 00 corresponds to a Laplace approximation. Similarly, in the context of Bayesian analysis, Naylor and Smith (1988) and Rabe-Hesketh et al. (2005) choose the parameters such that kðz; hÞ has the same means and variances as the (scaled) posterior distribution gðzÞ/ðzÞ. The EIS algorithm by Richard and Zhang (2007) attempts to minimize (an approximation of) the variance of the simulation noise of P. This approach has been successfully implemented mostly in highdimensional time-series applications. Notably, Liesenfeld and Richard (2008) use it for a time-series probit example. Since it appears to be very powerful and at the same time computationally relatively straightforward with the Gaussian importance sampler, this strategy will be implemented for both the simulation and the numerical integration techniques below. Adaptive integration on sparse grids (ASGI) combines SGI with an intelligent reformulation of the integration problem in the spirit of EIS. Remember that for the Gaussian kðz; hÞ, the goal is to choose h
52
FLORIAN HEISS
such that /ðL1 ðz aÞÞ is close to proportional to gðzÞ/ðzÞ. Define the distance eðz; hÞ as gðzÞ/ðzÞ þc (23) eðz; hÞ ¼ ln /ðL1 ðz aÞÞ 1 1 ¼ ln gðzÞ z0 z þ ððz aÞ0 ðLL0 Þ1 ðz aÞÞ þ c 2 2
(24)
Richard and Zhang (2007) show that the IS variance can approximately be minimized by minimizing the variance of eðhÞ over the appropriate (weighted) distribution Z e2 ðz; hÞ gðzÞ /ðzÞ dz (25) VðhÞ ¼ RD
With wr ¼ 1=R for simulation and is the integration weight for SGI, the approximated value is ~ VðhÞ ¼
R X
e2 ðzr ; hÞgðzÞwr
(26)
r¼1
The minimization of this function is a linear weighted least squares problem with weights gðzÞwr . The symmetric matrix B1 :¼ ðLL0 Þ1 contains elements bjk ¼ bkj . Eq. (24) can now be expressed for the nodes zr as D X D X 1 r0 r 1 0 1 1 r bjk zrj zrk þ z0 ðB1 aÞ þeðzr ; hÞ ln gðz Þ z z ¼ c a B a þ |fflfflffl{zfflfflffl} 2 2 2 |fflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflffl} j¼1 k¼1 |fflfflfflfflfflffl{zfflfflfflfflfflffl} c a
bjk
The optimal values for B and a can therefore be found by a weighted linear least squares regression of the ‘‘dependent variable’’ ðln gðzr Þ 12 zr0 zr Þ on a constant (parameter a), z (parameters c), and all squared and interacted values of z (parameters b). The ‘‘observations’’ correspond to the set of random draws or integration nodes fz1 ; . . . ; zR g. All ‘‘estimated’’ elements ^ The vector a follows as B^c. of matrix B1 can be recovered directly from b. Actually, Richard and Zhang (2007) propose to apply another importance sampling reformulation to Eq.(25) in order to obtain better estimates of the optimal B and a. The result is an iterative procedure. Similar to their findings, it does not seem to improve the results in the simulations reported below significantly, so the one-step estimates of B and a as described above are used.
The Panel Probit Model: Adaptive Integration on Sparse Grids
53
4. THE PERFORMANCE OF DIFFERENT ALGORITHMS In order to compare the performance of the various algorithms, they are applied to artificial data sets in this section. The data generating process is a panel probit model designed to be simple while providing the flexibility to assess the impact of all relevant model parameters. For all periods t ¼ 1; . . . T, assume that the dependent variable yit is generated as yit ¼ 1½b0 þ b1 xit þ uit 0
(27)
For all simulations, the regressor xit is an i.i.d. standard normal random variable and the parameters are b0 ¼ b1 ¼ 1. The error term is jointly normal with a structure which represents an i.i.d. standard normal error plus a stationary AR(1) process with a correlation of r and variance s2 . These two parameters together drive the overall autocorrelation of uit . So uit ¼ vit þ eit
(28)
vit ¼ rvit1 þ wit
(29)
eit Nð0; 1Þ;
wit Nð0; s2 ð1 r2 ÞÞ
(30)
The crucial parameters which drive the numerical difficulty are T, s2 , and r and will be varied in the simulations. From this model, 1,000 samples are generated for different sets of parameters. The approximated results are then compared to true values.4 The implemented algorithms differ in four aspects discussed above: Formulation of the integration problem: Stern’s decomposition/partial analytic simulator PA (9) vs. GHK (12) Integration approach: Pseudo-MC (PMC) vs. quasi-MC (QMC, Halton sequences) vs. integration on sparse grids (SGI) Approximation of the original integration (RAW) vs. adaptive rescaling EIS/ASGI. By default, just one iteration is used for the iterative updating of parameters. The accuracy level corresponding to the number of random draws/ deterministic nodes. The appendix lists pseudo code that should help to clarify details of the implementation of the various algorithms.
54
FLORIAN HEISS
The results are presented as graphs with the horizontal axis showing the number of replications (random draws/deterministic nodes) and the vertical axis showing the root mean squared approximation error. The RMSE of all algorithm should converge to zero as the number of function evaluations increases. In terms of computational time, the algorithms are not equally expensive. In the implemented code, the run time for the PA algorithm and GHK are roughly comparable with the same number of replications. The same holds for PMC vs. QMC vs. SGI. The adaptive rescaling takes more time. The difference depends on the problem (T and the number of replications). In a representative setting with T ¼ 10 and R ¼ 1; 000, the adaptive rescaling slows down computations by a factor of roughly 5. Fig. 2 shows the results for Stern’s PA approach without adaptive rescaling for the different integration approaches. In all cases, the PMC (‘‘Simulation’’) algorithm is clearly dominated by the QMC (‘‘Halton sequences’’) approach. For the relatively well-behaved models with small T and/or s, SGI is the dominant algorithm and delivers very accurate results even with small numbers of function evaluations. If, however, the problem becomes very ill behaved with both high T and s2 (Fig. 2(d) and (f )), this algorithm does not perform well. The integrand is very close to zero at most of the support. The sparse grid does not seem to be able to capture the small areas with positive values well. The remaining results presented are for the more interesting case of a high variance s2 ¼ 5, other results can be requested from the author. Fig. 3 shows similar results but now for the algorithms with adaptive rescaling in the sense of EIS/ASGI. The effect of this rescaling is impressive. In the figures with the RMSE-axis scaled such that the previous results can be seen well (Fig. 3(a), (c), and (e)), the RMSE of the rescaling methods is in almost all cases hard to distinguish from zero. If the RMSE axis is scaled to show differences of the integration methods, the sparse grids algorithm (ASGI) dominates all others. Its advantage relative to EIS based on QMC using Halton sequences becomes smaller in less well-behaved problems, but never disappears. The three graphs on the left of Fig. 4 compare Stern’s PA approach with GHK. As the previous literature such as Bo¨rsch-Supan and Hajivassiliou (1993), the results show that GHK clearly dominates Stern’s PA approach for any given integration method and parameter setting. Among the integration approaches, sparse grids is again most effective in all cases. The three graphs on the right of Fig. 4, evaluate the effect of adaptive rescaling in the GHK setting. The GHK approach based on random draws and Halton sequences is much improved by adaptive rescaling/EIS. Combining
The Panel Probit Model: Adaptive Integration on Sparse Grids
55
Fig. 2. Integration Approach: Stern’s PA Simulator without Rescaling. (a) T ¼ 5; s2 ¼ :2; (b) T ¼ 5; s2 ¼ 5; (c) T ¼ 10; s2 ¼ :2; (d) T ¼ 10; s2 ¼ 5; (e) T ¼ 20; s2 ¼ :2; (f) T ¼ 20; s2 ¼ 5.
56
FLORIAN HEISS
Fig. 3. Adaptive Rescaling: Stern’s PA Simulator with Rescaling (s2 ¼ 5). (a) T ¼ 5; (b) T ¼ 5 (Zoomed y Scale); (c) T ¼ 10; (d) T ¼ 10 (Zoomed y Scale); (e) T ¼ 20; (f) T ¼ 20 (Zoomed y Scale).
The Panel Probit Model: Adaptive Integration on Sparse Grids
57
Fig. 4. GHK vs. Stern’s PA Simulator (s2 ¼ 5). (a) T ¼ 5; (b) T ¼ 5 (Zoomed y Scale); (c) T ¼ 10; (d) T ¼ 10 (Zoomed y Scale); (e) T ¼ 20; (f) T ¼ 20 (Zoomed y Scale).
58
FLORIAN HEISS
the three potential improvements GHK, SGI, and adaptive rescaling does not help a lot compared to an algorithm which uses only two of them. Finally, Fig. 5 investigates the effect of the parameters r and s on the accuracy of approximation. Higher values of r and s increase the intertemporal correlation and makes the integrand less well-behaved – overall, the approximation errors increase. Comparing the algorithms, in all cases, SGI dominates PMC and QMC. Adaptive rescaling improves the results, but the magnitude of this improvement depends on the difficulty of the problem and the efficiency of the integration algorithm. It really pays in all cases for PMC and for most cases for QMC. For SGI, the adaptive rescaling has a sizeable effect mainly in the difficult cases of high r and/or s.
5. SUMMARY The panel probit model is a leading example for the use of maximum simulated likelihood estimators in applied econometric research. In the late 1980 and early 1990, there has been a lot of research on the best algorithm for simulation-based evaluation of the likelihood function for this kind of model and the multinomial probit model which essentially has the same structure. The bottom line from this research was that the GHK simulator delivers the best results among a long list of competitors in most settings. However, simulation studies have shown that in ‘‘difficult’’ cases, such as long panels with high error term persistence, even the GHK simulator can perform poorly. This chapter revisits the topic of approximating the outcome probabilities implied by this type of models. It starts from GHK and Stern’s partially analytic simulator and changes them in two ways in order to improve the approximation performance: the use of deterministic SGI and an automated adaptive rescaling of the integral in the spirit of EIS. The results using simulated data show that both approaches work impressively well. Not surprisingly, the GHK approach of formulating a numerical integration problem is more efficient than Stern’s partially analytic (decomposition) approach. Another result from the literature which also clearly shows up in the results is that quasi-Monte Carlo simulation based on Halton draws is much more effective than simulation based on a (pseudo-) random number generator. One new result is that SGI clearly outperforms the other integration approaches in almost all simulations. As a typical example, in the model
The Panel Probit Model: Adaptive Integration on Sparse Grids
59
Fig. 5. The Role of r and s: GHK, T ¼ 10. (a) s ¼ 1; r ¼ 0:1; (b) s ¼ 5; r ¼ 0:1; (c) s ¼ 1; r ¼ 0:5; (d) s ¼ 5; r ¼ 0:5; (e) s ¼ 1; r ¼ 0:9; (f) s ¼ 5; r ¼ 0:9.
60
FLORIAN HEISS
with 10 longitudinal observations and a relatively high variance of the dependent error component of s2 ¼ 5, the sparse grids results with R ¼ 163 function evaluations are more accurate than the Halton results with R ¼ 871 and the simulation results with R410; 000 function evaluations. Another result is that in almost all combinations of GHK vs. Stern’s approach and the three different integration methods, adding an adaptive rescaling step in the spirit of EIS improves the results impressively with a given number of function evaluations or – put differently – allows to save a lot of computational costs with a given level of required accuracy. In summary, common computational practice in panel and multinomial probit models can be improved a lot using the methods presented in this chapter. This decline in computational costs can be substantial in empirical work and can be used to invest in more careful specification testing, more complex models, the ability to use larger data sets, or the exploration of applications which are otherwise infeasible.
NOTES 1. It might seem odd that the integral is written in terms of standard normal instead of uniform random variables. The reason is that Eqs. (6) and (8) are also written in this way and this will make a consistent discussion below easier. 2. Since computers are inherently unable to generate true random numbers, implemented simulation algorithms are often called pseudo random numbers. 3. A more familiar problem from local approximation may provide an intuition: Consider a second-order Taylor approximation in two dimensions. It involves the terms z1 ; z2 ; z1 z2 ; z21 , and z22 , but not the terms z21 z2 , z1 z22 , or z21 z22 . It is a complete polynomial of total order 2. 4. True joint outcome probabilities are of course unavailable. Instead, results from the GHK algorithm with 500,000 draws are taken as ‘‘true’’ values.
REFERENCES Bhat, C. R. (2001). Quasi-random maximum simulated likelihood estimation of the mixed multinomial logit model. Transportation Research B, 35, 677–693. Bhat, C. R. (2003). Simulation estimation of mixed discrete choice models using randomized and scrambled Halton sequences. Transportation Research B, 37, 837–855. Bo¨rsch-Supan, A., & Hajivassiliou, V. (1993). Smooth unbiased multivariate probability simulators for maximum likelihood estimation of limited dependent variable models. Journal of Econometrics, 58, 347–368. Bungartz, H. J., & Griebel, M. (2004). Sparse grids. Acta Numerica, 13, 147–269.
The Panel Probit Model: Adaptive Integration on Sparse Grids
61
Butler, J. S., & Moffit, R. (1982). A computationally efficient quadrature procedure for the one-factor multinomial probit model. Econometrica, 50, 761–764. Cools, R. (2003). An encyclopedia of cubature formulas. Journal of Complexity, 19, 445–453. Geweke, J. (1989). Bayesian inference in econometric models using Monte Carlo integration. Econometrica, 57, 1317–1339. Geweke, J. (1996). Monte Carlo simulation and numerical integration. In: H. M. Amman, D. A. Kendrick & J. Rust (Eds), Handbook of computational economics (Vol. 1, pp. 731–800). Amsterdam: Elsevier Science. Geweke, J. F., Keane, M. P., & Runkle, D. E. (1997). Statistical inference in the multinomial multiperiod probit model. Journal of Econometrics, 80, 125–165. Gourie´roux, C., & Monfort, A. (1996). Simulation-based econometric methods. Oxford: Oxford University Press. Greene, W. H. (2008). Econometric analysis (4th ed.). London: Prentice Hall. Hajivassiliou, V., McFadden, D., & Ruud, P. (1996). Simulation of multivariate normal rectangle probabilities and their derivatives: Theoretical and computational results. Journal of Econometrics, 72, 85–134. Hajivassiliou, V. A., & Ruud, P. A. (1994). Classical estimation methods for LDV models using simulation. In: RF. Engle & DL. McFadden (Eds), Handbook of econometrics (Vol. 4, pp. 2383–2441). New York: Elsevier. Heiss, F., & Winschel, V. (2008). Likelihood approximation by numerical integration on sparse grids. Journal of Econometrics, 144, 62–80. Judd, K. L. (1998). Numerical methods in economics. Cambridge, MA: MIT Press. Keane, M. P. (1994). A computationally practical simulation estimator for panel data. Econometrica, 62, 95–116. Lee, L. F. (1997). Simulated maximum likelihood estimation of dynamic discrete choice statistical models: Some Monte Carlo results. Journal of Econometrics, 82, 1–35. Liesenfeld, R., & Richard, J. F. (2008). Improving mcmc, using efficient importance sampling. Computational Statistics & Data Analysis, 53, 272–288. Liu, Q., & Pierce, D. A. (1994). A note on Gauss–Hermite quadrature. Biometrika, 81, 624–629. McFadden, D. (1989). A method of simulated moments for estimation of discrete choice models without numerical integration. Econometrica, 57, 995–1026. Naylor, J., & Smith, A. (1988). Econometric illustrations of novel numerical integration strategies for Bayesian inference. Journal of Econometrics, 38, 103–125. Pinheiro, J. C., & Bates, D. M. (1995). Approximations to the log-likelihood function in the nonlinear mixed-effects model. Journal of Computational and Graphical Statistics, 4, 12–35. Rabe-Hesketh, S., Skrondal, A., & Pickles, A. (2005). Maximum likelihood estimation of limited and discrete dependent variable models with nested random effects. Journal of Econometrics, 128, 301–323. Richard, J. F., & Zhang, W. (2007). Efficient high-dimensional importance sampling. Journal of Econometrics, 141, 1385–1411. Sloan, I. H., & Woz´niakowski, H. (1998). When are quasi-Monte Carlo algorithms efficient for high dimensional integrals? Journal of Complexity, 14, 1–33. Smolyak, S. A. (1963). Quadrature and interpolation formulas for tensor products of certain classes of functions. Soviet Mathematics Doklady, 4, 240–243. Stern, S. (1992). A method for smoothing simulated moments of discrete probabilities in multinomial probit models. Econometrica, 60, 943–952. Train, K. (2009). Discrete choice methods with simulation (2nd ed.). Cambridge University Press.
62
FLORIAN HEISS
PSEUDO-CODE
The Panel Probit Model: Adaptive Integration on Sparse Grids
63
64
FLORIAN HEISS
A COMPARISON OF THE MAXIMUM SIMULATED LIKELIHOOD AND COMPOSITE MARGINAL LIKELIHOOD ESTIMATION APPROACHES IN THE CONTEXT OF THE MULTIVARIATE ORDERED-RESPONSE MODEL Chandra R. Bhat, Cristiano Varin and Nazneen Ferdous ABSTRACT This chapter compares the performance of the maximum simulated likelihood (MSL) approach with the composite marginal likelihood (CML) approach in multivariate ordered-response situations. The ability of the two approaches to recover model parameters in simulated data sets is examined, as is the efficiency of estimated parameters and computational cost. Overall, the simulation results demonstrate the ability of the CML approach to recover the parameters very well in a 5–6 dimensional ordered-response choice model context. In addition, the CML recovers Maximum Simulated Likelihood Methods and Applications Advances in Econometrics, Volume 26, 65–106 Copyright r 2010 by Emerald Group Publishing Limited All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1108/S0731-9053(2010)0000026007
65
66
CHANDRA R. BHAT ET AL.
parameters as well as the MSL estimation approach in the simulation contexts used in this study, while also doing so at a substantially reduced computational cost. Further, any reduction in the efficiency of the CML approach relative to the MSL approach is in the range of nonexistent to small. When taken together with its conceptual and implementation simplicity, the CML approach appears to be a promising approach for the estimation of not only the multivariate ordered-response model considered here, but also for other analytically intractable econometric models.
1. INTRODUCTION Ordered-response model systems are used when analyzing ordinal discrete outcome data that may be considered as manifestations of an underlying scale that is endowed with a natural ordering. Examples include ratings data (of consumer products, bonds, credit evaluation, movies, etc.), or Likertscale type attitudinal/opinion data (of air pollution levels, traffic congestion levels, school academic curriculum satisfaction levels, teacher evaluations, etc.), or grouped data (such as bracketed income data in surveys or discretized rainfall data), or count data (such as the number of trips made by a household, the number of episodes of physical activity pursued by an individual, and the number of cars owned by a household). In all of these situations, the observed outcome data may be considered as censored (or coarse) measurements of an underlying latent continuous random variable. The censoring mechanism is usually characterized as a partitioning or thresholding of the latent continuous variable into mutually exclusive (nonoverlapping) intervals. The reader is referred to McKelvey and Zavoina (1971) and Winship and Mare (1984) for some early expositions of the ordered-response model formulation, and Liu and Agresti (2005) for a survey of recent developments. The reader is also referred to a forthcoming book by Greene and Hensher (2010) for a comprehensive history and treatment of the ordered-response model structure. These recent reviews indicate the abundance of applications of the ordered-response model in the sociological, biological, marketing, and transportation sciences, and the list of applications only continues to grow rapidly. While the applications of the ordered-response model are quite widespread, much of these are confined to the analysis of a single outcome, with a sprinkling of applications associated with two and three correlated
MSL and CML Estimation Approaches
67
ordered-response outcomes. Some very recent studies of two correlated ordered-response outcomes include Scotti (2006), Mitchell and Weale (2007), Scott and Axhausen (2006), and LaMondia and Bhat (2009).1 The study by Scott and Kanaroglou (2002) represents an example of three correlated ordered-response outcomes. But the examination of more than two to three correlated outcomes is rare, mainly because the extension to an arbitrary number of correlated ordered-response outcomes entails, in the usual likelihood function approach, integration of dimensionality equal to the number of outcomes. On the other hand, there are many instances when interest may be centered around analyzing several ordered-response outcomes simultaneously, such as in the case of the number of episodes of each of several activities, or satisfaction levels associated with a related set of products/services, or multiple ratings measures regarding the state of health of an individual/organization (we will refer to such outcomes as crosssectional multivariate ordered-response outcomes). There are also instances when the analyst may want to analyze time-series or panel data of orderedresponse outcomes over time, and allow flexible forms of error correlations over these outcomes. For example, the focus of analysis may be to examine rainfall levels (measured in grouped categories) over time in each of several spatial regions, or individual stop-making behavior over multiple days in a week, or individual headache severity levels at different points in time (we will refer to such outcomes as panel multivariate ordered-response outcomes). In the analysis of cross-sectional and panel ordered-response systems with more than three outcomes, the norm until very recently has been to apply numerical simulation techniques based on a maximum simulated likelihood (MSL) approach or a Bayesian inference approach. However, such simulation-based approaches become impractical in terms of computational time, or even infeasible, as the number of ordered-response outcomes increases. Even if feasible, the numerical simulation methods do get imprecise as the number of outcomes increase, leading to convergence problems during estimation. As a consequence, another approach that has seen some (though very limited) use recently is the composite marginal likelihood (CML) approach. This is an estimation technique that is gaining substantial attention in the statistics field, though there has been relatively little coverage of this method in econometrics and other fields. The CML method, which belongs to the more general class of composite likelihood function approaches, is based on forming a surrogate likelihood function that compounds much easier-to-compute, lower-dimensional, marginal likelihoods. The CML method is easy to implement and has the advantage
68
CHANDRA R. BHAT ET AL.
of reproducibility of results. Under usual regularity assumptions, the CML estimator is consistent and asymptotically normally distributed. The maximum CML estimator should lose some efficiency from a theoretical perspective relative to a full likelihood estimator, but this efficiency loss appears to be empirically minimal (see Zhao & Joe, 2005; Lele, 2006; Joe & Lee, 2009).2 Besides, the simulation estimation methods for evaluating the analytically intractable likelihood function also leads to a loss in estimator efficiency. The objective of this chapter is on introducing the CML inference approach to estimate general panel models of ordered-response. We also compare the performance of the MSL approach with the CML approach in ordered-response situations when the MSL approach is feasible. We use simulated data sets with known underlying model parameters to evaluate the two estimation approaches. The ability of the two approaches to recover model parameters is examined, as is the sampling variance and the simulation variance of parameters in the MSL approach relative to the sampling variance in the CML approach. The computational costs of the two approaches are also presented. The rest of this chapter is structured as follows. In the next section, we present the structures of the cross-sectional and panel multivariate ordered-response systems. Section 3 discusses the simulation estimation methods (with an emphasis on the MSL approach) and the CML estimation approach. Section 4 presents the experimental design for the simulation experiments, while Section 5 discusses the results. Section 6 concludes the chapter by highlighting the important findings.
2. THE MULTIVARIATE ORDERED-RESPONSE SYSTEM 2.1. The Cross-Sectional Multivariate Ordered-Response Probit (CMOP) Formulation Let q be an index for individuals (q ¼ 1, 2, y, Q, where Q denotes the total number of individuals in the data set), and let i be an index for the orderedresponse variable (i ¼ 1, 2, y, I, where I denotes the total number of ordered-response variables for each individual). Let the observed discrete (ordinal) level for individual q and variable i be mqi (mqi may take one of Ki values; i.e., mqiA{1, 2, y, Ki} for variable i). In the usual ordered-response
69
MSL and CML Estimation Approaches
framework notation, we write the latent propensity ( yqi ) for each orderedresponse variable as a function of relevant covariates and relate this latent propensity to the observed discrete level mqi through threshold bounds (see McKelvey & Zavoina, 1975): mqi 1
yqi ¼ b0i xqi þ qi ; yqi ¼ mqi if yi
m
oyqi oyi qi ,
(1)
where xqi is a (L 1) vector of exogenous variables (not including a constant), bi is a corresponding (L 1) vector of coefficients to be estimated, m qi is a standard normal error term, and yi qi is the upper bound threshold for Ki 0 1 2 0 i discrete level mqi of variable i (yi oyi oyi . . . oyiK i 1 oyK i ; yi ¼ 1; yi ¼ þ1 for each variable i). The qi terms are assumed independent and identical across individuals (for each and all i). For identification reasons, the variance of each qi term is normalized to 1. However, we allow correlation in the qi terms across variables i for each individual q. Specifically, we define q ¼ ðq1 ; q2 ; q3 ; . . . ; qI Þ0 . Then, q is multivariate normal distributed with a mean vector of zeros and a correlation matrix as follows: 20 1 0 1 0 6B C B r 6B 0 C B 21 6 B q N 6 B . .. C C; B 6B B . . 4@ A @ . rI1 0
r12 1 .. .
r13 r23 .. .
.. .
rI2
rI3
13 r1I 7 r2I C C7 C7 .. C7; or 7 . C A5 1
(2)
q N½0; R The off-diagonal terms of R capture the error covariance across the underlying latent continuous variables; that is, they capture the effects of common unobserved factors influencing the underlying latent propensities. These are the so-called polychoric correlations between pairs of observed ordered-response variables. Of course, if all the correlation parameters (i.e., off-diagonal elements of R), which we will stack into a vertical vector O, are identically zero, the model system in Eq. (1) collapses to independent ordered-response probit models for each variable. Note that the diagonal elements of R are normalized to one for identification purposes. The parameter vector (to be estimated) of the cross-sectional multivariate probit model is d ¼ ðb01 ; b02 ; . . . ; b0I ; y01 ; y02 ; . . . ; y0I ; O0 Þ0 ; where yi ¼ ðy1i ; y2i ; . . . ; yiK i 1 Þ0 for i ¼ 1; 2; . . . ; I. The likelihood function for individual
70
CHANDRA R. BHAT ET AL.
q may be written as follows: Lq ðdÞ ¼ Prðyq1 ¼ mq1 ; yq2 ¼ mq2 ; . . . ; yqI ¼ mqI Þ Z Lq ðdÞ ¼
Z
m
y1 q1 b01 xq1 m
v1 ¼y1 q1
Z
1
b01 xq1
(3)
m
y2 q2 b02 xq2 m
v2 ¼y2 q2
1
b02 xq2
m
yI qI b0I xqI m
vI ¼yI qI
1
b0I xqI
fI ðv1 ; v2 ; . . . ; vI jOÞ dv1 dv2 . . . dvI ,
where fI is the standard multivariate normal density function of dimension I. The likelihood function above involves an I-dimensional integral for each individual q. 2.2. The Panel Multivariate Ordered-Response Probit (PMOP) Formulation Let q be an index for individuals as earlier (q ¼ 1, 2, y, Q), but let j now be an index for the jth observation (say at time tqi ) on individual q( j ¼ 1, 2, y, J, where J denotes the total number of observations on individual q).3 Let the observed discrete (ordinal) level for individual q at the jth observation be mqj (mqj may take one of K values; i.e., mqiA{1, 2, y, K}). In the usual random-effects ordered-response framework notation, we write the latent variable (yqj ) as a function of relevant covariates as: yqj ¼ b0 xqj þ uq þ qj ; yqj ¼ mqj if ymqj 1 oyqj oymqj ,
(4)
where xqj is a (L 1) vector of exogenous variables (not including a constant), b is a corresponding (L 1) vector of coefficients to be estimated, qj is a standard normal error term uncorrelated across observations j for individual q and also uncorrelated across individuals q, and ymqj is the upper bound threshold for discrete level mqj (y0 oy1 oy2 . . . oyK1 oyK ; y0 ¼ 1; yK ¼ þ1). The term uq represents an individual-specific random term, assumed to be normally distributed with mean zero and variance s2 . The term uq is independent of uq0 for qaq0 . The net result of the specification above is that the joint distribution of the latent variables ðyq1 ; yq2 ; . . . yqJ Þ for the qth subject is multivariate normal with standardized correlation matrix with mean vector ðb0 xq1 =m; b0 xq2 =m; . . . b0 xqJ =mÞ and paffiffiffiffiffiffiffiffiffiffiffiffiffi constant nondiagonal entries s2 =m2 , where m ¼ 1 þ s2 . The standard random-effects ordered-response model of Eq. (4) allows easy estimation, since one can write the probability of the sequence of
MSL and CML Estimation Approaches
71
observed ordinal responses across the multiple observations on the same individual, conditional on uq , as the product of standard ordered-response model probabilities, and then integrate the resulting probability over the range of normally distributed uq values for each individual. This results in only a one-dimensional integral for each individual, which can be easily computed using numerical quadrature methods. However, the assumption of equal correlation across the multiple observations on the same individual is questionable, especially for medium-to-long individual-specific series. An alternative would be to allow serial correlation within each subjectspecific series of observations, as proposed by Varin and Czado (2010). For instance, one may adopt an autoregressive structure of order one for the error terms of the same individual so that corrðqj ; qk Þ ¼ rjtqj tqk j , where tqj is the measurement time of observation yqj .4 The autoregressive error structure specification results in a joint multivariate distribution of the latent variables ðyq1 ; yq2 ; . . . yqJ Þ for the qth individual with standardized mean vector Rq with entries such ðb0 xq1 =m; b0 xq2 =m; . . . b0 xqJ =mÞ and a correlation matrix pffiffiffiffiffiffiffiffiffiffiffiffiffi that corrðyqj ; yqg Þ ¼ ðs2 þ rjtqj tqg j Þ=m2 , where m ¼ 1 þ s2 . The cost of the flexibility is paid dearly though in terms of computational difficulty in the likelihood estimator. Specifically, rather than a single dimension of integration, we now have an integral of dimension J for individual q. The parameter vector (to be estimated) of the panel multivariate probit model is d ¼ ðb0 ; y1 ; y2 ; . . . ; yK1 ; s; rÞ0 , and the likelihood for individual q becomes: Lq ðdÞ ¼ Prðyq1 ¼ mq1 ; yq2 ¼ mq2 ; . . . ; yqJ ¼ mqJ Þ Z amq2 Z amq1 Z amqJ ... fJ ðv1 ; v2 ; . . . ; vJ jRq Þ dv1 dv2 . . . dvJ ð5Þ Lq ðdÞ ¼ mq1 1
v1 ¼a
v2 ¼a
mq2 1
vJ ¼amqJ 1
where amqj ¼ ðymqj b0 xqj Þ=m. The likelihood function above entails a J-dimensional integral for each individual q. The above model is labeled as a mixed autoregressive ordinal probit model by Varin and Czado (2010).
3. OVERVIEW OF ESTIMATION APPROACHES As indicated in Section 1, models that require integration of more than three dimensions in a multivariate ordered-response model are typically estimated using simulation approaches, though some recent studies have considered a CML approach. Sections 3.1 and 3.2 provide an overview of each of these two approaches in turn.
72
CHANDRA R. BHAT ET AL.
3.1. Simulation Approaches Two broad simulation approaches may be identified in the literature for multivariate ordered-response modeling. One is based on a frequentist approach, while the other is based on a Bayesian approach. We provide an overview of these two approaches in the next two sections (Sections 3.1.1 and 3.1.2), and then (in Section 3.1.3) discuss the specific simulation approaches used in this chapter for estimation of the multivariate orderedresponse model systems. 3.1.1. The Frequentist Approach In the context of a frequentist approach, Bhat and Srinivasan (2005) suggested a MSL method for evaluating the multi-dimensional integral in a cross-sectional multivariate ordered-response model system, using quasiMonte Carlo simulation methods proposed by Bhat (2001, 2003). In their approach, Bhat and Srinivasan (BS) partition the overall error term into one component that is independent across dimensions and another mixing component that generates the correlation across dimensions. The estimation proceeds by conditioning on the error components that cause correlation effects, writing the resulting conditional joint probability of the observed ordinal levels across the many dimensions for each individual, and then integrating out the mixing correlated error components. An important issue is to ensure that the covariance matrix of the mixing error terms remains in a correlation form (for identification reasons) and is positive-definite, which BS maintain by writing the likelihood function in terms of the elements of the Cholesky-decomposed matrix of the correlation matrix of the mixing normally distributed elements and parameterizing the diagonal elements of the Cholesky matrix to guarantee unit values along the diagonal. Another alternative and related MSL method would be to consider the correlation across error terms directly without partitioning the error terms into two components. This corresponds to the formulation in Eqs. (1) and (2) of this chapter. Balia and Jones (2008) adopt such a formulation in their eight-dimensional multivariate probit model of lifestyles, morbidity, and mortality. They estimate their model using a Geweke–Hajivassiliou– Keane (GHK) simulator (the GHK simulator is discussed in more detail later in this chapter). However, it is not clear how they accommodated the identification sufficiency condition that the covariance matrix be a correlation matrix and be positive-definite. But one can use the GHK simulator combined with BS’s approach to ensure unit elements along the diagonal of the covariance matrix. Yet another MSL method to
MSL and CML Estimation Approaches
73
approximate the multivariate rectangular (i.e., truncated) normal probabilities in the likelihood functions of Eq. (3) and (5) is based on the Genz–Bretz (GB) algorithm (also discussed in more detail later). In concept, all these MSL methods can be extended to any number of correlated orderedresponse outcomes, but numerical stability, convergence, and precision problems start surfacing as the number of dimensions increase. 3.1.2. The Bayesian Approach Chen and Dey (2000), Herriges, Phaneuf, and Tobias (2008), Jeliazkov, Graves, and Kutzbach (2008), and Hasegawa (2010) have considered an alternate estimation approach for the multivariate ordered-response system based on the posterior mode in an objective Bayesian approach. As in the frequentist case, a particular challenge in the Bayesian approach is to ensure that the covariance matrix of the parameters is in a correlation form, which is a sufficient condition for identification. Chen and Dey proposed a reparametization technique that involves a rescaling of the latent variables for each ordered-response variable by the reciprocal of the largest unknown threshold. Such an approach leads to an unrestricted covariance matrix of the rescaled latent variables, allowing for the use of standard Markov Chain Monte Carlo (MCMC) techniques for estimation. In particular, the Bayesian approach is based on assuming prior distributions on the nonthreshold parameters, reparameterizing the threshold parameters, imposing a standard conjugate prior on the reparameterized version of the error covariance matrix and a flat prior on the transformed threshold, obtaining an augmented posterior density using Baye’s Theorem for the reparameterized model, and fitting the model using a MCMC method. Unfortunately, the method remains cumbersome, requires extensive simulation, and is time-consuming. Further, convergence assessment becomes difficult as the number of dimensions increase. For example, Muller and Czado (2005) used a Bayesian approach for their panel multivariate ordered-response model, and found that the standard MCMC method exhibits bad convergence properties. They proposed a more sophisticated group move multigrid MCMC technique, but this only adds to the already cumbersome nature of the simulation approach. In this regard, both the MSL and the Bayesian approach are ‘‘brute force’’ simulation techniques that are not very straightforward to implement and can create convergence assessment problems. 3.1.3. Simulators Used in this Chapter In this chapter, we use the frequentist approach to compare simulation approaches with the CML approach. Frequentist approaches are widely
74
CHANDRA R. BHAT ET AL.
used in the literature, and are included in several software programs that are readily available. Within the frequentist approach, we test two MSL methods with the CML approach, just to have a comparison of more than one MSL method with the CML approach. The two MSL methods we select are among the most effective simulators for evaluating multivariate normal probabilities. Specifically, we consider the GHK simulator for the CMOP model estimation in Eq. (3), and the GB simulator for the PMOP model estimation in Eq. (5). 3.1.3.1. The Geweke–Hajivassiliou–Keane Probability Simulator for the CMOP Model. The GHK is perhaps the most widely used probability simulator for integration of the multivariate normal density function, and is particularly well known in the context of the estimation of the multivariate unordered probit model. It is named after Geweke (1991), Hajivassiliou (Hajivassiliou & McFadden, 1998), and Keane (1990, 1994). Train (2003) provides an excellent and concise description of the GHK simulator in the context of the multivariate unordered probit model. In this chapter, we adapt the GHK simulator to the case of the multivariate ordered-response probit model. The GHK simulator is based on directly approximating the probability of a multivariate rectangular region of the multivariate normal density distribution. To apply the simulator, we first write the likelihood function in Eq. (3) as follows: Lq ðdÞ ¼ Prðyq1 ¼ mq1 Þ Prðyq2 ¼ mq2 jyq1 ¼ mq1 Þ Prðyq3 ¼ mq3 jyq1 ¼ mq1 ; yq2 ¼ mq2 Þ . . . PrðyqI ¼ mqI jyq1 ¼ mq1 ; yq2 ¼ mq2 ; . . . ; yqI1 ¼ mqI1 Þ Also, write the error terms in Eq. (2) as: 2 3 2 q1 l 11 0 0 6 7 6 6 q2 7 6 l 21 l 22 0 6 7 6 6 7¼6 .. .. 6 .. 7 6 .. 6 . 7 6 . . . 4 5 4 l I1 l I2 l I3 qI q ¼ Lvq
..
.
0
32
vq1
ð6Þ
3
76 7 6v 7 07 76 q2 7 76 7 .. 76 .. 7 6 . 7 . 7 54 5 l II vqI ð7Þ
where L is the lower triangular Cholesky decomposition of the correlation matrix R, and vq terms are independent and identically distributed standard normal deviates (i.e., vqBN[0, II]). Each (unconditional/conditional)
75
MSL and CML Estimation Approaches
probability term in Eq. (7) can be written as follows: m 1
Prðyq1 ¼ mq1 Þ ¼ Pr
y1 q1
0
b1 xq1
l 11
m
0
y q1 b1 xq1 ovq1 o 1 l 11
!
Prðyq2 ¼ mq2 jyq1 ¼ mq1 Þ
0 0 0 m m 1 b2 xq2 l 21 vq1 y2 q2 b2 xq2 l 21 vq1 y1 q1 b1 xq1 ¼ Pr ovq2 o l 22 l 22 l 11 ! 0 mq1 y b1 xq1 ovq1 o 1 l 11 m 1
y2 q2
Prðyq3 ¼ mq3 jyq1 ¼ mq1 ; yq2 ¼ mq2 Þ 0 mq3 1 1 0 0 m y3 b3 xq3 l 31 vq1 l 32 vq2 y3 q3 b3 xq3 l 31 vq1 l 32 vq2 ov o q3 B C l 33 l 33 B C B C 0 0 B mq1 1 C mq1 B y1 C b1 xq1 y1 b1 xq1 C ovq1 o ; ¼ PrB B C l l 11 11 B C B C B mq2 1 C 0 0 mq2 @ y2 A b2 xq2 l 21 vq1 y2 b2 xq2 l 21 vq1 ovq2 o l 22 l 22 .. . PrðyqI ¼ mqI jyq1 ¼ mq1 ; yq2 ¼ mq2 ; . . . ; yqI1 ¼ mqI1 Þ 0 m 1 1 0 y qI bI xqI l I1 vq1 l I2 vq2 l IðI1Þ vqðI 1Þ B I C B C l II B C 0 B C m B C yI qI bI xqI l I1 vq1 l I2 vq2 l IðI1Þ vqðI1Þ B ovqI o C B C l II B C B mq1 1 C 0 0 0 m m 1 q1 q2 B y C b x y b x y b x l v 21 q1 B 1 C 1 q1 1 q1 2 2 q2 1 ovq1 o ; B C l 11 l 11 l 22 B C B C ¼ PrB C ð8Þ 0 mq2 B C B ov o y2 b2 xq2 l 21 vq1 ; ; C q2 B C l 22 B C B C 0 B mqðI1Þ 1 C B yI1 C bI 1 xqI1 l ðI1Þ1 vq1 l ðI1Þ2 vq2 l ðI1ÞðI2Þ vqðI2Þ B ov qðI 1Þ C B C l ðI1ÞðI1Þ B C B C 0 B ymqðI1Þ b x C @ o I1 A I1 qI1 l ðI 1Þ1 vq1 l ðI1Þ2 vq2 l ðI1ÞðI2Þ vqðI2Þ l ðI1ÞðI1Þ
The error terms vqi are drawn d times (d ¼ 1, 2, y, D) from the univariate standard normal distribution with the lower and upper bounds as above. To be precise, we use a randomized Halton draw procedure to generate the d realizations of vqi , where we first generate standard Halton draw sequences
76
CHANDRA R. BHAT ET AL.
of size D 1 for each individual for each dimension i (i ¼ 1, 2, y, I), and then randomly shift the D 1 integration nodes using a random draw from the uniform distribution (see Bhat, 2001, 2003 for a detailed discussion of the use of Halton sequences for discrete choice models). These random shifts are employed because we generate 10 different randomized Halton sequences of size D 1 to compute simulation error. Gauss code implementing the Halton draw procedure is available for download from the home page of Chandra Bhat at http://www.caee.utexas.edu/prof/bhat/ halton.html. For each randomized Halton sequence, the uniform deviates are translated to truncated draws from the normal distribution for vqi that respect the lower and upper truncation points (see, e.g., Train, 2003, p. 210). An unbiased estimator of the likelihood function for individual q is obtained as: LGHK;q ðdÞ ¼
D 1X Ld ðdÞ D d¼1 q
(9)
where Ldq ðdÞ is an estimate of Eq. (6) for simulation draw d. A consistent and asymptotically normal distributed GHK estimator d^ GHK is obtained by maximizing the logarithm of the simulated likelihood function Q LGHK ðdÞ ¼ q LGHK;q ðdÞ. The covariance matrix of parameters is estimated using the inverse of the sandwich information matrix (i.e., using the robust asymptotic covariance matrix estimator associated with quasi-maximum likelihood (ML); see McFadden & Train, 2000). The likelihood function (and hence, the log-likelihood function) mentioned above is parameterized with respect to the parameters of the Cholesky decomposition matrix L rather than the parameters of the original covariance parameter R. This ensures the positive-definiteness of R, but also raises two new issues: (1) the parameters of the Cholesky matrix L should be such that R should be a correlation matrix and (2) the estimated parameter values (and asymptotic covariance matrix) do not correspond to R, but to L. The first issue is overcome by parameterizing the diagonal terms of L as shown below (see Bhat & Srinivasan, 2005): 2 3 1 qffiffiffiffiffiffiffiffiffiffiffiffiffi 0 ffi 0 0 6 7 1 l 221 0 0 6 l 21 7 6 7 7 (10) L¼6 .. .. .. .. 6 .. 7 . . qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi . . 6 . 7 ffi5 4 l I1 l I2 l I3 1 l 2I1 l 2I2 l 2IðI1Þ
MSL and CML Estimation Approaches
77
The second issue is easily resolved by estimating R from the convergent values of the Cholesky decomposition parameters (R ¼ LLu), and then running the parameter estimation procedure one more time with the likelihood function parameterized with the terms of R.
3.1.3.2. The GB Simulator for the PMOP Model. An alternative simulation-based approximation of multivariate normal probabilities is provided by the GB algorithm (Genz & Bretz, 1999). At the first step, this method transforms the original hyper-rectangle integral region to an integral over a unit hypercube, as described in Genz (1992). The transformed integral region is filled in by randomized lattice rules using a number of points depending on the integral dimension and the desired precision. Robust integration error bounds are then derived by means of additional shifts of the integration nodes in random directions (this is similar to the generation of randomized Halton sequences, as described in Bhat (2003), but with randomized lattice points rather than Halton points). The additional random shifts are employed to compute simulation errors using 10 sets of randomized lattice points for each individual. The interested reader is referred to Genz (2003) for details. More recently, Genz’s algorithm has been further developed by Genz and Bretz (2002). Fortran and Matlab code implementing the GB algorithm is available for download from the home page of Alan Genz at http://www.math.wsu.edu/faculty/genz/homepage. Furthermore, the Fortran code has been included in an R (R Development Core Team, 2009) package called mvtnorm freely available from the repository http://cran.r-project.org/. For a brief description of package mvtnorm, see Hothorn, Bretz, and Genz (2001) and Mi, Miwa, and Hothorn (2009). Technically, the algorithm allows for computation of integrals up to 1,000 dimensions. However, the computational cost for reliable integral approximations explodes with the raising of the integral dimension, making the use of this algorithm impractical for likelihood inference except for low-dimensions. In the PMOP model, a positive-definite correlation matrix Rq should result as long as s40 and 0oro1. The GB approach implemented in the R routine is based on a check to ensure these conditions hold. If they do not hold (i.e., the Berndt, Hall, Hall, and Hausman (BHHH) algorithm implemented in the R routine is trying to go outside the allowed parameter space), the algorithm reduces the ‘‘Newton–Raphson step’’ by half size to return the search direction within the parameter space.
78
CHANDRA R. BHAT ET AL.
3.2. The Composite Marginal Likelihood Technique – The Pairwise Marginal Likelihood Inference Approach The CML estimation approach is a relatively simple approach that can be used when the full likelihood function is near impossible or plain infeasible to evaluate due to the underlying complex dependencies. For instance, in a recent application, Varin and Czado (2010) examined the headache pain intensity of patients over several consecutive days. In this study, a full information likelihood estimator would have entailed as many as 815 dimensions of integration to obtain individual-specific likelihood contributions, an infeasible proposition using the computer-intensive simulation techniques. As importantly, the accuracy of simulation techniques is known to degrade rapidly at medium-to-high dimensions, and the simulation noise increases substantially. This leads to convergence problems during estimation. In contrast, the CML method, which belongs to the more general class of composite likelihood function approaches (see Lindsay, 1988), is based on forming a surrogate likelihood function that compounds much easier-to-compute, lower-dimensional, marginal likelihoods. The CML approach can be applied using simple optimization software for likelihood estimation. It also represents a conceptually and pedagogically simpler simulation-free procedure relative to simulation techniques, and has the advantage of reproducibility of the results. Finally, as indicated by Varin and Vidoni (2009), it is possible that the ‘‘maximum CML estimator can be consistent when the ordinary full likelihood estimator is not’’. This is because the CML procedures are typically more robust and can represent the underlying low-dimensional process of interest more accurately than the low-dimensional process implied by an assumed (and imperfect) high-dimensional multivariate model. The simplest CML, formed by assuming independence across the latent variables underlying the ordinal outcome variables (in the context of our chapter), entails the product of univariate probabilities for each variable. However, this approach does not provide estimates of correlation that are of interest in a multivariate context. Another approach is the pairwise likelihood function formed by the product of likelihood contributions of all or a selected subset of couplets (i.e., pairs of variables or pairs of observations). Almost all earlier research efforts employing the CML technique have used the pairwise approach, including
MSL and CML Estimation Approaches
79
Apanasovich et al. (2008), Bellio and Varin (2005), de Leon (2005), Varin and Vidoni (2009), Varin, Host, and Skare (2005), and Engle, Shephard, and Sheppard (2007). Alternatively, the analyst can also consider larger subsets of observations, such as triplets or quadruplets or even higherdimensional subsets (see Engler, Mohapatra, Louis, & Betensky, 2006; Caragea & Smith, 2007). In general, the issue of whether to use pairwise likelihoods or higher-dimensional likelihoods remains an open, and under-researched, area of research. However, it is generally agreed that the pairwise approach is a good balance between statistical and computation efficiency. The properties of the CML estimator may be derived using the theory of estimating equations (see Cox & Reid, 2004). Specifically, under usual regularity assumptions (Molenberghs & Verbeke, 2005, p. 191), the CML estimator is consistent and asymptotically normal distributed (this is because of the unbiasedness of the CML score function, which is a linear combination of proper score functions associated with the marginal event probabilities forming the composite likelihood).5 Of course, the maximum CML estimator loses some asymptotic efficiency from a theoretical perspective relative to a full likelihood estimator (Lindsay, 1988; Zhao & Joe, 2005). On the other hand, there is also a loss in asymptotic efficiency in the MSL estimator relative to a full likelihood estimator (see McFadden & Train, 2000). Given the full likelihood estimator has to be approximated using simulation techniques in a multivariate ordered-response system of dimensionality more than 3, it is of interest to compare the MSL and CML estimators in terms of asymptotic efficiency. Earlier applications of the CML approach (and specifically the pairwise likelihood approach) to multivariate ordered-response systems include de Leon (2005) and Ferdous, Eluru, Bhat, and Meloni (2010) in the context of CMOP systems, and Varin and Vidoni (2006) and Varin and Czado (2010) in the context of panel multivariate ordered-response probit (PMOP) systems. Bhat, Sener, and Eluru (2010) also use a CML approach to estimate their multivariate ordered-response probit system in the context of a spatially dependent ordered-response outcome variable. In this study, we do not use the high multivariate dimensionality of most of these earlier studies. Rather, we consider relatively lower multivariate dimensionality simulation situations, so that we are able to estimate the models using MSL techniques too.
80
CHANDRA R. BHAT ET AL.
3.2.1. Pairwise Likelihood Approach for the CMOP Model The pairwise marginal likelihood function for individual q may be written for the CMOP model as follows: LCMOP CML;q ðdÞ ¼
I 1 Y I Y
Prðyqi ¼ mqi ; yqg ¼ mqg Þ
i¼1 g¼iþ1
3 m 0 qg F2 yi qi b0i xqi ; ym b x ; r qg ig g g 6 7 6 7 6 7 mqi 0 mqg 1 0 7 bg xqg ; rig F2 yi bi xqi ; yg I 1 Y I 6 Y 6 7 6 7 ¼ 6 7 mqi 1 0 m 0 7 bi xqi ; yg qg bg xqg ; rig i¼1 g¼iþ16 F2 yi 6 7 6 7 4 5 mqi 1 b0i xqi ; ygmqg 1 b0g xqg ; rig þF2 yi 2
ð11Þ
where F2 ð:; :; rig Þ is the standard bivariate normal cumulative distribution function withQcorrelation rig . The pairwise marginal likelihood function is CMOP LCMOP CML ðdÞ ¼ q LCML;q ðdÞ. The pairwise estimator d^ CML obtained by maximizing the logarithm of the pairwise marginal likelihood function with respect to the vector d is consistent and asymptotically normal distributed with asymptotic mean d and covariance matrix given by the inverse of Godambe’s (1960) sandwich information matrix GðdÞ (see Zhao & Joe, 2005): V CML ðdÞ ¼ ½GðdÞ1 ¼ ½HðdÞ1 JðdÞ½HðdÞ1 ; where 2 @ log LCMOP CML ðdÞ HðdÞ ¼ E and @d@d0 @ log LCMOP @ log LCMOP CML ðdÞ CML ðdÞ JðdÞ ¼ E @d @d0
ð12Þ
HðdÞ and JðdÞ can be estimated in a straightforward manner at the CML estimate (d^ CML ): " # Q X @2 log LCMOP CML;q ðdÞ ^ ^ HðdÞ ¼ @d@d0 q¼1 d^ " # Q X 2 I1 X I X @ log Prðyqi ¼ mqi ; yqg ¼ mqg Þ ¼ ; and @d@d0 q¼1 i¼1 g¼iþ1 d^ " !# ! Q X @ log LCMOP ðdÞ @ log LCMOP ðdÞ CML;q CML;q ^ ¼ ^ dÞ ð13Þ Jð @d @d0 ^ q¼1 d
MSL and CML Estimation Approaches
81
In general, and as confirmed later in the simulation study, we expect that the ability to recover and pin down the parameters will be a little more difficult for the correlation parameters in R (when the correlations are low) than for the slope and threshold parameters, because the correlation parameters enter more nonlinearly in the likelihood function. 3.2.2. Pairwise Likelihood Approach for the PMOP Model The pairwise marginal likelihood function for individual q may be written for the PMOP model as follows: LPMOP CML;q ðdÞ ¼
JY 1
J Y
Prðyqj ¼ mqj ; yqg ¼ mqg Þ
j¼1 g¼jþ1
3
F2 amqj ; amqg ; rjg F2 amqj ; amqg1 ; rjg 4 m 1 m
5 ð14Þ ¼ qj ; a qg ; rjg þ F2 amqj 1 ; amqg 1 ; rjg j¼1 g¼jþ1 F2 a JY 1
J Y
2
pffiffiffiffiffiffiffiffiffiffiffiffiffi where amqj ¼ ðymqj b0 xqj Þ=m; m ¼ 1 þ s2 , and rjg ¼ ðs2 þ rjtqj tqg j Þ=m2 . Q PMOP The pairwise marginal likelihood function is LPMOP CML ðdÞ ¼ q LCML;q ðdÞ. ^ The pairwise estimator dCML obtained by maximizing the logarithm of the pairwise marginal likelihood function with respect to the vector d is consistent and asymptotically normal distributed with asymptotic mean d. The covariance matrix of the estimator may be computed in a fashion similar to that for the CMOP case, with LCMOP CML;q ðdÞ being replaced by ðdÞ. LPMOP CML;q As in the CMOP case, we expect that the ability to recover and pin down the parameters will be a little more difficult for the correlation parameter r (when r is low) than for the slope and threshold parameters. 3.2.3. Positive-Definiteness of the Implied Multivariate Correlation Matrix A point that we have not discussed thus far in the CML approach is how to ensure the positive-definiteness of the symmetric correlation matrix R (in the CMOP model) and Rq (in the PMOP model). This is particularly an issue for R in the CMOP model, so we will discuss this mainly in the context of the CMOP model. Maintaining a positive-definite matrix for Rq in the PMOP model is relatively easy, so we only briefly discuss the PMOP case toward the end of this section. There are three ways that one can ensure the positive-definiteness of the R matrix. The first technique is to use Bhat and Srinivasan’s technique of reparameterizing R through the Cholesky matrix, and then using these
82
CHANDRA R. BHAT ET AL.
Cholesky-decomposed parameters as the ones to be estimated. Within the optimization procedure, one would then reconstruct the R matrix, and then ‘‘pick off ’’ the appropriate elements of this matrix for the rig estimates at each iteration. This is probably the most straightforward and clean technique. The second technique is to undertake the estimation with a constrained optimization routine by requiring that the implied multivariate correlation matrix for any set of pairwise correlation estimates be positive-definite. However, such a constrained routine can be extremely cumbersome. The third technique is to use an unconstrained optimization routine, but check for positive-definiteness of the implied multivariate correlation matrix. The easiest method within this third technique is to allow the estimation to proceed without checking for positive-definiteness at intermediate iterations, but check that the implied multivariate correlation matrix at the final converged pairwise marginal likelihood estimates is positive-definite. This will typically work for the case of a multivariate ordered-response model if one specifies exclusion restrictions (i.e., zero correlations between some error terms) or correlation patterns that involve a lower dimension of effective parameters (such as in the PMOP model in this chapter). Also, the number of correlation parameters in the full multivariate matrix explodes quickly as the dimensionality of the matrix increases, and estimating all these parameters becomes almost impossible (with any estimation technique) with the usual sample sizes available in practice. So, imposing exclusion restrictions is a good econometric practice. However, if the above simple method of allowing the pairwise marginal estimation approach to proceed without checking for positive-definiteness at intermediate iterations does not work, then one can check the implied multivariate correlation matrix for positivedefiniteness at each and every iteration. If the matrix is not positivedefinite during a direction search at a given iteration, one can construct a ‘‘nearest’’ valid correlation matrix (see Ferdous et al., 2010 for a discussion). In the CMOP CML analysis of this chapter, we used an unconstrained optimization routine and ensured that the implied multivariate correlation matrix at convergence was positive-definite. In the PMOP CML analysis of this chapter, we again employed an unconstrained optimization routine in combination with the following reparameterizations: r ¼ 1=½1 þ expðcÞ and s ¼ expðpÞ. These reparameterizations were used to guarantee s40 and 0oro1, and therefore the positive-definiteness of the Rq multivariate correlation matrix. Once estimated, the c and p estimates were translated back to estimates of r and s.
MSL and CML Estimation Approaches
83
4. EXPERIMENTAL DESIGN 4.1. The CMOP Model To compare and evaluate the performance of the GHK and the CML estimation techniques, we undertake a simulation exercise for a multivariate ordered-response system with five ordinal variables. Further, to examine the potential impact of different correlation structures, we undertake the simulation exercise for a correlation structure with low correlations and another with high correlations. For each correlation structure, the experiment is carried out for 20 independent data sets with 1,000 data points. Prespecified values for the d vector are used to generate samples in each data set. In the set-up, we use three exogenous variables in the latent equation for the first, third, and fifth ordered-response variables, and four exogenous variables for the second and fourth ordered-response variables. The values for each of the exogenous variables are drawn from a standard univariate normal distribution. A fixed coefficient vector bi ði ¼ 1; 2; 3; 4; 5Þ is assumed on the variables, and the linear combination b0i xqi ðq ¼ 1; 2; . . . ; Q; Q ¼ 1; 000; i ¼ 1; 2; 3; 4; 5Þ is computed for each individual q and category i. Next, we generate Q five-variate realizations of the error term vector ðq1 ; q2 ; q3 ; q4 ; q5 Þ with predefined positive-definite low error correlation structure (Rlow ) and high error correlation structure (Rhigh ) as follows: 2 3 2 3 1 :30 :20 :22 :15 1 :90 :80 :82 :75 6 :30 1 :25 :30 :12 7 6 :90 1 :85 :90 :72 7 6 7 6 7 6 7 6 7 7 and Rhigh ¼ 6 :80 :85 1 :87 :80 7 :20 :25 1 :27 :20 Rlow ¼ 6 6 7 6 7 6 7 6 7 4 :22 :30 :27 1 :25 5 4 :82 :90 :87 1 :85 5 :15 :12 :20 :25 1 :75 :72 :80 :85 1 (15) The error term realization for each observation and each ordinal variable is then added to the systematic component ðb0i xqi Þ as in Eq. (1) and then translated to ‘‘observed’’ values of yqi (0, 1, 2, y) based on prespecified threshold values. We assume four outcome levels for the first and the fifth ordered-response variables, three for the second and the fourth ordered-response variables, and five for the third orderedresponse variable. Correspondingly, we prespecify a vector of three threshold values [ðyi ¼ y1i ; y2i ; y3i Þ, where i ¼ 1 and 5] for the first and the
84
CHANDRA R. BHAT ET AL.
fifth ordered-response equations, two for the second and the fourth equations [ðyi ¼ y1i ; y2i Þ, where i ¼ 2 and 4], and four for the third ordered-response equation [ðyi ¼ y1i ; y2i ; y3i ; y4i Þ, where i ¼ 3]. As mentioned earlier, the above data generation process is undertaken 20 times with different realizations of the random error term to generate 20 different data sets. The CML estimation procedure is applied to each data set to estimate data-specific values of the d vector. The GHK simulator is applied to each data set using 100 draws per individual of the randomized Halton sequence.6 In addition, to assess and to quantify simulation variance, the GHK simulator is applied to each data set 10 times with different (independent) randomized Halton draw sequences. This allows us to estimate simulation error by computing the standard deviation of estimated parameters among the 10 different GHK estimates on the same data set. A few notes are in order here. We chose to use a setting with five ordinal variables so as to keep the computation time manageable for the MSL estimations (e.g., 10 ordinal variables will increase computation time substantially, especially since more number of draws per individual may have to be used; note also that we have a total of 400 MSL estimation runs just for the five ordinal variable case in our experimental design). At the same time, a system of five ordinal variables leads to a large enough dimensionality of integration in the likelihood function where simulation estimation has to be used. Of course, one can examine the effect of varying the number of ordinal variables on the performance of the MSL and CML estimation approaches. In this chapter, we have chosen to focus on five dimensions, and examine the effects of varying correlation patterns and different model formulations (corresponding to cross-sectional and panel settings). A comparison with higher numbers of ordinal variables is left as a future exercise. However, in general, it is well known that MSL estimation gets more imprecise as the dimensionality of integration increases. On the other hand, our experience with CML estimation is that the performance does not degrade very much as the number of ordinal variables increases (see Ferdous et al., 2010). Similarly, one can examine the effect of varying numbers of draws for MSL estimation. Our choice of 100 draws per individual was based on experimentation with different numbers of draws for the first data set. We found little improvement in ability to recover parameters or simulation variance beyond 100 draws per individual for this data set, and thus settled for 100 draws per individual for all data sets (as will be noted in Section 5, the CMOP MSL estimation with 100 draws per individual indeed leads to negligible simulation variance). Finally, we chose to use three to four exogenous variables in our experimental design
85
MSL and CML Estimation Approaches
(rather than use a single exogenous variable) so that the resulting simulation data sets would be closer to realistic ones where multiple exogenous variables are employed.
4.2. The PMOP Model For the panel case, we consider six observations (J ¼ 6) per individual, leading to a six-dimensional integral per individual for the full likelihood function. Note that the correlation matrixpffiffiffiffiffiffiffiffiffiffiffiffiffi Rq has entries such that corrðyqj ; yqg Þ ¼ ðs2 þ rjtqj tqg j Þ=m2 , where m ¼ 1 þ s2 . Thus, in the PMOP case, Rq is completely determined by the variance s2 of the individual-specific nonvarying random term uq and the single autoregressive correlation parameter r determining the correlation between the qj and qk terms: corrðqj ; qk Þ ¼ rjtqj tqk j . To examine the impact of different magnitudes of the autoregressive correlation parameter, we undertake the simulation exercise for two different values of r: 0.3 and 0.7. For each correlation parameter, the experiment is carried out for 100 independent data sets with 200 data points (i.e., individuals).7 Prespecified values for the d vector are used to generate samples in each data set. In the set-up, we use two exogenous variables in the latent equation. One is a binary time-constant variable (xq1 ) simulated from a Bernoulli variable with probability equal to 0.7, and another (xqj2 ) is a continuous time-varying variable generated from the autoregressive model shown below: iid
ðxqj2 1Þ ¼ 0:6ðxq;j1;2 1Þ þ gqj ; gqj Nð0; 0:22 Þ.
(16)
A fixed coefficient vector b is assumed, with b1 ¼ 1 (coefficient on xq1 ) and b2 ¼ 1 (coefficient on xqj2 ). The linear combination b0 xqj ðxqj ¼ ðxq1 ; xqj2 Þ0 ; q ¼ 1; 2; . . . ; 200Þ is computed for each individual q’s jth observation. Next, we generate independent time-invariant values of uq for each individual from a standard normal distribution (i.e., we assume s2 ¼ 1), and latent serially correlated errors for each individual q as follows: 8 iid > < Zq1 Nð0; 1Þ for j ¼ 1 pffiffiffiffiffiffiffiffiffiffiffiffiffi (17) qj ¼ iid > 1 r2 Zqj ; Zqj Nð0; 1Þ for j 2: : rqj1 þ The error term realizations for each individual’s observation is then added to the systematic component ðb0 xqj Þ as in Eq. (4) and then translated to ‘‘observed’’ values of yqj based on the following prespecified threshold
86
CHANDRA R. BHAT ET AL.
values: y1 ¼ 1:5, y2 ¼ 2:5, and y3 ¼ 3:0. The above data generation process is undertaken 100 times with different realizations of the random error terms uq and qj to generate 100 different data sets. The CML estimation procedure is applied to each data set to estimate data-specific values of the d vector. The GB simulator is applied to each data set 10 times with different (independent) random draw sequences. This allows us to estimate simulation error by computing the standard deviation of estimated parameters among the 10 different GB estimates on the same data set. The algorithm is tuned with an absolute error tolerance of 0.001 for each six-dimensional integral forming the likelihood. The algorithm is adaptive in that it starts with few points and then increases the number of points per individual until the desired precision is obtained, but with the constraint that the maximal number of draws is 25,000.
5. PERFORMANCE COMPARISON BETWEEN THE MSL AND CML APPROACHES In this section, we first identify a number of performance measures and discuss how these are computed for the MSL approach (GHK for CMOP and GB for PMOP) and the CML approach. The subsequent sections present the simulation and computational results.
5.1. Performance Measures The steps discussed below for computing performance measures are for a specific correlation matrix pattern. For the CMOP model, we consider two correlation matrix patterns, one with low correlations and another with high correlations. For the PMOP model, we consider two correlation patterns, corresponding to the autoregressive correlation parameter values of 0.3 and 0.7. 5.1.1. MSL Approach (1) Estimate the MSL parameters for each data set s (s ¼ 1, 2, y, 20 for CMOP and s ¼ 1, 2, y, 100 for PMOP; i.e., S ¼ 20 for CMOP and S ¼ 100 for PMOP) and for each of 10 independent draws, and obtain the time to get the convergent values and the standard errors (SE).
MSL and CML Estimation Approaches
87
Note combinations for which convergence is not achieved. Everything below refers to cases when convergence is achieved. Obtain the mean time for convergence (TMSL) and standard deviation of convergence time across the converged runs and across all data sets (the time to convergence includes the time to compute the covariance matrix of parameters and the corresponding parameter standard errors). (2) For each data set s and draw combination, estimate the standard errors of parameters (using the sandwich estimator). (3) For each data set s, compute the mean estimate for each model parameter across the draws. Label this as MED, and then take the mean of the MED values across the data sets to obtain a mean estimate. Compute the absolute percentage bias (APB) as: mean estimate true value 100 APB ¼ true value (4) Compute the standard deviation of the MED values across the data sets and label this as the finite sample standard error (essentially, this is the empirical standard error). (5) For each data set s, compute the median standard error for each model parameter across the draws. Call this MSED, and then take the mean of the MSED values across the R data sets and label this as the asymptotic standard error (essentially this is the standard error of the distribution of the estimator as the sample size gets large). Note that we compute the median standard error for each model parameter across the draws and label it as MSED rather than computing the mean standard error for each model parameter across the draws. This is because, for some draws, the estimated standard errors turned out to be rather large relative to other independent standard error estimates for the same data set. On closer inspection, this could be traced to the unreliability of the numeric Hessian used in the sandwich estimator computation. This is another bothersome issue with MSL – it is important to compute the covariance matrix using the sandwich estimator rather than using the inverse of the cross-product of the first derivatives (due to the simulation noise introduced when using a finite number of draws per individual in the MSL procedure; see McFadden & Train, 2000). Specifically, using the inverse of the cross-product of the first derivatives can substantially underestimate the covariance matrix. But coding the analytic Hessian (as part of computing the sandwich estimator) is extremely difficult,
88
CHANDRA R. BHAT ET AL.
while using the numeric Hessian is very unreliable. Craig (2008) also alludes to this problem when he states that ‘‘(y) the randomness that is inherent in such methods [referring here to the GB algorithm, but applicable in general to MSL methods] is sometimes more than a minor nuisance.’’ In particular, even when the log-likelihood function is computed with good precision so that the simulation error in estimated parameters is very small, this is not always adequate to reliably compute the numerical Hessian. To do so, one will generally need to compute the log-likelihood with a substantial level of precision, which, however, would imply very high computational times even in low dimensionality situations. Finally, note that the mean asymptotic standard error is a theoretical approximation to the finite sample standard error, since, in practice, one would estimate a model on only one data set from the field. (6) Next, for each data set s, compute the simulation standard deviation for each parameter as the standard deviation in the estimated values across the independent draws (about the MED value). Call this standard deviation as SIMMED. For each parameter, take the mean of SIMMED across the different data sets. Label this as the simulation standard error for each parameter. (7) For each parameter, compute a simulation-adjusted standard error as follows: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ðasymptotic standard errorÞ2 þ ðsimulation standard errorÞ2
5.1.2. CML Approach (1) Estimate the CML parameters for each data set s and obtain the time to get the convergent values (including the time to obtain the Godambe matrix-computed covariance matrix and corresponding standard errors). Determine the mean time for convergence (TCML) across the S data sets.8 (2) For each data set s, estimate the standard errors (using the Godambe estimator). (3) Compute the mean estimate for each model parameter across the R data sets. Compute APB as in the MSL case. (4) Compute the standard deviation of the CML parameter estimates across the data sets and label this as the finite sample standard error (essentially, this is the empirical standard error).
MSL and CML Estimation Approaches
89
5.2. Simulation Results 5.2.1. THE CMOP Model Table 1a presents the results for the CMOP model with low correlations, and Table 1b presents the corresponding results for the CMOP model with high correlations. The results indicate that both the MSL and CML approaches recover the parameters extremely well, as can be observed by comparing the mean estimate of the parameters with the true values (see the column titled ‘‘parameter estimates’’). In the low correlation case, the APB ranges from 0.03 to 15.95% (overall mean value of 2.21% – see last row of table under the column titled ‘‘absolute percentage bias’’) across parameters for the MSL approach, and from 0.00 to 12.34% (overall mean value of 1.92%) across parameters for the CML approach. In the high correlation case, the APB ranges from 0.02 to 5.72% (overall mean value of 1.22% – see last row of table under the column titled ‘‘absolute percentage bias’’) across parameters for the MSL approach, and from 0.00 to 6.34% (overall mean value of 1.28%) across parameters for the CML approach. These are incredibly good measures for the ability to recover parameter estimates, and indicate that both the MSL and CML perform about evenly in the context of bias. Further, the ability to recover parameters does not seem to be affected at all by whether there is low correlation or high correlation (in fact, the overall APB reduces from the low correlation case to the high correlation case). Interestingly, the APB values are generally much higher for the correlation (r) parameters than for the slope (b) and threshold (y) parameters in the low correlation case, but the situation is exactly reversed in the high correlation case where the APB values are generally higher for the slope (b) and threshold (y) parameters compared to the correlation (r) parameters (for both the MSL and CML approaches). This is perhaps because the correlation parameters enter more nonlinearly in the likelihood function than the slope and threshold parameters, and need to be particularly strong before they start having any substantial effects on the log-likelihood function value. Essentially, the log-likelihood function tends to be relatively flat at low correlations, leading to more difficulty in accurately recovering the low correlation parameters. But, at high correlations, the log-likelihood function shifts considerably in value with small shifts in the correlation values, allowing them to be recovered accurately.9 The standard error measures provide several important insights. First, the finite sample standard error and asymptotic standard error values are quite close to one another, with very little difference in the overall mean values of these two columns (see last row). This holds for both the MSL and CML
Table 1a. Parameter
Evaluation of Ability to Recover ‘‘True’’ Parameters by the MSL and CML Approaches – With Low Error Correlation Structure.
True Value
MSL Approach
CML Approach
Relative Efficiency
Parameter estimates Standard error estimates MASEMSL/ SASEMSL/ MASECML MASECML Mean Absolute Finite Asymptotic Simulation Simulation Mean Absolute Finite Asymptotic estimate percentage sample standard standard adjusted estimate percentage sample standard error bias standard error error standard error bias standard ðMASECML Þ error ðMASEMSL Þ ðSASEMSL Þ error
Parameter estimates
Coefficients b11 b21 b31 b12 b22 b32 b42 b13 b23 b33 b14 b24 b34 b44 b15 b25 b35
0.5000 1.0000 0.2500 0.7500 1.0000 0.5000 0.2500 0.2500 0.5000 0.7500 0.7500 0.2500 1.0000 0.3000 0.4000 1.0000 0.6000
0.5167 1.0077 0.2501 0.7461 0.9984 0.4884 0.2605 0.2445 0.4967 0.7526 0.7593 0.2536 0.9976 0.2898 0.3946 0.9911 0.5987
3.34% 0.77% 0.06% 0.52% 0.16% 2.31% 4.19% 2.21% 0.66% 0.34% 1.24% 1.46% 0.24% 3.39% 1.34% 0.89% 0.22%
Standard error estimates
0.0481 0.0474 0.0445 0.0641 0.0477 0.0413 0.0372 0.0401 0.0420 0.0348 0.0530 0.0420 0.0832 0.0481 0.0333 0.0434 0.0322
0.0399 0.0492 0.0416 0.0501 0.0550 0.0433 0.0432 0.0346 0.0357 0.0386 0.0583 0.0486 0.0652 0.0508 0.0382 0.0475 0.0402
0.0014 0.0005 0.0010 0.0037 0.0015 0.0017 0.0006 0.0008 0.0021 0.0005 0.0008 0.0024 0.0017 0.0022 0.0014 0.0016 0.0007
0.0399 0.0492 0.0416 0.0503 0.0550 0.0434 0.0432 0.0346 0.0358 0.0386 0.0583 0.0487 0.0652 0.0508 0.0382 0.0475 0.0402
0.5021 1.0108 0.2568 0.7698 0.9990 0.5060 0.2582 0.2510 0.5063 0.7454 0.7562 0.2472 1.0131 0.3144 0.4097 0.9902 0.5898
0.43% 1.08% 2.73% 2.65% 0.10% 1.19% 3.30% 0.40% 1.25% 0.62% 0.83% 1.11% 1.31% 4.82% 2.42% 0.98% 1.69%
0.0448 0.0484 0.0252 0.0484 0.0503 0.0326 0.0363 0.0305 0.0337 0.0441 0.0600 0.0491 0.0643 0.0551 0.0300 0.0441 0.0407
0.0395 0.0482 0.0380 0.0487 0.0544 0.0455 0.0426 0.0342 0.0364 0.0389 0.0573 0.0483 0.0633 0.0498 0.0380 0.0458 0.0404
1.0109 1.0221 1.0957 1.0283 1.0100 0.9518 1.0149 1.0101 0.9815 0.9929 1.0183 1.0067 1.0298 1.0199 1.0055 1.0352 0.9959
1.0116 1.0222 1.0961 1.0311 1.0104 0.9526 1.0150 1.0104 0.9833 0.9930 1.0184 1.0079 1.0301 1.0208 1.0061 1.0358 0.9961
Correlation coefficients r12 0.3000 0.2857 r13 0.2000 0.2013 r14 0.2200 0.1919 r15 0.1500 0.1739 r23 0.2500 0.2414 r24 0.3000 0.2960 r25 0.1200 0.1117 r34 0.2700 0.2737 r35 0.2000 0.2052 r45 0.2500 0.2419 Threshold y11 y21 y31 y12 y22 y13 y23 y33 y43 y14 y24 y15 y25 y35
parameters 1.0000 1.0000 3.0000 0.0000 2.0000 2.0000 0.5000 1.0000 2.5000 1.0000 3.0000 1.5000 0.5000 2.0000
Overall mean value across parameters
4.76% 0.66% 12.76% 15.95% 3.46% 1.34% 6.94% 1.37% 2.62% 3.25%
0.0496 0.0477 0.0535 0.0388 0.0546 0.0619 0.0676 0.0488 0.0434 0.0465
0.0476 0.0409 0.0597 0.0439 0.0443 0.0631 0.0489 0.0515 0.0378 0.0533
0.0020 0.0019 0.0035 0.0040 0.0040 0.0047 0.0044 0.0029 0.0022 0.0075
0.0476 0.0410 0.0598 0.0441 0.0445 0.0633 0.0491 0.0516 0.0378 0.0538
0.2977 0.2091 0.2313 0.1439 0.2523 0.3013 0.1348 0.2584 0.1936 0.2570
0.77% 4.56% 5.13% 4.05% 0.92% 0.45% 12.34% 4.28% 3.22% 2.78%
0.0591 0.0318 0.0636 0.0419 0.0408 0.0736 0.0581 0.0580 0.0438 0.0455
0.0467 0.0401 0.0560 0.0431 0.0439 0.0610 0.0481 0.0510 0.0391 0.0536
1.0174 1.0220 1.0664 1.0198 1.0092 1.0342 1.0154 1.0094 0.9662 0.9937
1.0184 1.0231 1.0682 1.0239 1.0133 1.0372 1.0194 1.0110 0.9678 1.0034
1.0172 0.9985 2.9992 0.0172 1.9935 2.0193 0.5173 0.9956 2.4871 0.9908 3.0135 1.5084 0.4925 2.0201
1.72% 0.15% 0.03% – 0.32% 0.97% 3.47% 0.44% 0.52% 0.92% 0.45% 0.56% 1.50% 1.01%
0.0587 0.0661 0.0948 0.0358 0.0806 0.0848 0.0464 0.0460 0.0883 0.0611 0.1625 0.0596 0.0504 0.0899
0.0555 0.0554 0.1285 0.0481 0.0831 0.0781 0.0462 0.0516 0.0981 0.0615 0.1395 0.0651 0.0491 0.0797
0.0007 0.0011 0.0034 0.0007 0.0030 0.0019 0.0005 0.0011 0.0040 0.0031 0.0039 0.0032 0.0017 0.0017
0.0555 0.0554 0.1285 0.0481 0.0831 0.0781 0.0462 0.0516 0.0982 0.0616 0.1396 0.0652 0.0492 0.0798
1.0289 1.0010 2.9685 0.0015 2.0150 2.0238 0.4968 1.0014 2.5111 1.0105 2.9999 1.4805 0.5072 2.0049
2.89% 0.10% 1.05% – 0.75% 1.19% 0.64% 0.14% 0.44% 1.05% 0.00% 1.30% 1.44% 0.24%
0.0741 0.0536 0.1439 0.0475 0.0904 0.0892 0.0519 0.0584 0.0735 0.0623 0.1134 0.0821 0.0380 0.0722
0.0561 0.0551 0.1250 0.0493 0.0850 0.0787 0.0465 0.0523 0.1002 0.0625 0.1347 0.0656 0.0497 0.0786
0.9892 1.0063 1.0279 0.9750 0.9778 0.9920 0.9928 0.9877 0.9788 0.9838 1.0356 0.9925 0.9897 1.0151
0.9893 1.0065 1.0282 0.9751 0.9784 0.9923 0.9928 0.9879 0.9796 0.9851 1.0360 0.9937 0.9903 1.0154
–
2.21%
0.0566
0.0564
0.0022
0.0564
–
1.92%
0.0562
0.0559
1.0080
1.0092
Table 1b.
Evaluation of Ability to Recover ‘‘True’’ Parameters by the MSL and CML Approaches – With High Error Correlation Structure.
Parameter True Value
MSL Approach
CML Approach
Relative Efficiency
Parameter estimates Standard error estimates MASEMSL/ SASEMSL/ MASECML MASECML Mean Absolute Finite Asymptotic Simulation Simulation Mean Absolute Finite Asymptotic estimate percentage sample standard standard adjusted estimate percentage sample standard error bias standard error error standard error bias standard ðMASECML Þ error ðMASEMSL Þ ðSASEMSL Þ error
Parameter estimates
Coefficients b11 b21 b31 b12 b22 b32 b42 b13 b23 b33 b14 b24 b34 b44 b15 b25 b35
0.5000 1.0000 0.2500 0.7500 1.0000 0.5000 0.2500 0.2500 0.5000 0.7500 0.7500 0.2500 1.0000 0.3000 0.4000 1.0000 0.6000
0.5063 1.0089 0.2571 0.7596 1.0184 0.5009 0.2524 0.2473 0.5084 0.7498 0.7508 0.2407 1.0160 0.3172 0.3899 0.9875 0.5923
1.27% 0.89% 2.85% 1.27% 1.84% 0.17% 0.96% 1.08% 1.67% 0.02% 0.11% 3.70% 1.60% 5.72% 2.54% 1.25% 1.28%
Standard error estimates
0.0300 0.0410 0.0215 0.0495 0.0439 0.0343 0.0284 0.0244 0.0273 0.0302 0.0416 0.0311 0.0483 0.0481 0.0279 0.0365 0.0309
0.0294 0.0391 0.0288 0.0373 0.0436 0.0314 0.0294 0.0233 0.0256 0.0291 0.0419 0.0326 0.0489 0.0336 0.0286 0.0391 0.0316
0.0020 0.0026 0.0017 0.0028 0.0036 0.0023 0.0021 0.0015 0.0020 0.0019 0.0039 0.0033 0.0041 0.0028 0.0026 0.0036 0.0030
0.0294 0.0392 0.0289 0.0374 0.0437 0.0315 0.0294 0.0234 0.0256 0.0291 0.0420 0.0327 0.0491 0.0337 0.0288 0.0393 0.0317
0.5027 1.0087 0.2489 0.7699 1.0295 0.5220 0.2658 0.2605 0.5100 0.7572 0.7707 0.2480 1.0000 0.3049 0.4036 1.0008 0.6027
0.54% 0.87% 0.42% 2.65% 2.95% 4.39% 6.34% 4.18% 2.01% 0.96% 2.75% 0.80% 0.00% 1.62% 0.90% 0.08% 0.45%
0.0292 0.0479 0.0251 0.0396 0.0497 0.0282 0.0263 0.0269 0.0300 0.0365 0.0452 0.0234 0.0360 0.0423 0.0274 0.0452 0.0332
0.0317 0.0410 0.0290 0.0395 0.0463 0.0352 0.0315 0.0251 0.0277 0.0318 0.0450 0.0363 0.0513 0.0368 0.0301 0.0398 0.0329
0.9274 0.9538 0.9943 0.9451 0.9419 0.8931 0.9318 0.9274 0.9221 0.9150 0.9302 0.8977 0.9532 0.9133 0.9516 0.9821 0.9607
0.9294 0.9560 0.9961 0.9477 0.9451 0.8955 0.9343 0.9293 0.9248 0.9170 0.9341 0.9022 0.9566 0.9165 0.9554 0.9862 0.9649
Correlation coefficients r12 0.9000 0.8969 r13 0.8000 0.8041 r14 0.8200 0.8249 r15 0.7500 0.7536 r23 0.8500 0.8426 r24 0.9000 0.8842 r25 0.7200 0.7184 r34 0.8700 0.8724 r35 0.8000 0.7997 r45 0.8500 0.8421
0.34% 0.51% 0.60% 0.49% 0.87% 1.75% 0.22% 0.27% 0.04% 0.93%
0.0224 0.0174 0.0284 0.0248 0.0181 0.0187 0.0241 0.0176 0.0265 0.0242
0.0177 0.0201 0.0265 0.0243 0.0190 0.0231 0.0280 0.0197 0.0191 0.0231
0.0034 0.0035 0.0061 0.0046 0.0081 0.0097 0.0072 0.0036 0.0039 0.0128
0.0180 0.0204 0.0272 0.0247 0.0207 0.0251 0.0289 0.0200 0.0195 0.0264
0.9019 0.8009 0.8151 0.7501 0.8468 0.9023 0.7207 0.8644 0.7988 0.8576
0.21% 0.11% 0.60% 0.01% 0.38% 0.26% 0.09% 0.65% 0.15% 0.89%
0.0233 0.0195 0.0296 0.0242 0.0190 0.0289 0.0295 0.0208 0.0193 0.0192
0.0183 0.0203 0.0297 0.0251 0.0198 0.0244 0.0301 0.0220 0.0198 0.0252
0.9669 0.9874 0.8933 0.9678 0.9606 0.9484 0.9298 0.8972 0.9645 0.9156
0.9845 1.0023 0.9165 0.9849 1.0438 1.0284 0.9600 0.9124 0.9848 1.0480
Threshold y11 y21 y31 y12 y22 y13 y23 y33 y43 y14 y24 y15 y25 y35
1.10% 0.93% 0.71% – 0.44% 1.33% 1.73% 0.83% 0.44% 0.24% 0.34% 0.84% 3.55% 2.03%
0.0600 0.0551 0.0819 0.0376 0.0859 0.0838 0.0305 0.0516 0.0750 0.0574 0.1107 0.0694 0.0581 0.0850
0.0520 0.0515 0.1177 0.0435 0.0781 0.0754 0.0440 0.0498 0.0928 0.0540 0.1193 0.0629 0.0465 0.0741
0.0023 0.0022 0.0065 0.0028 0.0066 0.0060 0.0030 0.0035 0.0066 0.0050 0.0125 0.0056 0.0041 0.0064
0.0520 0.0515 0.1179 0.0436 0.0784 0.0757 0.0441 0.0499 0.0930 0.0542 0.1200 0.0632 0.0467 0.0744
1.0322 1.0118 2.9862 0.0010 2.0371 2.0506 0.5090 0.9987 2.5148 1.0255 3.0048 1.5117 0.4968 2.0025
3.22% 1.18% 0.46% – 1.86% 2.53% 1.80% 0.13% 0.59% 2.55% 0.16% 0.78% 0.64% 0.12%
0.0731 0.0514 0.1185 0.0418 0.0949 0.0790 0.0378 0.0569 0.1144 0.0656 0.0960 0.0676 0.0515 0.0898
0.0545 0.0528 0.1188 0.0455 0.0823 0.0776 0.0453 0.0509 0.0956 0.0567 0.1256 0.0649 0.0472 0.0761
0.9538 0.9757 0.9906 0.9572 0.9491 0.9721 0.9702 0.9774 0.9699 0.9526 0.9498 0.9699 0.9868 0.9735
0.9548 0.9766 0.9921 0.9592 0.9525 0.9752 0.9725 0.9798 0.9724 0.9566 0.9550 0.9737 0.9906 0.9771
1.22%
0.0429
0.0428
0.0044
0.0432
–
1.28%
0.0455
0.0449
0.9493
0.9621
parameters 1.0000 1.0110 1.0000 0.9907 3.0000 3.0213 0.0000 0.0234 2.0000 2.0089 2.0000 2.0266 0.5000 0.5086 1.0000 0.9917 2.5000 2.4890 1.0000 0.9976 3.0000 3.0101 1.5000 1.4875 0.5000 0.4822 2.0000 1.9593
Overall mean value across parameters
–
94
CHANDRA R. BHAT ET AL.
estimation approaches, and for both the low and high correlation cases, and confirms that the inverses of the sandwich information estimator (in the case of the MSL approach) and the Godambe information matrix estimator (in the case of the CML approach) recover the finite sample covariance matrices remarkably well. Second, the empirical and asymptotic standard errors for the threshold parameters are higher than for the slope and correlation parameters (for both the MSL and CML cases, and for both the low and high correlation cases). This is perhaps because the threshold parameters play a critical role in the partitioning of the underlying latent variable into ordinal outcomes (more so than the slope and correlation parameters), and so are somewhat more difficult to pin down. Third, a comparison of the standard errors across the low and high correlation cases reveals that the empirical and asymptotic standard errors are much lower for the correlation parameters in the latter case than in the former case. This reinforces the finding earlier that the correlation parameters are much easier to recover at high values because of the considerable influence they have on the log-likelihood function at high values; consequently, not only are they recovered accurately, but they are also recovered more precisely at high correlation values. Fourth, across all parameters, there is a reduction in the empirical and asymptotic standard errors for both the MSL and CML cases between the low and high correlation cases (though the reduction is much more for the correlation parameters than for the noncorrelation parameters). Fifth, the simulation error in the MSL approach is negligible to small. On average, based on the mean values in the last row of the table, the simulation error is about 3.9% of the sampling error for the low correlation case and 10.3% of the sampling error for the high correlation case. The higher simulation error for the high correlation case is not surprising, since we use the same number of Halton draws per individual in both the low and high correlation cases, and the multivariate integration is more involved with a high correlation matrix structure. Thus, as the levels of correlations increase, the evaluation of the multivariate normal integrals can be expected to become less precise at a given number of Halton draws per individual. However, overall, the results suggest that our MSL simulation procedure is well tuned, and that we are using adequate numbers of Halton draws per individual for the accurate evaluation of the log-likelihood function and the accurate estimation of the model parameters (this is also reflected in the negligible difference in the simulation-adjusted standard error and the mean asymptotic standard error of parameters in the MSL approach). The final two columns of each of Tables 1a and 1b provide a relative efficiency factor between the MSL and CML approaches. The first of these
MSL and CML Estimation Approaches
95
columns provides the ratio of the asymptotic standard error of parameters from the MSL approach and the asymptotic standard error of the corresponding parameters from the CML approach. The second of these columns provides the ratio of the simulation-adjusted standard error of parameters from the MSL approach and the asymptotic standard error of parameters from the CML approach. As expected, the second column provides slightly higher values of efficiency, indicating that CML efficiency increases when one also considers the presence of simulation standard error in the MSL estimates. However, this efficiency increase is negligible in the current context because of very small MSL simulation error. The more important and interesting point, though, is that the relative efficiency of the CML approach is as good as the MSL approach in the low correlation case. This is different from the relative efficiency results obtained in Renard, Molenberghs, and Geys (2004), Zhao and Joe (2005), and Kuk and Nott (2000) in other model contexts, where the CML has been shown to lose efficiency relative to a ML approach. However, note that all these other earlier studies focus on a comparison of a CML approach vis-a`-vis a ML approach, while, in our setting, we must resort to MSL to approximate the likelihood function. To our knowledge, this is the first comparison of the CML approach to an MSL approach, applicable to situations when the full information ML estimator cannot be evaluated analytically. In this regard, it is not clear that the earlier theoretical result that the difference between the asymptotic covariance matrix of the CML estimator (obtained as the inverse of the Godambe matrix) and of the ML estimator (obtained as the inverse of the cross-product matrix of derivatives) should be positive semidefinite would extend to our case because the asymptotic covariance of MSL is computed as the inverse of the sandwich information matrix.10 Basically, the presence of simulation noise, even if very small in the estimates of the parameters as in our case, can lead to a significant drop in the amount of information available in the sandwich matrix, resulting in increased standard errors of parameters when using MSL. Our results regarding the efficiency of individual parameters suggests that any reduction in efficiency of the CML (because of using only pairwise likelihood rather than the full likelihood) is balanced by the reduction in efficiency because of using MSL rather than ML, so that there is effectively no loss in asymptotic efficiency in using the CML approach (relative to the MSL approach) in the CMOP case for low correlation. However, for the high correlation case, the MSL does provide slightly better efficiency than the CML. However, even in this case, the relative efficiency of parameters in the CML approach ranges between 90 and 99% (mean of 95%) of the efficiency of the MSL approach, without
96
CHANDRA R. BHAT ET AL.
considering simulation standard error. When considering simulation error, the relative efficiency of the CML approach is even better at about 96% of the MSL efficiency (on average across all parameters). Overall, there is little to no drop in efficiency because of the use of the CML approach in the CMOP simulation context. 5.2.2. The PMOP Model Most of the observations made from the CMOP model results also hold for the PMOP model results presented in Table 2. Both the MSL and CML approaches recover the parameters extremely well. In the low correlation case, the APB ranges from 0.26 to 4.29% (overall mean value of 1.29%) across parameters for the MSL approach, and from 0.65 to 5.33% (overall mean value of 1.84%) across parameters for the CML approach. In the high correlation case, the APB ranges from 0.45 to 6.14% (overall mean value of 2.06%) across parameters for the MSL approach, and from 0.41 to 5.71% (overall mean value of 2.40%) across parameters for the CML approach. Further, the ability to recover parameters does not seem to be affected too much in an absolute sense by whether there is low correlation or high correlation. The CML approach shows a mean value of APB that increases about 1.3 times (from 1.84 to 2.40%) between the low and high r values compared to an increase of about 1.6 times (from 1.29 to 2.06%) for the MSL approach. It in indeed interesting that the PMOP results indicate a relative increase in the APB values from the low to high correlation case, while there was actually a corresponding relative decrease in the CMOP case. Another result is that the APB increases from the low to the high correlation case for the threshold (y) and variance ðs2 Þ parameters in both the MSL and CML approaches. On the other hand, the APB decreases from the low to the high correlation case for the correlation (r) parameter, and remains relatively stable between the low and high correlation cases for the slope (b) parameters. That is, the recovery of the slope parameters appears to be less sensitive to the level of correlation than is the recovery of other parameters. The finite sample standard error and asymptotic standard error values are close to one another, with very little difference in the overall mean values of these two columns (see last row). This holds for both the MSL and CML approaches. Also, as in the CMOP case, the empirical and asymptotic standard errors for the threshold parameters are generally higher than for the other parameters. The simulation error in the MSL approach is negligible, at about 0.1% or less than the sampling error for both the low and high correlation cases. Note that, unlike in the CMOP case, the PMOP
Table 2. Parameter
Evaluation of Ability to Recover ‘‘True’’ Parameters by the MSL and CML Approaches – The Panel Case.
True Value
MSL Approach Parameter estimates
CML Approach
Relative Efficiency
Parameter estimates Standard error estimates MASEMSL/ SASEMSL/ MASECML MASECML Mean Absolute Finite Asymptotic Simulation Simulation Mean Absolute Finite Asymptotic estimate percentage sample standard standard adjusted estimate percentage sample standard error bias standard error error standard error bias standard ðMASECML Þ error ðMASEMSL Þ ðSASEMSL Þ error
r ¼ 0.30 b1 b2 r s2 y1 y2 y3
1.0000 1.0000 0.3000 1.0000 1.5000 2.5000 3.0000
Overall mean value across parameters r ¼ 0.70 b1 b2 r s2 y1 y2 y3
1.0000 1.0000 0.7000 1.0000 1.5000 2.5000 3.0000
Overall mean value across parameters
Standard error estimates
0.9899 1.0093 0.2871 1.0166 1.5060 2.5129 3.0077
1.01% 0.93% 4.29% 1.66% 0.40% 0.52% 0.26%
0.1824 0.1729 0.0635 0.2040 0.2408 0.2617 0.2670
0.1956 0.1976 0.0605 0.2072 0.2615 0.2725 0.2814
0.0001 0.0001 0.0000 0.0002 0.0001 0.0002 0.0002
0.1956 0.1976 0.0605 0.2072 0.2615 0.2725 0.2814
0.9935 1.0221 0.2840 1.0142 1.5210 2.5272 3.0232
0.65% 2.21% 5.33% 1.42% 1.40% 1.09% 0.77%
0.1907 0.1955 0.0632 0.2167 0.2691 0.2890 0.2928
0.1898 0.2142 0.0673 0.2041 0.2676 0.2804 0.2882
1.0306 0.9223 0.8995 1.0155 0.9771 0.9719 0.9763
1.0306 0.9223 0.8995 1.0155 0.9771 0.9719 0.9763
–
1.29%
0.1989
0.2109
0.0001
0.2109
–
1.84%
0.2167
0.2159
0.9705
0.9705
1.0045 1.0183 0.6854 1.0614 1.5192 2.5325 3.0392
0.45% 1.83% 2.08% 6.14% 1.28% 1.30% 1.31%
0.2338 0.1726 0.0729 0.4634 0.2815 0.3618 0.4033
0.2267 0.1812 0.0673 0.4221 0.2749 0.3432 0.3838
0.0001 0.0001 0.0001 0.0004 0.0002 0.0003 0.0003
0.2267 0.1812 0.0673 0.4221 0.2749 0.3432 0.3838
1.0041 1.0304 0.6848 1.0571 1.5304 2.5433 3.0514
0.41% 3.04% 2.18% 5.71% 2.03% 1.73% 1.71%
0.2450 0.1969 0.0744 0.4864 0.3101 0.3904 0.4324
0.2368 0.2199 0.0735 0.4578 0.3065 0.3781 0.4207
0.9572 0.8239 0.9159 0.9220 0.8968 0.9076 0.9123
0.9572 0.8239 0.9159 0.9220 0.8968 0.9076 0.9123
–
2.06%
0.2842
0.2713
0.0002
0.2713
–
2.40%
0.3051
0.2990
0.9051
0.9051
98
CHANDRA R. BHAT ET AL.
MSL estimation did not involve the same number of draws per individual for the low and high correlation cases; rather, the number of draws varied to ensure an absolute error tolerance of 0.001 for each six-dimensional integral forming the likelihood. Thus, it is no surprise that the simulation error does not increase much between the low and high correlation cases as it did in the CMOP case. A significant difference with the CMOP case is that the empirical standard errors and asymptotic standard errors are consistently larger for the high correlation case than for the low correlation case, with a particularly substantial increase in the standard error of s2 . The final two columns provide a relative efficiency factor between the MSL and CML approaches. The values in these two columns are identical because of the very low simulation error. As in the CMOP case, the estimated efficiency of the CML approach is as good as the MSL approach in the low correlation case (the relative efficiency ranges between 90 and 103%, with a mean of 97%). For the high correlation case, the relative efficiency of parameters in the CML approach ranges between 82 and 96% (mean of 91%) of the efficiency of the MSL approach, indicating a reduction in efficiency as the dependence level goes up (again, consistent with the CMOP case). Overall, however, the efficiency of the CML approach remains high for all the parameters.
5.3. Nonconvergence and Computational Time The simulation estimation of multivariate ordered-response model can involve numerical instability because of possible unstable operations such as large matrix inversions and imprecision in the computation of the Hessian. This can lead to convergence problems. On the other hand, the CML approach is a straightforward approach that should be easy to implement and should not have any convergence-related problems. In this empirical study, we classified any estimation run that had not converged in 5 hours as having nonconverged. We computed nonconvergence rates in two ways for the MSL approach. For the CMOP model, we computed the nonconvergence rates in terms of the starting seeds that led to failure in a complete estimation of 10 simulation runs (using different randomized Halton sequences) for each data set. If a particular starting seed led to failure in convergence for any of the 10 simulation runs, that seed was classified as a failed seed. Otherwise, the seed was classified as a successful seed. This procedure was applied for each of the 20 data sets generated for each of the low and high correlation
MSL and CML Estimation Approaches
99
matrix structures until we had a successful seed.11 The nonconvergence rate was then computed as the number of failed seeds divided by the total number of seeds considered. Note that this would be a good reflection of nonconvergence rates if the analyst ran the simulation multiple times on a single data set to recognize simulation noise in statistical inferences. But, in many cases, the analyst may run the MSL procedure only once on a single data set, based on using a high level of accuracy in computing the multivariate integrals in the likelihood function. For the PMOP model, which was estimated based on as many draws as needed to obtain an absolute error tolerance of 0.001 for each six-dimensional integral forming the likelihood, we, therefore, consider another way of computing nonconvergence. This is based on the number of unsuccessful runs out of the 1,000 simulated estimation runs considered (100 data sets times 10 simulated estimation runs). The results indicated a nonconvergence rate of 28.5% for the low correlation case and 35.5% for the high correlation case in the CMOP model, and a nonconvergence rate of 4.2% for the low correlation case and 2.4% for the high correlation case in the PMOP model (note, however, that the rates cannot be compared between the CMOP and PMOP models because of very different ways of computing the rates, as discussed above). For both the CMOP and PMOP models, and both the low and high correlation cases, we always obtained convergence with the CML approach. Next, we examined the time to convergence per converged estimation run for the MSL and CML procedures (the time to convergence included the time to compute the standard error of parameters). For the CMOP model, we had a very well-tuned and efficient MSL procedure with an analytic gradient (written in Gauss matrix programming language). We used naı¨ ve independent probit starting values for the MSL as well as the CML in the CMOP case (the CML is very easy to code relative to the MSL, and was also undertaken in the GAUSS language for the CMOP model). The estimations were run on a desktop machine. But, for the PMOP model, we used an MSL code written in the R language without an analytic gradient, and a CML code written using a combination of C and R languages. However, we used the CML convergent values (which are pretty good) as the MSL start values in the PMOP model to compensate for the lack of analytic MSL gradients. The estimations were run on a powerful server machine. As a consequence of all these differences, one needs to be careful in the computational time comparisons. Here, we only provide a relative computational time factor (RCTF), computed as the mean time needed for an MSL run divided by the mean time needed for a CML run. In addition, we present the standard
100
CHANDRA R. BHAT ET AL.
deviation of the run times (SDR) as a percentage of mean run time for the MSL and CML estimations. The RCTF for the CMOP model for the case of the low correlation matrix is 18, and for the case of the high correlation matrix is 40. The substantially higher RCTF for the high correlation case is because of an increase in the mean MSL time between the low and high correlation cases; the mean CML time hardly changed. The MSL SDR in the CMOP model for the low correlation case is 30% and for the high correlation case is 47%, while the CML SDR is about 6% for both the low and high correlation cases. The RCTF for the PMOP model for the case of low correlation is 332, and for the case of high correlation is 231. The MSL SDR values for the low and high correlation cases in the PMOP model are in the order of 16–24%, though this small SDR is also surely because of using the CML convergent values as the start values for the MSL estimation runs. The CML SDR values in the PMOP model are low (6–13%) for both the low and high correlation cases. Overall, the computation time results do very clearly indicate the advantage of the CML over the MSL approach – the CML approach estimates parameters in much less time than the MSL, and the stability in the CML computation time is substantially higher than the stability in the MSL computation times. As the number of orderedresponse outcomes increase, one can only expect a further increase in the computational time advantage of the CML over the MSL estimation approach.
6. CONCLUSIONS This chapter compared the performance of the MSL approach with the CML approach in multivariate ordered-response situations. We used simulated data sets with known underlying model parameters to evaluate the two estimation approaches in the context of a cross-sectional orderedresponse setting as well as a panel ordered-response setting. The ability of the two approaches to recover model parameters was examined, as was the sampling variance and the simulation variance of parameters in the MSL approach relative to the sampling variance in the CML approach. The computational costs of the two approaches were also presented. Overall, the simulation results demonstrate the ability of the CML approach to recover the parameters in a multivariate ordered-response choice model context, independent of the correlation structure. In addition, the CML approach recovers parameters as well as the MSL estimation
MSL and CML Estimation Approaches
101
approach in the simulation contexts used in this study, while also doing so at a substantially reduced computational cost and improved computational stability. Further, any reduction in the efficiency of the CML approach relative to the MSL approach is in the range of nonexistent to small. All these factors, combined with the conceptual and implementation simplicity of the CML approach, makes it a promising and simple approach not only for the multivariate ordered-response model considered here, but also for other analytically intractable econometric models. Also, as the dimensionality of the model explodes, the CML approach remains practical and feasible, while the MSL approach becomes impractical and/or infeasible. Additional comparisons of the CML approach with the MSL approach for high-dimensional model contexts and alternative covariance patterns are directions for further research.
NOTES 1. The first three of these studies use the bivariate ordered-response probit (BORP) model in which the stochastic elements in the two ordered-response equations take a bivariate normal distribution, while the last study develops a more general and flexible copula-based bivariate ordered-response model that subsumes the BORP as but one special case. 2. A handful of studies (see Hjort & Varin, 2008; Mardia, Kent, Hughes, & Taylor, 2009; Cox & Reid, 2004) have also theoretically examined the limiting normality properties of the CML approach, and compared the asymptotic variance matrices from this approach with the maximum likelihood approach. However, such a precise theoretical analysis is possible only for very simple models, and becomes much harder for models such as a multivariate ordered-response system. 3. In this study, we assume that the number of panel observations is the same across individuals. Extension to the case of different numbers of panel observations across individuals does not pose any substantial challenges. However, the efficiency of the composite marginal likelihood (CML) approach depends on the weights used for each individual in the case of varying number of observations across individuals (see Kuk & Nott, 2000; Joe & Lee, 2009 provide a recent discussion and propose new weighting techniques). But one can simply put a weight of one without any loss of generality for each individual in the case of equal number of panel observations for each individual. In our study, the focus is on comparing the performance of the maximum simulated likelihood approach with the CML approach, so we steer clear of issues related to optimal weights for the CML approach by considering the ‘‘equal observations across individuals’’ case. 4. Note that one can also use more complicated autoregressive structures of order p for the error terms, or use more general structures for the error correlation. For instance, while we focus on a time-series context, in spatial contexts related to ordered-response modeling, Bhat et al. (2010) developed a specification where the
102
CHANDRA R. BHAT ET AL.
correlation in physical activity between two individuals may be a function of several measures of spatial proximity and adjacency. 5. Intuitively, in the pairwise CML approach used in this chapter, the surrogate likelihood function represented by the CML function is the product of the marginal likelihood functions formed by each pair of ordinal variables. In general, maximization of the original likelihood function will result in parameters that tend to maximize each pairwise likelihood function. Since the CML is the product of pairwise likelihood contributions, it will, therefore, provide consistent estimates. Another equivalent way to see this is to assume we are discarding all but two randomly selected ordinal variables in the original likelihood function. Of course, we will not be able to estimate all the model parameters from two random ordinal variables, but if we could, the resulting parameters would be consistent because information (captured by other ordinal variables) is being discarded in a purely random fashion. The CML estimation procedure works similarly, but combines all ordinal variables observed two at a time, while ignoring the full joint distribution of the ordinal variables. 6. Bhat (2001) used Halton sequence to estimate mixed logit models, and found that the simulation error in estimated parameters is lower with 100 Halton draws than with 1,000 random draws (per individual). In our study, we carried out the GHK analysis of the multivariate ordered-response model with 100 randomized Halton draws as well as 500 random draws per individual, and found the 100 randomized Halton draws case to be much more accurate/efficient as well as much less time-consuming. So, we present only the results of the 100 randomized Halton draws case here. 7. Note that we use more independent data sets for the panel case than the crosssectional case, because the number of individuals in the panel case is fewer than the number of individuals in the cross-sectional case. Essentially, the intent is to retain the same order of sampling variability in the two cases across individuals and data sets (the product of the number of observations per data set and the number of data sets is 20,000 in the cross-sectional and the panel cases). Further, the lower number of data sets in the cross-sectional case is helpful because maximum simulated likelihood is more expensive relative to the panel case, given that the number of parameters to be estimated is substantially more than in the panel case. Note also that the dimensionality of the correlation matrices is about the same in the crosssectional and panel cases. We use T ¼ 6 in the panel case because the serial correlation gets manifested in the last five of the six observations for each individual. The first observation error term eq1 for each individual q is randomly drawn from the normal distribution with variance s2. 8. The CML estimator always converged in our simulations, unlike the MSL estimator. 9. One could argue that the higher absolute percentage bias values for the correlation parameters in the low correlation case compared to the high correlation case is simply an artifact of taking percentage differences from smaller base correlation values in the former case. However, the sum of the absolute values of the deviations between the mean estimate and the true value is 0.0722 for the low correlation case and 0.0488 for the high correlation case. Thus, the correlation values are indeed being recovered more accurately in the high correlation case compared to the low correlation case.
MSL and CML Estimation Approaches
103
10. McFadden and Train (2000) indicate, in their use of independent number of random draws across observations, that the difference between the asymptotic covariance matrix of the MSL estimator obtained as the inverse of the sandwich information matrix and the asymptotic covariance matrix of the MSL estimator obtained as the inverse of the cross-product of first derivatives should be positivedefinite for finite number of draws per observation. Consequently, for the case of independent random draws across observations, the relationship between the MSL sandwich covariance matrix estimator and the CML Godambe covariance matrix is unclear. The situation gets even more unclear in our case because of the use of Halton or Lattice point draws that are not based on independent random draws across observations. 11. Note that we use the terminology ‘‘successful seed’’ to simply denote if the starting seed led to success in a complete estimation of the 10 simulation runs. In MSL estimation, it is not uncommon to obtain nonconvergence (because of a number of reasons) for some sets of random sequences. There is, however, nothing specific to be learned here in terms of what starting seeds are likely to be successful and what starting seeds are likely to be unsuccessful. The intent is to use the terminology ‘‘successful seed’’ simply as a measure of nonconvergence rates.
ACKNOWLEDGMENTS The authors are grateful to Lisa Macias for her help in formatting this document. Two referees provided important input on an earlier version of the chapter.
REFERENCES Apanasovich, T. V., Ruppert, D., Lupton, J. R., Popovic, N., Turner, N. D., Chapkin, R. S., & Carroll, R. J. (2008). Aberrant crypt foci and semiparametric modelling of correlated binary data. Biometrics, 64(2), 490–500. Balia, S., & Jones, A. M. (2008). Mortality, lifestyle and socio-economic status. Journal of Health Economics, 27(1), 1–26. Bellio, R., & Varin, C. (2005). A pairwise likelihood approach to generalized linear models with crossed random effects. Statistical Modelling, 5(3), 217–227. Bhat, C. R. (2001). Quasi-random maximum simulated likelihood estimation of the mixed multinomial logit model. Transportation Research Part B, 35(7), 677–693. Bhat, C. R. (2003). Simulation estimation of mixed discrete choice models using randomized and scrambled Halton sequences. Transportation Research Part B, 37(9), 837–855. Bhat, C. R., Sener, I. N., & Eluru, N. (2010). A flexible spatially dependent discrete choice model: Formulation and application to teenagers’ weekday recreational activity participation. Transportation Research Part B, 44(8–9), 903–921.
104
CHANDRA R. BHAT ET AL.
Bhat, C. R., & Srinivasan, S. (2005). A multidimensional mixed ordered-response model for analyzing weekend activity participation. Transportation Research Part B, 39(3), 255–278. Caragea, P. C., & Smith, R. L. (2007). Asymptotic properties of computationally efficient alternative estimators for a class of multivariate normal models. Journal of Multivariate Analysis, 98(7), 1417–1440. Chen, M.-H., & Dey, D. K. (2000). Bayesian analysis for correlated ordinal data models. In: D. K. Dey, S. K. Gosh & B. K. Mallick (Eds), Generalized linear models: A Bayesian perspective. New York: Marcel Dekker. Cox, D., & Reid, N. (2004). A note on pseudolikelihood constructed from marginal densities. Biometrika, 91(3), 729–737. Craig, P. (2008). A new reconstruction of multivariate normal orthant probabilities. Journal of the Royal Statistical Society: Series B, 70(1), 227–243. de Leon, A. R. (2005). Pairwise likelihood approach to grouped continuous model and its extension. Statistics & Probability Letters, 75(1), 49–57. Engle, R. F., Shephard, N., & Sheppard, K. (2007). Fitting and testing vast dimensional timevarying covariance models. Finance Working Papers, FIN-07-046. Stern School of Business, New York University. Engler, D. A., Mohapatra, M., Louis, D. N., & Betensky, R. A. (2006). A pseudolikelihood approach for simultaneous analysis of array comparative genomic hybridizations. Biostatistics, 7(3), 399–421. Ferdous, N., Eluru, N., Bhat, C. R., & Meloni, I. (2010). A multivariate ordered-response model system for adults’ weekday activity episode generation by activity purpose and social context. Transportation Research Part B, 44(8–9), 922–943. Genz, A. (1992). Numerical computation of multivariate normal probabilities. Journal of Computational and Graphical Statistics, 1(2), 141–149. Genz, A. (2003). Fully symmetric interpolatory rules for multiple integrals over hyper-spherical surfaces. Journal of Computational and Applied Mathematics, 157(1), 187–195. Genz, A., & Bretz, F. (1999). Numerical computation of multivariate t-probabilities with application to power calculation of multiple contrasts. Journal of Statistical Computation and Simulation, 63(4), 361–378. Genz, A., & Bretz, F. (2002). Comparison of methods for the computation of multivariate t probabilities. Journal of Computational and Graphical Statistics, 11(4), 950–971. Geweke, J. (1991). Efficient simulation from the multivariate normal and student-t distributions subject to linear constraints. Computer Science and Statistics: Proceedings of the Twenty Third Symposium on the Interface, Foundation of North America Inc., Fairfax, VA (pp. 571–578). Godambe, V. (1960). An optimum property of regular maximum likelihood equation. The Annals of Mathematical Statistics, 31(4), 1208–1211. Greene, W. H., & Hensher, D. A. (2010). Modeling ordered choices: A primer. Cambridge: Cambridge University Press. Hajivassiliou, V., & McFadden, D. (1998). The method of simulated scores for the estimation of LDV models. Econometrica, 66(4), 863–896. Hasegawa, H. (2010). Analyzing tourists’ satisfaction: A multivariate ordered probit approach. Tourism Management, 31(1), 86–97. Herriges, J. A., Phaneuf, D. J., & Tobias, J. L. (2008). Estimating demand systems when outcomes are correlated counts. Journal of Econometrics, 147(2), 282–298.
MSL and CML Estimation Approaches
105
Hjort, N. L., & Varin, C. (2008). ML, PL, QL in Markov chain models. Scandinavian Journal of Statistics, 35(1), 64–82. Hothorn, T., Bretz, F., & Genz, A. (2001). On multivariate t and gauss probabilities in R. R News, 1(2), 27–29. Jeliazkov, I., Graves, J., & Kutzbach, M. (2008). Fitting and comparison of models for multivariate ordinal outcomes. Advances in Econometrics, 23, 115–156. Joe, H., & Lee, Y. (2009). On weighting of bivariate margins in pairwise likelihood. Journal of Multivariate Analysis, 100(4), 670–685. Keane, M. (1990). Four essays in empirical macro and labor economics. Ph.D. thesis, Brown University, Providence, RI. Keane, M. (1994). A computationally practical simulation estimator for panel data. Econometrica, 62(1), 95–116. Kuk, A. Y. C., & Nott, D. J. (2000). A pairwise likelihood approach to analyzing correlated binary data. Statistics & Probability Letters, 47(4), 329–335. LaMondia, J., & Bhat, C. R. (2009). A conceptual and methodological framework of leisure activity loyalty accommodating the travel context: Application of a copula-based bivariate ordered-response choice model. Technical paper, Department of Civil, Architectural and Environmental Engineering, The University of Texas at Austin. Lele, S. R. (2006). Sampling variability and estimates of density dependence: A compositelikelihood approach. Ecology, 87(1), 189–202. Lindsay, B. G. (1988). Composite likelihood methods. Contemporary Mathematics, 80, 221–239. Liu, I., & Agresti, A. (2005). The analysis of ordered categorical data: An overview and a survey of recent developments. TEST: An Official Journal of the Spanish Society of Statistics and Operations Research, 14(1), 1–73. Mardia, K., Kent, J. T., Hughes, G., & Taylor, C. C. (2009). Maximum likelihood estimation using composite likelihoods for closed exponential families. Biometrika, 96(4), 975–982. McFadden, D., & Train, K. (2000). Mixed MNL models for discrete response. Journal of Applied Econometrics, 15(5), 447–470. McKelvey, R., & Zavoina, W. (1971). An IBM Fortran IV program to perform n-chotomous multivariate probit analysis. Behavioral Science, 16, 186–187. McKelvey, R. D., & Zavoina, W. (1975). A statistical model for the analysis of ordinal-level dependent variables. Journal of Mathematical Sociology, 4, 103–120. Mi, X., Miwa, T., & Hothorn, T. (2009). mvtnorm: New numerical algorithm for multivariate normal probabilities. The R Journal, 1, 37–39. Mitchell, J., & Weale, M. (2007). The reliability of expectations reported by British households: Micro evidence from the BHPS. Discussion paper. National Institute of Economic and Social Research, London, UK. Molenberghs, G., & Verbeke, G. (2005). Models for discrete longitudinal data. New York: Springer Series in Statistics, Springer ScienceþBusiness Media, Inc. Muller, G., & Czado, C. (2005). An autoregressive ordered probit model with application to high frequency financial data. Journal of Computational and Graphical Statistics, 14(2), 320–338. R Development Core Team. (2009). R: A language and environment for statistical computing (ISBN 3-900051-07-0. Available at http://www.R-project.org.). Vienna, Austria: R Foundation for Statistical Computing. Renard, D., Molenberghs, G., & Geys, H. (2004). A pairwise likelihood approach to estimation in multilevel probit models. Computational Statistics & Data Analysis, 44(4), 649–667.
106
CHANDRA R. BHAT ET AL.
Scott, D. M., & Axhausen, K. W. (2006). Household mobility tool ownership: Modeling interactions between cars and season tickets. Transportation, 33(4), 311–328. Scott, D. M., & Kanaroglou, P. S. (2002). An activity-episode generation model that captures interactions between household heads: Development and empirical analysis. Transportation Research Part B, 36(10), 875–896. Scotti, C. (2006). A bivariate model of Fed and ECB main policy rates. International Finance Discussion Papers 875, Board of Governors of the Federal Reserve System, Washington, DC, USA. Train, K. (2003). Discrete choice methods with simulation (1st ed.). Cambridge: Cambridge University Press. Varin, C., & Czado, C. (2010). A mixed autoregressive probit model for ordinal longitudinal data. Biostatistics, 11(1), 127–138. Varin, C., Host, G., & Skare, O. (2005). Pairwise likelihood inference in spatial generalized linear mixed models. Computational Statistics & Data Analysis, 49(4), 1173–1191. Varin, C., & Vidoni, P. (2006). Pairwise likelihood inference for ordinal categorical time series. Computational Statistics & Data Analysis, 51(4), 2365–2373. Varin, C., & Vidoni, P. (2009). Pairwise likelihood inference for general state space models. Econometric Reviews, 28(1–3), 170–185. Winship, C., & Mare, R. D. (1984). Regression models with ordinal variables. American Sociological Review, 49(4), 512–525. Zhao, Y., & Joe, H. (2005). Composite likelihood estimation in multivariate data analysis. The Canadian Journal of Statistics, 33(3), 335–356.
PRETEST ESTIMATION IN THE RANDOM PARAMETERS LOGIT MODEL Tong Zeng and R. Carter Hill ABSTRACT In this paper we use Monte Carlo sampling experiments to examine the properties of pretest estimators in the random parameters logit (RPL) model. The pretests are for the presence of random parameters. We study the Lagrange multiplier (LM), likelihood ratio (LR), and Wald tests, using conditional logit as the restricted model. The LM test is the fastest test to implement among these three test procedures since it only uses restricted, conditional logit, estimates. However, the LM-based pretest estimator has poor risk properties. The ratio of LM-based pretest estimator root mean squared error (RMSE) to the random parameters logit model estimator RMSE diverges from one with increases in the standard deviation of the parameter distribution. The LR and Wald tests exhibit properties of consistent tests, with the power approaching one as the specification error increases, so that the pretest estimator is consistent. We explore the power of these three tests for the random parameters by calculating the empirical percentile values, size, and rejection rates of the test statistics. We find the power of LR and Wald tests decreases with increases in the mean of the coefficient distribution.
Maximum Simulated Likelihood Methods and Applications Advances in Econometrics, Volume 26, 107–136 Copyright r 2010 by Emerald Group Publishing Limited All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1108/S0731-9053(2010)0000026008
107
108
TONG ZENG AND R. CARTER HILL
The LM test has the weakest power for presence of the random coefficient in the RPL model.
1. INTRODUCTION In this paper, we use Monte Carlo sampling experiments to examine the properties of pretest estimators in the random parameters logit (RPL) model, also called the mixed logit model. The pretests are for the presence of random parameters. We study the Lagrange multiplier (LM), likelihood ratio (LR), and Wald tests, using conditional logit as the restricted model. Unlike the conditional logit model, the mixed logit model does not impose the Independence from Irrelevant Alternatives (IIA) assumption. The mixed logit model can capture random taste variation among individuals and allows the unobserved factors of utility to be correlated over time as well. The choice probabilities in the mixed logit model cannot be calculated analytically because they involve a multidimensional integral, which does not have closed form solution. The integral can be approximated using simulation. The requirement of a large number of pseudo-random numbers during the simulation leads to long computational times. We are interested in testing the randomness of the mixed logit coefficients and the properties of pretest estimators in the mixed logit following the test for randomness. If the model coefficients are not random, then the mixed logit model reduces to the simpler conditional logit model. The most commonly used test procedures for this purpose are the Wald (or t-) test and the LR test for the significance of the random components of the coefficients. The problem is that in order to implement these tests the mixed logit model must be estimated. It is much faster to implement the LM test, as the restricted estimates come from the conditional logit model, which is easily estimated. We use Monte Carlo experiments in the context of one- and twoparameter choice models with four alternatives to examine the risk properties of pretest estimator based on LM, LR, and Wald tests. We explore the power of the three tests for random parameters by calculating the empirical 90th and 95th percentile values of the three tests and examining rejection rates of the three tests by using the empirical 90th and 95th percentile values as the critical values for 10 and 5% significance levels. We find the pretest estimators based on the LR and Wald statistics have
Pretest Estimation in the Random Parameters Logit Model
109
root mean squared error (RMSE) that is less than that of the random parameters logit model when the parameter variance is small, but that pretest estimator RMSE is worse than that of the random parameters logit model over the remaining parameter space. The LR and Wald tests exhibit properties of consistent tests, with the power approaching one as the specification error increases. However, the power of LR and Wald tests decreases with increases in the mean of the coefficient distribution. The ratio of LM-based pretest estimator RMSE to that RMSE of the random parameters logit model rises and becomes further away from one with increases in the standard deviation of the parameter distribution. The plan of the paper is as follows. In the following section, we review the conditional logit model and introduce the mixed logit specification. In Section 3, we introduce quasi-random numbers and describe our Monte Carlo experiments. We also show the efficiency of the quasi-random numbers in this section. Section 4 summarizes the RMSE properties of the pretest estimator based on LM, LR, and Wald tests, and the size-corrected rejection rates of these three tests. Some conclusions and recommendations are given in Section 5.
2. CONDITIONAL AND MIXED LOGIT MODELS The conditional logit model is frequently used in applied econometrics. The related choice probability can be computed conveniently without multivariate integration. The IIA assumption of the conditional logit model is inappropriate in many choice situations, especially for the choices that are close substitutes. The IIA assumption arises because in logit models the unobserved components of utility have independent and identical Type I extreme value distributions. This is violated in many cases, such as when unobserved factors that affect the choice persist over time. Unlike probit models, the mixed logit model, it is fully flexible because its unobserved utility is not limited to the normal distribution. It decomposes the random components of utility into two parts: one having the independent, identical Type I extreme value distribution and the other representing individual tastes, having any distribution. The related utility associated with alternative i as evaluated by individual n in the mixed logit model is written as U ni ¼ b0n xni þ ni
(1)
110
TONG ZENG AND R. CARTER HILL
where xni are observed variables for alternative i and individual n, bn is a vector of coefficients for individual n varying over individuals in the population with density function f(b), and ni is an i.i.d. extreme value random component that is independent of bn and xni . If bn is fixed, the mixed logit becomes the conditional logit model and the choice probability Lni ðbÞ for individual n choosing alternative i is 0
eb xni Lni ðbÞ ¼ P b0 x e nj
(2)
j
In the mixed logit model we specify a distribution for the random coefficients f ðbjyÞ, where y contains distribution means and variances. These are the parameters to be ultimately estimated. The choice probability is Z Z 0 eb xni (3) Pni ¼ P b0 x f ðbjyÞdb ¼ Lni ðbÞf ðbjyÞdb e ni i
Hensher and Greene (2003) discuss how to choose an appropriate distribution for random coefficients. In the following section, we will describe how to estimate the unknown parameters ðyÞ and introduce the quasi-Monte Carlo methods. Train (2003) summarizes many aspects of conditional logit and mixed logit models.
3. QUASI-MONTE CARLO METHODS 3.1. Simulated Log-Likelihood Function Unlike the conditional logit model, the mixed logit probability cannot be calculated analytically because the related integral does not have a closed form solution. The choice probability can be estimated through simulation and the unknown parameters ðyÞ can be estimated by maximizing the simulated log-likelihood (SLL) function. With simulation, a value of b labeled br , representing the rth draw, is chosen randomly from a previously specified distribution. The standard logit Lni ðbÞ in Eq. (2) can be calculated given value br . Repeating this process R times, and the simulated probability of individual n choosing alternative i is obtained by averaging Lni ðbr Þ: R 1X Lni ðbr Þ P ni ¼ R r¼1
(4)
Pretest Estimation in the Random Parameters Logit Model
111
The simulated log-likelihood function is SLLðyÞ ¼
N X J X
d ni ln P ni
(5)
n¼1 i¼1
where d ni ¼ 1 if individual n chooses alternative i and d ni ¼ 0 otherwise. Each individual is assumed to make choices independently and faces choice once.
3.2. The Halton Sequences The classical Monte Carlo method is used above to estimate the probability Pni . It reduces the integration problem to the problem of estimating an expected value on the basis of the strong law of large numbers. In general terms, the classical Monte Carlo method is described as a numerical method based on random sampling using pseudo-random numbers. In terms of the number of pseudo-random draws, N, it gives us a probabilistic error bound, or convergence rate OðN 1=2 Þ, since there is never any guarantee that the expected accuracy is achieved in a concrete calculation (Niederreiter, 1992, p. 7). It represents the stochastic character of the classical Monte Carlo method. For the classical Monte Carlo method the convergence rate of the numerical integration does not depend on the dimension of the integration. Good estimates, however, require a large number of pseudo-random numbers that leads to long computational times. To reduce the cost of long run times, we can replace the pseudo-random numbers with a constructed set of points. The same or even higher estimation accuracy can be reached with fewer points. The essence of the number theoretic method (NTM) is to find a set of uniformly scattered points over an s-dimensional unit cube. Such set of points obtained by NTM is usually called a set of quasi-random numbers, or a number theoretic net. Sometimes it can be used in the classical Monte Carlo method to achieve a significantly higher accuracy. The difference between the quasi-Monte Carlo method and the classical Monte Carlo method is the quasi-Monte Carlo method uses quasi-random numbers instead of pseudo-random numbers. There are several methods to construct the quasi-random numbers: here we use the Halton sequences proposed by Halton (1960). Bhat (2001) found the error measures of the estimated parameters were smaller using 100 Halton draws than using 1,000 random number draws in mixed logit model estimation.
112
TONG ZENG AND R. CARTER HILL
Halton sequences are based on the base-p number system. Any integer n can be written as n nM nM1 n2 n1 n0 ¼ n0 þ n1 p þ n2 p2 þ þ nM pM
(6)
where M ¼ ½lognp ¼ ½ln n= ln p, square brackets denoting the integral part. The base is p, which can be any integer except 1; ni is the digit at position i, 0 i M, 0 ni p 1; and pi is the weight of position i. With the base p ¼ 10, the integer n ¼ 468 has n0 ¼ 8, n1 ¼ 6, n2 ¼ 4. Using the base-p number system, we can construct one and only one fraction f that is smaller than 1 by writing n with a different base number system and reversing the order of the digits in n. It is called the radical inverse function and is defined as follows: f ¼ fp ðnÞ ¼ 0 n0 n1 n2 nM ¼ n0 p1 þ n1 p2 þ þ nM pM1
(7)
The Halton sequence of length N is developed from the radical inverse function and the points of the Halton sequence are fp ðnÞ for n ¼ 1; 2 . . . N where p is a prime number. The k-dimensional sequence is defined as: jn ¼ ðfp1 ðnÞ; fp2 ðnÞ; . . . fpk ðnÞÞ
(8)
where p1 ; p2 ; . . . ; pk are prime to each other and are always chosen from the first k primes to achieve a smaller discrepancy. In applications, Halton sequences are used to replace random number generators to produce points in the interval [0,1]. The points of the Halton sequence are generated iteratively. A one-dimensional Halton sequence based on prime p divides 0–1 interval into p segments. It systematically fills in the empty space by iteratively dividing each segment into smaller p segments. The position of the points is determined by the base that is used to construct iteration. A large base implies more points in each iteration, or a long cycle. Due to the high correlation among the initial points of Halton sequence, the first 10 points of the sequences are usually discarded in applications (Morokoff & Caflisch, 1995; Bratley, Fox, & Niederreiter, 1992). Compared to pseudo-random numbers, the coverage of the points of Halton sequence are more uniform, since the pseudo-random numbers may cluster in some areas and leave some areas uncovered. This can be seen from Fig. 1, which is similar to a figure from Bhat (2001). Fig. 1(a) is a plot of 200 points taken from a uniform distribution of two dimensions using pseudo-random numbers. Fig. 1(b) is a plot of 200 points obtained by the Halton sequence. The latter scatters more uniformly on the unit
Pretest Estimation in the Random Parameters Logit Model
(a)
(b)
Comparing Pseudo-Random Numbers to Halton Sequences. Note: Fig. 1(a) 200 points psedu-random numbers in two dimensions. Fig. 1(b) 200 points of two dimension Halton sequence generated with prime 2 and 3.
113
Fig. 1.
114
TONG ZENG AND R. CARTER HILL
square than the former. Since the points generated from the Halton sequences are deterministic points, unlike the classical Monte Carlo method, quasi-Monte Carlo provides a deterministic error bound instead of probabilistic error bound. It is called the ‘‘discrepancy’’ in the literature of NTMs. The smaller the discrepancy, the more evenly the quasi-random numbers spread over the domain. The deterministic error bound of quasi-Monte Carlo method with the Halton sequences is OðN 1 ðln NÞk Þ (Halton, 1960), which is smaller than the probabilistic error bound of classical Monte Carlo method OðN 1=2 Þ. The shortcoming of the Halton sequences is it needs a large number of points to ensure a uniform scattering on the domain for large dimensions, usually k 10. It increases the computational time and also leads to high correlation among higher coordinates of the Halton sequences. Higher dimension Halton sequences can be refined by scrambling their points, which is explored by Bhat (2003). Monte Carlo simulation methods require random samples from various distributions. A discrepancy-preserving transformation is often applied in quasi-Monte Carlo simulation to transform a set of n quasi-random numbers fY k ¼ ðY k1 ; . . . Y ks Þ; k ¼ 1; . . . ; ng generated from the s-dimensional unit cube, with discrepancy d, to a random variable x with another statistical distribution by solving: 1 xk ¼ ðF 1 1 ðY k1 Þ; . . . ; F s ðY ks ÞÞ;
k ¼ 1; . . . ; n
We achieve the same discrepancy d with respect to FðxÞ, where FðxÞ is an increasing continuous multivariate distribution function FðxÞ ¼ Fðx1 ; . . . ; xs Þ ¼ Psi¼1 F i ðxi Þ, and F i ðxi Þ is the marginal distribution function of x (Fang & Wang, 1994, Chapter 4). Due to the faster convergence rate and fewer draws, less computational time is needed. We apply the Halton sequences with the maximum simulated likelihood method to estimate the mixed logit model. How to choose the number of Halton draws is an issue in application of the Halton sequences. Different researchers provide different suggestions. To determine the number of Halton draws in our experiments, we compare the results of estimated mixed logit parameters with different sets of Halton draws and pseudorandom numbers. Specifically we compare estimator bias, Monte Carlo sampling variability (standard deviations), the average nominal standard errors, the ratio of average nominal standard errors to the Monte Carlo standard deviations, and the RMSE of random coefficient estimates.
Pretest Estimation in the Random Parameters Logit Model
115
3.3. Monte Carlo Experiment Design Our experiments are based on a mixed logit model that has no intercept term, with one or two coefficients that are independent of each other. In our experiments, each individual faces four mutually exclusive alternatives on one choice occasion. The utility of individual n choosing alternative i is U ni ¼ b0n xni þ ni
(9)
The explanatory variables for each individual and each alternative xni are generated from independent standard normal distributions. The coefficients s 2 Þ. for each individual bn are generated from the normal distribution Nðb; b The values of xni and bn are held fixed over each experiment design. The choice probability for each individual is generated by comparing the utility of each alternative: ( 1 b0n xni þ rni 4b0n xnj þ rnj 8iaj r (10) I ni ¼ 0 Otherwise The indicator function I rni ¼ 1 if individual n chooses alternative i, and is 0 otherwise. The values of the random errors have i.i.d. extreme value type I distributions, rni representing the rth draw. We calculate and compare the utility of each alternative. This process is repeated 1,000 times. The simulated choice probability Pni for each individual n choosing alternative i is Pni ¼
X 1 1;000 Ir 1; 000 r¼1 ni
(11)
The dependent variable values yni are determined by these simulated choice probabilities. In our experiments, we choose the estimation sample size N ¼ 200 and generate 999 Monte Carlo samples for the coefficient distributions with specific means and variances. To test the efficiency of the mixed logit estimators using the Halton sequence, we use 25, 100, and 250 Halton draws and 1,000 pseudo-random draws to estimate the mean and variance of the coefficient distribution.
3.4. Findings in Simulation Efficiency In the one coefficient case, the two parameters of interest are b and s b . We denote the estimates of these parameters as b^ and s^ b . We use SAS 9.2
116
TONG ZENG AND R. CARTER HILL
Table 1. The Mixed Logit Model Estimated with Classical Monte Carlo and Quasi-Monte Carlo Estimation: One Parameter Model. Monte Carlo Parameters and Results b ¼ 1.5a s b ¼ 0.8a Monte Carlo average value of b^ i Monte Carlo average value of s^ bi Monte Carlo SD of b^ i
Monte Carlo SD of s^ bi Average nominal SE of b^ i Average nominal SE of s^ bi Average nominal SECMC SD of b^ i Average nominal SECMC SD of s^ bi RMSE of b^ i
RMSE of s^ bi a
Classical Monte Carlo Estimation
Quasi-Monte Carlo Estimation
Number of random draws 1,000
Number of Halton draws 25 100 250
1.486 0.625 0.234 0.363 0.240 0.434 1.026 1.196 0.234 0.402
1.468 0.594 0.226 0.337 0.236 0.417 1.044 1.238 0.228 0.395
1.477 0.606 0.233 0.372 0.237 0.447 1.017 1.202 0.234 0.419
1.477 0.602 0.232 0.375 0.237 0.465 1.021 1.241 0.233 0.424
The true mean and the standard deviation of the distribution of random coefficient bn.
and NLOGIT Version 4.0 to check and compare the results of our GAUSS program. The Monte Carlo experiments were programmed in Gauss 9.0 using portions of Ken Train’s posted GAUSS code [http://elsa.berkeley.edu/ Btrain/software.html] and use NSAM ¼ 999 Monte Carlo samples. Table 1 shows the Monte Carlo average of the estimated mixed logit parameters and the error measures of mixed logit estimates with one random parameter. Specifically, X^ MC average b^ ¼ bi =NSAM rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi X ^ 2 =ðNSAM 1Þ MC standard deviation ðs:d:Þ of b^ ¼ ðb^ i bÞ
Average nominal standard error ðs:e:Þ of b^ ¼
Root mean squared error ðRMSEÞ of b^ ¼
X qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi c b^ i Þ=NSAM varð
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi X 2 =NSAM ðb^ i bÞ
Pretest Estimation in the Random Parameters Logit Model
117
From Table 1, increases in the number of Halton draws changes error measures only by small amounts. The number of Halton draws influences the RMSE of mixed logit estimators slightly. The Monte Carlo average value of b^ i underestimates the true parameter b ¼ 1:5 by less than 2%. With Halton draws, the average nominal standard errors of b^ are only 1% larger than the Monte Carlo standard deviations. The average nominal standard errors for s^ b are 20% larger than the Monte Carlo averages. With 100 Halton draws, the ratios of average nominal standard errors to Monte Carlo standard errors are closest to 1. Compared to the classical Monte Carlo estimation, our results confirm the findings of Bhat (2001, p. 691). We can reach almost the same RMSE of estimated parameters with only 100 Halton draws as compared with 1,000 random draws, and the computational time is considerably reduced with 100 Halton draws. Considering the relatively accurate estimation of the standard errors of b^ and s^ b ; which are used to construct t-tests, and the acceptable estimation time, we use 100 Halton draws to estimate the mixed logit parameters in our Monte Carlo experiments in one-parameter designs. In Table 2, we use the same error measures and show the Monte Carlo average values of the estimated mixed logit parameters with two random coefficients. The true mean and standard deviation of new independent random coefficient distributions are 2.5 and 0.3, respectively. In Table 2, with increases in the number of Halton draws, the percentage changes of the Monte Carlo average values of the estimated mixed logit parameters are no more than 1%. Unlike the one random coefficient case, the Monte Carlo average values of estimated means of two independent random coefficient distributions are overestimated by 10%. However, the biases are stable and not sensitive to the number of Halton draws. From Table 2, the average nominal standard errors of b^ 1i and b^ 2i are underestimated and further away from the Monte Carlo standard deviations than in one random coefficient case. The ratios of the average nominal standard errors to the Monte Carlo standard deviations of the estimated parameters are slightly closer to one with 100 Halton draws. Using 100 Halton draws also provides smaller RMSE of the estimated parameters. Based on these results, 100 Halton draws are also used in our two independent random coefficient mixed logit model. All of these factors lead us to conclude that increasing the number of Halton draws in our experiments will not greatly improve the RMSE of estimated mixed logit parameters. Since the convergence rate of the quasiMonte Carlo method with Halton sequences in theory is mainly determined by the structure of the sequences, the simulation error will not considerably decline with increases in the number of Halton draws for each individual.
118
TONG ZENG AND R. CARTER HILL
Table 2. The Mixed Logit Model Estimated with Classical Monte Carlo and Quasi-Monte Carlo Estimation: Two-Parameter Model. Monte Carlo (MC) Parameter Values and Results b 1 ¼ 2.5 s b1 ¼ 0.3 MC average value b^ 1i MC average value s^ b1i MC average value b^ 2i
MC average value s^ b2i MC s.d. b^ 1i
MC s.d. s^ b1i MC s.d. b^ 2i
MC s.d. s^ b2i Average nominal s.e. b^ 1i Average nominal s.e. s^ b1i Average nominal s.e. b^ 2i
Average nominal s.e. s^ b2i Average nominal s.e./MC s.d. b^ 1i Average nominal s.e./MC s.d. s^ b1i Average nominal s.e./MC s.d. b^ 2i
Average nominal s.e./MC s.d. s^ b2i RMSE b^ 1i
RMSE s^ b1i RMSE b^ 2i
RMSE s^ b2i
Classical Monte Carlo Estimation
Quasi-Monte Carlo Estimation
Number of random draws 1,000
Number of Halton draws 25 100 250
2.733 0.332 1.674 0.601 0.491 0.428 0.327 0.439 0.445 0.737 0.298 0.512 0.907 1.721 0.912 1.167 0.543 0.429 0.370 0.481
2.754 0.401 1.680 0.615 0.497 0.438 0.325 0.423 0.450 0.678 0.300 0.494 0.906 1.548 0.923 1.168 0.558 0.449 0.371 0.461
2.732 0.318 1.676 0.605 0.477 0.435 0.316 0.430 0.445 0.772 0.297 0.499 0.933 1.776 0.940 1.160 0.531 0.435 0.361 0.472
2.728 0.302 1.672 0.592 0.490 0.448 0.323 0.447 0.443 0.833 0.297 0.537 0.904 1.859 0.919 1.202 0.540 0.448 0.366 0.492
4. PRETEST ESTIMATORS Even though mixed logit model is a highly flexible model, it requires the use of simulation to obtain empirical estimates. It is desirable to have a specification test to determine whether the mixed logit is needed or not. The LR and Wald tests are the most popular test procedures used for testing coefficient significance. The problem is that in order to implement these tests the mixed logit model must be estimated. It is much faster to implement the LM test. It is interesting and important to examine the power of these three tests for the presence of the random coefficients in the mixed logit model. We use Monte Carlo experiments in the context of one- and two-parameter
Pretest Estimation in the Random Parameters Logit Model
119
choice models with four alternatives to examine the properties of pretest estimators in the random parameters logit model with LR, LM, and Wald tests. In our experiments, the LR, Wald, and LM tests are constructed based on the null hypothesis H 0 : sb ¼ 0 with alternative hypothesis H 1 : sb 40. With this one-tail test, when the null hypothesis is true, the parameter will lie on the boundary of the parameter space. The asymptotic distribution of the estimator is complex, since it does not include a neighborhood of zero. The standard theory of the LR, Wald, and LM tests assumes that the true parameter lies inside an open set in the parameter space. Thus, the LR, Wald, and LM statistics do not have the usual chi-square asymptotic distribution. Under the null hypothesis, Gourieroux and Monfort (1995) and Andrews (2001) have shown that the asymptotic distribution of the LR and Wald statistics is a mixture of chi-square distributions. In our experiments, we use their results to analyze the power of the LR, LM, and Wald tests.
4.1. One-Parameter Model Results In the one random parameter model, we use four different values for the parameter mean, b ¼ f0:5; 1:5; 2:5; 3:0g. Corresponding to each value of we use six different values for the standard deviation of the mean b, the parameter distribution, s b ¼ f0; 0:15; 0:3; 0:8; 1:2; 1:8g. The restricted and unrestricted estimates come from the conditional logit and mixed logit models, respectively. The LR, Wald, and LM tests are constructed based on the null hypothesis H 0 : sb ¼ 0 against the alternative hypothesis H 1 : sb 40. The inverse of information matrix in the Wald and LM tests is estimated using the outer product of gradients. Fig. 2 shows the ratio of pretest estimator RMSE of b relative to the random parameters logit model estimator RMSE of b using the LR, Wald, and LM tests at a 25% significance level. We choose a 25% significance level because 5% pretests are not optimal in many settings, and this is also true in our experiments. Under the one-tail alternative hypothesis, the distribution of LR and Wald chi-square test statistics has a mixture of chi-square distributions. In the one-parameter case, the (12a)-quantile of a standard chi-square is the critical value for significance level a (Gourieroux & Monfort, 1995, p. 265). For the 25% significance level the critical value is 0.455. Fig. 2 shows that the pretest estimators based on the LR and Wald statistics have RMSE that is less than that of the random parameters logit
120
Fig. 2.
TONG ZENG AND R. CARTER HILL
One Pretest Estimator RMSE bCMixed Logit Estimator RMSE b: Random Parameter Model.
Pretest Estimation in the Random Parameters Logit Model
121
model when the parameter variance is small, but that RMSE is larger than that of the random parameters logit model over the remaining parameter space. The LR and Wald tests exhibit properties of consistent tests, with the power approaching one as the specification error increases, so that the pretest estimator is consistent. The ratios of LM-based pretest estimator RMSE of b to that RMSE of the random parameters logit model rise and become further away from one with increases in the standard deviation of the parameter distribution. The poor properties of the LM-based pretest estimator arise from the poor power of the LM test in our experiments. It is interesting that even though the pretest estimator based on the LR and Wald statistics are consistent, the maximum risk ratio based on the LR The range over which and Wald tests increases in the parameter mean b. the risk ratio is less than one also increases in the mean of the parameter distribution b. To explore the power of the three tests for the presence of a random coefficient in the mixed logit model further, we calculate the empirical 90th and 95th percentile value of the LR, Wald, and LM statistics given the different combinations of means and standard deviations of the parameter distribution in the one random parameter model. The results in Table 3 show that the Monte Carlo 90th and 95th percentile values of the three tests change with the changes in the mean and standard deviation of parameter distribution. In general, the Monte Carlo critical values with different parameter means are neither close to 1.64 and 2.71 (the (12a)-quantile of standard chi-square statistics for 10 and 5% significance level, respectively) nor to the usual critical values 2.71 and 3.84. When b ¼ 0:5 and s b ¼ 0, the 90th and 95th empirical percentiles of LR, Wald, and LM in our experiments both are greater than the asymptotic critical values 1.64 and 2.71. With increases in the true standard deviation of the coefficient distribution, the 90th and 95th empirical percentiles increase for the LR and Wald statistics, indicating that these tests will have some power in choosing the correct model with random coefficients. The corresponding percentile values based on the LM statistics decline, meaning that the LM test has declining power. An interesting feature of Table 3 is that most Monte Carlo critical values based on the LR and Wald statistics decrease in the mean of coefficient distribution b. The results based on the empirical percentiles of the LR, Wald, and LM statistics imply the rejection rates of the three tests will vary depending on the mean and standard deviation of the parameter distribution. To get the rejection rate for the three tests, we choose the ‘‘corrected’’ chi-square critical values 1.64 and 2.71 for 10 and 5% significance levels with one
122
TONG ZENG AND R. CARTER HILL
Table 3. 90th and 95th Empirical Percentiles of Likelihood Ratio, Wald, and Lagrange Multiplier Tests: One Random Parameter Model. b
ab
0.5 0.5 0.5 0.5 0.5 0.5
LR 90th
LR 95th
Wald 90th
Wald 95th
LM 90th
LM 95th
0.00 0.15 0.30 0.80 1.20 1.80
1.927 1.749 2.239 6.044 12.940 26.703
3.267 2.755 3.420 7.779 15.684 31.347
4.006 3.850 4.722 9.605 14.472 19.225
5.917 5.425 6.210 11.014 15.574 19.950
2.628 2.749 2.594 2.155 1.712 1.494
3.576 3.862 3.544 3.043 2.344 2.041
1.5 1.5 1.5 1.5 1.5 1.5
0.00 0.15 0.30 0.80 1.20 1.80
1.518 1.541 1.837 5.753 11.604 24.684
2.668 2.414 3.364 7.451 13.953 28.374
3.671 3.661 4.361 8.603 12.930 17.680
5.672 5.443 6.578 10.424 13.974 18.455
2.762 3.020 3.048 2.496 1.825 1.346
3.972 4.158 4.308 3.489 2.376 1.947
2.5 2.5 2.5 2.5 2.5 2.5
0.00 0.15 0.30 0.80 1.20 1.80
0.980 1.020 1.217 2.766 6.321 18.018
1.727 1.858 2.235 4.667 8.643 20.828
2.581 2.598 2.751 6.387 9.700 14.895
4.017 4.256 4.616 8.407 11.598 15.822
2.978 2.976 3.035 3.119 2.714 2.189
4.147 4.317 4.429 4.315 3.832 3.275
3.0 3.0 3.0 3.0 3.0 3.0
0.00 0.15 0.30 0.80 1.20 1.80
1.042 1.040 1.260 2.356 4.610 13.261
1.720 1.941 2.114 3.167 6.570 15.622
2.691 2.548 3.068 4.915 8.086 12.960
4.264 4.878 5.124 7.106 10.296 14.052
3.455 3.285 3.164 3.073 2.917 2.579
4.594 4.441 4.324 4.198 4.224 3.478
Note: Testing H0 : sb ¼ 0; one-tail critical values are 1.64 (10%) and 2.71 (5%) compared to the usual values 2.71 and 3.84, respectively.
degree of freedom. Table 4 provides the percentage of rejecting the null hypothesis sb ¼ 0 using critical value 1.64 and 2.71. When the null hypothesis is true, most empirical percentage rates of LR test rejecting the true null hypothesis are less than the nominal rejection rates 10 and 5%, and become further away from the nominal rejection rates with increases All empirical rejection rates of Wald and in the parameter mean b: LM tests given a true null hypothesis are greater than the related expected percentage rates. When the number of Monte Carlo samples is increased to 9,999, the results are essentially unchanged. For the case in
123
Pretest Estimation in the Random Parameters Logit Model
Table 4.
Rejection Rate of Likelihood Ratio, Wald, and Lagrange Multiplier Tests: One Random Parameter Model.
^ a s:e:ðs^ b Þa LR 10%b LR 5%b Wald 10%b Wald 5%b LM 10%b LM 5%b s:e:ðbÞ
b
s b
0.5 0.5 0.5 0.5 0.5 0.5
0.00 0.15 0.30 0.80 1.20 1.80
0.123 0.125 0.125 0.135 0.153 0.195
0.454 0.461 0.460 0.416 0.391 0.438
0.122 0.113 0.143 0.472 0.816 0.996
0.065 0.051 0.072 0.348 0.722 0.989
0.219 0.233 0.281 0.665 0.916 1.000
0.155 0.164 0.214 0.587 0.882 0.999
0.204 0.200 0.184 0.161 0.109 0.084
0.095 0.101 0.093 0.061 0.036 0.021
1.5 1.5 1.5 1.5 1.5 1.5
0.00 0.15 0.30 0.80 1.20 1.80
0.242 0.243 0.243 0.247 0.261 0.291
0.593 0.586 0.567 0.439 0.391 0.443
0.092 0.090 0.115 0.390 0.777 0.995
0.048 0.042 0.068 0.264 0.659 0.990
0.199 0.215 0.236 0.582 0.897 0.999
0.139 0.148 0.160 0.461 0.816 0.996
0.215 0.225 0.233 0.184 0.116 0.075
0.102 0.116 0.119 0.083 0.037 0.016
2.5 2.5 2.5 2.5 2.5 2.5
0.00 0.15 0.30 0.80 1.20 1.80
0.416 0.416 0.410 0.392 0.392 0.412
0.910 0.889 0.853 0.714 0.537 0.453
0.058 0.064 0.070 0.176 0.471 0.949
0.022 0.023 0.031 0.106 0.342 0.898
0.143 0.146 0.159 0.335 0.641 0.985
0.090 0.095 0.101 0.235 0.539 0.959
0.216 0.221 0.221 0.229 0.221 0.166
0.111 0.122 0.119 0.121 0.100 0.068
3.0 3.0 3.0 3.0 3.0 3.0
0.00 0.15 0.30 0.80 1.20 1.80
0.519 0.508 0.514 0.489 0.478 0.479
1.131 1.062 0.975 0.910 0.701 0.505
0.052 0.060 0.076 0.135 0.304 0.808
0.028 0.026 0.030 0.074 0.199 0.714
0.139 0.140 0.162 0.256 0.465 0.909
0.099 0.096 0.113 0.190 0.389 0.858
0.229 0.248 0.237 0.226 0.221 0.217
0.140 0.128 0.130 0.117 0.114 0.095
a
The average nominal standard error of estimated mean and standard deviation of the random coefficient distribution. b Testing H0 : sb ¼ 0; one-tail critical values are 1.64 (10%) and 2.71 (5%).
which b ¼ 0:5; the rejection rate of the 10% LR test is 9.6%, and the 5% test rejects 5% of the time. As the parameter mean b increases, we again see the percentage rejections decline. The Wald and LM test performance is relatively the same. Fig. 3 contains graphs based on the results of Table 4. From Fig. 3, we can see the changes in the rejection rates of these three test statistics with increases in the mean and standard deviation of the parameter distribution, respectively. We find the rejection frequency of the LR and Wald statistics declines in the mean of the parameter distribution.
124
TONG ZENG AND R. CARTER HILL
Fig. 3. The Rejection Rate of LR, Wald, and LM Tests: One Random Parameter Model. Note: Testing H0 : sb ¼ 0; one-tail critical values are 1.64 (10%) and 2.71 (5%).
Due to the different sizes of the three tests, power comparisons are invalid. We use the Monte Carlo percentile values for each combination of parameter mean and standard deviation as the critical value to correct the size of the three tests. Table 5 provides the size-corrected rejection rates for the three tests. The size-corrected rejection rates for the LR and Wald tests increase in the standard deviation of the coefficient distribution as expected.
125
Pretest Estimation in the Random Parameters Logit Model
Table 5. b
Size Corrected Rejection rates of LR, Wald, and LM Tests: One Random Parameter Model.
s b
LR 10%
LR 5%
Wald 10%
Wald 5%
LM 10%
LM 5%
0.5 0.5 0.5 0.5 0.5 0.5
0.00 0.15 0.30 0.80 1.20 1.80
0.100 0.094 0.121 0.431 0.792 0.995
0.050 0.035 0.055 0.287 0.676 0.980
0.100 0.093 0.123 0.498 0.834 0.999
0.050 0.036 0.056 0.336 0.746 0.991
0.100 0.108 0.099 0.066 0.040 0.022
0.050 0.060 0.049 0.028 0.016 0.005
1.5 1.5 1.5 1.5 1.5 1.5
0.00 0.15 0.30 0.80 1.20 1.80
0.100 0.100 0.124 0.407 0.788 0.995
0.050 0.043 0.068 0.269 0.663 0.990
0.100 0.098 0.124 0.383 0.758 0.995
0.050 0.047 0.067 0.240 0.616 0.988
0.100 0.112 0.115 0.078 0.035 0.011
0.050 0.056 0.058 0.031 0.014 0.005
2.5 2.5 2.5 2.5 2.5 2.5
0.00 0.15 0.30 0.80 1.20 1.80
0.100 0.101 0.119 0.256 0.565 0.971
0.050 0.060 0.069 0.166 0.460 0.942
0.100 0.100 0.110 0.242 0.544 0.961
0.050 0.056 0.065 0.173 0.444 0.931
0.100 0.099 0.103 0.104 0.082 0.062
0.050 0.052 0.057 0.051 0.037 0.022
3.0 3.0 3.0 3.0 3.0 3.0
0.00 0.15 0.30 0.80 1.20 1.80
0.100 0.099 0.120 0.197 0.403 0.873
0.050 0.058 0.071 0.133 0.294 0.803
0.100 0.096 0.114 0.192 0.392 0.859
0.050 0.059 0.080 0.121 0.282 0.764
0.100 0.089 0.083 0.079 0.072 0.051
0.050 0.046 0.042 0.042 0.041 0.031
Note: Testing H0 : sb ¼ 0 using Monte Carlo percentile values as the critical values to adjust the size the LR, Wald, and LM tests.
Based on these results, there is not too much difference between these two size-corrected tests. The power of these two tests declines with increases in the parameter mean. In our experiments, at the 10 and 5% significance levels, the LM test shows the lowest power for the presence of the random coefficient among the three tests. Graphs in Fig. 4 are based on the results of Table 5. After adjusting the size of the test, the power of LR test declines slowly in the parameter mean. The results of the power of these three tests are consistent with the results of pretest estimators based on these three tests.
126
Fig. 4.
TONG ZENG AND R. CARTER HILL
The Size-Corrected Rejection Rates: One Random Parameter Model.
4.2. Two-Parameter Model Results We expand the model to two parameters. The mean and standard deviation of the added random parameter b2 are set as 1.5 and 0.8, respectively. We use four different values for the first parameter mean, b 1 ¼ f0:5; 1:5; 2:5; 3:0g. For each value of the mean b 1 , we use six different values for the
127
Pretest Estimation in the Random Parameters Logit Model
standard deviation, s b1 ¼ f0; 0:15; 0:3; 0:8; 1:2; 1:8g. In the two-parameter model, the LR, Wald, and LM tests are constructed based on the joint null hypothesis H 0 : sb1 ¼ 0 and sb2 ¼ 0 against the alternative hypothesis H 1 : sb1 40 or sb2 40 or sb1 40; and sb2 40. Fig. 5 shows the ratios of the pretest
Two Pretest Estimator RMSE bCMixed Logit Estimator RMSE b: rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P P ðb^ b Þ2 þ ðb^ b Þ2 =NSAM; Random Parameter Model. RMSE of b^ ¼ Fig. 5.
1
where NSAM ¼ 999.
1
2
2
128
TONG ZENG AND R. CARTER HILL
estimator RMSE of b 1 and b 2 to the random parameters logit model estimator RMSE of b 1 and b 2 based on the joint LR, Wald, and LM tests at a 25% significance level. Here we use the standard chi-square as the critical value for 25% significance level, 2.773. The joint LR and Wald tests show properties of consistent tests. The maximum risk ratio based on the joint LR and Wald tests still increases in the parameter mean b 1 . In the twoparameter model, the pretest estimators based on the joint LR and Wald statistics have larger RMSE than that of the random parameters logit model. The properties of the joint LM-based pretest estimator are also poor in the two-parameter model. Table 6 reports the 90th and 95th empirical percentiles of the joint LR, Wald and LM tests. They are different with different combinations of means and standard deviations. All the empirical Table 6. 90th and 95th Empirical Percentiles of Likelihood Ratio, Wald, and Lagrange Multiplier Tests: Two Random Parameter Model. b 1
s b1
b 2
s b2
LR 90th
LR 95th
Wald 90th
Wald 95th
LM 90th
LM 95th
0.5 0.5 0.5 0.5 0.5 0.5
0.00 0.15 0.30 0.80 1.20 1.80
1.5 1.5 1.5 1.5 1.5 1.5
0.8 0.8 0.8 0.8 0.8 0.8
14.634 13.583 13.504 14.961 19.940 29.429
17.899 17.001 16.043 17.867 23.966 32.083
13.493 13.148 13.060 12.496 13.536 15.208
14.647 14.118 14.156 13.157 14.305 16.081
4.033 4.164 4.208 4.052 4.168 3.989
5.196 5.242 5.420 5.062 5.215 5.218
1.5 1.5 1.5 1.5 1.5 1.5
0.00 0.15 0.30 0.80 1.20 1.80
1.5 1.5 1.5 1.5 1.5 1.5
0.8 0.8 0.8 0.8 0.8 0.8
14.109 12.645 11.955 12.341 15.529 22.300
16.844 15.466 14.415 14.569 17.472 25.700
12.638 11.961 11.498 11.022 11.760 13.321
14.074 13.448 12.641 12.017 12.860 14.155
6.329 5.991 5.881 4.480 4.478 4.682
8.105 7.689 7.444 5.601 5.699 5.639
2.5 2.5 2.5 2.5 2.5 2.5
0.00 0.15 0.30 0.80 1.20 1.80
1.5 1.5 1.5 1.5 1.5 1.5
0.8 0.8 0.8 0.8 0.8 0.8
11.315 10.449 9.998 10.388 14.168 21.625
13.966 13.120 12.437 12.690 17.001 24.694
10.161 9.820 9.707 9.554 10.527 12.815
11.439 11.137 10.986 10.657 11.433 13.704
5.094 4.920 5.051 4.714 4.552 4.994
6.275 6.368 6.230 6.092 5.829 6.248
3.0 3.0 3.0 3.0 3.0 3.0
0.00 0.15 0.30 0.80 1.20 1.80
1.5 1.5 1.5 1.5 1.5 1.5
0.8 0.8 0.8 0.8 0.8 0.8
9.713 9.185 8.384 8.219 13.704 20.939
12.354 11.450 10.388 10.083 15.917 23.476
8.905 8.493 8.262 8.499 10.058 12.454
10.552 10.215 9.7540 10.010 10.967 13.282
4.528 4.434 4.245 4.486 4.972 5.273
5.729 5.923 5.418 5.716 6.353 6.544
129
Pretest Estimation in the Random Parameters Logit Model
90th and 95th percentile values of the joint LR and Wald tests are much greater than the related standard chi-square statistics 4.605 and 5.991. The Monte Carlo empirical percentiles of the joint LM test are also not close to the standard chi-square statistics. Since the weighted chi-square statistics are even smaller than the standard chi-square statistics, we choose the standard ones to find the rejection rate of the three tests. These results change very little when the number of Monte Carlo samples is increased to 9,999. Table 7 shows the rejection rates of the three joint tests based on the standard chisquare statistics for 10 and 5% significance level. The results are consistent with Table 6. When the null hypothesis is true, the joint LR and Wald tests reject the true null hypothesis more frequently than the nominal rejection rates 10 and 5%. They become closer to the nominal rejection rates with Table 7.
Rejection Rate of Likelihood Ratio, Wald, and Lagrange Multiplier Tests: Two Random Parameter Model.
b 1
s b1
b 2
s b2
LR 10%
LR 5%
Wald 10%
Wald 5%
LM 10%
LM 5%
0.5 0.5 0.5 0.5 0.5 0.5
0.00 0.15 0.30 0.80 1.20 1.80
1.5 1.5 1.5 1.5 1.5 1.5
0.8 0.8 0.8 0.8 0.8 0.8
0.719 0.681 0.668 0.749 0.949 0.999
0.594 0.563 0.534 0.631 0.892 0.992
0.890 0.880 0.876 0.920 0.985 1.000
0.824 0.781 0.779 0.823 0.960 0.997
0.069 0.077 0.083 0.070 0.077 0.077
0.031 0.033 0.031 0.032 0.033 0.030
1.5 1.5 1.5 1.5 1.5 1.5
0.00 0.15 0.30 0.80 1.20 1.80
1.5 1.5 1.5 1.5 1.5 1.5
0.8 0.8 0.8 0.8 0.8 0.8
0.600 0.563 0.520 0.620 0.796 0.969
0.476 0.430 0.386 0.482 0.672 0.936
0.762 0.728 0.705 0.783 0.914 0.995
0.632 0.603 0.561 0.640 0.816 0.980
0.205 0.191 0.176 0.092 0.093 0.105
0.114 0.099 0.096 0.039 0.035 0.035
2.5 2.5 2.5 2.5 2.5 2.5
0.00 0.15 0.30 0.80 1.20 1.80
1.5 1.5 1.5 1.5 1.5 1.5
0.8 0.8 0.8 0.8 0.8 0.8
0.492 0.451 0.388 0.428 0.755 0.963
0.381 0.327 0.284 0.317 0.628 0.928
0.631 0.589 0.540 0.576 0.837 0.982
0.462 0.427 0.352 0.429 0.718 0.954
0.127 0.116 0.126 0.105 0.097 0.131
0.059 0.059 0.061 0.053 0.044 0.058
3.0 3.0 3.0 3.0 3.0 3.0
0.00 0.15 0.30 0.80 1.20 1.80
1.5 1.5 1.5 1.5 1.5 1.5
0.8 0.8 0.8 0.8 0.8 0.8
0.411 0.374 0.333 0.284 0.623 0.965
0.291 0.253 0.223 0.188 0.528 0.913
0.502 0.481 0.436 0.400 0.747 0.982
0.332 0.293 0.272 0.244 0.608 0.939
0.094 0.092 0.078 0.088 0.119 0.137
0.039 0.048 0.032 0.042 0.059 0.067
130
TONG ZENG AND R. CARTER HILL
increases in the parameter mean b 1 . When b 1 ¼ 0:5 and 3:0, the joint LM test rejecting the true null hypothesis is less than the nominal rejection rates. However, with b 1 ¼ 1:5 and 2:5, it rejects more frequently than the nominal rejection rates 10 and 5%. Fig. 6 shows the graphs based on the results of Table 7. They almost have the same trends as in the one-parameter
Fig. 6.
The Rejection Rate of LR, Wald, and LM Tests: Two Random Parameter Model.
131
Pretest Estimation in the Random Parameters Logit Model
case. The rejection frequency of the joint LR and Wald statistics decreases in the mean of the parameter distribution b 1 . To compare the power of the three joint tests in the two-parameter case, we also correct the size of the three joint tests using the Monte Carlo empirical critical values for 10 and 5% significance level. Table 8 provides the size-corrected rejection rates for the three joint tests. Fig. 7 presents the graphs based on Table 8. As in the one-parameter case, the joint LM test shows the weakest power for the presence of the random coefficient. The power of the joint LR and Wald tests decreases when the parameter mean b 1 increases from 0.5 to 1.5. However, the power of these two joint tests increases when the parameter mean b 1 increases further to 3.0.
Table 8.
The Size Corrected Rejection Rates of LR, Wald, and LM Tests: Two Random Parameter Model.
b 1
s b1
b 2
s b2
LR 10%
LR 5%
Wald 10%
Wald 5%
LM 10%
LM 5%
0.5 0.5 0.5 0.5 0.5 0.5
0.00 0.15 0.30 0.80 1.20 1.80
1.5 1.5 1.5 1.5 1.5 1.5
0.8 0.8 0.8 0.8 0.8 0.8
0.100 0.076 0.079 0.108 0.344 0.788
0.050 0.034 0.037 0.049 0.185 0.618
0.100 0.077 0.077 0.036 0.105 0.318
0.050 0.032 0.031 0.010 0.040 0.148
0.100 0.110 0.113 0.100 0.106 0.099
0.050 0.053 0.060 0.046 0.050 0.050
1.5 1.5 1.5 1.5 1.5 1.5
0.00 0.15 0.30 0.80 1.20 1.80
1.5 1.5 1.5 1.5 1.5 1.5
0.8 0.8 0.8 0.8 0.8 0.8
0.100 0.065 0.054 0.059 0.145 0.446
0.050 0.036 0.028 0.021 0.060 0.287
0.100 0.074 0.050 0.034 0.058 0.163
0.050 0.032 0.013 0.010 0.011 0.052
0.100 0.088 0.086 0.034 0.028 0.028
0.050 0.040 0.035 0.008 0.010 0.007
2.5 2.5 2.5 2.5 2.5 2.5
0.00 0.15 0.30 0.80 1.20 1.80
1.5 1.5 1.5 1.5 1.5 1.5
0.8 0.8 0.8 0.8 0.8 0.8
0.100 0.074 0.071 0.083 0.214 0.609
0.050 0.035 0.027 0.029 0.105 0.422
0.100 0.086 0.080 0.066 0.127 0.447
0.050 0.042 0.042 0.027 0.049 0.235
0.100 0.090 0.098 0.075 0.076 0.092
0.050 0.053 0.048 0.040 0.040 0.049
3.0 3.0 3.0 3.0 3.0 3.0
0.00 0.15 0.30 0.80 1.20 1.80
1.5 1.5 1.5 1.5 1.5 1.5
0.8 0.8 0.8 0.8 0.8 0.8
0.100 0.088 0.069 0.056 0.275 0.720
0.050 0.036 0.024 0.023 0.151 0.535
0.100 0.085 0.077 0.081 0.229 0.547
0.050 0.044 0.027 0.035 0.064 0.302
0.100 0.094 0.081 0.094 0.124 0.145
0.050 0.056 0.041 0.048 0.064 0.080
132
Fig. 7.
TONG ZENG AND R. CARTER HILL
The Size-Corrected Rejection Rates: Two Random Parameter Model.
5. CONCLUSIONS AND DISCUSSION Our first finding confirmed earlier research by Bhat (2001) that using standard metrics there is no reason to use pseudo-random sampling to compute the required integral for maximum simulated likelihood estimation
Pretest Estimation in the Random Parameters Logit Model
133
of the mixed logit model. It appears that estimation is as accurate with 100 Halton draws as with 1,000 pseudo-random draws. There are two major findings regarding testing for the presence of random parameters from our Monte Carlo experiments, neither of which we anticipated. First, the LM test should not be used in the random parameters logit model to test the null hypothesis that the parameters are randomly distributed across the population, rather than being fixed population parameters. In the one-parameter model Monte Carlo experiment, the size of the LM test is approximately double the nominal level of Type I error. Then, the rejection rate decreases as the degree of the specification error rises, which is in direct contrast to the properties of a consistent test. This is the most troubling and disappointing finding, as the LM test is completed in a fraction of a second, while LR and Wald tests requiring estimation of the mixed logit model are time consuming to estimate even with a limited number of Halton draws. This outcome resulted despite our use of the now well-established adjusted chi-square critical value for one-tail tests on the boundary of a parameter space. This outcome is also not due to programming errors on our part, as our GAUSS code produces estimates and LM test statistic values that are the same, allowing for convergence criteria differences, as those produced by NLOGIT 4.0. In the one-parameter problem the LR test had size close to the nominal level, while the Wald test rejected the true null hypothesis at about twice the nominal level. Our second finding is that LR and Wald test performance depends on the ‘‘signal-to-noise’’ ratio, that is, the ratio of the mean of the random parameter distribution relative to its standard deviation. When this ratio is larger, the LR and Wald tests reject less frequently the null hypothesis that the parameter is fixed rather than random. Upon reflection, this makes perfect sense. When the parameter mean is large relative to its standard deviation then the tests will have less ability to distinguish between random and fixed parameters. The ‘‘skinny’’ density function of the population parameter looks like a ‘‘spike’’ to the data. When the ratio of the mean of the random parameter distribution relative to its standard deviation is large, it matters less whether one chooses conditional logit or mixed logit, from the point of view of estimating the population-mean parameter. This shows up in lower size-corrected power for the LR and Wald tests when signal is large relative to noise. It also shows up in the risk of the pretest estimator relative to that of the mixed logit estimator. For the portion of the parameter space where the relative risk is greater than one, as the signal increases relative to noise the relative risk function increases, indicating that pretesting is a less preferred strategy.
134
TONG ZENG AND R. CARTER HILL
In the one-parameter case, the LR test is preferred overall. For the cases when the signal-to-noise ratio is not large, the empirical critical values, under the null, are at least somewhat close to the one-tail critical values 1.64 (10%) and 2.71 (5%) from the mixture of chi-square distributions. When the signal-to-noise ratio increases, the similarity between the theoretically justified critical values and the test statistic percentiles becomes less clear. The Wald test statistic percentiles are not as close to the theoretically true values as for the LR test statistics. The LM test statistic percentiles under the null are between those of the LR and Wald test statistic distributions, but not encouragingly close to the theoretically true values. In the two random parameter case, we vary the value of one standard deviation parameter, starting from 0, while keeping the other standard deviation parameter fixed at a nonzero parameter. Thus, we do not observe the rejection rates of the test statistics under the null that both are zero. We observe, however, that the empirical percentiles of the LR and Wald test statistics when one standard deviation is zero are far greater than the w2ð2Þ percentile values 4.605 (10%) and 5.991 (5%). The rejection rates of these two tests, under the null, using these two critical values are greater than 60% when signal-to-noise is lower, and falls to a rejection rate of more than 30% when signal-to-noise ratio is larger. Once again the rejection rate profile of the LM test is flat, indicating that it is not more likely to reject the null hypothesis at larger parameter standard deviation values. The ‘‘size-corrected’’ rejection rates are not strictly correct. In them we observe that the LR and Wald tests reject at a higher rate at higher signal-tonoise ratios. Further, in the two-parameter case the relative risk of the pretest estimators based on the LR and Wald test statistics are always greater than one. The pretesting strategy is not to be recommended under our Monte Carlo design. Interesting questions arising from the Monte Carlo experiment results are: (1) why does the power of LR and Wald tests for the presence of the random coefficient decline in the parameter mean and (2) how can we refine the LM test in the setting of the random parameters logit model? The LM test is developed by Aitchison and Silvey (1958) and Silvey (1959) in association with the constrained estimation problem. In our setting, the Lagrangian function is ln LðyÞ þ l0 ðcðyÞ qÞ
Pretest Estimation in the Random Parameters Logit Model
135
where ln LðyÞ is the log-likelihood function, which subject to the constraints ðcðyÞ qÞ ¼ 0. The related first-order conditions are 8 < @ ln LðyÞ þ @cðyÞ l ¼ 0 @y @y : cðyÞ q ¼ 0 Under the standard assumptions of the LM test, we know pffiffiffi ^ nðy yÞ Nð0; IðyÞ1 Þ and
0 @cðyÞ 1 @c ðyÞ n1=2 l^ N 0; IðyÞ @y0 @y
Based on the first-order conditions of the Lagrangian function, we have 0 ^ ^ ^ ^ 0 @cðyÞ ^ 1 @c ðyÞ l^ ¼ @ ln LðyÞ IðyÞ ^ 1 @ ln LðyÞ l^ 0 IðyÞ 0 @y @y^ @y^ @y^
From the above results, the LM statistics has the asymptotic chi-square distribution. The asymptotic distribution of the LM statistic is derived from the distribution of Lagrange multipliers, which is essentially based on the asymptotic normality of the score vector. In the Lagrangian function, the log-likelihood function is subject to the equality constraints. The weak power of the LM test for the presence of the random coefficient is caused by the failure of taking into account the properties of the one-tailed alternative hypothesis. Gourieroux, Holly and Monfort (1982) and Gourieroux and Monfort (1995) extended the LM test to a Kuhn–Tucker multiplier test and showed that it is asymptotically equivalent to the LR and Wald tests. However, computing the Kuhn–Tucker multiplier test is complicated. In the Kuhn–Tucker multiplier test a duality problem replaces the two optimization problems with inequality and equality constraints, which is shown as follows: ^ 0 Þ 0 @g0 ðy^ 0 Þ 0 0 @gðy 0 1 ^ ^ 1 ðl l^ Þ; subject to l 0 minl ðl l Þ 0 Iðy Þ n @y @y 0 0 where y^ and l^ are the equality constrained estimators. Compared to the 0 standard LM test, the Kuhn–Tucker multiplier test uses ðl l^ Þ to adjust
136
TONG ZENG AND R. CARTER HILL
0 the estimated Lagrange multipliers l^ . How to refine the LM test in the random parameters logit model is our future research.
ACKNOWLEDGMENT We thank Bill Greene, Tom Fomby, and two referees for their comments. All errors are ours.
REFERENCES Aitchison, J., & Silvey, S. D. (1958). Maximum likelihood estimation of parameters subject to restraints. Annals of Mathematical Statistics, 29, 813–828. Andrews, D. W. K. (2001). Testing when a parameter is on the boundary of the maintained hypothesis. Econometrica, 69(3), 683–734. Bhat, C. R. (2001). Quasi-random maximum simulated likelihood estimation of the mixed multinomial logit model. Transportation Research Part B, 35, 677–693. Bhat, C. R. (2003). Simulation estimation of mixed discrete choice models using randomized and scrambled Halton sequences. Transportation Research Part B, 37(9), 837–855. Bratley, P., Fox, B. L., & Niederreiter, H. (1992). Implementation and tests of low-discrepancy sequences. ACM Transactions on Modeling and Computer Simulation, 2, 195–213. Fang, K., & Wang, Y. (1994). Number-theoretic methods in statistics. London: Chapman and Hall/CRC. Gourieroux, C., Holly, A., & Monfort, A. (1982). Likelihood ratio test, Wald test, and KuhnTucker test in linear models with inequality constraints on the regression parameters. Econometrica, 50(1), 63–88. Gourieroux, C., & Monfort, A. (1995). Statistics and econometric models. Cambridge: Cambridge University Press. Halton, J. H. (1960). On the efficiency of certain quasi-random sequences of points in evaluating multi-dimensional integrals. Numerishe Mathematik, 2, 84–90. Hensher, D., & Greene, W. (2003). The mixed logit model: The state of practice. Transportation, 30(2), 133–176. Morokoff, W. J., & Caflisch, R. E. C. (1995). Quasi-Monte Carlo integration. Journal of Computational Physics, 122, 218–230. Niederreiter, H. (1992). Random number generation and quasi-Monte Carlo methods. Philadelphia: Society for Industrial Mathematics. Silvey, S. D. (1959). The Lagrangian multiplier test. Annals of Mathematical Statistics, 30, 389–407. Train, K. E. (2003). Discrete choice methods with simulation. Cambridge: Cambridge University Press.
SIMULATED MAXIMUM LIKELIHOOD ESTIMATION OF CONTINUOUS TIME STOCHASTIC VOLATILITY MODELS$ Tore Selland Kleppe, Jun Yu and H. J. Skaug ABSTRACT In this chapter we develop and implement a method for maximum simulated likelihood estimation of the continuous time stochastic volatility model with the constant elasticity of volatility. The approach does not require observations on option prices, nor volatility. To integrate out latent volatility from the joint density of return and volatility, a modified efficient importance sampling technique is used after the continuous time model is approximated using the Euler–Maruyama scheme. The Monte Carlo studies show that the method works well and the empirical applications illustrate usefulness of the method. Empirical results provide strong evidence against the Heston model. $
Kleppe gratefully acknowledges the hospitality during his research visit to Sim Kee Boon Institute for Financial Economics at Singapore Management University. Yu gratefully acknowledges support from the Singapore Ministry of Education AcRF Tier 2 fund under Grant No. T206B4301-RS. We wish to thank two anonymous referees for their helpful comments.
Maximum Simulated Likelihood Methods and Applications Advances in Econometrics, Volume 26, 137–161 Copyright r 2010 by Emerald Group Publishing Limited All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1108/S0731-9053(2010)0000026009
137
138
TORE SELLAND KLEPPE ET AL.
1. INTRODUCTION Continuous time stochastic volatility (SV) models have been provend to be very useful from various aspects. For example, it has been found that SV models provide a valuable tool for pricing contingent claims. Seminal contributions include Wiggins (1987), Hull and White (1987), Heston (1993), and Duffie, Pan, and Singleton (2000). See Bakshi, Cao, and Chen (1997) for an empirical analysis of the SV models. For another example, SV models have been proved successful to describe the time series behavior of financial variables. Important contributions include Andersen and Lund (1997) and Jones (2003). Unfortunately, maximum likelihood estimation (MLE) of continuous time SV models poses substantial challenges. The first challenge lies in the fact that the joint transition density of price (or return) and volatility is typically unknown in closed form. This is the well-known problem in the continuous time literature (see Aı¨ t-Sahalia, 2002; Phillips & Yu, 2009a). The second challenge is that when only the time series of spot prices is observed, volatility has to be integrated out from the joint transition density. Such integrals are analytically unknown and have to be calculated numerically. The dimension of integration is the same as the number of observations. When the number of observations is large, which is typical in practical applications, the numerical integration is difficult. In recent years, solutions have been provided to navigate such challenges. To deal with the second challenge, for example, Jones (2003) and Aı¨ t-Sahalia and Kimmel (2007) proposed to estimate the model using data from both underlying spot and options markets. Option price data are used to extract volatility, making the integration of volatility out of the joint transition density unnecessary. To deal with the first challenge, Jones (2003) suggested using in-filled Euler–Maruyama (EM) approximations that enables a Gaussian approximation to the joint transition density, whereas Aı¨ t-Sahalia and Kimmel (2007) advocated using a closed form polynomial approximation that can approximate the true joint transition density arbitrarily well. With the two problems circumvented, the full likelihood-based inference is possible. For example, the method of Aı¨ t-Sahalia and Kimmel (2007) is frequentistic, whereas the method of Jones (2003) is Bayesian. It is well known that option prices are derived from the risk-neutral measure. Consequently, a benefit of using data from both spot and options markets jointly is that one can learn about the physical as well as riskneutral measures. However, this benefit comes at expense. To connect the physical and risk-neutral measures, the functional form of the market price of risk has to be specified.
Simulated Maximum Likelihood Estimation of Continuous Time SV Models
139
In this chapter, we develop and implement a method for maximum simulated likelihood estimation of the continuous time SV model with the constant elasticity of volatility (CEV-SV). The approach does not require observations of option prices or volatility and hence there is no need to specify the functional form of the market price of risk. As a result, we only learn about the physical measure. The CEV-SV model was first proposed by Jones (2003) as a simple way to nest some standard continuous time SV models, such as the square root SV model of Heston (1993) and the GARCH diffusion model of Nelson (1990). To our knowledge, the present chapter constitutes the first time ML is used to estimate the CEV-SV model using the spot price only. To deal with the second challenge, we propose to use a modified efficient importance sampling (EIS) algorithm, originally developed in Richard and Zhang (2007), to integrate out a latent volatility process. To deal with the first challenge, we consider the EM approximation. We examine the performance of the proposed maximum simulated likelihood using both simulated data and real data. Based on simulated results, we find that the algorithm performs well. Empirical illustration suggests that the Heston square root SV model does not fit data well. This empirical finding reinforces those of Andersen, Benzoni, and Lund (2002), Jones (2003), Aı¨ tSahalia and Kimmel (2007), and others. This chapter is organized as follows. Section 2 discusses the model and introduces the estimation method. Section 3 tests the accuracy of the method by performing Monte Carlo (MC) simulations for the square root SV model of Heston (1993), the GARCH diffusion model of Nelson (1990), and the CEVSV model. In Section 4, we apply this estimation method to real data for the three SV models and analyze and compare the empirical results. A comparison with the log-normal (LN) model is also performed. Section 5 concludes.
2. MODEL AND METHODOLOGY This section first presents the CEV-SV model under consideration and then outlines the MC procedure used to do likelihood analysis when only the price process is observed. 2.1. SV Model with Constant Elasticity of Volatility The continuous time CEV model was recently proposed to model stochastic volatility (see, e.g., Jones, 2003; Aı¨ t-Sahalia & Kimmel, 2007). Although we
140
TORE SELLAND KLEPPE ET AL.
mainly focus on the CEV-SV model in this chapter, the proposed approach is applicable more generally. Let st and vt denote the log-price of some asset and the volatility, respectively, at some time t. Then the CEV model is specified in terms of the Itoˆ stochastic differential equation: "
st d vt
#
" ¼
# " pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi # pffiffiffiffi #" dBt;1 a þ bvt ð1 r2 Þvt r vt dt þ . dBt;2 a þ bvt 0 svgt
(1)
Here Bt;1 and Bt;2 denote a pair of independent canonical Brownian motions. The parameters y ¼ ½a; b; s; r; g; a; b have the restrictions a40, r 2 ð1; 1Þ, g 1=2, and bo0 whenever g 1 (see Jones (2003) for a treatment of the volatility process for g41). In addition, for g ¼ 1=2 we have the restriction 2a4s2 to ensure that vt stays strictly positive (Cox, Ingersoll, & Ross, 1985). The CEV model nests the affine SV model of Heston (1993) ðg ¼ 1=2Þ and the GARCH diffusion model of Nelson (1990) ðg ¼ 1Þ, and we will treat these special cases separately in addition to the full CEV model. Parameters a and b characterize the linear drift structure of the volatility, in which a=b is the mean and b captures the mean reversion rate. Parameter s is the volatility-of-volatility and r represents the leverage effect. Parameter g is the CEV elasticity. Parameters a and b represent, respectively, the long run drift and risk premium of the price process.
2.2. A Change of Variable and Time Discretization Under the parameter constraints described above, the volatility process vt is strictly positive with probability one. The importance sampling procedure proposed here uses (locally) Gaussian importance densities for vt , and thus the supports of vt and the importance density are inherently conflicting. To remove this boundary restriction, we shall work with the logarithm of the volatility process. As the latent process gets integrated out, the actual representation of the volatility (or the log-volatility) is irrelevant theoretically, but is very important for the construction of EIS procedures as will be clear below. In addition, this change of variable will influence the properties of the time-discretization scheme that will be discussed shortly.
Simulated Maximum Likelihood Estimation of Continuous Time SV Models
141
Define zt ¼ logðvt Þ. Then, by Ito’s lemma, we have: " # " # # " pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi #" dBt;1 a þ b expðzt Þ st r expðzt =2Þ ð1 r2 Þ expðzt =2Þ d ¼ , dt þ Mðzt Þ zt 0 s expðzt ðg 1ÞÞ dBt;2 (2) where Mðzt Þ ¼ b þ a expðzt Þ s2 expð2zt ðg 1ÞÞ=2. Clearly, the law of st is unaltered, but the latent process zt now has support over the real line. As the transition probability density (TPD) in the general case is not known under either representation (Eq. (1) or Eq. (2)), an approximation is needed. This is achieved by defining a discrete time model that acts as an approximation to Eq. (2) based on the EM scheme using a time-step equal to D time units. This discrete time process is given as the nonlinear and heteroskedastic auto-regression # " # " si þ Dða þ b expðzi ÞÞ siþ1 ¼ zi þ DMðzi Þ ziþ1 2 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 3" # ei;1 ð1 r2 Þ expðzi =2Þ r expðzi =2Þ pffiffiffiffi 5 þ D4 , 0 s expðzi ðg 1ÞÞ ei;2 where ½ei;1 ; ei;2 are temporarily independent bi-variate standard normal shocks. It is convenient to work with the log-returns of the price, so we define xi ¼ si si1 as this process is stationary. Hence, the discrete time dynamics for xi are given by: # " # " Dða þ b expðzi ÞÞ xiþ1 ¼ ziþ1 zi þ DMðzi Þ 2 pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi 3" # ei;1 ð1 r2 Þ expðzi =2Þ r expðzi =2Þ pffiffiffiffi 5 þ D4 . 0 s expðzi ðg 1ÞÞ ei;2 Throughout the rest of this chapter, Eq. (3) will be the model that we shall work with. Several authors (see, e.g., Aı¨ t-Sahalia, 2002; Durham & Gallant, 2002; Durham, 2006) have argued that one should transform the latent process (instead of the log-transform applied here) in such a manner that it becomes a (nonlinear) Ornstein–Uhlenbeck process – i.e., with a homoskedastic error term in the latent process. This variance stabilization transform is given
142
TORE SELLAND KLEPPE ET AL.
(see, e.g., Rao, 1999, p. 210), up to an affine transformation, by: 8 if g ¼ 1 > < logðvÞ 1g ZðvÞ ¼ v 1 otherwise: > : 1g However, excluding g ¼ 1, the variance stabilizing does not completely solve the finite boundary problem on the domain of the transformed volatility Zðvt Þ. Thus, the Gaussian approximation obtained from the EM scheme will have a support that conflicts with the continuous time model. In Section 3, some MC experiments are conducted to verify that an approximate likelihood function based on the EM scheme (Eq. (3)) with observed logvolatility does not lead to unacceptable biases for sample sizes and time steps that are relevant in practice. Another reason for using the variance stabilization procedure would be to bring the posterior density (i.e., the density of the latent given the observed price returns and parameters) closer to a multivariate Gaussian, as in Durham (2006) in the context of an EM discretization of the Heston model. This should in theory pave the way for using a Laplace approximation-based importance density (see, e.g., Shephard & Pitt, 1997), i.e., a multivariate Gaussian, to calculate the marginal likelihood of the data. By using the EIS procedure outlined below, there is no need to bring the posterior density closer to normality globally, as our importance density is only locally Gaussian. Thus, the argument against using the logarithm for all g does not apply here. We therefore conclude this discussion and use the logarithm throughout the rest of the chapter.
2.3. TPDs and Joint Densities Assume that we have n observations of xi , i.e., x ¼ ½x1 ; . . . ; xn 1, sampled discretely over a regular time grid with D time units between the time-points. More general deterministic time grids are possible. Further, denote the unobserved vector of zi s at the corresponding times as z ¼ ½z1 ; . . . ; zn . For simplicity, we assume for now that z0 is a known constant. The marginal distribution of z0 is not known in closed form in the general case, and in practice we will estimate z0 along with the parameter vector y by maximum likelihood. Let f i ¼ f i ðzi ; xi jzi1 ; y; DÞ denote the Gaussian TPD of the discrete time process Eq. (3). From the specification, it is evident that f i is a bi-variate
Simulated Maximum Likelihood Estimation of Continuous Time SV Models
143
Gaussian density with mean vector and covariance matrix z 3 2 i1 " # ð2g 1Þ Þ sr exp expðz i1 Dða þ b expðzi1 ÞÞ 2 6 7 z and D4 5, i1 2 zi1 þ DMðzi1 Þ sr exp ð2g 1Þ s expð2zi1 ðg 1ÞÞ 2 respectively. Exploiting the Markov structure of the discretized model, the joint density of ðz; xÞ is given by: pðz; xjy; z0 ; DÞ ¼
n Y
f i ðzi ; xi jzi1 ; y; DÞ.
(4)
i¼1
Clearly, this expression should also be regarded as an approximation to the continuous time joint density obtained when the f i s are exchanged with the (unknown) exact transition densities. The approximation is known to converge strongly as D ! 0 (Kloeden & Platen, 1999).
2.4. Monte Carlo Evaluation of the Marginal Likelihood Since the log-volatility z is unobserved, approximate evaluation (based on the EM discretization) of the likelihood function for given values of y and z0 involves an integral over z, say Z lðy; z0 jxÞ ¼ pðz; xjy; z0 Þdz. Due to the nonlinear structure of the discrete time model Eq. (3), no closed form expression for this integral is at hand, and hence numerical methods generally are needed. Since the dimension of the integral is typically of the order of 1,000–10,000, quadrature rules are of little use here. Instead, we apply an importance sampling technique where the importance density is constructed using the EIS algorithm of Richard and Zhang (2007). The EIS algorithm (approximately) minimizes the MC variance within a parametric class of auxiliary importance densities, say mðzja; x; z0 Þ, being indexed by the n 2-dimensional parameter a. We denote the optimal choice of a by a^ .2 Further, we refer to mðzj0; x; z0 Þ as the baseline importance density where a ¼ 0 denotes a with all elements equal to zero.
144
TORE SELLAND KLEPPE ET AL.
In this work, we restrict the importance densities to have the form: mðzja; x; z0 Þ ¼
n Y
mi ðzi jzi1 ; xi ; ai Þ.
i¼1
Note that we allow the importance density to depend explicitly on the observed vector x. The weak law of large numbers for S ! 1 suggests that lðy; z0 jxÞ may be approximated by S X pð~zðjÞ ; xjy; z0 Þ ~ z0 jx; aÞ ¼ 1 lðy; , S j¼1 mð~zðjÞ ja; x; z0 Þ
(5)
where z~ ðjÞ ; j ¼ 1; . . . ; S are drawn from mðzja; x; z0 Þ. This law of large numbers applies in particular to a^ that is obtained using the EIS algorithm so the variance of l~ is approximately minimized. Thus, the approximate MLE estimator will have the form ^ z^0 Þ ¼ arg max log lðy; ~ z0 jx; a^ Þ, ðy;
(6)
ðy;z0 Þ
where the logarithm is taken for numerical convenience. 2.4.1. The Baseline Importance Density Typically mðzja; x; z0 Þ is taken to be a parametric extension to the so-called natural sampler (i.e., pðzjy; z0 Þ) (see, e.g., Liesenfeld & Richard, 2003; Liesenfeld & Richard, 2006; Richard & Zhang, 2007; Bauwens & Galli, 2009). In this chapter, we depart from this practice by introducing information from the data into the baseline importance density. More precisely, we define f i ðzi jzi1 ; xi ; y; DÞ ¼ R
f i ðzi ; xi jzi1 ; y; DÞ , f i ðzi ; xi jzi1 ; y; DÞdzi
i.e., the conditional transition densities given xi , and set mi ðzi jzi1 ; xi ; 0i Þ ¼ f i ðzi jzi1 ; xi ; y; DÞ.3 Since f i ðzi ; xi jzi1 ; y; DÞ is a bi-variate Gaussian density, f i ðzi jxi ; zi1 ; y; DÞ is also Gaussian, with mean and standard deviation given by: 3 , m0i ðzi1 ; xi Þ ¼ zi1 þ DMðzi1 Þ þ srðxi Dða þ b expðzi1 ÞÞÞ exp zi1 g 2
Simulated Maximum Likelihood Estimation of Continuous Time SV Models
S0i ðzi1 Þ ¼ s
145
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Dð1 r2 Þ expðzi1 ðg 1ÞÞ,
respectively. 2.4.2. The Parametrically Extended Importance Density The baseline importance density is in itself a valid, but not very efficient importance density. Consequently, we parametrically extend it to introduce more flexibility. Following Richard and Zhang (2007), each factor of the baseline importance density is (conditionally on zi1 ) perturbed within the univariate Gaussian family of distributions. Staying within the Gaussian family is numerically advantageous because sampling from m then becomes fast and conceptually simple. More precisely, the extension is done by multiplying mi ðzi jzi1 ; xi ; 0i Þ by expðai;1 zi þ ai;2 z2i Þ and compensating with the appropriate normalization factor. More precisely, we write mi as mi ðzi jzi1 ; xi ; ai Þ ¼
Bi ðzi jzi1 ; xi Þci ðzi ; ai Þ , wi ðzi1 ; xi ; ai Þ
where log Bi ðzi jzi1 ; xi Þ ¼
ðzi m0i ðzi1 ; xi ÞÞ2 2S0i ðzi1 Þ2
,
log ci ðzi ; ai Þ ¼ ai;1 zi þ ai;2 z2i , Z wi ðzi1 ; xi ; ai Þ ¼
Bi ðzi jzi1 ; xi Þci ðzi ; ai Þdzi .
An explicit expression for wi is given in Appendix A. The mean and the standard deviation of mi ðzi jzi1 ; xi ; ai Þ that are used when sampling from mðzja; x; z0 Þ have the form: mai ðzi1 ; xi Þ ¼
m0i ðzi1 ; xi Þ þ ai;1 S0i ðzi1 Þ2 1 2ai;2 S0i ðzi1 Þ2
S0i ðzi1 Þ Sai ðzi1 Þ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi . 1 2ai;2 S0i ðzi1 Þ2
,
(7)
(8)
For each mi to have finite variance, it is clear from Eq. (8) that ai;2 must have the restriction ai;2 o1=ð2S20i Þ.
146
TORE SELLAND KLEPPE ET AL.
2.4.3. The EIS Regressions The final piece of notation introduced is the fraction xi ðzi1 ; xi Þ ¼
f i ðzi ; xi jzi1 ; y; DÞ . Bi ðzi jzi1 ; xi Þ
As Bi is the shape of the conditional density f i ðzi jzi1 ; xi ; y; DÞ, xðzi1 ; xi Þ is constant as a function zi . The expression for xi is given in Appendix A. Using this notation, we have n n Y Y pðz; xjy; z0 Þ f i ðzi ; xi jzi1 ; yÞ xi ðzi1 ; xi Þwi ðzi1 ; xi ; ai Þ ¼ ¼ ðz jz ; x ; a Þ mðzja; x; z0 Þ m ci ðzi ; ai Þ i¼1 i i i1 i i i¼1 " # n1 Y xiþ1 ðzi ; xiþ1 Þwiþ1 ðzi ; xiþ1 ; aiþ1 Þ ¼ x1 ðz0 ; x1 Þw1 ðz0 ; x1 ; a1 Þ ci ðzi ; ai Þ i¼1
1 . cn ðzn ; an Þ
ð9Þ
This last representation enables us to work out how the parameter a should be chosen to minimize MC variance using EIS type regressions. Firs, we set an ¼ ½0; 0 so that the last fraction is equal to 1 for all zn . In fact, setting an to zero effectively integrates out zn analytically, and thus for n ¼ 1 the procedure is exact. Second, under the assumption that z0 is non-stochastic, x1 w1 is also constant for fixed values of z0 and does not add to the variance of the importance sampling procedure. Finally, we notice that the log-variation (as a function of z) for each of the factors in the bracketed product of Eq. (9) depends only on a single zi . This gives rise to a recursive set of EIS ordinary least squares regression on the form ðjÞ ðjÞ ðjÞ log xiþ1 ðz~ðjÞ i ; xiþ1 Þ þ log wiþ1 ðz~i ; xiþ1 ; aiþ1 Þ ¼ ci þ log ci ðz~i ; ai Þ þ Zi ðjÞ 2 ðjÞ ¼ ci þ ai;1 z~ðjÞ i þ ai;2 ðz~i Þ þ Zi ; i ¼ 1; . . . ; n 1; j ¼ 1; . . . ; S,
ð10Þ
where ZðjÞ i are the regression residuals. The constant term ci is estimated jointly with ai . In particular, we notice that the regressions are linear in ai , suggesting that computationally efficient linear least squares algorithms may be applied. Note also that the EIS regressions need to be calculated backwards in time, as the ith regression depends on aiþ1 . The MC variance of Eq. (5), represented by ZðjÞ i , stems from the fact that the left-hand side of Eq. (10) is nonquadratic (in zi ) and thus deviates from the quadratic model represented by log ci . Still, since z~ðjÞ i ; j ¼ 1; . . . ; S;
147
Simulated Maximum Likelihood Estimation of Continuous Time SV Models
are typically strongly located by the information provided from the baseline density, the quadratic approximation works reasonably well. A fortunate by-product of the EIS regressions is that the log-weights in the likelihood estimate Eq. (5) are directly expressible in terms of the regression residuals. More precisely, Eq. (9) provides us with the following expression: log
n1 X pð~zðjÞ ; xjy; z0 Þ ¼ logðx w Þ þ ½ci þ ZðjÞ 1 1 i ; j ¼ 1; . . . ; S, mð~zðjÞ ja; x; z0 Þ i¼1
(11)
provided that we have set an to zero. Thus, the estimate of the likelihood function can be calculated with very small effort once the relevant quantities in the regression models are calculated. 2.4.4. Iterative EIS and Implementation Since the z~ ðjÞ s depend on a, and aiþ1 needs to be known to calculate ai , we may treat the regressions (Eq. (10)) as a fixed point condition for a^ , toward which we generate a convergent sequence of aðkÞ for integers k. This is done using the following steps: 1. Set að0Þ ¼ 0, k ¼ 0, and let w 2 RnS denote a matrix filled with independent standard normal variates. 2. Simulate the paths z~ ðjÞ ¼ z~ ðjÞ ðaðkÞ Þ; j ¼ 1; . . . ; S forward in time (i.e., for i ¼ 1 ! n 1) using ðjÞ ðjÞ z~ðjÞ i ¼ maðkÞ ðz~i1 ; xi Þ þ SaðkÞ ðz~i1 Þwi;j for j ¼ 1; . . . ; S; i ¼ 1; . . . ; n 1, i
i
where for simplicity we define z~ðjÞ 0 ¼ z0 . backwards in time (i.e., for i ¼ n 1 ! 1) by estimating 3. Calculate aðkþ1Þ i in wiþ1 and the paths the regression models (Eq. (10)) based on aðkþ1Þ iþ1 z~ ðjÞ ðaðkÞ Þ. 4. Calculate the logarithm of the likelihood estimate Eq. (9) using the quantities calculated for the regressions in step 3, and stop the iteration if this estimate has converged to the desired precision. 5. k k þ 1 and return to step 2. Following Richard and Zhang (2007), we apply the same set of canonical standard normal variates w to generate the paths in step 2 for each iteration. Moreover, this same set of canonical variates is used for each evaluation of the simulated log-likelihood function when doing the likelihood maximization. This usage of common random numbers ensures the smoothness of the
148
TORE SELLAND KLEPPE ET AL.
simulated log-likelihood function and allows us to apply a BFGS quasiNewton optimizer (Nocedal & Wright, 1999) based on finite difference gradients. Another measure to keep the simulated log-likelihood function smooth is to terminate the EIS iteration when the change (from iteration k to k þ 1) in log-likelihood value is a small number. We have used a change of log-likelihood value of o1:0e 9 as our stopping criterion. The choice to apply a gradient-based optimization algorithm stems from the fact that the model has up to eight parameters, and a simplex-type optimization algorithm will generally require too many function evaluations to converge for such problems. The computational cost of the extra EIS iterations needed to obtain this high precision is thus typically won back when using a faster optimization algorithm. The typical number of EIS iterations required is between 20 and 40 to obtain precision of the order of 1:0e 9. However, once such an evaluation is complete, computing the loglikelihood values needed for finite difference gradients can be much faster since we may start the EIS iteration using the previous a^ and apply it for a slightly perturbed parameter vector. Typically, this approach requires about 5 to 10 iterations to converge. One final detail to improve numerical stability is to add a simple line search, similar to those applied in line-searching optimization algorithms, to the EIS iteration. This is done by regarding the difference in iterates dðkþ1Þ ¼ aðkþ1Þ aðkÞ as a ‘‘search direction,’’ which we may, when necessary, take shorter steps along. More precisely, when completing step 3 above, we set aðkþ1Þ ¼ aðkÞ þ odðkþ1Þ , o 2 ð0; 1Þ, if the ‘‘raw’’ iterate in step 3 leads to an infinite variance in the importance density or some other pathology. Typical computing times for our FORTRAN90 implementation4 range from 30 to 1,000 seconds for locating a maximum likelihood estimator for data sets with around 2,000 observations using a standard PC. The LAPACK (Anderson et al., 1999) routine dgels was used for the linear regressions, and all the random numbers were generated using Marsaglia’s KISS generator (see, e.g., Leong, Zhang, Lee, Luk, & Villasenor, 2005).
3. MONTE CARLO EXPERIMENTS To assess the statistical properties of the EIS-MC procedure outlined in Section 2, we have conducted some MC experiments. The main objectives of these experiments are to quantify the error arising from the EM discretization and from using EIS to integrate out the latent process.
Simulated Maximum Likelihood Estimation of Continuous Time SV Models
149
The main sources (with no particular ordering) of statistical bias for the EIS-MC procedure are: Discretization of the continuous time model using a EM discretization. An indirect way of diagnosing this bias is to look for unacceptable errors using EM-based maximum likelihood when the latent process is observed. In this manner, importance sampling is avoided, and hence can be eliminated as a source of error. Small sample biases from using the integrated likelihood function. Diagnostics for this is provided by comparing the EM-based MLEs when the log-volatility is observed and unobserved. MC errors from using Eq. (5) instead of using exact integration. These errors will be discussed in the next section by using many different random number seeds in the program. All of the computations are done using a yearly time scale and with daily observations, corresponding to D ¼ 1=252. We use S ¼ 32 paths in the importance sampler throughout both the MC simulations and the application to real data. All the MC experiments are done with z both observed and unobserved, i.e., by maximizing Eqs. (4) and (5), respectively, with respect to y. Under observed z, the simulated z0 is applied, whereas under unobserved z, we estimate z0 along with the other parameters. The ‘‘true’’ parameter values used to generate the synthetic data sets are the empirical parameter estimates obtained from the Standard & Poor’s 500 data set, which we consider in Section 4. We first consider the Heston model and the GARCH diffusion model, and then consider the full CEV model under two different simulation designs. The synthetic data sets are simulated using the EM scheme with a time step of D=2048, so that the simulated data sampled at the D time grid can be regarded as sampled from the continuous time model. The first 3,000 data points for each simulation are discarded so that the simulated distribution of z0 is close to the marginal distribution of z0 dictated by the model. For all the experiments, we simulate and estimate 500 synthetic data sets.
3.1. Heston Model The results for the simulations under the Heston model are given in Table 1. We use sample size n ¼ 2; 022, equal to the number of observations in real data set considered in Section 4.5
150
TORE SELLAND KLEPPE ET AL.
Table 1. Parameter
Heston Model MC Study Results.
True Value
Bias
Std
MSE
a b s r a b
Observed volatility, n ¼ 2,022 0.2109 0.0081 7.7721 0.4366 0.3774 0.0042 0.3162 0.0044 0.0591 0.0072 1.6435 0.4320
0.0320 1.4286 0.0059 0.0190 0.0854 4.0063
0.0011 2.2272 0.0001 0.0004 0.0073 16.2035
a b s r a b
Unobserved volatility, n ¼ 2,022 0.2109 0.0040 7.7721 0.1068 0.3774 0.0342 0.3162 0.0194 0.0591 0.0344 1.6435 1.0805
0.0601 2.4411 0.0493 0.1209 0.1277 5.5070
0.0036 5.9577 0.0036 0.0150 0.0175 31.4291
Note: The bias is reported as E½y^ ytrue . ‘‘Std’’ denotes the statistical standard errors and MSE denotes the mean square error around the ‘‘true parameter’’.
It is well known that MLEs of the mean reversion parameter tend to be biased toward faster mean reversion in finite samples in the context of diffusion processes with a linear drift (see, e.g., Phillips & Yu, 2005; Phillips & Yu, 2009b). For the CEV model specified in Eq. (1), a faster mean reversion rate corresponds to higher negative values of b. This is also seen under the Heston model both for observed and unobserved logvolatility. Interestingly, we find the effect is stronger for observed log-volatility. Still, the bias under the Heston model is smaller than the corresponding statistical standard errors, both for the observed and the unobserved log-volatility. Thus, it seems that all three sources of bias discussed above are controlled for this model and under the amount of data. The loss of precision when using the integrated likelihood procedure ranges by a factor from 2 to 10 in increased statistical standard errors. In particular, the estimation precision of the volatility-of-volatility parameter s and the leverage parameter r is increased by a factor close to 10.
3.2. The GARCH Diffusion Model Simulation results for the GARCH diffusion model are summarized in Table 2. As before we use n ¼ 2,022.6 For both observed and unobserved
Simulated Maximum Likelihood Estimation of Continuous Time SV Models
151
Table 2. GARCH Diffusion Model MC Study Results. Parameter
True Value
Bias
Std
MSE
a b s r a b
Observed volatility, n ¼ 2,022 0.2411 0.0015 9.3220 0.3056 2.8202 0.0715 0.2920 0.0020 0.1019 0.0039 0.1139 0.3217
0.0323 2.0270 0.0430 0.0195 0.0881 4.4037
0.0010 4.1937 0.0070 0.0004 0.0078 19.4570
a b s r a b
Unobserved volatility, n ¼ 2,022 0.2411 0.0117 9.3220 0.8100 2.8202 0.0760 0.2920 0.0371 0.1019 0.0407 0.1139 1.4421
0.0756 3.6413 0.4254 0.1156 0.1320 6.1166
0.0058 13.8878 0.1864 0.0147 0.0190 39.4159
Note: See the note of Table 1 for details.
log-volatility, there is a negative bias in the mean reversion parameter b as one would expect. Once again we see that no biases are larger in magnitude than the corresponding standard errors when the log-volatility is unobserved. A downward bias in s is seen for the observed log-volatility to be larger in magnitude than the corresponding standard error, but the bias is still small compared with the ‘‘true’’ parameter. Again the loss of precision from using integrated MLE is the largest for the parameters s and r, where again about a tenfold increase in standard error is seen.
3.3. The CEV Model For the CEV model, we have performed two simulation studies, which are summarized in Tables 3 and 4. We denote these parameter settings as P1 and P2, respectively, corresponding to columns 3 and 4 in Table 5.7 Under P1 as ‘‘true parameters’’ are the estimates obtained using the full set of S&P 500 returns, which include the October 1987 crash. The experiment is done using both n ¼ 2; 022 and n ¼ 5; 000 data points. From Table 3 we see that the EIS-MLE procedure produces downward biased estimates of g when the log-volatility is unobserved. When the log-volatility is observed, this effect is negligible. The bias in g also leads to substantial bias in the other parameters governing the volatility, as we expect the MLE
152
TORE SELLAND KLEPPE ET AL.
Table 3. Parameter
CEV Model (P1) MC Study Results.
True Value
Bias
Std
MSE
a b s r g a b
Observed volatility, n ¼ 2,022 0.0434 0.0102 0.4281 0.5903 13.6298 0.3713 0.3317 0.0013 1.5551 0.0070 0.0820 0.0095 0.8716 0.5788
0.0232 1.5432 1.3716 0.0188 0.0266 0.0923 4.4829
0.0006 2.7252 2.0153 0.0004 0.0008 0.0086 20.3912
a b s r g a b
Unobserved volatility, n ¼ 2,022 0.0434 0.0634 0.4281 3.4599 13.6298 8.6680 0.3317 0.0227 1.5551 0.3539 0.0820 0.0030 0.8716 0.1687
0.0530 2.6160 3.7726 0.1355 0.2202 0.1191 5.4970
0.0068 18.7998 89.3375 0.0188 0.1736 0.0142 30.1819
a b s r g a b
Observed volatility, n ¼ 5,000 0.0434 0.0036 0.4281 0.2046 13.6298 0.4053 0.3317 0.0014 1.5551 0.0066 0.0820 0.0053 0.8716 0.2240
0.0138 0.9445 0.8616 0.0128 0.0169 0.0557 2.5162
0.0002 0.9322 0.9051 0.0002 0.0003 0.0031 6.3687
a b s r g a b
Unobserved volatility, n ¼ 5,000 0.0434 0.0480 0.4281 2.6796 13.6298 8.5886 0.3317 0.0309 1.5551 0.3082 0.0820 0.0154 0.8716 0.5007
0.0320 1.5569 2.5369 0.0787 0.1444 0.0720 3.1073
0.0033 9.5995 80.1870 0.0071 0.1158 0.0054 9.8867
Note: See the note of Table 1 for details.
estimates of them must have a strong correlation structure in distribution, even asymptotically. Increasing the sample size from 2,022 to 5,000 decreases the biases slightly, but it seems that very long time series will be needed to identify g with a decent precision for when the true parameter is in this range.
Simulated Maximum Likelihood Estimation of Continuous Time SV Models
Table 4. Parameter
153
CEV Model (P2) MC Study Results.
True Value
Bias
Std
MSE
a b s r g a b
Observed volatility, n ¼ 1,800 0.0754 0.0097 3.4022 0.5524 1.7587 0.0538 0.3912 0.0037 1.0804 0.0072 0.1811 0.0204 13.8246 1.2152
0.0216 1.3048 0.2565 0.0191 0.0359 0.1160 6.2209
0.0006 2.0042 0.0686 0.0004 0.0013 0.0138 40.0973
a b s r g a b
Unobserved volatility, n ¼ 1,800 0.0754 0.0231 3.4022 1.3109 1.7587 0.1653 0.3912 0.0030 1.0804 0.1273 0.1811 0.0220 13.8246 1.5312
0.0499 2.7270 1.6373 0.1618 0.2449 0.1755 9.2881
0.0030 9.1392 2.7025 0.0261 0.0761 0.0312 88.4301
a b s r g a b
Observed volatility, n ¼ 5,000 0.0754 0.0025 3.4022 0.1435 1.7587 0.0370 0.3912 0.0044 1.0804 0.0067 0.1811 0.0078 13.8246 0.4200
0.0122 0.7612 0.1511 0.0120 0.0213 0.0693 3.6322
0.0002 0.5989 0.0242 0.0002 0.0005 0.0049 13.3426
a b s r g a b
Unobserved volatility, n ¼ 5,000 0.0754 0.0044 3.4022 0.2826 1.7587 0.2897 0.3912 0.0201 1.0804 0.0770 0.1811 0.0101 13.8246 0.3163
0.0217 1.1640 0.8019 0.0876 0.1500 0.0936 4.7425
0.0005 1.4320 0.7257 0.0081 0.0284 0.0089 22.5458
Note: See the note of Table 1 for details.
To understand the impact of extreme observations (or ‘‘crashes’’) on this bias, we use the logarithm of the maximal absolute log-return as a simple proxy for ‘‘large’’ crash. In Fig. 1, a scatter plot of the estimated g against logarithm of the maximal absolute log-return is shown across all simulated paths under P1. A strong positive relationship is seen (correlation 0.52). This
154
TORE SELLAND KLEPPE ET AL.
Table 5. Parameter a
b
s
r
g
a
b
Log-likelihood
Parameter Estimation Results for the Standard and Poor’s 500 Data.
Heston
GARCH
CEV
LN
CEV1800
LN1800
0.2109 [0.0035] (0.0601) 7.7721 [0.1395] (2.4411) 0.3774 [0.0036] (0.0493) 0.3162 [0.0017] (0.1209) 0.5 – – 0.0591 [0.0013] (0.1277) 1.6435 [0.0459] (5.5070) 6514.90 [0.2457]
0.2411 [0.0020] (0.0756) 9.3220 [0.0860] (3.6413) 2.8202 [0.0115] (0.4254) 0.2920 [0.0006] (0.1156) 1.0 – – 0.1019 [0.0005] (0.1320) 0.1139 [0.0209] (6.1166) 6541.42 [0.0823]
0.0434 [0.0028] (0.0530) 0.4281 [0.1495] (2.6160) 13.6298 [0.3321] (3.7726) 0.3317 [0.0023] (0.1355) 1.5551 [0.0075] (0.2202) 0.0820 [0.0011] (0.1191) 0.8716 [0.0419] (5.4970) 6552.09 [0.3494]
29.5044 [0.2304] (8.7924) 7.6953 [0.0593] (2.2791) 2.4793 [0.0112] (0.3313) 0.3146 [0.0007] (0.0946) – – – 0.0683 [0.0005] (0.1102) 1.4183 [0.0196] (4.9598) 6538.23 [0.1502]
0.0754 [0.0027] (0.0499) 3.4022 [0.1489] (2.7270) 1.7587 [0.3088] (1.6373) 0.3912 [0.0010] (0.1618) 1.0804 [0.0432] (0.2449) 0.1811 [0.0044] (0.1755) 13.8246 [0.2251] (9.2881) 5916.20 [0.0457]
17.4672 [0.1234] (8.5628) 4.4413 [0.0313] (2.2166) 1.2965 [0.0063] (0.2664) 0.3934 [0.0010] (0.1525) – – – 0.2021 [0.0008] (0.1732) 14.8801 [0.0409] (8.7362) 5916.44 [0.0646]
Note: Monte Carlo standard errors are given squared brackets and statistical standard errors, taken from the MC experiments, are given in parentheses. For the Heston model and the CEV model, 1 of the 100 estimation replications failed to converge. Columns 5 and 6 report parameter estimates obtained when including only the first 1,800 observations in the sample
plot suggests that when the log-volatility is unobserved, extreme values in a sample are needed to identify high values of g. As a reference, the maximal absolute log-return of the October 1987 crash is roughly expð1:64Þ. In the simulated data sets, such extreme events occur in roughly 0.4% of the data sets. For the P2 ‘‘true parameters’’ obtained using the data excluding the October 1987 crash, we see much smaller biases, and the downward biases in g decrease substantially when the sample size is increased from 1,800 to 5,000. Still, all the biases are smaller in magnitude than the corresponding statistical standard error.
Simulated Maximum Likelihood Estimation of Continuous Time SV Models
155
2
γ
∧
1.5
1
0.5 −3.5
−3
−2.5 −2 log (maxi |xi|)
−1.5
−1
Fig. 1. Estimated g Values Explained by logðmaxi jxi jÞ from the MC Experiment with Parameter Setting 1 and n ¼ 2,022. The Sample Correlation Equals 0.52.
4. APPLICATION TO REAL DATA For our application to real data, we use S&P 500 log-return data previously used in Jacquier, Polson, and Rossi (1994) and Yu (2005).8 The data covers the period January 1980 to December 1987 and the sample has a total of n ¼ 2,022 log-return observations. We also fit the following continuous time LN model using the proposed method: "
st d zt
#
" ¼
# # " pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi #" a þ b expðzt Þ ð1 r2 Þ expðzt =2Þ r expðzt =2Þ dBt;1 dt þ . dBt;2 a þ bzt 0 s (12)
There are two purposes for fitting the LN model. First, it is used to illustrate the flexibility of our estimation methods. Second, it would be empirically interesting to compare the performance of the LN model with that of the CEV models. The discretization and the recipe for adapting the EIS algorithm to this model are given in Appendix B.
156
TORE SELLAND KLEPPE ET AL.
Parameter estimates along with statistical and MC standard errors for the Heston model, the GARCH diffusion, the full CEV model, and the LN model are given in columns 1–4 in Table 5, respectively. In addition, we have included parameter estimates for the CEV model and the LN model when only the first 1,800 observations (and thus excluding the October 1987 crash) are used in column 5 and 6. The statistical standard errors are taken from the MC experiments reported in Tables 1–4 for the CEV model instances. For the LN model, the statistical standard errors are MC estimates based on 100 synthetic data sets. As both the Heston model and the GARCH diffusion model are special cases of the CEV model, it is sensible to compare the maximum likelihood values reported in the last row of the table. The likelihood ratio test suggests that there is strong empirical evidence against the Heston model. This empirical result reinforces what have been found when both the spot prices and option prices are jointly used to estimate the CEV-SV model (Jones, 2003; Aı¨ t-Sahalia & Kimmel, 2007). Moreover, for the GARCH diffusion model, when the complete data set is used, the likelihood ratio test gives rejection for any practical p-value comparing with the complete CEV model. For the shorter data set, we see that the estimate for g is less than one-half the standard error from that of the GARCH diffusion model. The estimates for the leverage effect parameter r are very much in accordance with the estimates of Yu (2005) ðposterior mean ¼ 0:3179Þ under the log-normal stochastic volatility (LN SV) model. In all cases, we obtained a positive estimate of b, suggesting a positive risk-return relation, but the parameter estimates are statistically insignificant. The parameter estimates of the CEV model with and without the October 1987 crash differ significantly. This observation suggests a poor identification of g and that the influence of the crash to the estimate of g when the logvolatility is unobserved is substantial. The finding is consistent with that in Jones (2003), even though he uses data from 1986–2000 and 1988–2000 along with implied volatility data. For the data set including the October 1987 crash, Jones (2003) obtains a posterior mean of 1.33 for the CEV parameter g. The corresponding estimated value for data excluding the October 1987 crash is 1.17. Our simulated maximum likelihood estimates for g are 1.56 and 1.08, respectively. Jones (2003) argues that to accommodate the large spike in volatility represented by the October 1987 crash, higher values of g and s are needed. Still, since Jones (2003) used both log-return and implied volatility data, it is expected that his parameter estimates differ less then ours with and without the October 1987 crash in the sample.
Simulated Maximum Likelihood Estimation of Continuous Time SV Models
157
The estimation results based on the LN SV model are reported in column 4 for the full sample and in column 6 for the subsample. The LN SV model has also been successfully estimated in both cases. For example, when the full sample is used, the estimates for the leverage effect parameter r are very much in accordance with the estimates of Yu (2005). A comparison of the log-likelihood values of the CEV model and the LN model reveals that for the full sample the 7-parameter CEV model outperforms the 6-parameter LN model. However, for the subsample, the LN model has a slightly higher likelihood value even with fewer parameters. To estimate the errors induced by integrating out the log-volatility using the above-described EIS-MC method (comparing with the exact unknown integral), we repeat the estimation process 100 times using different random number seeds. These MC standard errors for the parameters and maximum log-likelihood values are included in brackets in Table 5. It is seen that the MC errors are generally small compared with the statistical standard errors. Judging from the MC standard errors in maximum log-likelihood estimates, the EIS-MC method performs best when g is close to 1. As references for the standard errors of the maximum log-likelihoods, we may mention that Liesenfeld and Richard (2006) obtains an MC standard error of 0.11 (log-likelihood: 918) under a 3-parameter LN SV model with 945 latent variables using 30 paths in the importance sampler. For a 5-parameter time-discretized Heston model, Durham (2006) obtains an MC standard error of 2.49 (log-likelihood: 18,473) using 1,024 draws in a Laplace importance sampler. As the latent process under consideration here is both nonlinear and heteroskedastic, the standard errors reported in Table 5 are satisfactory. Comparing with the findings of Kleppe and Skaug (2009), much of this may be written back to constructing the importance sampler around the product of conditional transition densities, rather than around the natural sampler as is commonly done in other applications of the EIS algorithm to nonlinear state space models.
5. CONCLUDING REMARKS This chapter outlines how the EIS algorithm may be applied to integrate out a latent process in an EM-discretized stochastic differential equation model. In terms of numerical precision, we find that the algorithm performs very well when considering the nonlinear and heteroskedastic structure of the latent process. In terms of the application to the CEV model, we find that
158
TORE SELLAND KLEPPE ET AL.
the integrated MLEs obtained perform well for moderate values of g, but the identification is more difficult for higher values of g. One direction for further research is to use the improved (relative to the EM) approximate continuous time TPDs proposed in Aı¨ t-Sahalia (2008) and for jump diffusions in Yu (2007). Using a simple Taylor expansion of these approximations (in zi ), one can obtain estimates of the conditional transition densities (i.e., conditional on xi ) that stays within the locally Gaussian importance samplers. The inclusion of jumps in the model will probably also improve the identifiability of the complete CEV model, as large returns may be regarded as jumps rather than be caused by large spikes in the volatility process. As a result, the volatility series will be smoother and hence it can be expected that the finite sample estimation bias of the mean reversion parameter will be more serious. Moreover, it should be noted that this procedure is by no means restricted to the CEV family of models. As shown in Appendix B, only the three model-dependent functions m0i ðzi1 ; xi Þ, S0i ðzi1 Þ, and xi ðzi1 ; xi Þ need to be respecified to implement a different model. The EM scheme suggests that any stochastic differential equation has an approximate Gaussian TPD for sufficiently short time-steps D. Thus, the technique of using the conditional on-data EM-TPD can be applied provided that data are given over a fine enough time grid. In particular, due to the explicit nature of Gaussian conditional densities, multivariate extensions (toward both multiple observed and unobserved processes) should also be straightforward. It is also worth noting that the above outlined procedure is closely related to the Laplace accelerated sequential importance sampling (LASIS) procedure of Kleppe and Skaug (2009). In the setting of the EM-discretized CEV model, their procedure would be equivalent to applying a Laplace importance sampler in w (which are standard normal) instead of z. This procedure would then bypass the much problems of heteroskedasticity and nonlinearity in much the same manner as outlined here, but we do not make further comparisons here.
NOTES 1. This is a slight abuse of notation, as the data are from the continuous time process (Eq. (1)) and not the discrete time approximation. 2. In general, both m and a^ depend on y and D, but we suppress this dependence in our notation. 3. 0i should be read as the ith row of a with the elements all equal to 0. 4. The source code is available on request from the first author.
Simulated Maximum Likelihood Estimation of Continuous Time SV Models
159
5. Under this simulation regime, 8% of the estimation replica for the unobserved log-volatility failed to converge and were subsequently ignored. 6. For this model, 2.8% of the simulation replications under unobserved logvolatility failed to converge. 7. For P1, 4.2% of the replications failed to converge, whereas for P2, 6.2% of the replications failed to converge. 8. The log-return data are multiplied by 0.01.
REFERENCES Aı¨ t-Sahalia, Y. (2002). Maximum-likelihood estimation of discretely-sampled diffusions: A closed-form approximation approach. Econometrica, 70, 223–262. Aı¨ t-Sahalia, Y. (2008). Closed-form likelihood expansions for multivariate diffusions. Annals of Statistics, 36(2), 906–937. Aı¨ t-Sahalia, Y., & Kimmel, R. (2007). Maximum likelihood estimation of stochastic volatility models. Journal of Financial Economics, 134, 507–551. Andersen, T. G., Benzoni, L., & Lund, J. (2002). An empirical investigation of continuous-time equity return models. The Journal of Finance, 57(3), 1239–1284. Andersen, T. G., & Lund, J. (1997). Estimation continuous-time stochastic volatility models of the short-term interest rate. Journal of Econometrics, 77, 343–377. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Croz, J. D., Greenbaum, A., Hammarling, S., McKenney, A., & Sorensen, D. (1999). LAPACK Users’ Guide (3rd ed.). Philadelphia: Society for Industrial and Applied Mathematics. Bakshi, G., Cao, C., & Chen, Z. (1997). Empirical performance of alternative option pricing models. Journal of Finance, 52, 2003–2049. Bauwens, L., & Galli, F. (2009). Efficient importance sampling for ml estimation of scd models. Computational Statistics and Data Analysis, 53, 1974–1992. Cox, J. C., Ingersoll, J. E., & Ross, S. A. (1985). A theory of the term structure of interest rates. Econometrica, 53(2), 385–407. Duffie, D., Pan, J., & Singleton, K. J. (2000). Transform analysis and asset pricing for affine jump-diffusions. Econometrica, 68, 1343–1376. Durham, G. B. (2006). Monte carlo methods for estimating, smoothing, and filtering one and two-factor stochastic volatility models. Journal of Econometrics, 133, 273–305. Durham, G. B., & Gallant, A. R. (2002). Numerical techniques for maximum likelihood estimation of continuous-time diffusion processes (with discussion). Journal of Business and Economic Statistics, 20(3), 297–338. Heston, S. (1993). A closed-form solution for options with stochastic volatility with applications to bonds and currency options. Review of Financial Studies, 6, 327–343. Hull, J., & White, A. (1987). The pricing of options on assets with stochastic volatilities. Journal of Finance, 42, 281–300. Jacquier, E., Polson, N. G., & Rossi, P. E. (1994). Bayesian analysis of stochastic volatility models. Journal of Business & Economic Statistics, 12(4), 371–389. Jones, C. S. (2003). The dynamics of stochastic volatility: Evidence from underlying and options markets. Journal of Econometrics, 116, 181–224. Kleppe, T. S., & Skaug, H., (2009). Fitting general stochastic volatility models using laplace accelerated sequential importance sampling. Submitted for publication.
160
TORE SELLAND KLEPPE ET AL.
Kloeden, P. E., & Platen, E. (1999). Numerical solution of stochastic differential equations. New York: Springer-Verlag. Leong, P. H. W., Zhang, G., Lee, D. U., Luk, W., & Villasenor, J. (2005). A comment on the implementation of the ziggurat method. Journal of Statistical Software, 12(7), 1–4. Liesenfeld, R., & Richard, J.-F. (2003). Univariate and multivariate stochastic volatility models: Estimation and diagnostics. Journal of Empirical Finance, 10, 505–531. Liesenfeld, R., & Richard, J.-F. (2006). Classical bayesian analysis of univariate and multivariate stochastic volatility models. Econometric Reviews, 25(2), 335–360. Nelson, D. B. (1990). ARCH models as diffusion approximations. Journal of Econometrics, 45, 7–38. Nocedal, J., & Wright, S. J. (1999). Numerical optimization. New York: Springer. Phillips, P., & Yu, J. (2005). Jackknifing bond option prices. Review of Financial Studies, 18, 707–742. Phillips, P., & Yu, J. (2009a). Maximum likelihood and Gaussian estimation of continuous time models in finance. In: E. A. Torben Andersen (Ed.), Handbook of Financial Times series (pp. 497–530). New York: Springer. Phillips, P. C. B., & Yu, J. (2009b). Simulation-based estimation of contingent-claims prices. Review of Financial Studies, 22(9), 3669–3705. Rao, B. L. S. P. (1999). Statistical inference for diffusion type processes. Number 8 in Kendall’s library of statistics. Arnold. Richard, J.-F., & Zhang, W. (2007). Efficient high-dimensional importance sampling. Journal of Econometrics, 141(2), 1385–1411. Shephard, N., & Pitt, M. K. (1997). Likelihood analysis of non-Gaussian measurement time series. Biometrika, 84, 653–667. Wiggins, J. (1987). Option values under stochastic volatility: Theory and empirical estimate. Journal of Financial Economics, 19, 351–372. Yu, J. (2005). On leverage in a stochastic volatility model. Journal of Econometrics, 127, 165–178. Yu, J. (2007). Closed-form likelihood approximation and estimation of jump-diffusions with an application to the realignment risk of the chinese yuan. Journal of Econometrics, 141(2), 1245–1280.
APPENDIX A. EEXPLICIT EXPRESSIONS The explicit expression for log wi is given as 1 1 1 ai;2 log wi ðzi1 ; xi ; ai Þ ¼ logðpÞ log 2 2 2S0i ðzi1 Þ2 m0i ðzi1 ; xi Þ
m0i ðzi1 ; xi Þ2 2S0i ðzi1 Þ2
S0i ðzi1 Þ2
! !2
þ ai;1
1 4 ai;2 S0i ðzi1 Þ2
! .
Simulated Maximum Likelihood Estimation of Continuous Time SV Models
161
Moreover, log xi is given as zi1 ð1 2gÞ 1 logðs2 ð1 r2 ÞÞ 2 2 ðxi Dða þ b expðzi1 ÞÞÞ2 . 2D expðzi1 Þ
log xi ðzi1 ; xi Þ ¼ logð2pDÞ þ
APPENDIX B. THE LOG-NORMAL MODEL In this chapter, we use the following specification of the continuous time LN SV model: " pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi #" " # " # # a þ b expðzt Þ st ð1 r2 Þ expðzt =2Þ r expðzt =2Þ dBt;1 ¼ dt þ . d dBt;2 a þ bzt zt 0 s (B.1) The EM scheme yields the discrete time dynamics " pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi #" # " # " # pffiffiffiffi xiþ1 Dða þ b expðzi ÞÞ ð1 r2 Þ expðzi =2Þ r expðzi =2Þ ei;1 ¼ , þ D ei;2 zi þ Dða þ bzi Þ ziþ1 0 s (B.2) which with a ¼ b ¼ 0 is equivalent to the ASV1 specification in Yu (2005). To adapt the above-described EIS algorithm, the following functions need to be altered: m0i ðzi1 ; xi Þ ¼ zi1 þ Dða þ bzi1 Þ þ sr expðzi1 =2Þðxi Dða þ b expðzi1 ÞÞ, (B.3) S0i ðzi1 Þ ¼ s
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Dð1 r2 Þ,
(B.4)
log xi ðzi1 ; xi Þ ¼ logð2pDÞ zi1 =2 logðs2 ð1 r2 ÞÞ
ðxi Dða þ b expðzi1 ÞÞÞ2 . 2D expðzi1 Þ
(B.5)
This highlights the fact that the above EIS algorithm is easily adapted to other models cast in the form of EM-discretized stochastic differential equations. For the computations summarized in Table 5, we use M ¼ 32 draws in the importance sampler.
EDUCATION SAVINGS ACCOUNTS, PARENT CONTRIBUTIONS, AND EDUCATION ATTAINMENT Michael D. S. Morris ABSTRACT This chapter uses a dynamic structural model of household choices on savings, consumption, fertility, and education spending to perform policy experiments examining the impact of tax-free education savings accounts on parental contributions toward education and the resulting increase in the education attainment of children. The model is estimated via maximum simulated likelihood using data from the National Longitudinal Survey of Young Women. Unlike many similarly estimated dynamic choice models, the estimation procedure incorporates a continuous variable probability distribution function. The results indicate that the accounts increase the amount of parental support, the percent contributing and education attainment. The policy impact compares favorably to the impact of other policies such as universal grants and general tax credits, for which the model gives results in line with those from other investigations.
Maximum Simulated Likelihood Methods and Applications Advances in Econometrics, Volume 26, 165–198 Copyright r 2010 by Emerald Group Publishing Limited All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1108/S0731-9053(2010)00000260010
165
166
MICHAEL D. S. MORRIS
1. INTRODUCTION Parents in the United States commonly help pay for their children’s college education, and for many families this can be a major expense. The College Board (2006a) estimates that the average yearly total cost, including tuition, fees, room, and board, for a private 4-year school was $30,367 per year for the 2006–2007 academic year, and $12,796 for a public 4-year school. These expenses have been rising rapidly. Total costs at 4-year private schools increased an inflation adjusted 28 percent over the past decade, and 4-year public schools rose an even larger 38 percent. Certainly some of the increase has been financed by student aid packages including grants, scholarships, and subsidized loans, and the U.S. Department of Education (2005) reports that 63 percent of students in 2004 received some form of financial aid. The average annual net price after all forms of aid, however, still averaged $21,400 for private 4-year universities and $9,700 for public 4-year universities and while aid packages have grown along with rising college costs, the inflation adjusted net price has still increased annually at an average of 2.1 and 2.6 percent for private and public universities, respectively, over the last decade (College Board, 2006a). So it falls upon the individual students and their families to help pay increasingly more for college expenses. Using data gathered by the National Center for Education Studies, Choy, Henke, and Schmitt (1992) found that in 1989, 67 percent of parents contributed to their children’s college education and the average annual amount was $3,900 ($6,340 in 2006 dollars), a trend that has increased more as the relative cost of college education has risen. In response to increasing prices and reliance of parental assistance for college funding, a variety of programs including grants, subsidized loans, and tax-sheltered targeted savings accounts have been introduced. However, there has been very little investigation as to their impact on actually increasing the number of students attaining a college degree. Dynarski (2000, 2002), Ichimura and Taber (2002), and Keane and Wolpin (2001) find varying degrees of evidence that direct subsidies and grants to students increase college enrollments, but there remains little evidence as to what extent this then leads to an increase in final degree attainments. Furthermore, there has been no investigation as to whether tax-sheltered education savings accounts (ESA) actually lead to greater number of children attaining a college degree. The goal of this chapter is to examine the impact of ESA programs on the educational attainment of children by developing and estimating a structural dynamic programming model of household decisions on having
Parent Contributions and Education Attainment
167
children, saving money, and making transfers to children in the form of educational funding. The model is estimated using a simulated maximum likelihood procedure and data from the National Longitudinal Survey (NLS) of Young Women cohort. The estimated model is then used to examine a policy experiment aimed to replicate the creation of an ESA program in order to determine the impact of such accounts on actually increasing the number of college degrees attained. The results here indicate that the accounts should increase both contributions and the attainment of college degrees, and that the impact compares favorably to some other policy options such as universal grants and general tax credits. The results for the grants, used for comparison, are also in line with other investigations. Section 2 of the chapter presents a literature review of related research, Section 3 of the chapter explains the model and estimation specification, including the numerical techniques associated with solving the model and the simulated maximum likelihood procedure used to estimate the model. Section 4 describes the data, both in terms of how the variables in the model are constructed and descriptive statistics of the sample used. Section 5 discusses the estimated parameter results and the fit of the estimated model. Section 6 presents the policy experiments and Section 7 concludes the chapter.
2. LITERATURE REVIEW Given the high costs of education, the fact that many children need support from their parents to attain a college degree, and the importance of a college degree in determining future earnings and productivity, there has been a push in the last decade to introduce a variety of tax breaks aimed at not only lowering the cost of education, but also to help parents contribute more toward their children’s college expenses. For example, the Taxpayer Relief Act of 1997 introduced two tax credits that parents can claim for dependants. The Hope Scholarship Credit is a tax credit for 100 percent of the first $1,000 of qualified tuition expenses and 50 percent of the next $1,000 of qualified tuition expenses. This credit can be claimed for each child supported, but is only available in the first 2 years of postsecondary education and is phased out at higher incomes ($40,000–$50,000 for individual taxpayers, twice that for married households). The Lifetime Learning Credit is similar except that it is available for any year of education (not just the first 2 years), but is only for 20 percent of up to $10,000 (only $5,000 prior to 2003) in expenses and is a single credit per
168
MICHAEL D. S. MORRIS
family, not per child. More directly to help parental contributions, the Taxpayer Relief Act of 1997 also created a federal educational savings account, now called a Coverdell Account, which is a custodial account for children under 18 in which $2,000 (was $500 before 2002) a year may be placed and for which the earnings and distributions are tax-free as long as they are used for qualified education expenses. Similarly, Section 529 of the IRS tax code in 1996 also allowed states to offer their own favorable tax treatment savings accounts to fund qualified state tuition programs. The most popular of these savings vehicles have been state-sponsored 529 plans, of which there are two general types. The first is a prepaid tuition plan where money is contributed in a child’s name and locks in an associated percentage of college expenses at current tuition rates at one of the state’s public universities.1 The other type of plan is an ESA where contributions that can grow tax-free, and generally withdrawals are also tax-free as long as the money is spent on education. The popularity of the state-run ESAs is due to their higher contribution limits, where the only limit is that they cannot exceed expected education expenses. These 529 plans have grown substantially over the decade since their creation in 1996: from less than 500,000 accounts averaging $4,959 in 1996 to over 8.8 million accounts with an average balance of $10,569 by 2006 (College Board, 2006b). Of the accounts in 2006, 81 percent are of the traditional tax-sheltered savings account as opposed to prepaid tuition accounts. Despite the wide use and political popularity of these savings accounts, there is very little evidence on their impact on actual educational attainment. Several studies have looked at the impact of aid programs in the form of grants and loans targeted directly at students as opposed to parents. Using data from the introduction of the Georgia HOPE scholarship program as a quasi-experiment, Dynarski (2000, 2002) finds the availability of an additional $1,000 subsidy increases attendance by 4 percent, a figure she finds consistent with previous findings. Ichimura and Taber (2002) estimate a 4.5 percent increase in attendance from the availability of a $1,000 subsidy using a reduced-form estimation derived from a model used by Keane and Wolpin (2001). They actually estimate that a $100 tuition increase (a negative subsidy) would lower enrollment rates of 18–24-year-olds by 1.2 percent. Interestingly, while finding that borrowing constraints are indeed tight for students (they cannot even support 1 year from borrowing alone), Keane and Wolpin do not find that allowing for easier loan access will increase attendance. Instead, they find that the major impact of reducing the borrowing constraint is in a reduction of working by students. Perhaps more importantly, Keane and Wolpin also find that parental
Parent Contributions and Education Attainment
169
transfers contingent on college attendance are not only prevalent, but do significantly increase the educational attainment of children. These finding, though, still leave in question the effectiveness of ESA programs. These accounts may or may not actually increase educational spending. They could just be a tax break for parents who would have sent their children to college anyway. In addition, it is unclear exactly to what extent additional parental contributions will increase attendance, let alone the number of people who get a college degree. The related literature on targeted savings, namely IRAs for retirement savings, gives rather mixed results.2 Answering the question regarding ESAs is made even more difficult because clearly contributions toward education are contingent on having children in the first place, and if parents do save or adjust their spending behavior over time to help make contributions, then to fully examine the impact of policies on parental contributions it is necessary to consider parents dynamic and interdependent decisions on fertility, savings, and college spending together. This chapter uses a modified version of the life-cycle model that is the workhorse for economic research on household inter-temporal savings and consumption. A very thorough review of the history of these models and their ability to deal with observed microeconomic facts can be found in Browning and Lusardi (1996), including the enhancements used here such as liquidity constraints and precautionary savings. While these models are certainly useful, it has been pointed out that life-cycle models really need to account for a whole range of related interdependent choices including, importantly for this analysis, fertility decisions.3 The model in this chapter makes the choice to have children, the number of children to have, and the timing of when to have children endogenous decisions within the life-cycle framework, along with decisions on consumption and savings in the face of income uncertainty and borrowing constraints. Furthermore, families also choose to make transfers to their children in the form of college educational spending. Together, this allows for a rich relationship between the number of children to have, how much to provide children with money for college education, savings, and consumption within the life-cycle.
3. THE MODEL 3.1. The Model The model is a finite horizon dynamic programming problem where households maximize their expected discounted utility over periods t ¼ 1, y, T.
170
MICHAEL D. S. MORRIS
In each period, households receive contemporaneous utility that depends not only on the level of consumption, but also directly on the total number of children they have as well as the levels of education of those children. More specifically, the utility function allows for households to receive nonpecuniary utility from having children, and additional utility for those children having a moderately high level of education, indicated by some college education beyond high school or a 2-year degree, and a high level of education, indicated by a 4-year college degree or more. The contemporaneous utility at time t is Uðct ; nt ; qmt ; qht ; ct ; nt ; qt Þ ¼
ðct ct Þ1g þ lnt þ aqmt þ yqht 1g
(1)
l ¼ l1 þ l2 nt þ l3 ct þ nt a ¼ a1 þ a2 qmt þ qt y ¼ y1 þ y2 qht þ qt where ct is consumption at time t, nt is the total number of children at time t, qmt is the number of children with a moderately high education level (some education beyond high school) at time t, qht is the number of children with a high level of education (4-year college degree or more) at time t, ect, ent, and eqt are taste shocks to consumption, number of children, and education levels, respectively, and the rest (g, l, a, and y) are model parameters to be estimated. This specification allows the marginal utility of the number of children to depend on both the number of children and the level of consumption, along with a random shock. The marginal utilities of additional children attaining moderate and high education levels can also depend on the number of children at those levels and a random shock.4 The final specification that is estimated allows some additional heterogeneity in the utility parameters for children and education with respect to different parental education levels, and the child parameters are allowed to further vary in the first three periods, the details of which are shown in Appendix A. The shocks are assumed to be iid, with the multiplicative consumption shock distributed log-normal (so the shock maintains positive consumption), and the other shocks distributed normal: lnðct Þ Nð0; s2c Þ; nt Nð0; s2n Þ; qt Nð0; s2q Þ
(2)
A period in the model is set to last 6 years, which will allow for a smoother general evolution of assets than shorter periods.5 Households
Parent Contributions and Education Attainment
171
choose a level of consumption and savings in each period. Younger households, in periods 1 through 3 (corresponding to ages 18–36), also choose how many additional children to have in each of those 6-year intervals. This in effect restricts households have all their children by an age of 36. In addition, households in the model are further restricted to having no more than four children in any one of the 6-year period, and no more than five children in total.6 Children are assumed to cost an amount per child, per period, C, for three periods (i.e., until they are 18), and the amount is allowed to differ by household education level. In the fourth period of a child’s life (when they are 19–24), households have the option to offer to help pay for that child’s college education, and it is assumed that parents only make these contributions to a particular child in this period of the child’s life.7 At this time, the household chooses a one-time offer, ot, of a per-year amount to contribute toward the child’s college education. Given this offer, the child then attains a certain level of education. From the parent’s view, a child’s education level is a realization of a stochastic process that depends on the amount of support offered, along with other possible demographic variables. Let djt be the education outcome for child j in period t, where djtA{high school or less, some college or 2-year degree, 4-year college degree, or more}, as these are the categories that enter the utility function. This outcome is given by d jt Dðd jt jxjt Þ
(3)
so that djt is a realization of the conditional distribution D(djt|xjt), where xjt is a vector of characteristics. The stochastic process in Eq. (3) is specified as an ordered probit, and the vector of characteristics, xjt, includes a quadratic in the family contribution offer interacted with the level of the parents’ education, as shown in Appendix A. The amount the household actually pays toward a child’s college education is then a product of the peryear offer made and the number of years of schooling the child attends. Furthermore, the offer amount is binding (i.e., the parents cannot renege on the offer). In order to identify the impact of offers on education attainment, the model restricts parents to making the same offer to all children of eligible age in a given 6-year period. This is necessary because the data only report the amount parents actually contribute toward their children’s education. As such, there is no data for children who did not go to college, and yet it is probable that some of these children could have received support from their parents. The identification here comes from assuming that those children
172
MICHAEL D. S. MORRIS
receive the same offer of support as their siblings attending school within the same 6-year period. When looking at the NLS data, this does not seem overly restrictive as most families do not greatly vary the amount they contribute toward their different children within a period. In fact, just over half made contributions that differed by less than $1,000 and almost 20 percent of contributions differed by less than $50. The restriction that children born within the same period receive the same offer, while providing identification for the impact of the offer, does imply that if offers positively impact the educational attainment of children, then one should observe that families with more children in college will on average be making larger contributions toward each of their children’s education. It is not obvious that this should be the case. For example, a family with two children and both of them in college at the same time may have less money available to support either individually than if just one was attending college. Looking at the NLS data, though, for families with two children born within 6 years of each other, the average per-child, per-year contribution in 1999 dollars is $5,721 for those with both children attending college, and only $3,666 for those with just one child in college, which is consistent with the assumed restriction. Household income is assumed to be determined by I t ¼ expfzt b þ wt g
(4)
where the characteristic vector zt contains dummies for the levels of education of parents interacted with a quadratic in age, as shown in more detail in Appendix A. This gives a standard semi-log specification for income. The income shocks is assumed to be distributed iid:8 wt Nð0; s2w Þ
(5)
and the variance is further allowed to differ by household education as shown in Appendix A. Because the impact of marital transitions is not a focus of this chapter, marital status is assumed to be constant over the agent’s life. Without this assumption, the changes from single to married and vice versa will need to be modeled along with the associated changes to income and assets. Furthermore, the separate contributions toward their children’s education from divorced couples would also need to be modeled. Not only would this greatly increase the complexity of the model, but also the relevant data from one of the divorced parents (their contributions toward education as well as other data on asset accumulation and income) are not available, making it
173
Parent Contributions and Education Attainment
impossible to estimate. Since the sample of stable continuously single parents in the NLS is so small, the model here will actually only consider continuously married families, which represent just under 60 percent of the NLS respondents. This assumption allows for a simpler model that focuses on the household savings and educational transfer decisions, and has the data needed for estimation, though it clearly may limit to some extent the inferences drawn from the results. The level of education of the household for earnings consideration is also not modeled as a dynamic decision, and as such is considered a constant household ‘‘type,’’ identified as the highest level attained. At any period, t, then the household problem is to solve " # T X tt d Uðct ; nt ; qmt ; qht ; ct ; nt ; et ÞjOt (6) max E fct ;nt ;ot g
t¼t
where d is the discount rate and Ot is the state space at time t (i.e., the relevant information at time t for making the decision). The maximization is made subject to the following constraints: ct ¼ kt ð1 þ rÞ þ I t Cnht
Nt X
ant ktþ1
n¼1
ct 0 kt 0 nt þ 4 ntþ1 nt nt 5 4n4t ot 2 ½0; ð1 þ rk Þkt þ I t Cnht
ð7Þ
The first of these is the budget constraint where kt is the level of household assets at time t, r is the real interest rate, It is household income at time t, nht is the number of children living at home at time t, ant is the amount the household pays toward child n’s college expenses at time t. The second and third conditions imply that households have positive consumption and that they cannot take out uncollateralized loans, respectively. The fourth and fifth conditions restrict the child decisions as discussed above, namely that no more than four additional children may be born in a period and that the maximum family size is five children. In the final restriction, where n4t is the number of children of age 4 (i.e., of college age) in period t and ot is the per-year, per-child offer in period t, (so 4n4tot is the total amount spent if all children of college age in the period attain a college degree), the offer is
174
MICHAEL D. S. MORRIS
restricted so that it must be nonnegative and, because the offer is binding, the parents are not allowed to offer more than can be covered without taking out an uncollateralized loan (i.e., by keeping their net assets positive). The decision timing of the model is as follows. At the beginning of each period the household receives a realization of ect, ent, eqt, and ewt. After the realization of these shocks, households receive income and pay expenses for children living at home and then make decisions for the period. In the first three periods, the choices of savings and the number of children to have in that period are made simultaneously. In periods 4–6, if parents have children of college age, the educational support offer and the savings decisions are made sequentially. First, parents choose a per-child, per-year amount to offer their children in educational support. Parents then realize the outcomes of their children’s education choices and make the appropriate payments (to enforce the binding offer assumption) before proceeding to choose a level of savings and consumption.
3.2. Model Solution The household optimization problem defined in Eqs. (6) and (7) can be rewritten in a dynamic programming framework based on the value of entering period t with state space Ot, represented by the value function, Vt, that satisfies the Bellman (1957) equation as V t ðOt Þ ¼ max ½Uðct ; nt ; qmt ; qht ; ct ; nt ; ht Þ ct ;nt ;ot
þ dEðV tþ1 ðOtþ1 ÞjOt ; ct ; nt ; ot Þ The state space at time t consists of the beginning of period t assets, number of children born in prior periods 1, 2, and 3, the number of children with mid- and high-levels education, parent education level, and the realization of the income shock and taste shocks on consumption, children, and education. Given the current model assumptions and specification, however, the relevant choice set varies between different periods. For example, households only choose additional children in the first three periods. Furthermore, when children are going to college, households face an additional, within-period, sequential choice of an education support offer before making consumption decisions. So, the value functions can be more accurately defined by considering different periods separately.
175
Parent Contributions and Education Attainment
For the first three periods, families simultaneously choose the number of children and consumption giving value functions V t ðOt Þ ¼ max½Uðct ; nt ; 0; 0; ct ; nt ; qt Þ ct ;nt
þ dEðV tþ1 ðOtþ1 ÞjOt ; ct ; nt Þ
t ¼ 1; 2; 3
ð8Þ
where the expectation in Eq. (8) is with respect to the stochastic shocks ect, ent, eqt, and ewt. In periods 4–6, families no longer make a choice on having children. Instead, they first choose an offer of college support for children of college age, and after realizing the educational outcome and making appropriate payments, choose consumption. As such, the value function becomes V t ðOt Þ ¼ max½E d ðV t:5 ðOt:5 ÞjOt ; ot Þ ot
V t:5 ðOt:5 Þ ¼ max½Uðct ; nt ; qmt ; qht ; ct ; nt ; qt Þ ct
þ dEðV tþ1 ðOtþ1 ÞjOt:5 ; ct Þ
t ¼ 4; 5; 6
ð9Þ
where the ‘‘.5’’ sub-period shows the sequential nature of the offer and consumption decisions. The Ed expectation is taken over the educational outcomes of the children of college age in period t. That outcome and the associated college payments are updated into state space point Ot.5, before the consumption decision is made. After period 6, the only choice households make in a period is an amount to consume, so the value function is simply V t ðOt Þ ¼ max½Uðct ; nt ; qmt ; qht ; ct ; nt ; qt Þ ct
þ dEðV tþ1 ðOtþ1 ÞjOt ; ct Þ
t46
ð10Þ
In the final period, VT, consists only of the contemporaneous utility portion (i.e., VTþ1(.) ¼ 0). The finite dynamic programming model laid out in Eqs. (8–10) can be solved using backward induction. However, since there is no analytic solution for this model, the solution is done with numerical techniques. The method used here is based on one proposed and used by Keane and Wolpin (1994, 1997, 2001). To solve the model you must determine the expectations of next period value functions, that is E(Vtþ1(Otþ1)), which following Keane and Wolpin (1994) can be referred to as Emaxt. The expectation for Emaxt involves multiple integration and is approximated using Monte Carlo Integration. Emaxt needs to be solved for every possible combination of
176
MICHAEL D. S. MORRIS
state space values in Otþ1. In this model Otþ1 2 Rþ N N N Q Q with assets being a positive real number, the number of children in periods 1, 2, and 3 an element of N ¼ {0,1, y, 4} and the number of children with mid-level and high-level education an element of Q ¼ {0,1, y, 5}. Since the household education level does not change, it can be suppressed from the state space and the model solved separately for each type. The specification restrictions in the model on the fertility process limits the number of possible different combinations of children born at different periods and their educational attainment in the state space to a manageable number of state points for which to solve Emaxt. However, since assets are a continuous variable, the solution still cannot be calculated for every state variable combination and even if assets were to be discretized in a reasonable way, the resulting state space would still be very large. This dimensionality problem is exacerbated when the model is estimated because the solution must be recalculated many times during the estimation optimization. To overcome this, the model will be solved for a subset of asset values, along with all possible combinations of children and education levels, and least squares will be used to interpolate the remaining Emaxt values. For Vt.5(.) in periods 4–6 and for Vt(.) in periods other than 4–6, the interpolation is based on the regression of directly solved Emax values on the contemporaneous utility evaluated at the means of the stochastic components and for Vt(.) in periods 4–6 on a quadratic in assets.9 Since there is no data on households beyond their late 50s, there is still a need to specify and estimate a terminal condition of some sort to fit and motivate the life-cycle decisions. The specification used here mimics a full life-cycle model, and has income after period 7 (after age 63) modeled as a portion of expected average lifetime income plus a random shock, and the final end point is set to T ¼ 11 (corresponding to a model end at age 84). The portion parameter and shocks are allowed to differ by household education as shown in Appendix A, and these parameters dictating the terminal condition are estimated along with the others of the model.
3.3. Estimation The model is estimated using a simulated maximum likelihood procedure.10 For a given set of parameter values, a likelihood function value can be constructed based on simulations from the solution to the dynamic programming problem and from this the optimal parameters can be found via numerical optimization. Several previous studies have utilized a similar
Parent Contributions and Education Attainment
177
estimation procedure for discrete choice models.11 Since the model in this chapter contains two continuous choices, the exact same procedure cannot be followed.12 However, the same concept applies, but instead of a probability simulator, a nonparametric kernel density estimator is used with the simulated samples to construct the elements of the likelihood function. For a single household, i, the data provide an observable sequence of state point values for assets, children, spending on college, and children’s educational outcomes. Since the household decisions relating to these items in the model depend only on the current state space variables and exogenous, independently distributed shocks, i’s contribution to the likelihood, Li can be written as a sequence of conditional densities: Li ¼
T 1 Y
f ðOitþ1 jOit Þ
(11)
t¼0
where f(.) is the pdf of Otþ1 conditional on Ot. The sample likelihood is then calculated as the product of these individual likelihoods. Calculating the likelihood is still a problem because the functional form of f(.) is unknown. Given a set of the parameters, however, the model can be solved and therefore a sample of values for Otþ1, given a value of Ot, can be simulated. From this simulated sample, a density estimator can be calculated and used to estimate the value of f(Otþ1|Ot). There is a wide literature on density estimation and techniques for continuous and mixed variable distributions and the chapter here adopts a smooth kernel density estimator.13 Traditional kernel estimators can be inaccurate and difficult (if not impossible) to implement when used in higher dimension space. In the model here, however, no more than two observed state space values change between any two observed periods in which the conditional density must be estimated. The remaining values are fixed. As such, it is never necessary to estimate a joint density for more than two variables, which can be done accurately. After period 6, the number of children and their education levels are fixed and just assets change, requiring just a one-dimensional conditional density of next period assets. In the education periods, 4–6, the number of children is fixed, but there is an offer decision, and education levels and assets are changing. Since the decisions are made (and outcomes realized) sequentially, though, there are really three independent conditional densities for these periods. First, a one-dimensional density estimator is needed for the offer given the initial state. Next, a two-dimensional estimator is used for the changing mid- and high-levels educational attainment of children given the offer and initial state, and
178
MICHAEL D. S. MORRIS
finally a one-dimensional estimator of next period assets. In the first three periods, only the number of children born in the current period and assets are changing, requiring only a bivariate density estimator. For a univariate continuous density estimator, a standard Gaussian kernel is used: ! 1 1 s Si 2 (12) Kðs; Si ; hÞ ¼ pffiffiffiffiffiffi exp h 2 2p with Silverman’s (1996) plug-in value for an optimal window width for the ^ 1=5 , as the smoothing parameter, where s^ is Gaussian kernel, h ¼ 1:06sn the standard deviation of the variable in the sample data and n here is now the sample size.14 The estimated density is then n 1 X Kðs; S i ; hÞ f^ðsÞ ¼ nh i¼1
(13)
A bivariate product kernel form (see Scott, 1992) is used for bivariate densities: f^ðs1 ; s2 Þ ¼
n 1 X K 1 ðs1 ; Si1 ; h1 ÞK 2 ðs2 ; Si2 ; h2 Þ nh1 h2 i¼1
(14)
The bivariate densities in the likelihood are for either two discrete variables (the two educational outcome categories in periods 4–6) or mixed with a continuous variable (assets) and a discrete variable (new children in periods 1–3). For the continuous portion I again use a standard Gaussian kernel in Eq. (12). For the discrete values, the variables here are not just categorical, but also are ordinal. To take advantage of this, I utilize a variant of the Habbema kernel, which was found to be highly effective by Titterington and Bowman (1985) with Kðs; S i ; hÞ ¼ ljsSi j
(15)
where l here is a smoothness parameter. In this kernel, h, is set to P j l an appropriate weight by setting h ¼ J1 j¼0 , where J is the number of discrete distances possible between s and other data points in the sample. The amount of smoothing is then controlled by l. For estimation, l is set to 0.3.15 The estimation procedure works as follows. First, an initial guess of the parameters is made and the model is solved for these parameters. The value of the likelihood function is constructed using the model solution and
Parent Contributions and Education Attainment
179
simulated density estimators Eqs. (12–15). The likelihood is checked to see if it is maximized, and if not, the guess is updated and the procedure repeats itself. See Appendix B for a more detail on these estimation steps.
4. THE DATA 4.1. Sample Construction and Variable Definitions The data are taken from the Young Women cohort of the NLS, which consists 5,159 women who were 14–24 years old in 1968. Surveys were administered every year from 1968 through 1973 and basically biennially since. The results here use a sample including only the years through 1999 in order to allow for the baseline parameters of the model to be estimated for a time period where the existence of recent education policy initiatives would not have had a sizeable impact. When matching the 6-year period in the model with the data, the first period is matched to ages 18–23 in a household’s life, the second to ages 24–29, and so on. The age of the household is measured as the age of the women followed in the NLS data. While the NLS collects data on a wide variety of topics, the relevant data used in this chapter are on assets, income, children, education, marital status, children’s education, and spending on children’s education. Assets are measured as total net household assets, excluding vehicles. Comprehensive questions on assets were only asked in 1968, 1971–1973, 1978, 1983, 1988, 1993, 1995, 1997, and 1999. If more than one observation of assets is available within a 6-year period of the model, the earlier dated value is used because the model treats the assets as the assets available when entering a period. Assets are frequently subject to measurement error when information is collected by survey, so a random measurement error is specified as shown in Appendix A. Income is measured as the total income of a woman and her spouse. Income information is collected with every survey, though with some variation in detail. To fit the 6-year periods, the income amount is calculated as six times the average of the annual income observations within the appropriate age range. Education information for the respondents and spouses is updated in every survey, along with marital status. The parent’s education enters the model in the wage equation and as a determinant of the education attainment of children. Since the model assumes that the level of education is constant for a household’s lifetime, the household education level is
180
MICHAEL D. S. MORRIS
measured as the highest level of education attained by either parent in a household. The education level is then grouped into one of two categories, those with a college degree, and those without, giving two education ‘‘types’’ of households, with 56 percent having no degree. The limit to two education categories is because the sample sizes within more subdivided education groups become very small. Marital status, like parental education, is also assumed to be a constant in the model. As explained previously, the sample is restricted to continuously married households, which are defined as the respondent being married before age 36 and in no subsequent interview after getting married are they no longer married. This should be taken into account when trying to draw too great an implication from the exact point estimates of the model, though perhaps general implications would be robust even to some less stable households. In every interview, information is collected for each of the respondent’s children living in the household, including age but not education. Furthermore, in 1999 a full child roster was collected, which includes children’s birth-dates, and level of education. From this information, children can be identified as being born in the first, second, or third period of the model and the appropriate educational attainment category can also be assigned for each child. As previously mentioned, the model assumes households can only have children in the first three 6-year periods, and that they have no more than five children, and no more than four within a given period. Only 10.78 percent of the NLS sample violates one or more of these three restrictions and will not be used. This restriction, while perhaps not a large loss of data, should be kept in mind when considering the types of households for which the results are estimated. Beginning in 1991 and subsequently in 1993, 1995, 1997, and 1999, the survey includes questions about the college enrollment of children and the amount of financial support parents provide toward college for each child within the past 12 months, though there is no data on the total amount spent on a child’s postsecondary education. The offer in the model is measured as the average annual contribution parents made toward postsecondary education for children born in the appropriate period. The amount of total education spending is then computed as four times the offer for children earning a 4-year college degree, and two times the offer for those children with the some college. While this total is not matched to any particular figure in the data, it is used to update the household’s evolving assets. The original NLS of Young Women sample began with 5,159 respondents. The sample used in this chapter is restricted to those women who remained in the data set through 1999 and reported information on all
Parent Contributions and Education Attainment
181
relevant variables. One impact of this is that it somewhat overly excludes the older women in the sample because data on contributions toward their children’s education was only gathered starting in 1991, meaning some of these women had children already educated before any information on their contributions was collected. However, there is no reason to suspect that these women are systematically different from the slightly younger women in the cohort. The sample is further limited to those women who meet the fertility constraints and stable marriage criteria outlined above, and outlier observations for assets were removed.16 This leaves a sample of 556 households with 3,139 household-period observations.
4.2. Descriptive Statistics Table 1 shows some descriptive statistics for the data. Asset accumulation shows a typical average increase with age and a large variance, with households with a college degree having higher assets. When looking at the number of children, less-educated households have more children, averaging 2.4 per household versus 2.2 for households with a college degree. Not surprisingly, the timing of when to have children is also very different. College-educated households have very few children during ages 18 through 23, averaging 0.5 per household. Conversely, this is when households without a college degree have the most children, averaging 1.2 children per household. At older age ranges, it is the college-educated households who have more children. For the entire sample, 82 percent of households with children attending college helped pay for college. This rises to almost 95 percent for parents with a college degree, and is still 68 percent for parents without a college degree. The average per-year contribution made by families to financially support their children in college was $4,274 in 1999 dollars ($2,054 for parents without a college degree and $6,336 for those with a college degree). This indicates a significant amount of money going toward higher education, especially when multiplied over several years and several children. Looking at the distribution of education attainment for children overall, 42 percent do not attend college, 36 percent attend some college but do not earn a 4-year degree, and 22 percent earn a 4-year degree or more. This varies by parental education, with the percent earning a 4-year degree rising to 37 percent for parents with a college degree, while falling to 13 percent for parents without a college degree. This relationship between parents and
182
MICHAEL D. S. MORRIS
Table 1.
Descriptive Statistics. All
No College
College
556
311
245
24,140 (12,140) 59,858 (41,957) 85,820 (56,331) 131,628 (85,364) 218,947 (133,470)
19,291 (5,628) 45,943 (28,166) 62,472 (38,024) 89,729 (57,938) 157,730 (85,000)
30,296 (17,887) 77,523 (56,912) 115,458 (80,958) 184,814 (137,546) 304,100 (213,700)
2.313 (1.213) 0.896 (1.040) 0.890 (0.878) 0.527 (0.826)
2.395 (1.293) 1.225 (1.110) 0.826 (0.863) 0.344 (0.714)
2.208 (1.098) 0.478 (0.761) 0.971 (0.894) 0.759 (0.898)
Percent contributing to college
4,273 (5,649) 81.71%
2,054 (3,157) 67.72%
6,336 (6,603) 94.71%
Children education attainment Percent with no college Percent with some college Percent with 4-year degree or more
42.37% 35.73% 21.90%
52.49% 33.72% 13.79%
23.10% 39.56% 37.34%
Number of households Assets at the age:a,b 24–29 30–35 36–41 42–47 48–53 Childrena Children born when 18–23 Children born when 24–29 Children born when 30–35 Contributions to collegea,b
a
Means with standard deviation in parenthesis below. Dollar amounts are 1999 dollars.
b
children’s education has been consistently documented (see Haveman & Wolfe, 1995 for a review).
5. PARAMETER ESTIMATES AND MODEL FIT The estimated model parameters are reported in Table 2. There are a total of 49 parameters in the specification as summarized in Appendix A, estimated over the 3,139 household-period observations in the sample. A few of the
183
Parent Contributions and Education Attainment
parameter estimates have a direct interpretation of some interest and provide an indication of the appropriateness and fit of the model. The discount rate for a 6-year period, d, is estimated at 0.8185 which is the equivalent of 0.967 per year, a reasonable discounting rate. The child-cost estimates, Ch for parents without a college degree and Ch þ Cc for parents with a college degree, are $42,403 and $79,396, respectively, per 6-year
Table 2.
Parameter Estimates.
Utility function g l1h 2.7463 0.5350a (0.0262) (0.3537)
l11h 6.2868a (0.0557)
l12h 8.6061a (1.7222)
l13h 1.7350a (0.2395)
l1c 0.2882a (0.0812)
l11c 0.8445a (0.1068)
l12c 3.1622a (0.3211)
l13c 4.2669a (1.4495)
l2h 0.7749a (0.5963)
l2c 0.1622a (0.4515)
l3 0.0011b (0.0054)
a1h 0.0099a (0.0000)
a1c 0.0059a (0.0001)
a2h 0.9071b (1.1773)
a2c 0.5393b (0.5104)
y1h 0.0228a (0.0001)
y1c 0.0103a (0.0002)
y2h 1.3717b (0.6621)
y2c 0.7994b (1.1186)
Children’s education m1 m2 0.0722 1.3001 (0.0123) (0.0155)
k1 1.1021c (0.0869)
k2 3.8651a (1.5151)
k3 0.4039 (0.0157)
k4 0.7197c (0.0533)
k5 0.0989a (0.7014)
Income b0h 11.1928 (0.2754)
b1h 0.3275 (0.0382)
b2h 0.9272 (0.0025)
b30h 0.3938 (0.0952)
b0c 0.1705 (0.1891)
b1c 0.0068 (0.0303)
b2c 0.9460 (0.0002)
b3c 0.1911 (0.0734)
bh 0.9663 (0.3240)
bc 0.0101 (0.1945)
swh 0.5451 (0.0109)
swc 0.8381 (0.0023)
srh 22896 (4031)
src 1.8436 (0.2690)
sZ 0.0301 (0.0911)
Other parameters d ch 0.8185 42403 (0.0113) (9504) Error distribution sc sn 0.0998 8.0146a (0.0068) (0.9691)
cc 36993 (5827) sq 1.4151c (0.8247)
Parameter estimate with standard error in parenthesis below. a Parameter multiplied by 109. b Parameter multiplied by 1012. c Parameter multiplied by 104.
184
MICHAEL D. S. MORRIS
periods. This is reasonably in line with other estimates. For example, adjusting the estimates based on USDA surveys (see Lino, 1996) to 6-year amounts gives $35,630 for households in the lowest third of income, $48,410 for households in them middle-third of income, and $70,610 for households in the highest third of income. The estimate for g, the coefficient of relative risk aversion, is 2.8, which is within range of prior estimates, though those estimates generally range quite widely. Last, the estimated taste parameters for children’s education, while not of interest in size themselves, are suggestive in their relation between college-educated households and less-educated households. The utility of additional education is greater for less-educated households (i.e., a1co0, y1co0), but the decrease in marginal utility is more rapid for such households (i.e., y2cW0). Table 3 shows the predicted probabilities, based on the estimated model parameters, for a child’s education level for different amounts of parental financial support and different parental education levels. These fit with the well-established link that more educated parents tend to have higher educated children at all level of financial support. For example, evaluated at the entire sample annual average contribution of $4,273 per year, children with a college-educated parent have a 31 percent chance of earning a 4-year degree versus 18 percent for children without a college-educated parent. The gap is even wider when factoring in the fact that more educated households contribute more money toward their children’s education. For example, the probability of a child earning a 4-year degree is only 14 percent for parents without a college degree contributing the annual average of $2,054 for that group. For college-educated parents contributing the annual average of
Table 3.
Estimated Education Outcome Probabilities. Parental Contribution $0
$2,054
$4,273
$6,336
$10,000
Parents without college degree Probability no college Probability some college Probability 4-year degree
52.88% 37.44% 9.68%
44.52% 41.70% 13.78%
37.14% 44.45% 18.41%
31.88% 45.66% 22.45%
26.00% 46.06% 27.94%
Parent with college degree Probability no college Probability some college Probability 4-year degree
37.00% 44.49% 18.51%
29.40% 45.97% 24.63%
23.21% 45.79% 31.00%
19.08% 44.72% 36.20%
14.75% 42.42% 42.84%
Parent Contributions and Education Attainment
185
$6,366 for that group, the probability of a child earning a 4-year degree rises to 36 percent, and increase of 22 percentage points. If increasing parental contributions are indeed to have an impact on educational attainment, then the model estimates should indicate that greater financial support has a significant impact on the education outcome probabilities. As seen in Table 3, increasing the parental contribution does change the probability distribution of education outcomes and raises the probability of a child earning a 4-year degree for all households. Evaluated at annual support of $0, $2,054, $4,273, $6,366, and $10,000 (the means for no-college households, the entire sample, college households, and two more extreme values) the probability of earning a college degree increases from 10 to 13, 18, 22, and up to 28 percent for children without a college-educated parent. A similar increase is seen at those support levels for children of parents with a college degree, rising from 19 to 25, 31, 36, and 43 percent, respectively.17 This suggests that a policy that successfully increases parental contributions to education should also increase the final educational attainment of children, though the marginal impact is nonlinear (notice that the almost $4,000 increase from $6,366 to $10,000 increases the probability of a college degree by about the same amount as the prior $2,000 increase) and the overall impact of such policies would need to take into account all of the related decisions in the model. Before turning to the policy simulations to evaluate educational savings accounts, consider some information on the fit of the model. Table 4 presents some summary statistics for the actual data and as predicted by a simulated sample of 10,000 households based on the model and parameter estimates. As the table shows, the model does fairly well in matching the average characteristics in the data. The model captures the increasing accumulation of assets, though it slightly understates average assets levels for younger households, overstates them in middle age and understates them again as households move into their 50s. Still, the mean levels are not too far off in general, and only for young households with a college degree can we reject the null hypothesis of equal means at a 10 percent level. The model also does fairly well in predicting the average number of children, 2.31 in the data versus 2.28 in the simulation, and does equally well for different ages and education levels, and at no point can we reject the null hypothesis of equal means. The model does a reasonably good job in matching the percentage of children with different levels of education, and statistically you cannot reject the hypothesis they are the same distribution. The model simulation also reasonably matches the average amounts contributed by households to their
186
MICHAEL D. S. MORRIS
Table 4. Actual and Predicted Outcomes. Actual
Predicted
P-Value
24,140 59,858 85,820 131,628 218,947
21,537 57,374 86,872 138,429 216,318
(0.118) (0.416) (0.921) (0.296) (0.996)
Children (mean) Children born when 18–23 (mean) Children born when 24–29 (mean) Children born when 30–35 (mean)
2.313 0.896 0.890 0.527
2.284 0.888 0.879 0.521
(0.279) (0.523) (0.708) (0.716)
Contributions to collegea (annual mean) Percent contributing
4,273 81.71
4,414 74.45
(0.296) (0.000)
Children education attainment (percent) No college Some college 4-year Degree
42.37 35.73 21.90
40.56 36.28 23.16
19,291 45,943 62,472 89,729 157,730
18,778 43,647 60,422 98,867 155,947
(0.912) (0.423) (0.576) (0.196) (0.889)
Children (mean) Children born when 18–23 (mean) Children born when 24–29 (mean) Children born when 30–35 (mean)
2.395 1.225 0.826 0.344
2.370 1.236 0.803 0.332
(0.457) (0.849) (0.588) (0.644)
Contributions to collegea (annual mean) Percent contributing
2,054 67.72
2,215 62.37
(0.523) (0.137)
Children education attainment (percent): No college Some college 4-year degree or more
52.49 33.72 13.79
52.41 34.24 13.35
30,296 77,523 115,458 184,814 304,100
26,151 75,998 118,981 190,731 300,142
All households Assetsa (mean by age) 24–29 30–35 36–41 42–47 48–53
By parent education Households without college degree Assetsa (mean by age): 24–29 30–35 36–41 42–47 48–53
All households Assetsa (mean by age) 24–29 30–35 36–41 42–47 48–53
(0.378)
(0.916)
(0.081) (0.734) (0.719) (0.501) (0.772)
187
Parent Contributions and Education Attainment
Table 4. (Continued ) Actual
Predicted
P-Value
Children (mean) Children born when 18–23 (mean) Children born when 24–29 (mean) Children born when 30–35 (mean)
2.208 0.478 0.971 0.759
2.119 0.445 0.967 0.761
(0.404) (0.241) (0.937) (0.947)
Contributions to collegea (annual mean) Percent contributing
6,336 94.71
6,410 83.41
(0.889) (0.000)
Children education attainment (percent) No college Some college 4-year degree
23.10 39.56 37.34
24.07 38.40 36.99
(0.618)
P-values are probability of type-1 error for t-test with unequal variances for difference in means and for chi-square test of same distribution for education attainment. a Dollar amounts in 1999 dollars.
children’s college education. As Table 4 shows, overall the simulated sample average is $4,414 per-child, per-year, while the actual data average is $4,273, but the difference is statistically not significant. The mean comparison is equally good when looking within different education levels. However, the model does noticeably under-predict the percentage of parents contributing (74 percent predicted vs. 82 percent in the data). The discrepancy is statistically significant and holds across parent education levels. A further investigation into the distribution of the offers shows higher variation in offers predicted versus the data. In particular, in many cases the model tends to predict an offer level of zero when the data often have a very small, but positive offer, and the model tends to predict just slightly more high-end offers.
6. POLICY SIMULATION This section uses the model parameter estimates and a simulation of 10,000 households, proportioned by education as is in the data, to examine how their savings and educational contribution decisions would change, and the impact of these changes on children’s education attainment, when different policies are implemented.18 In particular, a tax-advantaged ESA is introduced. Money invested in the ESA is allowed to grow tax-free and the earnings are not taxable as long as they are spent on education.19
188
MICHAEL D. S. MORRIS
Any amount not spent on education becomes taxable. Since taxes are not directly modeled here, the impact of the tax break is modeled as an increase in the return on savings equal the household marginal tax rate.20 The penalty for assets in the ESA not spent on education is an amount equal to the current marginal tax rate times the account balance. Parents can only contribute to these accounts if they have children and are not allowed to contribute more than $7,500 per-child, per-year. For comparison, both a direct college subsidy and a college spending tax credit are also considered. The subsidy is a flat $1,000 education grant. The impact in the model is approximated as a default offer to all children, so that, for example, when parents offer nothing, the child still has a $1,000 offer. Finally, a tax credit available on all money spent on children’s education is simulated. With no direct taxes in the model, the impact is that parents don’t actually have to pay part of their offered contribution when children go to school, the difference being equal to their marginal tax rate times the contribution. Table 5 presents simulated sample statistics for the different policies, with the base simulation being the original model as estimated and discussed above. The ESA actually gives the largest increase in contributions from parents. Over all households, the average annual, per-child parental contribution increases by $2,097. The increase is larger for college-educated households, but is still a sizeable $1,603 increase for less-educated parents. The increase is not just from giving a tax break to households already contributing toward college expenses. All households groups show an increase in the percentage of parents contributing: from 84 to 91 percent for parents with a college degree and from 62 to 76 percent for parents without a college degree. Furthermore, the accounts do actually generate a net increase in savings as can be seen in Table 5, which also shows that the ESAs reach a sizeable average of $48,000 at their peak before parents enter their 50s. The policy goal, presumably though, is to increase education attainment, not just contributions. With the higher contributions and higher contribution rate, the percentage of children earning a 4-year degree increases 4.78 percentage points, while the percentage without any college education falls by 4.68. The impact, particularly on the probability of a 4-year degree, is slightly larger for college-educated households, at a 5.69 increase. The cost of this program, in the form of lost revenue from taxable earnings on the education savings, comes to almost $6,000 per household. However, much of this lost revenue is on earning from savings that were not present from the base scenario, making this estimate hard to compare. If the amount of savings is held to the base simulated levels, the lost revenue is just $2,893 per household.
189
Parent Contributions and Education Attainment
Table 5. Policy Simulation Outcomes.
All households Annual contributions to college (mean) Percent of parents contributing Children education (percent) No college Some college 4-year degree or more Children (mean) Total assets (mean by age) 24–29 30–35 36–41 42–47 48–53
Base
ESA
Grant
Tax Credit
4,414 71.55
6,511 82.88
3,696 55.21
5,763 77.55
40.56 36.28 23.16
35.88 36.18 27.94
35.81 37.67 26.52
38.10 36.64 25.35
2.28
2.34
2.29
2.28
21,573 57,374 86,872 138,429 216,318
24,379 75,125 101,277 149,472 224,015
21,086 57,934 84,956 139,246 217,818
22,015 59,101 88,854 138,022 218,997
Educational savings (mean by age) 24–29 30–35 36–41 42–47 48–53 By parent education Household without college degree Parental contributions to college (mean) Percent of parents contributing Children education (percent): No college Some college 4-year degree or more Children (mean) Total assets (mean by age) 24–29 30–35 36–41 42–47 48–53 Educational savings (mean by age) 24–29 30–35 36–41 42–47 48–53
1,511 13,447 31,638 48,573 36,216
2,215 62.37
3,818 76.10
1,372 44.12
2,966 70.26
52.41 34.24 13.55
47.90 35.23 16.87
46.15 37.13 16.72
49.26 35.32 15.42
2.37
2.42
2.39
2.37
18,778 43,647 60,422 98,867 155,947
22,278 53,222 78,654 108,638 160,272
19,647 44,005 58,851 98,807 156,264
19,024 44,291 59,921 97,553 154,997
685 7,267 26,002 38,814 20,370
190
MICHAEL D. S. MORRIS
Table 5. (Continued )
Household with college degree Parental contributions to college (mean) Percent of parents contributing Children education (percent) No college Some college 4-year degree or more Children (mean) Total assets (mean by age) 24–29 30–35 36–41 42–47 48–53 Educational savings (mean by age) 24–29 30–35 36–41 42–47 48–53
Base
ESA
Grant
Tax Credit
6,410 83.41
8,939 91.21
5,924 72.10
8,007 86.04
24.07 38.40 36.99
20.11 37.21 42.68
21.48 38.65 39.87
23.31 37.51 39.18
2.12
2.20
2.11
2.12
26,151 75,998 118,981 190,731 300,142
27,011 98,129 128,518 202,620 312,605
25,656 76,624 117,644 191,449 301,333
26,959 79,824 122,351 191,728 302,975
2,202 18,531 38,532 62,041 51,268
For a comparison of the magnitude of these results, consider also the grant and tax credit. These policies are much more in line with those examined in previous studies discussed in the introduction, though again the model here more directly considers the impact on parental behavior. The $1,000 grant has the smallest impact on the average amount contributed by parents, and the average is lower than in the base simulation. However, that is to be expected since every child is in effect already receiving a $1,000 peryear contribution. Interestingly, the average contribution from parents does not decline by a full $1,000 and on net the annual per-child contribution increases about $300. The percentage of parents contributing also falls noticeably, but again 100 percent of children are receiving a $1,000 contribution from the new grant. So while only 55 percent of parents contribute above and beyond the grant, a decline of just over 16 percentage points from the percentage contributing before the grant, there is still a sizeable impact on the distribution of education outcomes for children. The percentage of children with no college education falls by almost 5 percentage points. This is consistent with the 4 and 4.5 percent attendance increases
Parent Contributions and Education Attainment
191
estimated by Dynarski (2002) and Ichimura and Taber (2002), respectively, and gives some additional validation for the model. The impact is actually greater for children from less-educated households, where percentage with no college education falls over 6 percent, while the percentage with a 4-year college degree rises by 3.4 percentage points. The impact on education outcomes here does differ from the ESA. In particular, the grant appears slightly more effective at reducing the probability of children having no college education at all, particularly for less-educated households. However, the ESA has a larger impact on increasing the probability of earning a 4-year degree, as it increased the number of households making sizeable contributions of support. The expected cost of the grant here comes to $4,146 per household, equal to the middle of the range of the estimated costs of the ESA. The tax credit generates a larger increase in average contributions than the grant, but still less than the ESA. It also increases the percentage of parents contributing, but again not by as much as the ESA policy and the net impact on education outcomes is the lowest here. Compared to the grant, which had a somewhat larger impact on less-educated households, the tax credit has a larger impact on households where parents have a college degree. Still, the impact is less on average for all households, and the expected cost of the credit is $5,775 per household, as costly as the highest cost range estimate for the ESA policy.
7. CONCLUSION The model presented here gives structural estimates of a dynamic, life-cycle model with endogenous choices of having children, spending on children’s college education, savings, and consumption. The model also allows for borrowing constraints, uncertain lifetime income, and allows for heterogeneity between parents with different levels of education. The model is solved numerically and estimated with a simulated maximum likelihood procedure using data from the NLS. The estimated model generally captures the feature of the data, including that most parents contribute toward their children’s college education and that the amounts are sizeable, averaging over $4,000 per-year, per-child. The estimated model is then used to run policy experiments to gauge the impact on children’s educational attainment of programs aiming to increase parental support. The policy simulations suggest that a tax-advantaged ESA would generate new savings and have a sizeable impact on both parental contributions and education attainment. The average increase over the
192
MICHAEL D. S. MORRIS
base simulation for contributions was over $2,000 per-year, per-child, and there was an increase in the probability of earning a 4-year degree of over 4 percent on average. The impact was slightly greater for households where the parent had a college degree, but the finding was still a 3.36 percent increase for less-educated households. The impact of the savings account was generally greater than a $1,000 universal grant or a traditional tax credit on money spent on education. The universal grant did have a slightly larger impact at reducing the percentage of children having no college education, particularly among less-educated households, though the ESA still had as large an impact, if not larger, on the probability of earning a 4-year degree, and the estimated cost of the ESA program was not greater than the grant. A traditional tax credit for education expenses underperformed the other two policies on all measures. The limitations of the model used here, though, should be factored in when considering further implications of these results. Due to data limitations and computational complexity, the model was only estimated for two-parent households and so the results, particularly any exact estimates of the impact of the policies considered, are not universal. However, it seems likely that the general relative implications would be robust to a broader sample and more complex model, with the results here being at least a good indication, especially considering the current lack of evaluation of the educational impact of ESAs. The model also holds constant a variety of factors that might also change as parents spend more on college education. In particular, there is no specific modeling of the impact of parental contributions on student financial aid or the price of education, outside of the generally estimated relationship between parental support and student educational attainment, which does show a diminishing marginal impact. In this aspect, the results could overstate the impact of increased parental support. However, it is not clear that would be the case, as such effects might already be captured in the marginal impact estimated, and might additionally just be played out in the redistribution of college attendance between schools of different costs and qualities, as the analysis here does not allow for such differentiation of school type and there is evidence of substitution between private and public 4-year schools in response to price. Of course, a more detailed interaction of the parents and children’s joint decision would be a useful extension for future research. Still, the general conclusions regarding the success of ESAs in generating new savings, increasing parental support for education both in amount and percentage contributing, and the resulting increase in the probability of children earning a higher-education degree are worth noting, along with its generally strong performance on these aspects relative to some other policy options.
Parent Contributions and Education Attainment
193
NOTES 1. If you choose to use the money somewhere other than one of the state’s public schools, you will get the amount you contributed to the account, but often without any earnings. 2. For example, see Hubbard and Skinner (1996), Poterba, Venti, and Wise (1996) and Engen, Gale, and Scholz (1996). 3. See, for example, Browning and Lusardi (1996), Coleman (1992), and Keane and Wolpin (2001). 4. The random shocks in the utility function allow for unobserved variation in behavior necessary for estimation, otherwise the model gives a unique choice for a given set of state space values. The parameterization allows for complementarities between consumption and the number of children (i.e., l3), but in the interest of parsimony, such complementarities were not included for the quality variables. In a robustness check, inclusion of such parameters was found to be statistically insignificant. 5. The 6-year period closely corresponds with the frequency that asset information is collected and the general biennial data collection for the NLS as will be explained later. Furthermore, the longer periods will greatly reduce the computational complexity of the model solution. 6. These restrictions impact only a small portion of the data and significantly ease the computational burden of solving the model. Within the NLS data, only 7.18% of households have children when older than 36, only 1.94% have more than 4 children in any single 6-year period and only 4.27% have more than 5 children. In all, only 10.78% of the NLS sample violates one or more of these three restrictions. 7. This assumption again significantly eases the computability of the solution and does not impact very many households. In the NLS data only 14.45% of children attending school received support after age 24, and for most of these that support came within the next 2 years. 8. While earnings certainly may be argued to show short-term persistence, the iid assumption is not unreasonable for the longer 6-year periods of this model. 9. Keane and Wolpin (1994) find this style of approximation works well, and investigation in a simplified framework here suggested these interpolations fit well. 10. For a nice summary of simulation-based estimation techniques, see Gourieroux and Monfort (1993). 11. Probability estimators utilizing simulated samples generated from the model are used to estimate discrete choice probabilities of outcomes and, therefore, generate the likelihood that is maximized. For examples of such simulators, see McFadden (1989), Geweke, Keane, and Runkle (1994), Stern (1994), and Keane and Wolpin (1997). 12. Keane and Wolpin (2001) develop an expanded approach for a model that maintains a discrete choice set but allows for additional continuous but unobserved outcome variables. 13. For a summary see Silverman (1986). 14. This choice of a Gaussian kernel was just for its broad use and understanding, and a value for the smoothing parameter (and slight variants) is also proposed and discussed in Ha¨rdle (1991) and Scott (1992). In a comparative simulation study of kernel methods, Bowman (1985) finds that such a plug-in parameter selection, while simple, performs well even when compared to more advanced smoothing parameter selection methods. Given the computational difficulty in solving the model itself, no iterative
194
MICHAEL D. S. MORRIS
procedure to optimize the smoothing parameter was done, though estimation with a few similar fixed alternatives did not qualitatively change the general results; for the same reason alternative kernels to the Gaussian were not systematically investigated. 15. Setting l ¼ 0 results in a histogram and setting l ¼ 1 results in an equal density estimate of 1/J for all values. 0.3 is utilized by Titterington and Bowman (1985) and as with their studies, changing this within a reasonable range does not qualitatively impact the estimates. For more discussion on kernel methods for discrete distributions refer to Aitken (1983). 16. The upper and lower truncation points for outliers differed for each period. In total, 21 observations from below and 28 from above were cut. 17. The full probability distributions at the different offers shown in Table 3 are statistically significantly different, though point probabilities for specific outcomes, most notably some college, may not be. The probability of some college may increase or decrease as offers increase depending on the net impact of the decrease in the probability of no college and the increase in the probability of a 4-year degree. 18. While the tax reform act of 1997 did introduce some policies such as these, their 6-year incremental life-cycle paths up to 1999 would not have been noticeably impacted, assuming that the policies changes were generally unexpected. 19. For this experiment, there are now two types of assets: regular savings and educational savings. The addition of the second asset class does create an additional, continuous, state variable in the model. This creates a much larger state space across which the problem must be solved, though the extended model does not need to be estimated. The same solution procedure is used as before, with the smoothing regressions applied across both asset classes. 20. For tax rates, I use the federal 1999 tax brackets for the marginal rates on married households filing jointly.
ACKNOWLEDGMENTS The author would like to thank the participants of the 8th Annual Advances in Econometrics Conference, and two anonymous reviewers, Ken Wolpin and seminar participants at the University of New Orleans and Oklahoma State University for helpful comments in writing this chapter.
REFERENCES Acton, F. S. (1990). Numerical methods that work. Washington, DC: Mathematical Association of America. Aitken, C. G. G. (1983). Kernel methods for the estimation of discrete distributions. Journal of Statistical Computation and Simulation, 16, 189–200. Bellman, R. (1957). Dynamic programming. Princeton, NJ: Princeton University Press. Bowman, A. W. (1985). A comparative study of some kernel-based nonparametric density estimators. Journal of Statistical Computation and Simulation, 21, 313–327.
Parent Contributions and Education Attainment
195
Browning, M., & Lusardi, A. (1996). Household saving: Micro theories and micro facts. Journal of Economic Literature, 34, 1797–1855. Choy, S. P., Henke, R. R., & Schmitt, C. M. (1992). Parental financial support for undergraduate education. Washington, DC: U.S. Department of Education, National Center for Education Statistics. Coleman, A. (1992). Household savings: A survey of recent microeconomic theory and evidence. Working paper No. 9808, New Zealand Treasury. College Board. (2006a). Trends in college pricing. Washington, DC: The College Board. College Board. (2006b). Trends in student aid. Washington, DC: The College Board. Dynarski, S. (2000). Hope for whom? Financial aid for the middle class and its impact on college attendance. National Tax Journal, 53, 629–661. Dynarski, S. (2002). The behavioral and distributional implications of aid for college. American Economic Review, 92, 279–285. Engen, E. M., Gale, W. G., & Scholz, J. K. (1996). The illusory effects of saving incentives on saving. Journal of Economic Perspectives, 10, 113–138. Geweke, J., Keane, M., & Runkle, D. (1994). Alternative computational approaches to inference in the multinomial probit model. Review of Economics and Statistics, 76, 609–632. Gourieroux, C., & Monfort, A. (1993). Simulation-based inference: A survey with special reference to panel data. Journal of Econometrics, 59, 5–33. Ha¨rdle, W. (1991). Smoothing techniques with implementation in S. New York: Springer-Verlag. Haveman, R., & Wolfe, B. (1995). The determinants of children’s attainments: A review of methods and findings. Journal of Economic Literature, 33, 1829–1878. Hubbard, R. G., & Skinner, J. (1996). Assessing the effectiveness of saving incentives. Journal of Economic Perspectives, 10, 73–90. Ichimura, H., & Taber, C. (2002). Semiparametric reduced-form estimation of tuition subsidies. American Economic Review, 92, 286–292. Keane, M. P., & Wolpin, K. I. (1994). The solution and estimation of discrete choice dynamic programming models by simulation and interpolation: Monte Carlo evidence. Review of Economics and Statistics, 76, 648–672. Keane, M. P., & Wolpin, K. I. (1997). The career decisions of young men. Journal of Political Economy, 105, 473–522. Keane, M. P., & Wolpin, K. I. (2001). The effect of parental transfers and borrowing constraints on educational attainment. International Economic Review, 42, 1051–1103. Lino, M. (1996). Expenditures on children by families, 1995 annual report. Washington, DC: U.S. Department of Agriculture, Center for Nutrition Policy and Promotion. McFadden, D. (1989). A method of simulated moments for estimation of discrete response models without numerical integration. Econometrica, 57, 995–1026. Nelder, J. A., & Mead, R. (1965). A simplex method for function minimization. Computer Journal, 7, 308–313. Poterba, J. M., Venti, S. F., & Wise, D. (1996). How retirement saving programs increase saving. Journal of Economic Perspectives, 10, 91–112. Scott, D. W. (1992). Multivariate density estimation: Theory, practice, and visualization. New York: Wiley. Silverman, B. W. (1996). Density estimation for statistics and data analysis. New York: Chapman and Hall. Stern, S. (1994). Two dynamic discrete choice estimation problems and simulation method solutions. Review of Economics and Statistics, 76, 695–702.
196
MICHAEL D. S. MORRIS
Titterington, D. M., & Bowman, A. W. (1985). A comparative study of smoothing procedures for ordered categorical data. Journal of Statistical Computation and Simulation, 21, 291–312. U.S. Department of Education. (2005). 2003–2004 national postsecondary student aid study. Washington, DC: U.S. Department of Education, National Center for Education Statistics.
APPENDIX A. MODEL SPECIFICATION Utility function:
Ut ¼
ðct ct Þ1g þ ðl1 þ l2 nt þ l3 ct þ nt Þnt 1g þ ða1 þ a2 qmt þ qt Þqmt þ ðy1 þ y2 qht þ qt Þqht l1 ¼ l1h þ l11h 1ðt ¼ 1Þ þ l12h 1ðt ¼ 2Þ þ l13h 1ðt ¼ 3Þ þ l1c 1ðed collegeÞ þ l11h 1ðed college; t ¼ 1Þ þ l12h 1ðed college; t ¼ 2Þ þ l13h 1ðed college; t ¼ 3Þ l2 ¼ l2h þ l2c 1ðed collegeÞ a1 ¼ a1h þ a1c 1ðed collegeÞ a2 ¼ a2h þ a2c 1ðed collegeÞ y1 ¼ y1h þ y1c 1ðed collegeÞ y2 ¼ y2h þ y2c 1ðed collegeÞ
Education attainment: Prðno collegeÞ ¼ Fðm1 S t Þ Prðsome collegeÞ ¼ Fðm2 St Þ Fðm1 St Þ Prðcollege degreeÞ ¼ 1 Fðm2 St Þ
S t ¼ k1 ot þ k2 o2t þ k3 1ðed collegeÞ þ k4 ot 1ðed collegeÞ þ k5 o2t 1ðed collegeÞ Income function:
I t ¼ exp½b0h þ b1h t þ b2h t2 þ b3h 1ðt ¼ 1Þ þ b0c 1ðed collegeÞ þ b1c t1ðed collegeÞ þ b2c t2 1ðed collegeÞ þ b3c 1ðed college; t ¼ 1Þ þ wht lnðwct 1ðed collegeÞÞ
t7
197
Parent Contributions and Education Attainment
" # 7 1X I t ¼ ðbh þ bc 1ðed collegeÞÞE I t þ rht lnðrct 1ðed collegeÞÞ 7 t¼1
t47
Other parameters: Discount factor: d Child costs: C ¼ ChþCc1(edZcollege) Measurement error: true assetsobs it ¼ assetsit expðZ Þ Error distributions:
lnðct Þ Nð0; s2c Þ; nt Nð0; s2n Þ; qt Nð0; s2q Þ; wht Nð0; s2wh Þ, lnðwct Þ Nð0; s2wc Þ; rht Nð0; s2rh Þ; lnðrct Þ Nð0; s2rc Þ; Z Nð0; s2Z Þ
ed is the household level of education, 1(.) is an indicator function equal to 1 if the expression in parenthesis is true and zero otherwise.
APPENDIX B. ESTIMATION PROCEDURE The estimation procedure begins with initializing values for the parameters of the model. Given these parameters, the expected next period value functions, as a function of the state choice variables, can be solved for; that is Emaxt(Otþ1; Y ¼ E(Vtþ1(Otþ1)), as explained in Section 3.2, where the vector of model parameters, Y, is usually suppressed. The likelihood function is then constructed by simulating the choices of the households. To do so, first draw a set of the four shocks to the utility function (unobserved taste variation in consumption, children, some college, and 4-year college attainment) shown in Eqs. (1) and (2). For the estimation in this chapter 10,000 such draws were made. For each household observation, at each time period, the data give an observable value of state variables, Oit and Oitþ1 (i.e., assets, number of children, education of children, offer of support, etc. as appropriate.) For each of the 10,000 shocks, the model gives the ~ itþ1 , given this period household choice of next period state variables, O observed state variables, Oit, as the solution to ~ itþ1 Þ max U it þ Emaxt ðO ~ itþ1 O
(B.1)
This gives a set of 10,000 simulated values of next period state variables, ~ jit , which along with the single observed data each of which is notated as O
198
MICHAEL D. S. MORRIS
value of next period state variables, Oitþ1, can be used to construct the simulated likelihood of Oitþ1. As discussed in Section 3.3, only one or two elements of O will change between two observation points. Denoting these as O1 and O2, the simulated likelihood of Oitþ1 can be constructed using Eqs. (13) or (14) as appropriate by f^ðOitþ1 Þ ¼
10;000 X 1 ~ 1 ; hÞ KðO1itþ1 ; O jitþ1 10; 000h j¼1
(B.2)
if one state value is changing, or if two are changing by f^ðOitþ1 Þ ¼
10;000 X 1 ~ 1 ; h1 Þ K 2 ðO2 ; O ~ 2 ; h2 Þ K 1 ðO1itþ1 ; O jitþ1 jitþ1 itþ1 10; 000h1 h2 j¼1
(B.3)
where the kernel functions K(.) are given by Eqs. (12) and (15) as appropriate, and the smoothing parameters h as set in Section 3.3. The simulated likelihood function is then LðYÞ ¼
n T 1 Y Y
f ðOitþ1 Þ
(B.4)
i¼1 t¼0
This likelihood function can then be numerically maximized over Y, where for each update of the parameters a new set of Emax functions is calculated, from which a new simulated likelihood value is calculated. In this study, parameters were updated using a version of a simplex algorithm originally proposed by Nelder and Mead (1965) and further discussed in Acton (1990).
ESTIMATING THE EFFECT OF EXCHANGE RATE FLEXIBILITY ON FINANCIAL ACCOUNT OPENNESS Raul Razo-Garcia ABSTRACT This chapter deals with the estimation of the effect of exchange rate flexibility on financial account openness. The purpose of our analysis is twofold: On the one hand, we try to quantify the differences in the estimated parameters when exchange rate flexibility is treated as an exogenous regressor. On the other hand, we try to identify how two different degrees of exchange rate flexibility (intermediate vs floating regimes) affect the propensity of opening the financial account. We argue that a simultaneous determination of exchange rate and financial account policies must be acknowledged in order to obtain reliable estimates of their interaction and determinants. Using a panel data set of advanced countries and emerging markets, a trivariate probit model is estimated via a maximum simulated likelihood approach. In line with the monetary policy trilemma, our results show that countries switching from an intermediate regime to a floating arrangement are more likely to remove capital controls. In addition, the estimated coefficients exhibit important differences when exchange rate flexibility is treated as an exogenous regressor relative to the case when it is treated as endogenous.
Maximum Simulated Likelihood Methods and Applications Advances in Econometrics, Volume 26, 199–251 Copyright r 2010 by Emerald Group Publishing Limited All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1108/S0731-9053(2010)00000260011
199
200
RAUL RAZO-GARCIA
1. INTRODUCTION Coordinating the implementation of exchange rate (ER) and financial account (FA) policies has been a perennial challenge for policymakers. Factors such as the soundness and development of the financial system, the degree of currency mismatch, and the economic growth strategy (e.g., export-led growth) complicate the interaction of the two policies. For example, countries opening their financial markets to international capital flows face the task of adapting their ER regime (ERR) to the resulting environment of greater capital mobility. Some assert that the removal of capital controls requires the implementation of a more flexible ERR to prepare the domestic market to deal with the effects of higher capital flows (e.g., Eichengreen, 2004, 2005; Prasad, Rumbaug, & Wang, 2005).1 At the same time the liberalization of FA policies, coupled with a more flexible ERR, can pose significant risks for countries in which the financial system is weak or currency mismatch is high.2 Historically, we have observed many examples in which FA policy poses challenges for the choice of the ERR and vice versa. In recent years, however, the debate on Chinese policy reforms can be considered one of the best examples showing the complex interaction between ER and capital control policies. This debate has centered on one question; how the Chinese authorities will move toward a more open FA and a more flexible ERR. A more flexible regime would allow China to use monetary policy to buffer its economy against shocks. But greater ER flexibility would also increase the foreign currency exposure of the financial and nonfinancial sectors and generate the need for instruments to hedge it. The dilemma is that removing capital controls would allow agents to access the markets and instruments needed to hedge foreign exchange exposure but simultaneously pose significant risks for the Chinese economy given the weakness of its financial system. In a poorly regulated environment, capital inflows could be misallocated, and currency mismatches on the balance sheets of financial and corporate sectors might rise to dangerous levels; this makes a more flexible regime less attractive. In addition, a more flexible ERR that led to an appreciation of the Renminbi might diminish the competitiveness of the export sector and damage the performance of the economy.3 The complex interaction between the two policies can be considered as a corollary of the monetary policy trilemma, which states that policymakers in open economies have to choose two out of the three desirable objectives: (1) ER stability, (2) international capital mobility, and (3) monetary policy oriented toward domestic goals. Therefore, if the trilemma has indeed
Estimating the Effect of Exchange Rate Flexibility on Financial Account
201
constrained the actions of policymakers throughout history, as Obstfeld, Shambaugh, and Taylor (2004) and Shambaugh (2004) have shown, the ER policy should not be considered an exogenous determinant of FA openness, and this, in turn, cannot be assumed as an exogenous regressor of the ERR. As a consequence, the simultaneous determination of ER and FA policies must be acknowledged in order to obtain reliable estimates of their interaction and determinants. Yet despite the clear connection between the ER policy and FA openness, the existent studies on the determinants of capital controls have, in general, disregarded the simultaneous determination of the two policies. One strand of the empirical literature has treated the endogeneity of ER and FA policies within a simultaneous equations framework (von Hagen & Zhou, 2005, 2006; Walker, 2003). Another strand of the literature has simply ignored the simultaneity problem, limiting the analysis to univariate probit or logit models and relying on econometric techniques that are inappropriate under the presence of discrete endogenous regressors (e.g., Alesina, Grilli, & Milesi-Ferretti, 1994; Collins, 1996; Leblang, 1997; Quinn & Incla´n, 1997).4 Another alternative to deal with the endogeneity of ERR on FA policies is to rely on instrumental variables. Nevertheless, the main challenge of the instrumental variables approach is that the presence of dummy endogenous regressors in a discrete choice model makes the analysis differ substantially from the continuous regressor models. As pointed out by Carrasco (2001), the presence of endogenous dummy regressors leads to an inconsistency with the statistical assumptions of the nonlinear discrete model if the traditional two-stage method is applied. The objective of this chapter is twofold. First, it investigates the impact of exogenous changes of ER flexibility on financial openness. Second, it examines how the coefficients associated with ER flexibility are affected when the ERR is treated as an exogenous regressor of FA openness. In our attempt to quantify the effect of ER flexibility on the removal of capital controls (i.e., financial openness), we face an identification problem: the data alone are not sufficient to identify this effect. Hence, the inference depends on prior information available to the researcher about the probability distribution of the endogenous variables. In this chapter, using a panel data set of advanced and emerging countries, we estimate a trivariate probit model to account for the simultaneity of ER and FA policies. The trivariate probit model is composed of a switching probit model to identify the impact of ERR on the propensity of removing capital controls and a multinomial probit to study the choice of the ER arrangement. Since rich and emerging countries have different economic,
202
RAUL RAZO-GARCIA
political, and institutional circumstances, we estimate the model separately for these two types of countries. To identify the effect of exogenous changes of ER flexibility on FA openness, we propose, along the geographical concentration of trade, measures of the world’s acceptance of intermediate and floating regimes as instruments. One of the contributions of this chapter is being able to assess how two levels of ER flexibility affect the propensity to remove capital restrictions. Another contribution of this chapter is the introduction of interactions between ER flexibility and the determinants of capital controls, which help to identify nonlinear relationships between the independent variables and the propensity to liberalize FA. Our estimation strategy departs from previous work in this area in at least three ways: (i) we use the maximum simulated likelihood (MSL) approach to estimate the trivariate probit model; (ii) we assume the residuals to be independent and identically distributed (i.i.d.) normal random variables; and (iii) we rely on Halton draws and the Geweke–Hajivassiliou–Keane (GHK) method to maximize the likelihood function. Assuming normal errors instead of logistic errors, as in von Hagen and Zhou (2006), avoids the restrictive substitution patterns due to the independence of irrelevant alternatives (IIA).5 The GHK method and the Halton draws are used to simulate the multidimensional normal integrals. There are six main results. First, the degree of ER flexibility strongly influences FA policy. In particular, a U-shape behavior is found between the propensity to remove capital controls and the ER flexibility. Second, the coefficients obtained when the ERR is treated as an endogenous regressor differ substantially from the estimated coefficients obtained when the ER arrangement is treated as exogenous. As a matter of fact, the effect of ER flexibility on the propensity to liberalize FA is overestimated when the endogeneity of the ERR is not properly accounted for. This overestimation varies across emerging and advanced countries and across ERRs. Third, interesting correlations between the exogenous variables and the degree of financial openness are unmasked when the effect of the exogenous regressors is allowed to vary across ERRs. Fourth, despite many external pressures, emerging countries have been more conservative in their processes to liberalize FA. Fifth, policymakers tend to adopt EERs with a higher degree of flexibility when these are more globally accepted. Finally, relative to other emerging markets, Asian countries display a higher degree of ‘‘fear of floating.’’6 The rest of the chapter is organized as follows. In Section 2, we review the empirical literature on the choice of ERR and the degree of financial
Estimating the Effect of Exchange Rate Flexibility on Financial Account
203
openness. Then, in Section 3, we describe the evolution of the bivariate distribution of ERR and FA openness. Sections 4 and 5 describe the empirical model and the data, respectively. In Section 6, we present the estimation strategy and in Section 7, we comment on the results. Some final remarks are included in Section 8.
2. LITERATURE REVIEW The empirical literatures on both the choice of ERR and the removal or imposition of capital controls are vast and cannot be comprehensively summarized here. Different sample periods and country coverage have been used to shed light on these two issues. Nevertheless, a common characteristic can be observed: the econometric approach.
2.1. The Literature on the Choice of Capital Controls The most common econometric specification used to analyze the factors affecting the level of FA openness is a discrete choice model in which an unobservable propensity, say, to remove capital controls is explained by some exogenous covariates. K it ¼ X 0it b þ it
(1)
where K it is an unobservable latent index describing the propensity to open the FA, Xit is a vector of exogenous regressors affecting the likelihood of lifting capital controls, b is a vector of parameters associated to with Xit, and it is a random term representing all the factors affecting K it not included in X. Depending on the distributional assumption imposed on it , a logit or probit model may be obtained.7 Although K it cannot be observed, the econometrician observes the following discrete variable K it ¼ 1fK it 40g, where 1{A} is an indicator function assuming the value of 1 if event A occurs.8 Common arguments in favor of the removal of capital controls are as follows: (1) FA liberalization promotes a more efficient international allocation of capital, and boosts economic growth in developing countries; (2) it allows agents to smooth consumption (i.e., risk sharing); (3) it encourages capital inflows9; and (4) it allows the government to send a signal indicating that it will abstain from using inflation tax or it’s committed to policies that favor investment.10 Conversely, traditional motives for the
204
RAUL RAZO-GARCIA
adoption of capital controls include: (1) to allow monetary independence when a fixed ERR is in place; (2) to reduce or eliminate volatile capital flows; (3) to maintain domestic savings and inflation tax; (4) to limit vulnerability to financial contagion; (5) to reduce ER volatility caused by volatile short-run capital flows; and (6) to avoid excessive external borrowing.11 The discrete choice model discussed above is commonly used to test these presumptions. Some of the most widely-cited studies are Alesina et al. (1994), Grilli and Milesi-Ferretti (1995), Leblang (1997), Quinn and Incla´n (1997), Eichengreen and Leblang (2003), Leblang (2005), and (von Hagen and Zhou (2005, 2006). Although most of these studies have included the ERR as a determinant of financial openness, the majority have disregarded the simultaneous determination of these two policies.12 Central bank independence, the political leaning of the government, political instability, degree of democracy, per capita income, size of the government, and openness to trade are often used to explain the likelihood of opening the FA. Common findings include: (1) countries with an independent central bank have a higher likelihood of removing capital controls (Alesina et al., 1994; Grilli & Milesi-Ferretti, 1995; Quinn & Incla´n, 1997; Walker, 2003; von Hagen & Zhou, 2005); (2) policymakers implementing a floating ERR are more prone to open the FA (Alesina et al., 1994; Grilli & Milesi-Ferretti, 1995; Leblang, 1997; von Hagen & Zhou, 2005); (3) countries with leftist governments and high levels of trade openness maintain more open FA (Quinn & Incla´n, 1997; Walker, 2003)13; (4) large economies with a small stock of international reserves are more likely to impose capital controls (Grilli & Milesi-Ferretti, 1995; Leblang, 1997); (5) countries with larger governments exhibit lower levels of financial openness (Grilli & MilesiFerretti, 1995; von Hagen & Zhou, 2005); and (6) economies with large current account deficits and high inflation levels present a higher propensity to impose capital controls (Walker, 2003; von Hagen & Zhou, 2005).
2.2. The Literature on the Choice of Exchange Rate Regime The Mundell–Fleming model and Mundell’s (1961) optimum currency area (OCA) theory have been the main workhorses in the empirical literature on the choice of ERR. In recent years, however, models incorporating frictions other than nominal rigidities (Lahiri, Singh, & Vegh, 2007), balance sheet exposure to ER volatility (Chang & Velasco, 2006; Hausmann, Panizza, & Stein, 2001), and political and institutional factors
Estimating the Effect of Exchange Rate Flexibility on Financial Account
205
(e.g., Simmons & Hainmueller, 2005; Bernhard & Leblang, 1999) have enriched theoretical and empirical research on this topic. A model similar to the one presented above, (Eq. (1)), has been used to analyze the determinants of the ERR. In this context, an unobservable propensity, say to peg, is explained by some exogenous regressors.14 Determinants of the ERR in previous research can be grouped into three categories: variables related to OCA theory, political factors, and macroeconomic performance. With regard to OCA theory, factors such as trade openness, geographical concentration of trade, size of the economy, and economic development are typically included in the models. Central bank independence, democracy, political stability, proximity to an election, and the influence of partisan politics are used to control political factors. Finally, foreign exchange reserves, terms of trade volatility, economic growth, inflation, real ER volatility, and current account are included as measures of macroeconomic performance assumed to affect the likelihood of pegging, floating, or implementing an intermediate regime. The main results of previous studies are as follows: (1) a positive relationship exists between the size of the economy and the likelihood of floating; (2) countries with high levels of trade and highly dollarized financial systems are more likely to implement a peg; and (3) inflation is frequently found to be positively associated with freely floating rates. Evidence related to political factors suggests that democratic countries are more likely to adopt floating rates (Broz, 2002; Leblang, 2005) and that governments with both strong support in the legislature and fragmented opposition are more inclined to peg (Frieden et al., 2000). Also, it seems there is a lower probability of exiting from pegs in the run-up to elections (e.g., Blomberg, Frieden, & Stein, 2005; Frieden et al., 2000). Table 1 shows some of the models utilized by previous studies on the determinants of ERR and capital controls.
3. EVOLUTION OF THE POLICY MIX COMPOSED BY THE EXCHANGE RATE AND CAPITAL ACCOUNT REGIMES 3.1. De Facto versus De Jure Classifications A key issue in this research area is how to classify ER and FA regimes. For both there are two possibilities: de jure and de facto classifications.
206
RAUL RAZO-GARCIA
Table 1. Some Studies on the Determinants of the Exchange Rate Regime and the Openness of the Financial Account. Author(s)
Year
Type of Model
Exchange rate regime Collins Klein and Marion Bernhard and Leblang Frieden et al. Poirson Broz von Hagen and Zhou Juhn and Mauro Levy-Yeyati et al. Barro and Tenreyro Leblang Eichengreen and Leblang Walker Blomberg et al. Simmons and Hainmueller von Hagen and Zhou von Hagen and Zhou
1996 1997 1999 2000 2001 2002 2002 2002 2002 2003 2003 2003 2003 2005 2005 2005 2006
Probit Logit Logit Ordered Logit Ordered Probit and OLS Ordered Probit Ordered Logit Bivariate Probit and Multinomial Logit Pooled Logit Probit (Instrumental Variables) Strategic Probit Bivariate Probit Simultaneous Equation Model (Probit) Duration Model Logit, Probit and Markov Transition Model Simultaneous Equation Model (Ordered Probit for ERR) Simultaneous Equation Model (Logit)
Capital account openness Alesina et al. Grilli and Milesi-Ferretti Leblang Quinn and Incla´n Walker Eichengreen and Leblang Leblang von Hagen and Zhou von Hagen and Zhou
1994 1995 1997 1997 2003 2003 2005 2005 2006
Logit and Probit Logit and Probit Probit OLS Simultaneous Equation Model (Probit) Bivariate Probit Logit Simultaneous Equation Model (Continuous Index for CA) Simultaneous Equation Model (Logit)
Note: See Rogoff et al. (2004) for other studies on the determination of the ERR.
The former are generated from arrangements reported by countries (i.e., official regimes), while the latter are constructed mainly on the basis of macroeconomic variables.15 Since our investigation deals with the effect of the implemented ER policy on the degree of financial openness, we use de facto arrangements when these are available. Three de facto ERR classifications have been proposed recently. Bubula and O¨tker-Robe (2002) (BOR) classification combines market ERs and other quantitative information with assessments of the nature of the regime drawn from consultations with member countries and IMF country desk
Estimating the Effect of Exchange Rate Flexibility on Financial Account
207
economists. BOR classify the ERR into 13 categories: (1) another currency as legal tender, (2) currency union, (3) currency board, (4) conventional fixed peg to a single currency, (5) conventional fixed peg to a basket, (6) pegged within a horizontal band, (7) forward-looking crawling peg, (8) forward-looking crawling band, (9) backward-looking crawling peg, (10) backward-looking crawling band, (11) tightly managed floating, (12) other managed floating, and (13) independent floating. Bubula and O¨tker-Robe (2002) classified the ER arrangements for all IMF members from 1990 to 2001. This classification has been updated by IMF staff through 2008. Levy-Yeyati and Sturzenegger (2005) (LYS) use cluster analysis and changes and the volatility of changes of the ER and international reserves to construct a de facto ERR classification. LYS argue that fixed ERs are associated with low volatility of ER (in levels and changes) and high volatility in international reserves, while floating regimes are characterized by low volatility of ER with more stable international reserves. This classification covers 183 countries over the period 1974–2004. LYS classify the ER arrangements into five categories: (1) inconclusive, (2) floats, (3) dirty floats, (4) crawling pegs, and (5) pegs. A country’s ER arrangement is classified as ‘‘inconclusive’’ when the volatility of the ER and international reserves are low.16 Reinhart and Rogoff (2004) (RR) categorize the ERR based on the variability of informal or black-market ERs and the official rate. This classification is most appropriate for the purposes of this chapter. When there are multiple ERs, RR use the information from black or parallel markets to classify the arrangement under the argument that marketdetermined dual or parallel markets are important, if not better, barometers of the underlying monetary policy.17 Two additional features of RR’s classification are that it is available for a longer period and it includes a ‘‘freely falling’’ regime.18 With the latter we generate a dummy variable, labeled CRISIS, to control periods of macroeconomic instability. Since we are interested in the ERR implemented even during macroeconomic stress periods, we use information in RR’s detailed chronologies to reclassify the ‘‘freely falling’’ countries into pegs, intermediate, or freely floating regimes. RR classified the ERR into 15 categories. BOR’s classification is not utilized due to its short coverage in terms of years. We do not use LYS’s classification because of the ‘‘inconclusive’’ regimes, which makes it less desirable for our purposes. Given the lack of de facto classifications for the openness of the FA, we rely on Brune’s financial openness index (BFOI) to classify the degree of
208
RAUL RAZO-GARCIA
FA openness.19 A problem here is that the ER structure is one of the variables Brune considers to construct the index. Since the aim of the financial openness index is to reflect openness of the economies to capital flows, we exclude the ER structure from the index because this factor has already been taken into account in RR’s classification.20 3.2. Definition of Financial Account Openness and Exchange Rate Regimes Let Kit denote the degree of FA openness in country i (i=1,2,y,N) in year t (t=1,2,y,Ti). BFOI is recoded on a 1,2,3,4 scale representing four different degrees of FA openness: closed (K=1), relatively closed (K=2), relatively open (K=3), and open (K=4). Formally, 8 1 > > > > > : 4
if
BFOI ¼ f0; 1; 2g
if if
BFOI ¼ f3; 4; 5g BFOI ¼ f5; 7; 8g
if
BFOI ¼ f9; 10; 11g
We collapse RR’s ‘‘natural’’ classification of ERR into three categories: pegs, intermediate, and flexible regimes. Pegs are regimes with no separate legal tender, preannounced pegs or currency boards, preannounced horizontal bands that are narrower than or equal to plus or minus ðÞ2%, and de facto pegs. Intermediate regimes include preannounced crawling pegs, preannounced crawling bands that are narrower than or equal to 2%, de facto crawling peg, de facto crawling bands that are narrower than or equal to 2%, preannounced crawling bands that are wider than or equal to 2%, de facto crawling bands that are wider or equal to 5%, moving bands that are narrower than or equal to 2% (i.e., allows for both appreciation and depreciation over time), and de facto crawling bands that are narrower than or equal to 2%. Finally, floats include managed floating and freely floating arrangements. de facto pegs are classified as pegs in spite of the fact there is no commitment by the monetary authority to keep the parity irrevocable. We split the sample of countries into two groups: advanced and emerging markets. The definition of advanced countries coincides with industrial countries in the International Financial Statistics data set. Following Bubula and O¨tker-Robe (2002), countries included in the Emerging Market Bond Index Plus (EMBI+), the Morgan Stanley Capital International
Estimating the Effect of Exchange Rate Flexibility on Financial Account
209
Index (MSCI), Singapore, Sri Lanka, and Hong Kong SAR are defined as emerging markets. The resulting sample consists of 24 advanced countries and 32 emerging markets.21 The model is estimated on annual data for the period 1975–2006.
3.3. Evolution of the Policy Mix 1975–2006 We first describe the evolution of ER and FA policies pooling advanced and emerging countries in the same group. The bivariate distribution of ERR and FA openness has displayed a clear path over the past 35 years. As Fig. 1 shows, in 1975, 2 years after the breakdown of the Bretton Woods System, about three quarters of the advanced and emerging markets, or 36 out of 49 countries, were implementing an ER arrangement with some degree of flexibility (a soft peg or a floating regime). Nevertheless, most of these countries still kept capital controls.22 Preparing the ground for a more open FA took about 15 years. It was not until the early 1990s when a significant number of countries started to liberalize their FA. Important differences are evident between advanced and emerging countries. The former exhibit the path described above: after the breakdown of the Bretton Woods system the majority of the industrialized world moved to an intermediate regime first, after which they liberalized the FA and finally shifted to either a flexible ER or a peg (see Fig. 2). Emerging markets have been more reluctant to move in the direction of greater ER flexibility and capital mobility. After the Bretton Woods system collapsed, some emerging economies moved to an intermediate ERR and just a few of them liberalized capital flows (see Fig. 3). Relative to the advanced countries, the distribution of ERR and FA regimes in emerging markets has not changed dramatically in the past three decades. It is clear that in these countries the intermediate regimes are still a popular option. In fact, comparing Figs. 2 and 3, we can conclude that the ‘‘bipolar view’’ can be rejected for emerging markets but not for advanced countries (see Eichengreen & Razo-Garcia, 2006).23 Also, from Fig. 3, we can observe that capital restrictions are a common practice among emerging countries in these days. The main conclusion of this section is that during the post–Bretton Woods era, ER and FA reforms have not taken place at the same pace and time in advanced and emerging countries. The initial reluctance displayed by emerging markets to move to more flexible arrangements after the breakdown of the Bretton Woods system might be explained, for example, by the underdevelopment of their financial systems. In countries with less
Fixed Regimes
Number of Countries
30 25 20 15 10 5 0 1975
1979 Closed FA
1983
1991
1995
1999
Relatively Open FA
2003 Open FA
Intermediate Regimes
30 Number of Countries
1987
Relatively Closed FA
25 20 15 10 5 0 1975
1979 Closed FA
1983
1987
1991
Relatively Closed FA
1995
1999
Relatively Open FA
2003 Open FA
Flexible Regimes
Number of Countries
30 25 20 15 10 5 0 1975
1979
Closed FA
1983
1987
Relatively Closed
1991
1995
Relatively Open FA
1999
2003 Open FA
Fig. 1. Financial Account Policies Under Different Exchange Rate Regimes in Advanced and Emerging Countries: 1975–2006. Note: These three figures report the evolution of ERR and FA policies. The vertical axis measures the number of advanced and emerging countries with ERR j (j=[pegs, intermediate, flexible]) that are implementing FA regime l (l=[closed, partially closed, partially open, open]). For example, in 1975, 13 countries implemented a fixed ER. Out of these 13 countries, 8 had a closed FA, 2 had a partially closed FA, 2 had partially opened their FA, and only 1 had removed all capital restrictions. Darker colors are assigned to FA regimes with more restrictions. Therefore, a figure dominated by dark colors means that for that specific type of ERR the majority of the countries in the sample had intensive capital restrictions. For definitions of the FA regimes see Section 3.2.
Fixed Regimes 16 Number of Countries
14 12 10 8 6 4 2 0 1975
1979
Closed FA
1983
1987
Relatively Closed FA
1991
1995
1999
Relatively Open FA
2003 Open FA
Intermediate Regimes Number of Countries
16 14 12 10 8 6 4 2 0 1975
1979
Closed FA
1983
1987
1991
Relatively Closed FA
1995
1999
Relatively Open FA
2003 Open FA
Flexible Regimes Number of Countries
16 14 12 10 8 6 4 2 0 1975 Closed FA
1979
1983
1987
Relatively Closed FA
1991
1995
Relatively Open FA
1999
2003 Open FA
Fig. 2. Financial Account Policies Under Different Exchange Rate Regimes in Advanced Countries: 1975–2006. Note: These three figures report the evolution of ERR and FA policies. The vertical axis measures the number of advanced countries with ERR j ( j=[pegs, intermediate, flexible]) that are implementing FA regime l (l=[closed, partially closed, partially open, open]). For example, in 1975, three advanced countries implemented a fixed ER. Out of these three countries, one had a closed FA, none of them had a partially closed FA, two had partially opened their FA, and no advanced countries had removed all capital restrictions. Darker colors are assigned to FA regimes with more restrictions. Therefore, a figure dominated by dark colors means that for that specific type of ERR the majority of the countries in the sample had intensive capital restrictions. For definitions of the FA regimes see Section 3.2.
Fixed Regimes 20
Number of Countries
18 16 14 12 10 8 6 4 2 0 1975
1979
Closed FA
1983
1987
Relatively Closed FA
1991
1995
1999
Relatively Open FA
2003 Open FA
Intermediate Regimes 20 Number of Countries
18 16 14 12 10 8 6 4 2 0 1975
1979
Closed FA
1983
1987
1991
Relatively Closed FA
1995
1999
Relatively Open FA
2003 Open FA
Flexible Regimes 20
Number of Countries
18 16 14 12 10 8 6 4 2 0 1975 Closed FA
1979
1983
1987
Relatively Closed FA
1991
1995
Relatively Open FA
1999
2003 Open FA
Fig. 3. Financial Account Policies Under Different Exchange Rate Regimes in Emerging Countries: 1975–2006. Note: These three figures report the evolution of ERR and FA policies. The vertical axis measures the number of emerging countries with ERR j (j=[pegs, intermediate, flexible]) that are implementing FA regime l (l=[closed, partially closed, partially open, open]). For example, in 1975, 10 countries implemented a fixed ER. Out of these 10 emerging markets, 7 had a closed FA, 2 had a partially closed FA, none of them had partially opened their FA, and only 1 had removed all capital restrictions. Darker colors are assigned to FA regimes with more restrictions. Therefore, a figure dominated by dark colors means that for that specific type of ERR the majority of the countries in the sample had intensive capital restrictions. For definitions of the FA regimes see Section 3.2.
Estimating the Effect of Exchange Rate Flexibility on Financial Account
213
developed financial sectors, economic agents may not have the financial tools to hedge currency risks, which can reduce the attractiveness of flexible rates. The institutional framework, ‘‘fear of floating,’’ and the inability to borrow funds denominated in domestic currency are other causes behind the different sequences of ER and FA policies followed by industrialized countries and emerging markets. A situation in which the domestic currency cannot be used to borrow abroad, dubbed ‘‘original sin’’ by Eichengreen, Hausmann, and Panizza (2003), and the presence of a weak institutional framework (e.g., financial regulation) is another example of how the interaction of economic and institutional factors has derived in different sequences of ER and FA policies across advanced and emerging countries. In this specific example, when emerging countries cannot borrow in terms of their domestic currency and the country’s banks have made loans in U.S. dollars, then a depreciation of the currency against the dollar can hurt the balance sheet of the financial institutions and greatly injure the financial system. Under these circumstances, the central bank is likely to display ‘‘fear of floating.’’ In this example, the three factors mentioned above, the institutional framework (e.g., poor prudential regulation), ‘‘fear of floating’’, and the inability to borrow in the international capital market in domestic currency, interact to reduce the likelihood of implementing a flexible rate in emerging economies. Since advanced and emerging economies display important differences in macroeconomic institutions, degree of access to international capital markets, level of development, and other economic and political factors, we split the sample of countries into two groups: advanced and emerging markets. This will help us to obtain accurate measures of the interaction and determinants of ER and FA policies.
4. TRIVARIATE PROBIT MODEL The econometric model consists of a random utility for the degree of FA openness. Let K it be the ith country’s unobservable latent index that guides the decision regarding the liberalization of the FA in period t. The underlying behavioral model that we assume is K it ¼ S it Zsk þ F it Zfk þ X 1;it bk þ Sit X 1;it dsk þ F it X 1;it dfk þ nit;k
(2)
where the subscript k denotes parameters or variables associated with the FA equation and the superscripts s and f denote parameters associated with
214
RAUL RAZO-GARCIA
the soft pegs (intermediate) and flexible regimes, respectively.24 X1,it is a row vector of m exogenous regressors of country i in period t; Sit and Fit are dummy variables indicating the implementation of an intermediate or a floating regime, respectively; Zsk is a parameter capturing the effect of intermediate regimes on the decision to liberalize the FA; Zfk captures the effect of flexible ERR on the decision to open the FA; bk is a vector of parameters associated with X1,it; dsk and dfk are vectors of parameters associated with the interactions of X1,it with Sit and Fit, respectively; and nit;k is a residual term assumed to be i.i.d (over time and countries) normally distributed, ðnit;k jX 1;it ¼ x1;it ; S it ; F it Þ Nð0; okk ¼ 1Þ. Since the constant term is included in the regression, the dummy variable associated with fixed ERRs is excluded to avoid perfect multicollinearity, making this arrangement the base category against which the intermediate and flexible regimes are assessed.25 With the normality assumption, the model described in Eq. (2) becomes a probit model. To estimate this switching probit model, we assume the model error variances are constant (i.e., we are assuming the errors are homoscedastic). We incorporate this assumption into the model by assuming that the conditional variance of nit;k is equal to one. The reason for fixing the variance is that we cannot distinguish between a data-generating process with pffiffiffiffiffiffiffiffi parameters Zsk , Zfk , bk , dsk , dfk , and okk and one with parameter ðZsk = okk Þ, pffiffiffiffiffiffiffiffi f pffiffiffiffiffiffiffiffi f pffiffiffiffiffiffiffiffi s pffiffiffiffiffiffiffiffi ðZk = okk Þ, ðbk = okk Þ, ðdk = okk Þ, ðdk = okk Þ, and 1. This is just a normalization, therefore, and not a substantive assumption. The homoscedastic assumption, however, is critical for two reasons. First, if we do not assume a constant variance, the model will not be identified. Second, if the errors are heteroscedastic, then the parameter estimates will be biased, inconsistent, and inefficient. Although the consequences of a misspecified model are serious, to avoid complicating the already cumbersome estimation process proposed in this chapter, we estimate a homoscedastic probit model.26 The degree of FA openness is modeled as a switching probit model with the ERR behaving as the switch. Hence, depending on the value of the switch, Eq. (2) have three states: one for pegs, one for intermediate arrangements, and another for floating arrangements. 8 X ðb þ nit;k Þ Fixed Regime if S it ¼ 0 and F it ¼ 0 > < 1;it k s s K it ¼ Zk þ X 1;it ðbk þ dk Þ þ nit;k Intermediate Regime if Sit ¼ 1 and F it ¼ 0 > : Zf þ X ðb þ df Þ þ n 1;it k it;k Flexible Regime if S it ¼ 0 and F it ¼ 1 k k Note that in Eq. (2) the effect of the explanatory variables varies among countries with different ERRs via the interaction terms. Although we do not
Estimating the Effect of Exchange Rate Flexibility on Financial Account
215
directly observe the propensity to open the FA, we do observe the following discrete variable: 8 1 if 1oK it a0 Closed FA > > > > < 2 if a0 oK a1 Relatively Closed FA it (3) K it ¼ 3 if a1 oK it a2 Relatively Open FA > > > > : 4 if K 4a2 Open FA it where a ¼ ½a0 ; a1 ; a2 represents a vector of thresholds to be estimated. Here, for example, a1 is a positive threshold differentiating between two degrees of financial openness: relatively closed and relatively open. Now, since the two endogenous dummy variables (S and F) cannot follow a normal distribution (by its discrete nature), we cannot apply the traditional two-stage or instrumental variables approaches.27 In this chapter, we account for the endogeneity between the ERR and the FA openness by assuming a trivariate probit model. In particular, for the two discrete endogenous regressors, we specified a reduced-form multinomial probit S it ¼ X 1;it Ps1 þ X 2;it Ps2 þ nit;s
(4)
F it ¼ X 1;it Pf1 þ X 2;it Pf2 þ nit;f
(5)
with S it ¼
0 otherwise
F it ¼
1 if S it 40 and Sit 4F it
1 if F it 40 and F it 4Sit 0 otherwise
(6)
(7)
where S it is an unobservable latent index that guides the policymakers’ intermediate ERR decision of country i in period t (relative to a peg); F it is an unobservable latent index measuring the proclivity of country i in time t to implement a floating ERR (relative to a peg); X2,it is a row vector of l excluded regressors (from the FA equation) that affects K only through S or F; and nit ¼ ½nit;s nit;f nit;k 0 is a vector of residuals assumed to be i.i.d. (over time and countries) normal with zero mean and
216
RAUL RAZO-GARCIA
variance–covariance matrix
2
oss 6 On ¼ 4
osf off
3 osk ofk 7 5
(8)
okk
From Eqs. (6) and (7), it can be verified that the ERR with the highest propensity will be adopted. For identification purposes, the coefficients of 28 the random propensity for the pegs are normalized to zero, P This it =0. normalization is a consequence of the irrelevance of the level of utilities in discrete choice models. In this case, neither all the regime-specific constants nor all the attributes of the countries that do not vary across alternatives can be identified. Hence, the coefficients of the intermediate regimes are interpreted as differential effects on the propensity to adopt a soft peg compared to the propensity to implement a peg. A similar interpretation is given to the coefficients of the floating equation.29 The parameter vector to be estimated, y, is composed of two scale parameters (Zsk and Zsk ), five vectors of dimension m 1 ðbk ; dsk ; dfk ; Ps1 ; Ps1 Þ, two vectors of dimension l 1 (Ps2 and Pf20 ), one vector containing six elements of the covariance matrix On ðVechðOn ÞÞ, and one vector of thresholds of dimension 3 1 ða ¼ ½a0 ; a1 ; a2 0 Þ.30 Then y0 ¼ ðZsk ; Zfk ; b0k ; ds0k ; dfk0 ; Ps01 ; Ps02 ; Pf10 ; Pf20 ; VechðOn Þ0 ; a0 Þ is a vector of dimension ð2 þ 5mþ 2l þ 6 þ 3Þ 1.
5. EXPLANATORY VARIABLES The variables included as determinants of both the choice of ERR and the degree of FA openness are inflation, international reserves normalized by M2 (RESERVES/M2), financial development (FINDEV) proxied by the monetary aggregate M2 and normalized by GDP, relative size (GDP relative to the U.S. GDP), trade openness, geographical concentration of trade (SHARE), per capita income, and democratic level (POLITY).31 Previous research suggests that governments compelled to resort to inflation tax are more likely to utilize capital controls to broaden the tax base; hence, a negative correlation between inflation and the openness of the FA is expected. Since financial deepening and innovation reduce the effectiveness of capital controls, countries with more developed financial systems should exhibit a higher propensity to open the FA.
Estimating the Effect of Exchange Rate Flexibility on Financial Account
217
More democratic polities are more prone to lift capital controls. Eichengreen and Leblang (2003) argue that democratic countries have greater recognition of rights, including the international rights of residents, who have a greater ability to press for the removal of restrictions on their investment options. An empirical regularity found in previous studies is that advanced economies are less inclined to resort to controls. Previous authors have suggested that the development of general legal systems and institutions, and not of those specific to financial transactions, is crucial for a country to benefit from opening its financial markets; we use per capita income to proxy for this characteristic.32 While institutional development is difficult to measure, there is a presumption that it is most advanced in high-income countries (Eichengreen, 2002). Acemoglu et al. (2001) and Klein (2005) found empirical evidence supporting this presumption (i.e., the link between per capita income and institutional quality).33 Trade openness is commonly seen as a prerequisite to open the FA, (McKinnon, 1993). Furthermore, openness to trade can make capital controls less effective.34 Hence, the more open to trade a country is, the higher its propensity to remove capital controls. Two variables, SPILL and SHARE, are used as instruments in the reduced form (Eqs. (4) and (5)).35 The former variable, SPILL, measures the proportion of countries in the world implementing either an intermediate or a floating ERR. The second, SHARE, measures the geographical concentration of trade, which is proxied by the export share to the main trading partner.36 Frieden et al. (2000) and Broz (2002) use a variable similar to SPILL to control for the feasibility of the ER arrangement. The idea is to capture the‘‘climate of ideas’’ regarding the appropriate ERR, in this case an arrangement with some degree of flexibility. That is, the choice of an ER arrangement may be related to the degree of acceptance of that regime in the world. This being true, then if most countries are adopting regimes in which the currency is allowed to fluctuate, within or without limits, it would be more feasible to adopt or maintain an intermediate or a floating regime. Then, SPILL is expected to be positively correlated with the latent indexes of the intermediate and floating regimes. Since countries with a higher degree of concentration in exports can benefit more from fixed regimes, a negative coefficient associated with SHARE is predicted in the floating ERR equation. Regarding the choice of ERR, we expect the following relationships. Countries with nontransparent political systems (e.g., autocracies) are expected to have a higher propensity to peg relative to countries with more transparent systems (e.g., democracies). The argument is that nondemocratic
218
RAUL RAZO-GARCIA
countries may choose a less flexible ERR as a commitment device to help them maintain credibility for low-inflation monetary policy objectives. Countries with underdeveloped financial systems do not possess the instruments needed to conduct open market operations and, as a consequence, are expected to adopt less flexible regimes. OCA theory holds that variables such as low openness to trade, large size, and low geographical concentration of trade are associated with more flexible regimes, since a higher volume and geographical concentration of trade increases the benefits from a less flexible ER, reducing transaction costs. Smaller economies have a higher propensity to trade internationally, leading to a higher likelihood of pegging. To the extent that a high level of international reserves is seen as a prerequisite for defending a less flexible regime, a negative association between the flexibility of the ER and the stock of international reserves is expected. To identify the effect of exogenous changes of ER flexibility on FA openness, we need more than correlation between the two external instruments, SPILL and SHARE, and the two latent indexes, S it and Fit . These two variables must be significant after controlling for the covariates included in X1t. Previous studies support our argument that these two variables significantly affect the choice of ERR even after controlling for other factors. Regarding the feasibility of intermediate and floating regimes, proxied by SPILL, Broz (2002) finds that the choice of a fixed ER is positively and significantly related to the general climate of opinion regarding pegging.37 Geographical concentration of trade, SHARE, has been found to significantly affect the propensity to adopt fixed, intermediate, and floating regimes. Studying the choice of de jure ERR for a group of 25 transition economies in the 1990s, von Hagen and Zhou (2002) find evidence that SHARE raises the chance to implement a fixed ERR among the Commonwealth of Independent States (CIS), but increases the probability of adopting a floating rate regime among non-CIS countries.38 For a larger set of developing countries, von Hagen and Zhou (2006) also find a significant relationship between geographical concentration of trade and the choice of the ER arrangement.39 In a slightly different but interrelated area, Klein and Marion (1997) find that SHARE is an important determinant of the duration of pegs in Latin American countries.
6. ESTIMATION As we will show, the difficulty in evaluating the log-likelihood function is that the probabilities require the evaluation of a three-dimensional
219
Estimating the Effect of Exchange Rate Flexibility on Financial Account
multivariate normal integral. We estimate, up to a constant, the parameters presented in Eqs. (2), (4), and (5) by MSL. Specifically, to simulate the multidimensional normal integrals, we use the GHK simulator and Halton draws.40 6.1. Maximum Simulated Likelihood The log-likelihood function in this panel data context is L¼
I X i¼1
log
T Y
!
PrðK it ¼ l; Sit ¼ s; F it ¼ f Þ
(9)
t¼1
with PrðK it ¼ l; Sit ¼ s; F it ¼ f Þ ¼
4 Y
l
PðK it ¼ l; S it ¼ 0; F it ¼ 0Þð1F it Sit ÞK it
l¼1
4 Y
l
PðK it ¼ l; Sit ¼ 1; F it ¼ 0ÞSit K it
(10)
l¼1
4 Y
l
PðK it ¼ l; Sit ¼ 0; F it ¼ 1ÞF it K it
l¼1
where l={1,2,3,4}, s, f={0,1}, K lit ¼ ifK it ¼ lg and ifAg is the indicator function. Note that all the choice probabilities are conditioned on X1it and X2it.41 These probabilities can be simulated using the Cholesky decomposition of the variance–covariance matrix of the error terms (Train, 2003, p. 127). The normality assumption imposed on the residuals requires the application of the GHK simulator to approximate the integrals implicit in these joint normal probabilities. For illustrative purposes suppose that country i has a closed FA and is not ‘‘treated’’ (neither a soft nor a floating ERR is adopted). Then the probability of adopting that policy mix is PðK it ¼ 0; F it ¼ 0; S it ¼ 0; Þ ¼ Pð1oK it 0; F it o0; Sit o0Þ ¼ Pð1 Lkit onit;k o Lkit ; nit;f o Lfit ; nit;s o Lsit Þ ¼ Pð1 Lkit onit;k o Lkit jnit;f o Lfit ; nit;s o Lsit Þ Pðnit;f o Lfit jnit;s o Lsit Þ Pðnit;s o Lsit Þ
(11)
220
RAUL RAZO-GARCIA
where Lkit ¼ S it Zsk þ F it Zfk þ X 1;it bk þ S it X 1;it dsk þ F it X 1;it dfk , Lsit ¼ X 1;it Ps1 þ X s2;it Ps2 , and Lfit ¼ X 1;it Pf1 þ X f2;it Pf2 . In the third equality of Eq. (11), the general rule of multiplication law is used to find the joint probability Pð1 Lkit onit;k o Lkit ; nit;f o Lfit ; nit;s o Lsit Þ. This law states that when n events happen at the same time, and the events are dependent, then the joint probability PrðE 1 \ E 2 \ \ E n Þ can be obtained by the multiplication of n1 conditional probabilities and one marginal probability (e.g., if n=3 then PðE 1 \ E 2 \ E 3 Þ ¼ PðE 1 \ jE 2 \ E 3 Þ PðE 2 jE 3 ÞPðE 3 Þ).42 One more transformation is needed to make the model more convenient for simulation. Let Ln be the Cholesky factor associated with the variance– covariance matrix On ð¼ Ln L0n ) 2 3 0 c11 0 6 7 Ln ¼ 4 c21 c22 0 5 c31
c32
c33
As is described in Train (2003, p. 127), using the Cholesky decomposition of On , the residuals of the three equations, which are correlated, can be rewritten as linear combinations of uncorrelated standard normal variables: 2 3 2 3 2 s3 it nit;s c11 0 0 f 7 6n 7 6c 7 6 4 it;f 5 ¼ 4 21 c22 0 5 4 it 5 ¼ Ln it nit;k c31 c32 c33 kit where 02 3 2 1 0 6 f 7 d B6 7 6 0 it ¼ 4 it 5 ! N@4 0 5; 4 kit 0 0 2
sit
3
31
0
0
1
7C 0 5 A.
0
1
With this transformation the error differences nit;s , nit;f and nit;k are correlated because all of them depend on sit . Now we can rewrite the latent indexes (utility differences) in the following way: Sit ¼ Lsit þ c11 sit F it ¼ Lfit þ c21 sit þ c22 fit K it ¼ Lsit þ c31 sit þ c32 fit þ c33 kit The probability described in Eq. (11) is hard to evaluate numerically in terms of the n0 s because they are correlated. However, using the Cholesky
Estimating the Effect of Exchange Rate Flexibility on Financial Account
221
decomposition of the variance–covariance matrix associated with the error differences, On , this probability can be rewritten in such a way that involves only independent random variables (’ s). Hence, the probability in Eq. (11) becomes a function of the univariate standard cumulative normal distribution: PðK it ¼ 0; S it ¼ 0; F it ¼ 0Þ
! c31 sit c32 fit f Lfit c21 fit s Lsit ; it o ¼P it o c33 c22 c11 ! f Lit c21 sit s Ls Ls P fit o it o it P sit o it c22 c11 c11 Lkit kit o
! Lkit c31 sit c32 fit fðsit Þfðfit Þdsit dfit ¼ F c33 1 1 ! s Z Lsit c11 Lfit c21 sit Lit F fðsit Þdsit F c c11 22 1 Z
f L c21 s it it c22
Ls
Z
c it
11
To simulate this probability we use the GHK simulator described next. 1. Compute s Lsit Lit s P it o ¼F c11 c11 2. Draw a value of sit , labeled s;q it , from a standard normal truncated at ðLsit =c11 Þ. This draw is obtained in the following way: (a) Let mq1 be the qth element of the first Halton sequence of length Q (b) Calculate s s Lit Lit q q q 1 1 ¼ F s;q ¼ F ð1 m ÞFð1Þ þ m F m F it 1 1 1 c11 c11 3. Compute P
Lfit fit o
! ! c21 sit s Lfit c21 s;q s;q it it ¼ it ¼ F c22 c22
4. Draw a value of fit , labeled fit;q , from a standard normal truncated at ðLfit c21 sit =c22 Þ. This draw is obtained in the following way: (a) Let mq2 be the qth element of the second Halton sequence of length Q
222
RAUL RAZO-GARCIA
(b) Calculate fit;q
¼ F1 ð1 ¼F
1
mq2 F
mq2 ÞFð1Þ
þ
Lfit c21 sit c22
mq2 F !!
Lfit c21 sit c22
!!
5. Compute P
Lkit kit o
! ! f ;q c31 sit c32 fit s Lkit c31 s;q s;q f f ;q it c32 it ¼F it ¼ it ; it ¼ it c33 c33
6. The simulated probability for this qth draw of kit and sit is calculated as follows: ! Lsit Lfit c21 s;q q it F PðK it ¼ 0; S it ¼ 0; F it ¼ 0Þ ¼ F c11 c22 ! (12) f ;q Lkit c31 s;q c 32 it it F c33 7. Repeat steps 1–6 many times q=1,2,3,y,Q 8. The simulated probability is X e it ¼ 0; S it ¼ 0; F it ¼ 0Þ ¼ 1 PðK it ¼ 0; Sit ¼ 0; F it ¼ 0Þq PðK Q q¼1 Q
(13)
Four critical points need to be addressed when the GHK simulator is applied for the maximum likelihood estimation (Train, 2003, p. 133). First, we have to make sure that the model is normalized for scale and level of utility to ensure that the parameters are identified. Second, the GHK simulator takes utility differences against the regime for which the probability is being calculated, and so different differences must be taken for countries choosing different regimes.43 Third, for a country choosing to peg its currency, the GHK simulator uses the covariance matrix On ; for a country choosing a soft peg it uses Osn ; while for a country with a floating regime it needs Ofn . These three matrices are derived from a common covariance matrix O of the original errors (the nondifferentiated residuals in the ERR equations). We must assure that the elements of On are consistent
Estimating the Effect of Exchange Rate Flexibility on Financial Account
223
with the elements of Osn and Ofn in the sense that the three matrices are derived from the O matrix. Fourth, the covariance matrices implied by the maximum likelihood estimates must be positive definite. We address these issues in detail in the Appendix A.
6.2. Numerical Optimization Initially we used the Davidon–Fletcher–Powell algorithm, a quasi-Newton method, to maximize the objective function.44 However, in many cases, we obtained small changes in b y and the objective function accompanied by a gradient vector that was not close to zero, so this quasi-Newton numerical routine was not effective at finding the maximum. Since Markov Chain Monte Carlo (MCMC) methods have been proposed to estimate problems, such as Powell’s least absolute deviation regression and the nonlinear instrumental variables estimator, which represent a formidable practical challenge (see, e.g., Chernozhukov & Hong, 2003), the Metropolis–Hastings algorithm, a MCMC method, is implemented in this chapter to maximize the loglikelihood function. This estimation strategy is not new in Macroeconomics. Recently Coibion and Gorodnichenko (2008) used the MCMC method developed by Chernozhukov & Hong to estimate a dynamic stochastic general equilibrium model to analyze firms’ price-setting decisions. They rely on this stochastic search optimizer to achieve the global optimum because the objective function is highly nonlinear in the parameters. To ensure monotonicity of the cutoff points, we use the reparametrization aj ¼ aj1 þ expðaj Þ for j=2,3 and estimate the unconstrained parameters a2 and a3. In order to identify the constant term in the FA equation, we fix a1 ¼ 0. Additionally, since the probability described in the likelihood Eq. (11) is invariant to scale shifts, the three diagonal elements of the covariance matrix On are arbitrary. One strategy is to set the parameters okk and oss to unity.45 In terms of the Cholesky decomposition discussed above, this means the free parameters inqLffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi n are now the ffi three strict lower triangular P 2 for j={k,s}. After imposing c elements and off . Thus, cj;j ¼ 1 j1 i¼1 i;j these identification restrictions, the number of parameters to estimate is reduced by three. To facilitate the convergence of the algorithm and to avoid the potential bias that few hyperinflationary outliers can cause, we follow Ghosh, Gulde, and Wolf (2003) and von Hagen and Zhou (2002) and rescale the annual inflation by ðp=1 þ pÞ.46 All covariates included in X1,it, except dummy
224
RAUL RAZO-GARCIA
variables, POLITY and SPILL, are lagged by one period to attenuate potential simultaneity of these variables. All the exogenous variables, except the constant and discrete variables, are standardized. This significantly improves the performance of the optimization program. The source of the variables and the list of countries used in the estimation are contained in the Appendix C.
7. RESULTS In this section, we report the estimates from different variants of the model described in Eqs. (2), (4), and (5). Our objective is twofold: (i) to investigate the impact of exogenous changes of ER flexibility on financial openness, and (ii) to examine how the coefficients on the FA equation associated with ER flexibility are affected when the endogeneity of the ERR is not accounted for. As mentioned above, we estimate the model first for advanced economies and then for emerging markets because these two types of countries differ in their economic, political, and institutional circumstances.47 Table 2 presents the estimations of four different models. In model 1, we estimate the FA equation neglecting the potential endogeneity of S and F and excluding the interaction terms.48 Model 2 adds the interaction terms to model 1. These benchmark models will help us to assess how the estimated coefficients are affected when S and F are treated as strictly exogenous. Model 3 treats the flexibility of the ERR (S and F) as endogenous and sets to zero all the interaction terms in Eq. (2). Finally, model 4 adds the interaction terms to model 3. In models 3 and 4, we use 750 Halton draws to simulate the multivariate normal integrals and 200,000 MCMC draws to maximize the log-likelihood function.
7.1. Effects of Exchange Rate Flexibility on Financial Account Policies Four major findings are based on model 4 (Table 2). First, the evidence shows that the degree of ER flexibility strongly influences FA policies. After controlling for other factors influencing financial openness, a U-shape behavior between the probability of lifting capital controls and ER flexibility is found. In other words, we find that the most intensive capital restrictions are associated with intermediate regimes. Consistent with the trilemma of monetary policy, a country switching from a soft peg toward a floating arrangement is more likely to remove capital controls
Table 2.
Financial Account and Exchange Rate Regime Equations (1975–2006). Advanced Countries
Model 1
Model 2
y
y
Financial account equation Constant 0.604 Zsk 0.226 0.101 Zf k
FINDEV 0.077 S*FINDEV F*FINDEV RESERVES/M2 0.228 S*(RESERVES/M2) F*(RESERVES/M2) GDP per capita 0.043 S*GDP per capita F*GDP per capita OPENNESS 0.472 S*OPENNESS F*OPENNESS INFLATION 1.045 S*OPENNESS F*OPENNESS RELATIVE SIZE 0.042 S*(RELATIVE SIZE) F*(RELATIVE SIZE) POLITY S*POLITY F*POLITY ASIA S*ASIA F*ASIA CRISIS
S.E.
Emerging Market Countries
Model 3 S.E.
y
S.E.
Model 4 y
0.110 0.205 0.142 1.717 0.143 0.535
0.267 0.335 0.112 0.414 0.389 0.055 0.109 1.627 0.371 0.306 0.135 0.147
0.042 0.238 0.714 0.189 0.060 0.075 0.821 0.209 0.066 0.140 0.151 0.664 0.078 0.951 0.672 0.346 0.104 1.692 1.207 0.275 0.051 0.843 0.681 0.665
0.192 0.087 0.229 0.200 0.283 0.367 0.397 0.351 0.208 0.096 0.288 0.237 0.174 0.632 0.323 0.425 0.432 2.085 0.582 0.508 0.387 0.084 0.395 0.389
0.040 0.402 0.829 0.300 0.081 0.031 0.631 0.103 0.064 0.000 0.382 0.389 0.093 1.139 1.124 0.169 0.122 1.810 1.025 0.080 0.033 0.795 0.536 0.607
Model 1 S.E.
y
0.160 0.710 0.223 0.821 0.220 0.835
S.E.
0.134 0.107 0.125
0.245 0.388 0.078 0.229 0.246 0.216 0.098 0.046 0.299 0.239 0.214 0.762 0.141 0.213 0.271 0.194 0.324 0.066 0.295 0.267 0.348 0.306 0.065 0.395 0.438 0.181 0.375 0.327 0.201 0.186 0.022 0.044 0.145 0.140 0.623 0.570 0.102 2.728 2.537 3.099 0.031 0.205 0.158
Model 2 y 2.126 2.955 3.144
Model 3
S.E.
y
S.E.
0.366 0.539 0.440
1.108 2.085 0.701
0.130 0.084 0.169
0.476 0.137 0.279 0.051 1.100 0.207 1.068 0.257 0.525 0.091 0.112 0.033 1.056 0.130 0.714 0.114 3.742 0.529 0.861 0.087 2.552 0.566 4.939 0.645 0.192 0.117 0.163 0.038 0.167 0.160 0.505 0.268 0.202 0.127 0.167 0.036 0.313 0.194 0.074 0.168 6.282 1.187 1.484 0.304 4.268 1.655 9.770 1.476 0.097 0.021 0.029 0.242 0.119 0.222 0.161 0.402 0.249 0.141 0.066 2.398 0.303 2.417 0.311 2.801 0.288 0.941 0.124 1.409
Model 4 y
S.E.
1.135 2.399 0.646
0.229 0.302 0.416
0.383 0.935 0.452 0.437 0.895 0.545 2.534 1.457 2.934 0.311 0.270 0.003 0.260 0.148 0.208 4.452 2.155 4.489 0.069 0.089 0.082 0.167 0.179 0.239 0.156
0.090 0.133 0.149 0.074 0.097 0.082 0.213 0.255 0.222 0.085 0.104 0.123 0.068 0.101 0.071 0.732 1.027 0.817
Table 2. (Continued ) Advanced Countries
Cutoffs a2 a3
Emerging Market Countries
Model 1
Model 2
y
S.E.
y
S.E.
y
S.E.
y
S.E.
y
0.354 1.451
0.046 0.081
0.398 1.634
0.053 0.095
0.339 1.412
0.031 0.056
0.386 1.586
0.032 0.061
0.657 1.358
Intermediate exchange rate regime equation Constant 0.236 0.134 SPILL 0.282 1.082 SHARE 0.275 0.044 FINDEV 0.077 0.066 RESERVES/M2 0.041 0.059 GDP per capita 0.016 0.078 OPENNESS 0.244 0.066 INFLATION 0.392 0.082 RELATIVE SIZE 0.543 0.142 POLITY ASIA CRISIS
Model 3
Model 4
0.124 0.274 0.288 0.069 0.102 0.033 0.306 0.790 0.377
0.175 0.796 1.082 2.251 0.047 0.251 0.063 0.026 0.123 0.618 0.102 1.002 0.106 1.095 0.166 2.114 0.099 0.217
0.129 0.901 1.088 1.960 0.049 0.247 0.063 0.028 0.116 0.677 0.088 0.929 0.108 1.091 0.153 2.031 0.072 0.224
Floating exchange rate regime equation Constant 0.747 0.158 1.017 SPILL 0.843 1.157 0.843 SHARE 0.179 0.053 0.189 FINDEV 0.114 0.071 0.108 RESERVES/M2 0.357 0.068 0.728
0.199 4.516 1.157 5.491 0.056 0.991 0.068 0.658 0.140 3.899
1.182 5.256 1.451 2.174 0.377 1.552 0.293 1.194 1.134 5.420
Model 1
0.107 1.023 0.047 0.066 0.103 0.068 0.082 0.165 0.070 0.009 0.649 1.502
S.E.
0.048 0.074
0.467 0.216 0.923 1.034 0.061 0.052 0.159 0.080 0.043 0.051 0.698 0.136 0.153 0.059 0.039 0.066 -2.518 0.489 0.045 0.009 0.106 0.64936 0.233 1.502
1.312 0.952 1.267 6.349 0.524 0.045 0.424 0.208 1.607 0.098
0.224 1.304 0.058 0.112 0.059
Model 2
Model 3
Model 4
y
S.E.
y
S.E.
y
S.E.
0.818 1.764
0.063 0.104
0.497 1.009
0.032 0.058
0.611 1.425
0.042 0.060
0.467 0.923 0.061 0.159 0.043 0.698 0.153 0.039 2.518 0.045 0.106 0.233
0.216 1.034 0.052 0.080 0.051 0.136 0.059 0.066 0.489 0.041 0.7734 0.917
0.182 3.809 0.115 0.351 0.174 0.711 0.168 0.017 1.186 0.035 0.085 0.367
0.187 0.580 0.044 0.058 0.056 0.158 0.058 0.052 0.294 0.051 0.741 0.022
0.285 4.250 0.045 0.195 0.139 1.028 0.260 0.059 2.093 0.036 0.077 0.248
0.123 0.544 0.044 0.055 0.043 0.140 0.045 0.050 0.277
0.952 6.349 0.045 0.208 0.098
0.224 1.304 0.058 0.112 0.059
3.689 9.239 0.821 0.352 0.403
0.462 0.664 0.246 0.274 0.206
1.694 6.615 0.251 0.380 0.170
0.507 0.564 0.147 0.200 0.185
GDP per capita OPENNESS INFLATION RELATIVE SIZE POLITY ASIA CRISIS
0.348 0.942 0.030 0.401
0.083 0.441 0.140 1.324 0.070 0.059 0.149 0.282
sff sks skf ssf
Memorandum Observations Endogeneity accounted for log-likelihood
0.106 3.245 0.197 7.707 0.140 1.720 0.105 1.641
35.298 0.012 1.672 5.721
675 No 1358.70
675 No 1300.37
0.664 3.435 1.332 10.818 0.459 1.071 0.478 2.249
0.610 0.035 2.573 0.061 0.514 0.011 0.536 2.259 0.143 0.055 0.409 0.128 1.946 0.209
0.187 0.035 0.090 0.061 0.067 0.011 0.465 2.259 0.143 0.055 0.4089 0.128 1.946 0.209
52.459 0.072 2.352 7.002
675 Yes 1166.20
0.187 0.278 0.564 0.343 0.090 1.187 0.261 0.466 0.067 0.032 0.214 0.244 0.465 5.295 0.573 5.050 0.906 0.237 0.438 0.146 0.8106 0.292 1.2336 0.308 11.858 1.023 7.299 0.751
28.828 0.937 2.642 0.843
675 Yes 1109.00
877 No 1514.83
877 No 1406.11
0.613 0.172 0.187 0.547
0.000 13.414 0.089 3.574
877 Yes 1323.70
877 Yes 1218.70
Notes: *, **, *** denote coefficients statistically different from zero at 1%, 5%, and 10% significance levels, respectively. Intermediate regimes include: preannounced crawling peg, preannounced crawling band that is narrower than or equal to 2%, de facto crawling peg, de facto crawling band that is narrower than or equal to 2%, preannounced crawling band that is narrower than or equal to 2%, de facto crawling band that is wider or equal to 5%, moving band that is narrower than or equal to 2% (i.e., allows for both appreciation and depreciation over time), de facto crawling band that is narrower than or equal 2%. Floating regimes include managed floating and freely floating arrangements.
228
RAUL RAZO-GARCIA
(i.e., jZfk jojZsk j). Our findings also show that a country switching from a peg to a soft peg is less inclined to lift capital restrictions. The reluctance to remove capital controls is more prevalent among emerging countries. The intensification of capital controls can help to sustain the ERR.49 Using a de jure ERR classification von Hagen and Zhou (2005, 2006) also find a U-shape relationship between the flexibility of the ER and the probability of lifting capital controls; however, this nonlinear relationship disappears when they use a de facto classification. Second, the estimated coefficients are substantially different when endogeneity is not accounted for from the estimates obtained when the ERR is treated as an exogenous regressor (models 1 vs model 3; model 2 vs model 4).50 Regarding the effect of ER flexibility on the openness of the FA, the coefficient associated with floating regimes, Zfk , exhibits the largest difference.51 In fact, for emerging markets, the coefficient estimated under strict exogeneity overestimates the (negative) effect of flexible regimes by a factor of 5 (3.144 vs 0.646). Although advanced economies also exhibit a difference in that coefficient, this is not statistically significant. Moreover, for some of the exogenous variables, the difference is so large that the estimated coefficients flip signs when S and F are treated as endogenous regressors. Among advanced countries, for example, the effect of international reserves on the propensity to adopt an intermediate regime changes from a (insignificant) negative and unexpected effect to a positive effect when endogeneity is properly accounted for. Third, the effect of the exogenous regressors on the propensity to open the FA varies considerably with the intensity of ER flexibility (model 3 vs model 4). For advanced countries, for example, the coefficient associated with S in the FA equation increases by a factor of 29 when interaction terms are allowed.52 Therefore, the inclusion of interaction terms unmasks interesting relationships between the exogenous variables, the degree of FA openness, and the ERR. Fourth, the cutoffs points (i.e., the a0 s) indicate that emerging economies have been more cautious relative to advanced countries in taking the first step to liberalize the FA.53 7.1.1. Determinants of the Financial Account Advanced Countries. According to Table 2, the degree of financial development, stock of international reserves, quality of the institutional framework, size of the economy, trade openness, and inflation affect the degree of FA openness (model 4). The effect of these variables on the decision to remove capital restrictions, however, depends on the type of the ERR implemented.54 The results of model 4 show important differences across
Estimating the Effect of Exchange Rate Flexibility on Financial Account
229
advanced and emerging countries. A major difference between these two types of countries is related to the interactions of F with the exogenous regressors. While the majority of these interactions are significant at conventional levels in the emerging market regressions, these are not statistically significant in the advanced country estimations.55 This result reflects the tendency of advanced economies to liberalize the FA under intermediate regimes (see Fig. 2). Policymakers in advanced countries with developed financial systems have displayed a higher propensity to remove capital restrictions when an intermediate regime is implemented than when a floating regime is in place. This result supports previous arguments that financial deepening and innovation reduce the effectiveness of capital controls, and this, in turn, increases the propensity to liberalize the FA. Per capita income is typically interpreted in this context as a measure of economic development. A number of economists have found that more developed countries are more likely to remove restrictions on capital flows (see, e.g., Alesina et al., 1994; Grilli & Milesi-Ferretti, 1995; Leblang, 1997). The observation that all of today’s high-income countries have lifted capital controls is consistent with the view that FA liberalization is a corollary of economic development and maturation (Eichengreen, 2002). Our results indicate that the effect of GDP per capita on the degree of FA openness varies considerably between intermediate and flexible regimes. While the development of general legal systems and institutions, proxied by per capita income, exhibits a positive correlation with the propensity to liberalize capital flows when an intermediate ERR is in place, this variable has no effect at the two ends of the ERR spectrum, pegs and flexible regimes.56 Regarding the stock of international reserves, the results presented in Table 2 indicate that advanced countries with a high stock of foreign reserves and a soft peg have more closed FAs. The interaction between the ERR and the degree of FA openness is one of the driving forces behind this result. The reason is that advanced countries with large reserves of foreign currency are more keen to adopt intermediate regimes, and this, in turn, decreases the propensity to open the FA. As we mentioned above, a lower degree of FA openness may help, the monetary authorities to maintain an intermediate ERR. Using a random effects probit model, Leblang (1997) obtains exactly the opposite correlation between reserves and financial openness.57 His estimates show that as countries run out of international reserves, policymakers become more likely to impose capital restrictions.58 In the studies most closely related to our empirical model,
230
RAUL RAZO-GARCIA
(von Hagen & Zhou, 2006; Walker, 2003), the stock of international reserves is excluded from the vector of determinants of financial openness. Another important factor affecting FA policy is the degree of trade openness. There are important differences in the role played by this variable across advanced and emerging countries and across ERR. Advanced economies with flexible rates exhibit a positive correlation between the degree of trade and FA openness.59 Although advanced countries with intermediate regimes also exhibit a positive correlation, this is very close to zero (0.015=1.1391.124). The positive correlation between trade and FA openness is consistent with the idea that trade liberalization is a prerequisite to liberalize the FA (McKinnon, 1993). The interaction of ER and FA policies allowed in the model reinforces this correlation. The ERR equations show that advanced countries that are highly open to foreign trade have a stronger preference for fixed regimes. This preference for pegs is translated into a lower probability of choosing or maintaining an intermediate or floating regime (i.e., S=0 and F=0 more likely), and this, in turn, increases the likelihood of removing capital controls.60 Based on our estimated model, we find that developed countries with high inflation are more inclined to impose capital controls. This finding supports previous arguments suggesting that governments compelled to resort to inflation tax are more likely to utilize capital controls to broaden the tax base (e.g., Alesina et al., 1994; Leblang, 1997; Eichengreen & Leblang, 2003). The asymmetric effects of the exogenous variables on the degree of FA openness, caused by different degrees of ER flexibility, are clearly observed in this case again. When interactions between the endogenous and exogenous variables are included (model 4), the effect of inflation becomes more negative in countries implementing an intermediate regime relative to countries with flexible rates, 2.835 (=1.811.025) and 1.890 (=1.8900.08), respectively.61 It has been argued that small countries might benefit from risk sharing. The presumption involves the price of risky assets across countries. If these prices differ across countries, there might be some benefits from allowing individuals to trade assets (e.g., consumption smoothing). The gains from risk sharing may be larger for developing or small countries because these contribute less to global output than developed or large countries, making it less likely that their domestic output would be correlated with world output. Hence, more of their idiosyncratic risk can be eliminated by trading assets with residents in other parts of the world. In addition, developing countries’ GDP is often much more volatile than that of large or advanced countries, which means there is more scope to reduce the volatility of the output, and
Estimating the Effect of Exchange Rate Flexibility on Financial Account
231
this, in turn, may be translated into higher benefits from risk sharing. Our findings support this presumption for emerging markets but not for advanced countries. Specifically, the estimated coefficient associated with the relative size is negative for emerging markets but positive for advanced countries. Hence, our results show that larger advanced economies tend to have more open FA. 7.1.2. Emerging Markets The coefficients on S and F show, as they do for advanced economies, that emerging countries are less inclined to liberalize the FA when a soft peg is in place. Moreover, our results indicate a further tightening of capital controls in emerging markets, relative to developed countries, when an intermediate regime is chosen. As is argued previously, countries following this practice can resort to capital restrictions to help maintain their ERRs. From Table 2 we conclude that the degree of financial development, the stock of international reserves, the quality of the institutional framework, the size of the economy, the degree of trade openness, inflation, and democratic level affect the degree of FA openness in emerging markets (model 4). Emerging countries, implementing a soft peg or a flexible rate, show a positive and significant correlation between the stock of international reserves and financial openness. To rationalize this result it must kept in mind that the size of domestic financial liabilities that could potentially be converted into foreign currency is equal to M2. Therefore, when the stock of international reserves increases relatively to M2, the monetary authorities are in better shape to maintain an intermediate regime when there is a sudden increase in the demand for foreign currency (e.g., a speculative attack against the intermediate regime or a sudden capital flight). Also, a high stock of foreign reserves may provide the monetary authorities with the funds necessary to reduce the excessive volatility of the ER when the country is implementing a (managed) flexible regime. Thus, when the stock of international reserves is sufficiently large the country has an additional tool, besides capital controls, to maintain a soft peg or to reduce the volatility of the ER. This finding, then, indicates that there might be some substitutability between capital controls and the stock of international reserves. Furthermore, policymakers in emerging markets react more to the stock of international reserves than their counterparts in advanced economies. It can be argued that our results might be biased due to the potential endogeneity of the stock of foreign exchange reserves and the degree of financial openness.62 As mentioned above, this variable is lagged by one period in
232
RAUL RAZO-GARCIA
order to mitigate this potential problem. Leblang (1997) also reports a positive relationship between foreign exchange and financial openness. A developed financial system tends to make emerging countries with a peg lift capital restrictions. On the contrary, emerging markets with developed financial systems and with either a soft peg or a flexible ER are less inclined to open their financial markets to the world. These asymmetries in the coefficients associated with financial development across ERR suggest that the quality of the institutional framework, including but not limited to the financial system, is what matters most to benefit from a more open FA. Similar to advanced economies, emerging markets with developed general legal systems and institutions are more likely to remove capital controls under an intermediate regime than when they allow their currencies to float (1.078 vs 0.400).63 A suggestive explanation of this result is the ‘‘fear of floating’’ hypothesis proposed by Calvo and Reinhart (2002). These authors argue that the reluctance to allow the currency to float is more commonly observed among emerging market countries. Therefore, if an emerging country decides to open its FA, it would be more likely to do it under a soft peg than under a floating regime due to the ‘‘fear of floating.’’ The positive correlation between per capita income and financial openness is one of the most robust regularities in the literature on the determinants of capital controls.64 This result is robust to the type of ERR classification, de jure or de facto, and sample period. Our contribution to the literature in this area is the identification of a nonlinear effect of per capita income on FA openness for different degrees of ER flexibility. While previous studies find a higher propensity to liberalize the FA when GDP per capita is high, we find the same positive relationship for emerging markets with pegs and intermediate regimes but a negative correlation when a flexible ER is in place. The positive correlation between trade openness and degree of FA openness is consistent with McKinnon’s (1993) ordering theory, which maintains that the current account must be liberalized before the FA. Our findings also show that trade openness has its major impact on FA policies at the two ends of the ERR spectrum, pegs and floating rates. Furthermore, if we consider the effect of trade openness on the ERR and how the latter affects FA policy, we can conclude that emerging countries that are highly open to foreign trade are more likely to adopt a peg and, therefore, more likely to remove capital restrictions. The results for inflation confirm our expectations that emerging markets experiencing high inflation are more reluctant to liberalize the FA. As matter of fact, among emerging countries with high inflation, the ones adopting a peg show a lower probability of removing capital controls. There are two suggestive explanations for this
Estimating the Effect of Exchange Rate Flexibility on Financial Account
233
result. The first is that capital restrictions in countries with high inflation can be used to protect the tax base (e.g., Grilli & Milesi-Ferretti, 1995; Leblang, 1997; Eichengreen & Leblang, 2003). The second is that those countries might be more reluctant to liberalize their FA in order to sustain their pegs when inflation is high (to prevent capital flight).65 Small emerging markets with either a peg or an intermediate regime tend to have more open FAs. This finding supports the risk-sharing idea. The evidence also indicates that the effect of relative size decreases, in absolute value, with the flexibility of the ER. In fact, when an emerging market switches from an intermediate to a floating regime, the correlation between the size of the economy and FA openness changes from negative to positive. Emerging markets with flexible ERs and democratic political systems have more open FA. In particular, when an emerging economy moves from an intermediate to a floating regime, the democratic level has the expected effect on the liberalization of the FA and plays a more important role in that process.66 This result contrasts dramatically with the null effect of the democratic level on the presence of controls found by von Hagen and Zhou (2006). The results for the dummy variables indicate that, relative to other emerging countries, Asian economies with either a soft peg or a floating regime maintain more open FA. Finally, we observe that emerging markets experiencing currency crises or hyperinflations (CRISIS) are more reluctant to liberalize the FA. This last result suggests that maintaining or imposing capital controls helps emerging markets to cope with financial crises (e.g., Malaysia during the Asian crisis).
7.2. Determinants of the Exchange Rate Regime Again we will focus on the coefficients of model 4 (Table 2). The main result regarding the choice of ERR is related to the significance of the external instruments, SPILL and SHARE. Both variables have a strong and a significant relationship with the ER arrangement. In all the estimations controlling for endogeneity, the coefficient associated with SPILL is, as expected, positive and statistically significant. Based on these estimates, we could obtain evidence indicating that policymakers tend to adopt more flexible ERRs when intermediate and floating regimes are widely accepted around the world (e.g., high SPILL). Regarding the second external instrument, SHARE, we find policymakers in advanced countries with a high geographical concentration of trade are less likely to allow the ER to float but more likely to implement an intermediate regime.
234
RAUL RAZO-GARCIA
Unexpectedly, emerging markets display a tendency to implement floating rates when the geographical concentration of trade is high. The coefficients associated with SPILL and SHARE support the relevance of the instruments and our assumption that these variables can be used to identify exogenous changes of ER flexibility on financial openness. 7.2.1. Advanced Countries Consistent with OCA theory, large advanced economies with low volume of trade adopt more flexible ERR (intermediate or flexible regimes). Furthermore, all else being equal, an increase in the size of the economy by one standard deviation or a decrease in the volume of trade by one standard deviation increases the preference for flexible regimes more than the preference for soft pegs. Thus, a large advanced country with a low degree of trade openness has a higher preference for flexible regimes. The argument is that in large economies the tradable sector has a lower participation in total production than in small economies. Therefore, the gains from fixing the ER are lower in large countries. Industrial countries with developed general legal systems and institutions (i.e., with high per capita income), high inflation, or a large stock of international reserves are more likely to let the ER to fluctuate to some extent – implementing either a soft peg or a floating regime. The positive correlation between the probability of adopting an intermediate or a floating regime and the stock of foreign reserves can be explained by the selfinsurance motive of holding international reserves. This motive tells us that a buffer stock represents a guaranteed and unconditional source of liquidity that can be used in bad states of nature (e.g., speculative attacks and capital flight). Although advanced countries may have better access to international capital markets during bad states of nature (e.g., speculative attacks), foreign reserves can be used immediately to maintain an intermediate ER or to prevent large swings of the ER in a (managed) floating environment. Regarding per capita income, we found that advanced economies with weak institutional frameworks are, as expected, more likely to peg. This result captures the tendency for countries with weak institutions to rely on currency boards, dollarization, and other forms of de facto pegs to solve credibility problems. The positive association between inflation and the probability of implementing an intermediate regime might be due to the need by countries experiencing high inflation to attain low-inflation monetary policy objectives. One common anti-inflationary strategy is to use the ER, either a peg or an intermediate regime, as a nominal anchor. Some European countries, for example, were required to attain price
Estimating the Effect of Exchange Rate Flexibility on Financial Account
235
stability to have access to the European Economic and Monetary Union (EMU).67 Also, the Exchange Rate Mechanism implemented in Europe required European countries to attain ER stability through the implementation of an intermediate ERR.68 Thus, the estimated correlation between inflation and the probability to adopt an intermediate regime might reflect the policies implemented in Europe before the implementation of the euro. As predicted by OCA theory, advanced countries with a lower geographic concentration of trade are more prone to implement a floating regime since a lower geographical concentration of trade decreases the benefits from pegs or intermediate regimes. Moreover, industrial economies with a higher degree of concentration in exports show a stronger preference toward intermediate regimes than toward hard pegs or de facto pegs. Finally, financial deepening tends to make advanced countries select hard or de facto pegs. This might be the consequence of the European Exchange Rate Mechanism and the recent move of European countries to a single currency.69 7.2.2. Emerging Markets The estimated coefficients of model 4 presented in Table 2 indicate that emerging countries with high inflation, large foreign reserves holdings, transparent political systems (e.g., democracy), developed institutional frameworks, and low levels of trade openness maintain more flexible regimes. The last result is in line with OCA theory. A positive coefficient on foreign international reserves is expected since emerging countries and particularly fast-growing Asian countries no longer see the IMF as a lender of last resort, causing them to accumulate foreign exchange reserves as a self-insurance against currency crises and sometimes as a mechanism to sustain or to manage the ER (e.g., to maintain an artificially low value for their currency). This result is consistent with that in von Hagen and Zhou (2005). Also, as we mentioned above, countries can accumulate a large stock of international reserves to prevent large swings of the ER in a flexible ER environment. Not surprisingly, emerging markets with nontransparent domestic political systems are more inclined to peg relative to countries with more transparent political systems (e.g., democracies). Nondemocratic countries can choose a less flexible ERR as a commitment mechanism to help them maintain credibility for low-inflation monetary policy objectives. The ability to adopt an ER with some flexibility is positively related to the level of development (i.e., the quality of the institutional environment).70 This finding captures the tendency for emerging countries with a poor
236
RAUL RAZO-GARCIA
institutional framework to rely on pegs to solve credibility problems. In fact, the effect of institutional quality is bigger in the intermediate ER equation than in the flexible ER equation. The last result indicates that emerging countries with high GDP per capita prefer ERR with limited flexibility. Again, this is consistent with the ‘‘fear of floating’’ hypothesis. As expected by OCA theory, larger emerging countries are more inclined toward floating rates, while smaller countries are more attracted toward soft pegs. Like advanced economies, emerging markets with underdeveloped financial systems prefer flexible rates over intermediate regimes. This result reflects the difficulties of maintaining a soft peg when the central bank does not count with the necessary tools to defend the currency. Surprisingly, the results show a positive relationship between the degree of concentration in exports and the implementation of a flexible regime. There are important differences in the selection of the ERR between Asian and non-Asian emerging economies. While the former are more inclined to adopt soft pegs, the latter are more inclined to let the ER float. Hence, we can conclude that Asian emerging markets have shown higher ‘‘fear of floating’’ relative to other emerging economies. This result supports the idea of Dooley, Folkerts-Landau, and Garber (2003) regarding the informal Bretton Woods II system.71 Finally, emerging markets classified as ‘‘freely fallers’’ by RR (CRISIS=1) are more inclined to float. This finding reflects the common tendency to move, voluntarily or not, to a freely floating regime at the onset of a currency crisis. Mexico in 1994, Thailand in 1997, and Argentina in 2001 are examples of countries that ended up floating after a speculative attack. Finally, one question that can be answered with our model is how the results and statistical inference are affected if advanced and emerging markets are pooled in one group. The results of the pooled analysis are shown in Appendix B (Table B1). Comparing the results of model 4 in Tables 2 and B1, we verify that trends and preferences specific to emerging or advanced countries are masked when countries are pooled in one group. There are some cases in which the coefficient is statistically significant in the pooled regression but significant only for one type of country. There are other cases in which the estimated coefficients are not statistically significant when emerging and advanced countries are grouped in the same category but significant for emerging or developed countries. This evidence supports the argument that distinguishing countries by their stage of economic and financial development may help unmask trends that are specific to advanced or emerging markets.72
Estimating the Effect of Exchange Rate Flexibility on Financial Account
237
8. CONCLUDING REMARKS In this chapter, we propose a trivariate probit model to investigate the effect of exogenous changes of ER flexibility on the openness of the FA. To identify this effect, we use a measure of the world’s acceptance of intermediate and floating regimes and the geographical concentration of trade as instruments. Some of the major findings are the following: First, a U-shape behavior between the probability of lifting capital controls and ER flexibility is found. While pegs and flexible rates do not impose constraints on FA policies, the adoption of an intermediate regime decreases the probability to remove capital controls. Second, the effect of ER flexibility on the degree of FA openness changes considerably between countries implementing a regime with limited flexibility (intermediate regime) and economies with a flexible rate. Our results predict a tightening of capital controls when an intermediate regime is adopted. Third, treating ER flexibility as an exogenous regressor leads, in general, to an overestimation of the impact of ER flexibility on the degree of openness of the FA. Moreover, our findings indicate that the effect of the exogenous regressors varies considerably with the intensity of ER flexibility. For example, the negative correlation between inflation and the likelihood of liberalizing the FA is lower (more negative) when ER fluctuations are limited (i.e., intermediate regime) than that when the ER is fully flexible. Fourth, we found that a network effect plays a role on the choice of ERR. Specifically, we found that policymakers are more likely to adopt more flexible ERRs when these regimes are widely accepted in the world. Finally, relative to other emerging markets, Asian countries displayed more ‘‘fear of floating.’’ Future extensions of this chapter may be the development of a de facto index of financial openness, to analyze controls to capital outflows separately from controls to capital inflows, to include other developing countries, and to develop an index of financial openness from BFOI using weights for the different types of capital controls. Another interesting extension can be the the possibility of controlling time invariant unobserved heterogeneity and to allow for dynamic relationships among the variables (feedback effects from lagged dependent variables to current and future values of the explanatory variables). This dynamics might be crucial in FA equations since the decision to imposed capital controls can also be affected by lagged capital controls and by other macroeconomic variables.
238
RAUL RAZO-GARCIA
NOTES 1. A more flexible regime might generate the incentives to develop the foreign exchange market, produce an awareness of currency risk and the need for hedging foreign exchange exposures, allow policymakers to tailor monetary policy to buffer the economy against shocks, and avoid the creation of ‘‘one-way bets,’’ thereby preventing speculators from all lining up on one side of the market that creates losses in the event that expectations of revaluations or devaluations are disappointed. 2. Currency mismatch occurs when a country’s banks and operating companies have their assets denominated in domestic currency, but their liabilities denominated in foreign currency so that a large depreciation generates a significant decline in the economy’s net worth. 3. The Japanese liberalization process implemented in the early 1960s is another example of how FA policies might be affected when a more flexible regime is adopted. At that time Japan embarked on a gradual removal of capital flow restrictions by implementing the Basic Plan for Liberalization of Trade and Foreign Exchange. More than a decade later, in 1973, the yen moved from a peg to a more flexible arrangement against the U.S. dollar, increasing the exposure of the Japanese economy to international shocks. This, in turn, affected FA policy and the pace of liberalization. In order to prevent short-term capital inflows and outflows from destabilizing the foreign exchange market during periods of international turmoil, caused by the two oil crises in the 1970s and the ‘‘learning to float’’ period, Japan then implemented a dizzying series of changes in foreign exchange controls and regulations affecting FA transactions (Aramaki, 2006). Immediately after the second oil shock occurred in 1979, the yen started to depreciate and the Japanese authorities took measures to encourage capital inflows to stabilize the ER. 4. For example, in the analysis of the evolution and performance of ERR presented by Rogoff, Husain, Mody, Brooks, and Oomes (2004), only 5 studies out of 14 controlled for the effects of FA openness on the choice of ERR (Holden, P, Holden, M, & Suss, 1979; Savvides, 1990; Edwards, 1996; Frieden, Ghezzi, & Stein, 2000; Poirson, 2001) and only one of them dealt with the endogeneity problem (Poirson, 2001). 5. Red bus versus blue bus being the canonical example. 6. Calvo and Reinhart (2002) dubbed ‘‘fear of floating’’ to fear of large currency swings. They argue that this fear might arise from a combination of lack of credibility, a high pass through from ER to prices, and inflation targeting; from liability dollarization; or from an output cost associated with ER fluctuations. 7. We obtain a probit model if we assume that the probability density function of it is the standard normal distribution ðit Nð0; 1ÞÞ. If we rather assume that it follows a logistic distribution, then a logit model is obtained. 8. Some variants of this specification are multinomial or ordered models. 9. Many countries that have removed capital controls have experienced substantial capital inflows when capital controls are lifted (e.g., Italy, New Zealand, Uruguay, and Spain). 10. See Hanson (1995) and Bartolini and Drazen (1997a and 1997b). 11. FA restrictions could be seen as a further form of prudential regulation. Those types of restrictions preclude the banks from funding themselves offshore in foreign
Estimating the Effect of Exchange Rate Flexibility on Financial Account
239
currency, while prudential supervision and regulation prevent them from making foreign currency-denominated loans to firms in the nontraded goods sector. 12. Exceptions are Eichengreen and Leblang (2003) and von Hagen and Zhou (2006). The former estimate a bivariate probit model, while the latter use a simultaneous equation model. 13. Quinn and Incla´n (1997) found that the effect of the political leaning of the government on the decision to lift or impose capital controls depends on the abundance of skilled labor. In this respect, their evidence suggests that leftist governments support financial openness, where skilled labor is abundant. Conversely, leftist governments in nations without a strong advantage in skilled labor tend to restrict capital flows. 14. The same framework has been used to explain the choice of intermediate or floating regimes. 15. Examples are foreign portfolio investment, for the case of the FA, and nominal ER and macroeconomic variables related to the foreign exchange market for de facto ERR classifications. 16. Regarding the ‘‘inconclusive’’ regimes Bubula and O¨tker-Robe (2002) argue that countries such as France and Belgium with obvious ERR (horizontal bands in IMF classifications and with de facto pegs in Reinhart and Rogoff (2004)) are classified as ‘‘inconclusive.’’ In spite of this potential misclassification, in their latest update, less than 2% of the regimes are classified as ‘‘inconclusive.’’ 17. As Reinhart and Rogoff (2004) argue, under official peg arrangements dual or parallel rates have been used as a form of back door floating. 18. A country’s ER arrangement is classified as ‘‘freely falling’’ when the 12-month inflation rate is equal to or exceeds 40% per annum, or the 6 months following an ER crisis where the crisis marked a movement from a peg or an intermediate regime to a floating regime (managed or freely floating). For more details on this classification, see the Appendix in Reinhart and Rogoff (2004). 19. The index is the sum of 12 components related to capital flows: (1) controls on inflows of invisible transactions (proceeds from invisible transactions and repatriation and surrender requirements); (2) controls on outflows of invisible transactions (payments for invisible transactions and current transfers); (3) controls on inflows of invisible transactions from exports; (4) controls on inflows pertaining to capital and money market securities; (5) controls on outflows pertaining to capital and money market securities; (6) controls on inflows pertaining to credit operations; (7) controls on outflows pertaining to credit operations; (8) controls on inward direct investment; (9) controls on outward direct investment; (10) controls on real estate transactions; (11) provisions specific to commercial banks; and (12) ER structure. This 11 dummy variables are obtained from the Annual Report on Exchange Arrangements and Exchange Restrictions published by the IMF. 20. Brune’s ER structure component is a binary variable assuming the value of 1 when a country has dual or multiple ER. 21. A list of the countries included in the analysis is provided in the Appendix C. 22. In terms of the sequencing of these two policies, this tendency suggests that countries tried to learn how to live with a more flexible ERR before liberalizing the FA. 23. The advocates of the ‘‘bipolar view’’ state that the intermediate ERRs are disappearing in favor of the two corners, hard pegs and flexible arrangements.
240
RAUL RAZO-GARCIA
24. For example, Zsk is the parameter associated with the effect of soft pegs (intermediate regimes) on financial openness. k Stands for capital (flows) that are intrinsically related to the FA. 25. See Section 3.2 for definitions of pegs, intermediate and flexible regimes. 26. In heteroscedastic probit models, the variance of the error term is assumed to be a function of a vector of regressors z. For example, a common practice is to qffiffiffiffiffiffiffiffi assume that ðnit;k jX 1;it ¼ x1;it ; Sit ; F it Þ Nð0; s2it;k Þ with s2it;k ¼ ðexpðz0it xÞ. 27. Provided we had an instrument for each endogenous dummy variable zj such that SjzS ; X 1;it ¼ x1;it NðmS ; s2S Þ and FjzF ; X 1;it ¼ x1;it NðmF ; s2F Þ the reduced form for K would also be a probit model, and, therefore, the parameters in Eq. (2) could be estimated by using a two-stage method. 28. The propensity to implement a peg, the third regime, is described by the following latent index: Pit ¼ X 1;it Pp1 þ X p2;it Pp2 þ g nit;p . 29. Therefore, the first two residuals in nit are the difference in errors for intermediate regimes and pegs and the difference in errors for flexible regimes and pegs. 30. Vech(M) is the vector obtained from stacking the lower triangular part of the symmetric matrix M on top of each other. 31. The polity variable ranges from 10 (strongly democratic) to 10 (strongly autocratic). 32. We admit that the per capita GDP can be a crude proxy for the development of general legal systems and institutions. The reason is that there might be countries with high per capita income and indices of economic and institutional development below those from advanced economies. In spite of this concern, we use per capita GDP as a proxy of the institutional development under two arguments: (i) on average a positive correlation is expected between per capita income and the development of general legal systems and institutions (see, e.g., the widely cited study of Acemoglu, Johnson, & Robinson, 2001; and (ii) it is difficult to find other measures for this variable. 33. Klein regresses the logarithm of per capita income on a composite index of five series measuring institutional quality. The five components of the index are Bureaucratic Quality, Control of Corruption in Government, Risk of Expropriation, Repudiation of Government Contracts, and Rule of Law. 34. Typically overinvoicing of imports or underinvoicing of exports. 35. Therefore, X 2;it ¼ ½SPILLit ; SHARE it . 36. For example, in 2000, about 88% of the Mexican exports went to the United States. In this case, the variable SHARE for Mexico in period t=2,000 is equal to 0.88. 37. In another study, Frieden et al. (2000) use the variable VIEWS to measure the percentage of countries in the world under fixed ERRs. Since the correlation between the VIEWS variable and a time trend included in their regression turned out to be extremely high (0.96), they only present the results using the time trend. 38. The members of the CIS included in their regressions are Armenia, Azebaijan, Belarus, Georgia, Kazakhstan, Kyrgyz Republic, Moldova, Russia, Tajikistan, Turkmenistan, Ukraine, and Uzbekistan. The non-CIS group is composed of Bulgaria, Czech Republic, Hungary, Poland, Romania, Slovak Republic, Slovenia, Estonia, Latvia, Lithuania, Albania, Croatia, and Macedonia.
Estimating the Effect of Exchange Rate Flexibility on Financial Account
241
39. The sign of the estimated coefficients and their significance level are not invariant across ERR classifications: de jure versus de facto. While the evidence shows that countries with geographically concentrated foreign trade are less likely to adopt a fixed ERR when the de jure classification is used, the opposite effect is found when the de facto classification is used. 40. Halton draws obtained from a Halton sequence can provide better coverage than simple random draws because they are created to progressively fill in the unit interval evenly and even more densely. In fact, it is argued that for mixed logit models 100 Halton draws might provide more precise results than 1000 random draws in terms of simulated errors (Train, 2003, p. 231). 41. In this context, a choice probability refers to the likelihood that a country chooses a particular outcome from the set of all possible outcomes (e.g., probability of choosing a peg and a closed FA). To simplify notation the conditioning in the exogenous regressors, X1it and X2it, is omitted in all the probabilities described in the chapter. 42. In fact, for each joint probability involving n events, there are n! alternative combinations of conditional and marginal probabilities. 43. We take utility differences only between the ERRs. 44. We use the refinement of the updating formula for the approximation of the inverse of the Hessian proposed by Broyden, Fletcher, Goldfarb, and Shannon. This numerical algorithm is suitable when the evaluation of the Hessian is impractical or costly and the log-likelihood function is close to quadratic. 45. Note that c11 is the (1,1) element of the Cholesky factor of On . It is important to mention that all the elements of the Cholesky factors of the intermediate and floating regimes, Ls and Lf, are functions of the elements of the Cholesky factor Ln . In fact, all the elements of Ln are estimated along with the parameters of the random utilities. 46. This transformation measures the depletion of the real value of the currency. 47. In the Appendix B, we show the results of a pooled model using data of both advanced and emerging markets. 48. In this model, the dummy variables S and F are treated as strictly exogenous. 49. See, for example, Begg et al. (2003) and von Hagen and Zhou (2006). 50. This does not necessarily mean that the difference in the estimated coefficients is the result of an endogeneity bias. In fact, the observed difference might be just a sampling difference. 51. Compared to the difference exhibited by Zsk . 52. Notice that b Zsk in model 4 changes from 0.055 to 1.627 for advanced economies. 53. Notice that the estimate of a2 for advanced economies is almost half of the estimate for emerging markets. This indicates that, all else being equal, emerging economies need higher values of the explanatory variables to jump from the partially closed state toward a partially open FA. 54. Note that in this context the effect of an exogenous variable, say X, on the propensity to open the FA depends on the ERR. For example, when we refer to the effect of X under a float, we refer not only to the interaction term but also to the sum of the coefficients associated with X and (F X).
242
RAUL RAZO-GARCIA
55. Except for relative size, which is statistically different from zero in advanced countries with floating regimes. 56. Although the coefficient on the interaction of per capita income and F is negative, it is not statistically significant. 57. Of all the studies included in Table 1, only Leblang (1997) includes foreign exchange in the vector of ‘‘exogenous’’ determinants of capital controls. There are at least two reasons to explain why our results differ from Leblang’s findings. First, he does not control for the endogeneity of the ERR. Second, he estimates a univariate model. 58. Leblang also obtains evidence that this effect is more intense in countries with fixed ER. 59. This effect is equal to 0.970=1.1390.169. 60. A negative relationship between trade openness and capital controls is commonly found in the empirical literature. (See, e.g., Grilli & Milesi-Ferretti, 1995; von Hagen & Zhou, 2006). 61. Surprisingly, von Hagen and Zhou (2006) and Walker (2003) do not include inflation as one of the determinants of capital controls. 62. The argument that the accumulation of foreign exchange reserves may substitute for what would otherwise be private sector capital outflows in countries with capital controls supports the presumption that FA policy affects the demand of reserves. 63. The effect of per capita income on FA openness in an emerging country with a soft peg is equal to 1.077=2.5341.457, and equal to 0.4=2.5342.934 when a floating regime is implemented. 64. See, for example, Eichengreen and Leblang (2003), Leblang (1997), and von Hagen and Zhou (2006). 65. In our ERR equations, we find that countries with high inflation rates are more prone to peg their currencies. 66. POLITY was dropped from the regressions associated with advanced economies because the between and within variations of this variable were very small. 67. The Maastricht Treaty required members who wanted to become part of the European Economic and Monetary Union to attain price stability: a maximum inflation rate 1.5% above the average of the three lowest national rates among European Union members. The Maastricht Treaty was a provision calling for the introduction of the single European currency and European central bank no later that the first day of 1999. By 1993, all 12 countries then belonging to the European Union had ratified the Maastricht Treaty: France, Germany, Italy, Belgium, Denmark, Ireland, Luxembourg, the Netherlands, Spain, Portugal, United Kingdom, and Greece. Austria, Finland, and Sweden accepted the Treaty’s provisions upon joining the European Union in 1995. 68. The European Monetary System defined the Exchange Rate Mechanism to allow most currencies to fluctuate +/2.25% around target ERs (France, Germany, Italy, Belgium, Denmark, Ireland, Luxembourg, and the Netherlands). This mechanism allowed larger fluctuations (+/6%) for currencies of Portugal, Spain, Britain (until 1992), and Italy (until 1990). 69. After the Maastricht Treaty, many European members have been classified as de facto fixers.
Estimating the Effect of Exchange Rate Flexibility on Financial Account
243
70. See Hausmann et al. (2001). 71. It has been argued that Pacific Asian countries are formally or informally managing their currencies with respect to the U.S. dollar in similar fashion as they did during the Bretton Woods system. 72. See, for example, Eichengreen and Razo-Garcia (2006). These authors showed that splitting the countries into emerging and advanced countries helps reconcile the ‘‘bipolar’’ and ‘‘fear of floating’’ views.
REFERENCES Acemoglu, D., Johnson, S., & Robinson, J. A. (2001). The colonial origins of comparative development: An empirical investigation. American Economic Review, 91(5), 1369–1401. Alesina, A., Grilli, V., & Milesi-Ferretti, G. (1994). The political economy of capital controls. In: L. Leiderman & A. Razin (Eds), Capital mobility: The impact on consumption, investment and growth (pp. 289–321). Cambridge: Cambridge University Press. Aramaki, K. (2006). Sequencing of capital account liberalization – Japan’s experiences and their implications to China. Public Policy Review, 2(1). Barro, R., & S. Tenreyro (2003). Economic effects of currency unions. NBER Working Paper No. 9435. National Bureau of Economic Research Inc., Cambridge, MA. Bartolini, L., & Drazen, A. (1997a). Capital account liberalization as a signal. American Economic Review, 87(1), 249–273. Bartolini, L., & Drazen, A. (1997b). When liberal policies reflect external shocks, what do we learn? Journal of International Economics, 42(3-4), 249–273. Begg, D., Eichengreen, B., Halpern, L., von Hagen, J., & Wyplosz, C. (2003). Sustainable regimes of capital movements in accession countries. Policy paper no.10, Centre for Economic Policy Research, London. Bernhard, W., & Leblang, D. (1999). Democratic institutions and exchange rate commitments. International Organization, 53(1), 71–97. Blomberg, B., Frieden, J., & Stein, E. (2005). Sustaining fixed rates: The political economy of currency pegs in Latin America. Journal of Applied Economics, VIII(2), 203–225. Broz, L. (2002). Political economy transparency and monetary commitment regimes. International Organization, 56(4), 861–887. Bubula, A., & O¨tker-Robe, I. (2002). The evolution of exchange rate regimes since 1990: Evidence from De Facto policies. IMF Working Paper no. 02/155. International Monetary Fund, Washington, DC. Calvo, G., & Reinhart, C. (2002). Fear of floating. Quarterly Journal of Economics, 47(2), 379–408. Carrasco, R. (2001). Binary choice with binary endogenous regressors in panel data: Estimating the effect of fertility on female labor participation. Journal of Business & Economic Statistics, 19(4), 385–394. Chang, R., & Velasco, A. (2006). Currency mismatches and monetary policy: A tale of two equilibria. Journal of International Economics, 69(1), 150–175. Chernozhukov, V., & Hong, H. (2003). An MCMC approach to classical estimation. Journal of Econometrics, 115(2), 293–346.
244
RAUL RAZO-GARCIA
Coibion, O., & Gorodnichenko, Y. (2008). Strategic interaction among heterogeneous pricesetters in an estimated DSGE model. Working Paper no. 14323. National Bureau of Economic Research, Inc., Cambridge, MA. Collins, S. M. (1996). On becoming more flexible: Exchange rate regimes in Latin America and the Caribbean. Journal of Development Economics, 51, 117–138. Dooley, M. P., Folkerts-Landau, D. & Garber, P. (2003). An essay on the revived Bretton Woods system. NBER Working Paper no. 5756. National Bureau of Economic Research, Inc., Cambridge, MA. Edwards, S. (1996). The determinants of the choice between fixed and flexible exchange rate regimes. NBER Working Paper no. 5756. National Bureau of Economic Research, Inc., Cambridge, MA. Eichengreen, B. (2002). Capital account liberalization: What do the cross country studies show us? World Bank Economic Review, 15, 341–366. Eichengreen, B. (2004). Chinese currency controversies. CEPR Discussion Paper no. 4375. CEPR Discussion Papers. Eichengreen, B. (2005). China’s exchange rate regime: The long and short of it. University of California, Berkeley. Eichengreen, B., Hausmann, R., & Panizza, U. (2003). Currency mismatches, debt intolerance and original sin: Why they are not the same and why it matters. NBER Working Paper no. 10036. National Bureau of Economic Research, Inc., Cambridge, MA. Eichengreen, B., & Leblang, D. (2003). Exchange rates and cohesion: Historical perspectives and political economy considerations. Journal of Common Market Studies, 41, 797–822. Eichengreen, B., & Razo-Garcia, R. (2006). The international monetary system in the last and next 20 years. Economic Policy, 21(47), 393–442. Frieden, J., Ghezzi, P., & Stein, E. (2000). Politics and exchange rates: A cross-country approach to Latin America. Research Network Working Paper no. R421. Inter-American Development Bank, Washington, DC. Ghosh, A., Gulde, A., & Wolf, H. (2003). Exchange rate regimes: Choices and consequences. Cambridge, MA: MIT Press. Grilli, V., & Milesi-Ferretti, G. (1995). Economic effects and structural determinants of capital controls. IMF Working Paper WP/95/31. International Monetary Fund, Washington, DC. Hanson, J. (1995). Opening the capital account: Costs, benefits and sequencing. In: S. Edwards (Ed.), Capital controls, exchange rates and monetary policy in the world economy. Cambridge: Cambridge University Press. Hausmann, R., Panizza, U., & Stein, E. (2001). Why do countries float the way they float? Journal of Development Economics, 66(2), 387–414. Holden, P., Holden, M., & Suss, E. (1979). The determinants of exchange rate flexibility: An empirical investigation. The Review of Economics and Statistics, 61(3), 327–333. Juhn, G., & Mauro, P. (2002). Long-run determinants of exchange rate regimes: A simple sensitive analysis. IMF Working Paper no. 02/104, p. 31. International Monetary Fund, Washington, DC. Klein, M. (2005). Capital Account Liberalization, Institutional Quality and Economic Growth: Theory and Evidence. NBER Working Paper no. 11112, p. 37. National Bureau of Economic Research, Inc., Cambridge, MA. Klein, M., & Marion, N. (1997). Explaining the duration of the exchange-rate pegs. Journal of Development Economics, 54, 387–404.
Estimating the Effect of Exchange Rate Flexibility on Financial Account
245
Lahiri, A., Singh, R., & Vegh, C. (2007). Segmented asset markets and optimal exchange rate regimes. Journal of International Economics, 72(1), 1–21. Leblang, D. (1997). Domestic and systemic determinants of capital controls in the developed and developing countries. International Studies Quarterly, 41(3), 435–454. Leblang, D. (2003). To defend or to devalue: The political economy of exchange rate policy. International Studies Quarterly, 47(4), 533–560. Leblang, D. (2005). Is democracy incompatible with international economic stability? In: M. Uzan (Ed.), The future of the international monetary system. London: Edward Elgar Publishing. Levy-Yeyati, E., Reggio, I., & Sturzenegger, F. (2002). On the endogeneity of exchange rate regimes. Universidad Torcuato Di Tella, Business School Working Papers no. 11/2002. Buenos Aires, Argentina. Levy-Yeyati, E., & Sturzenegger, F. (2005). Classifying exchange rate regimes: Deeds vs. Words. European Economic Review, 49, 1603–1635. McKinnon, R. (1993). The order of economic liberalization. Baltimore: The John Hopkins University Press. Mundell, R. (1961). A theory of optimum currency areas. American Economic Review, 51(4), 657–665. Obstfeld, M., Shambaugh, J., & Taylor, A. (2004). The trilemma in history: Tradeoffs among exchange rates, monetary policies, and capital mobility. NBER, Working Paper no. 10396. National Bureau of Economic Research, Inc., Cambridge, MA. Poirson, H. (2001). How do countries choose their exchange rate regime?. IMF Working Paper no. 01/46. International Monetary Fund, Washington, DC. Prasad, E., Rumbaug, T. & Wang, Q. (2005). Putting the cart before the horse? Capital account liberalization and exchange rate flexibility in China. IMF Policy Discussion Paper no. 05/1. Quinn, D., & Incla´n, C. (1997). The origins of financial openness: A study of current and capital account liberalization. American Journal of Political Science, 41(3), 771–813. Reinhart, C., & Rogoff, K. (2004). The modern history of exchange rate arrangements: A reinterpretation. Quarterly Journal of Economics, 119(1), 1–48. Rogoff, K., Husain, A., Mody, A., Brooks, R., & Oomes, N. (2004). Evolution and performance of exchange rate regimes. IMF Ocassional Paper 229. International Monetary Fund, Washington, DC. Savvides, A. (1990). Real exchange rate variability and the choice of the exchange rate regime by developing countries. Journal of International Money and Finance, 9, 440–454. Shambaugh, J. C. (2004). The effect of fixed exchange rates on monetary policy. Quarterly Journal of Economics, 119(1), 301–352. Simmons, B., & Hainmueller, J. (2005). Can domestic institutions explain exchange rate regime choice? The political economy of monetary institutions reconsidered. International Finance 0505011, EconWPA. Train, K. (2003). Discrete choice methods with simulation. Cambridge Books: Cambridge University Press. von Hagen, J., & Zhou, J. (2002). The choice of the exchange rate regimes: An empirical analysis for transition economies. ZEI Working Paper no B02-03. University of Bonn. Center for European Integration Studies (ZEI), Bonn, Germany. von Hagen, J., & Zhou, J. (2005). The determination of capital controls: Which role do exchange rate regimes play? Journal of Banking and Finance, 29, 227–248.
246
RAUL RAZO-GARCIA
von Hagen, J., & Zhou, J. (2006). The interaction between capital controls and exchange rate regimes: Evidence from developing countries. CEPR Discussion Paper no. 5537. CEPR Discussion Papers. Walker, R. (2003). Partisan substitution and international finance: Capital controls and exchange rate regime choice in the OECD. Ph.D. thesis, Rochester University (draft), Rochester, New York.
APPENDIX A. GHK SIMULATOR WITH MAXIMUM LIKELIHOOD As we mentioned in the main body of the chapter, we need to address four critical points when the GHK simulator is applied under the maximum likelihood framework (Train, 2003, p. 133). First, the model has to be normalized for scale and level of utility to ensure that the parameters are identified. Second, the GHK simulator takes utility differences against the regime for which the probability is being calculated, and so different differences must be taken for countries choosing different regimes. Third, for a country choosing to peg their currency the GHK simulator uses the covariance matrix On , for a country choosing a soft peg it uses Osn , while for a country with a floating regime it needs Ofn . In other words, the elements of On are consistent with the elements of Osn and Ofn in the sense that the three matrices are derived from the O matrix. Fourth, the covariance matrices have to be positive definite. To address these issues we followed the next steps: 1. To assure that the model is identified we need to start with the covariance matrix of the scaled utility differences with the differences taken against the peg regime, On . To assure positive definiteness of the covariance matrices, we have to parameterize the model in terms of the Cholesky decomposition of On 2 3 c11 0 0 6 7 Ln ¼ 4 c21 c22 0 5 c31 c32 c33 Since On ¼ Ln L0n then On is positive definite for any estimated values of the c0 s.
Estimating the Effect of Exchange Rate Flexibility on Financial Account
247
2. The covariance matrix of the original residuals, O, is recovered using the following lower-triangular matrix 2 3 0 0 0 0 6 0 c11 0 0 7 6 7 L¼6 7 4 0 c21 c22 0 5 0 c31 c32 c33 Therefore O ¼ L L0 . Be aware that the first row and column of the covariance matrix will be equal to zero. This means that we are subtracting from all the ERR residuals of country i in period t the residual of the peg equation. With this O matrix we can derive Osn and Ofn : Osv ¼ M s OM 0s ¼ Ls L0s Ofv ¼ M f OM 0f ¼ Lf L0f where Ls and Lf are the Cholesky factors of the covariance matrix when either a soft peg or a floating regime is chosen, respectively, and 2 3 2 3 1 1 0 0 1 0 1 0 6 7 6 7 M s ¼ 4 0 1 1 0 5 M f ¼ 4 0 1 1 0 5 0 0 0 1 0 0 0 1
This procedure takes into account the four issues mentioned above (Table B1).
248
RAUL RAZO-GARCIA
APPENDIX B. ESTIMATION POOLING ADVANCED AND EMERGING COUNTRIES Table B1.
Financial Account and Exchange Rate Regime Equations (1975–2006): Pooled Regression. Advanced and Emerging Countries Model 1 y
Financial account equation 0.748 Zsk 0.523 Zf k
Constant FINDEV S*FINDEV F*FINDEV RESERVES/M2 S*(RESERVES/M2) F*(RESERVES/M2) GDP per capita S*GDP per capita F*GDP per capita OPENNESS S*OPENNESS F*OPENNESS INFLATION S*INFLATION F*INFLATION RELATIVE SIZE S*RELATIVE SIZE F*RELATIVE SIZE POLITY S*POLITY F*POLITY ASIA S*ASIA F*ASIA EMERGING S*EMERGING F*EMERGING CRISIS Cutoffs a2 a2
Model 2 S.E.
y
Model 3 S.E.
y
0.079 0.722 0.209 1.925 0.091 0.411 0.243 1.040
0.422 0.043
0.092 0.035
0.049
0.043
0.564
0.059
0.284
0.044
0.455
0.072
0.055
0.023
0.058
0.044
0.748
0.098
0.359
0.135
0.118 0.470 1.336
0.208
0.190 0.082 0.057 0.134 0.384 0.877 0.442 0.775 0.131 0.897 0.551 0.645 0.487 0.418 0.258 0.073 0.241 0.031 0.190 0.055 0.020 0.371 2.529 2.100 2.496 1.366 0.818 2.592 0.029
0.168 0.091 0.117 0.110 0.073 0.105 0.096 0.151 0.178 0.179 0.078 0.108 0.155 0.146 0.201 0.180 0.240 0.251 0.242 0.073 0.099 0.119 0.220 0.268 0.284 0.277 0.357 0.364 0.253
0.031 0.053
0.522 1.490
0.035 0.059
Model 4 S.E.
0.081 0.869 0.255 0.145
1.139 0.004
0.094 0.029
0.116
0.036
0.499
0.040
0.188
0.035
0.277
0.043
0.009
0.037
0.075
0.026
0.676
0.057
0.389
0.082
0.407 0.209 0.395 1.125
y
0.025 0.041
S.E.
0.227 0.560
0.410 0.094 0.053 0.165 0.322 0.831 0.414 0.731 0.173 0.801 0.499 0.632 0.493 0.357 0.290 0.016 0.313 0.002 0.222 0.031 0.020 0.335 2.606 2.235 2.657 1.312 0.820 2.505 0.286
0.234 0.071 0.096 0.084 0.106 0.106 0.122 0.132 0.163 0.137 0.078 0.086 0.122 0.094 0.111 0.086 0.112 0.142 0.104 0.130 0.155 0.146 0.305 0.363 0.372 0.210 0.206 0.209 0.283
0.488 1.428
0.029 0.036
249
Estimating the Effect of Exchange Rate Flexibility on Financial Account
Table B1. (Continued ) Advanced and Emerging Countries Model 1 y
Model 2 S.E.
Model 3
y
S.E.
Intermediate exchange rate regime equation Constant 0.077 0.087 SPILL 0.296 0.716 SHARE 0.026 0.034 FINDEV 0.046 0.045 RESERVES/M2 0.073 0.042 GDP per capita 0.069 0.048 OPENNESS 0.008 0.043 INFLATION 0.225 0.059 RELATIVE SIZE 0.323 0.087 POLITY 0.034 0.040 CRISIS 1.781 0.214
0.077 0.296 0.026 0.046 0.073 0.069 0.008 0.225 0.323 0.034 1.781
0.087 0.716 0.034 0.045 0.042 0.048 0.043 0.059 0.087 0.040 0.214
Floating exchange rate regime equation Constant 1.224 SPILL 3.762 SHARE 0.015 FINDEV 0.065 RESERVES/M2 0.231 GDP per capita 0.224 OPENNESS 0.427 INFLATION 0.038 RELATIVE SIZE 0.446 POLITY 0.092 CRISIS 2.052
1.224 3.762 0.015 0.065 0.231 0.224 0.427 0.038 0.446 0.092 2.052
0.101 0.795 0.038 0.052 0.046 0.058 0.066 0.058 0.104 0.048 0.200
0.101 0.795 0.038 0.052 0.046 0.058 0.066 0.058 0.104 0.048 0.200
sff sks skf ssf Memorandum Observations Endogeneity accounted for log-likelihood
y
Model 4 S.E.
y
S.E.
0.190 1.516 0.096 0.019 0.242 0.038 0.114 0.253 0.145 0.071 0.148
0.056 0.392 0.024 0.031 0.036 0.039 0.034 0.054 0.053 0.029 0.238
0.217 1.085 0.058 0.051 0.207 0.001 0.083 0.321 0.078 0.061 0.652
0.061 0.493 0.028 0.026 0.039 0.034 0.036 0.042 0.065 0.033 0.157
2.465 7.807 0.080 0.133 0.690 0.573 1.280 0.024 1.033 0.271 5.687
0.630 1.097 2.340 4.391 0.082 0.026 0.113 0.060 0.143 0.392 0.153 0.268 0.265 0.630 0.146 0.106 0.252 0.578 0.104 0.202 1.234 2.795
5.565 0.804 0.077 0.503
1552 No 3094.08
1552 No 2979.08
1552 Yes 2757.20
0.358 0.694 0.061 0.057 0.089 0.111 0.160 0.084 0.164 0.080 0.870
1.268 0.043 0.293 0.195
1552 Yes 2655.40
Notes: , , denote coefficients statistically different from zero at 1%, 5%, and 10% significance levels, respectively. Intermediate regimes include preannounced crawling peg, preannounced crawling band that is narrower than or equal to 2%, de facto crawling peg, de facto crawling band that is narrower than or equal to 2%, preannounced crawling band that is narrower than or equal to 2%, de facto crawling band wider than or equal to 5%, moving band narrower than or equal to 2% (i.e., allows for both appreciation and depreciation over time), de facto crawling band that is narrower than or equal 2%. Floating regimes include managed floating and freely floating arrangements.
250
RAUL RAZO-GARCIA
APPENDIX C. DATA Table C1.
Data sources.
Variable
Source
Definition or Transformation Units
POLITY CPIa INFLATIONb
Polity IV project IFS Line 64 CPI
Political regime Consumer price index Annual inflation
Index (10,10) Index (2000=100) D % over
M2c RESERVES EXPORTS IMPORTS GDP FINDEV OPENNESS
IFS Line 35 IFS Line 1L IFS Line 90C IFS Line 98C IFS Line 99 WDI and IFS WDI and IFS
National currency U.S. dollars National currency National currency National currency % %
RESM2 GDP per capita RELATIVE SIZE SHARE
WDI and IFS WDI WDI
Money + quasi money Total reserves – Gold Exports of goods and services Imports of goods and services Gross domestic product M2/GDP Exports plus imports over GDP Reserves/M2 GDP per capita Size relative to the U.S. (GDP) Percentage of total exports with main partner de facto ‘‘Natural’’ exchange rate regime classification =1 freely falling regime from Reinhart and Rogoff Financial openness index (excluding exchange rate regime)
previous year
ERR CRISIS BFOI
DOTSd Reinhart and Rogoff (2004) Authors’ calculation Nancy Brune
% 2000 U.S.dollar % % 15 categories Dummy variable Index (0,11)
IFS stands for International Financial Statistics. a For missing observations, we use CPI from Global Financial Data. b The inflation variable we use in the estimations is equal to ðp=1 þ pÞ, where p is the annual inflation. c For Eurozone members, we use data from the Yearbook of International Financial Statistics. d DOTS stands for Direction of Trade Statistics (IMF).
Estimating the Effect of Exchange Rate Flexibility on Financial Account
Table C2. Advanced Countries (24) Australia Austria Belgium Canada Denmark Finland France Germany
251
Classifications of Countries. Greece Iceland Ireland Italy Japan Luxembourg the Netherlands New Zealand
Emerging Markets Countries (32) Argentina India Brazil Indonesia Bulgaria Israel Chile Jordan China Korea Rep. Colombia Malaysia Czech Republic Mexico Ecuador Morocco Egypt Arab Rep. Nigeria Hong Kong China Pakistan Hungary Panama
Norway Portugal San Marino Spain Sweden Switzerland United Kingdom United States Peru Philippines Poland Russian Federation Singapore South Africa Sri Lanka Thailand Turkey Venezuela RB
ESTIMATING A FRACTIONAL RESPONSE MODEL WITH A COUNT ENDOGENOUS REGRESSOR AND AN APPLICATION TO FEMALE LABOR SUPPLY Hoa B. Nguyen ABSTRACT This chapter proposes M-estimators of a fractional response model with an endogenous count variable under the presence of time-constant unobserved heterogeneity. To address the endogeneity of the right-handside count variable, I use instrumental variables and a two-step procedure estimation approach. Two methods of estimation are employed: quasimaximum likelihood (QML) and nonlinear least squares (NLS). Using these methods, I estimate the average partial effects, which are shown to be comparable across linear and nonlinear models. Monte Carlo simulations verify that the QML and NLS estimators perform better than other standard estimators. For illustration, these estimators are used in a model of female labor supply with an endogenous number of children. The results show that the marginal reduction in women’s working hours per week is less as women have one additional kid. In addition, the effect
Maximum Simulated Likelihood Methods and Applications Advances in Econometrics, Volume 26, 253–298 Copyright r 2010 by Emerald Group Publishing Limited All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1108/S0731-9053(2010)00000260012
253
254
HOA B. NGUYEN
of the number of children on the fraction of hours that a woman spends working per week is statistically significant and more significant than the estimates in all other linear and nonlinear models considered in the chapter.
1. INTRODUCTION Many economic models employ a fraction or a percentage, instead of level values, as a dependent variable. In these models, economic variables of interest occur in fractions such as employee participation rates in 401(k) pension plans, firm market shares, and fractions of hours spent working per week. Even though fractional response models (FRMs) with the quasimaximum likelihood estimation (QMLE) approach have been developed, they have not accounted for binary or count endogenous regressor. The traditional two-stage least squares approach, which accounts for the problem of endogeneity, does not produce consistent estimators for nonlinear simultaneous equations models (Wooldridge, 2002). Maximum likelihood (ML) techniques or two-stage method of moments are, therefore, needed (Terza, 1998; Wooldridge, 2002). In general, in simultaneous equations models in which the response variable or the endogenous regressor is a dummy, it is common to use the ML approach or two-stage method of moments because the computation task is not demanding. With a count endogenous regressor, which has more than two values of 0 and 1, it is more difficult to specify and apply the ML approach for many reasons. First of all, the ML approach in an FRM, which requires specifying the conditional distribution function, may lead to the predicted values lying outside the unit interval. Second, in nonlinear models, the ML technique is more computationally intensive than the corresponding QMLE approach under the presence of a count endogenous regressor. Third, it is desirable to specify a density function that admits specifications of different conditional moments and other distribution characteristics. These reasons motivate researchers to consider the method of QMLE. There are numerous studies that investigate the problem of count data and binary endogenous regressor. Some researchers have proposed to use standard assumptions such as a linear–exponential (LEF) specification. A Poisson or Negative Binomial distribution of the count response variable is considered in a model with binary endogenous regressors.
Estimating a Fractional Response Model
255
Mullahy (1997) and Windmeijer and Santos-Silva (1997) use Generalized Method of Moments (GMM) estimation based on an LEF specification and a set of instruments. Terza (1998) focuses on the two-stage method (TSM) and the weighted nonlinear least squares (WNLS) estimators using a bivariate normal distributional assumption with respect to the joint distribution of the unobserved components of the model. These models cannot be extended to count endogenous regressors unless a linear relationship between the regressor and instruments as well as the error term is allowed. This means the count endogenous regressors are treated as a continuous variable. The conditional mean of interest consequently has a closed form under this restricted assumption. Terza (1998) cited the Poisson/ Negative Binomial full information maximum likelihood (FIML) but the estimation of these models is not carried out, nor are their properties explored. In addition, the count variable in his model is a response variable instead of an explanatory variable. However, models with a count endogenous explanatory variable can benefit from his method. The assumption of a bivariate normal distribution for unobserved components suggested in Terza (1998) can be relaxed to be non normal. Weiss (1999) considers simultaneous binary/count models where the unobserved heterogeneity may be exponential gamma or normal so that the marginal distribution of the dependent variable is Negative Binomial. However, the estimation of the model depends on the function of the unobserved heterogeneity which depends on unknown parameters in the distribution of the unobserved effect. The joint standard normal distribution of the unobserved components in simultaneous binary choice/count models suggested in Weiss (1999) is a restrictive assumption in an application to FRMs. Other alternative estimations are semi-nonparametric or nonparametric as introduced in Das (2005). However, his paper does not have any application and the case of count endogenous variables is not thoroughly discussed. More importantly, semiparametric and nonparametric approaches have a major limitation in this case. These approaches have an inability to estimate the partial effects or the average partial effects (APEs). If we are interested in estimating both the parameters and the APEs, a parametric approach will be preferred. In this chapter, I show how to specify and estimate FRMs with a count endogenous explanatory variable and a time-constant unobserved effect. Based on the work of Papke and Wooldridge (1996), I also use models for the conditional mean of the fractional response in which the fitted value is always in the unit interval. I focus on the probit response function since the probit mean function is less computationally demanding in obtaining
256
HOA B. NGUYEN
the APEs. In order to consistently estimate the effect of an endogenous explanatory variable on the fractional response variable, I employ instrument variables to control for the endogeneity and then use the twostep estimation procedure. The count endogenous variable is assumed to have a Poisson distribution. As discussed in Winkelmann (2000), the error term (unobserved heterogeneity) in a Poisson model can be presented in terms of an additive correlated error or a multiplicative correlated error. However, the multiplicative correlated error has some advantage over the additive correlated error on grounds of consistency. As a result, a multiplicative correlated error is used in this chapter. I focus on the method of QMLE and NLS to get robust and efficient estimators. This chapter is organized as follows. Section 2 introduces the specifications and estimations of an FRM with a count endogenous explanatory variable and shows how to estimate parameters and the APEs using the QMLE and NLS approaches. Section 3 presents Monte Carlo simulations, and an application for the fraction of working hours for a female per week will follow in Section 4. Section 5 concludes.
2. THEORETICAL MODEL SPECIFICATION AND ESTIMATION For a 1 K vector of explanatory variables z1 , the conditional mean model is expressed as follows: Eðy1 jy2 ; z; a1 Þ ¼ Fða1 y2 þ z1 d1 þ Z1 a1 Þ
(1)
where FðÞ is a standard normal cumulative distribution function (cdf), y1 is a response variable ð0 y1 1Þ, and a1 is a heterogeneous component. The exogenous variables are z ¼ ðz1 ; z2 Þ where we need exogenous variables z2 to be excluded from Eq. (1). z is a 1 L vector where L4K; z2 is a vector of instruments. y2 is a count endogenous variable where we assume that the endogenous regressor has a Poisson distribution: y2 jz; a1 Poisson½expðzd2 þ a1 Þ
(2)
then the conditional density of y2 is specified as follows: f ðy2 jz; a1 Þ ¼
½expðzd2 þ a1 Þy2 exp½ expðzd2 þ a1 Þ y2 !
(3)
257
Estimating a Fractional Response Model
where a1 is assumed to be independent of z, and expða1 Þ is distributed as Gammaðd0 ; 1=d0 Þ using a single parameter d0 , with Eðexpða1 ÞÞ ¼ 1 and Varðexpða1 ÞÞ ¼ 1=d0 . After a transformation (see Appendix D for the derivation), the distribution of a1 is derived as follows f ða1 ; d0 Þ ¼
dd00 ½expða1 Þd0 expðd0 expða1 ÞÞ Gðd0 Þ
(4)
In order to get the conditional mean Eðy1 jy2 ; zÞ, I specify the conditional density function of a1 . Using Bayes’ rule, it is: f ða1 jy2 ; zÞ ¼
f ðy2 ja1 ; zÞf ða1 jzÞ f ðy2 jzÞ
Since y2 jz; a1 has a Poisson distribution and expða1 Þ has a gamma distribution, y2 jz is Negative Binomial II distributed, as a standard result. After some algebra, the conditional density function of a1 is f ða1 jy2 ; zÞ ¼
exp½P½d0 þ expðzd2 Þðy2 þd0 Þ Gðy2 þ d0 Þ
(5)
where P ¼ expðzd2 þ a1 Þ þ a1 ðy2 þ d0 Þ d0 expða1 Þ. The conditional mean Eðy1 jy2 ; zÞ, therefore, will be obtained as follows: Z þ1 Fða1 y2 þ z1 d1 þ Z1 a1 Þf ða1 jy2 ; zÞda1 ¼ mðy; y2; zÞ (6) Eðy1 jy2 ; zÞ ¼ 1
where f ða1 jy2 ; zÞ is specified as above and h ¼ ða1 ; d1 ; Z1 Þ. The estimators h^ of Eq. (6) are obtained by the QMLE or the NLS approach. The Bernoulli log-likelihood function is given by l i ðhÞ ¼ y1i ln mi þ ð1 y1i Þ lnð1 mi Þ
(7)
The QMLE of h is obtained from the maximization problem (see more details in Appendix A): max h2Y
n X i¼1
l i ðhÞ
258
HOA B. NGUYEN
The NLS estimator of h is attained from the minimization problem (see more details in Appendix C): min N 1 h2Y
N X
½y1i mi ðh; y2i; zi Þ2 =2
i¼1
Since Eðy1 jy2 ; zÞ does not have a closed-form solution, it is necessary to use a numerical approximation. The numerical routine for integrating out the unobserved heterogeneity in the conditional mean Eq. (6) is based on the Adaptive Gauss–Hermite quadrature. This adaptive approximation has proven to be more accurate with fewer points than the ordinary Gauss–Hermite approximation. The quadrature locations are shifted and scaled to be under the peak of the integrand. Therefore, the adaptive quadrature is performed well with an adequate amount of points (Skrondal & Rabe-Hesketh, 2004). Using the Adaptive Gauss–Hermite approximation, the above integral Eq. (6) can be obtained as follows: Z þ1 M pffiffiffi X pffiffiffi bi Þ hi ðy2i ; zi ; a1 Þda1 2b wm expfðam Þ2 ghi ðy2 ; zi ; 2b si si a m þ w mi ¼ 1
m¼1
(8) bi are the adaptive parameters for observation i, wm are the where b si and w weights and am are the evaluation points, and M is the number of quadrature points. The approximation procedure follows Skrondal and bi are updated in the Rabe-Hesketh (2004). The adaptive parameters b si and w kth iteration of the optimization for mi with mi;k
M pffiffiffi X pffiffiffi ^ i;k1 Þ 2s^ i;k1 wm expfðam Þ2 ghi ðy2i; zi ; 2s^ i;k1 am þ o m¼1
pffiffiffi 2s^ i;k1 wm expfðam Þ2 ghi ðy2i; zi ; ti;m;k1 Þ ^ i;k ¼ ðti;m;k1 Þ o mi;k m¼1 p ffiffi ffi M X 2s^ i;k1 wm expfðam Þ2 ghi ðy2i; zi ; ti;m;k1 Þ ^ i;k Þ2 ðti;m;k1 Þ2 ðo s^ i;k ¼ m i;k m¼1 M X
where ti;m;k1 ¼
pffiffiffi ^ i;k1 2s^ i;k1 am þ o
^ i;k have converged for this This process is repeated until s^ i;k and o iteration at observation i of the maximization algorithm. This adaptation is
Estimating a Fractional Response Model
259
applied to every iteration until the log-likelihood difference from the last iteration is less than a relative difference of 1e5 ; after this adaptation, the adaptive parameters are fixed. Once the evaluation of the conditional mean has been done for all observations, the numerical values can be passed on to a maximizer in order ^ The standard errors in the second stage are adjusted for to find the QMLE h. the first-stage estimation and obtained using the delta method (see Appendix A for derivation). Since the QMLE and NLS estimators in this chapter fall into the class of two-step M estimators, it is shown that these estimators are consistent and asymptotically normal (see Newey & McFadden, 1994; Wooldridge, 2002, Chapter 12).
2.1. Estimation Procedure (i) Estimate d2 and d0 by using the step wise ML of yi2 on zi in the Negative Binomial model. Obtain the estimated parameters d^ 2 and d^ 0 . (ii) Use the fractional probit QMLE (or NLS) of yi1 on yi2 ; zi1 to estimate a1 ; d1 , and Z1 with the approximated conditional mean. The conditional mean is approximated using the estimated parameters in the first step and using the Adaptive Gauss–Hermite method. After getting all the estimated parameters, h^ ¼ ð^a1 ; d^ 1 ; Z^ 1 Þ0 , the standard errors for these parameters using the delta method can be derived with the following formula: ! N X 1 ^ 1 d^ ^ 1 r^i1 r^0i1 A AvarðhÞ ¼ A1 N 1 1 N i¼1 For more details, see the derivation and matrix notation from Eq. (A.3) to Eq. (A.10) in Appendix A.
2.2. Average Partial Effects Econometricians are often interested in estimating the APEs of explanatory variables in nonlinear models in order to get comparable magnitudes with other nonlinear models and linear models. The APEs can be obtained by taking the derivatives or the differences of a conditional mean equation with respect to the explanatory variables. The APE cannot be estimated with the
260
HOA B. NGUYEN
presence of unobserved effect. It is necessary to integrate out the unobserved effect in the conditional mean and take the average across the sample. Then we will take the derivatives or changes with respect to the elements of ðy2 ; z1 Þ. In an FRM with all exogenous covariates, the model (1) with y2 exogenous (see Papke & Wooldridge, 2008) is considered. Let w ¼ ðy2 ; z1 Þ, Eq. (1) is rewritten as Eðy1i jwi ; a1i Þ ¼ Fðwi b þ Z1 a1i Þ where a1i jwi normalð0; s2a Þ and then Eðy1i jwi Þ ¼ Fðwi ba Þ pffiffiffiffiffiffiffiffiffiffiffiffiffi in which ba ¼ b= 1 þ s2a . The APEs are consistently estimated by: N 1
N X
Fðwi ba Þ ¼ N 1
i¼1
N X
Fða1a y2i þ z1i d1a Þ
i¼1
Given the consistent estimator of the scaled parameters ba , the APEs can be estimated by taking the derivatives or changes with respect to the elements of ðy2 ; z1 Þ of N 1
N X
Fð^a1a y2i þ z1i d^ 1a Þ
i¼1
For a continuous z11 , the APE is N 1
N X
d^ 11a fð^a1a y2i þ z1i d^ 1a Þ
i¼1
For a count variable y2 , the APE is N 1
N X
½Fð^a1a y21i þ z1i d^ 1a Þ Fð^a1a y20i þ z1i d^ 1a Þ
i¼1
For example, if we are interested in obtaining the APE when y2 changes from y20i ¼ 0 to y21i ¼ 1, it is necessary to predict the difference in mean responses with y2 ¼ 1 and y2 ¼ 0 and average the difference across all units. In an FRM with a count endogenous variable, the model (1) is considered with the estimation procedure provided in the previous section. The APEs are obtained by taking the derivatives or the differences in E a1 ½Fða1 y2 þ z1 d1 þ Z1 a1 Þ
(9)
261
Estimating a Fractional Response Model
with respect to the elements of ðy2 ; z1 Þ. Since we integrate out a1 and get the conditional mean as in Eq. (6), the estimator for the conditional mean is N N Z þ1 X X ^ y ; z1i Þ ¼ N 1 mi ðy; Fð^a1 y2i þ z1i d^ 1 þ Z^ 1 a1 Þf ða1 jy2i ; z1i Þda1 N 1 2i i¼1
i¼1
1
(10) For a continuous z11 , the APE is Z þ1 fðgi hÞf ða1 jy2i ; zi Þda1 d11 w¼E
(11a)
1
and it is consistently estimated by ! N Z þ1 X 1 ^ ¼ N ^ ða1 jy ; zi Þda1 d^ 11 fðgi hÞf w 2i i¼1
(11b)
1
where gi ¼ ðy2i ; z1i ; a1 Þ and h ¼ ða1 ; d1 ; Z1 Þ0 . For a count variable y2 , its APE is þ z1 d1 þ Z1 a1 Þ Fða1 yk2 þ z1 d1 þ Z1 a1 Þ k ¼ E a1 ½Fða1 ykþ1 2 and it is consistently estimated by Z N Z þ1 X b ^ ða1 jy ; zi Þda1 Fðgikþ1 hÞf k ¼ N 1 2i i¼1
1
þ1
(12a)
^ ða1 jy ; zi Þda1 Fðgki hÞf 2i
1
(12b) For example, in order to get the APE when y2 changes from yk2 ¼ 0 to ¼ 1, it is necessary to predict the difference in the mean responses with yk2 ¼ 0 and y2kþ1 ¼ 1 and take the average of the differences across all units. This APE gives us a number comparable to the linear model estimate. The standard errors for the APEs will be obtained using the delta method. The detailed derivation is provided from Eq. (A.11) to (A.29) in Appendix A. y2kþ1
3. MONTE CARLO SIMULATIONS This section examines the finite sample properties of the QML and NLS estimators of the population averaged partial effect in an FRM with a count endogenous variable. Some Monte Carlo experiments are conducted to
262
HOA B. NGUYEN
compare these estimators with other estimators under different scenarios. These estimators are evaluated under correct model specification with different degrees of endogeneity, with strong and weak instrumental variables, and with different sample sizes. The behavior of these estimators is also examined with respect to a choice of a particular distributional assumption.
3.1. Estimators Two sets of estimators under two corresponding assumptions are considered: (1) y2 is assumed to be exogenous and (2) y2 is assumed to be endogenous. Under the former assumption, three estimators are used: the ordinary least squares (OLS) estimator in a linear model, the maximum likelihood estimator (MLE) in a Tobit model, and the quasi-maximum likelihood estimator (QMLE) in a fractional probit model. Under the latter assumption, five estimators are examined: the two-stage least squares (2SLS) estimator, the maximum likelihood estimator (MLE) in a Tobit model using the Blundell–Smith estimation method (hereafter the Tobit BS), the QMLE in a fractional probit model using the Papke–Wooldridge estimation method (hereafter the QMLE-PW), the QMLE and the NLS estimators in a fractional probit model using the estimation method proposed in the previous section.
3.2. Data-Generating Process The count endogenous variable is generated from a conditional Poisson distribution: y
f ðy2i jx1i ; x2i ; zi ; a1i Þ ¼
expðli Þli 2i y2i !
with a conditional mean: li ¼ Eðy2i jx1i ; x2i ; zi ; a1i Þ ¼ expðd21 x1i þ d22 x2i þ d23 zi þ r1 a1i Þ using independent draws from normal distributions: z Nð0; 0:32 Þ; x1 Nð0; 0:22 Þ; x2 Nð0; 0:22 Þ and expða1 Þ Gammað1; 1=d0 Þ where 1 and 1=d0 are the mean and variance of a gamma distribution. Parameters in the conditional mean model are set to be: ðd21 ; d22 ; d23 ; r1 ; d0 Þ ¼ ð0:01; 0:01; 1:5; 1; 3Þ.
Estimating a Fractional Response Model
263
The dependent variable is generated by first drawing a binomial random variable x with n trials and a probability p and then y1 ¼ x=n. In this simulation, n=100 and p comes from a conditional normal distribution with the conditional mean: p ¼ Eðy1i jy2i ; x1i ; x2i ; a1i Þ ¼ Fðd11 x1i þ d12 x2i þ a1 y2i þ Z1 a1i Þ Parameters in this conditional mean are set at: ðd11 ; d12 ; a1 ; Z1 Þ ¼ ð0:1; 0:1; 1:0; 0:5Þ. Based on the population average values of the parameters set above, the true value of the APE with respect to each variable is obtained. Since exogenous variables are continuous variables and in order to compare nonlinear models with linear models, y2 is first treated as a continuous variable. The true value of the APE with respect to y2 is obtained by computing the derivatives of the conditional mean with respect to y2 and taking the average as follows: APE ¼ 1:0
N 1X fð0:1 x1i þ 0:1 x2i 1:0 y2i þ 0:5 a1i Þ N i¼1
Now when y2 is allowed to be a count variable, the true values of the APEs with respect to y2 are computed by taking differences in the conditional mean. The true values of the APEs are computed at interesting values. In this chapter, I will take the first three examples when y2 increases from 0 to 1, 1 to 2, and 2 to 3, respectively, and the true values of the APEs are " # N Fð0:1 x1i þ 0:1 x2i 1:0 1 þ 0:5 a1i Þ 1X APE01 ¼ N i¼1 Fð0:1 x1i þ 0:1 x2i 1:0 0 þ 0:5 a1i Þ " # N Fð0:1 x1i þ 0:1 x2i 1:0 2 þ 0:5 a1i Þ 1X APE12 ¼ N i¼1 Fð0:1 x1i þ 0:1 x2i 1:0 1 þ 0:5 a1i Þ " # N Fð0:1 x1i þ 0:1 x2i 1:0 3 þ 0:5 a1i Þ 1X APE23 ¼ N i¼1 Fð0:1 x1i þ 0:1 x2i 1:0 2 þ 0:5 a1i Þ The reported true values of the APEs with respect to y2 and other exogenous variables are presented in Tables 1–4. The experiment is conducted with 500 replications and the sample size is normally set at 1,000 observations.
264
HOA B. NGUYEN
3.3. Experiment Results I report sample means, sample standard deviations (SD), and root mean squared errors (RMSE) of these 500 estimates. In order to compare estimators across linear and nonlinear models, I am interested in comparing the APE estimates from different models. 3.3.1. Simulation Result with a Strong Instrumental Variable Tables 1A–1C report the simulation outcomes of the APE estimates for the sample size N=1,000 with a strong instrumental variable (IV) and different degrees of endogeneity, where Z1 ¼ 0:1; Z1 ¼ 0:5, and Z1 ¼ 0:9. The IV is strong in the sense that the first-stage F-statistic is large (the coefficient on z is d23 ¼ 1:5 in the first stage, which leads to a large F-statistic; the F-statistic has a mean at least equivalent to 100 in 500 replications for all three designs of Z1 ). Three different values of Z1 are selected which correspond to low, medium, and high degrees of endogeneity. Columns 2–10 contain the true values of the APE estimates and the means, SD, and RMSE of the APE estimates from different models with different estimation methods. Columns 3–5 contain the means, SD, and RMSE of the APE estimates for all variables from 500 replications with y2 assumed to be exogenous. Columns 6–10 contain the means, SD, and RMSE of the APE estimates for all variables from 500 replications with y2 allowed to be endogenous. First, the simulation outcomes for the sample size N=1,000 and Z1 ¼ 0:5 (see Table 1A.1) are discussed. The APE estimates using the proposed methods of QMLE and NLS in columns 9–10 are closest to the true values of the APEs when y2 is discrete (.3200, .1273, and .0212). The APE estimate is also very close to the true value of the APE (.2347) when y2 is treated as a continuous variable. It is typical to get these first three APEs as examples in order to see the pattern of the means, SD, and RMSE of the APE estimates. Table 1A.1 shows that the OLS estimate is about a half of the true value of the APE. The first source of large bias in the OLS estimate is the endogeneity of the count variable y2 (with Z1 ¼ 0:5). The second source of bias in the OLS estimate is the nonlinearity in both the structural and first-stage Eqs. (1) and (2). The 2SLS approach also produces a biased estimator of the APE because of the second reason mentioned in the OLS estimator even though the endogeneity is taken into account. The MLE estimators in the Tobit model have smaller bias than the estimators in the linear model but larger bias than the estimators in the fractional probit model because they do not consider the functional form of the fractional response variable and the count explanatory variable. When the endogeneity
Simulation Result of the Average Partial Effects Estimates (N=1,000, Z1 ¼ 0:5, 500 Replications).
Model
True Value
Linear
Tobit
Fractional Probit
Linear
Tobit BS
Fractional Probit
Fractional Probit
Fractional Probit
Estimation method
APE
OLS
MLE
QMLE
2SLS
MLE
QMLE-PW
NLS
QMLE
y2 is assumed exogenous y2 Continuous
.2347
y2 Discrete 0–1
.3200
1–2
.1273
2–3
.0212
x1
.0235
x2
.0235
.1283 (.0046) {.0034}
.0224 (.0181) .0218 (.0181)
.1591 (.0042) {.0024} .2014 (.0046) {.0038} .1610 (.0027) {.0011} .0388 (.0030) {.0006} .0210 (.0125) .0214 (.0128)
.2079 (.0051) {.0008} .2763 (.0051) {.0014} .1193 (.0017) {.0002} .0259 (.0012) {.0001} .0223 (.0130) .0195 (.0129)
y2 is assumed endogenous .1583 (.0110) {.0024}
.0237 (.0189) .0230 (.0192)
.1754 (.0064) {.0019} .2262 (.0082) {.0030} .1716 (.0031) {.0014} .0317 (.0031) {.0003} .0212 (.0131) .0218 (.0131)
.2295 (.0077) {.0002} .3109 (.0099) {.0003} .1258 (.0023) {.00005} .0224 (.0014) {.00004} .0231 (.0142) .0241 (.0134)
.2368 (.0051) {.00008} .3201 (.0041) {.00005} .1280 (.0020) {.00004} .0214 (.0014) {.00001} .0240 (.0159) .0243 (.0153)
.2371 (.0050) {.00008} .3204 (.0030) {.00001} .1278 (.0016) {.00001} .0212 (.0010) {.000001} .0238 (.0140) .0244 (.0136)
Estimating a Fractional Response Model
Table 1A.1.
265
266
Simulation Result of the Coefficient Estimates (N=1,000, Z1 ¼ 0:5, 500 Replications).
Table 1A.2. Model
True Value
Linear
Tobit
Fractional Probit
Linear
Tobit BS
Fractional Probit
Fractional Probit
Fractional Probit
Estimation method
Coefficient
OLS
MLE
QMLE
2SLS
MLE
QMLE-PW
NLS
QMLE
y2 is assumed exogenous y2
1.0
x1
.1
x2
.1
.1283 (.0044) .0224 (.0181) .0218 (.0181)
.2024 (.0046) .0267 (.0160) .0272 (.0163)
.8543 (.0146) .0917 (.0534) .0956 (.0534)
y2 is assumed endogenous .1583 (.0089) .0237 (.0190) .0231 (.0192)
.2275 (.0084) .0275 (.0171) .0282 (.0170)
.9387 (.0255) .0945 (.0578) .0987 (.0548)
1.045 (.0483) .1061 (.0702) .1071 (.0681)
1.044 (.0424) .1052 (.0619) .1073 (.0600)
Note: Figures in brackets (){} are standard deviation and RMSE, respectively.
HOA B. NGUYEN
Simulation Result of the Average Partial Effects Estimates (N=1,000, Z1 ¼ 0:1, 500 Replications).
Model
True Value
Linear
Tobit
Fractional Probit
Linear
Tobit BS
Fractional Probit
Fractional Probit
Fractional Probit
Estimation method
APE
OLS
MLE
QMLE
2SLS
MLE
QMLE-PW
NLS
QMLE
y2 is assumed exogenous y2 Continuous
.2461
y2 Discrete 0–1
.3383
1–2
.1332
2–3
.0208
x1
.0246
x2
.0246
.1507 (.0042) {.0031}
.0267 (.0168) .0241 (.0178)
.1854 (.0037) {.0019} .2390 (.0042) {.0032} .2001 (.0018) {.0021} .0193 (.0029) {.00005} .0210 (.0089) .0214 (.0100)
.2402 (.0046) {.0002} .3289 (.0031) {.0003} .1319 (.0011) {.00004} .0219 (.0007) {.00003} .0250 (.0063) .0246 (.0070)
y2 is assumed endogenous .1600 (.0102) {.0027}
.0265 (.0170) .0242 (.0182)
.1887 (.0053) {.0018} .2445 (.0066) {.0030} .2022 (.0025) {.0022} .0177 (.0032) {.0001} .0234 (.0090) .0222 (.0100)
.2442 (.0056) {.0001} .3355 (.0051) {.00009} .1331 (.0013) {.000004} .0212 (.0008) {.00001} .0250 (.0065) .0246 (.0070)
.2490 (.0044) {.0001} .3384 (.0019) {.000005} .1332 (.0008) {.000001} .0208 (.0007) {.000001} .0255 (.0072) .0252 (.0077)
.2491 (.0043) {.0001} .3385 (.0020) {.000007} .1332 (.0010) {.000001} .0208 (.0007) {.000001} .0253 (.0066) .0249 (.0072)
Estimating a Fractional Response Model
Table 1B.
267
268
Table 1C.
Simulation Result of the Average Partial Effects Estimates (N=1,000, Z1 ¼ 0:9, 500 Replications).
Model
True Value
Linear
Tobit
Fractional Probit
Linear
Tobit BS
Fractional Probit
Fractional Probit
Fractional Probit
Estimation method
APE
OLS
MLE
QMLE
2SLS
MLE
QMLE-PW
NLS
QMLE
y2 is assumed exogenous y2 Continuous
.2178
y2 Discrete 0–1
.2973
1–2
.1281
2–3
.0253
.0218
x2
.0218
.0327 (.0222) .0215 (.0201)
.1368 (.0039) {.0026} .1706 (.0049) {.0040} .1303 (.0031) {.00007} .0532 (.0022) {.0009} .0276 (.0176) .0212 (.0170)
.1777 (.0054) {.0013} .2307 (.0069) {.0021} .1110 (.0024) {.0005} .0319 (.0019) {.0002} .0291 (.0182) .0236 (.0179)
.1548 (.0148) {.0020}
.0263 (.0169) .0244 (.0184)
Note: Figures in brackets (){} are standard deviation and RMSE, respectively.
.1637 (.0096) {.0017} .2100 (.0136) {.0028} .1491 (.0060) {.0007} .0452 (.0030) {.0006} .0273 (.0184) .0199 (.0187)
.2144 (.0124) {.0001} .2871 (.0196) {.0003} .1232 (.0047) {.0002} .0258 (.0023) {.00002} .0305 (.0201) .0233 (.0206)
.2208 (.0052) {.0001} .3000 (.0054) {.00008} .1288 (.0025) {.00002} .0247 (.0021) {.00002} .0318 (.0237) .0200 (.0216)
.2205 (.0045) {.0001} .2994 (.0040) {.00006} .1291 (.0020) {.00003} .0249 (.0015) {.00001} .0313 (.0208) .0203 (.0187)
HOA B. NGUYEN
x1
.1104 (.0042) {.0034}
y2 is assumed endogenous
Estimating a Fractional Response Model
269
is corrected, the MLE estimator in Tobit model using Blundell–Smith method has smaller bias than the counterpart where y2 is assumed to be exogenous. Among the fractional probit models, the QMLE estimator, where y2 is assumed to be exogenous, (column 5) has the largest bias because it ignores the endogeneity of y2 . However, it still has a smaller bias than other estimators of the linear and Tobit models. The QMLE-PW estimator (column 8) provides useful result because its estimates are also very close to the true values of the APEs but it produces a larger bias than the QMLE and NLS estimators proposed in this chapter. Similar to the MLE estimator in Tobit model using Blundell–Smith method, the QMLE-PW estimator adopts the control function approach. This approach utilizes the linearity in the firststage equation. As a result, it ignores the discreteness in y2 which leads to the larger bias than the QMLE and NLS estimators proposed in this chapter. The first set of estimators with y2 assumed to be exogenous (columns 3–5) has relatively smaller SDs than the second set of estimators with y2 allowed to be endogenous (columns 6–10) because the methods that correct for endogeneity using IVs have more sampling variation than their counterparts without endogeneity correction. This results from the less-than-unit correlation between the instrument and the endogenous variable. However, the SDs of the QMLE and NLS estimators (columns 9–10) are no worse than the QMLE estimator, where y2 assumed to be exogenous (column 5). Among all estimators, QMLE and NLS estimators proposed in this chapter have the smallest RMSE, not only for the case where y2 is allowed to be a discrete variable but also for the case where y2 is treated as a continuous variable using the correct model specification. As discussed previously, the QMLE estimator using Papke–Wooldridge method has the third smallest RMSE since it also uses the same fractional probit model. Comparing columns 3 and 6, 4 and 7, and 5 and a set of all columns 8–10, it is found that the RMSEs of the methods correcting for endogeneity are smaller than those of their counterparts. Table 1A.2 reports simulation result for coefficient estimates. The coefficient estimates are useful in the sense that it gives the directions of the effects. For studies which only require exploring the signs of the effects, the coefficient tables are necessary. For studies which require comparing the magnitudes of the effects, we essentially want to estimate the APEs. Table 1A.2 shows that the means of point estimates are close to their true values for all parameters using the QML (or the NLS) approach (a1 ¼ 1:0; d11 ¼ :1 and d12 ¼ :1). The bias is large for both 2SLS and OLS methods. These results are as expected because the 2SLS method uses the predicted value from the first-stage OLS so it ignores the distributional information of the
270
HOA B. NGUYEN
right-hand-side (RHS) count variable, regardless of the functional form of the fractional response variable. The OLS estimates do not carry the information of endogeneity. Both the 2SLS and OLS estimates are biased because they do not take into account the presence of unobserved heterogeneity. The bias for a Tobit Blundell–Smith model is similar to the bias with the 2SLS method because it does not take into account the distributional information of the RHS count variable, and it employs different functional form given the fact that the fractional response variable has a small number of zeros. The biases for both the QMLE estimator treating y2 as an exogenous variable and for the QMLE-PW estimator are larger than those of the QMLE and NLS estimators in this chapter. In short, simulation results indicate that the means of point estimates are close to their true values for all parameters using the QMLE and the NLS approach mentioned in the previous section. Simulations with different degrees of endogeneity through the coefficient Z1 ¼ 0:1 and Z1 ¼ 0:9 are also conducted. Not surprisingly, with less endogeneity, Z1 ¼ 0:1, the set of the estimators treating y2 as an exogenous variable produces the APE estimates closer to the true values of the APE estimates; the set of the estimators treating y2 as an endogenous variable has the APE estimates further from the true values of the APE estimates. With more endogeneity, Z1 ¼ 0:9, the set of the estimators treating y2 as an endogenous variable has the APE estimates getting closer to the true values of the APE estimates; and the set of the estimators treating y2 as an exogenous variable gives the APE estimates further from the true values of the APE estimates. As an example, it is noted that, as Z1 increases, the APE estimates of the 2SLS method are less biased, whereas the APE estimates of the QMLE estimator treating y2 as an exogenous variable are more biased and the difference between these two APE estimates is smaller since the endogeneity is corrected. All other previous discussions on the bias, SD and RMSE still hold with Z1 ¼ 0:1 and Z1 ¼ 0:9. It confirms that the QMLE and NLS estimators perform very well under different degrees of endogeneity. 3.3.2. Simulation Result with a Weak Instrumental Variable Table 2 reports the simulation outcomes of the APE estimates for the sample size N=1,000 with a weak IV and Z1 ¼ 0:5. Using the rule of thumb on a weak instrument (suggested in Staiger & Stock, 1997), the coefficient on z is chosen as d23 ¼ 0:3 which corresponds to a very small first-stage F-statistic (the mean of the F-statistic is less than 10 in 500 replications). Columns 2–10 contain the true values of the APE estimates, and the means,
Simulation Result of the Average Partial Effects Estimates (N=1,000, Z1 ¼ 0:5; 500 Replications).
Model
True Value
Linear
Tobit
Fractional Probit
Linear
Tobit BS
Fractional Probit
Fractional Probit
Fractional Probit
Estimation method
APE
OLS
MLE
QMLE
2SLS
MLE
QMLE-PW
NLS
QMLE
y2 is assumed exogenous y2 Continuous
.2402
y2 Discrete 0–1
.3202
1–2
.1275
2–3
.0213
x1
.0240
x2
.0240
.1352 (.0047) {.0033}
.0224 (.0181) .0218 (.0181)
.1618 (.0037) {.0025} .1992 (.0044) {.0038} .1605 (.0026) {.0010} .0386 (.0027) {.0005} .0114 (.0118) .0104 (.0123)
.2094 (.0050) {.0010} .2724 (.0055) {.0015} .1195 (.0016) {.0003} .0268 (.0011) {.0002} .0117 (.0130) .0109 (.0129)
y2 is assumed endogenous .1661 (.0621) {.0023}
.0237 (.0189) .0230 (.0192)
.1823 (.0363) {.0023} .2301 (.0522) {.0018} .1676 (.0189) {.0029} .0332 (.0108) {.0013} .0235 (.0225) .0227 (.0244)
.2327 (.0366) {.0002} .3157 (.0099) {.0001} .1248 (.0023) {.00009} .0227 (.0014) {.00005} .0253 (.0142) .0227 (.0134)
.2405 (.0046) {.00001} .3199 (.0037) {.00001} .1280 (.0016) {.00002} .0215 (.0012) {.000001} .0261 (.0146) .0244 (.0135)
.2407 (.0043) {.00001} .3202 (.0029) {.000001} .1279 (.0015) {.00001} .0214 (.0010) {.000003} .0252 (.0127) .0250 (.0119)
Estimating a Fractional Response Model
Table 2.
Note: Figures in brackets (){} are standard deviation and RMSE, respectively.
271
272
HOA B. NGUYEN
SD, and RMSE of the APE estimates from different models with different estimation methods. Columns 3–5 contain the means, SD, and RMSE of the APE estimates for all variables from 500 replications, with y2 assumed to be exogenous. Columns 6–10 the means, SD contain and RMSE of the APE estimates for all variables from 500 replications with y2 allowed to be endogenous. The simulation results show that, even though the instrument is weak, the set of estimators assuming y2 endogenous still has smaller bias than the set of estimators assuming y2 exogenous. The QMLE and NLS APE estimates are still very close to the true values of the APEs for both cases in which y2 is treated to be a continuous variable and y2 is allowed to be a count variable. Their SD and RMSE are still the lowest among the estimators considering y2 as endogenous. 3.3.3. Simulation Result with Different Sample Sizes Four sample sizes are chosen to represent those commonly encountered sizes in applied research. These range from small to large sample sizes: N=100, 500, 1,000, and 2,000, respectively. Tables 3A–3D report the simulation outcomes of the APE estimates with a strong IV, Z1 ¼ 0:5, for sample sizes N=100, 500, 1,000, and 2,000, respectively. Table 3C is equivalent to Table 1A.1. Columns 2–10 contain the true values of the APE estimates and the means, SD and RMSE of the APE estimates from different models with different estimation methods. Columns 3–5 contain the means, SD, and RMSE of the APE estimates for all variables from 500 replications with y2 assumed to be exogenous. Columns 6–10 contain the means, SD, and RMSE of the APE estimates for all variables from 500 replications with y2 allowed to be endogenous. In general, the simulation results indicate that the SD and RMSE for all estimators are smaller for larger sample sizes. Previous discussion as in 3.2.1 is still applied. The QMLE and NLS estimators perform very well in all sample sizes with the smallest SD and RMSE. They are also the least biased estimators among all the estimators in this discussion. 3.3.4. Simulation Result with a Particular Distributional Assumption The original assumption is that expða1 Þ Gammað1; 1=d0 Þ. However, misspecification is dealt with in this part. The distribution of expða1 Þ is no longer gamma; instead, a1 Nð0; 0:12 Þ is assumed. The finite sample behavior of all the estimators in this incorrect specification is examined. Table 4 shows the simulation results for the sample size N=1,000 with a strong IV and Z1 ¼ 0:5 under misspecification. All of the previous discussions under the correct specification as in 3.2.1 are not affected.
Simulation Result of the Average Partial Effects Estimates (N=100, Z1 ¼ 0:5; 500 Replications).
Model
True Value
Linear
Tobit
Fractional Probit
Linear
Tobit BS
Fractional Probit
Fractional Probit
Fractional Probit
Estimation method
APE
OLS
MLE
QMLE
2SLS
MLE
QMLE-PW
NLS
QMLE
y2 is assumed exogenous y2 Continuous
.2350
y2 Discrete 0–1
.3281
1–2
.1308
2–3
.0214
.1419 (.0221) {.094}
.1695 (.0193) {.0066} .2173 (.0253) {.0111} .1767 (.0222) {.0046} .0316 (.0129) {.0010}
.2180 (.0216) {.0017} .3000 (.0339) {.0028} .1252 (.0096) {.0006} .0243 (.0034) {.0003}
y2 is assumed endogenous .1688 (.0667) {.0066}
.1837 (.0356) {.0051} .2386 (.0492) {.0090} .1840 (.0257) {.0053} .0286 (.0134) {.0007}
.2340 (.0253) {.0001} .3277 (.0457) {.00004} .1305 (.0139) {.00004} .0217 (.0041) {.00003}
.2371 (.0166) {.0002} .3288 (.0137) {.00005} .1300 (.0073) {.00009} .0211 (.0036) {.00003}
.2366 (.0162) {.00017} .3281 (.0122) {.00004} .1306 (.0063) {.00004} .0213 (.0027) {.00001}
Estimating a Fractional Response Model
Table 3A.
273
274
Table 3B.
Simulation Result of the Average Partial Effects Estimates (N=500, Z1 ¼ 0:5, 500 Replications).
Model
True Value
Linear
Tobit
Fractional Probit
Linear
Tobit BS
Fractional Probit
Fractional Probit
Fractional Probit
Estimation method
APE
OLS
MLE
QMLE
2SLS
MLE
QMLE-PW
NLS
QMLE
y2 is assumed exogenous y2 Continuous
.2358
y2 Discrete 0–1
.3285
1–2
.1309
2–3
.0214
.1415 (.0177) {.0046}
.2201 (.0163) {.0007} .3026 (.0311) {.0012} .1259 (.0082) {.0002} .0240 (.0025) {.0001}
.1617 (.0175) {.0041}
Note: Figures in brackets (){} are standard deviation and RMSE, respectively.
.1815 (.0114) {.0029} .2351 (.0153) {.0052} .1847 (.0158) {.0031} .0267 (.0084) {.0002}
.2334 (.0119) {.0001} .3241 (.0192) {.0002} .1300 (.0058) {.00004} .0219 (.0018) {.00002}
.2379 (.0051) {.0001} .3293 (.0109) {.00001} .1310 (.0044) {.00001} .0212 (.0017) {.00001}
.2376 (.0086) {.0001} .3290 (.0106) {.00002} .1311 (.0043) {.000004} .0213 (.0014) {.000004}
HOA B. NGUYEN
.1710 (.0157) {.0034} .2190 (.0219) {.0059} .1782 (.0205) {.0028} .0309 (.0109) {.0004}
y2 is assumed endogenous
Simulation Result of the Average Partial Effects Estimates (N=1,000, Z1 ¼ 0:5, 500 Replications).
Model
True Value
Linear
Tobit
Fractional Probit
Linear
Tobit BS
Fractional Probit
Fractional Probit
Fractional Probit
Estimation method
APE
OLS
MLE
QMLE
2SLS
MLE
QMLE-PW
NLS
QMLE
y2 is assumed exogenous y2 Continuous
.2347
y2 Discrete 0–1
.3200
1–2
.1273
2–3
.0212
.1283 (.0046) {.0034}
.1591 (.0042) {.0024} .2014 (.0046) {.0038} .1610 (.0027) {.0011} .0388 (.0030) {.0006}
.2079 (.0051) {.0008} .2763 (.0051) {.0014} .1193 (.0017) {.0002} .0259 (.0012) {.0001}
y2 is assumed endogenous .1583 (.0110) {.0024}
.1754 (.0064) {.0019} .2262 (.0082) {.0030} .1716 (.0031) {.0014} .0317 (.0031) {.0003}
.2295 (.0077) {.0002} .3109 (.0099) {.0003} .1258 (.0023) {.00005} .0224 (.0014) {.00004}
.2368 (.0051) {.00008} .3201 (.0041) {.00016} .1280 (.0020) {.00004} .0214 (.0014) {.00001}
.2371 (.0050) {.00008} .3204 (.0030) {.00001} .1278 (.0016) {.00001} .0212 (.0010) {.000001}
Estimating a Fractional Response Model
Table 3C.
275
276
Table 3D.
Simulation Result of the Average Partial Effects Estimates (N=2,000, Z1 ¼ 0:5, 500 Replications).
Model
True Value
Linear
Tobit
Fractional Probit
Linear
Tobit BS
Fractional Probit
Fractional Probit
Fractional Probit
Estimation method
APE
OLS
MLE
QMLE
2SLS
MLE
QMLE-PW
NLS
QMLE
y2 is assumed exogenous y2 Continuous
.2347
y2 Discrete 0–1
.3201
1–2
.1275
2–3
.0213
.1286 (.0028) {.0024}
.2080 (.0031) {.0006} .2766 (.0036) {.0010} .1194 (.0012) {.0002} .0259 (.0008) {.0001}
.1591 (.0082) {.0017}
Note: Figures in brackets (){} are standard deviation and RMSE, respectively.
.1755 (.0044) {.0013} .2263 (.0059) {.0021} .1717 (.0024) {.0010} .0317 (.0021) {.0002}
.2293 (.0050) {.0001} .3106 (.0074) {.0002} .1258 (.0017) {.00004} .0224 (.0010) {.00003}
.2369 (.0031) {.00005} .3201 (.0029) {.000001} .1281 (.0015) {.00001} .0214 (.0010) {.000002}
.2371 (.0030) {.00006} .3204 (.0021) {.000007} .1278 (.0011) {.000009} .0212 (.0007) {.000001}
HOA B. NGUYEN
.1591 (.0028) {.0017} .2014 (.0034) {.0027} .1609 (.0020) {.0008} .0390 (.0020) {.0004}
y2 is assumed endogenous
Simulation Result of the Average Partial Effects Estimates (N=1,000, Z1 ¼ 0:5, a1 Is Normally Distributed, 500 Replications).
Model
True Value
Linear
Tobit
Fractional Probit
Linear
Tobit BS
Fractional Probit
Fractional Probit
Fractional Probit
Estimation method
APE
OLS
MLE
QMLE
2SLS
MLE
QMLE-PW
NLS
QMLE
y2 is assumed exogenous y2 Continuous
.2379
y2 Discrete 0–1
.3409
1–2
.1361
2–3
.0215
x1
.0238
x2
.0238
.1599 (.0053) {.0025}
.0265 (.0165) .0234 (.0179)
.1876 (.0041) {.0015} .2431 (.0040) {.0031} .2032 (.0019) {.0021} .0195 (.0027) {.00006} .0223 (.0084) .0217 (.0103)
.2369 (.0050) {.00003} .3393 (.0028) {.00005} .1358 (.0010) {.000007} .0217 (.0006) {.000005} .0240 (.0059) .0240 (.0064)
y2 is assumed endogenous .1625 (.0088) {.0024}
.0237 (.0189) .0230 (.0192)
.1885 (.0051) {.0016} .2445 (.0063) {.0031} .2037 (.0026) {.0021} .0192 (.0032) {.00007} .0224 (.0084) .0218 (.0104)
.2375 (.0056) {.00001} .3403 (.0050) {.00002} .1360 (.0012) {.000002} .0216 (.0007) {.000003} .0240 (.0059) .0240 (.0064)
.2390 (.0049) {.00003} .3401 (.0023) {.00003} .1360 (.0011) {.000002} .0216 (.0008) {.000003} .0242 (.0064) .0239 (.0064)
.2390 (.0048) {.00003} .3401 (.0018) {.00003} .1360 (.0010) {.000002} .0216 (.0006) {.000003} .0240 (.0061) .0239 (.0061)
Estimating a Fractional Response Model
Table 4.
Note: Figures in brackets (){} are standard deviation and RMSE, respectively.
277
278
HOA B. NGUYEN
Table 5. Number of Kids
Frequencies of the Number of Children. Frequency
Percent
Cumulative Relative Frequency
0 1 2 3 4 5 6 7 8 9 10
16,200 10,000 3,733 1,373 323 134 47 6 4 2 2
50.90 31.42 11.73 4.31 1.01 .42 .15 .02 .01 .01 .01
50.90 82.33 94.06 98.37 99.39 99.81 99.96 99.97 99.99 99.99 100.00
Total
31,824
100.00
Table 6. Variable frhour
Descriptive Statistics.
Description
Fraction of hours that a woman spends working per week kidno Number of kids age Mother’s age in years agefstm Mother’s age in years when first child was born hispan ¼ 1 if race is hispanic; ¼ 0 if race is black nonmomi Non-mom income = Family income – Mom’s labor income edu Education = Number of schooling years samesex ¼ 1 if the first two kids have the same sex; ¼ 0 otherwise multi2nd ¼ 1 if the 2nd birth is twin; ¼ 0 otherwise
Mean .126
SD .116
.752 .977 29.742 3.613 20.118 2.889 .593 .491 31.806 20.375
Min 0
Max .589
0 10 21 35 15 32 0 1 0 157.4
11.005 .503
3.305 .500
0 0
20 1
.009
.093
0
1
The APE estimates under the fractional probit model are still very close to the true values of the APEs (Tables 5–7).
3.4. Conclusion from the Monte Carlo Simulations This section examines the finite sample behavior of the estimators proposed in the FRM with an endogenous count variable. The results of some Monte Carlo experiments show that the QMLE and NLS estimators have smallest SDs, RMSE, and are least biased when the endogeneity is presented.
279
Estimating a Fractional Response Model
Table 7.
First-Stage Estimates Using Instrumental Variables.
Dependent Variable – Kidno edu age agefstm hispan nonmomi samesex multi2nd constant
Linear Model (OLS)
Negative Binomial II Model (MLE)
.065 (.002) .096 (.002) .114 (.002) .036 (.010) .002 (.0002) .075 (.010) .786 (.052) .911 (.042)
.078 (.002) .119 (.002) .156 (.003) .045 (.015) .003 (.0004) .098 (.013) .728 (.045) .013 (.067)
Note: Figures in parentheses are robust standard errors.
The QMLE and NLS methods also produce least biased estimates in terms of both parameters and the APEs compared to other competitive methods.
4. APPLICATION AND ESTIMATION RESULTS The fraction of hours that a woman spends working per week, which is affected by the number of children, can be used as an empirical application of the FRM with a count endogenous explanatory variable. The data in this chapter were used in Angrist and Evans (1998) to illustrate a linear model with a dummy endogenous variable: more than two kids. They estimated the effect of additional children on female labor supply, considering the number of children as endogenous and using the instruments: samesex and twins at the first two births. They found that married women who have the third child reduce their labor supply, and their 2SLS estimates are roughly a half smaller than the corresponding OLS estimates. In this chapter, the fractional response variable is the fraction of hours that a woman spends working per week. This variable is generated from the number of working hours, which was used in Angrist and Evans (1998), divided by the maximum hours per week (168). There is a substantial
280
HOA B. NGUYEN
Table 8. Model Estimation method
kidno (continuous)
Estimates Assuming Number of Kids is Conditionally Exogenous. Linear
Tobit
Fractional Probit
OLS
MLE
QMLE
Coefficient
Coefficient
APE
Coefficient
APE
.019 (.0007)
.034 (.0013)
.099 (.004)
.004 (.0002) .005 (.0002) .006 (.0003) .032 (.001) .0003 (.00003)
.008 (.0004) .008 (.0003) .010 (.0004) .052 (.0022) .0006 (.00006)
.0225 (.0008) .0231 (.0008) .0207 (.0007) .0183 (.0005) .005 (.0002) .006 (.0002) .007 (.0003) .034 (.0014) .0004 (.00004)
.0202 (.0008) .0207 (.0008) .0185 (.0007) .0163 (.0005) .005 (.0002) .005 (.0002) .006 (.0003) .031 (.0013) .0004 (.00003)
0–1 1–2 2–3 edu age agefstm hispan nonmomi
.022 (.001) .024 (.001) .030 (.001) .150 (.007) .002 (.0002)
Note: Figures in parentheses under the Coefficient column are robust standard errors and figures in parentheses under the APE column are bootstrapped standard errors.
number of women who do not spend any hours working. Therefore, a Tobit model might also be of interest. Table 8 shows the estimation results of the OLS in a linear model, the MLE in a Tobit model, and the QMLE in a fractional probit model when y2 is assumed exogenous. The estimation results of the 2SLS in a linear model, the MLE in a Tobit BS model, the QMLE-PW, the QMLE, and NLS estimation in a fractional probit model are shown in Table 9 when y2 is assumed endogenous. Since I also analyze the model using the Tobit BS model, its model specification and derivation of the conditional mean, the APEs, and the estimation approach are included in Appendix B. The NLS method with the same conditional mean used in the QMLE method is also presented in Appendix C. The count variable in this application is the number of children instead of an indicator for having more than two kids, which was used in Angrist and
Estimates Assuming Number of Kids is Endogenous.
Mode
Linear
Tobit (BS)
Fractional Probit
Fractional Probit
Fractional Probit
Estimation method
2SLS
MLE
QMLE-PW Kidno is assumed continuous
QMLE
NLS
Coefficient kidno (continuous)
Coefficient
.016 (.007)
.027 (.013)
.004 (.0005) .005 (.0007) .006 (.0008) .032 (.001) .0003 (.00004)
.009 (.0009) .008 (.001) .010 (.001) .052 (.002) .0005 (.00006)
0–1 1–2 2–3 edu age agefstm hispan nonmomi
APE
Coefficient
APE
Coefficient
APE
Coefficient
APE
.018 (.008) .018 (.008) .017 (.007) .015 (.006) .006 (.0006) .005 (.0008) .006 (.001) .034 (.001) .0004 (.00004)
.078 (.037)
.016 (.008) .016 (.008) .015 (.007) .014 (.005) .005 (.0005) .004 (.0008) .006 (.0009) .031 (.001) .0003 (.00004)
.081 (.007)
.017 (.001) .017 (.0009) .015 (.0009) .014 (.001) .005 (.0005) .004 (.0008) .005 (.0008) .031 (.001) .0003 (.00004)
.081 (.007)
.017 (.001) .017 (.0009) .015 (.0009) .014 (.001) .005 (.0005) .004 (.0008) .005 (.0008) .031 (.001) .0003 (.00004)
.024 (.002) .022 (.004) .028 (.004) .150 (.007) .002 (.0002)
.024 (.001) .021 (.001) .027 (.002) .151 (.007) .002 (.0002)
.024 (.001) .021 (.001) .027 (.002) .151 (.007) .002 (.0002)
281
Note: Figures in parentheses under the Coefficient column are robust standard errors and figures in parentheses under the APE column are bootstrapped standard errors; those under the APEs for a count endogenous variable with the QMLE and NLS methods are computed standard errors.
Estimating a Fractional Response Model
Table 9.
282
HOA B. NGUYEN
Evans (1998). The number of kids is considered endogenous, which is in line with the recent existing empirical literature. First, the number and timing of children born are controlled by a mother who makes fertility decisions correlated with the number of children. Second, women’s fertility is determined by both heterogeneous preferences and correlated, heterogeneous opportunity costs of children. The estimation sample contains 31,824 women, in which more than 50% are childless, 31% have one kid, 11% have two kids, and the rest have more than two kids. Table 5 gives the frequency distribution of the number of children and it appears to have excess zeros and long tails, with the average number of children being one. Other explanatory variables which are exogenous, including demographic and economic variables of the family, are also described in Table 6. The current research on parents, preferences over the sex mixture of their children using US data shows that most families would prefer at least one child of each sex. For example, Ben-Porath and Welch (1976) found that 56% of families with either two boys or two girls had a third birth, whereas only 51% families with one boy and one girl had a third child. Angrist and Evans (1998) found that only 31.2% of women with one boy and one girl had a third child, whereas 38.8% and 36.5% of women with two girls and two boys had a third child, respectively. With the evidence that women with children of the same sex are more likely to have additional children, the instruments that we can use are samesex and twins.
4.1. Ordinary Least Squares The OLS estimation often plays a role as a benchmark since its computation is simple, its interpretation is straightforward, and it requires fewer assumptions for consistency. The estimates of a linear model in which the fraction of total working hours per week is the response variable and the number of kids is considered exogenous are provided in Table 8. As discussed in the literature of women’s labor supply, the coefficient of the number of kids is negative and statistically significant. The linear model with the OLS estimation ignores functional form issues that arise from the excess-zeros nature of the dependent variable. In addition, the predicted value of the fraction of the total weekly working hours for women always lies in the unit interval. The use of the linear model with the OLS estimation does not make any sense if the predicted value occurs outside this interval.
Estimating a Fractional Response Model
283
4.2. A Tobit Model with an Exogenous Number of Kids There are two reasons that a Tobit model might be practical. First, the fraction of working hours per week has many zeros. Second, the predicted value needs to be nonnegative. The estimates are given in Table 8. The Tobit coefficients have the same signs as the corresponding OLS estimates, and the statistical significance of the estimates is similar. For magnitude, the Tobit partial effects are computed to make them comparable to the linear model estimates. First of all, the partial effect of a discrete explanatory variable is obtained by estimating the Tobit conditional mean. Second, the differences in the conditional mean at two values of the explanatory variable that are of interest are computed (e.g., we should first plug in y2i ¼ 1 and then y2i ¼ 0). As implied by the coefficient, having the first child reduces the estimated fraction of total weekly working hours by about .023, or 2.3 percentage points, a larger effect than 1.9 percentage points of the OLS estimate. Having the second child and the third child makes the mother work less by about .021 or 2.1 percentage points and .018 or 1.8 percentage points, respectively. All of the OLS and Tobit statistics are fully robust and statistically significant. Comparing with the OLS partial effect, which is about .019 or 1.9 percentage points, the Tobit partial effects are larger for having the first kid but almost the same for the second and the third kids. The partial effects of continuous explanatory variables can be obtained by taking the derivatives of the conditional mean, or we can practically get the adjustment factors to make the adjusted Tobit coefficients roughly comparable to the OLS estimates. All of the Tobit coefficients given in Table 8 for continuous variables are larger than the corresponding OLS coefficients in absolute values. However, the Tobit partial effects for continuous variables are slightly larger than the corresponding OLS estimates in absolute values.
4.3. An FRM with an Exogenous Number of Kids Following Papke and Wooldridge (1996), I also use the fractional probit model assuming the exogenous number of children for comparison purpose. The FRM’s estimates are similar to the Tobit’s estimates, but they are even closer to the OLS estimates. The statistical significance of QML estimates is almost the same as that of the OLS estimates (see Table 8). Having the second child reduces the estimated fraction of total weekly working hours by 1.9 percentage points, which is roughly the same as the OLS estimate.
284
HOA B. NGUYEN
However, having the first child and the third child results in different partial effects. Having the first kid makes a mother work much less by 2.0 percentage points, and having the third kid makes a mother work less by 1.6 percentage points.
4.4. Two-Stage Least Squares In the literature on female labor supply, Angrist and Evans (1998) consider fertility as endogenous. Their remarkable contribution is to use two binary instruments: genders of the first two births are the same (samesex) and twins at the first two births (multi2nd) to account for an endogenous third child. The 2SLS estimates are replicated and reported in Table 9. The first-stage estimates using the OLS method and assuming a continuous number of children, given in Table 7, show that women with higher education are estimated to be 6.5 percentage points less likely to have kids. In magnitude, the 2SLS estimates are less than the OLS estimates for the number of kids but roughly the same for other explanatory variables. With IV estimates, having children leads a mother to work less by about 1.6 percentage points, which is smaller than the corresponding OLS estimates of about 1.9 percentage points. These findings are consistent with Angrist and Evans’ result.
4.5. A Tobit BS Model with an Endogenous Number of Kids A Tobit BS model is used with the endogenous number of children (see Table 9). Only the Tobit APEs of the number of kids have statistically slightly larger effect than that of the 2SLS estimates. The APEs of the Tobit estimates are almost the same as those of the corresponding 2SLS estimates for other explanatory variables. Having the first, second, and third kids reduces the fraction of hours a mother spends working per week by around 1.8, 1.7, and 1.5 percentage points, respectively. Having the third kid reduces a mother’s fraction of working hours per week by the same amount as the 2SLS estimates. The statistical significance is almost the same for the number of kids. The Tobit BS method is similar to the 2SLS method in the sense that the first stage uses a linear estimation and it ignores the count nature of the number of children. It explains why the Tobit BS result gets very close to the 2SLS estimates.
Estimating a Fractional Response Model
285
4.6. An FRM with an Endogenous Number of Kids Now let us consider the FRM with the endogenous number of kids. The fractional probit model with Papke–Wooldridge method (2008) has dealt with the problem of endogeneity. However, this method has not taken into account the problem of count endogeneity. The endogenous variable in this model is assumed as a continuous variable; hence, the partial effects at discrete values of the count endogenous variable are not considered. In this chapter, the APEs of the QMLE-PW estimates are also computed in order to be comparable with other APE estimates. Having the first kid reduces a mother’s fraction of weekly working hours by the same amount as the 2SLS estimates. Treating the number of children continuous also gives the same effect as the 2SLS estimate on the number of kids. As the number of children increases, the more working hours a mother has to sacrifice. Having the second and third kids reduces the fraction of hours a mother spends working per week by around 1.6, 1.5, and 1.4 percentage points, respectively. The statistical significance is the same as the Tobit BS esimates for the number of kids. The APEs of the QMLE-PW estimates are almost the same as those of the corresponding 2SLS estimates for other explanatory variables. The fractional probit model with the methods proposed in this chapter is attractive because it controls for endogeneity, functional form issues, and the presence of unobserved heterogeneity. More importantly, the number of children is considered a count variable instead of a continuous variable. Both the QMLE and NLS are considered and the NLS estimates are quite the same as the QML estimates. The QML and NLS coefficients and robust standard errors are given in Table 9, and the first-stage estimates are reported in Table 7. In the first stage, the Poisson model for the count variable is preferred because of two reasons. First, the distribution of the count variable with a long tail and excess zeros suggests an appropriate model of gamma heterogeneity instead of normal heterogeneity. Second, adding the unobserved heterogeneity with the standard exponential gamma distribution to the Poisson model transforms the model to the Negative Binomial model, which can be estimated by the ML method. The OLS and Poisson estimates are not directly comparable. For instance, increasing education by one year reduces the number of kids by .065 as in the linear coefficient and by 7.8% as in the Poisson coefficient. The fractional probit (FB) estimates have the same signs as the corresponding OLS and 2SLS estimates. In addition, the result shows that the QMLE is more efficient than the OLS and 2SLS. For magnitude, the FB APEs are computed to make them comparable to the linear model estimates. Similar to the Tobit model, the partial effect of a discrete
286
HOA B. NGUYEN
explanatory variable is obtained by estimating the conditional mean and taking the differences at the values we are interested in. Regarding the number of kids, having more kids reduces the fraction of hours that a mother works weekly. Having the first child cuts the estimated fraction of total weekly working hours by about .017, or 1.7 percentage points, which is similar to the 2SLS estimates, and reduces more than the OLS estimates. Having the second child and the third child makes a mother work less by about 1.5 and 1.4 percentage points, respectively. Even though having the third kid reduces a mother’s fraction of weekly working hours compared to having the second kid, the marginal reduction is less, since a marginal reduction of .2 percentage points for having the second kid now goes down to .1 percentage points for having the third kid. This can be seen as the ‘‘adaptation effect’’ as the mother adapts and works more effectively after having the first kid. The partial effects of continuous explanatory variables can be obtained by taking the derivatives of the conditional mean so that they would be comparable to the OLS, 2SLS, and other alternative estimates. All of the estimates in Table 9 tell a consistent story about fertility. Statistically, having any children reduces significantly a mother’s working hours per week. In addition, the more kids a woman has, the more hours that she needs to forgo. The FRM estimate treating the number of kids as endogenous and as a count variable shows that the marginal reduction in women’s working hours per week is less as women have additional children. In addition, the FRM estimates, taking into account the endogeneity and count nature of the number of children, are statistically significant and more significant than the corresponding 2SLS and Tobit estimates.
5. CONCLUSION I present the QMLE and NLS methodologies to estimate the FRM with a count endogenous explanatory variable. The unobserved heterogeneity is assumed to have an exponential gamma distribution, and the conditional mean of the FRM is estimated numerically. The QMLE and NLS approaches are more efficient than the 2SLS and Tobit with IV estimates. They are more robust and less difficult to compute than the standard MLE method. This approach is applied to estimate the effect of fertility on the fraction of working hours for a female per week. Allowing the number of kids to be endogenous, using the data provided in Angrist and Evans (1998), I find that the marginal reduction in women’s working hours per week is less as women have additional children. In addition, the effect of the number of children on the fraction of hours that a woman spends working per week
Estimating a Fractional Response Model
287
is statistically significant and more significant than the estimates in all other linear and nonlinear models considered in the chapter.
NOTE Details on CPU times and the number of quadrature points and codes are available upon request.
ACKNOWLEDGMENTS I express my special thank to Jeffrey Wooldridge for his advice and support. I thank Peter Schmidt, Carter Hill, David Drukker, and two anonymous referees for helpful comments. All remaining errors are my own.
REFERENCES Angrist, J., & Evans, W. (1998). Children and their parents’ labor supply: Evidence from exogenous variation in family size. American Economic Review, 88, 450–477. Ben-Porath, Y., & Welch, F. (1976). Do sex preferences really matter? Quarterly Journal of Economics, 90, 285–307. Das, M. (2005). Instrumental variables estimators of nonparametric models with discrete endogenous regressors. Journal of Econometrics, 124, 335–361. Mullahy, J. (1997). Instrumental-variable estimation of count data models: Applications to models of cigarette smoking behavior. Review of Economics and Statistics, 79, 586–593. Newey, W. K., & McFadden, D. (1994). Large sample estimation and hypothesis testing. In: R. F. Engle & D. McFadden (Eds), Handbook of econometrics (Vol. 4, pp. 2111–2245). Amsterdam: North Holland. Papke, L., & Wooldridge, J. (1996). Econometric methods for fractional response variables with an application to 401(k) plan participation rates. Journal of Applied Econometrics, 11, 619–632. Papke, L., & Wooldridge, J. (2008). Panel data methods for fractional response variables with an application to test pass rates. Journal of Econometrics, 145, 121–133. Skrondal, A., & Rabe-Hesketh, S. (2004). Generalized latent variable modeling: Multilevel, longitudinal and structural equation models. Boca Raton, FL: Chapman & Hall/CRC. Smith, R., & Blundell, R. (1986). An exogeneity test for a simultaneous equation tobit model with an application to labor supply. Econometrica, 54, 679–685. Staiger, D., & Stock, J. (1997). Instrumental variables regression with weak instruments. Econometrica, 65, 557–586. Terza, J. (1998). Estimating count data models with endogenous switching: Sample selection and endogenous treatment effects. Journal of Econometrics, 84, 129–154. Weiss, A. (1999). A simultaneous binary choice/count model with an application to credit card approvals. In: R. Engle & H. White (Eds), Cointegration, causality, and forecasting: A festschrift in honour of Clive W. J. Granger (pp. 429–461). Oxford: Oxford University Press.
288
HOA B. NGUYEN
Windmeijer, F., & Santos-Silva, J. (1997). Endogeneity in count-data models: An application to demand for health care. Journal of Applied Econometrics, 12, 281–294. Winkelmann, R. (2000). Econometric analysis of count data. Berlin: Springer. Wooldridge, J. (2002). Econometric analysis of cross section and panel data. Cambridge, MA: MIT Press.
APPENDICES Appendix A This appendix derives asymptotic standard errors for the QML estimator in the second step and the average partial effects. In the first stage, we have y2i jzi ; a1i Poisson ½expðzi d2 þ a1i Þ with the conditional density function: f ðy2i jzi ; a1i Þ ¼
½expðzi d2 þ a1i Þy2i exp½ expðzi d2 þ a1i Þ y2i !
(A.1)
The unconditional density of y2i conditioned only on zi is obtained by integrating a1i out of the joint density. That is: Z f ðy2i jzi Þ ¼ f ðy2i jzi ; a1i Þf ða1i Þda1i a Z 1i1 ½expðzi d2 þ a1i Þy2i exp½ expðzi d2 þ a1i Þ ¼ y2i ! 1 dd00 expða1i Þd0 1 expðd0 expða1i ÞÞ da1i Gðd0 Þ Let mi ¼ expðzi d2 Þ and ci ¼ expða1i Þ then the conditional density is f ðy2i jzi ; a1i Þ ¼
½mi ci y2i exp½mi ci Gðy2i þ 1Þ
and the unconditional density is Z 1 ½mi ci y2i exp½mi ci dd00 cid0 1 expðd0 ci ÞÞ dci f ðy2i jzi Þ ¼ Gðy2i þ 1Þ Gðd0 Þ 0 Z 1 ½mi y2i dd00 y þd 1 ¼ exp½ci ðmi þ d0 Þci 2i 0 dci Gðy2i þ 1ÞGðd0 Þ 0 ¼
½mi y2i dd00 Gðy2i þ d0 Þ Gðy2i þ 1ÞGðd0 Þ ðmi þ d0 Þðy2i þd0 Þ
289
Estimating a Fractional Response Model
Define hi ¼ d0 =mi þ d0 results in f ðy2i jzi Þ ¼
Gðy2i þ d0 Þhdi 0 ð1 hi Þy2i Gðy2i þ 1ÞGðd0 Þ
where y2i ¼ 0; 1; . . . and d0 40, which is the density function for the Negative Binomial distribution. The log-likelihood for observation i is d0 expðzi d2 Þ Gðy2i þ d0 Þ þ y2i ln þ ln l i ðd2 ; d0 Þ ¼ d0 ln d0 þ expðzi d2 Þ d0 þ expðzi d2 Þ Gðy2i þ 1ÞGðd0 Þ (A.2) For all observations: Lðd2 ; d0 Þ ¼
N X
l i ðd2 ; d0 Þ
i¼1
We can estimate jointly d2 and d0 by a stepwise MLE. Let c ¼ ðd2 ; d0 Þ0 has the dimension of (L+1), where L is the dimension of d2 which is the sum of K and the number of instruments, under standard regularity conditions, we have N X pffiffiffiffi ri2 þ op ð1Þ N ð^c cÞ ¼ N 1=2
(A.3)
i¼1
where " ri2 ¼
A1 01 s01 A1 02 s02
# (A.4)
in which s0 ¼
r d2 l i
! ¼
r d0 l i
s01
!
s02
and A0 ¼
Eðr2c l i Þ
¼E
r2d2 l i r2d0 l i
! ¼E
H01 H02
! ¼
A01 A02
!
290
HOA B. NGUYEN
After taking the first and the second derivatives, we have z0i d0 ðy2i expðzi d2 ÞÞ d0 þ expðzi d2 Þ z0i zi d0 expðzi d2 Þ ¼ d þ expðzi d2 Þ 0 d0 expðzi d2 Þ y2i G0 ðy2i þ d0 Þ G0 ðd0 Þ þ þ ¼ ln d0 þ expðzi d2 Þ d0 þ expðzi d2 Þ Gðy2i þ d0 Þ Gðd0 Þ expðzi d2 Þ expðzi d2 Þ y2i ¼ d0 ½d0 þ expðzi d2 Þ ½d0 þ expðzi d2 Þ2
s01 ¼ H01 s02 H02
þ
G00 ðy2i þ d0 ÞGðy2i þ d0 Þ ½G0 ðy2i þ d0 Þ2 G00 ðd0 ÞGðd0 Þ ½G0 ðd0 Þ2 ½Gðy2i þ d0 Þ2 ½Gðd0 Þ2
s01 and H01 are L 1 and L L matrices; s012 and H02 are 1 1 and 1 1 matrices. ri2 ðcÞ has the dimension of ðL þ 1Þ 1. pffiffiffiffi ^ With the two-step M-estimator, the asymptotic variance of pN ffiffiffiffiðh hÞ must be adjusted to account for the first-stage estimation of N ð^c cÞ (see more in 12.4.2 of Chapter 12, Wooldridge, 2002). The score of the QML (or the gradient) for observation i with respect to h is si ðh; gÞ ¼ 5h l i ðhÞ 5 m 5 m ¼ y1i h i ð1 y1i Þ h i mi 1 mi y 5 m ð1 mi Þ mi ð1 y1i Þ5h mi ¼ 1i h i mi ð1 mi Þ y1i 5h mi mi 5h mi ¼ mi ð1 mi Þ ðy1i mi Þ5h mi ¼ mi ð1 mi Þ Z ðy1i mi Þ þ1 @Fðgi hÞ ¼ f ða1 jy2i ; zi Þda1 mi ð1 mi Þ 1 @h Z ðy mi Þ þ1 0 si ðh; gÞ ¼ 1i g fðgi hÞf ða1 jy2i ; zi Þda1 mi ð1 mi Þ 1 i
ðA:5Þ
291
Estimating a Fractional Response Model
where gi ¼ ðy2i ; z1i ; a1i Þ and h ¼ ða1 ; d1 ; Z1 Þ0 and h has the dimension of K+2 N X pffiffiffiffi 1=2 N ðh^ hÞ ¼ A1 ðN ri1 ðh; cÞÞ þ op ð1Þ 1
(A.6)
i¼1
A1 ¼ E½5h si ðh; gÞ ð5h mi Þ0 5h mi ¼E m ð1 mi Þ i 1 0 BB ¼E mi ð1 mi Þ N X 1 ^ 1 ¼ N 1 ^ 0B ^ B A mi ð1 mi Þ i¼1 where B ¼
R þ1 1
(A.7)
g0i fðgi hÞf ða1 jy2i ; zi Þda1 ri1 ðh; cÞ ¼ si ðh; gÞ F1 ri2 ðcÞ r^i1 ðh; cÞ ¼ s^ i ðh; gÞ F^ 1 r^i2 ðcÞ
ðA:8Þ
where ri1 ðh; cÞ; si ðh; gÞ are ðK þ 2Þ 1 matrices, and ri2 ðcÞ and F1 are ðL þ 1Þ 1 and ðK þ 2Þ ðL þ 1Þ matrices, A1 is a ðK þ 2Þ ðK þ 2Þ matrix. ! 5d2 si ðh; gÞ F1 ¼ E½5c si ðh; gÞ ¼ E 5d0 si ðh; gÞ Z þ1 1 @f ða1 jy2i ; zi Þ B Fðgi hÞ da1 Eð5d2 si ðh; gÞÞ ¼ E mi ð1 mi Þ @d2 1 Z þ1 1 @f ða1 jy2i ; zi Þ B Fðgi hÞ da1 Eð5d0 si ðh; gÞÞ ¼ E mi ð1 mi Þ @d0 1 2 3 R 1 þ1 ^ @f ða1 jy2i ; zi Þda1 ^ Fðg hÞ B i 7 1 N 6 @d2 X 6 mi ð1 mi Þ 7 ^F1 ¼ 1 6 7 ðA:9Þ 7 R N i¼1 6 1 @f ðajy ; z Þ þ1 4 5 2i i ^ ^ B Fðg da hÞ 1 i 1 mi ð1 mi Þ @d0
292
HOA B. NGUYEN
where @f ða1 jy2i ; zi Þ z0i PC½d0 þ expðzi d2 Þðy2i þd0 1Þ ¼ Gðy2i þ d0 Þ @d2
(A.9.1)
where P ¼ expðzi d2 þ a1 Þ þ a1 ðy2i þ d0 Þ d0 expða1 Þ and C ¼ fðy2i þ d0 Þ expðzi d2 Þ expðzi d2 þ a1 Þ½d0 þ expðzi d2 Þg @f ða1 jy2i ; zi Þ ¼ f ða1 jy2i ; zi ÞD @d0
(A.9.2)
where D ¼ a1 a1 expða1 Þ þ lnðd0 þ expðzi d2 ÞÞ þ ðy2i þ dÞ0 =ðd0 þ expðzi d2 ÞÞ G0 ðy2i þ d0 Þ and expðPÞ½d0 þ expðzi d2 Þðy2i þd0 Þ f ða1 jy2i ; zi Þ ¼ Gðy2i þ d0 Þ pffiffiffiffi 1 ^ Avar N ðh hÞ ¼ A1 Var½ri1 ðh; cÞA1 1 ! N X 1 ^ 1 d^ ^ 1 r^i1 r^0i1 A N 1 AvarðhÞ ¼ A 1 1 N i¼1
ðA:10Þ
The asymptotic standard errors are obtained by the square roots of the diagonal elements of this matrix. Now we obtain standard errors for the APEs. pffiffiffiffi First, we need to obtain the asymptotic variance of N ðw^ wÞ for continuous explanatory variable where ! N Z þ1 X 1 ^ ^ fðg hÞf ða1 jy ; zi Þda1 h^ (A.11) w¼ N i
i¼1
2i
1
is the vector of scaled coefficients times the scaled factor in the APE section and Z þ1 (A.12) fðgi hÞf ða1 jy2i ; zi Þda1 h w¼E 1
is the vector of scaled population coefficients times the mean response. If y2 is treated as a continuous variable: XN Z þ1 1 d ^ fð^a1 y2i þ z1i d1 þ Z^ 1 a1 Þf ða1 jy2i ; zi Þda1 a^ 1 APE ¼ N i¼1 1
293
Estimating a Fractional Response Model
For a continuous variable z11 : ! N Z þ1 X 1 d ¼ N APE fð^a1 y2i þ z1i d^ 1 þ Z^ 1 a1 Þf ða1 jy2i ; zi Þda1 d^ 11 i¼1
1 0
0
0
Using problem 12.12 in Wooldridge (2003), and let p^ ¼ ðh^ ; d^ 2 ; d^ 0 Þ0 we have N X pffiffiffiffi pffiffiffiffi ^ wÞ ¼ N 1=2 ½jðgi ; zi ; pÞ w þ E½rp jðgi ; zi ; pÞ N ðp^ pÞ þ op ð1Þ N ðw i¼1
(A.13) where jðgi ; zi ; pÞ ¼ ð
R þ1
fðgi hÞf ða1 jy2i ; zi Þda1 Þh and d0 þy2 d0 þ expðzd2 Þ ½expða1 Þy2 f ða1 jy2 ; zÞ ¼ f ða1 ; d0 ; d0 Þ d0 þ expðzd2 þ a1 Þ pffiffiffiffi First, we need to find N ðp^ pÞ ! N X A1 pffiffiffiffi 1 ri1 1=2 N ðp^ pÞ ¼ N þ op ð1Þ ri2 i¼1 1
N X pffiffiffiffi N ðp^ pÞ ¼ N 1=2 ki þ op ð1Þ
ðA:14Þ
i¼1
pffiffiffiffi ^ wÞ is Thus, the asymptotic variance of N ðw Z þ1 fðgi hÞf ða1 jy2i ; zi Þda1 h w þ JðpÞki Var
(A.15)
1
where JðpÞ ¼ E½rp jðgi ; zi ; pÞ. Next, we need to find rh jðgi ; zi ; pÞ; rd2 jðgi ; zi ; pÞ, and rd0 jðgi ; zi ; pÞ. Z þ1 fðgi hÞf ða1 jy2i ; zi Þda1 IKþ2 rh jðgi ; zi ; pÞ ¼ 1 Z þ1 ðA:16Þ fðgi hÞðgi hÞðhgi Þf ða1 jy2i ; zi Þda1 1
where IKþ2 is the identity matrix and (K+2) is the dimension of h Z þ1 0 @f ða1 jy2i ; zi Þ rd2 jðgi ; zi ; pÞ ¼ h fðgi hÞ da1 (A.17) @d2 1
294
HOA B. NGUYEN
where @f ða1i jy2i ; zi Þ=@d2 is defined in (A.9.1) and Z
þ1
rd0 jðgi ; zi ; pÞ ¼ h 1
@f ða1 jy2i ; zi Þ fðgi hÞ da1 @d0
0 (A.17.1)
where @f ða1i jy2i ; zi Þ=@d0 is defined in (A.9.2). rd2 jðgi ; zi ; pÞ is ðK þ 2Þ L matrix and rd0 jðgi ; zi ; pÞ is ðK þ 2Þ 1 matrix. Then (A.18) rp jðgi ; zi ; pÞ ¼ rh jðgi ; zi ; p; hÞjrd2 jðgi ; zi ; p; d2 Þjrd0 jðgi ; zi ; p; d0 Þ and its expected value is estimated as ^ ¼ N 1 J^ ¼ JðpÞ
N h X
^ d jðg ; zi ; p; d^ 2 Þjrd jðg ; zi ; p; d^ 0 Þ rh jðgi ; zi ; p; hÞjr i i 2 0
i
i¼1
(A.19) pffiffiffiffi ^ wÞ is consistently estimated as: Finally, Avar½ N ðw N X pffiffiffiffi d N ðw ^ wÞ ¼ N 1 Avar½
Z
Z
i¼1 þ1
þ1
^ ða1 jy ; zi Þda1 h^ w ^ þ J^ k^i fðgi hÞf 2i
1
0 ^ ða1 jy ; zi Þda1 h^ w ^ þ J^ k^i fðgi hÞf 2i
ðA:20Þ
1
where all quantities are evaluated at the estimators given above. The asymptotic standard error for any particular APE is obtained as the square pffiffiffiffi root of the corresponding diagonal element of Eq.p(A.20), divided by N . ffiffiffiffi Now we obtain the asymptotic variance of N ðb k kÞ for a count endogenous variable where þ z1 d1 þ Z1 a1 Þ Fða1 yk2 þ z1 d1 þ Z1 a1 Þ APE ¼ E a1 ½Fða1 ykþ1 2
(A.21)
For example, yk2 ¼ 0 and y2kþ1 ¼ 1. d ¼ N 1 APE Z
N Z X i¼1 þ1
1
þ1
^ ða1 jy ; zi Þda1 Fðgikþ1 hÞf 2i
1
^ ða1 jy ; zi Þda1 Fðgki hÞf 2i
ðA:22Þ
295
Estimating a Fractional Response Model
pffiffiffiffi pffiffiffiffi Var½ N ðb k kÞ ¼ Var N ½ðb kkþ1 b kk Þ ðkkþ1 kk Þ pffiffiffiffi pffiffiffiffi ¼ Var N ðb kkþ1 kkþ1 Þ þ Var N ðb kk kk Þ pffiffiffiffi pffiffiffiffi kk kk Þ 2Cov½ N ðb kkþ1 kkþ1 Þ; N ðb (1) We start with N X pffiffiffiffi ðjðgki ; zi ; pÞ kk Þ N ðb kk kk Þ ¼ N 1=2 i¼1 pffiffiffiffi þE½rp jðgki ; zi ; pÞ N ðp^
where jðgki ; zi ; pÞ ¼
(A.23)
pÞ þ op ð1Þ
R þ1
Fðgki hÞf ða1 jy2i ; zi Þda1 2 N Z þ1 X pffiffiffiffi 1 kb b b b b d Fðgi hÞf ða1 jy2i ; zi Þda1 kk þ Jki Var½ N ðkk kk Þ ¼ N 1
1
i¼1
(A.24) where the notations of b ki is the same as Eq. (A.14) and J^ is defined as follows: ^ ¼ N 1 J^ ¼ JðpÞ
N X
^ d jðgk ; zi ; p; d^ 2 Þjrd jðgk ; zi ; p; d^ 0 Þ ½rh jðgki ; zi ; p; hÞjr 2 0 i i
i¼1
(A.25) rh jðgki ; zi ; p; yÞ ¼
Z
þ1
kb gk0 i jðgi hÞf ða1 jy2i ; zi Þda1
(A.26)
1
rd2 jðgki ; zi ; p; d2 Þ ¼
Z
þ1
Fðgkib hÞ
@f ða1 jy2i ; zi Þ da1 @d2
(A.27)
Fðgkib hÞ
@f ða1 jy2i ; zi Þ da1 @d0
(A.28)
1
rd0 jðgki ; zi ; p; d0 Þ ¼
Z
þ1
1
pffiffiffiffi d N ðb kkþ1 kkþ1 Þ is obtained in a similar way as (1). (2) Var½ (3) Using the formula: Cov(x,y)=E(xy)ExEy and getting the estimator of this covariance with the notice that Eðb kk Þ ¼ kk , after some algebra, we have the estimator for this covariance is 0. Adding (1), (2), and (3) together, we get pffiffiffiffi pffiffiffiffi pffiffiffiffi d N ðb d N ðb d N ðb kkþ1 kkþ1 Þ Var½ k kÞ ¼ Var½ kk kk Þ þ Var½
(A.29)
296
HOA B. NGUYEN
The asymptotic standard error for APE of the count endogenous variable is obtained as the square pffiffiffiffiroot of the corresponding diagonal element of Eq. (A.29), divided by N .
Appendix B Following the Smith–Blundell (1986) approach, the model with endogenous y2 is written as y1 ¼ maxð0; a1 y2 þ z1 d1 þ v2 x1 þ e1 Þ where the reduced form of y2 is: y2 ¼ zp2 þ v2 ; v2 jzNormalð0; S2 Þ and e1 jz; v2 Normalð0; s2e Þ. The conditional mean of y1 is: Eðy1 jz; y2 ; v2 Þ ¼ F½ða1 y2 þ z1 d1 þ v2 x1 Þ=
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð1 þ s2e Þ
¼ Fða1e y2 þ z1 d1e þ v2 x1e Þ The Blundell–Smith procedure for estimating a1 ; d1 ; x1 , and s2e will then be (i) Run the OLS regression of yi2 on zi and save the residuals v^i2 , i=1,2, y ,N (ii) Do Tobit of yi1 on yi2 ; z1i and v^i2 to get a^ 1e ; d^ 1e , and x^ 1e , i=1,2, y ,N APEs for Tobit model with exogenous or endogenous variable are obtained as follows: * APE in Tobit model with exogenous variable y2 y1 ¼ maxð0; y1 Þ; y1 ¼ a1 y2 þ z1 d1 þ a1 ; a1 jy2 ; z1 Nð0; s2 Þ The conditional mean is Eðy1 jz1 ; y2 Þ ¼ Fða1s y2 þ z1 d1s Þða1 y2 þ z1 d1 Þ þ sfða1s y2 þ z1 d1s Þ ¼ mðy2 ; z1 ; y1s ; y1 Þ where a1s ¼ a1 =s; d1s ¼ d1 =s For a continuous variable y2 : APE ¼ @Eðy1 jz1 ; y2 Þ=@y2 ¼ Fða1s y2 þ z1 d1s Þa1 d ¼ 1 PN Fð^a1s y þ z1i d^ 1s Þ^a1 The estimator for this APE is APE 2i i¼1 N For a discrete variable y2 with the two values c and c+1: APE ¼ mðy2i ¼ d ¼ 1 PN mðy c þ 1Þ mðy2i ¼ cÞ and the estimator for this APE is APE i¼1 ^ 2i ¼ N ^ 2i ¼ cÞ ¼ Fð^a1s c þ z1i d^ 1s Þð^a1 c þ z1i d^ 1 Þþ ^ 2i ¼ cÞ, where mðy c þ 1Þ mðy ^ a1s c þ z1i d^ 1s Þ. sfð^
Estimating a Fractional Response Model
297
* APE in Tobit model with endogenous y2 (Blundell & Smith, 1986) y1 ¼ maxð0; y1 Þ; y1 ¼ a1 y2 þ z1 d1 þ Z1 a1 þ e1 ¼ a1 y2 þ z1 d1 þ u1 y2 ¼ zd2 þ a1 , Varða1 Þ ¼ s2 ; e1 jz; a1 Nð0; t21 Þ The standard method is to obtain APEs by computing the derivatives or the differences of E a1 ½mða1 y2 þ z1 d1 þ Z1 a1 ; t21 Þ where mða1 y2 þ z1 d1 þ Z1 a1 ; t21 Þ ¼ mða1 y2 þ z1 d1 ; Z21 s2 þ t21 Þ is mða1 y2 þ z1 d1 ; Z21 s2 þ t21 Þ ¼ Eðy1 jz1 ; y2 Þ ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Fða1s y2 þ z1 d1s Þða1 y2 þ z1 d1 Þ þ Z21 s2 þ t21 fða1s y2 þ z1 d1s Þ; where a1s ¼ qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi a1 = Z21 s2 þ t21 ; d1s ¼ d1 = Z21 s2 þ t21 . Estimators of APEs are obtained from the derivatives or the differences of mð^a1 y2 þ z1 d^ 1 ; Z^ 21 s^ 2 þ t^ 21 Þ with respect to elements of ðz1 ; y2 Þ, where s^ 2 is the estimate of error variance from the first-stage OLS regression. The
conditional
mean
d with respect to z1 ¼ N 1 PN Fð^a1s y þ z1i d^ 1s Þ^a1 APE 2i i¼1 d with respect to y ¼ N 1 PN mðy ^ ^ 2i ¼ cÞ ¼ c þ 1Þ mðy APE 2 2i i¼1 qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi where ^ 2i ¼ cÞ ¼ Fð^a1s c þ z1i d^ 1s Þð^a1 c þ z1i d^ 1 Þ þ Z^ 21 s^ 2 þ t^ 21 fð^a1s c þ z1i d^ 1s Þ. mðy An alternative method is to get APEs by computing the derivatives or the differences of E a1 ½mða1 y2 þ z1 d1 þ Z1 a1 ; t21 Þ where mðz1 ; y2 ; a1 ; t21 Þ ¼ mðx; t21 Þ ¼ Fðx=tq 1 Þx ffiffiffiffiffiþ s1 /ðx=t1 Þ P N 1 d with respect to z1 ¼ N d11 APE t21 Þb i¼1 Fðx= b P N 1 d with respect to y ¼ N b1 m b 0 where m b 0 ¼ m½y b 2 ¼ 0 APE 2 i¼1 ½m d1 þ b Z1 ab1i and a^1 is the residual obtained from the in which x ¼ b a1 y2 þ z1b first-stage estimation. For more details, see the Blundell–Smith procedure and the APEs in Wooldridge (2002), Chapter 16.
298
HOA B. NGUYEN
Appendix C In order to compare the NLS and the QML estimation, the basic framework is introduced as below: The first stage is to estimate d2 and d0 by using the step-wise maximum likelihood of yi2 on zi in the Negative Binomial model. Obtain the estimated parameters d^ 2 and d^ 0 . In the second stage, instead of using QMLE, we use the NLS of yi1 on yi2 ; zi1 to estimate a1 ; d1 , and Z1 with the approximated conditional mean mi ðh; y2; zÞ. The NLS estimator of h solves: 2 Z þ1 N X 1 y1i Fða1 y2i þ z1i d1 þ Z1 a1i Þf ða1 jy2i ; zi Þda1i min N h2Y
1
i¼1
or min N 1 h2Y
N X
½y1i mi ðh; y2i; zi Þ2 =2
i¼1
The score function can be written as Z þ1 si ¼ ðy1i mi Þ g0i fðgi hÞf ða1 jy2i ; zi Þda1 1
Appendix D We are given expða1 Þ distributed as Gammaðd0 ; 1=d0 Þ using a single parameter d0 . We are interested in obtaining the density function of Y ¼ a1 . Let X ¼ expða1 Þ. The density function of X is specified as follows: f ðX; d0 Þ ¼
dd00 X d0 1 expðd0 XÞ ; Gðd0 Þ
X40; d0 40
Since X40 and Y ¼ lnðXÞ; dX=dY ¼ expðYÞ and Y 2 ð1; 1Þ. The density function of Y will be derived as
dX
f ðY; d0 Þ ¼ f ½hðYÞ
; Y 2 ð1; 1Þ dY where f ½hðYÞ ¼ dd00 expðYÞd0 1 exp½d0 expðYÞ=Gðd0 Þ. Plug in Y ¼ a1 , we get f ðY; d0 Þ ¼ which is Eq. (4).
dd00 expða1 Þd0 exp½d0 expða1 Þ Gðd0 Þ
ALTERNATIVE RANDOM EFFECTS PANEL GAMMA SML ESTIMATION WITH HETEROGENEITY IN RANDOM AND ONE-SIDED ERROR Saleem Shaik and Ashok K. Mishra ABSTRACT In this chapter, we utilize the residual concept of productivity measures defined in the context of normal-gamma stochastic frontier production model with heterogeneity to differentiate productivity and inefficiency measures. In particular, three alternative two-way random effects panel estimators of normal-gamma stochastic frontier model are proposed using simulated maximum likelihood estimation techniques. For the three alternative panel estimators, we use a generalized least squares procedure involving the estimation of variance components in the first stage and estimated variance–covariance matrix to transform the data. Empirical estimates indicate difference in the parameter coefficients of gamma distribution, production function, and heterogeneity function variables between pooled and the two alternative panel estimators. The difference between pooled and panel model suggests the need to account for spatial, temporal, and within residual variations as in Swamy–Arora estimator, and within residual variation in Amemiya estimator with panel framework. Finally, results from this study indicate that short- and long-run Maximum Simulated Likelihood Methods and Applications Advances in Econometrics, Volume 26, 299–322 Copyright r 2010 by Emerald Group Publishing Limited All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1108/S0731-9053(2010)00000260013
299
300
SALEEM SHAIK AND ASHOK K. MISHRA
variations in financial exposure (solvency, liquidity, and efficiency) play an important role in explaining the variance of inefficiency and productivity.
1. INTRODUCTION The stochastic frontier model, introduced simultaneously by Aigner, Lovell, and Schmidt; Meeusen and van den Broeck; Battese and Corra in 1977, decomposes the error term e into a symmetrical random error v and a onesided error or inefficiency u.1 Aigner et al. (1977) assumed a normal–half normal and an exponential distribution while Meeusen and van den Broeck (1977) assumed an exponential distribution of inefficiency term, u. In 1982, Jondrow, Materov, Lovell, and Schmidt suggested a method to estimate firm-specific inefficiency measures. In 1980, Greene proposed a flexible frontier production model and followed up with a gamma-distributed stochastic frontier model in 1990. In 2003, Greene proposed the simulated maximum likelihood (SML) estimation of normal-gamma stochastic frontier model to overcome the complexity associated with the log likelihood function.2 Since the 1990s other theoretical contributions to stochastic frontier analysis include the estimation of time invariant and variant models under alternative distributions of one-sided inefficiency term for panel data. However, most of the panel stochastic frontier model include fixed effects dummy under time invariant models and variations in the intercept (Schmidt & Sickles, 2004). This assumption of time invariant was relaxed later in the estimation of stochastic frontier models. Still, these models force the fixed effects dummy not only to capture the heterogeneity but also to estimate inefficiency across cross-sectional units. The proposed random effects time invariant and variant stochastic frontier models primarily dealt with one-way random effects associated with cross-sectional variation (see Cornwell, Schmidt, & Sickles, 1990; Battese & Coelli, 1992). Additionally, research has focused on the influence of a broader set of determinants of inefficiency, heteroskedasticity, and heterogeneity, namely, geographic, market structure conduct and performance, policy, and size of the firm variables, using a two-step procedure. The two-step procedure has been the subject of analysis by earlier researchers, though it might be biased due to omitted or left out variables (see Wang & Schmidt, 2002; Greene, 2004). Other extensions of the stochastic frontier analysis include differentiation into productivity and inefficiency measures using panel data. Productivity
Alternative Random Effects Panel Gamma SML Estimation
301
identified with residual error (Abramovitz, 1956; Solow, 1957), is defined as the difference between the log of the output and log of the input. In this chapter, we first utilize the residual concept of productivity measures defined by Abramovitz (1956) in the context of stochastic frontier production model to differentiate productivity and efficiency measures (see below). Following Greene (2004), instead of a two-step process, a stochastic frontier model with heteroskedasticity of v random error term identified as productivity and u one-sided error term identified as inefficiency is used to examine the importance of short- and long-run variation in financial ratios (liquidity, solvency, and efficiency) on productivity and inefficiency. Second, three alternative two-way random effects panel estimators of normal-gamma stochastic frontier model are proposed, using SML estimation techniques. For the three alternative panel estimators, we use a generalized least squares procedure3 involving the estimation of variance components in the first stage and the use of the estimated variance– covariance matrix to transform the data in the second stage. Several possibilities exist for the first stage, namely the use of pooled OLS residuals as in Wallace–Hussain (WH) approach (1969), within residuals as in Amemiya (AM) approach (1971) or within residuals, between crosssectional residuals and between time-series residuals as in Swamy–Arora (SA) approach (1972) in the estimation of alternative panel estimators. In Section 2, we extend Greene’s (2003) normal-gamma SML stochastic frontier methodology to include technical efficiency and productivity. Then three alternative two-way random effects panel estimators of normalgamma SML stochastic frontier model are presented. Section 3 will provide details of the panel data used in the analysis and simulation. Application of the three alternative two-way random effects panel estimators of normalgamma SML stochastic frontier is presented in Section 4, and some conclusions are drawn in Section 5.
2. RANDOM EFFECTS PANEL GAMMA SML STOCHASTIC FRONTIER MODELS 2.1. Stochastic Frontier Models to Include Efficiency and Productivity Following Greene (2003), the gamma SML stochastic frontier model can be used to represent a Cobb–Douglas production function as y ¼ f ðx; bÞ þ v u
(1)
302
SALEEM SHAIK AND ASHOK K. MISHRA
where y is the output and x is a vector of inputs used in the production function, b is a vector coefficients associated with inputs, v represents the random error and v Nð0; s2v Þ, u represents the one-sided inefficiency and can be represented with alternative distributions including normalgamma with a scale q and a shape parameter P. The normalgamma distribution of the inefficiency, u, following Greene (2003) can be defined as
f ðuÞ ¼
qp expðquÞuP1 GðPÞ
(2)
Eq. (1) with normal-gamma distribution can be extended by introducing heterogeneity in the random error, v, and the one-sided inefficiency, u, as y ¼ f ðx; bÞ þ v u s2u ¼ expðd0 zÞ s2v ¼ expðd0 zÞ
ð3Þ
where s2u is the variance in the inefficiency term, s2v is the variance in the random error. The variances in the inefficiency and random error terms can be modeled as a function of variance in variables z. Here, we defined the variances as a function of variance of financial ratio variables, which include solvency, liquidity, and efficiency. The inefficiency and random error variances in Eq. (3) can be paraphrased as variance in inefficiency and productivity measures. Productivity or total factor productivity (TFP) is defined as the ratio of inputs over outputs. Mathematically, the production function assuming ~ bÞ þ v, where f ðx; ~ bÞ is equal to inefficiency can be represented as y ¼ f ðx; f ðx; bÞ u. This production function, assuming inefficiency, can be used ~ bÞ. The productivity concept could be to represent TFP as v ¼ y=f ðx; incorporated into stochastic frontier production function (SFPF) with decomposed error terms, y ¼ f ðx; bÞ þ v u; where v constitutes a conventional random error or TFP and u constitutes one-side disturbance that is distributed as normal-gamma and represents inefficiency. The SFPF with heteroskedasticity listed above is used to examine the importance of short- and long-run variation in liquidity, solvency, and
Alternative Random Effects Panel Gamma SML Estimation
303
efficiency financial ratios. Specifically, the model can be represented as: y s2inefficiency
¼ f ðx; bÞ þ v u ¼ expðd0 zÞ
. . . . . . Output . . . . . . Inefficiency
s2productivity
0
. . . . . . Productivity
¼ expðd zÞ
(4)
2.2. Panel Gamma SML Stochastic Frontier Models The time-series or cross-section gamma SML stochastic frontier model Eq. (4) can be extended to one- and two-way fixed or random effects panel model. The basic panel gamma SML stochastic frontier production function with heterogeneity in the random error v and the one-sided inefficiency u can be represented as yit ¼ f ðxit ; bÞ þ vit uit s2uit s2vit
or yit ¼ f ðxit ; bÞ uit þ vit
0
¼ expðd zit Þ ¼ expðd0 zit Þ
ð5Þ
where i ¼ 1; . . . ; N cross-section observations (in our case 48) and t ¼ 1; . . . ; T (in our case 44) number of years, y is the vector output with NT 1, x is a vector of inputs with NT K, and b is K 1 with K being the number of explanatory variables used in the production function. Let us consider a one-way error disturbance gamma SML stochastic frontier production function yit ¼ f ðxit ; bÞ þ vit uit s2uit s2vit
with vit ¼ mi þ it
0
¼ expðd zit Þ ¼ expðd0 zit Þ
ð6Þ
where the random error for one-way random effects model can be represented as vit ¼ mi þ it with mi representing the temporally invariant cross-section or spatial effect and it representing the remaining random error. If mi representing individual cross-sectional units are assumed to be fixed, a one-way fixed effects gamma SML stochastic frontier production function with heterogeneity in the random error vit and the one-sided
304
SALEEM SHAIK AND ASHOK K. MISHRA
inefficiency uit can be written as yit ¼ f ðxit ; b; Z l ; mi Þ uit þ it s2uit ¼ expðd0 zit Þ s2vit ¼ expðd0 zit Þ
ð7Þ
where Z l is a vector of individual cross-sectional dummies and mi is the associate parameters of the cross-sectional dummies. An alternative to the estimation of too many parameters (dummies) is to treat mi as random, which leads to one-way random effects model. The oneway random panel gamma SML stochastic frontier production function with heterogeneity in the random error vit and the one-sided inefficiency uit can be represented as yit ¼ f ðxit ; bÞ uit þ mi þ it s2uit ¼ expðd0 zit Þ s2vit ¼ expðd0 zit Þ
ð8Þ
where mi is the temporally invariant spatial error, normally distributed with mean zero and variance s2m , it is the remaining random error which is normally distributed with mean zero and variance s2 , and mi is independent of eit. Further, xit is independent of mi and eit for all i and t. Similarly, the two-way error gamma SML stochastic frontier production function with heterogeneity in the random error, vit, and the one-sided inefficiency, uit, can be represented as yit ¼ f ðxit ; bÞ þ vit uit ; s2uit s2vit
with vit ¼ mi þ lt þ it
0
¼ expðd zit Þ ¼ expðd0 zit Þ
ð9Þ
where mi represents the temporally invariant cross-section or spatial effect, lt represents the spatially invariant time-series or temporal effect, and eit represents the remainder random error. If mi and lt, representing individual cross-sectional and time-series units, respectively, are assumed to be fixed, a two-way fixed effects gamma SML stochastic frontier production function can be written as yit ¼ f ðxit ; b; Z l ; mi ; Z k ; lt Þ uit þ it
(10)
where Z l is a vector of individual cross-sectional dummies and mi is a vector of associate parameters of the cross-sectional dummies, Zk is a vector of
Alternative Random Effects Panel Gamma SML Estimation
305
individual time-series dummies, and lt is the associate parameters of the times-series dummies. Similarly, one can assume that mi and lt are random – leading to a twoway random effects model. The two-way random panel gamma SML stochastic frontier production function with heterogeneity in the random error, vit, and the one-sided inefficiency, uit, can be represented as yit ¼ f ðxit ; bÞ uit þ mi þ lt þ it s2uit ¼ expðd0 zit Þ s2vit ¼ expðd0 zit Þ
ð11Þ
where mi is temporally invariant spatial error and mi Nð0; sm Þ, lt is spatially invariant temporal error and lt Nð0; sl Þ, and mi, lt, and eit are independent. Further, xit is independent of mi, lt, and it for all i and t.
2.3. Alternative Two-Way Panel Estimators of Gamma SML Stochastic Frontier Models The two-way random effects stochastic frontier production function with heterogeneity in the random error, vit, and the one-sided inefficiency, uit, can be represented as yit ¼ f ðxit ; bÞ uit þ vit s2uit ¼ expðd0 zit Þ s2vit ¼ expðd0 zit Þ
ð12Þ
where uit ¼ ðu11 ; . . . ; u1T ; u21 ; . . . ; u2T ; . . . . . . ; uN1 ; . . . uNT Þ, vit ¼ ðv11 ; . . . ; v1T ; v21 ; . . . ; v2T ; . . . ; vN1 ; . . . vNT Þ, vit ¼ Z l mi þ Z k lt þ Z it and Z l ¼ ðIN iT Þ;
m0 ¼ ðu1 ; u2 ; . . . ; uN Þ
Z k ¼ ðIT iN Þ; Z ¼ ðIN IT Þ;
l0 ¼ ðl1 ; l2 ; . . . ; lT Þ, 0 ¼ ð1 ; 2 ; . . . ; NT Þ
where IN and IT (iN and iT ) represent identity matrices (vector of ones) of N and T dimensions (T and N dimensions) respectively, and represent the
306
SALEEM SHAIK AND ASHOK K. MISHRA
random error components with zero mean and covariance matrix: 0 1 0 1 0 0 s2m I N m C B B C 0 C s2l I T E @ l A m 0 l0 0 ¼ B @ 0 A 0 0 s2 I NT
(13)
The error variance–covariance (O) of the gamma SML stochastic frontier production function can be represented as: O s2v ¼ s2m ðI N iT Þ þ s2l ðI T iN Þ þ s2 ðI N I T Þ
(14)
Finally, to estimate the two-way error component gamma SML stochastic frontier model with heterogeneity in the random error, vit, and the one-sided inefficiency, uit, we need to transform Eq. (12): yit ¼ f ðxit ; bÞ uit þ vit s2uit ¼ expðd0 zit Þ s2vit ¼ expðd0 zit Þ where yit ¼ O1=2 yit ;
ð15Þ
xit ¼ O1=2 xit with O defined in Eq. (14), or yit ¼ yit y1 yi: y2 y:t þ y3 y::
where y1 ¼ 1
s 1=2
j2
! ; y2 ¼ 1
s 1=2
j3
! ; and y3 ¼ y1 þ y2 þ
s 1=2
! 1
j4
yi: , y:t , and y:: in the above equation represent the cross-section, time-series, P and the are computed as yi: ¼ Tt¼1 yit =T, PNoverall mean of the variable PN Pand y:t ¼ n¼1 yit =N, and y:: ¼ n¼1 Tt¼1 yit =NT , respectively. The phi’s ^ 3 ¼ Ns2l þ s2 and j^ 4 ¼¼ Ts2m þ Ns2l þ s2 are obtained j^ 2 ¼ Ts2m þ s2 , j variances of between cross-section, between time-period, and within crosssection time-period errors. The phi’s used in the computation of the thetas (y1, y2, and y3) can be estimated by: (1) using residuals estimated from the pooled gamma SML stochastic frontier model proposed by WH approach; or (2) using residuals estimated from the within gamma SML stochastic frontier model proposed by AM approach; or (3) using the residuals estimated from within, between cross-section and between time-series gamma SML stochastic frontier model proposed by SA approach.
Alternative Random Effects Panel Gamma SML Estimation
307
2.3.1. Swamy–Arora (SA) Estimator The Swamy–Arora transformation involves the use of residuals obtained from within, between time-series, and between cross-section gamma SML stochastic frontier models to estimate following three models. Within Gamma SML Stochastic Frontier Production Function Model. Consider within gamma SML stochastic frontier production function with heterogeneity in the random error, vit, and the one-sided inefficiency, uit y~it ¼ f ðx~ it ; bÞ uit þ vit s2uit ¼ expðd0 z~it Þ s2vit ¼ expðd0 z~it Þ
ð16Þ
where y~it ¼ yit Q1 ¼ yit ½ðI N I T Þ ðI N iT Þ ðI T iN Þ þ ðiN iT Þ, ¼ yit yi: y:t þ y:: and yi: , y:t , and y:: are defined in Eq. (15) and Q1 ¼ ½ðI N I T Þ ðI N iT Þ ðI T iN Þ þ ðiN iT Þ. Consistent estimates of within variance error (sm ) can be obtained using within error v~it ¼ y~it x~ it b estimated from within SML stochastic frontier production function model SFPF s2
v~0it Q1 v~it traceðQ1 Þ
(17)
Between Cross-Section Gamma SML Stochastic Frontier Production Function. Similarly, consider a between cross-section gamma SML stochastic frontier production function with heterogeneity in the random error, vi ; and the one-sided inefficiency, ui ; yi ¼ f ðx i ; bÞ ui þ vi s2ui ¼ expðd0 zi Þ s2vi ¼ expðd0 zi Þ
ð18Þ
308
SALEEM SHAIK AND ASHOK K. MISHRA
where yi ¼ yit Q2 ¼ yit ½ðI N I T Þ ðI N iT Þ ¼ yit yi: and yi: and y:: are defined in Eq. (15) and Q2 ¼ ½ðI N I T Þ ðI N iT Þ. Consistent estimates of between cross-section variance error (sl ) can be obtained using cross-section error vi ¼ yi x i b estimated from within cross-section SML stochastic frontier production function model SFPF s2m
v0i Q2 vi traceðQ2 Þ
(19)
Between Time-Series Gamma SML Stochastic Frontier Production Function. Finally, consider a between time-series gamma SML stochastic frontier production function with heterogeneity in the random error, vt, and the one-sided inefficiency, ut, yt ¼ f ðx t ; bÞ ut þ vt s2ut ¼ expðd0 zt Þ s2vt ¼ expðd0 zt Þ
ð20Þ
where yt ¼ yit Q3 ¼ yit ½ðI N I T Þ ðI T iN Þ ¼ yit y:t and yt and y:t are defined earlier in the text and Q3 ¼ ½ðI N I T Þ ðI T iN Þ. Consistent estimates of between time-series variance error (sl ) can be obtained using within error vt ¼ y t x t b estimated from within time-series SML stochastic frontier production function model SFPF s2l
v0t Q3 vt traceðQ3 Þ
(21)
2.3.2. Wallace–Hussain (WH) Estimator The Wallace–Hussain transformation involves the use of residuals from pooled gamma SML stochastic frontier model with heterogeneity in the
Alternative Random Effects Panel Gamma SML Estimation
309
random error, vit, and the one-sided inefficiency, uit, as defined below. y~it ¼ f ðx~ it ; bÞ uit þ vit s2uit ¼ expðd0 z~it Þ s2vit ¼ expðd0 z~it Þ
ð22Þ
Consistent estimates of within error variance can be obtained from the residuals of the pooled gamma SML stochastic frontier model s2
v~0it Q1 v~it traceðQ1 Þ
(23)
The between cross-section error variance can be obtained from the residuals of the pooled gamma SML stochastic frontier model s2m
v~0it Q1 v~it traceðQ2 Þ
(24)
and the between time-series error variance can be obtained from the residuals of the pooled gamma SML stochastic frontier model s2l
v~0it Q1 v~it traceðQ3 Þ
(25)
2.3.3. Amemiya (AM) Estimator The Amemiya transformation involves the use of residuals from within gamma SML stochastic frontier model with heterogeneity in the random error, v and the one-sided inefficiency, u, as defined below. yit ¼ f ðxit ; bÞ uit þ vit s2uit ¼ expðd0 zit Þ s2vit ¼ expðd0 zit Þ
ð26Þ
Consistent estimates of within error variance can be obtained from the residuals of the within gamma SML stochastic frontier model s2
v0it Q1 vit traceðQ1 Þ
(27)
The between cross-section error variance can be obtained from the residuals of the within gamma SML stochastic frontier model s2m
v0it Q1 vit traceðQ2 Þ
(28)
310
SALEEM SHAIK AND ASHOK K. MISHRA
and the between time-series error variance can be obtained from the residuals of the within gamma SML stochastic frontier model s2l
v0it Q1 vit traceðQ3 Þ
(29)
The empirical application of the above concept and methods is simple and straightforward. Further, it is easy to estimate two-way random effects gamma SML stochastic frontier production function by using the frontier module of LIMDEP package: Step 1: Estimate pooled Eq. (26), within Eq. (16), between cross-section Eq. (18), and between time-series system Eq. (20) gamma SML stochastic frontier production function models with heterogeneity in the random error, vit, and the one-sided inefficiency, uit, using standard LIMDEP package. Specifically, estimate gamma SML stochastic frontier model of yit on xit and zit for pooled, y~it on x~ it and z~it for within, and yi on x i and zi (yt on x t and zt ) for between cross-section (between time-series). Step 2: Compute estimates of s , sm , and sl as described in Eqs. (17, 19, and 21), respectively, for SA estimator; Eqs. (23, 24, and 25), respectively, for WH estimator; and Eqs. (27, 28, and 29), respectively, for AM estimator. ^ 3 , and j ^ 2j ^4 Step 3: Obtain the error variance s , sm , and sl to develop j to estimate the thetas, y1 ; y2 ; y3 in order to transform the output and input variables, yit and xit ; respectively see Eq. (15) for each of the three alternative estimators (SA, WH, and AM), i.e., yit ¼ yit y1 yi: y2 y:t þ y3 y:: . Step 4: Finally, estimate the gamma SML stochastic frontier model with heterogeneity in the random error, v, and the one-sided inefficiency, u, on this transformed model as represented in Eq. (15).
3. DATA AND VARIABLES USED IN THE ANALYSIS The U.S. Department of Agriculture’s Economic Research Service (ERS) constructs and publishes the state and aggregate production accounts for the farm sector.4 The features of the state and national production accounts are consistent with gross output model of production and are well documented in Ball, Gollop, Kelly-Hawke, and Swinand (1999). Output is defined as gross production leaving the farm, as opposed to real value added (quantity index, base 1,960 ¼ 100). Price of land is based on hedonic
Alternative Random Effects Panel Gamma SML Estimation
311
regressions. Specifically the price of land in a state is regressed against land characteristics and location (state dummy). Ball et al. (1999) points out that the land characteristics are obtained from climatic and geographic data contained in State Soil Geographic (STATSGO) database (USDA). In addition, a ‘‘population accessibility’’ index for each country is used to estimate the price of land. The indexes are derived from a gravity model of urban development, which provides measures of accessibility to population concentrations. The index increases as population increases and/or distance from population center decreases. Construction of the population accessibility index is calculated on the basis of population within a 50-mile radius of each parcel. Prices of capital inputs are obtained on investment goods prices, taking into account the flow of capital services per unit of capital stock in each state (Ball, Bureau, Butault, & Nehring, 2001). All inputs are quantity index, with 1,960 ¼ 100. The financial exposure variables are defined as follows and available from U.S. Department of Agriculture’s Economic Research Service.5 Financial solvency is defined as the ratio of total farm debt to total farms assets. It measures debt pledged against farm business assets, indicating overall financial risk. Financial liquidity is defined as the ratio of interest plus principal payments to gross cash farm income and it measures the share of the farm business’s gross income needed to service the debt. Finally, financial efficiency is defined as the ratio of gross cash farm income to farm business assets and it measures the gross farm income generated per dollar of farm business assets. Table 1 presents the summary statistics of the output, inputs, and variables that measure financial exposure.
4. EMPIRICAL APPLICATION AND RESULTS To evaluate the importance of accounting for panel data using alternative panel estimators – SA, WH, and AM – we use a generalized least squares procedure involving the estimation of the variance components in the first stage of gamma SML stochastic frontier model with heterogeneity in the random and one-sided inefficiency. Specifically, pooled Eq. (5), within Eq. (16), between cross-section Eq. (18), and between time-series Eq. (20) gamma SML stochastic production frontier models with heterogeneity in the random and one-sided inefficiency are estimated. In the second stage, we use the estimated variance–covariance matrix to transform the data as defined in Eq. (15). The SA panel estimator uses the errors estimated from
312
Table 1.
SALEEM SHAIK AND ASHOK K. MISHRA
Summary Statistics of Output, Input, and Financial Exposures Variables of U.S. Agriculture Section, 1961–2003. Mean
Output (quantity index, 1,960 ¼ 100) Capital (quantity index, 1,960 ¼ 100) Land (quantity index, 1,960 ¼ 100) Labor (quantity index, 1,960 ¼ 100) Chemicals (quantity index, 1,960 ¼ 100) Energy (quantity index, 1,960 ¼ 100) Materials (quantity index, 1,960 ¼ 100) Measures of financial exposure Liquidity: debt servicing ratio Liquidity debt servicing – long-run riska Liquidity debt servicing – short-run riskb Solvency: debt-to-asset ratio Solvency debt to asset – long-run risk Solvency debt to asset – short-run risk Efficiency: asset turnover ratio Efficiency asset turnover ratio – long-run risk Efficiency asset turnover ratio – short-run risk a
SD
140.99 46.9772 108.141 28.026 80.212 17.192 59.003 21.430 227.881 209.809 118.530 31.162 130.013 46.022 0.170 0.038 0.018 15.876 2.100 1.021 20.620 2.810 1.525
0.065 0.022 0.014 4.199 0.954 0.778 7.361 1.864 1.188
Minimum Maximum 59.5198 39.381 33.868 14.388 28.818 51.795 41.752 0.050 0.000 0.000 3.533 0.007 0.007 6.700 0.057 0.057
336.103 219.242 104.955 134.597 2901.580 322.732 380.328 0.480 0.112 0.095 31.640 5.073 6.037 65.720 12.176 9.589
Long-run risk is defined as the cumulative standard deviation of the financial variables. Short-run risk is defined as a 5-year moving standard deviation of the financial variables.
b
within, between cross-section, and between time-series models. In contrast, the WH and AM panel estimator uses the errors estimated from the pooled and within models, respectively. To examine the importance of short and long variation in financial exposure on efficiency and productivity variance, gamma SML stochastic frontier model with heterogeneity in the random error, v, and the one-sided inefficiency, u, as defined in Eq. (5) is estimated. Second, the SA and AM panel gamma SML stochastic frontier production function with heterogeneity in the random error, v; and the one-sided inefficiency, u; as defined by Eq. (15), is used to estimate variance in financial exposure variables. The output and inputs in the production function equation are estimated using the logs of the variables and the variance in solvency, liquidity, and efficiency – financial exposure variables – while heteroskedasticity in inefficiency and productivity variance function is estimated in levels. Gamma SML stochastic frontier analysis of the production function with heteroskedasticity is estimated following Greene (2007). We specified a Cobb–Douglas functional form for the pooled, time-series, cross-section, SA, and AM panel gamma SML stochastic frontier models
Alternative Random Effects Panel Gamma SML Estimation
313
with heterogeneity in the random error, v, and the one-sided inefficiency, u. The long- and short-run variance of financial exposure variables were specified in the inefficiency and productivity heteroskedasticity variance function. Further, we estimated the SML function with 50 Halton draws for all the models. The Cobb–Douglas functional form with heteroskedasticity was specified as: Outputit ¼ b0 þ b1 Capitalit þ b2 Landit þ b3 Laborit þ b4 Chemicalsit þ b5 Energyit þ b6 Materialsit þ b7 Year þ it s2u s2v
¼ g1;u LR riskit þ g2;u SR riskit ¼ g1;v LR riskit þ g2;v SR riskit
ð30Þ
where LR risk is the long-run risk and defined as the cumulative standard deviation of the financial exposure variable, SR risk is the short-run risk and, defined as a 5-year moving standard deviation of the financial exposure variable. 4.1. Pooled Gamma SML Stochastic Frontier Models Parameter estimates of the pooled gamma SML stochastic frontier production function are presented in Table 2. In addition to showing the variables related to production function, the table also demonstrates the impact of short- and long-run variations in financial – solvency, liquidity, and efficiency – risk variables due to the heterogeneity in the random error v and the one-sided inefficiency u specification related to production function. Each of these variables measure a facet of financial risk faced in agriculture. By including them in the production function, we can easily assess the impact of short- and long-run variations in financial exposure variables, in three different ways, on efficiency and productivity, after accounting for technology change. Recall that the independent and dependent variables are in logarithms, hence, the coefficients represent elasticity of endogenous variable with respect to exogenous variation. Results from Table 2 suggest that in the case of pooled gamma SML, all three models of variations in financial exposure (solvency, liquidity, and efficiency) perform equally well. The year, a proxy for technology, is positively related to agricultural output and returns to scale ranged from 0.91 for financial liquidity model to as low as 0.84 for financial efficiency model, with the inclusion of technology.6 The theta, P, and sigma (v) were positive and significant. In each case, the input variables in the production
314
SALEEM SHAIK AND ASHOK K. MISHRA
Table 2. Pooled Gamma SML Stochastic Frontier Production Function Results for Solvency, Liquidity, and Efficiency Financial Variables. Financial Solvency
Constant Capital Land Labor Chemicals Energy Materials Year
Financial Liquidity
Financial Efficiency
Coefficient
P[|Z|Wz]
Coefficient
P[|Z|Wz]
Coefficient
P[|Z|Wz]
21.4854 0.0365 0.0930 0.0717 0.0609 0.1361 0.4473 0.0113
o0 0.0087 o0 o0 o0 o0 o0 o0
24.5737 0.05892 0.12895 0.08458 0.05044 0.11983 0.4551 0.01279
o0 o0 o0 o0 o0 o0 o0 o0
21.4817 0.0517 0.0843 0.0744 0.0648 0.1371 0.4167 0.0114
o0 0.0001 o0 o0 o0 o0 o0 o0
Theta Shape parameter (P) Sigma (v)
18.9495 1.0483 0.0835
Theta (1/u) Long-run risk Short-run risk
0.0014 0.1927
Productivity Long-run risk Short-run risk
0.1507 0.0200
o0 0.001 o0 0.9876 0.0219 o0 0.415
48.1545 1.59415 0.06904
0.0006 o0 o0
18.9551 1.0046 0.0820
o0 0.0005 o0
12.2275 7.88384
0.0011 0.0499
0.0188 0.0613
0.7637 0.3353
12.9806 3.01444
o0 0.1166
0.0978 0.0008
o0 0.9628
function are all positive and statistically significant at the 1 percent level of significance. An interesting finding here is that, in all three models (measuring financial risk in different ways), the estimated coefficients are very similar. Results indicate an input elasticity of materials ranging from 0.46 in the financial liquidity model to 0.42 in the financial efficiency model (Table 2), indicating that a 100 percent increase in the use of material input increases output by 46, 45, and 42 percent, respectively. The second factor with a significant impact on agricultural production in the United States is energy. Results in Table 2 indicate an input elasticity of energy ranging from 0.14 in financial solvency and efficiency models to 0.12 in the financial liquidity model (Table 2), indicating that a 100 percent increase in the use of energy would increase the output by 14 percent in the financial solvency and efficiency models. Meanwhile, a 100 percent increase in energy input would increase agricultural output by 12 percent in the financial liquidity model. Land input elasticity ranges from 0.13 percent for financial liquidity model to about 0.09 for the financial solvency and efficiency model. Land
Alternative Random Effects Panel Gamma SML Estimation
315
ranks third with respect to the magnitude of contributions to agricultural output. Farm labor with an elasticity of 0.072, 0.085, and 0.074 for liquidity, solvency, and efficiency models, respectively. Finally, the impact of capital and chemicals on agricultural output are somewhat similar in all models. In the case of chemicals the elasticity ranges from 0.04 to 0.06 percent, while for capital input the elasticity ranges from 0.01 to 0.05 percent (Table 2). With regard to short- and long-run variations in financial exposure variables, our analysis shows some conflicting results. For example, the short-run financial solvency risk (variability in debt-to-asset ratio) variable in theta or inverse of inefficiency variance function is negative and significant at 2 percent level of significance. Accordingly, an increase in the variation of financial solvency (debt-to-asset ratio) decreases the variation in theta or inverse of inefficiency variance in the short-run. A possible explanation is that more indebted farmers are higher costs farmers, and hence more technically inefficient. This finding is consistent with the ‘‘Agency costs’’7 theory proposed by Nasr, Barry, and Ellinger (1998). In contrast, short-run financial liquidity risk variable has a positive and significant impact on theta or inverse of inefficiency variance. The positive sign indicates that short-run variation in financial liquidity (debt servicing ratio) would increase the variation in theta or inverse or inefficiency variance. Findings here support the ‘‘Free Cash Flow’’ hypothesis, which postulates that excess cash flows encourage managerial laxness, which translates into technical inefficiency. Finally, in all three models of financial risk the sign on long-run risk variable coefficient is positive and statistically significant at the 1 percent level of significance. Results indicate that regardless of the measure of long-run financial risk, variation in financial risks would increase the variability in agricultural productivity (Table 2).
4.2. Panel Gamma SML Stochastic Frontier Models In exploring alternatives to the pooled gamma SML frontier production function we now switch to estimating various panel gamma SML stochastic frontier production functions with heterogeneity in the random error, v, and the one-sided inefficiency, u. In particular, we estimate panel gamma SML stochastic frontier production functions using SA, WH, and AM two-way random effects panel estimators. Due to the wrong skewness, we do not present results of the WH model. Parameter estimates of the panel gamma SML stochastic frontier production function (SA and AM) are presented in Tables 3 and 4, respectively.
316
SALEEM SHAIK AND ASHOK K. MISHRA
Table 3. Swamy–Arora Alternative Panel Gamma SML Stochastic Frontier Production Function Results for Solvency, Liquidity, and Efficiency Financial Variables. Financial Solvency
Financial Liquidity
Financial Efficiency
Coefficient
P[|Z|Wz]
Coefficient
P[|Z|Wz]
Coefficient
P[|Z|Wz]
23.9604 0.0152 0.1331 0.0973 0.0656 0.0727 0.4549 0.0124
o0 0.2329 o0 o0 o0 o0 o0 o0
17.8617 0.0523 0.14879 0.09633 0.05383 0.06054 0.46029 0.01324
o0 0.0001 o0 o0 o0 o0 o0 o0
23.9168 0.0057 0.1464 0.1147 0.0919 0.0158 0.4639 0.0121
o0 0.6451 o0 o0 o0 0.2448 o0 o0
Theta Shape parameter (P) Sigma (v)
16.2376 0.8436 0.0575
o0 0.0001 o0
48.6613 1.92584 0.06144
Theta (1/u) Long-run risk Short-run risk
0.0482 0.1997
0.538 0.0105
11.7154 6.60034
0.0005 0.0613
0.0470 0.0752
0.2483 0.1577
Productivity Long-run risk Short-run risk
0.2449 0.0306
o0 0.2897
15.0019 3.45814
o0 0.0706
0.0646 0.0184
0.0008 0.2717
Constant Capital Land Labor Chemicals Energy Materials Year
o0 o0 o0
23.1985 0.8217 0.0635
o0 o0 o0
Table 3 presents result from the SA alternative panel SML stochastic production frontier model with heterogeneity in the random error, v and the one-sided inefficiency, u. In addition to usual input factors, we also assess the impact of short- and long-run variations in financial exposure variables on technical efficiency and productivity variance function. Comparing the overall result between pooled and panel gamma SML stochastic frontier production function indicate that the coefficients of the inputs are roughly the same in case of material inputs. However, significant differences exist in all other inputs. In the case of variations in financial exposure variables (both short- and long-run) – theta and productivity variations – results show that in some cases the signs on the coefficient are reversed or the magnitude of the coefficient becomes larger. For instance, the coefficient of short-run financial liquidity risk increases from 3.01 and insignificant in pooled analysis (Table 2) to 3.46 and significant in panel analysis (Table 3). This may be due to the accounting of spatial, temporal, and within residual variation used to transform the data in the SA model.
317
Alternative Random Effects Panel Gamma SML Estimation
Table 4. Amemiya Alternative Panel Gamma SML Stochastic Frontier Production Function Results for Solvency, Liquidity, and Efficiency Financial Variables. Financial Solvency
Financial Liquidity
Financial Efficiency
Coefficient
P[|Z|Wz]
Coefficient
P[|Z|Wz]
Coefficient
P[|Z|Wz]
21.4854 0.0366 0.0931 0.0717 0.0609 0.1359 0.4474 0.0113
o0 0.0085 o0 o0 o0 o0 o0 o0
19.1302 0.05695 0.07576 0.07525 0.08339 0.15157 0.49155 0.00991
o0 o0 o0 o0 o0 o0 o0 o0
21.4877 0.0507 0.0708 0.0768 0.0675 0.1360 0.4166 0.0114
o0 0.0001 0.0003 o0 o0 o0 o0 o0
Theta Shape parameter (P) Sigma (v)
18.9495 1.0494 0.0835
0.0035 0.1241 o0
34.6749 1.0001 0.0646
o0 o0 o0
18.9454 1.0607 0.0794
0.0011 0.1483 o0
Theta (1/u) Long-run risk Short-run risk
0.0016 0.1930
0.9862 0.0331
21.2496 38.9859
o0 o0
0.0209 0.0946
0.6925 0.0616
Productivity Long-run risk Short-run risk
0.1508 0.0202
o0 0.4085
o0 o0
0.0979 0.0036
o0 0.8392
Constant Capital Land Labor Chemicals Energy Materials Year
15.4 0.29185
Results in Table 3 suggest that year, as a proxy for technology, is positively related to agricultural output. The output returns to scale, ranged from 0.89 for financial liquidity model to as low as 0.81 for the financial efficiency model, with the inclusion of technology.6 The theta, P, and sigma (v) were all positive and significant. Results also indicate that input variables in production are all positive and significantly related to the output, with the exception of energy and capital factors in the financial efficiency model (column 6, Table 3). The production function results are consistent with production theory, i.e., an increase in the quantity of input leads to an increase in the quantity of output produced. The results from the SA panel model indicate about 0.46 input elasticity for materials in all the three models. The elasticity is relatively higher to the other inputs, indicating that a 100 percent increase in material inputs would increase the output by 46 percent. The coefficient on land input is about 0.13–0.14, depending on the financial risk measure used in the model. It should be noted that land input ranks second with respect to the magnitude
318
SALEEM SHAIK AND ASHOK K. MISHRA
of contributions to agricultural output, indicating that a 100 percent increase in land input increases agricultural output by about 13–14 percent. Farm labor with an elasticity of about 0.10 and capital input with an elasticity of 0.02 are much smaller compared to materials or land input, indicating that labor and capital inputs have a smaller positive influence on agricultural output. When comparing the impact of variations in financial exposure variables on technical efficiency and productivity in case of panel models, the findings are similar to those obtained in pooled analysis (Table 2). The short-run financial solvency risk (variability in debt-to-asset ratio) variable in theta or inverse of inefficiency variance function is negative and significant at 1 percent level of significance. This indicates that an increase in the variation of financial solvency decreases the variation in theta or inverse of inefficiency variance in the short-run. A possible explanation is that more indebted farmers are higher costs farmers, and hence more technically inefficient. This finding is consistent with the agency costs theory proposed by Nasr et al. (1998). In contrast, the short-run financial liquidity risk variable has a positive and significant impact on theta or inverse of inefficiency variance. The positive sign indicates that, short-run variation in financial liquidity (debt serving ratio) would increase the variation in theta or inverse of inefficiency variance. The final two rows in Table 3 assess the impact of long-run and short-run variations in financial exposure variables on productivity in case of panel gamma SML stochastic frontier functions. The results in Table 3 show that in all three models the impact of long-run variation in financial exposure is positive and statistically significant at the 1 percent level of significance. Findings here support the free cash flow hypothesis, which postulates that excess cash flows encourage managerial laxness, which translates into technical inefficiency. Finally, the coefficient on short-run variation in financial exposure is negative and significant for the financial liquidity model, indicating that an increase in the short-run variation of the financial liquidity (debt servicing ratio) would lead to a decrease in the variation of productivity. Table 4 presents the results of the Amemiya alternative panel gamma SML stochastic frontier production function. The parameter estimates of inputs (land, labor, capital, etc.) are similar to those obtained in Table 2 (pooled analysis) because the errors used in the transformation of the data were obtained from the pooled model in the first stage. However, when considering the impact of short- and long-run variations in financial exposure on theta and productivity variation, the results are similar to the
Alternative Random Effects Panel Gamma SML Estimation
319
SA panel model. For example, short-run financial risk, in case of financial solvency and liquidity, has a negative and positive significant effect on theta or inverse of inefficiency, respectively. Additionally, long-run variations in financial exposure in all models have a positive and significant impact on productivity variation. Year, as a proxy for technology, is positively related to agricultural output and returns to scale, ranging from 0.94 for financial liquidity model to as low as 0.83 for financial efficiency model, with the inclusion of technology.6 The theta and sigma (v) are positive and significant (Table 4). Finally, production function results are consistent with production theory, i.e., an increase in the quantity of input leads to an increase in quantity of output produced. Interestingly, results in Table 4 show that short-run financial efficiency risk (asset turnover ratio) variable in theta or inverse of inefficiency is negative and significant. This indicates an increase in the variability of financial efficiency leads to a decrease in the variation of theta or inverse of inefficiency measure. It is highly likely that variability in gross farm income is the driving force behind the variability in asset turnover ratio. Higher variability would be associated with higher variance in production. It has been argued that more efficient farmers have higher asset turnover ratio. According to Nasr et al. (1998), lenders like to advance funds to ‘‘low-cost’’ (technically efficient) farmers. Under this hypothesis, we expect efficient farmers to have higher debt. Therefore, any fluctuations in the gross farm revenue would have a negative impact on variation in technical efficiency. Finally, results in Table 4 show that long-run variation in financial exposure in all three models has a positive and significant impact on productivity variance function. Findings here suggest that an increase in the variation in financial solvency, liquidity, and efficiency risk would lead to an increase in productivity variance. Finally, the estimated theta and sigma (v) for pooled, SA, and AM panel estimators are all significant at 1 percent level of significance. This result indicates a good fit of the gamma SML pooled and panel model with heteroskedasticity in one-sided and random errors. The shape parameter, p as defined in Eq. (2), was also significant for the pooled, SA, and AM panel estimators with the exception of financial solvency and financial efficiency in AM panel model. Larger values of p (greater than 1) allow the mass of the inefficiency distribution to move away from zero. Results indicate that for the pooled and most of panel models, the value of p was less than or close to zero, with the exception of financial liquidity pooled and SA panel models.
320
SALEEM SHAIK AND ASHOK K. MISHRA
5. CONCLUSION The contribution of the research presented in this chapter is twofold. First, three alternative two-way random effects panel estimators of the normal-gamma stochastic frontier model with heterogeneity in the random error and the one-sided inefficiency is proposed and tested using simulated maximum likelihood estimation techniques. In particular, we propose a generalized least squares procedure that involves estimating the variance components in the first stage and then using the estimated variance–covariance matrix to transform the data. The data transformation involves estimation of the pooled model (Wallace– Hussain estimator); within model (Amemiya estimator); or within, between cross-sectional residuals and between time-series model (Swamy–Arora estimator) in the estimation of alternative panel estimators. Second, the stochastic frontier model with heteroskedasticity of a random error term, identified with productivity, and a one-sided error term, identified with inefficiency, is used to examine the importance of short- and long-run variations in financial risk variables – namely, financial liquidity, solvency, and efficiency. Empirical estimates indicate differences in the parameter estimates of gamma distribution, production function, and heterogeneity function variables between pooled and the two alternative panel estimators – namely SA and AM estimators. The difference between the pooled and the panel model suggests the need to account for spatial, temporal, and within residual variations as in Swamy–Arora estimator, and within residual variation in Amemiya estimator within the panel framework. Our findings show production increasing with increasing units of inputs. Results from this study indicate that variations in financial exposure measures (solvency, liquidity, and efficiency) play an important role in technical efficiency and productivity. For example, in the case of financial solvency and financial liquidity risk models, our findings reveal a negative and positive effect on technical efficiency in the long run and short run, respectively. Future research could examine the implications of time invariant and specification variant to gamma SML stochastic frontier production and cost function models. Further, research could also focus on the robustness of the alternative two-way random effects models with application to farm-level data. Compared to aggregate production analysis, individual farm-level data results may vary with regard to production of agricultural output and the impact of financial risk on productivity and efficiency.
Alternative Random Effects Panel Gamma SML Estimation
321
NOTES 1. Efficiency concept introduced by Farrell (1957) is defined as the distance of the observation from the production frontier and measured by the observed output of a firm, state, or country relative to realized output, i.e., output that could be produced if it were 100 percent efficient from a given set of inputs. 2. According to Greene (2003), the ‘‘normal-gamma model provides a richer and more flexible parameterization of the inefficiency distribution in the stochastic frontier model than either of the canonical forms, normal-half normal and normal-exponential.’’ 3. Alternatively, maximum likelihood estimator of the two-way random effects panel stochastic frontier models can also be presented. 4. The data are available at the USDA/ERS website http://www.ers.usda.gov/ data/agproductivity/ 5. The data are available at the USDA/ERS website http://www.ers.usda.gov/ data/farmbalancesheet/fbsdmu.htm 6. However, returns to scale (RTS) were slightly lower when technology was excluded from the regression. 7. Due to asymmetric information and misaligned incentives between lenders and borrowers, monitoring of borrowers by lenders is implied. Monitoring involves transactions costs and lenders may pass on the monitoring costs to the farmers in the form of higher interest rates and/or collateral requirements.
ACKNOWLEDGMENTS The authors wish to thank the participants of the 8th Annual Advances in Econometrics Conference, Nov. 6–8, Baton Rouge, LA, for their useful comments and questions. We thank the editor and two anonymous reviewers for their valuable suggestions that greatly improved the exposition and readability of the paper. Mishra’s time on this project was supported by the USDA Cooperative State Research Education & Extension Service, Hatch project # 0212495 and Louisiana State University Experiment Station project # LAB 93872. Shaik’s time on this project was supported by USDA Cooperative State Research Education & Extension Service, Hatch project # 0217864 and North Dakota State University Experiment Station project # ND 01397.
REFERENCES Abramovitz, M. (1956). Resource and output trends in the United States since 1870. American Economic Review, 46(2), 5–23. Aigner, D. J., Lovell, C. A. K., & Schmidt, P. (1977). Formulation and estimation of stochastic frontier production function models. Journal of Econometrics, 6, 21–37.
322
SALEEM SHAIK AND ASHOK K. MISHRA
Amemiya, T. (1971). The estimation of the variances in a variance-components model. International Economic Review, 12, 1–13. Ball, V. E., Bureau, J.-C., Butault, J.-P., & Nehring, R. (2001). Levels of farm sector productivity: An international comparison. Journal of Productivity Analysis, 15, 5–29. Ball, V. E., Gollop, F., Kelly-Hawke, A., & Swinand, G. (1999). Patterns of productivity growth in the U.S. farm sector: Linking state and aggregate models. American Journal of Agricultural Economics, 81, 164–179. Battese, G., & Coelli, T. J. (1992). Frontier production functions, technical efficiency and panel data: With application to paddy farmers in India. Journal of Productivity Analysis, 3, 153–169. Battese, G., & Corra, G. (1977). Estimation of a production frontier model: With application for the pastoral zone of eastern Australia. Australian Journal of Agricultural Economics, 21, 167–179. Cornwell, C., Schmidt, P., & Sickles, R. C. (1990). Production frontiers with cross-sectional and time-series variation in efficiency levels. Journal of Econometrics, 46(1–2), 185–200. Farrell, M. J. (1957). The measurement of productive efficiency. Journal of the Royal Statistical Society Series A, 120, 253–290. Greene, W. H. (1980). Maximum likelihood estimation of econometric frontier functions. Journal of Econometrics, 13(1), 27–56. Greene, W. H. (1990). A gamma-distributed stochastic frontier model. Journal of Econometrics, 46(1), 141–164. Greene, W. H. (2003). Simulated likelihood estimation of the normal-gamma stochastic frontier function. Journal of Productivity Analysis, 19(2/3), 179–190. Greene, W. (2004). Distinguishing between heterogeneity and inefficiency: Stochastic frontier analysis of the World Health Organization’s panel data on National Health Care Systems. Health Economics, 13(10), 959–980. Greene, W. (2007). LIMDEP computer program: Version 9.0. Plainview, NY: Econometric Software. Jondrow, J., Materov, I., Lovell, K., & Schmidt, P. (1982). On the estimation of technical inefficiency in the stochastic frontier production function model. Journal of Econometrics, 19, 233–238. Meeusen, W., & van den Broeck, J. (1977). Efficiency estimation from Cobb–Douglas production functions with composed error. International Economic Review, 18, 435–444. Nasr, R. E., Barry, P. J., & Ellinger, P. N. (1998). Financial structure and efficiency of grain farms. Agricultural Finance Review, 58, 33–48. Schmidt, P., & Sickles, R. C. (1984). Production frontiers and panel data. Journal of Business and Economic Statistics, 2(4), 367–374. Solow, R. M. (1957). Technical change and the aggregate production function. Review of Economics and Statistics, 39(3), 312–320. Swamy, P. A. V. B., & Arora, S. S. (1972). The exact finite sample properties of the estimators of coefficients in the error component regression models. Econometrica, 40, 261–275. Wallace, T. D., & Hussain, A. (1969). The use of error components models in combining crosssection and time-series data. Econometrica, 37, 55–72. Wang, H., & Schmidt, P. (2002). One-step and two-step estimation of the effects of exogenous variables on technical efficiency levels. Journal of Productivity Analysis, 18, 129–144.
MODELING AND FORECASTING VOLATILITY IN A BAYESIAN APPROACH Esmail Amiri ABSTRACT In a Bayesian approach, we compare the forecasting performance of five classes of models: ARCH, GARCH, SV, SV-STAR, and MSSV using daily Tehran Stock Exchange (TSE) market data. To estimate the parameters of the models, Markov chain Monte Carlo (MCMC) methods is applied. The results show that the models in the fourth and the fifth class perform better than the models in the other classes.
1. INTRODUCTION Volatility is a well-known characteristic in many financial time series. Volatility changes over time. In the basic option pricing model introduced by Black and Schools (1973), the returns volatility of the underlying asset is only one of five parameters, which is not directly observable and must be forecasted. Also, in some financial studies, for a more precise estimation of the value at risk, an accurate volatility forecast plays an important role.
Maximum Simulated Likelihood Methods and Applications Advances in Econometrics, Volume 26, 323–356 Copyright r 2010 by Emerald Group Publishing Limited All rights of reproduction in any form reserved ISSN: 0731-9053/doi:10.1108/S0731-9053(2010)00000260014
323
324
ESMAIL AMIRI
In the literature, different econometric models are suggested to forecast volatility, among them autoregressive conditional heteroskedasticity (ARCH) and generalized ARCH (GARCH) family have an important role (e.g., Brooks, 1998; Yu, 2002). Our aim is to compare the performance of ARCH and GARCH and some stochastic volatility (SV) models for predicting volatility in the Tehran (Iran) stock market in a Bayesian approach. This study contributes to the volatility forecasting literature in two ways. First, a data set from a stock market rarely used in the literature is processed. Second, more SV models into the competing candidates are included. In Section 2, the data is introduced; Section 3 displays five classes of volatility models; Section 4 is devoted to the Bayesian inference; Section 5 describes the likelihood of the models and priors; Section 6 presents the method of volatility forecasting; in Section 7, forecast evaluation measures are introduced; Section 8 displays the empirical results; and Section 9 is allocated to the conclusion.
2. DATA We analyze Tehran stock exchange (TSE) market data set. The TSE began its operations in 1967 and during years 1979 (the beginning of Iran’s Islamic revolution) to 1989 was inactive. The Iranian government’s new economic reforms and a privatization initiative in 1989 raised the attention to the private sector and brought life back to the TSE. TSE, which is a full member of the World Federation of Exchanges (WFE) and a founding member of the Federation of Euro-Asian Stock Exchanges (FEAS), has been one of the world’s best performing stock exchanges in recent years (e.g., Azizi, 2004; Najarzadeh & Zivdari, 2006). The government of Iran directly holds 35% of the TSE, while securing another 40% through Iranian pension funds and investment companies. Foreign investment accounts for only about 2% of the stock market in 2009. The TSE is open for trading 5 days a week from Saturday to Wednesday, excluding public holidays. As of June 2008, 400 companies, with a market capitalization of US$70 billion were listed on TSE. The sample for this study, consists of 2,398 daily returns over the period from 1 January 1999 to 31 December 2008 (excluding holidays and no-trading days). Returns are defined as the natural logarithm of price relatives; that is, yt ¼ log X t =X t1 , where X t is the daily capital index. The basic framework is a 5-day trading week, with the markets closing for various holidays.
325
Modeling and Forecasting Volatility in a Bayesian Approach
In the literature, there are different methods to obtain monthly volatility series using daily returns. Since we only have a data set at the daily frequency, we calculate the volatility in a certain period simply, we take the approach of Merton (1980) and Perry (1982) and use the following simple formula: s2T ¼
NT X
y2t
(1)
t¼1
8000 6000 4000 2000
TSE
10000
12000
14000
where yt is the daily return on day t and N T is the number of trading days in month T. Figs. 1 and 2 plot the daily stock index and the daily return series, respectively. Figs. 1 and 2 also exhibit a trend and volatility clustering, respectively. In total, we have 120 monthly volatilities. Fig. 3 plots the series. From Fig. 3, two particularly volatile periods can be easily identified. The first one corresponds to the 2003 crash while the second one occurred in 2008,
0 1999 Jan
500
1000
1500
Days of years 1999-2009
Fig. 1.
Plot of the Daily TSE Stock Index.
2000
2009 Jan
ESMAIL AMIRI
0.0 -0.04
-0.02
TSE-Return
0.02
0.04
326
0 1999 Jan
500
1000
1500
2000
2009 Jan
Days of years 1999-2009
Plot of the Daily TSE-Return Series.
0.004 0.002 0.0
volatility
0.006
Fig. 2.
0
20
1999 Jan
Fig. 3.
40
60 Months form 1999-2009
80
100
120 2009 Jan
Plot of the Monthly Volatility of TSE from 1999 to the end of 2008.
327
Modeling and Forecasting Volatility in a Bayesian Approach
Table 1.
Summary Statistics of the Monthly Volatility of TSE.
Mean
Median
Max.
Kurt.
r1
0.0005068 r2 0.2256
0.000239 r3 0.1681
0.0075356 r4 0.0965
27.39 r5 0.0496
0.5673 r6 0.0229
the period of the world financial crises. Table 1 presents a descriptive picture of the volatility time series. Table 1 also presents that the autocorrelation function (ACF) of volatility is decreasing, to test for stationarity, the KPSS (the KPSS test, is due to Kwiatkowski, Phillips, Schmidt, & Shin, 1992) statistic is calculated (Zivot & Wang, 2006). The KPSS statistic for the entire sample is 0.1887, which is smaller than 99% quantile, 0.762. Therefore, the null hypothesis that the volatility series is stationary is accepted at the 1% level. After obtaining the monthly volatility series, the forecasting horizon has to be chosen. In this study, 1-month ahead forecasts are chosen. Furthermore, a period has to be chosen for estimating parameters and a period for predicting volatility. The first 8 years of data are used to fit the models. Thus, the first month for which an out-of-sample forecast is obtained is January 2007. As the sample is rolled over, the models are re-estimated and sequential 1-month ahead forecasts are made. Hence, in total 24 monthly volatilities are forecasted. Furthermore, when the KPSS statistics is calculated for the first 8 years and last 2 years, the values are found to be 0.2283 and 0.4622, respectively. Again, for both series, the null hypotheses that the volatility series is stationary is accepted at the 1% level.
3. VOLATILITY MODELS We apply four classes of volatility models to TSE-return time series and evaluate their forecast performance. The first and second class are known as observation-driven and the two other classes are called parameter-driven. In the first two classes, we examine ARCH and GARCH models and in the last two classes, we study SV models with different state equations. In the following, yt is the return on an asset at time t ¼ 1; . . . ; T, fet g is independent Gaussian white noise processes, and ht ¼ log s2t .
328
ESMAIL AMIRI
3.1. ARCH Models Based on Engle (1982), the ARCH(p) model is defined as yt ¼ m þ st et p P s2t ¼ x þ ai ðyti mÞ2
(2)
i¼1
where s2t is volatility at time t, x and ai(i ¼ 1, y, p) are the parameters of deterministic volatility model.
3.2. GARCH Models Class of GARCH models builds on the fact that the volatility is time varying and persistent, and current volatility depends deterministically on past volatility and the past-squared returns. GARCH models are easy to estimate and quite popular, since it is relatively straight forward to evaluate the likelihood function for this kind of models. This type of model is extended by Bollerslev (1986). A GARCH (p,q) model is defined as yt ¼ m þ st et s2t ¼ x þ
p X i¼1
ai ðyti mÞ2 þ
q X
bj s2tj
(3)
j¼1
where given the observation up to time t 1, the volatility s2t at time t is deterministic, once the parameters x; ai ; bj ði ¼ 1; . . . ; p; j ¼ 1; . . . ; qÞ are known.
3.3. Stochastic Volatility Models Taylor (1986) originally introduced the SV model. For the class of SV models, the innovations to the volatility are random and the volatility realizations are therefore unobservable and more difficult to be discovered from data. Therefore, the characteristic distinguishing SV models from GARCH models is the presence of an unobservable shock component in the volatility dynamics process. Exact value of volatility at time t cannot be known even if all past information is employed to determine it. As more information becomes available, the volatility in a given past period could be better evaluated. Both contemporaneous and future information thus contribute to learning about volatility. In contrast, in the deterministic
Modeling and Forecasting Volatility in a Bayesian Approach
329
setting of the simple GARCH volatility process, the volatility in a certain time period is known, given the information from the previous period (Rachev, Hsu, Bagasheva, & Fabozzi, 2008). The following log-normal SV model is well known in the SV literature (e.g., Shephard, 1996): yt ¼ st et ¼ e0:5ht et ht ¼ x þ dht1 þ sZ Zt
(4)
where yt is the return on an asset at time t ¼ 1; . . . ; T. fet g and fZt g are independent Gaussian white noise processes, sZ is the standard deviation of the shock to ht , and ht has a normal distribution. x and d are volatility model parameters. However, it is impossible to write the likelihood function of SV models in a simple closed form expression. Estimating an SV model involves integrating out the hidden volatilities.
3.4. Stochastic Volatility Models with STAR Volatility Different models have been proposed for generating the volatility sequence ht in the literature (Kim, Shephard, & Chib, 1998). In a two-regime self-exciting threshold autoregressive (SETAR) model, the observations ht are generated either from the first regime when htd is smaller than the threshold, or from the second regime when htd is greater than the threshold value. If the binary indicator function is replaced by a smooth transition function 0oFðzt Þo1 which depends on a transition variable zt (like the threshold variable in TAR, models), the model is called smooth transition autoregressive (STAR) model. A general form of STAR model is as follows: ht ¼ X t fð1Þ ð1 Fðzt ÞÞ þ X t cðFðzt ÞÞ þ Zt ;
Zt Nð0; s2 Þ
(5)
ð1Þ and X t ¼ ð1; ht1 ; where c ¼ ð1; c1 ; . . . ; cp Þ, fð1Þ ¼ ð1; fð1Þ 1 ; . . . ; fp Þ, ð2Þ ht2 ; . . . htp Þ. For practical computation, let f ¼ c fð1Þ , then Eq. (5) can be rewritten as
ht ¼ X t fð1Þ þ X t fð2Þ ðFðzt ÞÞ þ Zt
(6)
ð2Þ where fð2Þ ¼ ð1; fð2Þ 1 ; . . . ; fp Þ. Model (6) is similar to a two-regime SETAR model. Now the observations ht switch between two regimes smoothly in the sense that the dynamics of ht may be determined by both regimes, with one regime sometimes having more impact and the other regime having more impacts at other times.
330
ESMAIL AMIRI
Two popular choices for the smooth transition function are the logistic function and the exponential function as follows, respectively: Fðzt ; g; cÞ ¼ ½1 þ egðzt cÞ 1 ; 2
Fðzt ; g; cÞ ¼ 1 egðzt cÞ ;
g40
(7)
g40
(8)
The resulting model is referred to as logistic STAR or LSTAR model and exponential STAR or ESTAR, respectively. In the Eqs. (7) and (8) the parameter c is interpreted as the threshold as in TAR models, and g determines the speed and smoothness of the transition. If, in a SV model, the volatility sequence evolves according to the equation of a STAR (p), then the SV model is called stochastic volatility model with STAR volatilities as (SV-STAR). yt ¼ st et ¼ e0:5ht et
et Nð0; 1Þ
ht ¼ X t fð1Þ þ X t fð2Þ ðFðg; c; htd ÞÞ þ sZ Zt ;
Zt Nð0; 1Þ
(9)
where fð1Þ and fð2Þ are p þ 1 dimensional vectors, and Fðg; c; htd Þ is a smooth transition function. We assume, without loss of generality, that d p always. When p ¼ 1, the STAR(1) reduces to an AR(1) model. In Fðg; c; htd Þ, g40, c and d are smoothness, location (threshold), and delay parameters, respectively. When g ! 1, the STAR model reduces to a SETAR model, and when g ! 0, the standard AR(p) model arises. We assume that hpþ1 ; hpþ2 ; ; h0 are not known quantities. For the sake of computational purposes, the second equation of Eq. (9) is presented in a matrix form, ht ¼ W 0 f þ sZ Zt
(10)
where f0 ¼ ðfð1Þ ; fð2Þ Þ and W 0 ¼ ðX t ; X t Fðg; c; ltd ÞÞ.
3.5. Markov Switching Stochastic Volatility Models A Markov switching stochastic volatility model (MSSV), as in So, Lam, and Li (1998), is yt ¼ st et ¼ e0:5ht et ; et Nð0; 1Þ ht ¼ xst þ dht1 þ sZ Zt ; Zt Nð0; 1Þ xs t ¼ r 1 þ
k X j¼2
rj I jt ;
rj 40
(11)
Modeling and Forecasting Volatility in a Bayesian Approach
331
where st is a state variable and I jt is an indicator variable that is equal to 1 when st j. st follows a k-state first-order Markov process, pij ¼ Prðst ¼ jjst1 ¼ iÞ;
for
i; j ¼ 1; . . . ; k
(12)
P where kj¼1 pij ¼ 1. The goal of this model is to separate clusters of high and low volatility, captured in the different x’s, and, therefore, more precisely estimate the persistence parameter d (Hamilton & Susmel, 1994). We assume that ht is greater in high volatility state ðst ¼ kÞ than in low volatility state ðst ¼ 1Þ; when st ¼ 1, MSSV model reduces to model (4). To simplify the notation, let x ¼ ðx1 ; . . . ; xk Þ, s ¼ ðs1 ; . . . ; sT Þ, h ¼ ðh1 ; . . . ; hT Þ and P ¼ ðpij Þ. Moreover, to make our notation consistent, we present a p order MSSV model as SV-MSAR(p), where p is the order of autoregressive part of MSSV model.
4. BAYESIAN INFERENCE OF VOLATILITY MODELS For a given volatility model (GARCH-type or SV), we denote by y 2 Y Rd the vector of parameters, pðyjyÞ the posterior of y and y ¼ ðy1 ; y2 ; . . . ; yT Þ the vector of observations. The joint posterior density of y is then obtained by Bayes’ theorem as pðyjyÞ ¼ R
f ðyjyÞ pðyÞ Y f ðyjyÞ pðyÞdy
(13)
where pðyÞ is the prior density of y and f ðyjyÞ R is the joint density of y given y, which is called likelihood. In Eq. (13), Y f ðyjyÞ pðyÞdy is defined as marginal likelihood, which is the normalizing constant of pðyjyÞ. In Eq. (13), once f ðyjyÞ pðyÞ has been obtained, a technical difficulty is calculating the high-dimensional integral necessary to find the normalizing constant of pðyjyÞ. The likelihood function of a SV model is intractable; therefore, we have to resort to simulation methods. Of course, likelihood function of ARCH and GARCH models is well behavior than SV model1, but we prefer to choose a similar estimation method for all models. Markov chain Monte Carlo (MCMC) method is, a promising way of attacking likelihood estimation by simulation techniques using the computer intensive MCMC methods to draw samples from the distribution of volatilities conditional on observations. Early work on these methods pioneered by (Hastings, 1970; Geman & Geman, 1984) while recent developments appear in (Gelfand & Smith, 1990) and (Chib & Greenberg, 1995,
332
ESMAIL AMIRI
1996). Kim and Shephard (1994) and Jacquier, Polson, and Rossi (1994) are among the first pioneers who applied MCMC methods to estimate SV models. MCMC permits to obtain the posterior distribution of the parameters by simulation rather than analytical methods. Estimating the parameters of models (2)–(4) and (9)–(11) by MCMC methods is carried on in a Bayesian framework. For the application of MCMC, priors with specific hyper-parameters should be assumed for the distributions of parameters. We follow (Congdon, 2003) and fit the models to TSE-return time series using MCMC methods. The performance of the models is evaluated using in-sample and out-of-sample measures. For in-sample model selection, two model choice criteria are considered, DIC and PLC. DIC is the deviance at the posterior mean criterion of Spiegelhalter, Best, Carlin, and van der Linde (2002) and PLC is the prediction loss criterion of Gelfand and Ghosh (1998). For out-of-sample model fit, two model fit criteria are chosen, root mean square error (RMSE) and linear-exponential (LINEX) (Yu, 2002).
4.1. Markov Chain Monte Carlo Methods Markov Chain Monte Carlo (MCMC) methods have virtually revolutionized the practice of Bayesian statistics. Early work on these methods pioneered by Hastings (1970) and Geman and Geman (1984) while recent developments appears in Gelfand and Smith (1990) and Chib and Greenberg (1995, 1996). When sampling from some high-dimensional posterior densities are intractable, MCMC methods provide us with the algorithms to achieve the desired samples. Letting pðyÞ be the interested target posterior distribution, the main idea behind MCMC is to build a Markov chain transition kernel Pðz; CÞ ¼ PrfyðmÞ 2 Cjyðm1Þ 2 zgM m¼1
(14)
where M is the number of iterations, yðmÞ is a sample of y at the m th iteration of the chain. Starting from some initial state yð0Þ , with limiting invariant distribution equal to pðyÞ. It has been proved (e.g., by Chib & Greenberg, 1996) that under some suitable conditions, one can build such a transition kernel generating a Markov chain fyðmÞ jyðm1Þ g whose realizations converge in distribution to pðyÞ. Once convergence is happened, a sample of serially dependent simulated observation on the parameter y is obtained, which can be used to perform Monte Carlo inference. Much effort has been devoted to the design of algorithms able to generate a convergent transition
Modeling and Forecasting Volatility in a Bayesian Approach
333
kernel. The Metropolis–Hastings (M–H) and the Gibbs sampler are among the most famous algorithms, which are very effective in buildings the abovementioned Markov chain transition kernel.
4.2. The Metropolis–Hastings Algorithm The M–H algorithm introduced by Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller (1953) and Hastings (1970) has similarities with adaptive rejection (A-R) algorithm (Gilks & Wild, 1992). Let pðyÞ denote the un-normalized posterior density, sampling from which is not possible. To sample from pðyÞ, M–H uses an auxiliary density, which is called candidateR generating density. This will be denoted by qðy; yÞ, where qðy; yÞdy ¼ 1. This density is used to move from the point y to a new point y by taking samples from density qðy; yÞ, in other words qðy; yÞ plays the role of a kernel. The move is governed by the probability of move, aðy; yÞ. The probability of move is defined as aðy; yÞ ¼
pðyÞqðy; yÞ pðyÞqðy; yÞ
(15)
where aðy; yÞ 1. When pðyÞqðy; yÞopðyÞqðy; yÞ, it means move from y to y happens more often and when pðyÞqðy; yÞ4pðyÞqðy; yÞ we can take aðy; yÞ ¼ 1 and then y and y can be interchanged in Eq. (15). To summarize, the probability of move would be 8 < min pðyÞqðy; yÞ; 1 if pðyÞqðy; yÞ40 pðyÞqðy; yÞ (16) aðy; yÞ ¼ : 1 otherwise It is worth mentioning that if the candidate-generating density is symmetric, then the fraction in Eq. (16) reduces to pðyÞ=pðyÞ and there would be no need for normalizing pðyÞ. Therefore, the algorithm is as follows, where we denote aðy; yÞ by r:
Set i ¼ 0 and yi ¼ y0 , where y0 is chosen arbitrarily. Step 1: Set i ¼ i þ 1, if i4m, stop, otherwise go to the next step. Step 2: Draw y from qðyi ; yÞ and u from Uð0; 1Þ. Step 3: If r u accept y, otherwise stay at yi . Step 4: Go to step 1.
334
ESMAIL AMIRI
4.3. The Gibbs Sampler The Gibbs sampler is a version of M–H algorithm for obtaining marginal distributions from a nonnormalized joint density by a Markovian updating scheme. It has been developed in the context of image restoration initially by Geman and Geman (1984) and later Tanner and Wong (1987) utilized this method in the statistical framework. Later Gelfand and Smith (1990) showed its applicability to general parametric Bayesian computation. In the Gibbs sampler, conditional distributions play an important role in simulation from multivariate distributions. Let’s assume pðyÞ is an unnormalized multivariate density, where y ¼ ðy1 ; y2 ; . . . ; yk Þ. The distributions of the form pðyi jyi Þ ¼ pðyi jy1 ; . . . ; yi1 ; yiþ1 ; . . . ; yk Þ are called full conditionals. If full conditionals are available, the Gibbs sampler is called conjugate, but if those are not available the Gibbs sampler is called non-conjugate. Essentially the Gibbs sampler is the M–H algorithm in which the full conditionals play the role of candidate generating densities and the probability of the acceptance is one. A systematic form of the so-called Gibbs sampler proceeds as follows: ð0Þ Step 1: Choose an arbitrary starting set of values yð0Þ 1 ; . . . ; yk ð1Þ ð0Þ ð0Þ ð1Þ Step 2: Draw y1 from pðy1 jy2 ; . . . ; yk Þ, then y2 from pðy2 jyð1Þ 1 ; ð0Þ ð1Þ ð1Þ ð1Þ ð1Þ ; . . . ; y Þ, and so on up to y from pðy jy ; y ; . . . ; y Þ to complete yð0Þ k 3 1 2 k k k1 one iteration of the scheme. ðtÞ Step 3: After t iterations of step 2 we would arrive at ðyðtÞ 1 ; . . . ; yk Þ.
5. LIKELIHOODS, CONDITIONAL POSTERIORS, AND PRIORS Following Congdon (2003), the log-likelihood for the tth observation under the models (2) and (3) is log Lt ¼ 0:5 logð2pÞ 0:5 log s2t 0:5 log
ðyt mÞ2 s2t
(17)
and the log-likelihood for the tth observation under the models (4) and (9) is (e.g., Gerlach & Tuyl, 2006) log Lt / 0:5y2t eht 0:5ht 0:5 log s2Z 0:5Z2t
(18)
Modeling and Forecasting Volatility in a Bayesian Approach
335
The likelihood for the model (11) can be written as log Lt / 0:5y2t eht 0:5ht 0:5 log s2Z 0:5ðht xst dht1 Þ2
(19)
BRugs and SVPack software are used to facilitate programming of simulation. BRugs is a collection of R functions that allow users to analyze graphical models using MCMC techniques. Most of the R functions in BRugs provide an interface to the BRugs dynamic link library (shared object file). BRugs is a free software and is downloadable from http:// mathstat.helsinki.fi/openbugs. The Bayesian study of volatility models is simple to implement in BRugs and does not require the programmer to know the precise formulae for any prior density or likelihood. The general steps of programming with BRugs are: 1. Identify the likelihood 2. Identify the priors of the parameters. SVPack is a freeware dynamic link library for the OX programing language (Doornik, 1996). In the following subsections, we summarize only the main steps of the MCMC algorithm for the studied models and present the conditional posteriors and priors.
5.1. MCMC Algorithm for ARCH and GARCH Models To estimate the parameters of ARCH and GARCH models using MCMC methods, the conditional posterior pdfs are necessary. Using the Bayes’ theorem, the kernel of conditional posterior pdfs could be a combination of priors and the likelihood function. Let us denote a ¼ ða1 ; a2 ; . . . ; Þ, b ¼ ðb1 ; b2 ; . . . ; bq Þ; s2 ¼ ðs21 ; s22 ; . . . ; s2T Þ, and y ¼ ðx; a; b; mÞ. We assume the following independent priors on the unknown parameters: pðyÞ / pðmÞpðxÞpðaÞpðbÞ
(20)
where pðmÞ is a normal pdf and pðxÞ; pðaÞ, and pðbÞ are uniform pdfs. The conditional likelihood function for the ARCH and GARCH model is ! ! T T Y X 1 T=2 2 exp (21) s1 s2 ðyt mÞ f ðyjy; t0 Þ ¼ ð2pÞ t 2 t¼1 t t¼1
336
ESMAIL AMIRI
where t0 is a set of initial conditions. In this chapter we assume the following initial conditions: 1. yt ¼ 0 for to0; and 2. s2t ¼ s2 , for t 0, where s2 denotes the sample variance of yt . We use the following conditional pdfs within the Gibbs sampler algorithm: xjy; a; b; m; s2 pðxÞf ðyjy; t0 Þ
(22)
ajy; x; b; m; s2 pðaÞf ðyjy; t0 Þ
(23)
bjy; x; a; m; s2 pðbÞf ðyjy; t0 Þ
(24)
mjy; x; a; b; s2 pðmÞf ðyjy; t0 Þ
(25)
In our empirical work, we drew observations from these conditional pdfs using a (M–H) algorithm (see Chib & Greenberg, 1995). The main steps of the MCMC algorithm for ARCH and GARCH models described by Nakatsuma (1998) are: 1. 2. 3. 4. 5. 6. 7.
Initialize x; a; b and calculate s2 and set l ¼ 0 Sample a new x by xjy; a; b; m; s2 Sample a new a by ajy; x; b; m; s2 Sample a new b by bjy; x; a; m; s2 Sample a new m by mjy; x; a; b; s2 Update s2 If loL set l to l þ 1 and continue with 2, where L is the required number of iterations. 5.2. Priors for ARCH and GARCH Models
We use the following priors for ARCH and GARCH models m Nð0; 1; 000Þ;
x Uð0; 1; 000Þ
(26)
ai Uð0; 1Þ;
i ¼ 1; ; p
(27)
bj Uð0; 1Þ;
j ¼ 1; ; q
(28)
s21 IGð1; 0:001Þ
(29)
Y t Nðm; s2t Þ
(30)
Also, we assume
Modeling and Forecasting Volatility in a Bayesian Approach
337
5.3. MCMC Algorithm for SV Model We use the same MCMC algorithm as described in Boscher, Fronk, and Pigeot (1998). Unlike GARCH models, the sequence of volatilities is not observable in the SV model; therefore, it has to be estimated along with the other parameters. To estimate parameters of a SV model conditional posteriors has to be constructed. Let us denote y ¼ ðx; d; s2Z Þ and ht ¼ ðh1 ; h2 ; . . . ; ht1 ; htþ1 ; . . . ; hT Þ. The likelihood for the tth obsevation is f ðyt jht ; y; Þ Nð0; eht Þ
(31)
Using the Markov property, it can be shown that the full conditional posterior distribution of ht is pðht jht ; y; yÞ / pðht jht1 ; htþ1 ; y; yt Þ
(32)
and pðht jht ; y; yÞ/pðht jht1 ; yÞpðhtþ1 jht ; yÞ‘ðyt jht ; yÞ ) ( 1 ðht h t Þ2 2 /exp ðht þ yt expðht ÞÞ exp 2v2t 2
(33)
where h t ¼
xð1 dÞ þ dðhtþ1 þ ht1 Þ ; 1 þ d2
v2t ¼
s2Z 1 þ d2
;
t ¼ 2; . . . ; T 1.
For sampling the full conditionals pðht jht ; y; yÞ, we apply the Hastings algorithm. A proposal h0 is made using the transition kernel pðht jht1 ; htþ1 ; y; yt Þ and accepted with an acceptance probability given by f ðyt jh0t ; yÞ . min 1; f ðyt jht ; y The beginning log-volatility value, h1 , can be considered either constant or stochastic. We assume h1 is randomly distributed according to the stationary volatility distribution ! s2Z x ; h1 N 1 d 1 d2 Like Kim et al. (1998), we assume the prior of s2Z as an inverse gamma (IG) distribution, which leads to an IG as posterior distribution of s2Z jx; d; h; y.
338
ESMAIL AMIRI
Kim et al. (1998) assert a normal prior for the intercept, x, in the volatility dynamics equation. The choice of prior distribution for the persistence parameter, d, is dictated by the goal of imposing stationarity (i.e., restricting d within the interval (0.1, 1)). That prior is based on the beta distribution. To obtain the prior, define d Betaðn1 ; n2 Þ. Let d ¼ 2d 1. It can be easily verified that the prior distribution of d, pðdÞ, is Gðn1 þ n2 Þ 1 þ d n1 1 1 d n2 1 (34) pðdÞ ¼ 2 2 2Gðn1 ÞGðn2 Þ Then the full conditional posterior distribution of d is given by " # T 1 1 þ d n1 1 1 d n2 1 1 X 2 2 exp ðhtþ1 x dht Þ pðdjx; sZ ; hÞ / 2 2 2s2Z t¼1 (35) The hyper-parameters are set to n1 ¼ 20 and n2 ¼ 1:5. We proceed like Kim, Shephard, and Chib (1996) and use Hastings algorithm to sample from the conditional posterior djx; s2Z ; h with a normal distribution Nðmd ; vd Þ as proposal density, where !1 T 1 T 1 T 1 X X X htþ1 ht = h2t ; vd ¼ s2Z h2t (36) md ¼ t¼1
t¼1
t¼1
Thus, here only the main steps of the MCMC algorithm for the SV model is summarized: 1. Initialize h, x, a, d, and s2Z and set l ¼ 0. 2. (a) For t ¼ 1; 2; . . . ; T, sample a new ht by ht jht , y, x, d, s2Z . (b) Sample a new s2Z by s2Z jh; y; x; d. (c) Sample a new x by xjh, d, s2Z , y. (d) Sample a new d by djh, x, s2Z , y. 3. If loL set l to l þ 1 and continue with 2, where L is the chosen number of iterations.
5.4. Priors for SV Model We use the following priors for SV model: h0 Nðx; s2Z Þ
(37)
Modeling and Forecasting Volatility in a Bayesian Approach
339
s2Z IGð2:5; 0:025ÞðIG distributionÞ
(38)
x Nð0; 10Þ
(39)
d ¼ 2d 1;
1odo1
(40)
d Betað20; 1:5Þ
(41)
Y t Nð0; eht Þ
(42)
Also, we assume
5.5. MCMC Algorithm for SV-STAR In our application y, h, and y ¼ ðf; g; c; s2Z Þ are the vector of observation, the vector of log volatilities, and the vector of identified unknown parameters, R respectively. Following Kim et al. (1998) f ðyjyÞ ¼ f ðyjh; yÞpðhjyÞdh is the likelihood function, the calculation of this likelihood function is intractable. The aim is to sample the augmented posterior density pðh; yjyÞ that includes the latent volatilities h as unknown parameters. To sample the posterior density pðh; yjyÞ, following Jacquier et al. (1994), full conditional distribution of each component of ðh; yÞ is necessary. Let’s, assume p and d are known. Applying Lubarno’s (2000) formulation, we assume the following priors: pðgÞ ¼
1 ; 1 þ g2
g40
where pðgÞ is a truncated cauchy density. c U½c1 ; c2 ^ ^ c2 ¼ Fð0:85Þ, where c has a uniform density, c 2 ½c1 ; c2 , c1 ¼ Fð0:15Þ, ^ and F is the empirical cumulative distribution function (CDF) of the time series. pðs2 Þ /
1 s2
340
ESMAIL AMIRI
With the assumption of independence of g, c, s2Z , and fð1Þ and also an improper prior for fð1Þ , pðfð1Þ ; g; s2 ; cÞ / ð1 þ g2 Þ1 s2 ðfð2Þ js2 ; gÞ Nð0; s2Z eg I pþ1 Þ Then, the joint prior density is
n o 2 1 g 0ð2Þ ð2Þ exp 12ðg þ s2 f pðyÞ / s3 Z ð1 þ g Þ Z e f
(43)
A full Bayesian model consists of the joint prior distribution of all unknown parameters, here, y, the unknown states, h ¼ ðhpþ1 ; . . . ; h0 ; h1 ; . . . ; hT Þ, and the likelihood. Bayesian inference is then based on the posterior distribution of the unknowns given the data. By successive conditioning, the prior density is pðy; hÞ ¼ pðyÞpðh0 ; h1 ; . . . ; hpþ1 js2Z Þ
T Y
pðht jht1 ; . . . ; htp ; yÞ
(44)
t¼1
where, we assume ðh0 ; h1 ; . . . ; hpþ1 js2Z Þ Nð0; s2 I p Þ and ðht jht1 ; . . . ; htp ; yÞ NðW 0 f; s2Z Þ where f ¼ ðfð1Þ ; fð1Þ Þ. The likelihood is f ð y1 ; . . . ; yT jy; hÞ ¼
T Y
f ðyt jht Þ
(45)
t¼1
where f ð yt jht Þ Nð0; eht Þ Thus,
) T 1X ht 2 f ð y1 ; . . . ; yT jy; hÞ ¼ exp ðe yt þ ht Þ 2 t¼1 ð2pÞT=2 1
(
(46)
Modeling and Forecasting Volatility in a Bayesian Approach
341
Using the Bayes, theorem, the joint posterior distribution of the unknowns given the data is proportional to the prior times the likelihood, that is pðy; hjy1 ; . . .; yT Þ/ð1 þ g2 Þ1 sðTþpþ3Þ ( " pþ1 X 1
exp 2 s2 g þ eg f0ð2Þ fð2Þ þ h2t 2s t¼0 #) T X þ ½ðht W 0 fÞ2 þ s2 ðeht y2t þ ht Þ
(47)
t¼1
In order to apply MCMC methods, full conditional distributions are necessary, the full conditionals are as follows: ( " #) T X sðTþ6Þ=2 1 2 2 g 0ð2Þ ð2Þ exp 2 gs þ e f f þ ðht Wt0 fÞ (48) pðyjhÞ / ð1 þ g2 Þ s t¼1 ht jht NðW 0 f; s2 Þ, ht ¼ ðhpþ1 ; . . . ; h0 ; h1 ; . . . ; ht1 ; htþ1 ; . . . ; hT Þ (49)
ðyjht ; g; cÞ N
nhX
W t W 0t s2 Z þM
iX
o X 0 2 W t ht s2 W s þ M , W t Z t Z (50)
where M ¼ diagð0; s2Z eg I pþ1 Þ. T þ p þ 1 g 0ð2Þ ð2Þ X ðht W 0 fÞ2 Þ=2 ; ðe f f þ ðs2Z jh; yÞ IG 2
(51)
where IG denotes inverse gamma density function. ( " #) T X sðTþ6Þ=2 1 exp 2 gs2Z þ eg f0ð2Þ fð2Þ þ ðht W 0 fÞ2 f ðg; cjh; yÞ / 1 þ g2 2sZ t¼1 (52) f ðht jht ; y; yÞ / f ðyt jht Þ
p Y
f ðhtþi jhtþi1 ; . . . ; htþip ; yÞ ¼ gðht jht ; y; yÞ (53)
i¼0
If p and d are not known, their conditional posterior distributions can be calculated as follows.
342
ESMAIL AMIRI
Let pðdÞ be the prior probability of the d 2 f1; 2; . . . ; Lg, where L is a known positive integer. Therefore, the conditional posterior distribution of d is ( ) T sðTÞ=2 1X 2 0 exp 2 ðht W t fÞ (54) pðdjh; yÞ / f ðdjh; yÞpðdÞ / s t¼1 ð2pÞT=2 Let pðpÞ be the prior probability of the p 2 f1; 2; . . . ; Ng, where N is a known positive integer, multiplying the prior by the likelihood and integrating out the y, the conditional posterior distribution of p is 1=2 X T ðkþ1Þ=2 2 g kþ1 2 0 2 2 ½sZ e W t W t sZ þ M pðkjh; g; c; d; sZ Þ / ð2pÞ t¼1 #0 ( " #" T T X X 1 2 0 s hh exp W t ht s2 W t W 0t s2 (55) Z Z þM 2 Z t¼1 t¼1 " #1 " #" # 19 = T T T X X X A
W t W 0t s2 W t W 0t s2 W t h0t s2 Z þM Z þM Z ; t¼1 t¼1 t¼1 The sampling strategy when p and d are known is as follows: 1. Initialize the volatilities and the parameter vector at some hð0Þ and yð0Þ , respectively. 2. Simulate the volatility vector hi from the following full conditional ðiÞ ðiÞ ði1Þ ði1Þ ði1Þ pðht jhðiÞ ;y ; yÞ. pþ1 ; . . . ; h1 ; . . . ; ht1 ; htþ1 ; . . . ; hT ðiÞ
3. Sample f from pðfjhðiþ1Þ ; gðiÞ ; cðiÞ ; s2Z Þ. 4. Sample s2Z from pðs2Z jhðiþ1Þ ; fðiþ1Þ Þ. 5. Sample g and c from pðg; cjhðiþ1Þ ; fðiþ1Þ Þ using M–H algorithm. 6. If i m, go to 2, where m is the required number of iterations to generate samples from pðh; yjyÞ. If p and d are not known, the following steps could be inserted before the algorithm’s final step: 1. Sample d from pðdjhðiþ1Þ ; yðiþ1Þ Þ. 2. Sample k from pðkjhðiþ1Þ ; gðiþ1Þ ; cðiþ1Þ ; d ðiþ1Þ Þ using M–H algorithm.
Modeling and Forecasting Volatility in a Bayesian Approach
343
5.6. Priors for SV-STAR Models We use the following priors for SV-STAR model: pðgÞ ¼
1 ; 1 þ g2
g40
where pðgÞ is a truncated cauchy density. c U½c1 ; c2 ^ ^ c2 ¼ Fð0:85Þ, and where c has a uniform density, c 2 ½c1 ; c2 , c1 ¼ Fð0:15Þ, F^ is the emprical CDF of the time series. h0 Nðx; s2Z Þ
(56)
s2Z IGð2:5; 0:025ÞðIG distributionÞ
(57)
xi Nð0; 10Þ;
i ¼ 1; 2
1odij o1;
i ¼ 1; 2;
dij ¼ 2d ij 1;
d ij Betað20; 1:5Þ;
i ¼ 1; 2;
(58) j ¼ 1; . . . ; p
j ¼ 1; . . . ; p
(59) (60)
Also, we assume Y t Nð0; eht Þ
(61)
5.7. MCMC Algorithm for MSSV Models A MCMC algorithm for the MSSV model based on forward filteringbackward sampling and the smoother algorithm of Shephard (1994) is developed by (So, Lam, & Li, 1998). For the unknown parameters in the MSSV model, we work with the following prior distributions: r1 Nðr10 ; Cr1 Þ; s2Z
IGðn1 ; n2 Þ;
d TN ð1;1Þ ðd0 ; C d Þ h1 Nðh11 ; C h1 Þ
ri TN ð0;1Þ ðri0 ; Cri Þ; pi Dirðui0 Þ;
for
for
i ¼ 2; . . . ; k
pi ¼ ðpi1 ; . . . ; pik Þ;
i ¼ 1; . . . ; k.
344
ESMAIL AMIRI
The hyper-parameters are chosen to represent fairly non-informative priors, so we set d0 ¼ h11 ¼ 0, C d ¼ C h1 ¼ 100, n1 ¼ 2:001, n2 ¼ 1, ri0 ¼ 0, C ri ¼ 100, and ui0 ¼ ð0:5; . . . ; 0:5Þ for i ¼ 1; . . . ; k. To estimate the parameters of the MSSV model (11), the MCMC algorithm of So et al. (1998) generates samples from the joint posterior density of H T , S T , y, and pðH T ; ST ; yjDT Þ pðH T ; ST ; yjDT Þ / f ðDT jH T ÞpðH T jST ; yÞpðS T jyÞpðyÞ
(62)
where y ¼ ðx; d; s2Z ; PÞ, Dt ¼ ðy1 ; . . . ; yt Þ, H t ¼ ðh1 ; . . . ; ht Þ and S t ¼ ðs1 ; . . . ; st Þ. In Eq. (62), f ðDT jH T Þ, pðH T jST ; yÞ, pðST jyÞ and pðyÞ are the likelihood, conditional posterior density of H T , posterior distribution of S T , and prior density of y, respectively. The terms in Eq. (62) defined as T Y 1 2 ht 0:5ht (63) e exp yt e f ðDT jH T Þ / 2 t¼1 ( " T pffiffiffiffiffiffiffiffiffiffiffiffiffi X 1 pðH T jST ; yÞ / sT ðht xst dht1 Þ2 1 d2 exp 2 Z 2sZ t¼2 #) xs1 2 2 þð1 d Þ h1 1d pðS T jyÞ ¼
T Y
pst1 st ps1 ;
pi ¼ Prðs1 ¼ iÞ;
i ¼ 1; . . . ; k
(64)
(65)
t¼2
and pðyÞ depend on the choice of the prior distribution for the unknown parameter y. We assume that pðyÞ ¼ pðs2Z ÞpðdÞ
k Y ½pðri Þpðpi1 ; . . . ; pik Þpðpi Þ
(66)
i¼1
In practice, the Gibbs sampling of drawing successively from the full conditional distributions is iterated for L ¼ M þ N times. The first M burn in iterations are discarded. The last N iterates are taken to be an approximate sample from the joint posterior distribution. To simulate the latent variable S T , consider the decomposition of pðS T jDT ; H T Þ as pðS T jDT ; H T Þ ¼ pðsT jDT ; H T Þ
T 1 Y t¼1
pðst jDT ; H T ; Stþ1 Þ
(67)
Modeling and Forecasting Volatility in a Bayesian Approach
345
where S t ¼ ðst ; . . . ; sT Þ. In Eq. (67) pðst jDT ; H T ; S tþ1 Þ is pðst jDT ; H T ; Stþ1 Þ ¼ pðst jDT ; H T ; stþ1 Þ / pðstþ1 jst ; DT ; H T Þpðst jDT ; H T Þ ¼ pðstþ1 jst Þpðst jH T ; DT Þ
(68)
Using the discrete filter developed by Carter and Kohn (1994) and Chib (1996), desired samples from pðS T jDT ; H T Þ are generated via Eqs. (67) and (68). The discrete filter is defined as pðst jH t ; stþ1 Þ ¼
pðst jH t1 Þ ¼
k X
pðst jH t Þpðstþ1 jst Þ pðstþ1 jH t Þ
pðst jst1 ; H t1 Þpðst1 ¼ ijH t1 Þ
i¼1
¼
k X
(69)
(70) pðst jst1 Þpðst1 ¼ ijH t1 Þ
i¼1
pðst jH t1 Þpðht jH t1 ; st Þ pðst jH t ; DT Þ ¼ Pk i¼1 pðst jH t1 Þpðht jH t1 ; st Þ
(71)
To simulate log-volatility variable H T , So et al. (1998) formulate the MSSV model in the partial non normal state-space form. log y2t ¼ ht þ zt ht ¼ xst þ dht1 þ sZ Zt
(72)
where zt ¼ log e2t . The trick of the approach is to approximate the log w21 distribution of zt in Eq. (72) by a mixture of normal distributions; that is pðzt Þ
7 X
qi pðzt jrt ¼ iÞ,
(73)
i¼1
where zt jrt ¼ i Nðmi 1:2704; t2i Þ and qi ¼ Prðrt ¼ iÞ. Given the rt ’s as specified under the normal mixture approximation, the MSSV model in Eq. (72) can be written in an ordinary linear state space form; that is log y2t ¼ ht þ ut ht ¼ xst þ dht1 þ sZ Zt
(74)
346
ESMAIL AMIRI
where ut jrt Nðm0t ; t2t Þ, m0t ¼ mrt 1:2704, and t2t ¼ t2rt . Instead of sampling the ht ’s directly from f ðh1 ; . . . ; hT jr1 ; . . . ; rT ; y; S T ; H T Þ, So et al. (1998) sample Zt ’s from their multivariate posterior distribution pðZ1 ; . . . ; ZT j r1 ; . . . ; rT ; y; ST ; HT ; DT Þ. To simulate rt ’s, the full conditional density of rt is proportional to " # 1 1 2 2 2 pðlog yt jrt ; ht Þpðrt Þ / exp 2 ðlog yt ht mrt þ 1:2704Þ pðrt Þ (75) tr t 2trt rt ’s are independent with discrete support. rt ’s are drawn separately from the uniform distribution over ½0; 1. The conditional posterior distribution of s2Z is IGððT=2Þ þ n1 ; ð1=wÞÞ where ( T xs 1 2 1 1 X 2 2 ðht xst dht1 Þ þ ð1 d Þ h1 þ , w¼ 1d 2 t¼2 n2 n1 ¼ 2 þ 102 and n2 ¼ 100. Let’s define time series zit ; i ¼ 1; . . . ; k by pffiffiffiffiffiffiffiffiffiffiffiffiffi 1 d2 ½ð1 dÞh1 xs1 þ ri I i1 zi1 ¼ 1d and zit ¼ ðht xst þ ri dht1 ÞI it ;
t ¼ 2; . . . ; T
Then, the posterior density of ri is Nðr~ i ; v~i Þ, where vi s2Z ð1 dÞ
v~i ¼
vi ð1 þ dÞ þ ð1 dÞ½sZ þ ðT 1Þvi pffiffiffiffiffiffiffiffiffiffiffiffiffi ! T X ~ 2Z rs 1 d2 v~i zi1 þ zit þ r~ i ¼ 2 sZ vi 1d t¼2 and
( I 11 ¼ 1;
ðvi ¼ 1012 Þ
I it ¼
1 0
if ri 40 otherwise; i ¼ 2; . . . ; k
347
Modeling and Forecasting Volatility in a Bayesian Approach
The conditional posterior distribution of d is (
) a b 2 pðdjH T ; S T ; r1 ; . . . ; rk Þ / QðdÞ exp 2 d Id 2sZ a
where
( ) pffiffiffiffiffiffiffiffiffiffiffiffiffi xs1 2 1 2 2 QðdÞ ¼ 1 d exp Z ð1 d Þ h1 1d 2s a¼
T X
h2t1 þ
t¼2
sZ ; s2d
b¼
T X
ht1 ðht xst Þ þ d
t¼2
sZ s2d
and I d is an indicator function. As pðdjH T ; S T ; r1 ; . . . ; rk Þ does not have closed form, we sample from using the M–H algorithm with proposal s2 s2 Þ, sd ¼ 106 , density the truncated N ½1;1 ðba ; aZ Þ (with the use of d Nðd; d d ¼ 0 as prior). Given i ¼ i , with the use of noninformative prior, the full conditional distribution of ðpi 1 ; pi 2 ; . . . ; pi k Þ is Dirichlet distribution Dirðd i 1 ; d i 2 ; . . . ; d i k Þ, where d i j
T X
Iðst1 ¼ i ÞIðst ¼ jÞ þ 0:5.
t¼2
Using noninformative prior, the full conditional distribution of ðp1 ; . . . ; pk Þ is Dirichlet distribution Dirð0:5 þ Iðs1 ¼ 1Þ; 0:5 þ Iðs1 ¼ 2Þ; . . . ; 0:5 þ Iðs1 ¼ kÞÞ. To allow the components of y to vary in the real line we take the logarithm of s2Z , ri ; i ¼ 2; . . . ; k, and pij =ð1 pij Þ. The main steps of the MCMC algorithm for MSSV model is summarized: 1. 2. 3. 4. 5. 6. 7.
Set i ¼ 1. ðiÞ Get starting values for the parameters yðiÞ and the states S ðiÞ T and H T . ðiÞ ðiÞ ðiþ1Þ pðyjH T ; S T ; DT Þ. Draw y pðST jyðiþ1Þ ; H ðiÞ Draw S ðiþ1Þ T T ; DT Þ ðiþ1Þ Draw H T pðH T jyðiþ1Þ ; S ðiþ1Þ ; DT Þ. T Set i ¼ i þ 1. If ioL, go to 3, where L is the required number of iterations to generate samples from pðH T ; ST ; yjDT Þ.
348
ESMAIL AMIRI
6. METHOD OF THE VOLATILITY FORECASTING With MCMC we sample from the posterior distribution of parameters and estimate the parameters. Based on the estimates obtained from a time series of length T in the lth iteration the future volatility at time T þ s is generated for s ¼ 1; . . . ; K. Thus, for models (2)–(4) and (9) and (10), an s-day ahead step forecast is as follows, respectively: 1. Model (2) ðlÞ s2ðlÞ Tþs ¼ a0 þ
q X
2 aðlÞ i yTþsi
i¼1 ðlÞ 2ðlÞ where aðlÞ denote the sample of the respective parameter during 0 , ai , and st the lth iteration. 2. Model (3) ðlÞ s2ðlÞ Tþs ¼ a0 þ
p X
2 aðlÞ i yTþsi
i¼1
q X
2ðlÞ bðlÞ j sTþsj
j¼1
ðlÞ 2ðlÞ 2ðlÞ where xðlÞ , aðlÞ i , bj , sTþsj , and sTþs denote the sample of the respective parameter during the lth iteration. 3. Model (4) ðlÞ ðlÞ ðlÞ hðlÞ Tþs ¼ x þ d hTþs1 ðlÞ where aðlÞ , dðlÞ , hðlÞ t1 , and hTþs , denote the sample of the respective parameter during the lth iteration. 4. Model (9) ð1ÞðlÞ hðlÞ þ X Tþs fð2ÞðlÞ ðFðg; c; hTþsd ÞÞ Tþs ¼ X Tþs f
where fð1ÞðlÞ , fð2ÞðlÞ , and hðlÞ Tþs denote the sample of respective parameter during the lth iteration. 5. Model (11) ðlÞ ðlÞ ðlÞ hðlÞ Tþs ¼ xst þ d hTþs1 ðlÞ ðlÞ ðlÞ where xðlÞ st , d , ht1 , and hTþs denote the sample of the respective parameter during the lth iteration. The volatility s2Tþs is then estimated by
s^ 2Tþs ¼
L 1X s2ðlÞ L l¼1 Tþs
Modeling and Forecasting Volatility in a Bayesian Approach
349
After obtaining the daily volatility forecasts across all trading days in each month, monthly volatility forecasts can be calculated using the expression s^ 2Tþ1 ¼
N Tþ1 X
s^ 2t ;
T ¼ 96; . . . ; 119
t¼1
7. FORECAST EVALUATION MEASURES The performance of the models is evaluated using in-sample and out-ofsample measures. 7.1. In-Sample Measures In model choice procedures, approaches similar in some ways to classical model validation procedures are often required because the canonical Bayesian model choice methods (via Bayes factors) are infeasible or difficult to apply in complex models or large samples (Gelfand & Ghosh, 1998; Carlin & Louis, 2000). The Bayes factor may be sensitive to the information contained in diffuse priors, and is not defined for improper priors. Even under proper priors, with sufficiently large sample sizes the Bayes factor tends to attach too little weight to the correct model and too much to a less complex or null model. Hence, some advocate a less formal view to Bayesian model selection based on predictive criteria other than the Bayes factor. Following the original suggestion of Dempster (1974), a model selection criterion in the Bayesian framework is developed (Spiegelhalter et al., 2002). This criterion is named Deviance Information Criterion (DIC) which is a generalization of well-known Akaike, information criterion (AIC). This criteria is preferred to, Bayesian information criterion (BIC) and AIC, because, unlike them, DIC needs to effective number of parameters of the model and applicable to complex hierarchical random effects models. DIC is defined based on the posterior distribution of the classical deviance DðYÞ, as follows: DðYÞ ¼ 2 log f ðyjYÞ þ 2 log f ðyÞ
(76)
where y and Y are vectors of observations and parameters, respectively. DIC ¼ D þ pD
(77)
350
ESMAIL AMIRI
Also DIC can be D ¼ E Yjy ½D and pD ¼ E Yjy ½D DðE Yjy ½YÞ ¼ D DðYÞ. presented as DIC ¼ D^ þ 2pD
(78)
where D^ ¼ DðYÞ. Predictive Loss Criterion (PLC) is introduced by (Gelfand & Ghosh, 1998). PLC is obtained by minimizing posterior loss for a given model. PLC is a penalized deviance criterion. The criterion comprises a piece which is a Bayesian deviance measure and a piece which is interpreted as a penalty for model complexity. The penalty function arises without specifying model dimension or asymptotic justification. PLC can be presented as (79)
PLC ¼ Pm þ Gm
here Pm is a penalty term and Gm is a goodness-of-fit measure. Pm and Gm are defined as Pm ¼
T X i¼1
VarðZ i Þ;
Gm ¼
T w X fEðZ i yi Þg; w þ 1 i¼1
w40
(80)
Z i is the replicate of observation yi . Typical values of w at which to compare models might be w ¼ 1, w ¼ 10, and w ¼ 100; 000. Larger values of w downweight precision of predictions. For under-fitted models, predictive variances will tend to be large and thus so will Pm ; but also for over-fitted models inflated predictive variances is expected, again making Pm large. Hence, models, which are too simple, will do poorly with both Gm and Pm . As models become increasingly complex, a trade-off will be observed; Gm will decrease but Pm will begin to increase. Eventually, complexity is penalized and a parsimonious choice is encouraged. DIC and PLC admittedly not formal Bayesian choice criteria, but are relatively easy to apply over a wide range of models including nonconjugate and heavily parameterized models, which the later is our case (Congdon, 2003). In comparison, PLC is a predictive check measure based on replicate sampling while DIC is a likelihood based measure. 7.2. Out-of-Sample Measures Two measures, are used to evaluate the forecast accuracy, the RMSE and the LINEX loss function, which has been advocated in Yu (2002).
Modeling and Forecasting Volatility in a Bayesian Approach
351
They are defined by
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u I u1 X ðs^ 2 s2i Þ2 RMSE ¼ t I i¼1 i
LINEX ¼
I 1X ½expðaðs^ 2i s2i ÞÞ þ aðs^ 2i s2i Þ 1 I i¼1
where I is the number of forecasted month and a is a given parameter. Despite mathematical simplicity of RMSE, it is symmetric, a property that is not very realistic and inconceivable under some circumstances (see Brailsford & Faff, 1996). It is well known that the use of symmetric loss functions may be inappropriate in many circumstances, particularly when positive and negative errors have different consequences. LINEX loss functions, introduced by Varian (1975), are asymmetric. The properties of the LINEX loss functions is extensively discussed by ze Zellner (1986). In the LINEX loss function, the magnitude of a reflects the degree of asymmetry, also positive errors are weighed differently from the negative errors. The sign of shape parameter a reflects the direction of asymmetry. We set a40ðao0Þ if overestimation is more (less) serious than under estimation. In this chapter we follow Yu (2002) and choose a ¼ 20; 10; 10; 20. Obviously, a ¼ 10; 20 penalize overpredictions more heavily, while a ¼ 10; 20 penalize underpredictions more heavily.
8. EMPIRICAL RESULTS In the following examples smooth transition function is logistic function, but the exponential function can be easily replaced. For the convergence control, as a rule of thumb the Monte Carlo error (MC error) for each parameter of interest should be less than 5% of the sample standard deviation.
8.1. In-Sample Fit Table 2 presents the values of the model fit criteria DIC (Spiegelhalter et al., 2002) and PLC (Gelfand & Ghosh, 1998) for 14 competing models (we set w to 100; 000). DIC criterion shows that SV models have the best fit.
352
ESMAIL AMIRI
Table 2.
DIC and PLC Criterion for Volatility Models.
Model
DIC
PLC
ARCH(1) ARCH(2) ARCH(3) GARCH(1,1) GARCH(1,2) GARCH(2,2) GARCH(2,3) GARCH(3,3) SV-AR(1) SV-LSTAR(1), d=1 SV-LSTAR(1), d=2 SV-LSTAR(2), d=1 SV-LSTAR(2), d=2 SV-MSAR(1), k=2
1808 1807 1809 1805 1810 1810 1812 1815 1805 1798 1795 1805 1817 1803
15.68 15.99 15.78 15.7 15.82 15.87 16.1 16.2 15.75 15.71 15.7 15.73 15.75 15.78
Moreover, within the SV family, SV-LSTAR(1), d=2 has the lowest DIC value and hence the best in-sample fit. The superior in-sample fit of the SV family is also supported by the PLC values, while the ARCH(1) has the lowest PLC value. We choose 10 models with lowest DIC and PLC values from Table 2 for our out-of-sample fit study.
8.2. Out-of-Sample Fit The out-of-sample fit results are presented in Tables 3 and 4. In Table 3, the value and ranking of all 10 competing models under the RMSE are reported, while Table 4 presents the value under the four LINEX loss functions. The RMSE statistic favors the SV family while the SV-LSTAR(2), d=2 model is the best. Table 3 shows the poor performance of ARCH and GARCH models. In Table 4, the same models are evaluated under asymmetric loss functions, where four LINEX loss functions are used (a ¼ 20; 10; 10, and 20). LINEX with a ¼ 10 and a ¼ 20 identifies the SV-LSTAR(1), d=2 model as the best performer while the ARCH model and the GARCH model provide the worst forecasts. LINEX with a ¼ 10 and a ¼ 20 ranks the SV-LSTAR model first.
353
Modeling and Forecasting Volatility in a Bayesian Approach
Table 3.
RMSE Criterion for Selected Volatility Models.
Model ARCH(1) ARCH(2) GARCH(1,1) GARCH(1,2) GARCH(2,2) SV-AR(1) SV-LSTAR(1), d=1 SV-LSTAR(1), d=2 SV-LSTAR(2), d=1 SV-LSTAR(2), d=2 SV-MSAR(1), k=2
Table 4.
RMSE
Rank
0.0060588 0.0060651 0.0060598 0.0061005 0.0061107 0.0059815 0.0051432 0.0045760 0.0048911 0.0049951 0.0059415
7 9 8 10 11 6 4 1 2 3 5
LINEX Criterion for Selected Volatility Models.
Model
a ¼ 20
a ¼ 10
a ¼ 10
a ¼ 20
ARCH(1) ARCH(2) GARCH(1,1) GARCH(1,2) GARCH(2,2) SV-AR(1) SV-LSTAR(1), d=1 SV-LSTAR(1), d=2 SV-LSTAR(2), d=1 SV-LSTAR(2), d=2 SV-MSAR(1), k=2
4.87254 4.99245 4.66458 4.66381 5.21214 4.25454 4.21254 4.12112 4.35412 4.23415 4.21211
0.914421 0.924153 0.899152 0.899123 0.898822 0.885123 0.884521 0.884150 0.889893 0.885521 0.88504
1.425419 1.332167 1.328796 1.352265 1.256482 1.225894 1.215268 1.099282 1.154621 1.112451 1.225774
5.205047 5.246215 5.354812 5.491561 5.482564 5.145892 5.145467 5.002143 5.110056 5.125489 5.145788
9. CONCLUSION In a Bayesian framework using MCMC methods, we fitted five types of volatility models, namely, ARCH, GARCH, SV-AR, SV-STAR, and SV-MSAR to TSE data set. Applying RMSE and LINEX criterion, the result of examination shows that the SV models perform better than ARCH and GARCH models. Also among SV models, SV-STAR(1) with d ¼ 2 showed smaller RMSE and LINEX criterion.
354
ESMAIL AMIRI
NOTE 1. There are easier methods such as maximum likelihood estimation (MLE) method, which can be used to estimate parameters of ARCH and GARCH models, see Hamilton (1994) for more details.
ACKNOWLEDGMENTS The author would like to thank the editors, and anonymous referees for helpful comments and suggestions.
REFERENCES Azizi, F. (2004). Estimating the relation between the rate of inflation and returns on stocks in Tehran Stock Exchange. (In Farsi, with English summary). Quarterly Journal of the Economic Research, 11–12, 17–30. Black, F., & Scholes, M. (1973). The pricing of options and corporate liabilities. Journal of Political Economy, 81, 637–659. Bollerslev, T. (1986). Generalized autoregressive conditional hetroskedasticity. Journal of Econometrics, 31, 307–327. Boscher, H., Fronk, E.-M., & Pigeot, I. (1998). Estimation of the stochastic volatility by Markov chain Monte Carlo. In: R. Galata & H. Kiichenhoff (Eds), Festschrift zum 65. Geburtstag von Prof. Dr. Hans Schneeweit3: Econometrics in theory and practice (pp. 189–203). Heidelberg: Physica-Verlag. Brailsford, T. J., & Faff, R. W. (1996). An evaluation of volatility forecasting technique. Journal of Banking and Finance, 20, 419–438. Brooks, C. (1998). Predicting stock market volatility: Can market volume help? Journal of Forecasting, 17(1), 59–80. Carlin, B., Louis, T. (2000). Bayes and Empirical Bayes Methods for Data Analysis, 2nd ed. Texts in Statistical Sciences. Chapman and Hall/ RCR, Boca Raton. Carter, C. K., & Kohn, R. (1994). On Gibbs sampling for state space models. Biometrika, 81, 541–553. Chib, S. (1996). Calculating posterior distributions and modal estimates in Markov mixture models. J. Econometrics, 75, 79–97. Chib, S., & Greenberg, E. (1995). Understanding the Metropolis-Hastings algorithm. The American Statistician, 49, 327–335. Chib, S., & Greenberg, E. (1996). Markov Chain Monte Carlo simulation methods in econometrics. Econometrics Theory, 12, 409–431. Congdon, P. (2003). Applied Bayesian Modelling. Chichester, UK: Wiley. ISBN: 0-471-48695-7. Dempster, A. P. (1974). The direct use of likelihood for significance testing. Proceedings of Conference on Fundamental questions in Statistical Inference, Department of Theoretical Statistics: University of Aarhus (pp. 335–352).
Modeling and Forecasting Volatility in a Bayesian Approach
355
Doornik, J. A. (1996). Ox: Object oriented Matrix programming, 1.10. London: Chapman & Hall. Engle, R. F. (1982). Autoregressive conditional hetroskedasticity with estimates of the variannce of U.K. inflation. Econometrica, 50, 987–1008. Gelfand, A., & Ghosh, S. (1998). Model choice: A minimum posterior predictive loss approach. Biometrika, 85(1), 1–11. Gelfand, A. E., & Smith, A. F. M. (1990). Sampling-based approaches to calculating marginal densities. Journal American Statistical Association, 85, 398–409. Geman, S., & Geman, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 6, 721–741. Gerlach, R., & Tuyl, F. (2006). MCMC methods for comparing stochastic volatility and GARCH models. International Journal of Forecasting, 22, 91–107. Gilks, W., & Wild, P. (1992). Adaptive rejection sampling for Gibbs sampling. Applied Statistics, 41(2), 337–348. Hamilton, J. D., & Susmel, R. (1994). Autoregressive conditional heteroskedasticity and changes in regime. Journal of Econometrics, 64, 307–333. Hastings, W. K. (1970). Monte Carlo sampling methods using Markov Chains and their applications. Biometrika, 57, 97–109. Jacquier, E., Polson, N. G., & Rossi, P. (1994). Bayesian analysis of stochastic volatility models. Journal of Business & Econometric Statistics, 12, 371–417. Kim, S., & Shephard, N. (1994). Stochastic volatility: Likelihood inference and comparison with ARCH models. Economics Papers 3. Economics Group, Nuffield College, University of Oxford. Kim, S., Shephard, N., & Chib, S. (1998). Stochastic volatility: Likelihood inference and comparison with ARCH models. Review of Economic Studies, 65, 361–393. Kwiatkowski, D., Phillips, P. C. B., Schmidt, P., & Shin, Y. (1992). Testing the null hypothesis of stationarity against the alternative of a unit root. Journal of Econometrics, 54, 159–178. Merton, R. (1980). On estimating the expected return on the market: An explanatory investigation. Journal of Financial Economics, 8, 323–361. Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., & Teller, E. (1953). Equation of state calculations by fast computing machines. Journal of Chemical Physics, 21(6), 1087–1092. Najarzadeh, R., & Zivdari, M. (2006). An empirical investigation of trading volume and return of the Teheran Stock Exchange. Quarterly Journal of the Economic Research, 2, 59–80. Nakatsuma, T. (1998). A Markov-chain sampling algorithm for GARCH models. Studies in Nonlinear Dynamics and Econometrics, 3(2), 107–117. Perry, P. (1982). The time-variance relationship of security returns: Implications for the returngenerating stochastic process. Journal of Finance, 37, 857–870. Rachev, S., Hsu, J., Bagasheva, B., & Fabozzi, F. (2008). Bayesian methods in finance. Finance. Hoboken, NJ: Wiley. ISBN: 978-0-471-92083-0. Shephard, N. (1994). Partial non-Gaussian state space. Biometrika, 81, 115–131. Shephard, N. (1996). Statistical aspects of ARCH and stochastic volatility. In: D. R. Cox, O. E. Barndorff-Nielson & D. V. Hinkley (Eds), Time series models in econometrics, finance and other fields (pp. 1–67). London: Chapman and Hall. So, M. K. P., Lam, K., & Li, W. K. (1998). A stochastic volatility model with Markov switching. Journal of Business and Economic Statistics, 16, 244–253.
356
ESMAIL AMIRI
Spiegelhalter, D. J., Best, N. G., Carlin, B. P., & van der Linde, A. (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society, Series B, 64, 583–639. Tanner, M., & Wong, W. (1987). The calculation of posterior distribution by data augmentation. Journal of the American Statistical Association, 81, 82–86. Taylor, S. J. (1986). Modeling financial time series. Chichester, Great Britain: Wiley. Varian, H. R. (1975). A Bayesian approach to real estate assessment. In: S. E. Fienberg & A. Zellner (Eds), Studies Bayesian Econometrics and Statistics in Honor of Leonard J. Savage (pp. 195–208). Amsterdam: North Holland. Yu, Jun (2002). Forecasting volatility in the New Zealand stock market. ze Zellner, A. (1986). Bayesian estimation and prediction using asymmetric loss functions. Journal of the American Statistical Association, 81, 446–451. Zivot, E., & Wang, J.-h. (2006). Modeling Financial Time Series with S-Plus. Springer, Berlin Heidelberg: Springer-Verlag. ISBN 0-387-27965-2.