Institute of Mathematical Statistics LECTURE NOTES-MONOGRAPH SERIES
Selected Proceedings of the Symposium on Inference ...
14 downloads
906 Views
28MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Institute of Mathematical Statistics LECTURE NOTES-MONOGRAPH SERIES
Selected Proceedings of the Symposium on Inference for Stochastic Processes
S.V. Basawa, C.C. Heyde and R.L. Taylor, Editors
Volume 37
Institute of Mathematical Statistics LECTURE NOTES-MONOGRAPH SERIES Volume 37
Selected Proceedings of the Symposium on Inference for Stochastic Processes I.V. Basawa, C.C. Heyde and R.L. Taylor, Editors
Institute of Mathematical Statistics Beachwood, Ohio
Institute of Mathematical Statistics Lecture Notes-Monograph Series Series Editor: Joel Greenhouse
The production of the IMS Lecture Notes-Monograph Series is managed by the IMS Business Office: Julia A. Norton, IMS Treasurer, and Elyse Gustafson, IMS Executive Director.
Library of Congress Control Number: 2001 135427 International Standard Book Number 0-940600-51-X Copyright © 2001 Institute of Mathematical Statistics All rights reserved Printed in the United States of America
Table of Contents SECTION 1: INTRODUCTION
1
An Overview of the Symposium /. V. Basawa, C. C. Heyde, and R. L. Taylor
3
Shifting Paradigms in Inference C. C Heyde
9
Section 2: Stochastic Models: General
23
Modelling by Levy Processes Ole E. Barndorjf-Nielsen
25
Extreme Values for a Class of Shot-Noise Processes W. P. McCormick and Lynne Seymour
33
Statistical Inference for Stochastic Partial Differential Equations B. L. S. Prakasa Rao
47
Fixed Design Regression Under Association George G. Roussas
71
Dependent Bootstrap Confidence Intervals Wendy D. Smith and Robert L. Taylor
91
Section 3: Time Series
109
Kolmogorov-Smirnov Tests for AR Models Based on Autoregression Rank Scores Faouzi El Bantli and Marc Hallin
111
Estimation of the Long-Memory Parameter: A Review of Recent Developments and an Extension Rajendra J. Bhansali and Piota S. Kokoszka
125
Stability of Nonlinear Time Series: What Does Noise Have to Do With It? Daren B. H. Cline and Huay-min H. Pu
151
Section 4: Population Genetics
171
Testing Neutrality of mtDNA Using Multigeneration Cytonuclear Data Susmita Datta
173
Inference on Random Coefficient Models for Haplotype Effects in Dynamic Mutation Using MCMC Richard M. Huggins, Guoqi Qian and Danuta Z. Loesch
185
Section 5: Semiparametric Inference
203
Semiparametric Inference for Synchronization of Population Cycles P. E. Greenwood and D. T. Haydon
205
Plug-In Estimators in Semiparametric Stochastic Process Models Ursula U. Muller, Anton Schίck and Wolfgang Wefelmeyer
213
Section 6. Estimating Functions
235
Nuisance Parameter Elimination and Optimal Estimating Functions T. M. Durairajan and Martin L. William
237
Optimal Estimating Equations for Mixed Effects Models with Dependent Observations Jeong-gun Park and I. V. Basawa
247
Section 7. Spatial Models
269
Reconstruction of a Stationary Spatial Process from a Systematic Sampling Karim Benhenni
271
Estimating the Variance of the Maximum Pseudo-Likelihood Estimator Lynne Seymour
281
A Review of Inhomogeneous Markov Point Processes Eva B. Vedel Jensen and Linda Stougaard Nielsen
297
Section 8. Perfect Simulation
319
Perfect Sampling for Posterior Landmark Distributions with an Application to the Detection of Disease Clusters Marc A. Loizeaux and Ian W. McKeague
321
A Review of Perfect Simulation in Stochastic Geometry Jesper Mφller
333
LIST OF CONTRIBUTORS Name Bantli, Faouzi El Barndorff-Nielsen, Ole E. Basawa, Ishwar Benhenni, Karim Bhansali, Rajendra Cline, Daren B. H. Datta, Susmita Durairajan, T. M. Greenwood, Priscilla Hallin, Marc Haydon, D. T. Heyde, C. C. Huggins, Richard Jensen, Eva B. Vedel Kokoszka, Piota S. Loesch, Danuta Z. Loizeaux, Marc A. McCormick, William McKeague, Ian W. M0ller, Jesper Muller, Ursula U. Nielsen, Linda Stougaard Park, Jeong-gun Pu, Huay-min H. Qian, Guoqi Rao, B. L. S. Prakasa Roussas, George Schick, Anton Seymour, Lynne Smith, Wendy D. Taylor, Robert L. Wefelmeyer, Wolfgang William, Martin L.
Affiliations Universite Libre deBruxelles, Belgium University of Aarhus, Denmark University of Georgia Universite Pierre Mendes-France University of Liverpool, U.K. Texas A&M University Georgia State University Loyola College, India University of British Columbia & Arizona State University Universite Libre deBruxelles, Belgium Centre for Tropical Veterinary Medicine, U.K. Columbia University & Australian National University LaTrobe University, Australia University of Aarhus, Denmark University of Liverpool, U.K. LaTrobe University, Australia Florida State University University of Georgia Florida State University Aalborg University, Denmark Universitat Bremen, Germany University of Aarhus, Denmark Harvard University Texas A&M University LaTrobe University, Australia Indian Statistical Institute University of California at Davis Binghamton University University of Georgia U.S. Census Bureau University of Georgia Universitat of Siegen, Germany Loyola College, India
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
AN OVERVIEW OF THE SYMPOSIUM ON INFERENCE FOR STOCHASTIC PROCESSES I. V. Basawa University of Georgia C. C. Heyde Columbia University and Australian National University R. L. Taylor University of Georgia Abstract Some historical snap-shots of developments in the general area of stochastic processes and statistical inference are given. An overview of the papers appearing in this volume is then presented.
1
Introduction
The Symposium on Inference for Stochastic Processes was held at the University of Georgia from May 10, 2000 to May 12, 2000. The Symposium was cosponsored by the Institute of Mathematical Statistics and was a satellite meeting of the Fifth World Congress of the Bernoulli Society. The major focus of the symposium was to provide a forum for the interchange of information on inference for stochastic processes and related applications. Partial funding by the University of Georgia's "State-of-the-Art" Conference Program, a National Security Agency Grant and a National Science Foundation Grant contributed to the success of the Symposium and is gratefully acknowledged. The Symposium attracted 79 registered participants from several countries including Australia, Belgium, Canada, Denmark, France, Germany, India, Iran, Portugal, Sweden, Taiwan, the United Kingdom and the United States of America. The program consisted of 17 sessions with 49 speakers. This volume represents selected proceedings of the Symposium.
4
2
BASAWA, HEYDE AND TAYLOR
Some Historical Perspectives
Before describing the current research and applications of inference for stochastic processes, it is important to consider some historical developments in this subject area. Our most influential ancestors in the subject were very strongly motivated by the scientific needs of their times. They did applied work of the highest quality for which they developed theoretical tools as necessary. To illustrate, we might start with what is arguably some prehistory, namely with Daniel Bernoulli. He was one of the famous Swiss Bernoulli family and in 1760 he produced the first epidemic model - a deterministic model for tracking the spread of smallpox. This disease was still a major concern at that time and variolation, the first attempt at vaccination, was new and very topical. Bernoulli was a strong advocate of its advantages, for which he provided important quantitative support. Of course, the first stochastic epidemic model came much later, what is now called the chain binomial model, first published by the Russian physician Pyotr En'ko in 1889. He had measles data, and he did careful estimation and fitting. Next let us move to Prance in the early 1840's. At the time there was considerable interest in the demographics of aristocratic families. Benoiston de Chateauneuf did considerable statistical work in which he reached conclusions such as, on average, an aristocratic family name dies out in around 300 years. There was interest and concern in these matters, inheritance being a key factor in the operation of society. This motivated I.-J. Bienayme, then a civil servant, to model the extinction of family names. He developed the branching process model, and was able to state the correct form of the criticality theorem. Galton and Watson took up the same subject in England in 1873-4, but they did not know of Bienayme's earlier work of 1845, and they did not find the correct form of the criticality theorem. Unfortunately, it is only their names that are generally remembered for the work. Prom the start of the 20th Century the practical use of stochastic processes and associated inference began to flourish. In 1900 Louis Bachelier published the theory of Brownian motion, in his thesis, five years before the work of Einstein on the subject. Bachelier's motivation was the modelling of the stock market. But unfortunately his work was distinctly unappreciated till late in his life. Indeed, he did not manage to obtain a tenured university appointment until he was age 47. It was actually Kolmogorov in 1931 who first recognized the importance of his work, but his fame was established only in the 1960's and 1970's. Now we even have a Bachelier Society. The next of our ancestors from the turn of the century who can be usefully mentioned is the Swede Filip Lundberg. To him we owe the basic theory and practice of collective risk as it is applied by insurance companies, his contributions beginning with his thesis of 1903. His starting point was the
SYMPOSIUM OVERVIEW
5
description of the total claim by what we would now describe as a compound Poisson process. He went on to spend his working life in the most senior and influential positions in the insurance industry in Sweden. One exception in terms of practical motivation was the work of Markov and his contributions on Markov chains which date from 1906. Markov's motivation was actually to strike a blow on behalf of the St. Petersburg School (founded by Chebyshev) against Nekrasov, then leader of the rival Moscow School. Nekrasov had unwisely asserted that independence was a necessary condition for the weak law of large numbers, and Markov pounced on this, developing the idea of chain dependence to show that Nekrasov was wrong. Philosophical and religious differences underpinned a bitter enmity between them. But practical use of the Markov chain idea had significantly predated Markov. In 1846 Quetelet used a two-state Markov chain to model the weather type from one day to the next. He noted from the available data that independence did not hold, rain being more likely to be followed by rain etc. Much of our modern queueing theory comes from the Dane Agner Erlang. He worked for the Copenhagen Telephone Company from 1908 until his death in 1929, and he was involved in all facets of queueing performance. Indeed, he is reputed to have been regularly seen walking the streets of Copenhagen accompanied by a workman with a ladder. He was hunting for network loss sources. Another very practical man was the British engineer Harold Hurst. It is to him that we owe the ideas of long-range dependence. These ideas were developed in the 1940's and 1950's when he played a key role in the design of the Aswan High Dam on the river Nile. Hurst had abundant data, and his numerical work convinced him that standard ARMA time series models could not match the data. Much new theory, largely developed by Benoit Mandelbrot, came out of Hurst's pioneering work. These are some of our innovative predecessors who did much to influence the development of stochastic processes and its associated inference. They had good models and genuine data. They did first-rate science and they are good role models for us. More details can be found in articles in Heyde and Seneta (2001). The mathematical foundations for the modern theory of statistical inference were laid by R. A. Fisher in the 1920's. Neyman and Pearson developed the theory for hypotheses testing in the 1930's. Subsequently, the work by Wald, LeCam, C. R. Rao and the others unified various developments in the theory of inference. Most of the pioneering work on inference was, however, devoted to the classical framework of independent and identically distributed observations. Grenander (1950) addressed the problem of extending the classical inference theory to stochastic processes. See also Grenander (1981).
6
BASAWA, HEYDE AND TAYLOR
Billingsley (1961) discussed inference problems for Markov processes. Hall and Heyde (1980) gave a general treatment of likelihood based inference using martingales. The monograph by Basawa and Prakaso Rao (1980) gave a comprehensive survey of the general area of inference for stochastic processes. See also, Basawa (2001) for a recent review on this topic. The recent monograph by M. M. Rao (2000) gives a rigorous mathematical treatment of the theory of inference for stochastic processes.
3
An Overview
This volume containing the Selected Proceedings of the Symposium on Inference for Stochastic Processes includes twenty referred articles in addition to this overview. These papers are grouped into eight sections. The introductory Section 1 contains the overview article and Chris Heyde's foundational paper on shifting paradigms in inference. Section 2 has five papers on applications to various stochastic models: Barndorff-Nielsen discusses applications of Levy processes in finance; McCormick and Seymour study extreme value results for a shot-noise model; inference problems for stochastic partial differential equations are discussed by Prakasa Rao; Roussas considers design problems under association and finally, Smith and Taylor present their results on bootstrap confidence intervals for dependent data. Section 3 contains three papers on time series: El Bantli and Hallin discuss Kolmogorov-Smirnov tests for autoregressive models based on ranks; Bhansali and Kokoszka review long-memory parameter estimation and discuss an extension; the problem of stability of nonlinear time series is studied by Cline and Pu. Two papers on population genetics are included in Section 4: Susmita Datta studies the problem of testing neutrality using multigeneration data; Huggins, Qian and Loesch discuss inference problems for random coefficient models for haplotype effects. Two papers on semiparametric models are included in Section 5: Greenwood and Haydon discuss semiparametric inference for synchronization of population cycles; Muller, Schick and Wefelmeyer study estimation for semiparametric stochastic processes. Section 6 contains two papers on optimal estimating functions: Durairajan and William discuss nuisance parameter elimination in optimal estimating functions; Park and Basawa study optimal estimating functions for mixed effects nonlinear models with dependent observations. Section 7 contains three papers on spatial models: Benhenni discusses systematic sampling from a stationary spatial process; Seymour studies variance estimation for the pseudo-likelihood estimator; Vedel Jensen and Nielsen review inhomogeneous spatial Markov point processes. Finally, Section 8 contains two papers on perfect simulation: Loizeaux and McKeague
SYMPOSIUM OVERVIEW discuss perfect sampling from posterior distributions in a spatial model; M0Uer reviews perfect simulation in stochastic geometry. Markov Chain Monte Carlo (MCMC) methods and more recently perfect simulation techniques provide a powerful link between stochastic processes and inference. The papers in Sections 7 and 8 illustrate the use of these methods. Acknowledgements The editors of these selected proceedings are very grateful to the many referees who carefully reviewed all papers which were submitted for publication in the proceedings. Also, the cooperation of individual authors in producing this volume in the IMS Lecture Notes Series is greatly appreciated. Special thanks go to Connie Durden for the preparation of this Volume. Ms. Durden worked tirelessly and patiently with the various authors in securing software files of their papers, providing uniformity of margins, spacing and similar editorial changes which greatly enhanced the general appearance of this Volume.
References Basawa, I. V. (2001). Inference in Stochastic Processes. In Handbook of Statistics, Vol 19, 55- 77, Eds.: D. N. Shanbhag, and C. R. Rao. Elsevier, Amsterdam Basawa, I. V. and B. L. S. Prakasa Rao (1980). Statistical Inference for Stochastic Processes, Academic Press, London. Billingsley, P. (1961). Statistical Inference for Markov Processes, Univ. Chicago Press, Chicago. Grenander, U. (1950). Stochastic Processes and Statistical Inference. Arkiv fur Mat 1, 195-277. Grenander, U. (1981). Abstract Inference, Wiley, New York. Hall, P. and Heyde, C. C. (1980). Martingale Limit Theory and its Applications. Academic Press, New York. Heyde, C. C. and E. Seneta (Eds.) (2001). Statisticians of the Centuries, Springer, New York. Rao, M. M. (2000). Stochastic Processes: Inference Theory, Kluwer, Boston.
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
SHIFTING PARADIGMS IN INFERENCE C.C. Heyde Australian National University and Columbia University Abstract Some personal perspectives on changing paradigms in inference are presented. The topics discussed include the changes from independence to dependence, estimators to estimating functions and from adhoc methods to Fisher information based methods. Recent trends in time series, general theory of inference, estimating functions and information based techniques are discussed.
1
Introduction
The advent of the new millenium gives us a particularly good excuse to take stock of changing paradigms in inference, or more particularly inference for stochastic processes. Our subject can reasonably be thought of as roughly a century old, and it was strongly practical and model based from the outset. Our distinguished ancestors such as Pyotr En'ko in 1889 with the chain binomial model for epidemics, Louis Bachelier in 1900 with Brownian motion and the modelling of the sharemarket and Filip Lundberg in 1903 with collective risk for insurance application, were very much motivated by the scientific needs of their times. Any list of paradigms is, of course, rather subjective. The ones that I will treat in this paper are undoubtedly important, and are ones which have influenced me personally. But there are arguably others, and certainly one other that I would have liked to discuss. That is the advent of the computer as a tool for model exploration and simulation. I have learned a lot from the newly available technologies. A lot about bad models, poor behaviour of limit theorems, slow rates of convergence etc. But the constraints of the occasion have precluded discussion of these issues. So let me pass to the topics which I will discuss. These are the changes: 1. Independence => Dependence. 2. Estimators => Estimating Functions. 3. Ad hoc methods => Fisher Information based methods.
10
HEYDE
In connection with the first of these, I should remark that I am not seeking to minimize the role of independence in stochastic models. Regeneration in stochastic models is a key phenomenon and our capacity to simulate is strongly tied to independence. My focus is on what we can do in inference.
2
Independence to Dependence
The history of Inference for Stochastic Processes has two major strands of development: (1) General theory of inference. (2) Inference for time series. We shall examine the impediments to the development of each of these. 2.1
Time Series
Time series as an autonomous area of study basically dates from the 1920s. There were many contributors but I will particularly mention the name of George Udny Yule who wrote papers in 1926, 1927 which laid the foundation of autoregressive process theory. The idea of a linear model in terms of a finite past of the process together with a stochastic disturbance was natural and immediately successful in a wide range of problems. The subject exolved into the autoregressive moving average (ARMA) and autoregressive integrated moving average (ARIMA) forms and entered the modern era in large part through the computational implementation of Box and Jenkins (1970). Prom the outset there was a special focus on the case of Gaussian time series. Indeed, the theory was developed for second order stationary Gaussian processes, which are fully characterized by their means and covariances {^(k) = cov(Xn, Xn+k)}. Also, maximum likelihood (ML) and least squares (LS) estimation procedures were used from the outset, and continue to be used. Amongst the most influential early results was the following: Wold Decomposition (1938). If {Xj} is a purely non-deterministic (physically realizable) stationary process with zero mean and finite variance it is representable in the form oo j=0
where {βj} is a stationary, uncorrelated, zero mean process and Σj^Lo aj2 < oo. If the X's are normally distributed the e's can be taken as independent and identically distributed (iid) and normally distributed.
11
SHIFTING PARADIGMS
This theorem implicitly suggested that all second order stationary processes could be reasonably approximated by ARMA processes of sufficiently high order, a suggestion which was further reinforced by results such as the following: Theorem If (7(.)} is any covariance function such that ^(k) —> 0 as k —> oo, then there is a causal AR(K) process whose autocovariance function at lags 0,1, ...,if coincides with j{j),j = 0,1, ...K. The results lulled users into a false sense of security concerning the breadth of applicability of ARIMA models, and it was not for some decades that the need to deal with long-range dependent processes, and various nonlinear phenomena, could no longer be denied. In the meantime, the subject proceeded via development of inference for the Gaussian case and then the Gaussian assumption was dropped and replaced by that of iid innovations e. There was complete reliance on the Strong Law of Large Numbers (SLLN) and Lindeberg-Feller Central Limit Theorem (CLT) for sums of independent random variables to develop the consistency and asymptotic normality results which underpinned a useful inferential theory. This is where the subject stood at the end of the 1960s. The 1970s saw the development of limit theory, in particular the SLLN and CLT, for martingales, subsuming the earlier results for independent random variables. With the martingale theory it became possible to treat issues such as when is a linear model appropriate. There is, indeed, a simple answer to this question. Consider a stationary finite variance process {Xn} and write e
X
j — j ~~
where the βj are the prediction errors, {Fn} are the past history σ-fields and E{Xj\Tj-ι) is the best one-step predictor of Xj. Then, it turns out that the best linear predictor is the best predictor if and only if the e 's are martingale
differences (Hannan and Heyde (1972)). In recent times there has been an increasing realization of the role of non-linear models, but much of the development has been coming from other professions, such as physicists (see for example Kantz and Schreiber (1997)). Dynamical systems, often with striking associated properties such as chaos, have attracted much attention and proponents of deterministic theory have thrown out a challenge to the stochastic community to which there has been all too little in the way of a reasoned response. 2.2
General Theory of Inference
This was developed in a setting of a random sample of iid rv (and is still typically taught in that setting!). Much of the theory rests on asymptotic
12
HEYDE
normality (or mixed normality) of estimators, and when it was developed there were nice CLT results only for independent rv. The first attempts at a discussion of inference for stochastic processes in a general setting came only in the 1960s and 1970s. This can be seen in the books of Billingsley (1961) and Roussas (1972) in a setting of stationary ergodic Markov chains. Soon thereafter, general central limit results for martingales became available subsuming independence results such as the Lindeberg-Feller CLT. Only then did it become possible to give a very general discussion of inference for stochastic processes in the traditional likelihood based setting. The basic framework is as follows. We have a sample {XijX^ •••^n} whose distribution depends on a parameter θ (which we take as scalar for convenience). The likelihood Ln(θ) is assumed to be differentiate with respect to θ. Then, ordinarily, the score function
is a martingale. The whole classical theory of the maximum likelihood estimator (MLE) carries over in its entirety to the general setting. One uses martingale limit theory on Un(θ) and local linearity with a Taylor expansion in the neighbourhood of the MLE. Details are given in Hall and Heyde (1980, Chapter 6). The essence of the results is as follows. If In(θ) = Σ * = 1 E(ui2\!Fi-ι) is generalized Fisher information, and if In(θ) —> oo, and In(θ)/EIn(θ)^η2(θ) for some η(θ) > 0 a.s., 'up' denoting uniform convergence in probability, then with little else one has optimality of the MLE in terms of producing minimum size asymptotic confidence intervals for θ and the classical theory is nicely subsumed. Martingale theory provides the natural setting but - 20 years later - these things have regrettably not yet become part of the statistical consciousness. The consequence is that many contemporary developments in inference, for example on the "general" linear model, missing data and the EM algorithm, multiple roots of the score function,... - are carried out in an independence setting while a much more general treatment is possible. Martingales are not yet part of the statistical mainstream. They are still regarded as belonging to the domain of the probabilists. It is notable that, by contrast, the Econometricians have not been reluctant to embrace the theory. See, for example, Davidson (1994, p. xiii).
13
SHIFTING PARADIGMS
3
Estimators to Estimating Functions
An estimating function (EF) is a function of data and parameter, typically with mean zero, which when equated to zero gives a parameter estimator as its root. The use of estimating functions is close to universal in statistical practice. It is just that there has been little focus to date on EF's themselves. Their usage dates back at least to Karl Pearson's method of moments (1894). For example, if Xi are iid with EX\ = μ and varXi = σ 2 , then
are estimating functions for θ = (μ,σ 2 )'. Maximum likelihood (ML) and least squares and its variants (LS, WLS) are basically EF methods, the parallel between them being shown below. ML Likelihood L(θ) Form the score dlogL(θ)/dθ Equate to zero and solve
LS/WLS Sum(weighted sum) of squares S(θ) Form dS(θ)/dθ Equate to zero and solve
The score function is a benchmark (eg. Godambe (I960)). It is the score function, rather than the MLE which comes from it, which is fundamental. Indeed, the optimality properties which we ascribe to the MLE are really optimality properties of the score function. For example: • Fisher information is an EF property (Fisher information is varU). • The Cramer-Rao inequality is an EF property. It gives varU as a bound on the variances of standardized estimating functions. EF's have significant advantages over the estimators derived therefrom. • EF's with information about an unknown parameter can be readily combined. • EF's usually have straightforward asymptotics. That for the estimator is derived therefrom using local linearity plus regularity. For a discussion of optimal inference it is best to choose an EF setting and to focus on optimality of the EF and not optimality of an estimator derived therefrom. The origins of this theory go back to the 1960s but it began a serious surge of development in the mid 1980s, much of the impetus being provided by Godambe's 1985 paper. A detailed treatment of the subject has been provided in book form in Heyde (1997). The theory, labelled as quasilikelihood since it closely mimics the features of classical likelihood theory, is outlined below.
14 3.1
HEYDE General QL Principles
The setting is of a sample {Zut e T} from some stochastic system whose distribution involves θ. The θ to be efficiently estimated is a vector of dimension p. The approach is via a chosen family of EFs the GT being vectors of dimension p with EGτ(θ) = 0 and the p x p matrices EGT =
1
being assumed nonsingular. Comparisons are made using an information criterion (generalized Fisher information) E(Gτ) = (EG)r(EGτGτYι{EGτ) for GT € Q. We choose G*χ £ Q to maximize £(ζ?τ) in the partial order of non-negative definite matrices. (This amounts to a reformulation of the Gauss-Markov theorem.) Such a G*τ is called a quasi-score estimating function (QSEF) within Q. It should be emphasized that the choice of the family Q is open and should be tailored to the particular application. The estimator θ*τ obtained from G*τ(θ*τ) = 0, termed a quasi-likelihood estimator, has under broad conditions, minimum size asymptotic confidence zone properties for 0, at least within Q. The basic properties are those of the MLE, but restricted to Q. The theory does not require a parametric setting, let alone the existence of a score function Uτ{θ). Important features of the theory include: • It applies to general stochastic systems. • It allows for the control of the problem of misspecification. This control is in the hands of the experimenter. No more than means and variances are required in many contexts. • It carries with it all the classical theory of ML and LS. There is no extra baggage required for the discussion of inference for stochastic processes. But for this setting the detailed asymptotics does require modern limit theory (especially that for martingales). QSEFs can usually be found with the aid of the following result (Heyde (1997), Theorem 2.1, p.14): Proposition G*τ € Q is a QSEF within Q if {EGτ)'ιEGτG*τf = Cτ (3.1) for all GT € Q, where CT is a fixed matrix. Conversely, if Q is convex and G*τ is a QSEF then (3.1) holds.
15
SHIFTING PARADIGMS
3.2
Finding Useful Families of EFs
Statistical models can generally, perhaps after suitable transformation, be described as data = signal + noise where the signal is a predictable trend term and the noise is a zero-mean stochastic disturbance. This can then be conveniently reformulated in terms of a special semimartingale representation as Xt = Xo + At{θ) + Mt{θ) where At is a predictable finite variation process and Mt is a local martingale. This provides a natural route to estimating parameters in the signal. Discrete time processes and most continuous time processes with finite means have this kind of representation, via a suitable rewrite if necessary. Thus, for example, if {Xt = Σ*_i χί} ι s a discrete time process, with past history σ-fields {^i}, it can be rewritten in special semimartingale form as
A general strategy is to try the Hutton-Nelson family of EFs
G = {Gτ : Gτ = [T as(θ)dMs(θ) = Γ as(θ)d(Xs - As(θ))} Jo
Jo
for which the QSEF is
1
where Mt = Y?8=i ms in the discrete time case and
[T(E(dMa\Γa-))'(d(M)a)-dMa
Jo
in the continuous time case. Here (M)t is the quadratic characteristic and the minus superscript denotes the generalized inverse. Example. The membrane potential V across a neuron is well described by a stochastic differential equation dV(t) = (-pV(t) + λ)dt + dM{t) (eg Kallianpur (1983)), where M(t) is a martingale with a (centered) generalized Poisson distribution. Here (M)t = σ 2 t, σ > 0.
16
HEYDE
The QSEF for the Hutton-Nelson family on the basis of a single realization {V(t),0 < t < T} is / ί-V(t) l)'(dV(t) - (-pV(t) + X)dt). Jo The estimators p and λ are then obtained from the estimating equations
ίTV(t)dV(t) = ίT(-βV(t) + \)V(t)dt JO
Jθ
V{T) - V{0) =
0
For more details about this subject, including such things as how to deal with parameters in the noise component of the semimartingale model, see Heyde (1997, Chapter 2). An important role for semimartimgales in inference for stochastic processes has been evident since the papers of Hutton and Nelson (1986) and Thavenaswaran and Thompson (1986). There are few contexts where these methods cannot play a vital role. Now there is a book on inference associated with semimartingale models (Prakasa Rao (1999)), although not with a focus of the kind that has been outlined above. Amongst the (rare) processes which are not semimartingales is the fractional Brownian motion BH(t),0 < H < l,fί φ \. The case H = \ corresponds to ordinary Brownian motion, which is a semimartingale. Fractional Brownian motion has a Gaussian distribution and the self-similarity property BH(ct)±cHBH(t),c>0. It has been widely used to model possible long-range dependence (see for example Beran (1994)). An example where this process has been used is in modelling departures from the standard geometric Brownian motion model of Black and Scholes for the price of a risky asset which may now exhibit long-range dependence. Here the price Pt of the asset at time t is modelled by the stochastic differential equation (sde) -σdBH{t)] where Bπ(t),0 < H < 1, is a fractional Brownian motion process. The standard model corresponds to the case H = \ and the nonstandard model, H φ ^, is not amenable to the analysis described above. Substitute methods, however, have recently been developed. See, for example, Mikosch and
SHIFTING PARADIGMS
17
Norvaisa (2000) for a discussion of the above sde, and Norros, Valkeila and Virtamo (1999) for a discussion of the parameter estimation. All the methods of inference, semimartingale based or not, make use, in some sense, of information or empirical information. Consistency results can generally be obtained via the martingale SLLN and (asymptotic) confidence intervals via the martingale CLT. For the latter, the most general results deal with the self-normalized case [M]~2 Af, [M] being the quadratic variation (Heyde (1997)). Ideas on information are the subject of the next, and last, section.
4
Fisher Information as a Statistical "Law of Nature"
The essential points that I wish to make are: (1) The role of Fisher information in the general theory of inference has already been described (in the previous section). (2) Fisher information has a key role as a scientific tool (for example in Physics). (3) Many, perhaps most, statistical procedures rely on Fisher information in ways which have not hitherto been acknowledged. Recently there has been some quite striking work in Physics based on the idea of Fisher information. The book Frieden (1998) caught the interest of the science journalist community. For example, it led to an article in New Scientist (Matthews (1999)). On the basis of this I bought the book and I found it both fascinating and frustrating. I wove consideration of it into a seminar course which I gave at Columbia University in the Fall of 1999. The ideas certainly warrant very serious consideration by the statistical community. Frieden's thesis is that: • All physical laws may be unified under the umbrella of measurement theory. • With each phenomenon there is an associated Lagrangian, natural to the field. All Lagrangians consist entirely of two forms of Fisher information - data information and phenomenological information. An informal explanation is that each context requires solution to some extremum problem. At the basis of this is a scalar function called the Lagrangian (like the likelihood). The solution of the problem can be phrased in terms of a pde involving the Lagrangian. The parallels with statistics look good at face value. But the reality is much more complex. Much of the book treats physical systems where information decreases over time. Of course, statistical problems are typically ones where information increases over time - corresponding to the collection
18
HEYDE
of more data. The sort of context where information decreases over time is, say, when the position of a particle is observed subject to noise. Over time the particle moves and its position is known with decreasing precision. 4.1
Procedures Involving Information
Most statistical procedures seem to be associated, directly or indirectly, with measures of information, and it is arguably of value to make the connection explicit as an aid to the development of useful methods. Also, it is important to note that there is a close connection between comparisons of information content and statistical distance. For example, the formulation of a quasiscore estimating function in terms of maximizing generalized Fisher information can be equivalently recast into a formulation in terms of minimizing dispersion distance (from the (generally unknown) score function) (Heyde (1997), p. 12). A new book focusing on the use of statistical distance is Lindsay and Markatou (2001). We now proceed to examine two applications in which information based ideas are not immediately apparent, in order to see the role that they can play. Choosing the order of an autoregression The most widely used procedure, AIC, involves choosing the order k to minimize: AΙC(k) = -2logL(θk) + 2k
(4.1)
where θk is the MLE of θ restricted to Rk and 2k is a penalty function. It is assumed that the order k < K for some fixed K. Akaike's original proof uses the Kullback-Liebler entropy KL given by KL
--/'
r(x)
measuring the distance between two pdf's p, r and it should be noted that Fisher information can be thought of as a form of local entropy (Friedan (1998), pp. 31-32)). The proof shows that, asymptotically, the order minimizing EK(Θ, θk) is the same as the one minimizing (4.1). It proceeds via the likelihood ratio statistic for testing the null hypothesis Ho : θ E Rk versus the alternative H\ : θ G Rκ — Rk and suggests a QL generalization of AIC based on generalized Fisher information. Note the route to treating problems where one does not have estimating functions differentiable with respect to the parameter of interest. Here the variable in question is discrete.
19
SHIFTING PARADIGMS
Stochastic Resonance The core idea here is of a weak signal operating in a noisy environment which is normally undetectable. However, by suitably increasing the noise a "resonance" can be set up making the signal apparent. Resonance may be a very important phenomenon scientifically. It has, for example, been proposed as a possible explanation for the ice ages. There is a burgeoning literature on the phenomenon which can be conveniently accessed via the the stochastic resonance web site http://www.umbrars.com/sr based in Perugia, Italy, which in turn has links to similar web sites in San Diego, USA and Saratov, Russia. A very simple example concerns the tunable model dXt = {Asinωt)dt + σdW{t)
(4.2)
where W is standard Brownian motion and the amplitude A is subthreshold (ie A < AQ). We want to estimate ω and the issue is the optimum choice of σ. When estimating a frequency from discretely observed data Xt = Xt — Xt-ι, the conventional wisdom is to work with the periodogram
where ωp = 2πp/N,p = 0,1, ...[N/2], The theory, which originated back with Fisher (1929), tells us to use the estimator ω corresponding to the ωp for which max p Ip obtains. Consistency and rate of convergence results are available. An information formulation can proceed as follows. Prom the semimartingale representation
say, derived from (4.2), we obtain the estimating function
2
and the empirical information associated with this is | | G P | | . Asymptotically, | | G P | | 2 and Ip are maximized for the same p. Of course in the stochastic resonance problem we do not observe the {xt} process, but rather the censored process {Tt = xtl(\xt\ > Ao)}. But it seems that the periodogram based approach is still appropriate. As a general conclusion, it seems profitable to think about statistical problems in a setting of maximizing an information. I see this as an important unifying principle.
20
HEYDE References Beran, J. (1994). Statistics for Long-Memory Processes, Chapman & Hall, New York. Billingsley, P. (1961). Statistical Inference for Markov Processes, Univ. Chicago Press, Chicago. Box, G.E.P. and Jenkins, G.M. (1970). Time Series Analysis, Forecasting and Control Holden-Day, San Francisco. Davidson, J. (1994). Stochastic Limit Theory. Oxford University Press Advanced Texts in Econometrics, Oxford. Fisher, R.A. (1929). Tests of significance in harmonic analysis. Proc. Roy. Soc. Ser. A 125, 54-59. Frieden, B.R. (1998). Physics from Fisher Information. Cambridge U. Press, Cambridge.
A Unification,
Godambe, V.P. (1960). An optimum property of regular maximum- likelihood estimation. Ann. Math. Statist. 3 1 , 1208-1211. Godambe, V.P. (1985). The foundations of finite sample estimation in stochastic processes. Biometrika 72, 419-428. Hall, P.G. and Heyde, C.C. (1980). Martingale Limit Theory and its Application, Academic Press, New York. Hannan, E.J. and Heyde, C.C. (1972). On limit theorems for quadratic functions of discrete time series. Ann. Math. Statist. 43, 2058-2066. Heyde, C.C. (1997). Quasi-Likelihood and its Application. A General Approach to Optimal Parameter Estimation, Springer, New York. Hutton, J.E. and Nelson, P.I. (1986). Quasi-likelihood estimation for semimartingales. Stochastic Process. Appl. 22, 245-257. Kallianpur, G.P. (1983). On the diffusion approximation to a discontinuous model for a single neuron. In P.K. Sen, Ed., Contributions to Statistics: Essays in Honor of Norman L. Johnson. North-Holland, Amsterdam, 247-258. Kantz, H. and Schreiber, T. (1997). Nonlinear Time Series Analysis. Cambridge University Press, Cambridge. Lindsay, B.G. and Markatou, M. (2001). Statistical Distances: A Global Framework for Inference. Springer, New York, to appear. Matthews, R. (1999). / is the law. New Scientist, 30 January 1999, 24-28.
SHIFTING PARADIGMS
21
Mikosch, T. and Norvaisa, R. (2000). Stochastic integral equations without probability. Bernoulli 6, 401-434. Norros, L, Valkeila, E. and Virtamo, J. (1999). An elementary approach to a Girsanov formula and other analytical results on fractional Brownian motion. Bernoulli 5, 571-587. Prakasa Rao, B.L.S. (1999). Semimartingales and their Statistical Inference. Chapman & Hall/CRC, Boca Ratan. Roussas, G.G. (1972). Contiguity of Probability Measures, Cambridge Univ. Press, London and New York. Thavaneswaran, A. and Thompson, M.E. (1986). Optimal estimation for semimartingales. J. Appl. Prob. 23, 409-417.
25
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
MODELLING B Y LEVY PROCESSES Ole E. Barndorff-Nielsen MaPhySto1 University of Aarhus
1
Introduction
A considerable body of recent work uses Levy processes to model and analyse financial time series. Section 2 provides a brief review of this work. The review is to a large extent based on two papers Barndorff-Nielsen and Shephard (2001a,b) where more detailed information may be found. See also Barndorff-Nielsen and Shephard (2001c,d,e). The models in question aim to incorporate one or more of the main stylised features of financial series, be they stock prices, foreign exchange rates or interest rates. A summary of these stylised features, and a comparison with related empirical findings in the study of turbulence, is given in Section 3. (In fact, the intriguing similarities between finance and turbulence have given rise to a new field of study coined 'econophysics'.)
2
Levy Processes in Finance
A Levy process is a stochastic process (in continuous time) with independent and homogeneous increments. The study of such processes, as part of probability theory generally, is currently attracting a great deal of attention, see Bertoin (1996,1999), Sato (1999), Barndorff-Nielsen, Mikosch and Resnick (2001), and references given there. It is by now well recognised that Brownian motion generally provides a poor description of log price processes of stocks and other financial assets. Improved descriptions are obtained by substituting Brownian motion by suitably chosen alternative Levy processes, for instance hyperbolic Levy motion, normal inverse Gaussian Levy motion and, more generally, one of the Centre for Mathematical Physics and Stochastics, funded by the Danish National Research Foundation
26
BARNDORFF-NIELSEN
generalised hyperbolic Levy motions. See Eberlein (2001), Prause (1999), Mantegna and Stanley (1999), Barndorff-Nielsen and Prause (2001). Merely changing from Brownian motion to another, more suitable, Levy process does not, however, provide a modelling of the important quasi long range dependencies (cf. Section 3) that pervades the financial markets. But such dependencies may be captured by further use of Levy processes, as innovation processes driving volatility processes in the framework of SV (Stochastic Volatility) models. Discrete time models of this kind were considered in Barndorff-Nielsen (1998b). That approach has since been developed, in joint work with Neil Shephard, into the continuous time setting, and the rest of the present note consists mainly in a summary of that work (Barndorff-Nielsen and Shephard (2001a,b; cf. also 2001c,d,e)). The stochastic volatility models considered are of the form dx*(t) = {μ + βσ2(t)} dt + σ(t)dw(t)
(2.1)
where, for concretenes, we may think of x*(t) as the log price process of a given stock. In (2.1), w(t) is Brownian motion and σ 2 (t), which represents the fluctuating and time dependent volatility, is a stationary stochastic process, for simplicity assumed independent of w(t). Of particular interest are cases where σ2(t) is of OU type (Ornstein-Uhlenbeck type) or is a superposition of such processes. In the former instance, σ2(t) satisfies a stochastic differential equation of the form dσ 2 (t) - -λσ 2 (t)dt + dz(λt)
(2.2)
where z(t) is a Levy process with positive increments; thus z(t) is a subordinator. Because of its role in (2.2), z(t) is referred to as the Background Driving Levy Process (BDLP, for short). The correlation function 2 r(u) of the (stationary) solution σ (t) of (2.2) has the exponential form r(u) = exp(-λu). 2 In particular, choosing the volatility process so that σ (t) follows the inverse Gaussian law /G(ί, 7) with probability density
one obtains that the increments of the log returns over a lag Δ, i.e. x*(t + Δ) — x*(t), are approximately distributed according to a normal inverse Gaussian law, and these laws are known to describe the distributions of log returns well. More generally, one may consider the generalized inverse Gaussian distribution GIG(λ,δ,j) (Ί/S)x
2Kλ(δΊ)
_Λ_
27
LEVY MODELLING 2
where K\ is a Bessel function, as the law for the volatility σ (t). The corresponding approximate laws of the increments z*(£ + Δ) — x*(t) are then of the generalized hyperbolic type which, besides the hyperbolic and normal inverse Gaussian distributions, inter alia includes the variance gamma laws, the Student distributions, and the Laplace distributions. It is important here to note that it is essential for the construction of these OU processes that the G/G(λ, ί, 7) distributions are selfdecomposable (cf. Barndorff-Nielsen (1998b)). 2 Furthermore, whatever the choice of the process & (t), if the parameter β is (approximately) 0 then the autocorrelations of the sequence of log returns will be (approximately) 0, reflecting another important stylized fact. The dependency structure in the log price process (as it manifests itself for instance in the autocorrelations of the absolute or squared returns) may be modelled by endowing the σ2(t) process with a suitable correlation structure. This can be done by taking σ 2 (t) to be a superposition of independent OU processes, while keeping the chosen marginal law of o2{t). Already the superposition of just two OU processes (with different regression parameters λi and λ2) may go a long way in describing the observed dependency structure of x*(t) (see, for instance, Barndorff-Nielsen (1998b; Figure 1)). However, even processes with real long range dependence can be constructed in this way (Barndorff-Nielsen (2001)). Finally, the so-called leverage effect (see Section 3) can be modelled by adding an extra term in equation (2.1), defined using again the BDLP z(t).
3
Stylized Features of Finance and Turbulence
A number of characteristic features of observational series from finance and from turbulence are summarised in table 1. The features are widely recognized as being esssential for understanding and modelling within these two, quite different, subject areas. In finance the observational series concerned consist of values of assets such as stocks or (logarithmic) stock returns or exchange rates, while in wind turbulence the series typically give the velocities or velocity derivatives (or differences), in the mean wind direction of a large Reynolds number wind field. For some typical examples of empirical probability densities of logarithmic asset returns, on the one hand, and velocity differences in large Reynolds number wind fields, on the other, see, for instance, Eberlein and Keller (1995) and Shephard (1996), respectively Barndorff-Nielsen (1998a). A very characteristic trait of time series from turbulence as well as finance is that there seems to be a kind of switching regime between periods of relatively small random fluctuations and periods of high 'activity'. In turbulence this phenomenon is known as intermittency whereas in finance one
BARNDORFF-NIELSEN
28
speaks of stochastic volatility or conditional heteroscedasticity. For cumulative processes x* (t) in finance a basic expression of the volatility is given by the quadratic variation process [#*](£), defined as
i=\
where 0 = ίo < h < ••• < ί n _i < tn — t and the limiting procedure is for the grid size max(ij — ίj_i) tending to 0. Similarly, in turbulence intermittency is expressed as the energy dissipation rate per unit mass at position ξ: i
/'
ξ
Jξξ-r/2
Here u = u(x) is the velocity at position x in the mean direction of the wind field. For detailed and informative discussions of the concepts of intermittency and energy dissipation, see Frisch (1995).
varying activity semiheavy tails asymmetry aggregational Gaussianity 0 autocorrelation quasi long range dependence scaling/selfsimilarity TABLE
Finance volatility
Turbulence intermittency
+ + + + + [+]
+ + + [+] +
1. Stylised features.
The term 'semiheavy tails', in table 1, is intended to indicate that the data suggest modelling by probability distributions whose densities behave, for x —> ±oo, as const. \x\p± exp(—σ± \x\)
for some p-f-,p- € R and σ+,σ_ > 0. The generalised hyperbolic laws exhibit this type of behaviour. Velocity differences in turbulence show an inherent asymmetry consistent with Kolmogorov's modified theory of homogeneous high Reynolds number turbulence (cf. Barndorff-Nielsen, 1986). Distributions of financial asset returns are generally rather close to being symmetric around 0, but for stocks there is a tendency towards asymmetry stemming from the fact that the equity market is prone to react differently to positive as opposed to negative
LEVY MODELLING
29
returns, cf. for instance Shephard (1996; Subsection 1.3.4). This reaction pattern, or at least part of it, is referred to as a 'leverage effect' whereby increased volatility tends to be associated with negative returns. By aggregational Gaussianity is meant the fact that long term aggregation of financial asset returns, in the sense of summing the returns over longer periods, will lead to approximately normally distributed variates, and similarly in the turbulence context2. For illustrations of this, see for instance Eberlein and Keller (1995) and Barndorff-Nielsen (1998a). The estimated autocorrelation functions based on log price differences on stocks or currencies are generally (closely) consistent with an assumption of zero autocorrelation. Nevertheless, this type of financial data exhibit 'quasi long range dependence' which manifests itself inter alia in the empirical autocorrelation functions of the absolute values or the squares of the returns, which stay positive for many lags. For discussions of scaling phenomena in turbulence we refer to Frisch (1995). As regards finance, see Barndorff-Nielsen and Prause (2001) and references given there. In addition, it is relevant to mention the one-dimensional Burgers equation du du _ d2u dt dx dx2 This nonlinear partial differential equation may be viewed as a 'toy model' version of the Navier-Stokes equations of fluid dynamics and, as such, have been the subject of extensive analytical and numerical studies, see for instance Frisch (1995; p. 142-143) and Bertoin (2001), and references given there. In finance, Burgers' equation has turned up in work by Hodges and Carverhill (1993) and Hodges and Selby (1997). However, the interpretation of the equation in finance does not appear to have any relation to the role of the equation in turbulence.
References [1] Barndorff-Nielsen, O.E. (1986): Sand, wind and statistics. Ada Mechanica 64, 1-18. [2] Barndorff-Nielsen, O.E. (1997): Normal inverse Gaussian distributions and stochastic volatility modelling. Scand. J. Statist. 24, 1-14. [3] Barndorff-Nielsen, O.E. (1998a): Probability and statistics: selfdecomposability, finance and turbulence. In Acccardi, L. and Heyde, 2
However, in turbulence a small skewness generally persists, in agreement with Kolmogorov's theory of isotropic turbulence.
30
BARNDORFF-NIELSEN C.C. (Eds.): Probability Towards 2000. Proceedings of a Symposium held 2-5 October 1995 at Columbia University. New York: SpringerVerlag. Pp. 47-57.
[4] Barndorff-Nielsen, O.E. (1998b): Processes of normal inverse Gaussian type. Finance and Stochastics 2, 41-68. [5] Barndorff-Nielsen, O.E. (2001): Superposition of Ornstein-Uhlenbeck type processes. Theory Prob. Its Appl. (To appear.) [6] Barndorff-Nielsen, O.E., Mikosch, T. and Resnick, S. (Eds.) (2001): Levy Processes - Theory and Applications. Boston: Birkhauser. [7] Barndorff-Nielsen, O.E. and Perez-Abreu, V. (1999): Stationary and selfsimilar processes driven by Levy processes. Stoch. Proc. Appl. 84, 357-369. [8] Barndorff-Nielsen, O.E. and Prause, K. (2001): Apparent scaling. Finance and Stochastics, 5, 103-113. [9] Barndorff-Nielsen, O.E. and Shephard, N. (2001a): Modelling by Levy processes for financial econometrics. In Barndorff-Nielsen, O.E., Mikosch, T. and Resnick, S. (Eds.): Levy Processes - Theory and Applications. Boston: Birkhauser, 283-318. [10] Barndorff-Nielsen, O.E. and Shephard, N. (2001b): Non-Gaussian OU based models and some of their uses in financial economics (with Discussion). J. Roy. Statist. Soc, B, 63, 167-241. [11] Barndorff-Nielsen, O.E. and Shephard, N. (2001c): Econometric analysis of realised volatility and its use in estimating stochastic volatility models. J. Roy. Statist. Soc, J5, 64, (to appear). [12] Barndorff-Nielsen, O.E. and Shephard, N. (2001d): Integrated OU processes. (Submitted). [13] Barndorff-Nielsen, O.E. and Shephard, N. (2001e): Normal modified stable processes. (Submitted). [14] Bertoin, J. (1996): Levy Processes. Cambridge University Press. [15] Bertoin, J. (1999): Subordinators: Examples and Applications. In Bernard, P. (Ed.): Lectures on Probability Theory and Statistics. Ecole d Έ t e de St-Flour, 1997. Berlin: Springer. Pp. 1-91.
LEVY MODELLING
31
[16] Bertoin, J. (2001): Some properties of Burgers turbulence with white or stable noise initial data. In Barndorff-Nielsen, O.E., Mikosch, T. and Resnick, S. (Eds.): Levy Processes - Theory and Applications. Boston: Birkhauser. Pp. 267-279. [17] Eberlein, E. (2001): Application of generalized hyperbolic Levy motion to finance. In Barndorff-Nielsen, O.E., Mikosch, T. and Resnick, S. (Eds.): Levy Processes - Theory and Applications. Boston: Birkhauser. Pp. 319-336. [18] Eberlein, E. and Keller, U. (1995): Hyperbolic distributions in finance. Bernoulli 1, 281-299. [19] Eberlein, E. and Raible, S. (1999): Term structure models driven by general Levy processes. Math. Finance 9, 31-54. [20] Prisch, U. (1995): Turbulence. Cambridge University Press. [21] Hodges, S. and Carverhill, A. (1993): Quasi mean reversion in an efficient stock market: The characterization of economic equilibria which support the Black-Scholes option pricing. Economic J. 103, 395-405. [22] Hodges, S. and Selby, M.J.P. (1997): The risk premium in trading equilibria which support the Black-Scholes option pricing. In Dempster, M. and Pliska, S. (Eds.): Mathematics of Derivative Securities. Cambridge University Press. Pp. 41-53. [23] Mantegna, R.N. and Stanley, H.E. (1999): Introduction to Econophysics: Correlation and complexity in finance. Cambridge University Press. [24] Prause, K. (1999): The Generalized Hyperbolic Model: Estimation, Financial Derivatives and Risk Measures. Dissertation. Albert-LudwigsUniversitat, Freiburg i. Br. [25] Sato, K. (1999): Levy Processes and Infinite Divisibility. Cambridge University Press. [26] Shephard, N. (1996): Statistical aspects of ARCH and stochastic volatility. In D.R. Cox, D.V. Hinkley and O.E. Barndorff-Nielsen (Eds.): Time Series Models - in econometrics, finance and other fields. London: Chapman and Hall. Pp. 1-67.
33 Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
EXTREME VALUES FOR A CLASS OF SHOT-NOISE PROCESSES W. P. McCormick Department of Statistics University of Georgia Athens, GA 30602 USA
Lynne Seymour Department of Statistics University of Georgia Athens, GA 30602 USA Abstract
The distribution of the maximum of a shot-noise process based on amplitudes which are heavy tailed and follow a chain-dependent structure is analysed. Asymptotic results are obtained. The process is seen to have a strong local dependence and it's extremal index is computed. A simulation study shows the finite sample size performance of an asymptotic approximation to the distribution of the maximum.
1
Introduction
In the paper we are concerned with the asymptotic behavior of the extreme values for a class of shot noise processes. Shot noise processes provide a wide class of stochastic models that are particularly well suited to modeling time series with sudden jumps. Such processes have been applied to modeling river flow data where a rise in the riverflow level could, for example, be attributed to rainfall, Lawrance and Kottegoda (1977) and Weiss (1973). Moreover, rainfall data, itself, has been modeled via shot noise processes, Waymire and Gupta (1981). The basic model under study here takes the form
Y τ
k
) ,
t>0
where {A^} is a sequence of random amplitudes, {r^} forms a point process of event times and h is the impulse response function, typically, taken to be nonincreasing with support in [0,oo). In the current investigation, we take the {A^} to be a stochastic process of heavy-tailed random variables. In applications the sequence of shocks or amplitudes exhibits dependence. It may be that large shocks tend to occur in succession followed by periods
34
MCCORMICK AND SEYMOUR
of mild or small shocks. To model this dependence, we assume that {Ak} form a sequence of chain dependent variables. A chain dependent process operates in the following way. The observed process {A^} is linked with a secondary Markov chain {ξk} Conditional on the values of {£*;}, the Afc's are independent with the distribution of A^ depending on the value of ξk-ι Chain dependent processes form a useful class of stochastic models which, for example, have been applied with success in modeling extremes of precipitation data, Guttorp (1995) p. 74. Theorectical work on extremes for chain dependent models has been done in Resnick and Neuts (1970), Denzel and O'Brien (1975) and more recently in McCormick and Seymour (2001). In section 2 we present an extreme value analysis of the shot noise model. Extremes for shot noise processes have been considered by several authors under various assumptions. However, all previous work have taken the {Ajς} sequence to be iid. On the other hand, in practice, the data often appear to contradict such an assumption. When the {A^} are constant and the {r^} form a homogeneous Poisson process, i.e. in the case of a filtered Poisson process, Hsing and Teugels (1989) have obtained results on the limiting distribution of the maximum. Doney and O'Brien (1991) provide an extension to the results of Hsing and Teugels (1989) while working under the assumption of constant amplitude. The case of light-tailed amplitudes, viz. Gamma or Weibull distribution, was considered in Homble and McCormick (1995) and the heavy-tailed amplitude case, e.g. Pareto distribution, was developed in McCormick (1997). The process under consideration has a strong local dependence quantified by a value referred to as its extremal index. The method developed in Chernick et al. (1991) for calculating extremal index is conveniently applied here and represents an essential step in obtaining the asymptotics for the maxima. In section 3 the results of a simulation study are shown.
2
Asymptotics
Let {ξn} be a stationary finite state space Markov chain with probability transition matrix P — {pij), 1 < i,j < r. Further, let {TΓJ,1 < j < r} denote the stationary probability measure for the chain. Next, we define a chain-dependent sequence associated with the Markov chain {ξn} as follows. Let iί^(x), 1 < i < r be distribution functions and let {An} be such that P{An
<x,ξn=
j\Ak,ξk,k 0} be a renewal process with a fixed renewal at TQ = 0. We assume that the sequence {τn} is independent of the sequence {Aniξn} and
35
EXTREME VALUES define a shot noise process X by X(t)=
Σ
i > 0
Akh{t-τk),
(2.1)
0 1, i.e. σi ~ G and σ^ — σ^-i ~ F, j > 2 in (2.2), r /I
C =
Proof: Consider for 0 < s\ < ... < sn < 1 x|σ^ = SJ, j = 1,..., n, n
Let κ(ί) = Ki and let {Xj} be iid with d.f. H and be independent of {ξj}. Then P{Y > x\σj = Sj,
1<J
ex,
38
MCCORMICK
AND SEYMOUR
we have n
P{Y > x\σj — Sj, 1 < j < n,σ n +χ > 1} ~ EN^κa(ξj-ι)ha(sj)H(x)
as x —> oo. (2.10)
Finally, it follows from (2.10) as in McCormick (1997) that oo
r 1
oo j=l
~
""
) asx^oo.
(2.11)
In the stationary point process case, i.e. σ\ ~ G, we have j
(2.12)
jj
Thus the lemma holds from (2.11) and (2.12).
D
Consider the stationary sequence -τj),
k = 0,1,...
(2.13)
where we define TJ with j < 0 such that {—r_j,j = 1,2,...} is an independent copy of {τj,j = 1,2,...}. By Lemma 2.1 P{W0 >x)~ cx~aL{x)
as x -> oo
(2.14)
where c = ( ^ ^ «f π2) ΣZ=ι J 0}, m — 1,2,3,..., satisfy mixing conditions D(un) and D^m+ι\un) for a suitable sequence un. By Corollary 1.3 in Chernick et al (1991), this implies that the sequence {W™, k > 0} has an extremal index θm for each m. We recall the definition of D(un) and D^k\un). For a stationary sequence ^ } and sequence of constants {un}, set for 1 < I < n — 1
α n ? / = snp\P{Wj < un,j e AuB}-P{Wj
< un,j e A}P{Wj < un,j E
EXTREME
39
VALUES
where the supremum is taken over all A, B such that i c { l , . . . , f c } and B C {k + /,...,n} for some k with 1 < k < n — I. The condition D(un) is said to hold for the stationary sequence {Wk} if for some sequence ln = o(n), anjn -> 0 as n -> oo. If D(un) holds for {VFfc}, we say that condition Dk(un) holds provided there exist sequences of integers {sn} and {/n} with sn -> oo, s n α n ? / n -* 0, snln/n -^ 0 and lim nP{Wχ > un >
7l-»OO
i=2
where rn = [n/sn] and \/i signifies the maximum as the index varies over i to j . For the sequence {un} consider for any β > 0, un = un(/?) satisfying for all n sufficiently large un = i T - ( l - ^ ) (2.16) n where we take H^~{x) — inf{y : iϊ(y) > x} and ϋ" is as given in (2.8). Note regular variation of 1 - H(x) implies n ( l — H(un)) -> β as n —)• oo. First, we note from Denzel and O'Brien (1975) that the chain-dependent process {An} is strong mixing, and since the m-tuples (h(τk—Tfc_m),..., hfa — Tfc_i)), k = 0 , 1 , . . . are m-dependent, we see that {W™,k > 0} is strongmixing. Thus condition D(un) holds. Next, we consider D^m^(un). Observe m-i-l
rn
v wr
>un>\ιwr,
2=771+2
i-2
m + 2 un, W™ > un\ξk,
k un} j=l-m
j=i-m
where Pτ,x denotes taking the probability with respect to the r and X sequences only. Thus if K* — Vi=i «i? 771 + 1
> un,Wt
m
> un} < P
2
40
MCCORMICK AND SEYMOUR
Hence, D^m+ι\un) holds. We next turn our attention to computation of the extremal index for the Wk. Lemma 2.2. Let the Wk, k > 0, be as defined in (2.13). Then {Wk} has an extremal index θ given by
Proof. We begin by computing the extremal index for {W™}. Since conditions D(un) and D^m+ι\un) hold for {W™}, we have by Corollary 1.3 in Chernick et al. (1991) that the extremal index for the {W™} exists and is given by # m where m+l
θm = lim P{ V W™ < un\W^ > un}. v
n-+oo
i=2
Now observe that m+l
nP{W?
>un>\/
W™} m+l
=
nEP{W^
>un > \J W™\τk,ξk,
k<m + l}
i=2
=
1
m+l i
ΠEPX{ ^
«(ίj-l)Λ(n -Tj)Xj >Un > \J ( Σ
j=l-m
Kiξj-lM^-Tj).
ϊ~2 j=i-m
Following the development in Chernick et al (1991), it is checked that as n -» oc, 1
m+l i τ
y ^ κ>{ζj-ι)h{τ~i — j)Xj > ^n ^ \f ( / ^ κ(ζj-i)h(ri
nEPχ{
j=l-m 1
j+m
j=l-m
i=2
1
β Σ
j+rn
E{ha\rx-τ3)-\l
j=l-m 1
β Σ j=l-m
~ Tj)
i=2 j=i-m
Λ ° ( T i - T ^ I + ^ & -I)
i—2 r E ha
a
l ^-^)-h (τ2-r3))(Σoo
\ 0 and the resolvent {k(θ)I - Cθ)~ι is compact. Let Λ# = (k(θ)I - Cθ)^. Let hi(θ) be an orthonormal system of eigen functions of Λ#. We assume that the following condition holds. (H3) There exists a complete orthonormal system {hi, i > 1} independent of θ such that
The elements of the basis {h{,i > 1} are also eigen functions for the operator CQ, that is i = μi hi
where
51
STOCHASTIC PDE
For s > 0, define H$ to be the set of all u G L
1 form an orthonormal basis in HQ. Condition (H2) imples that for every 5 ,the spaces H$ are equivalent for all θ. We identify the spaces HQ (denoted by Hs) and the norms !|.|| s ,0 for different θ € θ. In addition to the conditions (H1)-(H3), we assume that (H4)u 0 G H~a where a > \. Note that u0 e L 2 (G), (H5) the operator Ai is uniformly strongly elliptic of even order mi and has the same system of eigen functions {h^i > 1} as Cg. The conditons (H1)-(H5) described above are the same as those in Huebner and Rozovskii (1995). Note that u0 E H~a. For θ <E Θ, define θ
(2.8)
u 0ι = (uo,h-«)-a. Then the random field 00
θ
Σ
θ
(
x
)
(2-9)
1=1
is the solution of (2.1) subject to the boundary conditions (2.2) and (2.3) where u\ (ί) is the unique solution of the stochastic differential equation du\{i) = μθiUi(t)dt + \-a{θ)dWi{t),0 < t < T,
u{f)(0)=uθ0ι.
(2.10)
(2.11)
52
PRAKASA RAO Let πN be the orthogonal projection operator of H~a onto the subspace
spanned by {h~θa, 1 < i < N}. Let uN'θ{t,x)
= πNuθ(t,x)
(2.12)
θ
where u τ(t) is the solution of (2.10) subject to (2.11). Note that duNfi(t,
x) = AeuNfi{t,
x)dt + dWN(t,
(2.13)
x),0 1 are independent standard Wiener processes.
Let Pf be the probability measure generated by uNfi Let h~a denote h~^,uN
on C([0,T]; RN).
denote uN>θ° and u denote uθ° when #o is the true
parameter. It is known that, for any θ E Θ, the measures P^ and P^ are absolutely continuous with respect to each other and T r
fJpN lOg
dP*(uN)
(ύ2 _ ύ2\
= (θ-θo)J(AlUN(s),duN(s))o-[
θo
0) 2
0
T r
J \\A,uN{s)\\lds 0
T
-(θ - ΘQ) J(AιUN{s),
AouN{s))ods.
(2.16)
o It is easy check that (cf. Huebner and Rozovskii (1995)) N
N
J(A1u (s),dW (s))0 ΘN-ΘO
(2.17)
=^
f\\AιU"(sψods 0
where ΘN is the maximum likelihood estimator of ΘQ. Huebner and Rozovskii (1995) studied the asymptotic properties of this estimator under the
53
STOCHASTIC PDE conditions (H1)-(H5). Further more the Fisher information is given by T
IN = Ej\\AιuN'θ°(s)\\ld8.
(2.18)
o Note that IN -* oo as N -> oo from the Lemma 2.1 of Huebner and Rozovskii (1995). Suppose that Λ is a prior probability measure on (Θ,i?) where B is the σ-algebra of Borel subsets of set Θ C R. We assume that the true parameter #o £ Θ°, the interior of Θ. Further suppose that Λ has the density λ( ) with respect to the Lebesgue measure and the density λ( ) is continuous and positive in an open neighbourhood of θo, the true parameter. Let τ = IιJ2(θ-θN)
(2.19)
and
p*(τ\uN) = Γ^PΨN
+ τΓNll2\uN)
(2.20)
where ^ ( 0 1 ^ ) is the posterior density of θ given uN. Note that P(θ\uN) = —$
(2.21)
S^(u^)\{θ)dθ and let p*(τ\uN)
denote the posterior density of ij
(θ — §w). Let
dP? θ
In view of (2.16), it follows that 2
(2.23)
\ogvN{τ) = -\r l^ I WA^WWlds since HA1uN(s),duN(s)-A0(s)uN(s)d8)0 ΘN = 2 0
.
(2.24)
54
PRAKASA RAO
Let oo
CN = I uN(r)λ(θN + T / - 1 7 V
(2-25)
—oo
It can be checked that + r/-1/2).
p*(τ\uN) = C^uN(r)X(θN
(2.26)
T 1
Note that (Cl)
N
βN = I^ J \\Aχu {s)\\lds -> 1 a.s. [Pθo] as N -> oo. o from the Lemma 2.2 of Huebner and Rozovskii (1995). Then the following relations hold: (i)
iv
limo^(r)=exp(-ir2)a.s.[P(,0],
(ii) for any 0 < 7 < 1, 1
o.
x
for every r for sufficiently large JV, and (iii) for every 5 > 0, there exists 7' > 0 such that sup
vN(τ)
0.
Then OO
lim
1.
mm
NN
/ \τ\ \p*(τ\u \p*(τ\u ) ) -( (± ^ λ)* e-^\dr
= 0 a.s. [Pθo].
(2.28)
Remarks: It is obvious that the condition (D2) holds for m = 0. Suppose the condition (Dl) holds. Then it follows that 0 0
1/2
lim ί \p*(r\uN) - (±-)
e-^ldr - 0 a.s. [P,o].
(2.29)
56
PRAKASA RAO
This is the analogue of the Bernstein-von Mises theorem in the classical statistical inference. As a particular case of Theorem 2.2, we obtain that ~ θo)}m -> E[Z)m as N -> oo
(2.30)
where Z is JV(O,1).
For proofs of Theorems 2.1 and 2.2, see Prakasa Rao(1998).
3
Bayes Estimation
We define an estimator ΘN for θ to be a Bayes estimator based on the path uN corresponding to the loss function L(θ,φ) and the prior density λ(0) if it is an estimator which minimizes the function
BN(φ) = jL(θ,φ)p(θ\uN)dθ,φ(Ξ Θ where L(θ,φ) is defined on Θ x Θ. Suppose there exist a Bayes estimator 0/v. Further suppose that the loss function L{θ,φ) satisfies the following conditions: (El) L(θ,ψ)=L(\θ-φ\)>0; (E2) L(t) is nondecreasing for t > 0; (E3) there exists nonnegative functions (a) RNL{τI~1/2)
and G(τ) such that
RN,K(T)
< G{r) for all N>1;
I/O
(b)
RNL(TIN
' ) -> K(τ) as TV -> oo uniformly on bounded intervals
of τ; (c) the function oo _ !
/
2
K(r + ra)e~2r dr
—oo
achieves its minimum at m — 0, and (d) G(τ) satisfies the conditions similar to (C3) and (C4).
STOCHASTIC
57
PDE
The following result can be proved by arguments similar to those given in Borwanker et al. (1971). We omit the proof.
Theorem 3.1: Suppose the conditions (D1)-(D2) of Theorem 2.2 hold in addition to (H1)-(H5) stated earlier. In addition , suppose that the loss function L{θ,φ) satisfies the conditions (El) - (E3) stated above. Then Oa.s. [Pθo] asJV->oo
(3.1)
and lim RNBN(ΘN) N-+00
=
lim RNBN(ΘN)
(3.2)
N—ΪOO 1/2
Huebner and Rozovskii (1995) proved that ΘN -> 0O a.s. [Pθo] as N -> oo
(3.3)
and I]!2ΦN
- θ0) 4 iV(0,1) as N -> oo
(3.4)
under the conditions (H1)-(H5). As a consequence of Theorem 3.1, it follows that 07V -> 0o a.s [Pθo] as TV -> oo
(3.5)
and
l]ί2φN
- 0O) 4 7V(0,1) as N -> oo.
(3.6)
In other words the Bayes estimator ΘM of the paramaeter 0 in the parabolic SPDE given by (2.1) is strongly consistent, asymptotically normal and asymptotically efficient as N —> oo under the conditions (H1)-(H5) of Huebner and Rozovskii (1995) and the conditions stated in Theorem 3.1. Remarks:
A general approach for the study of asymptotic properties
of maximum likelihood estimators and Bayes estimators is by proving the local asymptotic normality of the loglikelihood ratio process as was done in
58
PRAKASA RAO
Prakasa Rao (1968), Ibragimov and Khasminskii (1981) in the classical i.i.d. cases and by Huebner and Rozovskii (1993) for some classes of SPDE. Our approach for Bayes estimation, via the comparison of the rates of convergence of the difference between the maximum likelihood estimator and the Bayes estimator, is a consequence of the the Bernstein - Von Mises type theorem .
We now consider a nonparametric version of the problem discussed earlier for a class of SPDE.
4
Stochastic PDE with Linear Multiplier
Let (Ω,^1*, P) be a probability space and consider the process u e (£,x),0 < x < 1,0 < t 0 and θ £ Θ where Θ is a class of real valued functions 0(i),O < t < T uniformly bounded , k times continuously differentiate and suppose that the λ -th derivative θ^k\.) satisfies the Lipschitz condition of order α G (0,1], that is, |0(*)(t) _ 0(*)( 5 )| < \t - s\a,β = k + a.
(4.2)
Further suppose the initial and the boundary conditions are given by ue(O,a0 = /(α0,/€L2[O,l]
( 43 )
u e (ί,0) = u e ( t , l ) = 0 , 0 < ί < T and Q is the nuclear covariance operator for the Wiener process WQ( 1 are independent one - dimensional standard Wiener processes and {e^} is a complete orthonormal system in Z^IP, 1] consisting of eigen vectors of Q and {^} eigen values of Q. We assume that the operator Q is a special covariance operator Q with βfc = sin(kπx),k
> 1 and λ^ = (πk)2,k > 1. Then {e^} is a complete
orthonormal system with eigen values qι = (1 + AJ" 1 ^' > 1 for the operator Q and Q = (/ - Δ)"" 1 . Note that dWQ = Qιl2dW.
(4.5)
We define a solution u£(t,x) of (4.1) as a formal sum oo
Uε{t,x) = (cf. Rozovskii (1990)). It can be checked that the Fourier coefficient Uiε(t) satisfies the stochastic differential equation duiε{t) = (0(t) - \τ)uιε{t)dt +
,S V Λi + 1
dWj{t):
0 < t < T
(4.7)
with the initial condition uιε(0) = vu vi=
ί f(x)eι(x)dx. Jo
(4.8)
We assume that the initial function / in (4.3) is such that Vi = / f{x)et(x)dx>0, Jo
i > 1.
Estimation of linear multiplier We now consider the problem of estimation of the function θ(t), 0 < t < T based on the observation of the Fourier coefficients Uiε(t), 1 < i < N over [0, T] or equivalently the projection uε ' (ί, x) of the process uε(t, x) onto the subspace spanned by {ei,..., ejy} in L2[0,1]. We will at first construct an estimator of #(.) based on the path {ixie(t), 0 < t < T}. Our technique follows the methods in Kutoyants (1994), p.155.
60
PRAKASA RAO Let us suppose that sup sup |0(t)| < L 0 .
(4.9)
θeθθ 0
(4.49)
for 1 < i < N. In view of (4.30), it follows that the estimator θ*Nε(t) is a consistent estimator of θ{t). Theorem 4.2: Under the conditions stated above, for 0 < t < T, fl vεW "^ θ(t) as ε -^ 0. Note that
ΣϊLi Since (i) 7ε(4(ί) - θ(t)) 4 ΛΓ(0,σ2(ί)) as ε -> 0 for 1 < i < iV, (ii) ύiε(t) 4 Ui(ί) as ε ~> 0 for 1 < i < N,
(4.50)
68
PRAKASA
RAO
and since the estimators θiε(t),l < i < N are independent random variables, it follows that the estimator θ*Nε(t) is asymptotically normal and we have the following theorem. Theorem 4.3: Under the conditions stated earlier, for 0 < t < T, Ίε{θ*Nε{t) - θ(t)) 4 iV(0, σ2{t)) as ε -> 0
(4.51)
= ε-2lτϊ
(4.52)
where
Ίζ
and
Remarks l:If k = 0 and /? = 1, that is, the function θ(.) G Θ where Θ is the class of uniformly bounded functions which are Lipschitzian of order one, then it follows that ε'Hθ*Nε{t) - θ(t)) 4 N(0,σ2{t))
as ε -> 0.
(4.54)
Remarks 2: It is known that the probability measures generated by stochastic processes satisfying the SPDE given by (4.1) are absolutely continuous with respect to each other when θ(.) is a constant (cf. Huebner et al.(1993)). There are classes of SPDE which generate probability measures which are singular with respect to each other when θ(.) is a constant. One can study the problem of nonparametric inference for a linear multiplier for such a class of SPDE by the above methods(cf. Prakasa Rao (2000b)).
References Borwanker, J.D., Kallianpur, G. and Prakasa Rao, B.L.S. 1971 The Bernstein-von Mises theorem for Markov processes, Ann. Math Statist. 42, 1241-1253.
STOCHASTIC PDE
69
Da Prato, G. and Zabczyk, J. 1992.Stochastic Equations in Infinite Dimensions, Cambridge University Press, Cambridge. Huebner, M., Khasminskii, R. and Rozovskii. B.L. 1993. Two examples of parameter estimation for stochastic partial differential equations, In Stochastic Processes : A Festschrift in Honour of Gopinath Kallianpur, Ed. S.Cambanis, J.K.Ghosh, R.L.Karandikar, P.K.Sen, Springer, New York, pp. 149-160. Huebner, M., and Rozovskii, B.L. 1995.On asymptotic properties of maximum likelihood estimators for parabolic stochastic SPDE's. Prob. Theory and Relat Fields, 103, 143-163. Ibragimov, LA., and Khasminskii, R. 1981. Statistical Estimation: Asymptotic Theory, Springer-Verlag, Berlin. Ito, K. 1984.^0^7-1^207^5 of Stochastic Differential Equations in Infinite Dimensional Spaces, Vol. 47 of CBMS Notes, SIAM, Baton Rouge. Kallianpur, G., and Xiong, J. 1995. Stochastic Differential Equations in Infinite Dimensions , Vol. 26, IMS Lecture Notes, Hayward, California. Kutoyants, Yu. 1994. Identification of Dynamical Systems with Small Noise , Kluwer Academic Publishers, Dordrecht. Prakasa Rao, B.L.S. 1968. Estimation of the location of the cusp of a continuous density, Ann. Math. Statist., 39, 76-87. Prakasa Rao, B.L.S. 1981. The Bernstein - von Mises theorem for a class of diffusion processes, Teor. Sluch. Proc.,9, 95-101 (In Russian). Prakasa Rao, B.L.S. 1984. On Bayes estimation for diffusion fields. In Statistics : Applications and New Directions, Ed. J.K. Ghosh and J. Roy, Statistical Publishing Society, Calcutta. Prakasa Rao, B.L.S. 1998. Bayes estimation for parabolic stochastic partial differential equations, (Preprint, Indian Statistical Institute, New Delhi).
70
PRAKASA
RAO
Prakasa Rao, B.L.S. 1999a. Statistical Inference for Diffusion type Processes, Arnold, London and Oxford university Press, New York. Prakasa Rao, B.L.S. 1999b. Semimartingales and their Statistical Inference, CRC Press, Boca Raton , Florida and Chapman and Hall, London. Prakasa Rao, B.L.S. 2000. Bayes estimation for some stochastic partial differential equations , J. Statist. Plan. Infer. , 9 1 , 511-524. Prakasa Rao, B.L.S. 2000a. Nonparametric inference for a class of stochastic partial differential equations, Tech. Report No. 293, Dept. of Statistics and Actuarial Science, University of Iowa. Prakasa Rao, B.L.S. 2000b. Nonparametric inference for a class of stochastic partial differential equations II, Statist. Infer, for Stock. Proc. (To appear). Rozovskii, B.L. 1990. Stochastic Evolution Systems, Kluwer, Dordrecht. Shimakura, N. 1992. Partial Differential Operators of Elliptic Type, AMS Transl. Vol. 99, Amer. Math. Soc, Providence.
71
Institute of Mathematical Statistics
LECTURE NOTES — MONOGRAPH SERIES
FIXED DESIGN REGRESSION UNDER ASSOCIATION George G. Roussas University of California, Davis Abstract For n = 1,..., 7i, let xni,i = 1,..., n, be points in a compact subset in Sftd,d > 1, at which observations Yn{ are taken. It is assumed that these observations have the structure Yni = g(xni) + εni> where g is a real-valued unknown function, and the errors (e n i, ^nn) coincide with the segment (£χ,... ,f n ) of a strictly stationary sequence of random variables ξi, & » — F° r each x G 5Rd, the function g(x) is estimated by gn(x]Xn) = i2?=iwni(x\xn)Yni, where xn = (xnli... ,x n n ) and Wnϊ( ; •) are weight functions. Under suitable conditions on the underlying stochastic process £1,62, and the weights wni( ; •), it is shown that the estimate gn(x\Xn) is asymptotically unbiased, and consistent in quadratic mean. By adding the assumption of (positive or negative) association of the sequence £i,&» • •> it is shown that ρ n (^;^n), properly normalized, is also asymptotically normal.
Key words and phrases: Fixed design regression, stationarity, weights, fixed design regression estimate, asymptotic unbiasedness, consistency in quadratic mean, association, asymptotic normality.
1
Introduction
For each natural number n, consider the design points xni, % = 1,... ,n in Sβd, d > 1, which, through a real-valued (Borel) function g defined on $ϊd, produce observations Ym, subject to errors ε n i, 1 < i < n. That is, Yni = g(xni)
+ εni,
l 1, is a (strictly) stationary and (positively or negatively) associated (see Definition 1.1) sequence of random variables (r.v.s). The problem we are faced with here is that of estimating
72
ROUSSAS
the function g in terms of the YniS and xniS, and establishing optimal properties for the proposed estimate. Following established tradition in this line of work, for each x G 5Rd, the contemplated estimate is gn(x\ xn) given by
gn{x;xn) = ^2wni(x;xn)Ynii
(1.2)
2=1
where xn = (xnι,... ,£ n n), and wni( ), 1 < i < n, are suitable weight functions. It will be shown that, under appropriate regularity conditions, the proposed estimate is asymptotically unbiased, consistent in quadratic mean, and asymptotically normal. Properties of this nature and for specific choices of the weight functions were established by Priestly and Chao (1972), and Gasser and Mύller (1979). This problem was also investigated by Georgiev and Greblicki (1986) and Georgiev (1988). In all of these cases, the errors εUu i = l , . . . 5 n , were assumed to be independent identically distributed (i.i.d.). When independence is replaced by strong mixing, the above cited results were established in Roussas (1989) and Roussas et al. (1992). In the present contribution, independence is suppressed again and is replaced by association. For a brief review on the significance of the concept of association, some of its applications, and a summary of some (statistical) results under association, the interested reader is referred to the review paper Roussas (1999). Relevant are also the papers of Cai and Roussas (1999 a,b). Important results on some limit theorems for dependent r.v.s, and, in particular, negatively dependent r.v.s may be found in Bozorgnia et al. (1996), Patterson and Taylor (1997), Taylor and Patterson (1997), and Taylor et al. (1999a,b). The paper is organized as follows. Asymptotic unbiasedness and consistency in quadratic mean are established in Section 2 after suitable assumptions are spelled out. Asymptotic mormality is proved in Section 3 along with a number of auxiliary results. Assumptions under which these results hold are also stated in this same section, and they are followed by some comments. This section is concluded with the definition of association. Definition 1.1. For a finite index set 7, the r.v.s {Xi\ i G /} are said to be positively associated (PA), if for any real-valued coordinatewise increasing functions G and H defined on SR7, Cσυ[G{Xuiεr),H(XJ9j
εη]>0,
provided EG2(Xi,i G I) < oo,£H2(XjJ G J) < oo. These r.v.s are said to be negatively associated (NA), if for any nonempty and disjoint subsets A and B of /, and any coordinatewise increasing functions G and H with G :
REGRESSION
73
UNDER ASSOCIATION
5ft and # : 5ftβ -> 5ft with SG2(Xi,i
e A) < oo,SH2(XjJ
i,z e A),H(XjJ
Eΰ) oo unless otherwise stated, and C stands for a generic (positive) constant.
2
Asymptotic Unbiasedness and Consistency in Quadratic Mean
Assumptions (A) (Al) For a compact subset S of 5ftrf, the function g : S —> 5? is continuous. (A2) For 1 < i < n and n > 1, the errors εniS have expectation 0. For each x G S and with xn = (x n χ,... , x n n ) ϊ ^ m € 3?d, i = 1 , . . . ,n, the weights wni(x]xn) are 0 for i > n, and satisfy the following requirements for 1 < i < n: (A3) ΣΊ-i \wni{x\ xn)\ < S , n > 1, for a positive constant B.
(A4) Σ ? = i k f e « n ) h l . (A5) For any c> 0, £ " = 1 |^ n ί (a;;x n )|/ ( || X n ._ ; r || > c ) (a;) -> 0, where || || is any one of the familiar norms in 3?d. All results in this paper hold for all x G 3id and with xn as defined above. Theorem
2.1 (asymptotic unbiasedness). £gn{x',xn)
Proof. Writing gn(x) and wni(x)
Under assumptions (Al) - (A5),
->g(χ)
instead oί gn(x\xn)
and wni(x;xn),
re-
74
ROUSSAS
spectively, we have n
Sgn(x) - g(x) I =:| ^2wni{x)g{xni)
- g{x) \
2=1
n
7/j
ίr^ II niΎ \ — n(τ\
I Wnι\?) II 9 ^ m J
9VXJ I i (||xni-^||'
i=l
For every ε > 0 and sufficiently small c = c(ε), consider those xnjS for which || ίr^i — re ||< c. Then | g(xni) — g{%) |< ε, and therefore | g{xni) — g(x) I /(||χ ni -χ||< c )(^) < ε Thus, for all sufficiently large n, (2.1) yields I £gn(x) — g(x) ϊ< 2Cε + εC + εC = 4εC, where C is a suitable bounding constant. This completes the proof. • Before the formulation of the second main result, assumptions (A) are augmented as follows. Assumptions (B) (Bl) For each n > 1, (ε n i,..., εnn) is equal in distribution to (ξi,..., ξ n ), where {ξn}5 n > 1, is a (strictly) stationary sequence of r.v.s, £ξf = σ 2 < oo, (B2) For each x G 3id and x n as above, {\wni(x] xn)\ 1 < i < n} -> 0. Theorem 2.2 (consistency in quadratic mean). Under assumptions (Al) (A5) and (Bl) - (B2), ε[gn(x;xn)-g{x)]2->0. Proof. For further notational simplification, write just wni instead of v>ni{x) — Wmθz;#n), and recall that wn = max{\wni\; 1 < i < n}. Then, by assumptions (A3) and (B2), (2.2)
REGRESSION UNDER ASSOCIATION
75
Next, 2
2
£ [9n(x) - 9(x)] = ε [gn(x) - εgn{x)}
2
+ [£gn(x) - g(x)] ,
and the second term on the right-hand side above tends to 0, under assumptions (Al) - (A5), by Theorem 2.1. So, it suffices to show that Var(gn(x)) ->> 0. To this end,
Var(gn(x)) = Var Σ w
)
[ ^2 wniwnjS
£ε
li li
{εniεnj)
i=\
(2.3)
jS {εniεnj).
Since the first term on the right-hand side of (2.3) tends to 0 by (2.2), it suffices to show that 0. By assumption (Bl), jZ
(εniεnj)
iξj)
n—\
t=l
n-l
•\Cσo(ξltξn-i+ι)\] 2=1
(by stationarity) n—1
t=l
n—i
s.i=l
76
ROUSSAS
by assumptions (A3), (Bl) and (B2). • Remark 2.1. At this point, it is to be observed that Theorems 2.1 - 2.2 were established without reference to association. The property of association is used only in Theorem 3.1, stated and proved in Section 3.
3
Asymptotic Normality
Introduce the following notation by suppressing the argument z. Set Zni
= σ~ιwniεni,
σl
=Var
(gn>)
=
equal in distribution to σ~ιwniii, v
a r
t^rι
βM
,CΛ
Ί
,
ϊ
(όΛ)
.
Also, for m = 1,..., fc, let
Jm
={(m
and define ynm, y'nm and y^ by:
ynm = Σ Zni, y'nm = Σ Znj, y l = Σ ielm
jeJm
Z
^
(3 3)
l=k(p+q)+l
and let y'nm, ra=l
K = V'n
(3-4)
- Sgn).
(3.5)
m=l
We wish to show that Sn Λ ΛΓ(0,1),
where Sn = σ-\gn
Clearly, Sn = Tn+T'n+Tl
(3.6)
and (3.5) will be established by showing that Tn Λ JV(0,1),
(3.7)
and
ε{τ n γ + ε(τχf -> o.
(3.8)
77
REGRESSION UNDER ASSOCIATION
These assertions hold true under the set of assumptions stated below. Although some of the assumptions spelled out below coincide with assumptions previously made, we choose to gather all of them here for easy reference. Assumptions (C) (Cl) The sequence {£n}5 n > 1, is (either positively or negatively) associated and (strictly) stationary. 2+δ
(C2) Sξx = 0, S\ξι\
< oo for some δ > 0, and Σ ^ |Coυ(fi>£j+i)l < °°
(C3) For 1 < i < n and n > 1, (ε n i,... ,ε n r ι ) is equal in distribution to (fl,---,ίn). With wn = max{\wni(x\ # n ) | , 1 < i < n}, it is assumed that: (C4) (i) wn = 0{n-1). (ii) wn = O(σ^), where σ\ = σ\{x) =
Var(gn(x;xn)).
Let p = Pn and q = qn be positive integers with q < p < n and tending to oo, as n —> oo, and let k = kn be the largest integer for which k(p + q) < n. Then select p and q as just described, and also to satisfy the requirements: (C5)
(i) p = o{np), p = 2(ϊ^y (the same δ as in (C2)).
(ϋ) *?->!. Comments on some assumptions (a) The choice of p, q, and fcasθ• 0 and J -»• 0 (since ^ψ± = & + f and both *ίH±ώ,$-> 1, and 2 = f - + 0 ) . n 1
(c) If p = o(n) (which is implied by (C5)(i)), then fc -> oo (since £ =
5) -»° b y ( b ))
£±£
78
ROUSSAS
(d) Choices of p and q as described above and satisfying condition (C5) are readily available. Indeed, for 0 < 62 < δ\ < p, take p ~ nδl and q ~ nδ2 (where xn ~ yn means | ^ ->• 1). This choice of p is consistent with (C5)(i) (since £ = ;£-• ^ 7 Γ "^ °) Furthermore, A; ~ n 1 "* 1 (since ^ = ^ • & and k = ^ . ^ . . ^ which tends to 1). Therefore $ = ^
^
^ 1.
(e) That δ in assumptions (C2) and (C5)(i) must be the same stems from the proof of Lemma 3.2(iii). (f) Assumption (C4)(ii) is borrowed from Roussas et al.(2000) (see Remark 2.1(ii), page 265). Theorem 3.1. Under assumptions (Cl) - (C5), the convergence asserted in (3.5) holds; that is,
where Sn is defined in (3.5), gn = gn{xm, xn) is given in (1.2), and σ\ = σ^(x)
= Var(gn). The proof of the theorem follows by combining the two propositions below. The propositions, as well as the three lemmas employed in this section, hold under all or parts only of assumptions (Cl) - (C5). However, these lesser assumptions will not be explicitly stated. Proposition
3.1. The convergence asserted in (3.8) holds; that is,
ε{τ'nγ + ε{τ£γ -+ o, where T'n and T% are given in (3.4). Proposition
3.2. The convergence asserted in (3.7) holds; that is,
τnΛ;\r(o,i), where Tn is given in (3.4). Assuming for a moment that Propositions 3.1 and 3.2 have been established, we have Proof of Theorem 3.1. It follows from Propositions 3.1 - 3.2 and relation
(3.6). • The following three lemmas will be required in various parts of the proofs of Propositions 3.1 - 3.2.
79
REGRESSION UNDER ASSOCIATION f
Lemma 3.1. Let ynm and y nm be defined by (3.3). Then: (i) KKr 0. The proof of the proposition is completed. •
84
ROUSSAS
For the formulation of the second lemma, introduce the following notation. Let Ynm, m = l,...,fc be independent r.v.s with Ynm having the distribution of ynm, set si = Σm=iVar(Ynm), and let Xnm = ^ with distribution function Fnm, m = 1,..., k. Then the r.v.s Xnm, rn — 1,..., k are independent with 8Xnm = 0 and Σm=i Var(Xnm) = 1. Finally, for ε > 0, set *2dFnm(x).
(3-20)
Then we have Lemma 3.2. Let T n and gn(ε) be given by (3.4) and (3.20), respectively, and recall that si = Σ,km=ι Var(Ynm). Then: (i) εTl -> 1, (ii) si -> 1, and (iii) gn{ε) -+ 0. Proof, (i) From (3.5), £5;; = 1, whereas from (3.6),
= εs2n + ε{τ'n + τχ)2 - 2ε [sn{τ'n + rj)] = l + ε{τ'n + τχ)2 - 2ε [sn(τ^ + τ;')]. But by Proposition 3.1,
εHr^ + TZ)2 < εHTtf + εHTZ)2 -> o,
(3.21)
and
2
2
< (εϊsήεh (τ'n + τή = ε\ (τn + τή -+ o (by (3.21)). Thus,
ετl -41. (ii) Prom (3.4) again,
J] m=l
m=l
), Kl 1 with - + - = 1 ) s t \Vn
2s
\
125
(3.23) At this point, take s = ^ Then (3.23) becomes
and t = ^ , so that 2s = 2 + δ = v, and ^ = δ.
x2dFnm(x)
[
J(\x\>ε) However, by assumption (Bl)
ε \ynm\ =ε
y
zni
ε) Hence, with C = e~ιε \ ξι \u,
and therefore, by (3.24) - (3.25),
x2dFnm(x) 0] + (1 - a)I[u < 0]} , u € R,
aβ [0,1],
and x ( := (Yt, ..., Yt-p+ι)', t = 0,..., n - 1. This α-autoregression quantile can be obtained as the component ΰ(n)=
ίθ^\θinΛ
€ W+1 of the optimal solution ( d ( n ) , f + f - )
G E2n+P+1
of the linear program α l n r + + (1 — a)lnr~
:= min
Y n - l n 0 O - XnZ = Γ+ - ΓzeW,
i^eEl,
0 the components of a n (a) are determined by the equality constraints in (1.6). Clearly, the sample paths {a n (α), 0 < a < 1} are continuous, piecewise linear, and such that Qt (0) — 15 and άtn (1) = 0. An obvious modification of the algorithms of Koenker and dΌrey (1987 and 1994) allows for an efficient computation of the solutions ΰ
(a) and a n (α) over the whole interval [0,1].
A crucial property of autoregression rank scores is their autoregressioninvariance, i.e., denoting by άtn (α, Y n ) the solution of (1.6), the fact that ά; n ) (α,Y n + * l n + X n z) =aί n ) (α,Y n ),
(z,z) e KP+1,
(1.7)
which immediately follows from (1.6). Some further algebraic relations between autoregression quantiles and the corresponding autoregression rank scores are provided in Lemma 2.1 of Hallin and Jureckova (1999). Quite remarkably, no preliminary estimation of θ is needed in order to compute autoregression rank score statistics. This is in sharp contrast with the more familiar aligned rank methods (Hallin and Puri, 1994), where ranks are computed from estimated residuals; see Jureckova (1991), Gutenbrunner and
114
EL BANTLI AND HALLIN
Jureckova (1992), Gutenbrunner et al. (1993), Hallin et al. (1997a, 1977b), Harel and Puri (1998), or Hallin and Jureckova (1999) for details and numerical applications. In the present paper, new tests based on a autoregression rank score version of the traditional Kolmogorov-Smirnov statistic are introduced for model (1.1). The asymptotic behaviour of these tests is investigated in Section 3, where we show that the limiting distributions of the test statistics coincide with those of the classical Kolmogorov-Smirnov statistics, both under the null hypothesis as under contiguous alternatives. Our results extend those of Jureckova (1991) from regression models to autoregression models. The local asymptotic efficiency of these tests is also investigated. Finally, the performance of the proposed tests is illustrated on simulated AR series with Normal, Laplace and Cauchy innovation densities, respectively.
2
Limiting Distributions of Kolmogorov-Smirnov Statistics Based on Autoregression Rank Scores
Assume that the density / of the innovations in the autoregressive model (1.1) remains unspecified within the family T of exponentially tailed densities satisfying (1.2) and the following conditions (borrowed from Hallin and Jureckova, 1999) : (Fl)
f(x) is positive for all x E M, and absolutely continuous, with a.e. /
d e r i v a t i v e /'a n d
finite
Fisher information l(f)
:=
£f ί
\\ ^
/ 1 . 1
f(x)dx
J \}\x)j < oo; moreover, there exists K/ > 0 such that, for all \x\ > Kf, f has two bounded derivatives, / and / " , respectively; (F2)
/ is monotonically decreasing to 0 as a; —> ±oo and, for some b = bf > 0, r = 77 > 1, -log(l-F(s))
-logFjx) z lim »oo
b\x\r
i m
l z->oo
b\x\r
We will focus on the problem of testing null hypotheses of the form Ή-o
' θp = 0,
0(!) := (0i,..., θp-ι) unspecified,
against alternatives U\ : θp φ 0,
0 ( 1 ) := {θu ..., 0p_i) unspecified.
Such tests play a crucial role, for instance, in the order identification process (see Garel and Hallin 1999).
115
K-S TESTS FOR AR MODELS Write the AR{p) model (1.1) as ^(l) + Xn;2^p + β n ,
(2.1)
where X n := ( X n ; i:X n ; 2 I is the (n x p) matrix with rows x^_i, 1 < t < n, Xn;2 := (Y-p+u , ^n-p)' (hence, Xn;2,t = Yt-p, t = 1,..., n), and ε n := ; (εi,...,ε n ) . Denote by P n := X n ; i(X n ; 1 X n ; i) 1
the (random) matrix projecting W onto the linear space spanned by the columns of X n; χ. Define ^ ; 2 := Xn;2 ~ Xn;2 ~ [Xn,2 ~ Xn;2J l n
where n-p a
n
d
t=l
t=-p+l
and let )
:
^ = n" 1 (X n ; 2 - Xnflln)'
[In - P n ] (X n ; 2 - X n ; 2 l n ) .
Denoting by 7it(d), A; = 1,... the autocovariances of the stationary solution of (1.1), the consistency under AR(p) dependence of empirical autocovariances implies that D\ converges in probability, as n —> oo, to (
TO W
71W
•••
Ίv-2{θ)\
-1
71
Ίp-ιiβ)
which, in view of classical Yule-Walker equations, under HQ reduces to p-1
a simple scale factor that does not depend on 0, nor on the shape of the innovation density /. Let a n (a) = Γά/1 (α),...,ά n n (cm, 0 < a < 1, be the autoregression rank scores computed under HQ , i.e., corresponding to the submodel Yn =
+ εn
(2.2)
116
EL BANTU AND HALLIN
(though of course a " (α), as a statistic, does not depend on 0(i)) For each n, consider the process {Tn(a) : 0 < a < 1} defined by Xn;2,tάt
;
(α),
0 < α < 1.
(2.3)
t=l
This process has trajectories in the space C[o,i] °f continuous functions α ι-> c(a), a 6 [0,1] (as usual, C[o,i] is equipped with the Borel σ-field C associated with the uniform metric ||ci — C21| := maxo ϋ
and
{
U;
1
2
2
l-2f)(-l)*- «p(-2ifc x ) The proof is based on the following lemma.
0
•
(2
'
7)
Lemma 2.1 Define the scores 0 a*t (a) := { Rn.t - not 0
no. < Rn]t - 1 Rn,t - 1 < not < i? n ; ί Rn t < na ,
(2.8)
117
K-S TESTS FOR AR MODELS
where Rn]t denotes the rank of εt among εi,..., εn. Assuming that (F1)-(F2) are satisfied, let at(a) := I[εt > F~ι{a)],
1 < t < n,
0 < a < 1.
Then, sup n
-1/2
Σ [(*n;2,t " *n;2) ^ (α) - X^fi,( (^-I
where P$n is computed under θn := (0(i),0p) Then, for any 0 < α < 1,
,
(2-11)
where f\ and F\ stand for the standardized versions of f and F, respectively. Furthermore, for τ —> 0 (the notation £(τ) ~ ζ(τ) means that the ratio ξ(τ)/ζ(r) tends to one as τ -> 0; Φ, as usual, stands for the standard normal distribution function), 1/2
( ( 1 λ ί1 ι2 B(a,/,r) — a~ I 2σfT~ ' (f)a ί — -logα 1 /
\ φ(u,f)ψ(a,u)du)r, (2.12)
ψ(a,u) ~ 2Φ U - I l o g α ) 1 ' (2u - l)(tι(l - u))" 1 / 2 ! - 1.
Proof. This proof, as well as the proof of Theorem 3.1, heavily relies on Sections VL4.5 and VΠ.2.3 of Hajek and Sidak (1967); the various constants appearing there here are to be taken as ct = n-ι'2X^t
(hence, c = 0),
p 2 = 1,
K-S TESTS FOR AR MODELS
119
and
Theorem VΊ.3.2 in Hajek and Sidak (1967) entails that, under 1-Ln, the process converges in distribution to the Brownian bridge {Z(a)}. Since K+ = maxop},p=l,2, The Whittle (an approximate maximum likelihood) estimate βpn obtained by fitting an FAR(p, d) model is then defined as the value of β G E p which minimizes the integral
G(β) = / J —π
/n(λ
^λ,
(3.8)
132
BHANSALI AND KOKOSZKA
where i\t
2πn
(3.9)
t=i
is the periodogram and -2 ΛX\d
(3.10)
is the power transfer function, which is proportional to the spectral density of the process {Xt} in the finite variance case. Under Assumption 2A, the corresponding Whittle estimate of σ2 is given by
=Γ J-
'p,nJ
EXAMPLE. TO give an example of a parameter space satisfying Asssumption 4, consider a sequence of positive numbers x$, xi,... such that Σ ^ o Xj < oo and \βj\ < Xj. The set {β eF : \βj\ < Xj, j = 0,1,...} is then a parameter space satisfying Assumption 4. Moreover, now the corresponding set, {βp : \βj\ < xj >3 — 0,1,... ,p;;Sj = 0,j > p}, is an example of the parameter space E p .
4
Consistency of the Estimates
The consistency of the Whittle estimators /3Pj7l and σ2(p) obtained by fitting fractionally differenced autoregressive models is established below in Theorem 4.1. In common with related studies, see, for example, Berk (1974), we assume that p = p(n) is a sequence of integers such that p(n) -> oo, p(n)/n -> 0, as n —>> oo. For an observed time series, the value of p will be determined by appealing to the AIC, or BIC, criterion introduced in Section 5 by equations (5.1) and (5.2), respectively, or by a related criterion. Thus, in practice, p = p(n) will invariably be a random sequence of integers and Theorem 4.1 does not apply to this situation. In Section 5, we demonstrate the usefulness of the FAR approach in this situation by a simulation study, and, also, compare the relative behaviour of some of the alternative estimates of d discussed in Section 2 with that of the FAR estimate. THEOREM 4.1 Suppose p = p(n) is a sequence of integers such thatp(n) < n and p(n) —> oo, as n -> oo, and that Assumptions 1-4 hold. 2 2 (i) Under 2A, with probability one, βp -* β° and σ (p) -> σ . (ii) Under 2B, βp —>• β° in probability.
ESTIMATION OF LONG MEMORY
133
The proof of Theorem 4.1 is similar to the corresponding proofs for the correctly specified models, see Fox and Taqqu (1986) and Kokoszka and Taqqu (1996b). Modifications include Lemma 4.1 and the proof of Proposition 4.1. We present the proof only under Assumption 2A because the argument under Assumption 2B is essentially the same, but with the difference that one now works with the self-normalized periodogram and the a.s. convergence in Proposition 4.1 and Lemma 4.2 is replaced by convergence in probability, see analogous proofs in Kokoszka and Taqqu (1996b) and Mikosch et al. (1995). LEMMA
4.1 For any compact subset E ofF and any sequence p = p(n) such
that p -> oo, as n —» oo ;
lim sup y^ \aΛ = 0. PROOF.
(4.1)
Observe that for each fixed p, the function Tp:F3
(d,αi,α 2 , ..)
\cij\ G (0, oo) j=p+l
is continuous on F . Since for every β G F, l i m ^ - ^ Tp(β) = 0 and Tp+\ < Tp, the convergence is uniform on any compact subset of F. I Denote ΛX\d
gp{λ,β) = k=0
PROPOSITION 4.1 Suppose that the assumptions of Theorem 4-1 hold. Then, as n -» oo (and so p —> oo)j with probability one,
0(λ,/3°)
sup Denoting Ap(λ,β)
PROOF.
iXk
= Σ{=0 ake ,
dλ
A{λ,β)
(4.3)
0. αfe
= ΣT=o Ofce ,
= ζfj(λ,β°)(g(X,β))-1dλ, observe that 2
Gp(β)-σ (β) = jΓ
—e
d\
(4.4)
BHANSALI AND KOKOSZKA
134
iλ
{\Ap(X,β)\2 - \A(λ,β)\2} Jn(λ) 1-e
2d
dλ
\A(λ,β)\2 L(λ) - ^ By Lemma 4.2 below, the last term in (4.4) tends to zero uniformly on E with probability one. Thus it remains to verify that with probability one,
l^{\A(X,β)\2-\Ap(λ,β)\2}ln(X)
sup
1-e
iλ
2d
dX
0.
(4.5)
Note that (4.6) j=p+l
and ιλ sup l-e \
< oo.
(4.7)
/3GE
Observe also that since every linear process is ergodic, we have with probability one,
Γ In(X)dλ = - ΣX2
-> EX2.
(4.8)
Relation (4.5) now follows from (4.6) combined with (4.1), (4.7) and (4.8).
LEMMA 4.2 Suppose that the assumptions of Theorem 4-1 hold, andu(λ^β) is a function continuous on [—π,π] x E. Then, with probability one,
sup Γ u(\,β)In{\)d\ -£- Γ u(Kβ)g(\β°)dλ
/36E |«/-π
^7Γ
0.
J-π
PROOF. This is essentially a restatement of Lemma 1 of Fox and Taqqu (1986), the only difference being that in our setting Gaussianity is not assumed. However, since {Xt} is linear, it is also ergodic and the proof of Lemma 1 of Hannan (1973) applies. • PROOF OF THEOREM
4.1: The proof relies on the fact that for any β φ β° (4.9)
Relation (4.9) follows immediately from the inequality
/
^(λ,/3i) (\ / 3 \ ^ ^ ^ π '
whenever /3X ^ /3 2 ,
(4.10)
135
ESTIMATION OF LONG MEMORY
see Lemma 3.1 of Kokoszka and Taqqu (1999). For ARMA spectral densities, relation (4.10) is the content of Lemma 10.8.1 of Brockwell and Davis (1991) which was extended to fractional ARIMA models in Lemma 2.1 of Kokoszka and Taqqu (1996b). In our setting, (4.10) follows, for example, from Lemma 3.1 of Kokoszka and Taqqu (1999). In the sequel, all random quantities are evaluated at a fixed elementary event from the set on which (4.3) holds. Suppose ad absurdum that βpn
does not converge to β°. Since E is
compact, there is a subsequence \βp(m\m \ of {βp(n),n \ > denoted for brevity 2
by {/3r}, such that βr -> β' φ β°. By (4.3), lim Gr{βr) = σ (β'). On the 2 other hand, Gr(βr) < Gr(β°), so again by (4.3) limsup Gr(βr) < σ (β°). 2 2 Thus, σ (β') < σ (/3°), which contradicts (4.9). •
5
Simulations
We illustrate the efficacy of the FAR method of estimating the memory parameter, d, in situations where the short-memory process, Y^, is not necessarily purely random, by applying this method to simulated realizations of several different ARFIMA(p, d, q) processes, with both p and q taking varying values over the region 0 < p < 2, 0 < q - 0, where the periodogram I(λ) is defined by (3.9). Thus, for low frequencies the log-log plot of J(λ) versus λ should follow a straight line with slope — 2d. Following Taqqu and Teverovsky (1998) we used the lowest 10% of the frequencies to fit the regression line. Our own simulations showed that for models with moderate AR and MA coefficients using various frequency bands between 3% and 12% does not significantly affect the estimates. 2. Semiparametric: This method has been rigorously investigated by Robinson (1995b), who assumes that the spectral density /(λ) ~ G(d)|λ|~ 2 d , as λ -» 0. As the approximate Gaussian likelihood is maximized in a neighbourhood of the zero frequency, it is also known as, see Taqqu and Teverovsky (1998), Kύnsch (1987), "local Whittle" . As with the periodogram method, however, it is not clear how to choose the optimal frequency band over which the local likelihood function is maximized. Following the recommendation of Taqqu and Teverovsky (1998), we used the lowest 1/32 of all frequencies. Our own limited simulation study showed that the results remain essentially the same for any band from 1/50 to 1/20 of all frequencies. 3. ARFIMA(l,d,l): Here d was estimated by approximate Gaussian maximum likelihood and assuming that the simulated time series truly follows a fixed ARFIMA(l,d,l) process, whether or not the undelying simulated model was of this particular form. The maximization was carried out using the S+ function arima.fracdiff. 4. ARFIMA(2,d,2): Same as above, but ARFIMA(2,d,2) model was fitted. 5. ARFIMA AIC: In this method we considered ARFIMA(p, d, q) models for p, q = 0,1,2. The model for which d was estimated was selected
138
BHANSALI AND
KOKOSZKA
by minimizing the following AIC criterion: AIC{p, q) = -2loglik + 2{p + q).
(5.3)
For each value of the order p and q, the estimation was carried out using the S+ function arima.fracdiff. 6. ARFIMA BIC: Same as above, but with AIC(p,q) replaced by BIC{p, q) = -2loglik + (1 + In n)(p + q).
(5.4)
The periodogram and semiparametric methods were chosen for comparison because the extensive simulation study of Taqqu and Teverovsky (1998) demonstrated that these two methods are more robust and accurate than any other non-parametric method considered by them. The ARFIMA(1, d, 1) and ARFIMA(2, d, 2) methods were considered for two reasons: first, for models VI and VII described below their use corresponds to fitting the true generating processes respectively and thus for these two models, the simulation results should throw some light on possible effects of model selection on the estimation of d. Secondly, for the other six simulated models, their use corresponds to under- or over-fitting the generated process and the simulation results may again be expected to provide some information about how misspecification of the short memory model structure in this way influences the estimation of d. The ARFIMA AIC and ARFIMA BIC methods were considered in order to see if anything is lost or gained by fitting only fractional autoregressive models. We next describe the long memory models used in our study. Recall that d Yt = (1 — B) Xt denotes the fractionally differenced process and {Zt} is the noise sequence. The simulated process is {Xt}' I ARFIMA(0,d,0): Yt = Zt. II ARFIMA(l,d,O): Yt = .5Yt-ι + Zt. III ARFIMA(2,d,0): Yt = -.5Yt-ι
- .25Yt_2 + Zt.
IV ARFIMA(2,d,0): Yt = .5Yt-ι - .25Y*_2 + Zt. V ARFIMA(0,d, 1): Yt = Zt + .5Z t _i. VI ARFIMA(l,d, 1): Yt - .5Yt_i + Zt + .5Zt-i. VII ARFIMA(0,d,2): Yt = Zt-
.5Zt_i + .25Z*_2.
VIII ARFIMA(2,d,2):Yt = ,5Yt-ι - .25Yt-2 + Zt + .5Zt-i -
ESTIMATION OF LONG MEMORY
139
Note that all models have moderate AR and MA coefficients so as not to favour a priori any of the model fitting methods. The models were decided upon before any simulations were done. For comparing the behaviour of the estimates under Assumption 2A, the Zt were simulated as independent Gaussian deviates, each with mean 0 and variance 1, using the S+ function arima.fracdiff .sim. Only one value of n, namely n — 1000 is considered and the number of simulations for each (model, method) configuration was 250. The nominal value of d was set to equal 0.3. The simulation results for the Gaussian innovations are shown in Table 1, where for each cell corresponding to each (model, method) configuration, the bias, standard deviation, and the square root of the mean squared error are shown. For convenience, a summary of the simulation results is also given in Table 2, where the average squared roots of the mean squared errors averaged over all models and separately for only the fractional autoregressive, FAR, models are shown. As our simulation study is restricted to only the class of ARFIMA models the two periodogram-based non-parametric and semiparametric methods perform rather poorly: for all the simulated models the magnitudes of their biases and standard deviations are much larger than for other methods. Consider next the method of fitting either a fixed order ARFIMA(2,d,2) or ARFIMA(l,d,l) model. For Model VII the former coincides with the actual generated model and it performs particularly well, with a similar remark applying to the method of fitting a fixed order ARFIMA(l,d,l) for Model II. However, for all other models possible effects of misspecifying the short memory model on the estimation of d may be gleaned from our simulation results. Thus the fitting of an ARFIMA(2,d,2) model to any of the models I-VII tantamounts to fitting a model with too many parameters. The main effect of this overfitting as compared with the fitting of an FAR model with the order selected by the BIC criterion is seen to be an increase in the simulated variance; moreover the bias is also much larger for all models except Model VI. For Model VII, on the other hand, fitting a fixed ARFIMA(l,d,l) corresponds to fitting a model with too few parameters and the main effect of this underfitting is seen to be an increase in the bias though the variance is smaller than when the correct ARFIMA(2,d,2) is fitted. A possible explanation of these results is that when a model with too many parameters is fitted, the additional parameters attempt to model the long memory component, effectively altering the generated value of d. At the same time, the variance in estimating d increases because of the excess variability introduced by the estimation of the redundant parameters. It
BHANSALI AND KOKOSZKA
140
T A B L E 1. COMPARISON OF THE ESTIMATE OF d PROVIDED BY DIFFERENT METHODS IN 2 5 0 SIMULATIONS OF VARIOUS GAUSSIAN MODELS.
Method Periodogram
Semiparametric
ARFIMA(l,d,l)
ARFIMA(2,d,2)
FAR AIC
FAR BIC
ARFIMA AIC
ARFIMA BIC
a) b) c) a) b) c) a) b) c) a) b) c) a) b) c) a) b) c) a) b) c) a) b) c)
Model I II III IV V VI VII VIII .066 .070 .065 .084 .067 .089 .053 .059 .206 .192 .179 .208 .187 .192 .199 .194 .204 .188 .224 .206 .203 .216 .199 .212 -.031 .334 -.306 .267 .091 .494 -.252 .474 .022 .062 .028 .026 .023 .027 .028 .036 .495 .260 .475 .038 .340 .307 .268 .094 .078 -.170 -.078 -.055 .004 -.126 -.018 -.056 .094 .088 .097 .049 .047 .048 .037 .047 .049 .134 .051 .109 .086 .176 .118 .112 -.083 -.104 -.035 -.036 -.126 -.106 -.063 -.034 .074 .125 .116 .103 .065 .120 .122 .059 .177 .146 .160 .069 .082 .157 .121 .073 -.028 -.061 -.037 -.043 -.072 -.087 -.051 -083 .122 .106 .071 .107 .064 .081 .096 .075 .076 .107 .042 .150 .118 .112 .058 .120 -.005 -.054 -.019 -.021 -.039 -.064 .019 -.063 .027 .092 .037 .054 .101 .103 .100 .074 .027 .107 .042 .108 .121 .102 .058 .097 -.021 -.071 -.025 -.040 -.029 -.067 -.037 -.045 .114 .046 .070 .063 .032 .111 .090 .076 .073 .135 .052 .075 .043 .130 .097 .088 -.011 -.070 -.009 -.032 -.015 -.041 -.016 -.090 .034 .092 .045 .051 .033 .087 .051 .078 .036 .117 .046 .060 .036 .096 .053 .119
a) BIAS = SIMULATED MEAN - 0.3, b) STANDARD DEVIATION,
ESTIMATION OF LONG MEMORY
141
TABLE 2. AVERAGE SQUARE ROOTS OF MEAN SQUARE ERRORS OF VARIOUS ESTIMATES OF D IN ALL GAUSSIAN MODELS AND F A R MODELS IN INCREASING ORDER.
Method FAR BIC ARFIMA BIC FAR AIC ARFIMA AIC ARFIMA(l,d,l) ARFIMA(2,d,2) Periodogram Semiparametric
FAR Models .059 .065 .071 .083 .103 .114 .205 .238
Method ARFIMA BIC FAR BIC ARFIMA AIC FAR AIC
ARFIMA(l,d,l) ARFIMA(2,d,2) Periodogram Semiparametric
All
Models .070 .081 .087 .098 .104 .123 .207 .293
should be noted that the situation here is slightly different from when a pure short memory is being overfitted in which case the estimation of additional parameters increases the variance but does not unduly influence the bias in estimating the non-zero parameters. On the other hand, when a model with too few parameters is fitted, the short memory component is not adequately modelled and this introduces bias in the estimation of d because the spectral density of the generated process is being approximated by the spectral density of the underparametrized model. The variance in estimating d is, however, reduced because fewer parameters are estimated. Consider now the method of fitting a fractional autoregressive model proposed in this paper and its relative behaviour in comparison with the fitting of ARFIMA models, with order selected by the AIC or the BIC criterion. The method of fitting a fractional autoregressive model is seen to provide good results for all models, but especially for models I-IV, where the generated model for {Yf} is a finite autoregression. It should be noted, however, that the ARFIMA BIC method has a smaller mean squared error for models VI-VII, all of which have q > 0. This finding is probably not surprising because if the selected model coincides with the generated model the resulting estimate of d is known to be asymptotically efficient. It should be emphasised, nevertheless, that the full ARFIMA models were fitted only up to order 2 and the chance of selecting an incorrect model is quite small in our simulations. As regards the question of whether the AIC or BIC criterion should be used for implementing the FAR method suggested in this paper, the simulation results appear to favour the latter, probably because
142
BHANSALI AND
KOKOSZKA
the AIC criterion is known to frequently select an "overparametrized" model resulting in a large mean square error as explained above when discussing possible effects of overfitting the simulated models. Even though the main goal of the present simulation study is to examine the overall performance of the estimators for typical stationary models with moderate AR and MA coefficients, it is of interest to see how the methods perform for nearly non-stationary or non-invertible models, and thus to examine the limits of their applicability. We considered the following models for achieving this objective. 1) Almost unit root AR(1) models of the form Yt = φYt_x+Zu with φ = .9, .95, .99, 2) almost unit root MA(1) models of the form Yt = Zt + ΘZt-ι with θ = .9, .95, .99. Our findings can be summarized as follows. Focusing first on AR models, the periodogram and semiparametric methods fail. The FAR BIC method by contrast gives estimates almost as good as for AR processes with moderate coefficients considered in Table 1. By contrast, for the almost unit root MA(1) models, the performance of the FAR BIC method is worse than that for the models considered in Table 1, and, also, as compared with the ARFIMA BIC method. A main reason why the performance of the ARFIMA(1, d, 1) method is noticeably better for this class of models than that of all other methods is that the ARFIMA(l,d, 1) model just includes the simulated ARFIMA(0,d, 1) and ARFIMA(l,d,O) models as its special case and yet avoids the parameter identification difficulties (Hannan (1970), p. 388) associated with fitting ARFIMA(p,d,g) models with p > l,g > 1. A detailed analysis of the simulation results for MA(1) models revealed that the ARFIMA(l,c/, 1) method yields estimates of the autoregressive and moving average coefficients which are within 0.1 of the correct values 0 and 0, respectively, and thus its estimate of d is based on an almost MA(1) short memory component. To illustrate the relative behaviour of the estimators in the infinite variance setting, we simulated symmetric α-stable innovations Zt with a = 1.5 and unit scale parameter. The nominal value of d is .2, which lies approximately in the middle of the stationary invertible range 0 < d < 1 — I/a specified in Assumption 3. Figures 1-4 show histograms of the estimated values of d in 50 replications of models I, II, V and VI. Only six methods were considered, namely, Periodogram, ARFIMA(l,d, 1), FAR AIC, FAR BIC, ARFIMA AIC and ARFIMA BIC. For all four models, the periodogram method provides a biased as well as a highly dispersed estimate of d, a finding that accords with the results reported above in Tables 1 and 2 for the the Gaussian case. Somewhat surprisingly, however, a similar comment applies to the ARFIMA BIC method, which, unlike the Gaussian case, now provides a biased estimator for all four models, and for models V and VI the estimator is also highly dispersed. A
ESTIMATION OF LONG MEMORY
143
Figure 1: FIGURE 1. Histograms of the estimated values of d for model I (ARFIMA(0,d,0)) with d = 0.2 and stable innovations. Periodogram
FARIMA(1,d,1)
I
FAR AIC
I 0.0
0.1
0.2
0.3
0.2
0.3
0.4
0.5
0.0
0.1
FARIMA AIC
FAR BIC
0.0
0.1
0.4
0.5
0.2
0.3
0.4
0.5
O.4
0.5
FARIMA BIC
0.4
0.5
0.0
0.1
0.2
I 0.3
plausible explanation for this behaviour is not easily given, but the simulations appear to indicate that the question of model selection for an α-stable ARFIMA process requires further investigation and that a naive use of the BIC criterion (5.4) may not be recommended in a situation where "outliers" may be present. The simulation results for the FAR BIC method, by contrast, broadly support the asymptotic consistency property established in Theorem 4.1. For models I and VI, in particular, the histogram is centred around the actual generated value of d = 0.2 with a relatively small dispersion around this value. For model V, however, the estimate is biased, probably reflecting the difficulty of estimating d for processes with a short memory MA component. In conclusion, the simulations indicate that when the long-memory process generating the observed series can be well approximated by a fractional autoregressive process, the method proposed in this paper provides a good estimator of d even when the order of the generating process, finite or infinite, is unknown, and in this sense it is "non-parametric". The method, moreover, compares favourably with the periodogram based non-parametric methods, which in this situation tend to have a larger variance and often a greater bias.
BHANSALI AND KOKOSZKA
144
Figure 2: FIGURE 2. Histograms of the estimated values of d for model II (ARFIMA(l,d,O)) with d - 0.2 and stable innovations. Periodogram
0.0
0.1
0.2
0.3
0.4
0.5
FAR BIC
0.2
6
0.3
FAR AIC
FARIMA(1,d,1)
_ι I. 0.0
0.1
0.2
0.3
0.4
0.5
0.0
J I. 0.1
0.4
0.5
L.
0.2
0.3
0.4
0.5
0.4
0.5
FARIMA BIC
FARIMA AIC
0.0
0.1
0.2
0.3
Data Example
We consider the Ethernet traffic data studied by Leland et al (1994), Willinger et al (1995), Taqqu and Teverovsky (1997), among others. The data, described in detail by Leland et al (1994), were collected between August 1989 and February 1992 at the Bellcore Morristown Research and Engineering Center and represent the number of bytes per 10 milliseconds passing through a monitoring system during a "normal traffic hour" in August 1989. The periodogram of the data has a number of very sharp peaks at non-zero frequencies suggesting that the data may not necessarily follow an ARFIMA model. Taqqu and Teverovsky (1997) used a graphical analysis of the periodogram and the semi-parametric method to infer that the true value of d lies between 0.31 and 0.35. We consider here 18 consecutive 200 second long time periods making up the "normal traffic hour". Thus, we consider 18 extremely long series, each consisting of 20,000 observations. Because of the self-similarity property, the long memory parameter for each of the 18 time series must be the same, and the estimates should lie in the range between .31 and .35. Figure 5. shows the estimates obtained using the FAR BIC, ARFIMA BIC, and Semiparametric
ESTIMATION OF LONG MEMORY
145
Figure 3: FIGURE 3. Histograms of the estimated values of d for model V (ARFIMA(O,d,l)) with d = 0.2 and stable innovations. Periodogram
0.0
0.1
0.2
O.3
FARIMA(1,d.1)
O.4
0.5
.Hi..
0.0
0.1
FAR BIC
0.0
0.1
0.2
0.3
FAR AIC
0.2
0.3
0.0
0.1
FARIMA AIC
0.4
0.5
0.0
-
0.1
0.2
0.3
0.2
0.3
0.4
0.5
0.4
0.5
FARIMA BIC
0.4
0.5
0.0
0.1
0.2
0.3
methods. Following the analysis of Taqqu and Teverovsky (1997), we used for the semiparametric method the lowest 1/128 of all frequencies to get clear of the peaks at non-zero frequencies. The estimates obtained using the FAR BIC and ARFIMA BIC methods appear to be more stable over time than those obtained using the semiparametric method. It is possible that the intensity of long range dependence decreased in the 7th period and increased in the 8th period, but the semiparametric estimates probably overestimate the magnitude of the change. Acknowledgements. The software used in Section 5 for the periodogram and semiparametric methods and for simulating α-stable ARFIMA series was kindly made available to us by Murad Taqqu and Vadim Teverovsky, who also gave us plentiful advice on how to use it. Our colleague Simon Fear has invariably been willing to help us deal with intricacies of WΓ^X. and S+. We also thank Jan Beran for a discussion which stimulated the present research and Murad Taqqu for providing us with the Ethernet data studied in Section 6. Clifford Hurvich, Carenne Ludeήa, Adrian Raftery and Gennady Samorodnitsky also offered valuable comments.
BHANSALI AND KOKOSZKA
146
Figure 4: FIGURE 4. Histograms of the estimated values of d for model VI (ARFIMA(l,d,l)) with d = 0.2 and stable innovations. FARIMA(1,d,1)
Periodogram
0.0
0.1
0.2
0.3
0.4
0.0
0.5
0.1
0.0
0.1
0.2
0.2
0.3
0.4
0.2
0.5
0.3
0.4
0.3
0.4
0.5
0.4
O.S
FARIMA BIC
FARIMA AIC
FAR BIC
0.5
0.0
0.1
0.2
0.3
Figure 5: FIGURE 5. Estimated long memory parameter for 18 consecutive Ethernet data series by various methods. FAR BIC: continuous, ARFIMA BIC: dotted, Semiparametric: dashed.
Y
,- \ ••
:
Λ. *
\\ / /
^
5
'
\
//••»'
"ΪJi ψj 1O
V •••. \
\
OS
\ "\ V
V
\
15
Series Number
References Agiakloglou, C, Newbold, P. and Wohar, M. (1993). Bias in an estimator of the
ESTIMATION OF LONG MEMORY
147
fractional difference parameter. Journal of Time Series Analisis, 14, 235-246. Akaike, H. (1978). A Bayesian analysis of the minimum AIC procedure. Ann. Inst. Statist. Math., 30 A, 9-14. Andel, J. (1986). Long memory time series models. Kybernetika, 22, 105-123. Bardet, J-M., Moulines, E. and Soulier, P. (1999). Recent advances on the semiparametric estimation of the long-range dependence coefficient. In ESAIM Proceedings, pp. 23-43. Societe de Mathematiques Appliquees et Industrielles. Beran, J. (1992). Statistical methods for data with long-range dependence. Statistical Science, 7, number 4, 404-416; With discussions and rejoinder, pages 404-427. Beran, J. (1994). Statistics for long-memory processes. Chapman & Hall, New York. Beran, J. (1995). Maximum likelihood estimation of the differencing parameter for invertible short and long memory autoregressive integrated moving average models. J. Royal Statist. Soc. B, 57, 659-673. Beran, J. (1997). Discussion of 'Heavy tail modelling and teletraffic data'. The Annals of Statistics, 25, 1852-1856. Beran, J., Bhansali, R. J. and Ocker, D. (1998). On unified model selection for stationary and nonstationary short- and long-memory autoregressive processes. Biometrika, 85, 921-934. Berk, K. N. (1974). Consistent autoregressive spectral estimates. The Annals of Statistics, 2, 489-502. Bhansali, R. J. (1978). Linear prediction by autoregressive model fitting in the time domain. The Annals of Statistics, 6, 224-231. Bhansali, R. J. (1980). Autoregressive and window estimates of the inverse correlation function. Biometrika, 67, 551-566. Bloomfield, P. (1973). An exponential model for the spectrum of a scalar time series. Biometrika, 60, 217-226. Box, G. E. P. and Jenkins, G. M. (1970). Time series analysis; forecasting and control. Holden Day, New York. Brockwell, P. J. and Davis, R. A. (1991). Time Series: Theory and Methods. Springer-Verlag, New York. Chan, N. H. and Palma, W. (1998). State space modeling of long-memory processes. The Annals of Statistics, 26, 719-740. Crato, N. and Ray, B. K. (1996). ModeΓselection and forecasting of long-range dependent processes: results of a simulation study. Journal of Forecasting, 15, 107-125. Dahlhaus, R. (1989). Efficient parameter estimation for self similar processes. Ann. Stat, 17, number 4, 1749-1766.
148
BHANSALI AND KOKOSZKA
Fox, R. and Taqqu, M. S. (1986). Large-sample properties of parameter estimates for strongly dependent stationary Gaussian time series. The Annals of Statistics, 14, 517-532. Geweke, J. and Porter-Hudak, S. (1983). The estimation and application of long memory time series models. Journal of Time Series Analysis, 4, 221-238. Giraitis, L., Robinson, P. and Surgailis, D. (1999). Variance-type estimation of long memory. Stochastic Processes and their Applications, 80, 1-24. Giraitis, L. and Surgailis, D. (1990). CLT for quadratic forms in strongly dependent linear variables and application to asymptotical normality of Whittle's estimate. Prob. Th. Rel Fields, 86, 87-104. Granger, C. W. J. and Joyeux, R. (1980). An introduction to long-memory time series and fractional differencing. J. Time Series Anal, 1, 15-30. Hall, P. (1997). Defining and measuring long-range dependence. Fields Institute Communications, 11, 153-160. Hannan, E. J. (1970). Multiple Time Series. Wiley, New York. Hannan, E. J. (1973). The asymptotic theory of linear time series models. J. Appl. Prob., 10, 130-145. Hannan, E. J. (1980). The estimation of the order of an ARMA process. The Annals of Statistics, 8, 1071-1081. Hannan, E. J. and Quinn, B. G. (1979). The determination of the order of an autoregression. J. Royal Statist. Soc. B, 41, 190-195. Haslett, J. and Raftery, A. E. (1989). Space-time modelling with long-memory dependence: assesing Ireland's wind power resource. Appl. Statist, 38, number 1, 1-50. Heyde, C. C. and Yang, Y. (1997). On defining long-range dependence. J. Appl. Probab., 34, 939-944. Hosking, J. R. M. (1981). Fractional differencing. Biometrika, 68, number 1, 165-176. Hosking, J. R. M. (1984). Modeling persistence in hydrological time series using fractional differencing. Water Resources Research, 20, number 12, 1898-1908. Hurvich, C. M. and Beltrao, K. I. (1993). Asymptotics for the low-frequency ordinates of the periodogram of a long-memory time series. Journal of Time Series Analysis, 14, 455-472. Hurvich, C. M. and Brodsky, J. (2001). Broadband semiparametric estimation of the memory parameter of a long-memory time series using fractional exponential models. Journal of Time Series Analysis, 22, 221-249. Hurvich, C. M., Deo, R. and Brodsky, J. (1998). The mean squared error of Geweke and Porter-Hudak's estimator of the memory parameter of a long memory time series. Journal of Time Series Analysis, 19, 19-46.
ESTIMATION OF LONG MEMORY
149
Kokoszka, P. S. (1996). Prediction of infinite variance fractional ARIMA. Probability and Mathematical Statistics, 16/1, 65-83. Kokoszka, P. S. and Taqqu, M. S. (1995). Fractional ARIMA with stable innovations. Stochastic Processes and their Applications, 60, 19-47. Kokoszka, P. S. and Taqqu, M. S. (1996a). Infinite variance stable moving averages with long memory. Journal of Econometrics, 73, 79-99. Kokoszka, P. S. and Taqqu, M. S. (1996b). Parameter estimation for infinite variance fractional ARIMA. The Annals of Statistics, 24, 1880-1913. Kokoszka, P. S. and Taqqu, M. S. (1999). Discrete time parametric models with long memory and infinite variance. Mathematical and Computer Modelling, 29, 203-215. Kύnsch, H. (1987). Statistical aspects of self-similar processes. Bernoulli, 1, 67-74. Leland, W. E., Taqqu, M. S., Willinger, W. and Wilson, D.V. (1994). On the selfsimilar nature of Ethernet traffic (extended version). IEEE/ACM Transactions on Networking, 2, 1-15. Mikosch, T., Gadrich, T., Klύppelberg, C. and Adler, R. J. (1995). Parameter estimation for ARM A models with infinite variance innovations. The Annals of Statistics, 23, 305-326. Moulines, E. and Soulier, P. (1999). Broad band log-periodogram regression of time series with long range dependence. The Annals of Statistics, 27, 1415-1439. Pai, J. S. and Ravishanker, N. (1996). Bayesian modelling of arfima processes by markov chain monte-carlo methods. Journal of Forecasting, 15, 63-82. Pai, J. S. and Ravishanker, N. (1998). Bayesian analysis of autoregressive fractionally integrated moving average. Journal of Time Series Analysis, 19, 99-102. Parzen, E. (1969). Multiple time series modelling. In Multivariate Analysis II, New York (ed. P. R. Krishnaiah). Academic Press. Priestley, M. B. (1981). Spectral Analysis and Time Series: Volume 1. Academic Press. Ravishanker, N. and Ray, B. K. (1997). Bayesian analysis of vector arfima processes. Austral J. Statist, 39, 295-311. Robinson, P. M. (1995a). Log-periodogram regression of time series with long range dependence. The Annals of Statistics, 23, 1048-1072. Robinson, P. M. (1995b). Gaussian semiparametric estimation of long range dependence. The Annals of Statistics, 23, 1630-1661. Samorodnitsky, G. and Taqqu, M. S. (1994). Stable Non-Gaussian Random Processes: Stochastic Models with Infinite Variance. Chapman & Hall. Schwarz, G. (1978). Estimating the dimension of the model. The Annals of Statistics, 6, 461-464. Shibata, R. (1980). Asymptotically efficient selection of the order of the model for estimating parameters of a linear process. The Annals of Statistics, 8, 147-164.
150
BHANSALI AND KOKOSZKA
Shibata, R. (1981). An optimal autoregressive spectral estimate. The Annals of Statistics, 9, 300-306. Sowell, F. B. (1992). Maximum likelihood estimation of stationary univariate fractionally integrated time series models. Journal of Econometrics, 53, 165188. Taqqu, M. S. and Teverovsky, V. (1997). Robustness of Whittle-type estimates for time series with long-range dependence. Stochastic Models, 13, 723-757. Taqqu, M. S. and Teverovsky, V. (1998). On estimating the intensity of long-range dependence in finite and infinite variance series. In A practical guide to heavy tails: Statistical techniques for analyzing heavy tailed distributions, Boston (eds R. Adler, R. Feldman and M. S. Taqqu), pp. 177-217. Birkhauser. Willinger, W., Taqqu, M. S., Leland, W. E. and Wilson, D.V. (1995). Self-similarity in high-speed packet traffic: analysis and modeling of Ethernet traffic measurements. Statistical Science, 10, 67-85.
151 Institute of Mathematical Statistics
LECTURE NOTES — MONOGRAPH SERIES
STABILITY OF NONLINEAR TIME SERIES: WHAT DOES NOISE HAVE TO DO WITH IT? Daren B.H. Cline and Huay-min H. Pu Department of Statistics Texas A&M University Abstract We survey results on the stability of various nonlinear time series, both parametric and nonparametric. The emphasis will be on identifying the role that the "error term" has in determining stability. The error term can indeed affect stability, even when additive and for simple, common parametric models. The stability of the time series is not necessarily the same as that of its related (noiseless) dynamical system. In particular, this means that care must be taken to ensure that estimates are actually within the valid parameter space when analyzing a nonlinear time series.
Key Words: ergodicity, Markov chain, nonlinear time series
1
Introduction
Fitting time series with nonlinear models has become increasingly popular, especially since the emergence of nonparametric function estimation methods (Collomb and Hardle (1986), Hardle and Vieu (1992), Chen and Tsay (1993a,b), TJ0stheim and Auestad (1994a,b), Masry and Tj0stheim (1995)). No matter what model is fit, a critical part of the estimation procedure is determining whether the model is stable or whether the parameters are within the appropriate parameter space (TJ0stheim (1994)). Additionally, knowing the stability properties of a particular model makes it possible to develop simulation and resampling procedures to be used for inference. For these procedures, as well as for the obvious questions of limit theorems and robustness, the nature of the noise (error) distribution is clearly a significant concern. What is not so clear, however, is how this distribution can affect — if it does at all — the stability question itself. Habit with
152
CLINE AND PU
linear models has made it seem as though the error distribution is essentially irrelevant for stability or, as in the case of bilinear and ARCH models, the magnitude of the error variance can appear to be all that is relevant. On the contrary, even when additive, noise often plays a large and critical role in determining the stability of many nonlinear time series models. In particular, the assumption that the process is stochastic is fundamental. In a sense, the issue is between taking a stochastic view of the reasons underlying a process and taking a deterministic view. In the stochastic view, noise is not just a nuisance of observation and inference, a proxy for the uncertainty of scientific investigation. Rather it is integral to the behavior of the process. Of course this is no surprise to those who study stochastic processes: the effect of noise on the values of the time series persists due to dependence. This persistent effect, in turn, affects the stability of the process itself. Determinism as a tenet of science is ingrained into us at our earliest experiences with the scientific method. Linear models, it turns out, naturally lend themselves to this point of view. One's objective as a scientist is to identify the hidden principle, to strip away the flesh (as it were) of noise and distracting factors and with Occam's razor lay bare the skeleton of a true mechanism for movement in the process. This is determinism. If instead we choose to view nature as stochastic (and we will not argue whether this is a wise choice) then we also recognize the biases of traditional determinism, and this includes our view of the stability of nonlinear time series. In this paper, then, our objective is to identify the role that noise plays in determining stability conditions for nonlinear time series. This will involve general approaches to the problem of stability, illustrated with specific examples. We will also directly compare stability conditions for time series with those of dynamical systems which are deterministic and can sometimes be thought of as noiseless "skeletons" of the process (Tong (1990)). For some examples the stability conditions coincide and for others they do not. As have most authors studying stability, starting with Priestley (1980), we take the approach of embedding the time series (say, {&}) in a suitable Markov chain referred to as the state space model and defining stability in terms of the ergodicity, null recurrence or transience of that chain. State space models may vary with the application and for our purposes they need not be observable. For example, an autoregressive type of process can be embedded in (1) for some order p, whereas a process that also has a moving average component of order q could use the state space model {(Xu Ut)} = {(&,..., ξt-p+uet,..., e t - 9 +i)}
(2)
153
NONLINEAR TS STABILITY
The time series is most suitably stable — and the properties of estimators behave best — when the Markov chain is geometrically ergodic (Nummelin (1984), Athreya and Pantula (1986), Chan (1989,1993a,b), Meyn and Tweedie (1992)), meaning the chain converges to its stationary distribution at a uniform geometric rate. Throughout the paper we set aside the questions of irreducibility and aperiodicity even though the errors can have a role in determining these properties as well. So unless we say otherwise, the stability conditions discussed below are to be taken in the context of a subspace on which the process is irreducible and/or with time lags according to the periodicity. Likewise, we assume continuity of the transitions in the sense of a T-chain (cf. Tuominen and Tweedie (1979), Meyn and Tweedie (1993), ClineandPu (1998)). In addition, there is a distinction between geometric ergodicity of a Markov chain and geometrically stable drift of a Markov chain. A chain with geometrically stable drift tends to decrease geometrically in magnitude when it becomes too large and it will satisfy a drift condition such as those in Theorems 1 and 2 below, but such drift is not necessary for geometric ergodicity. For most of the paper we will focus on geometrically stable drift as it most directly compares with the notion of geometric stability of a dynamical system skeleton (see (5)). In section 7, however, we return to this distinction in a discussion of the role that the noise distribution tails can play. The error sequence will be denoted {e*} and is assumed iid. It may or may not contribute additively to the time series. Sections 2 and 3 review stability of linear, bilinear and ARCH models, as well as stability of models that can be tied directly to their skeletons. Sections 4 and 5 go on to describe the standard uses of Foster-Lyapounov test functions with examples that behave like their skeletons and an example that does not. Section 6 presents a new approach to using such test functions to analyze models which either do not have skeletons or are characteristically different from their skeletons. Finally, in section 7 we discuss improvements possible when errors have sufϊicently light distribution tails. 2. Linear, Bilinear and ARCH Models. Example 1. Viewing stability deterministically works especially well for a linear model. When in reduced form, the usual linear ARMA(p, q) model & = aιξt-ι H + apξt-p + et + bιet-ι H h bqet-q (3) has (2) as its state space model and + ••• + apx*t_p
(4)
154
CLINE AND PU
as its skeleton. The skeleton, in other words, is the time series stripped of its noise terms. Here, it is a linear dynamical system. A dynamical system {xt} is defined to be geometrically stable when a bounded solution exists for each initial condition and there exist K < oo and p < 1 such that I*;I < K(l + p*||a;o||) for all t > 1 and x0 = (sj,.. -, xLp).
(5)
The system (4) is geometrically stable precisely when the solution is attracted to 0 regardless of the initial condition. An equivalent algebraic condition is that the eigenvalues of the so-called companion matrix,
( 10... 0 0 1 0, have maximum modulus less than 1. Obviously, geometric stability for (3) (more precisely, for (2)) is thus identical to geometric stability for (4). Example 2. The usual condition for stability of bilinear models also is essentially algebraic though it does depend on the error variance σ1. The model is expressed as it = a'Xt-ι + XUBUt-ι + et + dUt-u where et has zero mean, Xt-ι = (&_i,..., &- p ), Ut-\ = (et-ι,..., e t _ 9 ), B is a matrix and a and c are vectors. If B is subdiagonal, there are appropriately defined matrices A\, B\ such that if A\ ® A\ -j-σ2B\ ® B\ has spectral radius less than 1 then the state space process (e.g., (2)) is geometrically ergodic (Pham (1985,1986)). See Bhaskara Rao et al (1983), Guegan (1987), Liu and Brockwell (1988), Liu (1992) and Pham (1993) for a treatment with more general B. Weaker conditions may actually suffice (Liu (1992)). These also depend on the distribution of et, again principally through the scale parameter. Example 3. Combined autoregressive and autoregressive conditionally heteroscedastic (AR-ARCH) models such as it = a0 + aiit-i + -" + apξt~p + (bo + &i&2-i + 2
likewise have an algebraic condition depending on σ :
1
^
155
NONLINEAR TS STABILITY
(Tong (1981), Quinn (1982), Lu (1998b), cf. also Diebolt and Guegan (1993), Lu (1996,1998a), Borkovec (2000)). The relevant Markov chain is (1). Liu, Li and Li (1997) provided a similar algebraic condition for stability of a nonlinear AR-ARCH model with piecewise constant coefficient functions. We return to a generalization of this example in section 6. 3. Nonlinear AR(p) Models. A common nonlinear model is the autoregressive model with functional coefficients (FCAR), ξt = ao{Xt-ι) + αi(X t -i)6-i +
+ ap{Xt-i)ξt-P
+ c(et; X t -i),
(6)
where c(e*; x) has mean 0 for each x E W and the state space model (1) defines the embedding Markov chain. The coefficient functions αo(rr),..., ap(x) usually are assumed bounded. The error terms may be additive (c(et',x) = et) but more generally {c(et; x)} has some regularity condition such as being uniformly integrable across the choices of x. Self-exciting threshold (SETAR) models are special cases where the coefficient functions are piecewise constant, whereas threshold-like models only require the functions to be asymptotically piecewise constant, as ||α;|| -» oo. Examples of all of these have found successful application (cf. Tong (1990)). Nonparametric fitting of the FCAR model is now common (cf. Tj0stheim (1994), Chen and Hardle (1995), Hardle et al (1997)). The skeleton of (6) is the dynamical system, (7)
x*t = a{x*t_ι,...,x*t_p),
where a(x) =
CLQ(X)
+aι(x)xχ
H
\-ap{x)xp,
x = ( x i , . . . ,xp).
(8)
Explicit conditions for geometric stability of (7) are difficult to state, and not always known. Chan and Tong (1985) and Chan (1990) have shown, however, that if (7) is geometrically stable and a(x) is Lipschitz continuous then (6) is likewise geometrically stable. That is, {Xt} is a geometrically ergodic Markov chain satisfying a geometric drift condition. Cline and Pu (1999a, Thm. 3.1; 1999b, Thm. 2.5) have extended this result to the case a(x) is asymptotically Lipschitz as min^i,...^ \x{\ —>• oo, in other words, for x far away from the axial hyperplanes. It has also been extended to nonlinear ARMA models (Cline and Pu (1999c)). The condition is essentially deterministic as it depends only on the dynamical system and not on the noise. For the models where α(x) is sufficiently smooth, the condition is sharp and thus stability of the time series
156
CLINE AND PU
coincides with stability of the skeleton. This includes any threshold model where a(x) is piecewise linear and continuous. Unfortunately, the smoothness condition disallows the many threshold models for which the coefficient function is piecewise linear but not continuous. Example 4- An example where this approach gives good results is the simple SETAR(l) model ξt = (αoi + απ&-i)lξt-io + c(ξt-ι)et which has a piecewise linear and Lipschitz continuous autoregression function. Assuming c(x) is bounded, this process has geometrically stable drift if and only if agreeing exactly with the geometric stability of its skeleton (Petruccelli and Woolford (1984), Chan et al (1985), Guo and Petruccelli (1991)). This example and others naturally bring the following questions to mind: when are stability of the time series and stability of the skeleton equivalent? That is, when is stability determined independently of the errors? Is it usual for stability of the time series and stability of the skeleton to be equivalent or does the error distribution normally play a critical role?
4. The Foster-Lyapounov Drift Condition. The connection between stability of a stochastic process and stability of a dynamical system is not superficial, even if it is not always as simple as one might like. To verify that a dynamical system is stable, one approach is to show that within some finite time the system (or some appropriate function of it) is sure to "drift" toward an attracting set. (See (5), for example.) A simple condition to check this is known as Lyapounov's drift condition (La Salle (1976)): for some nonnegative function V, K < oo and compact set C
V(x*t) < V(xi_x) + Klc(xϊ-i)
f o r a11
*>
L
Foster (1953) likewise showed that if the mean transition of a Markov chain on the nonnegative integers was uniformly negative for large states then the chain is certain to drift toward the origin whenever it gets too large, and therefore is ergodic. The method was generalized by Tweedie (1975,1976,1983a), Popov (1977), Nummelin and Tuominen (1982), Meyn and Tweedie (1992) and others (cf. Nummelin (1984), Meyn and Tweedie (1993)) to what is now called the Foster-Lyapounov drift condition for ergodicity of an irreducible, aperiodic Markov chain {Xt}' for some function V taking values in [l,oo), K < oo and "small" set C,
E (y(Xt)
- V(X t _i) I Xt-ι =x)
1 and measure v such that P{Xm G B \ Xo = x) > v(B) for all x 6 C and all B (cf. Nummelin (1984), Meyn and Tweedie (1993)). Typically, compact sets on an appropriately defined topological space are small.) The function V is then called a Foster-Lyapounov test function. Furthermore, a Markov chain (or more precisely, the sequence of distributions generated by the transition kernel from an initial distribution) is in fact a dynamical system on the space of probability distributions and thus (9) can be interpreted as an ordinary Lyapounov drift condition for that system. This is the connection with dynamical systems, therefore, to be exploited. Indeed, Meyn and Tweedie (1992,1993) have explored the depth of this concept, and especially for the stronger drift condition for geometric ergodicity: for some function V taking values in [l,oo), K < oo, p < 1 and small set C,
E (y(Xt) I Xt-i =x)< pV{x) + Klc(x).
(10)
This drift condition ensures, among other things, F-uniform ergodicity and a geometric rate of convergence of the marginal distributions to the stationary distribution (Nummelin (1984), Chan (1989), Meyn and Tweedie (1992)), a strong law of large numbers for ^ Σ?=i h(Xt) if \h{x)\ < V(x) (Meyn and Tweedie (1992)), and a central limit theorem for \ Σ? = i h(Xt) if (h(x))2 < V{x) (Meyn and Tweedie (1992), Chan (1993a,b)). If the test function satisfies ||x|| r < V(x) < M + if ||a;|| r for some finite K and M then (10) is a condition for geometrically stable drift of the chain. (See Theorem 1 in the next section.) Example 5. An example of the application of the drift condition (10) is to the FCAR model (6). Chan and Tong (1985, 1986) have shown that if £sup|ai(z)| The chain {Yt} is assumed to be geometrically ergodic and, in particular, to satisfy the geometric drift condition with test function V\{y). Its stationary distribution we denote G. If there is a function H(y) which somehow exemplifies (or bounds) the relative change in magnitude of X\ when XQ is large and YQ — y then, intuitively, the stationary value of H(Yt) will measure the geometric drift of {Xt}- Thus, a log-drift condition for geometric stability would be exp ( ίlog(ί + H(y))G(dy)\ < 1 for some δ > 0.
(15)
To obtain a test function that will yield such a condition, we first define
and let y(x) identify the "embedding" of Yt into Xt An integer m is chosen suitably large, a "correction" function c(x) constructed and the ultimate test function is (something like) l/m
V(x)=c(x)[l[E(h(Yj)\Y0 =
162
CLINE AND PU
The key point for this paper is that the piggyback method and the resulting condition for ergodicity capture the implicit stochastic behavior of {F t }, not the behavior of a deterministic skeleton of {Xt} or {£*}. Even if such a skeleton can be identified, its stability properties will not coincide with those of {&}. Alternatively, one may think of {Xt} as having a sort of stochastic skeleton which must be analyzed for stability. Example 8. Our first example is a bivariate threshold model (Cline and Pu (1999a, Ex. 3.2; 2001, Ex. 3.2)). Indeed it is the simplest such model that is not just two independent univariate models joined together. Suppose
where ai(xχ) = α ϋ l X l < o + α^lxjXb i = 1?2. Note that the nonlinearity of the second component Xt^ is driven by the univariate TAR(l) process {Xt,ι} The latter is our "embedded" process and is stable when max(αii,αi2,αiiαi2) < 1.
(16)
Let G be its stationary distribution. The function |α2(#i)| plays the role of H(y) so that the resulting (sharp) stability condition is, in addition to (16),
which represents the stationary value of the relative change in magnitude of Xti2 when it is very large. This condition neither implies nor is implied by the stability condition for the corresponding skeleton process: (16) plus
Example 9. The second example (cf. Cline and Pu (1999c)) is the threshold ARMA(l,ςr) model (TARMA) with a delay d:
ξt = oo(Xt-i) + aι(Xt-i)ξt-i
+et + &i(-ϊt-i)et-i + • +
bq(Xt-ι)et-q,
where Xt-ι = {ξt-i,..., ξt-d), «o3 &i, , bq are bounded and aχ{x) is (asymptotically) piecewise constant. We further assume the thresholds are affine which implies the regions on which a\(x) is constant are cones in Md. This is the simplest interesting example of a TARMA process. See also Brockwell, Liu and Tweedie (1992) and Liu and Susko (1992). In the case q > 0, the time series must be embedded in the Markov chain {(Xt,Ut)} where Xt is as above and Ut = ( e t , . . . , e t _ g +i). Threshold ARMA models have not seen a lot of study, perhaps in part because the moving average terms can affect the irreducibility and periodicity
NONLINEAR TS STABILITY
163
properties of the chain in complicated ways as yet not well understood. (See, for example, Cline and Pu (1999c)). This by itself is a major role played by the noise but we will pass by it here. Let the regions i?χ,..., Rm be the partition of W such that a\(x) is constant on each region, with a n , . . . , a i m being the corresponding constants. There are basically two types of situations that arise in these models when ξt is very large: cyclical and noncyclical. For the cyclical situation, {Xt} essentially cycles close to certain rays having the form
(Uϊ:lalji,nPiZΪalji,...,l)xu
(17)
if all are in the interior of the conical regions. Noise plays no role in determining the stability in this situation since Xt avoids the thresholds; all that matters is the product of coefficients realized by moving through the cycle. For a model which is purely cyclical the stability condition is based on the "worst case" cycle, it is deterministic and corresponds to that of the skeleton process, and it is very much like that of the TAR(l) process with delay discussed in section 5. The model may also have, however, situations where one or more of the rays of type (17) actually lie on a threshold. In such a case, Xt can fall on either side of the threshold, and thus into one of two possible regions, at random but depending on both the present error et and the past errors e t _i,..., et-q. If Jt denotes the region that Xt is in then {(Jt, Ut)} behaves something like a Markov chain where the first component is one of a finite number of states and the second component is stationary. (If q = 0 then {Jt} itself is like a finite state Markov chain.) We relate {(Jt, Ut)} to such a Markov chain denoted, say, {(JtiUt)}. This chain is not necessarily irreducible or aperiodic but clearly every invariant measure is finite. Indeed it may be decomposed into a finite number of uniformly ergodic subprocesses. The coefficients |αχj| play the role of H(y) in this model. Now let G be any stationary distribution for {(Jt, Ut)} and define ΈJ = IRQ G(j,du). If (condition (15))
regardless of the choice of G then {(XuUt)} is geometrically ergodic aiίid, again, the condition is sharp. Because at least one ray lies on a threshold, the noncyclical models are special cases, but the stability condition for a noncyclical process can be quite different from that of nearby purely cyclical processes. See the parameter spaces for the TARMA(1,1) with delay 2 in Cline and Pu (1999c).
164
CLINE AND PU
Example 10. The third example combines the nonlinearity of piecewise continuous coefficient functions with a piecewise conditional heteroscedascity, a model called the threshold AR-ARCH time series: ξt = a(Xt-i) + b(Xt-i)et + c(et; X t -i), where a{x) and b(x) are piecewise linear, {c(ei a )} is uniformly integrable and Xt-i = (ξt-i, ,ζt-p) We further suppose a(x) and b(x) are homogeneous, b{x) is locally bounded away from zero except at x = 0, the thresholds are subspaces containing the origin and the regions of constant behavior are cones. Note that these assumptions need only hold asymptotically (in an appropiate sense) as x gets large. Once again, the Markov chain under study is {Xt} The basic idea on which we piggyback is that the process {Xt} collapsed to the unit sphere behaves very much like a Markov chain. The compactness of the unit sphere serves to make this chain stable and then the stability condition for the original chain can be computed. More specifically, define
S = α ( X t _ 0 + &(*_!)*,
X; = (ξl...,ξ*t-p+ι)
and θ*t = X*/\\X*t\\.
Then, due to the homogeneity of a(x) and 6(x), {0£ } is a Markov chain on the unit sphere and is uniformly ergodic with stationary distribution G, say. By the piggyback method, therefore, Xt has geometrically stable drift if
E (flog(\a(θ) + b(θ)e1\/\θι\)G(dθ)yj < 0. For a simple demonstration, suppose p = 1, a(x) = (αil x [l,oo) is locally bounded. Suppose there exists a random variable W(x) for each x such that V{X\) < W(x) whenever XQ = x, {\W(x) — V(x)\ + er(w(x^~v(x^} is uniformly integrable for some r > 0 and limsupE (W(x) - V(x)) < 0. Then there exist s > 0 and V\(x) = esV^ ergodic (and hence geometrically ergodic).
(18)
such that {Xt} is V\-uniformly
Proof This follows directly from the drift condition for ^-uniform ergodicity (cf. Meyn and Tweedie (1993, Thm. 16.0.1)) and uniform convergence (cf. Cline and Pu (1999a, Lem. 4.2)). (See also the proof to Theorem 4.) D Essentially, this is the log-drift condition in another guise: if the test function in (10), for example, is replaced with V\{x) = esV^ with some sufficiently small s > 0 then (18) is a log-drift version of the condition. As a bonus, if V(x) is norm-like, satisfying ||x|| < V(x) < M +jfif||a;||, one gets strong laws and central limit theorems for all the sample moments (Meyn and Tweedie (1992), Chan (1993a,b)) and exponentially damping tails in the stationary distribution (Tweedie (1983a,b)). Example 12. For example, consider the FCAR(p) model discussed in section 3. If the noise term c(et;Xt-ι) is such that sup x E(er^eux^) < oo for some r > 0 then it frequently is possible to satisfy the requirements of Theorem 3 with a norm-like V(x). To illustrate how this can work, consider the FCAR(l) process, Xt = ξt = αi(6-i) + Φ t ; 6 - i ) with
-L < a\{x) < a\\x + αoi Ίix < -L,
, *
L > a\(x) > a\2X + αo2 if x > L, where a\\a\i — 1, a\\ < 0 and L < oo. We assume here that E(c(e\\ x)) = 0 for all x G R and sup x E(er^eux^) < oo for some r > 0. For the special case of equality on the right in (19) (the SETAR(l) model of Example 4), Chan et al (1985) showed {ξt} is ergodic if and only if 7 = 011^02 + «oi < 0. We thus assume 7 < 0. Let λi = λ^"1 = \J—a\\ and choose δ{ > 1, i = 1,2 so that -λiα O 2 + δχ- δ2 = X2OΌI + h - δ\ = λ 2 7/2. Define
V{x) = (\ι\x\ + δλ)lx 0 and K < oo, V(Xι) - V(x) < (λ 2 lχ< 0 - λil x >o)c(ei; x) + X2j/2 + K\c(eγ,x)\l\c{ei]x)\>€\x\ χ ιs when |Xo| — \ \ sufficiently large, which satisfies the conditions of Theorem 3 with the limit in (18) being λ27/2. The time series is thus geometrically ergodic. On the other hand its skeleton, while stable, is not geometrically stable since a\\aγι = 1. In fact we would say both have only a linear drift. Tanikawa (1999) studied this example and Cline and Pu (1999b) looked at similar first order threshold-like models, but with a possible delay. Using a similar approach but with stronger stability conditions, Diebolt and Guegan (1993) studied multivariate examples and An and Chen (1997) investigated FCAR(p) models withp > 1. One of the drawbacks to a log-drift condition such as the one in Theorem l(iii) is that it guarantees geometric ergodicity only with test functions of the form V(x) = l-hλ(α;)||x|| r where r may be arbitrarily small and therefore it fails to imply needed limit theorems for sample moments. To be able to conclude Vί-uniform ergodicity with an exponential-like Vί, the condition must again be boosted and then the desired limit theorems will hold. Theorem 4 Assume {Xt} is an aperiodic, φ-irreducible T-chain in W and V : W —> [l,oo) is locally bounded and V(x) -> oo as \\x\\ -> oo. Suppose there exists a random variable W(x) for each x such that V(X\) < W(x) whenever Xo = x, {\ log{W{x)/V{x))\ + e ( ^ W ) r - ( ^ ) ) Γ } is uniformly integrable for some r > 0 and limsup£ (log{W(x)/V(x))) < 0.
(20)
IMI->oo
Then there exist s > 0 and V\{x) = e^v^s such that {Xt} is V\-uniformly ergodic (and hence geometrically ergodic). Proof For v > w > 1 and 0 < s < r, we have i {ew3~vS - 1) < ^ ( ^ - l) and log(w/v) < j fe - l j < 0. By the uniform integrability of {log(W(x)/V(x))}, truncation and uniform convergence (as s I 0),
Λi ((ii ((e("W-(v(*))« ("W(v(*)) __ Λ
< limsupE (\og(W(x)/V(x))lw{x) 0 and 5 > 0 small enough.
Λ
+ e < -e,
(21)
167
NONLINEAR TS STABILITY For w > v > 1 and 0 < s < r/2, we have 0 < \og(w/υ) < - (ewS-υ° - l) < - (ewT~vT
-
and if wΓ-υr < K, v > M > 1 then ± (β 10 '-"* - 1) < g £ £ . By the uniform integrability of {e^ί*))'"^**))'}, truncation and V{x) -> oo as ||z|| ->• oo,
0
v(x})
\\x\\-κx>
)M^)) _ Λ i
) 0 and s > 0 small enough. From (20)-(22), therefore, we conclude there exists s > 0 small enough that limsupE (- (eWi»MV(*)) _ Λ IX o = \ Also, sup|| x ||< M J5 M v ( X l ) ) s
χ0 —xλ < oo for all M < oo, and hence geo-
metric ergodicity is assured with test function V\.
D
Example 13. We again consider an FCAR(l) model, ξt = «i(6-i) + c{et\it-ι), satisfying (19) but now we assume a n < 0 < anau < 1 and |c(ei;z)| < ci|a;|^|ei| where cx > 0, 0 < β < 1 and E(eη^) < oo for some η > 0. Let λi = y/-an, λ 2 = V~ α i2 and F(x) = 1 + (λil x o)|^|. Then for \XQ\ = \x\ sufficiently large and some e > 0 and K < oo, V{X\) < (1 - e)V(x) + ίί|x|^|ei|, which satisfies Theorem 4 with r < 1 - β. See also Diebolt and Guegan (1993) and Guegan and Diebolt (1994) for related results. When the errors are bounded, an otherwise unstable model can sometimes be stable. See Chan and Tong (1994) for an example. References An, H.Z. and Chen, S.G. (1997). A note on the ergodicity of non-linear autoregressive model, Stat. Prob. Letters 34, 365-372. An, H.Z. and Huang, F.C. (1996). The geometrical ergodicity of nonlinear autoregressive models, Stat Sinica 6, 943-956. Athreya, K.B. and Pantula, S.G. (1986). Mixing properties of Harris chains and autoregressive processes, J. Appl. Probab. 23, 880-892. Bhaskara Rao, M., Subba Rao, T. and Walker, A.M. (1983). On the existence of some bilinear time series models, J. Time Series Anal 4, 95-110. Borkovec, M. (2000). Extremal behavior of the autoregressive process with ARCH(l) errors, Stock. Proc. Appl 85, 189-207.
168
CLINE AND PU
Brockwell, P.J., Liu, J. and Tweedie, R.L. (1992). On the existence of stationary threshold autoregressive moving-average processes. J. Time Series Anal 13, 95-107. Chan, K.-S. (1989). A note on the geometric ergodicity of a Markov chain, Adv. Appl. Probab. 21, 702-704. Chan, K.-S. (1990). Deterministic stability, stochastic stability, and ergodicity, Appendix 1 in Non-linear Time Series Analysis: A Dynamical System Approach, by H. Tong, Oxford University Press (London). Chan, K.-S. (1993a). A review of some limit theorems of Markov chains and their applications, Dimensions, Estimation and Models, ed. by H. Tong, World Scientific (Singapore), 108-135. Chan, K.-S. (1993b). On the central limit theorem for an ergodic Markov chain, Stock. Proc. Appl 47, 113-117. Chan, K.-S., Petruccelli, J.D., Tong, H. and Woolford, S.W. (1985). A multiple threshold AR(1) model, J. Appl Probab. 22, 267-279. Chan, K.-S. and Tong, H. (1985). On the use of the deterministic Lyapunov function for the ergodicity of stochastic difference equations, Adv. Appl Probab. 17, 666-678. Chan, K.-S. and Tong, H. (1986). On estimating thresholds in autoregressive models, J. Time Series Anal 7, 179-190. Chan, K.-S. and Tong, H. (1994). A note on noisy chaos, J. Royal Stat. Soc. 56, 301-311. Chen, R. and Hardle, W. (1995). Nonparametric time series analysis, a selective review with examples, Bulletin of the International Statistical Institute, 50th session of ISI, August, 1995, Beijing, China. Chen, R. and Tsay, R.S. (1991). On the ergodicity of TAR(l) process, Ann. Appl Probab. 1, 613-634. Chen, R. and Tsay, R.S. (1993a). Functional-coefficient autogressive models, J. Amer. Stat Assoc. 88, 298-308. Chen, R. and Tsay, R.S. (1993b). Nonlinear additive ARX models, J. Amer. Stat Assoc. 88, 955-967. Cline, D.B.H. and Pu, H.H. (1998). Verifying irreducibility and continuity of a nonlinear time series, Stat & Prob. Letters 40, 139-148. Cline, D.B.H. and Pu, H.H. (1999a). Geometric ergodicity of nonlinear time series, Stat Sinica 9, 1103-1118. Cline, D.B.H. and Pu, H.H. (1999b). Stability of nonlinear AR(1) time series with delay, Stock. Proc. Appl 82, 307-333. Cline, D.B.H. and Pu, H.H. (1999c). Stability of threshold-like ARMA time series, Statistics Dept., Texas A&M Univ. Cline, D.B.H. and Pu, H.H. (2001). Geometric transience of nonlinear time series, Stat Sinica 11, 273-287. Collomb, G. and Hardle, W. (1986). Strong uniform convergence rates in robust nonparametric time series analysis and prediction: kernel regression estimation from dependent observations, Stock. Proc. Appl 23, 77-89. Diebolt, J. and Guegan, D. (1993). Tail behaviour of the stationary density of general non-linear autoregressive processes of order 1, J. Appl. Probab. 30, 315-329.
NONLINEAR TS STABILITY
169
Foster, F.G. (1953). On the stochastic matrices associated with certain queueing processes. Ann. Math. Stat. 24, 355-360. Guegan, D. (1987). Different representations for bilinear models, J. Time Series Anal. 8, 389-408. Guegan, D. and Diebolt, J. (1994). Probabilistic properties of the β-ARCH model, Stat. Sinica 4, 71-87. Guo, M. and Petruccelli, J.D. (1991). On the null-recurrence and transience of a first order SETAR model, J. Appl. Probab. 28, 584-592. Hardle, W., Lϋtkepohl, H. and Chen, R. (1997). A review of nonparametric time series analysis, Internal. Stat Review 65, 49-72. Hardle, W. and Vieu, P. (1992). Kernel regression smoothing of time series, J. Time Series Anal 13, 209-232. La Salle, J.P. (1976). The Stability of Dynamical Systems, CMBS 25, Society for Industrial and Applied Mathematics (Philadelphia). Lim, K.S. (1992). On the stability of a threshold AR(1) without intercepts, J. Time Series Anal. 13, 119-132. Liu, J. (1992). On stationarity and asymptotic inference of bilinear time series models, Stat. Sinica 2, 479-494. Liu, J. and Brockwell, P.J. (1988). On the general bilinear time series models, J. Appl. Probab. 25, 553-564. Liu, J., Li, W.K. and Li, C.W. (1997). On a threshold autoregression with conditional heteroscedastic variances, J. Stat. Plan. Inference 62, 279300. Liu, J. and Susko, E. (1992). On strict stationarity and ergodicity of a nonlinear ARMA model, J. Appl. Probab. 29, 363-373. Lu, Z. (1996). A note on the geometric ergodicity of autoregressive conditional heteroscedasticity (ARCH) model, Stat. Prob. Letters 30, 305-311. Lu, Z. (1998a). Geometric ergodicity of a general ARCH type model with applications to some typical models, in Advances in Operations Research and Systems Engineering, J. Gu, G. Fan and S. Wang, eds., Global-Link Informatics Ltd., 76-86. Lu, Z. (1998b). On the geometric ergodicity of a non-linear autoregressive model with an autoregressive conditional heteroscedastic term, Stat. Sinica 8, 1205-1217. Masry, E. and Tj0stheim, D. (1995). Nonparametric estimation and identification of nonlinear ARCH time series: strong convergence and asymptotic normality, Econometric Theory 11, 258-289. Meyn, S.P. and Tweedie, R.L. (1992). Stability of Markovian processes I: Criteria for discrete-time chains, Adv. Appl. Probab. 24, 542-574. Meyn, S.P. and Tweedie, R.L. (1993). Markov Chains and Stochastic Stability, Springer-Verlag (London). Nummelin, E. (1984). General Irreducible Markov Chains and Non-negative Operators. Cambridge University Press, Cambridge. Nummelin, E. and Tuominen, P. (1982). Geometric ergodicity of Harris recurrent Markov chains with applications to renewal theory, Stoch. Proc. Appl. 12, 187-202. Petruccelli, J.D. and Woolford, S.W. (1984). A threshold AR(1) model, J. Appl. Probab. 21, 270-286.
170
CLINE AND PU
Pham D.T. (1985). Bilinear Markovian representation and bilinear models, Stock. Proc. Appl. 20, 295-306. Pham, D.T. (1986). The mixing property of bilinear and generalised random coefficient autoregressive models, Stock. Proc. Appl. 23, 291-300. Pham, D.T. (1993). Bilinear times series models, Dimensions, Estimation and Models, ed. by H. Tong, World Scientific Publishing (Singapore), 191-223. Popov, N. (1977). Conditions for geometric ergodicity of countable Markov chains, Soviet Math. Dokl. 18, 676-679. Priestley, M.B. (1980). State-dependent models: A general approach to non-linear time series analysis, J. Time Series Anal. 1, 47-71. Pu, H.H. and Cline, D.B.H. (2001). Stability of threshold AR-ARCH models, tech. rpt., Dept. Stat., Texas A&M University (forthcoming). Quinn, B.G. (1982). A note on the existence of strictly stationary solutions to bilinear equations, J. Time Series Anal. 3, 249-252. Spieksma, F.M. and Tweedie, R.L. (1994). Strengthening ergodicity to geometric ergodicity for Markov chains, Stock. Models 10, 45-74. Tanikawa, A. (1999). Geometric ergodicity of nonlinear first order autoregressive models, Stock. Models 15, 227-245. Tj0stheim, D. (1990). Non-linear time series and Markov chains, Adv. Appl. Probab. 22, 587-611. Tj0stheim, D. (1994). Non-linear time series: a selective review, Scand. J. Stat. 21, 97-130. Tj0stheim, D. and Auestad, B.H. (1994a). Nonparametric identification of nonlinear time series: projections, J. Amer. Stat. Assoc. 89, 1398-1409. Tj0stheim, D. and Auestad, B.H. (1994b). Nonparametric identification of nonlinear time series: selecting significant lags, J. Amer. Stat. Assoc. 89, 1410-1419. Tong, H. (1981). A note on a Markov bilinear stochastic process in discrete time, J. Time Series Anal. 2, 279-284. Tong, H. (1990). Non-linear Time Series Analysis: A Dynamical System Approach, Oxford University Press (London). Tuominen, P. and Tweedie, R.L., (1979). Markov chains with continuous components, Proc. London Math. Soc. (3) 38, 89-114. Tweedie, R.L. (1975). Sufficient conditions for ergodicity and recurrence of Markov chains on a general state space. Stock. Proc. Appl. 3, 385-403. Tweedie, R.L. (1976). Criteria for classifying general Markov chains. Adv. Appl. Probab. 24, 542-574. Tweedie, R.L. (1983a). Criteria for rates of convergence of Markov chains with application to queueing and storage theory, in Probability, Statistics and Analysis, London Math. Society Lecture Note Series, ed. by J.F.C. Kingman and G.E.H. Reuter, Cambridge Univ. Press (Cambridge). Tweedie, R.L. (1983b). The existence of moments for stationary Markov chains, Adv. Appl. Probab. 20, 191-196.
173
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
TESTING NEUTRALITY OF mtDNA USING MULTIGENERATION CYTONUCLEAR DATA Susmita Datta Department of Mathematics and Statistics Georgia State University Abstract The neutrality theory of evolutionary genetics assumes that DNA markers distinguishing individuals and species are neutral and have little effect on individual fitness (Kimura, 1983). Under this hypothesis, the action of genetic drift or genetic drift in combination with mutation or migration can be used to describe the evolution of most DNA markers. In recent years, scientists have set up experiments to collect cytonuclear data over several generations to test whether the empirical evidence is consistent with this theory. In this paper, we review the existing statistical tests for neutrality based on such data and propose a new test that we believe is vastly superior. The new test arises from the likelihood theory after embedding the neutral model in a larger class of selection models, where the selection effect takes place due to a difference in fertility of various gametes. A power study based on Monte Carlo simulation is presented to demonstrate the superior performance of the new test.
1
Introduction
A major debate amongst evolutionary geneticists in recent years is whether most DNA markers distinguishing individuals and species are neutral and have little effect on individual fitness (Kimura, 1983). As a profound application of this theory, DNA sequence differences between extant species have been used to reconstruct the history of life. The classical theoretical developments of random genetic drift is built around this assumption. Under this hypothesis, the action of genetic drift or genetic drift in combination with mutation or migration can be used to describe the evolution of most DNA markers. The recent attacks on the neutrality theory are twofold. Firstly, it has been pointed out that in some cases non-neutral models can also explain behavior consistent with empirical evidence. For example, Gillespie (1979) showed that his model of selection in a random environment has the same
174
DATTA
stationary distribution as the infinite allele neutral model. Therefore, the agreement between observations and that predicted by the infinite allelic model noted by Fuerst et al. (1977) can be used with equal strength to support Gillespie's model of natural selection. In another context, Rothman and Templeton (1980) showed that, under some departure from the model assumptions, a neutral model (Watterson, 1977; Ewens, 1972) can yield frequency spectra and homozygosity similar to those expected from heterosis. In addition to the above results, a number of recent experiments suggest apparent non-neutral behaviors of mtDNA markers (Clark and Lyckegaard, 1988; MacRae and Anderson 1988; Fos et al., 1990; Nigro and Prout, 1990; Pollak, 1991; Arnason, 1991; Kambhampati et al., 1992; Scribner and Avise, 1994a, b; Hutter and Rand, 1995; etc.). Singh and Hale (1990) suggested that the apparent "non-neutral" behavior may also be caused by mating preference and that any attempt to understand the role of selection on mtDNA variants should first begin with simpler conspecific variants rather than with interspecific variants; however see MacRae and Anderson (1990), Jenkins et al. (1996). Multi-locus empirical comparisons have been undertaken by Karl and Avise (1992; also see McDonald, 1996), Berry and Kreitman (1993), McDonald (1994). In view of these recent experimental developments it is important to test whether the apparent non-neutral behavior of the markers are indeed statistically significant. Consequently it is more important than ever to devise appropriate statistical tests for testing the neutrality of a mtDNA marker. As we will see in Section 3, the existing statistical tests are often too limited to take full advantage of the multi-generation cytonuclear data that are now available. As a result, a new test based on the recent works by Datta (1999, 2001) is proposed. This test is based on an approximate likelihood for the full available data constructed from a broad parametric selection model and is therefore expected to perform well in practice. The data collection scheme and the underlying model of random drift for genetic evolution is introduced in the next section. This neutral model serves as the null model for the statistical tests which are introduced in Sections 3 and 4. A numerical power study based on Monte Carlo simulation is reported in Section 5. The paper ends with some concluding remarks in Section 6.
2
Data Collection Scheme and the Random Drift Model
In recent kitty-pool experiments, there are two potential sources of variation in cytonuclear frequencies, namely, genetic sampling variation and statistical sampling variation (Weir, 1990). Genetic sampling variation arises
NEUTRALITY OF mtDNA
175
from genetic drift, the sampling of gametes from a finite breeding pool of individuals in nature to constitute the next generation. Statistical sampling variation arises from sampling individuals from a population and using the genotypic frequencies from the sample in subsequent calculation. In Datta et al. (1996), test statistics based on cytonuclear disequilibria were constructed which can account for both sources of variation. The sampling scheme is described below. Such sampling schemes were introduced by Fisher and Ford (1947) and subsequently considered by Schaffer et al. (1977). Kiperskey (1995) also collected data on the fruit fly Drosophila melanogaster following such a scheme. We feel that these types of sampling schemes will become increasingly important in prospective tests for selection (White et al., 1998) using molecular markers in which a cytoplasmic marker is included as a control. Consider a population propagating through discrete non-overlapping generations. Although this is a simplifying and restrictive assumption, it can be achieved for an experimental population with specially selected species, such as Gambusia and fruit flies. At each generation, a portion of the adult population is collected by simple random sampling and sent for genotyping after they form the next generation; eggs by random mating. The eggs are then collected and placed in a cage to form the next generation. Thus, in this case, only the sample genotypic relative frequencies are available and are therefore subject to the additional source of sampling variation. We let g denote the number of consecutive generations from which samples were drawn. Throughout the rest of the paper, we will simultaneously concentrate on a nuclear site with possible alleles A and a and a cytoplasmic site with possible alleles Cand c. The various relative frequencies at the genotypic and the gametic levels are indicated in Tables 1 and 2, respectively. Note that since the cytoplasmic marker is only maternally inherited, its representation remains the same at both levels. Also, if needed, we will denote the generation number (i.e., time) in parenthesis and the corresponding quantities at the sample level will be indicated by the hat notation. Table 1 Genotypic frequencies Nuclear Allele Cytoplasm AA Aa aa Total C V2 P3 q Pi c P4 P5 Pe 1-q V u w 1 Total
DATTA
176
Table 2 Gametic frequencies
Cytoplasm C
Nuclear Allele A a Total eι
c
Total
q
e4 P
l-q 1
Under the action of genetic drift alone, the evolution of the population through generations can be modeled by the following Markov chain. Under the RUZ (random union of zygotes) model (Watterson, 1970), the probability of observing an offspring which received gametic types / and m, respectively, from the two parents is e/e m . Thus, the probability distribution of the counts X(t + 1) = (Xn(t +1), , Xu(t + 1)) in generation t + 1,given Ww> the gametic combination counts up to time ί, is multinomial and is given by (1)
(2) where JVi+i = Σ/,m x /m(^+l) is the size of the ί+1 generation. Finally note that this in turn determines the distribution of the genotypic and the gametic proportions p(t + 1) and e(t + 1), since they are just linear combinations of the x(t + l);viz, pk(t) = {Σf,m aθ(θφ,&), set Θ% = θ[j). - Finally set θ^ + 1 ) = θ § } . Return the sequence {(7 48 runs.
205
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
SEMIPARAMETRIC INFERENCE FOR SYNCHRONIZATION OF POPULATION CYCLES RE. Greenwood Dept. of Mathematics, Arizona State University, Tempe, Arizona and Dept. of Mathematics, University of British Columbia, Canada D.T. Haydon Centre for Tropical Veterinary Medicine Easter Bush, Roslin, Midlothian, U.K.
1
Introduction
We consider a dynamic random field. On each of a discrete array of sites is located a hyperbolic dynamical system perturbed by noise. We assume that the dynamics are identical, with a unimodal limit cycle and that the perturbations are independent, centered, and with the same distribution. In addition the individual processes are coupled with one another in a homogeneous pattern. The coupling may be global, in which case we are thinking of a mean-field type of system. Or the coupling may be local, the coupling strength between each site and its neighbors attenuating with distance. The application we have in mind is to cycling populations of animals, where the log of each local population increases roughly linearly to an apparent critical point from which it falls precipitously to a minimum. The data is discrete in space and time, being based on periodic reports from catchment or reporting regions. In a previous study [1] we focussed on data from Canadian lynx populations. Lynx population cycles are known to follow those of snowshoe hare. Previous analyses of this data have been concerned with inferring the length and regularity of the evident population cycles. We consider in [1] the very different challenge of estimating a parameter identified as strength of coupling among populations. That paper is primarily addressed to data analysis and interpretation. The emphasis in this report is on the steps involved in arriving at a suitable model and estimator. We explore some of the difficulties posed by this rather unusual problem. In forthcoming studies we, with colleagues, will apply the method described here to cycling population data from Canadian muskrat and mink, and from the greysided vole of Hokkaido.
206
2
GREENWOOD AND HAYDON
Development of a Coupled Evolving Phase Field Model
To begin, let us introduce a general definition of synchronization applicable to random fields. Let X = {Xa,i € I,t = 0,1,2,...} denote the values at site i and time t of a dynamic random field. Let S = {£(•)*, t > 0} denote a real-valued functional or collection of functionals of {Xis, i € / , s < *}. Note that we allow S evaluated at time t to depend on data up to and including time t. Define S'-synchronization to mean {SX)t = 0, almost surely, for t in some time-set T. This gives us a flexible definition of exact synchronization. But a random field will not be exactly synchronized. In order to formulate a statistically useful definition we need to allow for and to measure departures from synchronization. For example, we probably want the mean and variance of S applied to random data to be small. We are led to the following: Definition. A dynamic random field X departs from 5-synchronization by no more than e over the time interval T if E(SX)Ί < e for all t eT. One might use other norms, e.g. Lι or Kullbach-Leibler instead of L 2 . However this definition has a familiar form and is computationally convenient. At each site i in an array of sites / is located a "skeleton" process which, in the absence of any noise or coupling can be written -ΪM+i = HXith
i e /, t = 0,1,2,....
(2.1)
In this treatment we write the dynamics as of first order in discrete time. We assume there is a unimodal limit cycle t(t),t = 0, . . . , p , ^(0) = t{p), where, since / is the same for each i, the form of the cycle, ί^ and its period, p, are the same for each site i. For definiteness let ^(0) be the minimum point of the unimodal cycle. If we picture the deterministic field (1) running in equilibrium, we see the periodic cycle ί{ ) being executed at each site, the only difference between sites being the phase, which will differ if the initial points X^o were not identical. Now suppose we watch, instead, the stochastic field, X^t+ι=f(Xit)
+ eiu
t 6 / , t = 0,l,2,...
(2.2)
where en represents a small, identically distributed, centered noise. A theorem of deterministic dynamics called the "shadowing lemma" implies that the stochastic paths of (2) shadow the paths of (1), where "shadow" means that they remain in a distributional neighborhood of the deterministic path up to a time shift. The differences among the paths at the various sites, then, in addition to the small width of this neighborhood, are the phases, which change in a random way. The neighborhood width depends on the
207
INFERENCE FOR SYNCHRONIZATION
noise variance. This picture motivates a model for coupling based on the randomly evolving phases of the components. We now move our attention from the evolving random field of population levels, (2), to a corresponding evolving random field of phases. Going back to the deterministic model (1) we can unambiguously define the phase φa at site 2, time £, to be the time fraction of the current cycle, ^(0),... ,^(p), which has been accomplished at time t. Then each phase φu is in the interval [0,1). The arithmetic for phases is modi. We think of the set of points {Φit,i G /} as a set on the circle of circumference 1. As time advances the points progress around the circle. Since in fact we wish to consider the phase field associated with the stochastic field (2), where the "limit cycle" is a stochastic perturbation of the deterministic one, we may not see a unique minimum for each cycle, and the definition of φu may be ambiguous. We will assume that the noise en is small enough so that the ambiguity can be resolved by a device described in the next section. For the moment let us ignore this problem and assume that in the stochastic model the phase φn is defined as the fraction of the current orbit which has been traversed at time t by each path X{ of the stochastic field (2). We describe the structure of the phase field φa, i G /, t = 0,1,2,..., by writing Φiί+\ = Φit + 9it + tit
mod 1,
(2.3)
where gu is the fraction of the current orbit traversed at site i at time t. Now we introduce a hypothetical coupling force into the phase field which will shift the phase at each site i and time t in the direction of the "mean phase". For this purpose we need to devise an appropriate definition of mean for a set of random points on a circle. As mentioned above we identify the phase values with points on a circle of circumference 1. Between points x,y on this circle, let Δ(x,y) denote the signed smallest arc measured counter-clockwise between them. Then Δ(x,y) is positive or negative according to whether x leads or lags y, and (x,y)| B(fc(e)t;(ε)) = (g*,Dv)
for t; G V.
Hence g$ is a gradient of a(f) when i9 is known. It fulfills " (m,gϋ) =
Er(ϋ,X)E(ί(e)k(ε)).
By Remark 4, an appropriate bracketing condition on the functions bτ(x, y) = k(y - rx) - Ek(ε) implies stochastic differentiability (6.1). It follows from (3.3) that the plug-in estimator ά^ is asymptotically linear for a(f) with influence function 9
= gϋ-{rn,g$)c = jfc(ε) - Ek{ε) -
Er{ϋ,X)E{i{ε)k{ε))(Er{ΰ,X)2)-ιr{ϋ,x)ε.
228
MULLER, SCHICK AND WEFELMEYER
Efficient estimators for ϋ are constructed in Drost, Klaassen and Werker (1997) and Koul and Schick (1997). The canonical gradient g and an efficient estimator for Ek(ε) is in Schick and Wefelmeyer (2000). Example 3. (Heteroscedastic linear autoregression.) The observations Xo5 , X n are real with
The Si are independent and, for simplicity, standard normal. Conditions for uniform ergodicity and efficient estimators for ϋ are in Maercker (1997) and Schick (1999). The model is semiparametric, with transition distribution 1
/υ — ϋx
where φ is the standard normal density. Fix ϋ and 5. Introduce perturbations tfmi = 0 + n~ 1 / 2 u, snυ{x) = s(x)(l + n-l'2v{x)). The function υ runs through V = ^ ( Z ) , where / is the stationary density. The perturbed transition distribution is Qnuv(x,dy) = QΰnuSnv{χ,dy) = Qϋs{x,dy)(l
= —-7-τy( y G
+ n-1/2{um{x,y)
T\
(
+ Dv(x,y))),
with
Since the normal distribution is symmetric, m and DV are orthogonal, and s can be estimated adaptively with respect to ΰ. Suppose we want to estimate the functional 2
a(s) = / s(x) dx. Jo For all u E R and v EV we have n1/2(a(snυ)
- a(s)) -> 2 / s(x)2v(x)dx = (Dva,Dυ + um) Jo 2 with va = l[ 0) i]S //. Hence a(s) is differentiate at (i?,s), with canonical gradient — ϋx\^
\
229
PLUG-IN ESTIMATORS Assume first that ΰ is known. Then we can estimate a(s) by l
/ h(x) Jo where
Here wn(x) = c~ιw(c~ιx), where w is a continuously differentiable symmetric density with compact support [—1,1], and cn is a bandwidth of order n " 1 / 3 . We show that a$ is asymptotically linear with influence function Dva. We do so under the assumption that 5 is twice continuously differentiable. Write (Xi - ΰXi^)2 = β(Xi-i) 2 (ε? - 1) + s(X - i ) 2 . Expand s(Xi_i) around s(x) to obtain , x Γ1A(x) aϋ - a(s) = / J o
+ 2s(x)8'{x)f1(x) j — fo{x)
2
dx
where A(X)
=
-
To . To
Λ
1=1
i=\
The assumptions imply that / is twice continuously differentiable. Hence we obtain uniformly for x G [0,1],
EA{x)2 = O(n-ιc-1) = O(n E(fo(x) - /(z)) 2 = 0{n-χc-1 + c4n) = O(n" 2 / 3 ), E(fι(x) - cnf'(x))2 = Oin-'c-1 + ci) = O(n- 2 / 3 ). We can also show that s u p o < x < i |/o(#) — / ( # ) | converges to zero in probability. From this and the facfthat / is bounded away from zero on [0,1], we can conclude that άΰ - a{s) = I -jγ^dx Now write
+ 0Pnΰs
(cn).
230
MULLER, SCHICK AND WEFELMEYER
with I
n(y) = Jo
1
j^wn(y-x)dx.
It is easy to check that In converges in L,2(f) to the indicator of [0,1]. Combining the above lets us conclude that α# has influence function Dυa. 1 2 Suppose now that ϋ is unknown. Let ϋ be a n / -consistent estimator of ΰ. We prove that the plug-in estimator ά^ is efficient. We have already shown above that ά$ fulfills (3.4) with b$ = Dva. By the argument of Section 3, it remains to show (3.5). Since (m,Dva) = 0 by adaptivity, (3.5) reduces l 2 to asymptotic equivalence of ά^ and ά#, i.e., n / (a$ — ά#) = opnϋs(l). To prove this, we note first that
~ ^ / o fo(x) where B{x) = -Ys{Xi-ι): n ϊΞ{ Since / 0 B(x)/fo(x)dx converges to zero in probability, we obtain the desired result.
7
Extensions
1. We have assumed ϋ and a(F) to be one-dimensional. Extension to finitedimensional a(F) is straightforward; infinite-dimensional a(F) require additional technicalities. In nonlinear regression, Example 1, we may, e.g., be interested in estimating the error distribution function F, defined by F(t) — P(ε < t). For linear regression we refer to Klaassen and Putter (1999). Extension to finite-dimensional ϋ is also straightforward. We note that it may happen that a(F) is adaptive with respect to certain components of ί? only. For efficiency of ά^, efficient estimators are required only for the non-adaptive components of ΰ. Extensions of nonlinear regression, Example 1, are treated in Mύller and Wefelmeyer (2000a). Extensions of nonlinear autoregression, Example 2, are treated in Schick and Wefelmeyer (2000). 2. We have restricted attention to functionals a(F) of F only. The results may be extended to functionals α(i9, F) which depend also on ΰ. An interesting application is estimation of invariant distributions of time series, for example in linear autoregression Xι = ΰXi-ι + E{. Since Y^LxWej is distributed as the invariant law, we can write the expectation of a function k under the invariant law as j
Ek{X) = Ek(Σΰ εj)
= a(ΰ,F),
231
PLUG-IN ESTIMATORS
where F is the invariant distribution function. Hence Ek(X) can be estimated by a von Mises statistic or a [/-statistic based on estimated innovations; see Schick and Wefelmeyer (2001). 3. The results extend from semiparametric models {PnϋF : t f 6 θ , F E 7 J } to parametric families {Vn>β : ϋ G θ } of nonparametric models. This is of interest when we start from a nonparametric model Vn and impose a restriction which depends on an unknown parameter, say r#(Pn) = 0, leading to Vn* = {Pn r#(Pn) = 0}. For example, let XQ, . . . , Xn be observations from a Markov chain with transition distribution fulfilling /Q(x,dy)y = r{ϋ,x) for some ϋ. This is thenonlinear autoregressive model X{ = r($,Xi_i) + ε^, where the ε% are martingale increments, not i.i.d. as in Example 2. For estimators of ϋ see Wefelmeyer (1994), (1996), (1997a), (1997b); for estimators of the stationary law see Schick and Wefelmeyer (1999). The model may be written as a semiparametric model by introducing transition distributions F(x,dy) with / F(x, dy)y = 0 and writing Q{x,dy)=F{x,dy-r(ΰJx)). This is, however, technically inconvenient because we perturb ΰ and would need differentiability of F. Another example are i.i.d. observations (-XΊ, YΊ),..., ( I n , Yn) with joint law fulfilling the constraint E(a(X, Y, ΰ)\X) = 0, where α(X, Y, ϋ) is a given function. For plug-in estimators in such models see Mύller and Wefelmeyer (2000b). A special case is a{X,Y,ϋ) = Y - r(tf,X), i.e., Y{ = r(ϋ,Xi) + εu which differs from Example 1 in that we do not assume Z{ and X{ to be independent.
References Andersen, P. K., Borgan, 0., Gill, R. D. and Keiding, N. (1993). Statistical Models Based on Counting Processes. Springer Series in Statistics, Springer, Berlin. Andrews, D. W. K. (1994). Asymptotics for semiparametric econometric models via stochastic equicontinuity. Econometrica 62, 295-314. Andrews, D. W. K. and Pollard, D. (1994). An introduction to functional central limit theorems for dependent stochastic processes. Internat. Statist Rev. 62, 119-132. An, H. Z. and Huang, F. C. (1996). The geometrical ergodicity of nonlinear autoregressive models. Statist. Sinica 6, 943-956.
232
MULLER, SCHICK AND WEFELMEYER
Bhattacharya, R. and Lee, C. (1995). On geometric ergodicity of nonlinear autoregressive models. Statist. Probab. Lett. 22, 311-315. Bickel, P. J. (1982). On adaptive estimation. Ann. Statist. 10, 647-671. Bickel, P. J., Klaassen, C. A. J., Ritov, Y. and Wellner, J. A. (1998). Efficient and Adaptive Estimation for Semiparametric Models. Springer, New York. Daniels, H. E. (1961). The asymptotic efficiency of a maximum likelihood estimator. Proc. Fourth Berkeley Symp. Math. Statist. Probab. 1, 151163. Drost, F. C, Klaassen, C. A. J. and Werker, B. J. M. (1997). Adaptive estimation in time-series models. Ann: Statist. 25, 786-817. Greenwood, P. E. and Wefelmeyer, W. (1991). Efficient estimating equations for nonparametric filtered models. In: Statistical Inference in Stochastic Processes (N. U. Prabhu, I. V. Basawa, eds.), 107-141, Marcel Dekker, New York. Fabian, V. and Hannan, J. (1985). Introduction to Probability and Mathematical Statistics. Wiley, New York. Hajek, J. (1970). A characterization of limiting distributions of regular estimates. Z. Wahrsch. Verw. Gebiete 14, 323-330. Huber, P. J. (1967). The behavior of maximum likelihood estimates under nonstandard conditions. Proc. Fifth Berkeley Symp. Math. Statist. Probab. 1, 221-233. Jeganathan, P. (1995). Some aspects of asymptotic theory with applications to time series models. Econometric Theory 11, 818-887. Klaassen, C. A. J. and Putter, H. (1999). Efficient estimation of Banach parameters in semiparametric models. Technical Report, Department of Mathematics, University of Amsterdam. Koul, H. L. and Schick, A. (1997). Efficient estimation in nonlinear autoregressive time series models. Bernoulli 3, 247-277. Kreiss, J.-P. (1987). On adaptive estimation in stationary ARMA processes. Ann. Statist. 15, 112-133. Le Cam, L. (1986). Asymptotic Methods in Statistical Decision Theory. Springer-Verlag, New York. Le Cam, L. and Yang, G. L. (1990). Asymptotics in Statistics. Springer Series in Statistics, Springer, Berlin.
233
PLUG-IN ESTIMATORS
Maercker, G. (1997). Statistical Inference in Conditional Heteroskedastic Autoregressive Models. Shaker, Aachen. Mύller, U. U. and Wefelmeyer, W. (2000a). Estimating parameters of the residual distribution in nonlinear regression. In preparation. Mύller, U. U. and Wefelmeyer, W. (2000b). Regression type models and optimal estimators. In preparation. Ogata, Y. (1980). Maximum likelihood estimates of incorrect Markov models for time series and the derivation of AIC. J. Appl. Probab. 17, 59-72. Pollard, D. (1985). New ways to prove central limit theorems. Econometric Theory 1, 295-314. Schick, A. (1993). On efficient estimation in regression models. Ann. Statist. 21, 1486-1521. Correction: 23 (1995), 1862-1863. Schick, A. (1999). Efficient estimation in a semiparametric heteroscedastic autoregressive model. Technical Report, Department of Mathematical Sciences, Binghamton University, h t t p : //math. binghamton. edu/ant on/preprint. html Schick, A. (2000). On asymptotic differentiability of averages. Probab. Lett. 51, 15-23.
Statist.
Schick, A. (2001). Sample splitting with Markov chains. Bernoulli, 7, 33-61. Schick, A. and Wefelmeyer, W. (1999). Efficient estimation of invariant distributions of some semiparametric Markov chain models. Math. Meth. Statist. 8, 426-440. Schick, A. and Wefelmeyer, W. (2000). Estimating the innovation distribution in nonlinear autoregressive models. To appear in: Ann. Inst. Statist. Math. Schick, A. and Wefelmeyer, W. (2001). Estimating invariant laws of linear processes by [/-statistics. Technical Report, Department of Mathematics, University of Siegen. http://www.math.uni-siegen.de/statistik/wefelmeyer.html Wefelmeyer, W. (1991). A generalization of asymptotically linear estimators. Statist. Probab. Lett. 11, 195-199. Wefelmeyer, W. (1994). Improving maximum quasi-likelihood estimators. In: Asymptotic Statistics (P. Mandl, M. Huskova, eds.), 467-474, Physika-Verlag, Heidelberg. Wefelmeyer, W. (1996).
Quasi-likelihood models and optimal inference.
234
MULLER, SCHICK AND WEFELMEYER Ann. Statist
24, 405-422.
Wefelmeyer, W. (1997a). Adaptive estimators for parameters of the autoregression function of a Markov chain. J. Statist. Plann. Inference 58, 389-398. Wefelmeyer, W. (1997b). Quasi-likelihood regression models for Markov chains. In: Selected Proceedings of the Symposium on Estimating Functions (I. V. Basawa, V. P. Godambe and R. L. Taylor, eds.), 149-173, IMS Lecture Notes-Monograph Series, Institute of Mathematical Statistics, Hayward, California. Wefelmeyer, W. (1999). Efficient estimation in Markov chain models: an introduction. In: Asymptotics, Nonparametrics, and Time Series (S. Ghosh, ed.), 427-459, Statistics: Textbooks and Monographs 158, Dekker, New York.
237
Institute of Mathematical Statistics LECTURE NOTES — MONOGRAPH SERIES
NUISANCE PARAMETER ELIMINATION AND OPTIMAL ESTIMATING FUNCTIONS T. M. Durairajan and Martin L. William Loyola College, Madras, India
Abstract In the context of obtaining optimal estimating functions for interesting parameters in the presence of nuisance parameters in parametric models, a method of elimination of nuisance parameters is proposed in this paper. The proposed method is direct and does not impose any 'factorization' conditions on the likelihood. In this direction, a sequence of lower bounds for the variance-covariance matrix of estimating functions is derived. A recipe which gives a transparent approach for obtaining optimal estimating functions is suggested. It is shown that minimum variance unbiased estimators could be obtained using the recipe. Keywords and Phrases: Lower bounds, nuisance parameter elimination, optimal estimating function.
1
Introduction
In the theory of estimating functions applied to parametric models involving nuisance parameters, the 'elimination' of nuisance parameters to obtain optimal estimating functions (EF) for interesting parameters is a very important task. In a pioneering work, Godambe (1976) suggested a method of eliminating nuisance parameters by multiplying and adding suitable functions to the score function and formally established the optimality of conditional score function. Lloyd (1987) and Bhapkar and Srinivasan (1993) claimed the optimality of marginal score function. However, that there are errors in the results of Lloyd (1987) and Bhapkar and Srinivasan (1993) has been pointed out by Bhapkar (1995, 1997) who imposed some more conditions and established the optimality of marginal score function. The conditional and marginal factorization properties were used by the above authors in the elimination of nuisance parameters. Heyde (1997) proposed a method of
238
DURAIRAJAN
AND WILLIAM
obtaining optimal EF by eliminating nuisance parameters from a suitably chosen function that possesses the 'likelihood score property'. Heyde gives the optimal EF of 'first order theory' but not of the higher orders. The present work is an attempt in this direction. In this paper, a straight forward recursive method of elimination of nuisance parameters without going into the factorization aspects of the likelihood is proposed. In this direction, a theorem which gives a sequence of lower bounds for the variance-covariance matrix of the EFs is established in Section 2. This is achieved by considering higher order derivatives with respect to the nuisance parameters drawing inspiration from Godambe (1984). Consequently, a recipe which gives a systematic approach for possible elimination of nuisance parameters leading to optimal EF is suggested. Section 3 presents several examples to illustrate the recipe. In Section 4, as another outcome of the main result of Section 2, a sequence of lower bounds for the variance - covariance matrix of unbiased estimators of the interesting parameters is given. This sequence is different from the sequence of Bhattacharya bounds both in context and in content. Further, it is shown that minimum variance bound unbiased estimators of the interesting parameters could be obtained by the suggested recipe.
2
The Main Result and the Recipe
Let X be a random vector with sample space X and probability density function p(x; ω) with respect to some σ-finite measure μ on (X, B(X)). The family of densities is indexed by ω = {θ,φ) E Ω with θ G Ωi C SRr, φ E Ω2 C 3?m, Ω = Ωi x Ω2. The interesting parameter is θ and the nuisance parameter is φ and estimation of θ in the presence of φ is considered. We assume the usual regularity conditions on the density function 'p' and the EFs g = (gu .. .,gr)': X x Ωx -» W (refer Godambe (1976, 1984), Bhapkar (1995, 1997)). Let Dg = ((E(dgi/dθj))). Let the class of EFs satisfying the regularity conditions be denoted as Go let Mg(ω) = D~ιE{g.g')(D'g)~ι, the variance-covariance matrix of standardized EFs. In the sequel, the following notations are used: le = (dlog p/dθu...,dlogp/dθry In = E(lθ l'θ)
(2.1) (2.2)
NUISANCE PARAMETER ELIMINATION
239
with l
φ \ — h\ )
&
l
•'22 — -^ [*0 " φ J
^ J
/ i i a n d / ^ a r e a s s u m e d non-singular a n d for simplicity, we write I,1' = lφ,
/&> = /i2, I$? = hi, J& = hi- Let (2.6)
, fc = 1 , 2 , . . .
Since / ^ are positive definite we have Bk+i > Bk. Also, E[L •0(*)
'j = BΓ1
V k.
(2.7)
Theorem 2.1: For every 3 € C/o, M5 > 5 fc , k = 1,2,..., with equality if and only if g = 1(0,0) L ^ where A(θ, φ) is a non-singluar matrix and the functions Lg ' are defined recursively in (2.4) and (2.5). Proof: For g e Go, we observe that E[glφk)>) = 0, E[gLf]'} Now, considering the n.n.d. matrix E
r(*)
9,' — =
-Dg
= -Dg\/k.
(2.8)
E(g.g')
and applying matrix theory arguments, we get Rank
- Rank
0 E{g.g')
Also, Bkι — M~ι > 0 by the non-negative definiteness of the full matrix in (2.8). This gives Mg > Bk. Further, B~ι - M~ι = 0 if and only if the rank of the matrix in (2.8) is r. ' for some non-singular matrix A(θ,φ), then Now, if g = A(θ,φ) clearly the matrix in (2.8) has rank r so that Mg = B^. Conversely, if Mg = Bk i.e. the rank of the matrix in (2.8) is r, then as B^1, Dg and E(g.g') are all non-singular, it is necessary that for some non-singular matrix A(θ, φ), g = A(θ, φ) - Lf\ Hence the theorem. Remark 1: If for some k > 1, Bk is attained by the EF A(θ,φ)Lf) for a suitable choice of A(θ,φ) (i.e. A(θ,φ)Ly is free of φ), then we have /Jί;+1) = 0 which gives L^+l) = L{θk) and Bk+λ = Bk. Similarly, V s > k + 1, /W = 0, L{θs) = Lf] and Bs = Bk.
240
DURAIRAJAN AND WILLIAM In view of this, we now propose the following.
Definition 2.1: If there exists a g* G Go such that Mg* attains a lower bound Bk for some A;, then g* is said to be minimum variance bound estimating function (MVBEF). Based on the theorem established and the above remark, we suggest the following recipe for eliminating nuisance parameters and obtaining MVBEF: "Starting with the score function /#, consider recursively the functions Lθ
= IQ — I ^ I ^ l φ ' i LQ = L Q — l[2 I22
lφ
a n
ds 0 forth. If t h e nuisance
parameters are essentially eliminated or appear as a multiplicative factor of an EF in some recursion, stop the process as the EF thus obtained is optimal and further recursions would result in the same EF". Remark 2: The first order bound in the sequence of bounds derived above namely Bι = (In — Iul^hi)"1 was derived as the lower bound by Chandrasekar and Kale (1984) who gave a different line of proof. In a latest book, Heyde (1997) has proposed a 'first order' theory for obtaining optimal EF. He suggests the possibility for higher order theory but does not give any explicit method for the same. The recipe suggested above gives a formal and transparent method to proceed to second and higher order theories and carries out Heyde's suggestion. However, the forms of the optimal EFs obtained in Theorem 2.1 do not follow as a consequence of the technique of Heyde (1997).
3
Applications
In this section, a number of examples are discussed to illustrate the recipe suggested in the previous section. Throughout this section, we reserve the symbol θ for the interesting (real or vector) parameter. Example 3.1: Let x = (xi,...,x n ) and y = (j/i,. ,J/n) be independent where X{ are i.i.d. with density φexp(-φx), x > 0 and yι are i.i.d. with density φθ~ιexp(-φθ~λy), y > 0, 0, > 0 . Here,
is the MVBEF attaining the bound B\. Example 3.2: Let z\,..., zn be i.i.d. with z% = (x^, yu > ? VH) where X{ are i.i.d. exponential with mean φ and for each fixed j — 1,..., r, y^ are i.i.d. n
exponential with mean φθj. Denote x = Σ x^ yj = Y%=i yji, j = 1,..., r.
241
NUISANCE PARAMETER ELIMINATION
Here, φL™ = (rf,... ,g*r) is MVBEF, with g* = | - ^
^
Σ |
Remark 3: Examples 3.1 and 3.2 have been discussed respectively by Lloyd (1987) and Bhapkar and Srinivasan (1993) in the context of marginal factorization of the likelihood. These authors claimed that a marginal score function is the optimal EF. However, from the above discussion, we find that the optimal EFs obtained above do not coincide with the EFs claimed by these authors as optimal. In Example 3.1, we have Mg{\)+ — 2θ2/n whereas for the EF of Lloyd (1987) namely g0 = n/θ - 2n/(θ + we have Mgo = 2(n + I)θ2/n2 > M (i)*. This shows that Lloyd's claim that go is optimal EF is incorrect. The errors in Lloyd (1987) and Bhapkar and Srinivasan (1993) has been pointed out also by Bhapkar (1995, 1997) who has found the correct optimal EF for the model in Example 3.1 but not for Example 3.2. In contrast, the recipe of Section 2 and the explicit form L,θ ' for the optimal EF have enabled us to achieve this for Example 3.2 as well in an elegant manner. Example 3.3: Let x\, yi,..., x n , yn be independent normal with E(xi) = 0, E(yi)
=θ + φi, V{xi) = V(yi) = 1. Here, L^ = Σ {x{ - θ) is the MVBEF. i l
Example 3.4: Let a i, j/i,... .Xn^Vn be independent normal with E(x{) = θ + φu E{yi) = φi, V{xi) = V{yi) = 1. Here, the MVBEF is L^ = n(x - y - θ)/2. Remark 4: Examples 3.3 and 3.4 were discussed by Godambe (1976) in the context of nuisance parameter elimination when conditional factorization property for the likelihood holds good. He has shown that the same EFs are optimal. In the above discussion, we have demonstrated the straight forward applicability of our recipe without investigating the factorization aspects which is required in Godambe's approach. Example 3.5: Let x = (xi,... ,xn) and y = (yi,... ,y m ) be independent where xι are i.i.d. Poisson (0), yj are i.i.d. Poisson (), θ,φ > 0. Here,
4
υ
= mn(x - θy)/(θ(nθ + m)) is the MVBEF.
This example with m — n — \ has been discussed by Reid (1995) in illustrating the roles of conditioning in inference in the presence of nuisance parameters, wherein the estimation is based on conditioning upon a statistic called a 'cut' (Barndorίf-Nielsen 1978) and involves a suitable reparametrization. In contrast, our approach is straight forward and does not require reparametrization. Example 3.6: Consider the linear model of a randomized block design Vij = μ + U + bj + eij, i — 1,..., fc, j = 1,..., r where e^ are i.i.d. iV(0, σ 2 ),
242
DURAIRAJAN AND WILLIAM
Σt{ = Σbj
= 0. Suppose estimation of t h e effect of the first treatment 't'λ
alone is of interest. 2
T h a t is θ = t\, φ = ( μ , ΐ i , . . . ,ίjfe_i,&i, {
Here, (σ (k
2
5
&r-i )
= ylm - y.. - h is MVBEF, where ylm =
- l)/(kr))L ^
Example 3.7: Consider a 2 x 2 contingency table ((n^)), i, j = 1,2, following a multinomial distribution with fixed sample size n = ΣΣn^ and with probabilities ((TΓJJ)), i,j — 1,2, ΣΣπ^ = 1. Let θ = πn be the parameter of interest with φ = (πi2,π2i). Here Lβ ' = {n\\ — nθ)/(θ(l — θ)) is the MVBEF. This example was discussed by Bhapkar (1989) in investigating conditioning and loss of information in the presence of nuisance parameters. Example 3.8: Let {X(t),t > 0} and {Y(t),t > 0} be two independent Poisson processes with parameters (θ + φ) and φ respectively, 0, φ > 0. Suppose data on the states of {X(t)} at times Bk, fc = 1,2,...
where Bk's are given in (2.6). This is verified by considering EFs of the form T-0. If any of the L ^ ' s defined recursively in (2.4) and (2.5) is such that A{θ,φ)L{θk) is of the form T* - θ for a suitable choice of A(θ,φ), then T* is minimum variance unbiased estimator (MVUE) of θ. Thus, the recipe of Section 2 could possibly be of help in finding T*. The following examples illuminate this point. Example 4.1: Consider the model in Example 3.3. Here LQ ' /n — x — θ so that x is MVUE of θ attaining bound B\. Example 4.2: Consider the model in Example 3.4. Here 2L,Θ '/n = x — y — θ so that x — y is MVUE of θ attaining bound B\.
244 Example 4.3: 2
(σ (fc - l)/kr)L^
DURAIRAJAN AND WILLIAM Consider the linear model in Example 3.6.
Here,
= ylm - y.. - tλ so t h a t ylm - ymm is M V U E of tλ.
Example 4.4: Consider Example 3.7. Hear, 0(1 - θ)Ly /n = n\\jn - θ so that nn/n is MVUE of θ. Example 4.5: Consider Example 3.8. Here, ((0 + φ)sntm)~1Lg [SnX(tm) - tmy(Sn)]/(Sntm) as MVUE of θ. Example 4.6: Consider Example 3.9. (σ\,..., σ2) attaining bound i?2
= 0 gives
Here, (S?,... ,Sj?) is MVUE of
Example 4.7: Consider the Neyman-Scott model in Example 3.10. Here 2θ2L%)/n = Σ{xi - yi)2/(2n) - θ so that Σ{Xi - y;)2/(2n) is MVUE of θ attaining bound B