ˇ Cížek • Härdle • Weron (Eds.) Statistical Tools for Finance and Insurance
ˇ Pavel Cížek • Wolfgang Ka rl Härdle • Rafał Weron (Eds.)
Statistical Tools for Finance and Insurance Second Ed ition
123
Editors ˇ ížek Pavel C Tilburg University Dept. of Econometrics & OR P.O. Box 90153 5000 LE Tilburg, Netherlands
[email protected] Rafał Weron Wrocław University of Technology Hugo Steinhaus Center Wyb. Wyspia´ nskiego 27 50-370 Wrocław, Poland
[email protected] Prof. Dr. Wolfgang Karl Härdle Ladislaus von Bortkiewicz Chair of Statistics C.A.S.E. Centre for Applied Statistics and Economics School of Business and Economics Humboldt-Universität zu Berlin Unter den Linden 6 10099 Berlin, Germany
[email protected] The majority of chapters have quantlet codes in Matlab or R. These quantlets may be downloaded from http://ex tras.springer.com directly or via a link on http://springer.com/97 8 -3 -64 2 -18 061-3 and from www.quantlet.de. ISBN 978-3-642-18061-3 e-ISBN 978-3-642-18062-0 DOI 10.1007/978-3-642-18062-0 Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011922138 © Springer-Verlag Berlin Heidelberg 2005, 2011 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Cover design: WMXDesign GmbH Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Contents Contributors
9
Preface to the second edition
11
Preface
13
Frequently used notation
17
I
19
Finance
1 Models for heavy-tailed asset returns Szymon Borak, Adam Misiorek, and Rafal Weron 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Stable distributions . . . . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Definitions and basic properties . . . . . . . . . . . . . . 1.2.2 Computation of stable density and distribution functions 1.2.3 Simulation of stable variables . . . . . . . . . . . . . . . 1.2.4 Estimation of parameters . . . . . . . . . . . . . . . . . 1.3 Truncated and tempered stable distributions . . . . . . . . . . 1.4 Generalized hyperbolic distributions . . . . . . . . . . . . . . . 1.4.1 Definitions and basic properties . . . . . . . . . . . . . . 1.4.2 Simulation of generalized hyperbolic variables . . . . . . 1.4.3 Estimation of parameters . . . . . . . . . . . . . . . . . 1.5 Empirical evidence . . . . . . . . . . . . . . . . . . . . . . . . .
21
2 Expected shortfall Simon A. Broda and Marc S. Paolella 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Expected shortfall for several asymmetric, fat-tailed distributions 2.2.1 Expected shortfall: definitions and basic results . . . . . 2.2.2 Student’s t and extensions . . . . . . . . . . . . . . . . .
57
21 22 22 25 28 29 34 36 36 40 42 44
57 58 58 60
2
Contents
2.3
2.4 2.5 2.6
2.2.3 ES for the stable Paretian distribution . . . . . . . . . . 2.2.4 Generalized hyperbolic and its special cases . . . . . . . Mixture distributions . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Expected shortfall for normal mixture distributions . . . 2.3.3 Symmetric stable mixture . . . . . . . . . . . . . . . . . 2.3.4 Student’s t mixtures . . . . . . . . . . . . . . . . . . . . Comparison study . . . . . . . . . . . . . . . . . . . . . . . . . Lower partial moments . . . . . . . . . . . . . . . . . . . . . . . Expected shortfall for sums . . . . . . . . . . . . . . . . . . . . 2.6.1 Saddlepoint approximation for density and distribution 2.6.2 Saddlepoint approximation for expected shortfall . . . . 2.6.3 Application to sums of skew normal . . . . . . . . . . . 2.6.4 Application to sums of proper generalized hyperbolic . . 2.6.5 Application to sums of normal inverse Gaussian . . . . . 2.6.6 Application to portfolio returns . . . . . . . . . . . . . .
3 Modelling conditional heteroscedasticity in nonstationary series ˇ ıˇzek Pavel C´ 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Parametric conditional heteroscedasticity models . . . . . . . 3.2.1 Quasi-maximum likelihood estimation . . . . . . . . . 3.2.2 Estimation results . . . . . . . . . . . . . . . . . . . . 3.3 Time-varying coefficient models . . . . . . . . . . . . . . . . . 3.3.1 Time-varying ARCH models . . . . . . . . . . . . . . 3.3.2 Estimation results . . . . . . . . . . . . . . . . . . . . 3.4 Pointwise adaptive estimation . . . . . . . . . . . . . . . . . . 3.4.1 Search for the longest interval of homogeneity . . . . . 3.4.2 Choice of critical values . . . . . . . . . . . . . . . . . 3.4.3 Estimation results . . . . . . . . . . . . . . . . . . . . 3.5 Adaptive weights smoothing . . . . . . . . . . . . . . . . . . . 3.5.1 The AWS algorithm . . . . . . . . . . . . . . . . . . . 3.5.2 Estimation results . . . . . . . . . . . . . . . . . . . . 3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65 67 70 70 71 72 73 73 76 82 83 84 85 87 90 92 101
. . . . . . . . . . . . . . .
101 103 104 105 108 109 111 114 116 118 119 123 124 127 127
4 FX smile in the Heston model 133 Agnieszka Janek, Tino Kluge, Rafal Weron, and Uwe Wystup 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 4.2 The model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
Contents 4.3
4.4
4.5
3
Option pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 European vanilla FX option prices and Greeks . . . . . 4.3.2 Computational issues . . . . . . . . . . . . . . . . . . . 4.3.3 Behavior of the variance process and the Feller condition 4.3.4 Option pricing by Fourier inversion . . . . . . . . . . . . Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Qualitative effects of changing the parameters . . . . . . 4.4.2 The calibration scheme . . . . . . . . . . . . . . . . . . 4.4.3 Sample calibration results . . . . . . . . . . . . . . . . . Beyond the Heston model . . . . . . . . . . . . . . . . . . . . . 4.5.1 Time-dependent parameters . . . . . . . . . . . . . . . . 4.5.2 Jump-diffusion models . . . . . . . . . . . . . . . . . . .
5 Pricing of Asian temperature risk Fred Espen Benth, Wolfgang Karl H¨ardle, and Brenda Lopez Cabrera 5.1 The temperature derivative market . . . . . . . . . . . . . . . 5.2 Temperature dynamics . . . . . . . . . . . . . . . . . . . . . . 5.3 Temperature futures pricing . . . . . . . . . . . . . . . . . . . 5.3.1 CAT futures and options . . . . . . . . . . . . . . . . 5.3.2 CDD futures and options . . . . . . . . . . . . . . . . 5.3.3 Infering the market price of temperature risk . . . . . 5.4 Asian temperature derivatives . . . . . . . . . . . . . . . . . . 5.4.1 Asian temperature dynamics . . . . . . . . . . . . . . 5.4.2 Pricing Asian futures . . . . . . . . . . . . . . . . . . . 6 Variance swaps Wolfgang Karl H¨ardle and Elena Silyakova 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Volatility trading with variance swaps . . . . . . . . . . . . 6.3 Replication and hedging of variance swaps . . . . . . . . . . 6.4 Constructing a replication portfolio in practice . . . . . . . 6.5 3G volatility products . . . . . . . . . . . . . . . . . . . . . 6.5.1 Corridor and conditional variance swaps . . . . . . . 6.5.2 Gamma swaps . . . . . . . . . . . . . . . . . . . . . 6.6 Equity correlation (dispersion) trading with variance swaps 6.6.1 Idea of dispersion trading . . . . . . . . . . . . . . . 6.7 Implementation of the dispersion strategy on DAX index .
136 138 140 142 144 149 149 150 152 155 155 158 163
. . . . . . . . .
165 167 170 171 173 175 177 177 188 201
. . . . . . . . . .
. 201 . 202 . 203 . 209 . 211 . 213 . 214 . 216 . 216 . 219
4 7 Learning machines supporting bankruptcy prediction Wolfgang Karl H¨ardle, Linda Hoffmann, and Rouslan Moro 7.1 Bankruptcy analysis . . . . . . . . . . . . . . . . . 7.2 Importance of risk classification and Basel II . . . 7.3 Description of data . . . . . . . . . . . . . . . . . . 7.4 Calculations . . . . . . . . . . . . . . . . . . . . . . 7.5 Computational results . . . . . . . . . . . . . . . . 7.6 Conclusions . . . . . . . . . . . . . . . . . . . . . .
Contents 225 . . . . . .
8 Distance matrix method for network structure analysis Janusz Mi´skiewicz 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 8.2 Correlation distance measures . . . . . . . . . . . . . 8.2.1 Manhattan distance . . . . . . . . . . . . . . 8.2.2 Ultrametric distance . . . . . . . . . . . . . . 8.2.3 Noise influence on the time series distance . . 8.2.4 Manhattan distance noise influence . . . . . . 8.2.5 Ultrametric distance noise influence . . . . . 8.2.6 Entropy distance . . . . . . . . . . . . . . . . 8.3 Distance matrices analysis . . . . . . . . . . . . . . . 8.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . 8.4.1 Structure of stock markets . . . . . . . . . . . 8.4.2 Dynamics of the network . . . . . . . . . . . 8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
251 . . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
II Insurance 9 Building loss models Krzysztof Burnecki, Joanna Janczura, and Rafal Weron 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . 9.2 Claim arrival processes . . . . . . . . . . . . . . . . 9.2.1 Homogeneous Poisson process (HPP) . . . . 9.2.2 Non-homogeneous Poisson process (NHPP) 9.2.3 Mixed Poisson process . . . . . . . . . . . . 9.2.4 Renewal process . . . . . . . . . . . . . . . 9.3 Loss distributions . . . . . . . . . . . . . . . . . . . 9.3.1 Empirical distribution function . . . . . . . 9.3.2 Exponential distribution . . . . . . . . . . . 9.3.3 Mixture of exponential distributions . . . .
226 237 238 239 240 245
251 252 253 253 254 255 257 262 263 265 265 268 279
291 293 . . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
293 294 295 297 300 301 302 303 304 305
Contents
9.4
9.5
5
9.3.4 Gamma distribution . . . . . . . . . . . . . . . . . 9.3.5 Log-Normal distribution . . . . . . . . . . . . . . . 9.3.6 Pareto distribution . . . . . . . . . . . . . . . . . . 9.3.7 Burr distribution . . . . . . . . . . . . . . . . . . . 9.3.8 Weibull distribution . . . . . . . . . . . . . . . . . Statistical validation techniques . . . . . . . . . . . . . . . 9.4.1 Mean excess function . . . . . . . . . . . . . . . . . 9.4.2 Tests based on the empirical distribution function Applications . . . . . . . . . . . . . . . . . . . . . . . . . . 9.5.1 Calibration of loss distributions . . . . . . . . . . . 9.5.2 Simulation of risk processes . . . . . . . . . . . . .
10 Ruin probability in finite time Krzysztof Burnecki and Marek Teuerle 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 10.1.1 Light- and heavy-tailed distributions . . . . . . . 10.2 Exact ruin probabilities in finite time . . . . . . . . . . . 10.2.1 Exponential claim amounts . . . . . . . . . . . . 10.3 Approximations of the ruin probability in finite time . . 10.3.1 Monte Carlo method . . . . . . . . . . . . . . . . 10.3.2 Segerdahl normal approximation . . . . . . . . . 10.3.3 Diffusion approximation by Brownian motion . . 10.3.4 Corrected diffusion approximation . . . . . . . . 10.3.5 Diffusion approximation by α-stable L´evy motion 10.3.6 Finite time De Vylder approximation . . . . . . 10.4 Numerical comparison of the finite time approximations 11 Property and casualty insurance pricing with GLMs Jan Iwanik 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . 11.2 Insurance data used in statistical modeling . . . 11.3 The structure of generalized linear models . . . . 11.3.1 Exponential family of distributions . . . . 11.3.2 The variance and link functions . . . . . . 11.3.3 The iterative algorithm . . . . . . . . . . 11.4 Modeling claim frequency . . . . . . . . . . . . . 11.4.1 Pre-modeling steps . . . . . . . . . . . . . 11.4.2 The Poisson model . . . . . . . . . . . . . 11.4.3 A numerical example . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. 307 . 309 . 311 . 313 . 314 . 315 . 315 . 318 . 321 . 321 . 324 329
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .
329 331 333 334 334 335 335 337 338 338 340 342 349
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
. 349 . 350 . 351 . 352 . 353 . 353 . 354 . 355 . 355 . 356
6
Contents 11.5 Modeling claim severity . . . . . . . . . . . . . . . . 11.5.1 Data preparation . . . . . . . . . . . . . . . . 11.5.2 A numerical example . . . . . . . . . . . . . . 11.6 Some practical modeling issues . . . . . . . . . . . . 11.6.1 Non-numeric variables and banding . . . . . . 11.6.2 Functional form of the independent variables 11.7 Diagnosing frequency and severity models . . . . . . 11.7.1 Expected value as a function of variance . . . 11.7.2 Deviance residuals . . . . . . . . . . . . . . . 11.7.3 Statistical significance of the coefficients . . . 11.7.4 Uniformity over time . . . . . . . . . . . . . . 11.7.5 Selecting the final models . . . . . . . . . . . 11.8 Finalizing the pricing models . . . . . . . . . . . . .
12 Pricing of catastrophe bonds Krzysztof Burnecki, Grzegorz Kukla, and David Taylor 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 12.1.1 The emergence of CAT bonds . . . . . . . . . 12.1.2 Insurance securitization . . . . . . . . . . . . 12.1.3 CAT bond pricing methodology . . . . . . . . 12.2 Compound doubly stochastic Poisson pricing model 12.3 Calibration of the pricing model . . . . . . . . . . . 12.4 Dynamics of the CAT bond price . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
356 357 358 360 360 360 361 361 361 363 364 365 366 371
. . . . . . .
. . . . . . .
13 Return distributions of equity-linked retirement plans Nils Detering, Andreas Weber, and Uwe Wystup 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 13.2 The displaced double-exponential jump diffusion model 13.2.1 Model equation . . . . . . . . . . . . . . . . . . . 13.2.2 Drift adjustment . . . . . . . . . . . . . . . . . . 13.2.3 Moments, variance and volatility . . . . . . . . . 13.3 Parameter estimation . . . . . . . . . . . . . . . . . . . 13.3.1 Estimating parameters from financial data . . . . 13.4 Interest rate curve . . . . . . . . . . . . . . . . . . . . . 13.5 Products . . . . . . . . . . . . . . . . . . . . . . . . . . . 13.5.1 Classical insurance strategy . . . . . . . . . . . . 13.5.2 Constant proportion portfolio insurance . . . . . 13.5.3 Stop loss strategy . . . . . . . . . . . . . . . . . . 13.6 Payments to the contract and simulation horizon . . . . 13.7 Cost structures . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
371 372 374 375 377 379 381 393
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
393 395 395 398 398 399 399 401 401 401 402 404 405 406
Contents 13.8 Results without costs . 13.9 Impact of costs . . . . 13.10Impact of jumps . . . 13.11Summary . . . . . . . Index
7 . . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
407 409 411 412 415
Contributors Fred Espen Benth Center of Mathematics for Applications, University of Oslo Szymon Borak Center for Applied Statistics and Economics, Humboldt-Universit¨at zu Berlin Simon Broda Department of Quantitative Economics, Amsterdam School of Economics Krzysztof Burnecki Hugo Steinhaus Center for Stochastic Methods, Wroclaw University of Technology Brenda Lopez Cabrera Center for Applied Statistics and Economics, Humboldt Universit¨at zu Berlin ˇ ıˇ Pavel C´ zek Center for Economic Research, Tilburg University Nils Detering MathFinance AG, Waldems, Germany Wolfgang Karl H¨ ardle Center for Applied Statistics and Economics, Humboldt Universit¨at zu Berlin and National Central University, Jhongli, Taiwan Linda Hoffmann Center for Applied Statistics and Economics, Humboldt Universit¨ at zu Berlin Jan Iwanik RBS Insurance, London Agnieszka Janek Institute of Mathematics and Computer Science, Wroclaw University of Technology Joanna Janczura Hugo Steinhaus Center for Stochastic Methods, Wroclaw University of Technology Tino Kluge MathFinance AG, Waldems, Germany Grzegorz Kukla Towarzystwo Ubezpieczeniowe EUROPA S.A., Wroclaw Adam Misiorek Santander Consumer Bank S.A., Wroclaw
10
Contributors
Janusz Mi´skiewicz Institute of Theoretical Physics, University of Wroclaw Rouslan Moro Brunel University, London Marc Paolella Swiss Banking Institute, University of Zurich Dorothea Sch¨ afer Deutsches Institut f¨ ur Wirtschaftsforschung e.V., Berlin Elena Silyakova Center for Applied Statistics and Economics, Humboldt Universit¨ at zu Berlin David Taylor School of Computational and Applied Mathematics, University of the Witwatersrand, Johannesburg Marek Teuerle Institute of Mathematics and Computer Science, Wroclaw University of Technology Andreas Weber MathFinance AG, Waldems, Germany Rafal Weron Institute of Organization and Management, Wroclaw University of Technology Agnieszka Wyloma´ nska Hugo Steinhaus Center for Stochastic Methods, Wroclaw University of Technology Uwe Wystup MathFinance AG, Waldems, Germany
Preface to the second edition The meltdown of financial assets in the fall of 2008 made the consequences of financial crisis clearly visible to the broad public. The rapid loss of value of asset backed securities, collateralized debt obligations and other structured products was caused by devaluation of complex financial products. We therefore found it important to revise our book and present up-to-date research in financial statistics and econometrics. We have dropped several chapters, thoroughly revised other and added a lot of new material. In the Finance part, the revised chapter on stable laws (Chapter 1) seamlessly guides the Reader not only through the computationally intensive techniques for stable distributions, but also for tempered stable and generalized hyperbolic laws. This introductory chapter is now complemented by a new text on Expected Shortfall with fat-tailed and mixture distributions (Chapter 2). The book then continues with a new chapter on adaptive heteroscedastic time series modeling (Chapter 3), which smoothly introduces the Reader to Chapter 4 on stochastic volatility modeling with the Heston model. The quantitative analysis of new products like weather derivatives and variance swaps is conducted in two new chapters (5 and 6, respectively). Finally, two different powerful classification techniques - learning machines for bankruptcy forecasting and the distance matrix method for market structure analysis - are discussed in the following two chapters (7 and 8, respectively). In the Insurance part, two classical chapters on building loss models (Chapter 9) and on ruin probabilities (Chapter 10) are followed by a new text on property and casualty insurance with GLMs (Chapter 11). We then turn to products linking the finance and insurance worlds. Pricing of catastrophe bonds is discussed in Chapter 12 and a new chapter introduces into the pricing and cost structures of equity linked retirement plans (Chapter 13).
12
Preface
The majority of chapters have quantlet codes in Matlab or R. These quantlets may be downloaded from the Springer.com page or from www.quantlet.de. Finally, we would like to thank Barbara Choros, Richard Song, and Weining Wang for their help in the text management. ˇ ıˇzek, Wolfgang Karl H¨ardle, and Rafal Weron Pavel C´ Tilburg, Berlin, and Wroclaw, January 2011
Preface This book is designed for students, researchers and practitioners who want to be introduced to modern statistical tools applied in finance and insurance. It is the result of a joint effort of the Center for Economic Research (CentER), Center for Applied Statistics and Economics (C.A.S.E.) and Hugo Steinhaus Center for Stochastic Methods (HSC). All three institutions brought in their specific profiles and created with this book a wide-angle view on and solutions to up-to-date practical problems. The text is comprehensible for a graduate student in financial engineering as well as for an inexperienced newcomer to quantitative finance and insurance who wants to get a grip on advanced statistical tools applied in these fields. An experienced reader with a bright knowledge of financial and actuarial mathematics will probably skip some sections but will hopefully enjoy the various computational tools. Finally, a practitioner might be familiar with some of the methods. However, the statistical techniques related to modern financial products, like MBS or CAT bonds, will certainly attract him. “Statistical Tools for Finance and Insurance” consists naturally of two main parts. Each part contains chapters with high focus on practical applications. The book starts with an introduction to stable distributions, which are the standard model for heavy tailed phenomena. Their numerical implementation is thoroughly discussed and applications to finance are given. The second chapter presents the ideas of extreme value and copula analysis as applied to multivariate financial data. This topic is extended in the subsequent chapter which deals with tail dependence, a concept describing the limiting proportion that one margin exceeds a certain threshold given that the other margin has already exceeded that threshold. The fourth chapter reviews the market in catastrophe insurance risk, which emerged in order to facilitate the direct transfer of reinsurance risk associated with natural catastrophes from corporations, insurers, and reinsurers to capital market investors. The next contribution employs functional data analysis for the estimation of smooth implied volatility surfaces. These surfaces are a result of using an oversimplified market benchmark model – the Black-Scholes formula – to real data. An attractive approach to
14
Preface
overcome this problem is discussed in chapter six, where implied trinomial trees are applied to modeling implied volatilities and the corresponding state-price densities. An alternative route to tackling the implied volatility smile has led researchers to develop stochastic volatility models. The relative simplicity and the direct link of model parameters to the market makes Heston’s model very attractive to front office users. Its application to FX option markets is covered in chapter seven. The following chapter shows how the computational complexity of stochastic volatility models can be overcome with the help of the Fast Fourier Transform. In chapter nine the valuation of Mortgage Backed Securities is discussed. The optimal prepayment policy is obtained via optimal stopping techniques. It is followed by a very innovative topic of predicting corporate bankruptcy with Support Vector Machines. Chapter eleven presents a novel approach to money-demand modeling using fuzzy clustering techniques. The first part of the book closes with productivity analysis for cost and frontier estimation. The nonparametric Data Envelopment Analysis is applied to efficiency issues of insurance agencies. The insurance part of the book starts with a chapter on loss distributions. The basic models for claim severities are introduced and their statistical properties are thoroughly explained. In chapter fourteen, the methods of simulating and visualizing the risk process are discussed. This topic is followed by an overview of the approaches to approximating the ruin probability of an insurer. Both finite and infinite time approximations are presented. Some of these methods are extended in chapters sixteen and seventeen, where classical and anomalous diffusion approximations to ruin probability are discussed and extended to cases when the risk process exhibits good and bad periods. The last three chapters are related to one of the most important aspects of the insurance business – premium calculation. Chapter eighteen introduces the basic concepts including the pure risk premium and various safety loadings under different loss distributions. Calculation of a joint premium for a portfolio of insurance policies in the individual and collective risk models is discussed as well. The inclusion of deductibles into premium calculation is the topic of the following contribution. The last chapter of the insurance part deals with setting the appropriate level of insurance premium within a broader context of business decisions, including risk transfer through reinsurance and the rate of return on capital required to ensure solvability. Our e-book offers a complete PDF version of this text and the corresponding HTML files with links to algorithms and quantlets. The reader of this book may therefore easily reconfigure and recalculate all the presented examples and methods via the enclosed XploRe Quantlet Server (XQS), which is also
Preface
15
available from www.xplore-stat.de and www.quantlet.com. A tutorial chapter explaining how to setup and use XQS can be found in the third and final part of the book. We gratefully acknowledge the support of Deutsche Forschungsgemeinschaft ¨ (SFB 373 Quantifikation und Simulation Okonomischer Prozesse, SFB 649 ¨ Okonomisches Risiko) and Komitet Bada´ n Naukowych (PBZ-KBN 016/P03/99 Mathematical models in analysis of financial instruments and markets in Poland). A book of this kind would not have been possible without the help of many friends, colleagues, and students. For the technical production of the e-book platform and quantlets we would like to thank Zdenˇek Hl´ avka, Sigbert Klinke, Heiko Lehmann, Adam Misiorek, Piotr Uniejewski, Qingwei Wang, and Rodrigo Witzel. Special thanks for careful proofreading and supervision of the insurance part go to Krzysztof Burnecki. ˇ ıˇzek, Wolfgang H¨ Pavel C´ ardle, and Rafal Weron Tilburg, Berlin, and Wroclaw, February 2005
Frequently used notation def
x = ... [x] x ≈y A (F ◦ G)(x) I R an , b n , . . . αn = O(βn ) αn = o(βn )
x is defined as ... integer part of x x is approximately equal to y transpose of matrix A F {G(x)} for functions F and G indicator function real numbers sequences of real numbers of vectors αn /βn −→ const. as n −→ ∞ αn /βn −→ 0 as n −→ ∞
X ∼D P(A) E(X) Var(X) Cov(X, Y ) N(μ, Σ)
the random variable X has a distribution D probability of a set A expected value of random variable X variance of random variable X covariance of two random variables X and Y normal distribution with expectation μ and covariance matrix Σ; a similar notation is used if Σ is the correlation matrix standard normal cumulative distribution function standard normal density function chi-squared distribution with p degrees of freedom t-distribution (Student’s) with p degrees of freedom Wiener process the information set generated by all information available at time t sequences of random variables ∀ε > 0 ∃M, ∃N such that P[|An /Bn | > M ] < ε, ∀n > N ∀ε > 0 : limn→∞ P[|An /Bn | > ε] = 0
Φ ϕ χ2p tp Wt Ft An , Bn , . . . An = Op (Bn ) An = op (Bn )
Part I
Finance
1 Models for heavy-tailed asset returns Szymon Borak, Adam Misiorek, and Rafal Weron
1.1 Introduction Many of the concepts in theoretical and empirical finance developed over the past decades – including the classical portfolio theory, the Black-Scholes-Merton option pricing model or the RiskMetrics variance-covariance approach to Value at Risk (VaR) – rest upon the assumption that asset returns follow a normal distribution. But this assumption is not justified by empirical data! Rather, the empirical observations exhibit excess kurtosis, more colloquially known as fat tails or heavy tails (Guillaume et al., 1997; Rachev and Mittnik, 2000). The contrast with the Gaussian law can be striking, as in Figure 1.1 where we illustrate this phenomenon using a ten-year history of the Dow Jones Industrial Average (DJIA) index. In the context of VaR calculations, the problem of the underestimation of risk by the Gaussian distribution has been dealt with by the regulators in an ad hoc way. The Basle Committee on Banking Supervision (1995) suggested that for the purpose of determining minimum capital reserves financial institutions use a 10-day VaR at the 99% confidence level multiplied by a safety factor s ∈ [3, 4]. Stahl (1997) and Danielsson, Hartmann and De Vries (1998) argue convincingly that the range of s is a result of the heavy-tailed nature of asset returns. Namely, if we assume that the distribution is symmetric and has finite variance σ 2 then from Chebyshev’s inequality we have P(Loss ≥ ) ≤ 12 σ 2 2 . Setting the right hand side to 1% yields an upper bound for VaR99% ≤ 7.07σ. On the other hand, if we assume that returns are normally distributed we arrive at VaR99% ≤ 2.33σ, which is roughly three times lower than the bound obtained for a heavy-tailed, finite variance distribution.
P. Čížek et al. (eds.), Statistical Tools for Finance and Insurance, DOI 10.1007/978-3-642-18062-0_1, © Springer-Verlag Berlin Heidelberg 2011
21
22
1 Models for heavy-tailed asset returns 0
10 0.1
CDF(x)
DJIA Returns
−1
10
0.05 0 −0.05
−2
10
−3
10
−0.1
DJIA returns Gaussian fit
−4
0
500 1000 1500 2000 Days (2000.01.03−2009.12.31)
2500
10
−3
10
−2
10 −x
−1
10
Figure 1.1: Left panel : Returns log(Xt+1 /Xt ) of the DJIA daily closing values Xt from the period January 3, 2000 – December 31, 2009. Right panel : Gaussian fit to the empirical cumulative distribution function (cdf) of the returns on a double logarithmic scale (only the left tail fit is displayed). STFstab01
Being aware of the underestimation of risk by the Gaussian law we should consider using heavy-tailed alternatives. This chapter is intended as a guide to such models. In Section 1.2 we describe the historically oldest heavy-tailed model – the stable laws. Next, in Section 1.3 we briefly characterize their recent lighter-tailed generalizations, the so-called truncated and tempered stable distributions. In Section 1.4 we study the class of generalized hyperbolic laws, which – like tempered stable distributions – can be classified somewhere between infinite variance stable laws and the Gaussian distribution. Finally, in Section 1.5 we provide numerical examples.
1.2 Stable distributions 1.2.1 Definitions and basic properties The theoretical rationale for modeling asset returns by the Gaussian distribution comes from the Central Limit Theorem (CLT), which states that the sum of a large number of independent, identically distributed (i.i.d.) variables – say, decisions of investors – from a finite-variance distribution will be (asymp-
1.2 Stable distributions
23
0.15 −1
PDF(x)
PDF(x)
10
−2
10
α=2 α=1.9 α=1.5 α=0.5
−3
10
−4
10 −10
−5
0 x
0.1
β=0 β=−1 β=0.5 β=1
0.05
5
10
0 −5
−2.5
0 x
2.5
5
Figure 1.2: Left panel : A semi-logarithmic plot of symmetric (β = μ = 0) stable densities for four values of α. Note, the distinct behavior of the Gaussian (α = 2) distribution. Right panel : A plot of stable densities for α = 1.2 and four values of β. STFstab02
totically) normally distributed. Yet, this beautiful theoretical result has been notoriously contradicted by empirical findings. Possible reasons for the failure of the CLT in financial markets are (i) infinite-variance distributions of the variables, (ii) non-identical distributions of the variables, (iii) dependences between the variables or (iv) any combination of the three. If only the finite variance assumption is released we have a straightforward solution by virtue of the generalized CLT, which states that the limiting distribution of sums of such variables is stable (Nolan, 2010). This, together with the fact that stable distributions are leptokurtic and can accommodate fat tails and asymmetry, has led to their use as an alternative model for asset returns since the 1960s. Stable laws – also called α-stable, stable Paretian or L´evy stable – were introduced by Paul L´evy in the 1920s. The name ‘stable’ reflects the fact that a sum of two independent random variables having a stable distribution with the same index α is again stable with index α. This invariance property holds also for Gaussian variables. In fact, the Gaussian distribution is stable with α = 2. For complete description the stable distribution requires four parameters. The index of stability α ∈ (0, 2], also called the tail index, tail exponent or characteristic exponent, determines the rate at which the tails of the distribution taper off, see the left panel in Figure 1.2. The skewness parameter β ∈ [−1, 1] defines the asymmetry. When β > 0, the distribution is skewed to the right, i.e.
24
1 Models for heavy-tailed asset returns
the right tail is thicker, see the right panel in Figure 1.2. When it is negative, it is skewed to the left. When β = 0, the distribution is symmetric about the mode (the peak) of the distribution. As α approaches 2, β loses its effect and the distribution approaches the Gaussian distribution regardless of β. The last two parameters, σ > 0 and μ ∈ R, are the usual scale and location parameters, respectively. A far-reaching feature of the stable distribution is the fact that its probability density function (pdf) and cumulative distribution function (cdf) do not have closed form expressions, with the exception of three special cases. The best known of these is the Gaussian (α = 2) law whose pdf is given by: 1 (x − μ)2 fG (x) = √ exp − . (1.1) 2σ 2 2πσ The other two are the lesser known Cauchy (α = 1, β = 0) and L´evy (α = 0.5, β = 1) laws. Consequently, the stable distribution can be most conveniently described by its characteristic function (cf) – the inverse Fourier transform of the pdf. The most popular parameterization of the characteristic function φ(t) of X ∼ Sα (σ, β, μ), i.e. a stable random variable with parameters α, σ, β and μ, is given by (Samorodnitsky and Taqqu, 1994; Weron, 1996): ⎧ πα α α ⎪ ⎨−σ |t| {1 − iβsign(t) tan 2 } + iμt, α = 1, log φ(t) = (1.2) ⎪ ⎩ 2 −σ|t|{1 + iβsign(t) π log |t|} + iμt, α = 1. Note, that the traditional scale parameter σ of the Gaussian distribution is not the same as σ in the above representation. A comparison of formulas (1.1) and √ (1.2) yields the relation: σGaussian = 2σ. For numerical purposes, it is often useful to use Nolan’s (1997) parameterization: ⎧ πα α α 1−α ⎪ − 1]} + iμ0 t, α = 1, ⎨−σ |t| {1 + iβsign(t) tan 2 [(σ|t|) log φ0 (t) = ⎪ ⎩ −σ|t|{1 + iβsign(t) π2 log(σ|t|)} + iμ0 t, α = 1, (1.3) which yields a cf (and hence the pdf and cdf) jointly continuous in all four parameters. The location parameters of the two representations (S and S 0 ) 2 are related by μ = μ0 − βσ tan πα 2 for α = 1 and μ = μ0 − βσ π log σ for α = 1.
1.2 Stable distributions
25
The ‘fatness’ of the tails of a stable distribution can be derived from the following property: the pth moment of a stable random variable is finite if and only if p < α. Hence, when α > 1 the mean of the distribution exists (and is equal to μ). On the other hand, when α < 2 the variance is infinite and the tails exhibit a power-law behavior (i.e. they are asymptotically equivalent to a Pareto law). More precisely, using a CLT type argument it can be shown that (Janicki and Weron, 1994a; Samorodnitsky and Taqqu, 1994): limx→∞ xα P(X > x) = Cα (1 + β)σ α , (1.4) limx→∞ xα P(X < −x) = Cα (1 + β)σ α , ∞ −1 where Cα = 2 0 x−α sin(x)dx = π1 Γ(α) sin πα . The convergence to the 2 power-law tail varies for different α’s and is slower for larger values of the tail index. Moreover, the tails of stable cdfs exhibit a crossover from an approximate power decay with exponent α > 2 to the true tail with exponent α. This phenomenon is more visible for large α’s (Weron, 2001).
1.2.2 Computation of stable density and distribution functions The lack of closed form formulas for most stable densities and distribution functions has far-reaching consequences. Numerical approximation or direct numerical integration have to be used instead of analytical formulas, leading to a drastic increase in computational time and loss of accuracy. Despite a few early attempts in the 1970s, efficient and general techniques have not been developed until late 1990s. Mittnik, Doganoglu and Chenyao (1999) exploited the pdf–cf relationship and applied the fast Fourier transform (FFT). However, for data points falling between the equally spaced FFT grid nodes an interpolation technique has to be used. The authors suggested that linear interpolation suffices in most practical applications, see also Rachev and Mittnik (2000). Taking a larger number of grid points increases accuracy, however, at the expense of higher computational burden. Setting the number of grid points to N = 213 and the grid spacing to h = 0.01 allows to achieve comparable accuracy to the direct integration method (see below), at least for typically used values of α > 1.6. As for the computational speed, the FFT based approach is faster for large samples, whereas the direct integration method favors small data sets since it can be computed at any arbitrarily chosen point. Mittnik, Doganoglu and Chenyao (1999) report that for N = 213 the FFT based method is faster
26
1 Models for heavy-tailed asset returns
for samples exceeding 100 observations and slower for smaller data sets. We must stress, however, that the FFT based approach is not as universal as the direct integration method – it is efficient only for large alpha’s and only as far as the pdf calculations are concerned. When computing the cdf the former method must numerically integrate the density, whereas the latter takes the same amount of time in both cases. The direct integration method, proposed by Nolan (1997, 1999), consists of a numerical integration of Zolotarev’s (1986) formulas for the density or the distribution function. Set ζ = −β tan πα . Then the density f (x; α, β) of a 2 standard stable random variable in representation S 0 , i.e. X ∼ Sα0 (1, β, 0), can be expressed as (note, that Zolotarev (1986, Section 2.2) used another parametrization): • when α = 1 and x = ζ: 1
α(x − ζ) α−1 f (x; α, β) = π |α−1 |
π 2
−ξ
α V (θ; α, β) exp −(x − ζ) α−1 V (θ; α, β) dθ, (1.5)
for x > ζ and f (x; α, β) = f (−x; α, −β) for x < ζ, • when α = 1 and x = ζ: f (x; α, β) = • when α = 1:
Γ(1 + α1 ) cos(ξ) 1
π(1 + ζ 2 ) 2α
⎧ πx πx π 1 2 ⎪ 2β 2β V (θ; 1, β) e −e dθ, ⎪ π V (θ; 1, β) exp ⎨ 2|β| −
β = 0,
⎪ ⎪ ⎩
β = 0,
2
f (x; 1, β) =
,
1 , π(1+x2 )
where ξ=
1 α arctan(−ζ), π 2,
α = 1, α = 1,
(1.6)
and ⎧ α α−1 1 cos{αξ+(α−1)θ} cos θ ⎨(cos αξ) α−1 , α = 1, sin α(ξ+θ) π cos θ V (θ; α, β) = ⎩ 2 2 +βθ exp 1 ( π + βθ) tan θ , α = 1, β = 0. π cos θ β 2
1.2 Stable distributions
27
The distribution F (x; α, β) of a standard stable random variable in representation S 0 can be expressed as: • when α = 1 and x = ζ: sign(1 − α) F (x; α, β) = c1 (α, β) + π
π 2
−ξ
α exp −(x − ζ) α−1 V (θ; α, β) dθ,
for x > ζ and F (x; α, β) = 1 − F (−x; α, −β) for x < ζ, where 1 π − ξ , α < 1, π 2 c1 (α, β) = 1, α > 1, • when α = 1 and x = ζ: F (x; α, β) = • when α = 1:
F (x; 1, β) =
⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩
1 π 1 2
π 2
−π 2
+
1 π
1 π −ξ , π 2
πx exp −e− 2β V (θ; 1, β) dθ,
β > 0,
arctan x,
β = 0,
1 − F (x, 1, −β),
β < 0.
Formula (1.5) requires numerical integration of the function g(·) exp{−g(·)}, α where g(θ; x, α, β) = (x − ζ) α−1 V (θ; α, β). The integrand is 0 at −ξ, increases monotonically to a maximum of 1e at point θ∗ for which g(θ∗ ; x, α, β) = 1, and then decreases monotonically to 0 at π2 (Nolan, 1997). However, in some cases the integrand becomes very peaked and numerical algorithms can miss the spike and underestimate the integral. To avoid this problem we need to find the argument θ ∗ of the peak numerically and compute the integral as a sum of two integrals: one from −ξ to θ∗ and the other from θ∗ to π2 . To the best of our knowledge, currently no statistical computing environment offers the computation of stable density and distribution functions in its standard release. Users have to rely on third-party libraries or commercial products. A few are worth mentioning. The standalone program STABLE is probably the most efficient (downloadable from John Nolan’s web page: academic2.american.edu/˜jpnolan/stable/stable.html). It was written in Fortran
28
1 Models for heavy-tailed asset returns
and calls several external IMSL routines, see Nolan (1997) for details. Apart from speed, the STABLE program also exhibits high relative accuracy (ca. 10−13 ; for default tolerance settings) for extreme tail events and 10−10 for values used in typical financial applications (like approximating asset return distributions). The STABLE program is also available in library form through Robust Analysis Inc. (www.robustanalysis.com). This library provides interfaces to Matlab, S-plus/R and Mathematica. In the late 1990s Diethelm W¨ urtz has initiated the development of Rmetrics, an open source collection of S-plus/R software packages for computational finance (www.rmetrics.org). In the fBasics package stable pdf and cdf calculations are performed using the direct integration method, with the integrals being computed by R’s function integrate. On the other hand, the FFT based approach is utilized in Cognity, a commercial risk management platform that offers derivatives pricing and portfolio optimization based on the assumption of stably distributed returns (www.finanalytica.com). The FFT implementation is also available in Matlab (stablepdf fft.m) from the Statistical Software Components repository (ideas.repec.org/c/boc/bocode/m429004.html).
1.2.3 Simulation of stable variables Simulating sequences of stable random variables is not straightforward, since there are no analytic expressions for the inverse F −1 (x) nor the cdf F (x) itself. All standard approaches like the rejection or the inversion methods would require tedious computations. A much more elegant and efficient solution was proposed by Chambers, Mallows and Stuck (1976). They noticed that a certain integral formula derived by Zolotarev (1964) led to the following algorithm: • generate a random variable U uniformly distributed on (− π2 , π2 ) and an independent exponential random variable W with mean 1; • for α = 1 compute: 2
X = (1 + ζ )
1 2α
1−α sin{α(U + ξ)} cos{U − α(U + ξ)} α , W {cos(U )}1/α
• for α = 1 compute: π W cos U 1 π X= + βU tan U − β log 2π , ξ 2 + βU 2
(1.7)
(1.8)
1.2 Stable distributions
29
where ξ is given by eqn. (1.6). This algorithm yields a random variable X ∼ Sα (1, β, 0), in representation (1.2). For a detailed proof see Weron (1996). Given the formulas for simulation of a standard stable random variable, we can easily simulate a stable random variable for all admissible values of the parameters α, σ, β and μ using the following property. If X ∼ Sα (1, β, 0) then σX + μ, α = 1, Y = (1.9) 2 σX + π βσ log σ + μ, α = 1, is Sα (σ, β, μ). It is interesting to note that for α = 2 (and β = 0) the ChambersMallows-Stuck (CMS) method reduces to the well known Box-Muller algorithm for generating Gaussian random variables. Many other approaches have been proposed in the literature, including application of Bergstr¨om and LePage series expansions (Janicki and Weron, 1994b). However, the CMS method is regarded as the fastest and the most accurate. Because of its unquestioned superiority and relative simplicity, it is implemented in some statistical computing environments (e.g. the rstable function in Splus/R) even if no other routines related to stable distributions are provided. It is also available in Matlab (function stablernd.m) from the SSC repository (ideas.repec.org/c/boc/bocode/m429003.html).
1.2.4 Estimation of parameters The lack of known closed-form density functions also complicates statistical inference for stable distributions. For instance, maximum likelihood (ML) estimates have to be based on numerical approximations or direct numerical integration of the formulas presented in Section 1.2.2. Consequently, ML estimation is difficult to implement and time consuming for samples encountered in modern finance. However, there are also other numerical methods that have been found useful in practice and are discussed in this section. Given a sample x1 , ..., xn of i.i.d. Sα (σ, β, μ) observations, in what follows, we provide estimates α ˆ, σ ˆ , βˆ and μ ˆ of all four stable law parameters. We start the discussion with the simplest, fastest and ... least accurate quantile methods, then develop the slower, yet much more accurate sample cf methods and, finally, conclude with the slowest but most accurate ML approach. All of the presented methods work quite well assuming that the sample under consideration is indeed stable.
30
1 Models for heavy-tailed asset returns
However, testing for stability is not an easy task. Despite some more or less successful attempts (Brcich, Iskander and Zoubir, 2005; Paolella, 2001; Matsui and Takemura, 2008), there are no standard, widely-accepted tests for assessing stability. A possible remedy may be to use bootstrap (or Monte Carlo simulation) techniques, as discussed in Chapter 9 in the context of insurance loss distributions. Other proposed approaches involve using tail exponent estimators for testing if α is in the admissible range (Fan, 2006; Mittnik and Paolella, 1999) or simply ‘visual inspection’ to see whether the empirical densities resemble those of stable laws (Nolan, 2001; Weron, 2001). Sample Quantile Methods. The origins of sample quantile methods for stable laws go back to Fama and Roll (1971), who provided very simple estimates for parameters of symmetric (β = 0, μ = 0) stable laws with α > 1. A decade later McCulloch (1986) generalized their method and provided consistent estimators of all four stable parameters (with the restriction α ≥ 0.6). After McCulloch define: vα =
x0.95 − x0.05 x0.75 − x0.25
and vβ =
x0.95 + x0.05 − 2x0.50 , x0.95 − x0.05
(1.10)
where xf denotes the f -th population quantile, so that Sα (σ, β, μ)(xf ) = f . Statistics vα and vβ are functions of α and β only, i.e. they are independent of both σ and μ. This relationship may be inverted and the parameters α and β may be viewed as functions of vα and vβ : α = ψ1 (vα , vβ ) and
β = ψ2 (vα , vβ ).
(1.11)
Substituting vα and vβ by their sample values and applying linear interpolation between values found in tables given in McCulloch (1986) yields estimators α ˆ ˆ and β. Scale and location parameters, σ and μ, can be estimated in a similar way. However, due to the discontinuity of the cf for α = 1 and β = 0 in representation (1.2), this procedure is much more complicated. In a recent paper, Dominicy and Veredas (2010) further extended the quantile approach by introducing the method of simulated quantiles. It is a promising approach which can also handle multidimensional cases as, for instance, the joint estimation of N univariate stable distributions (but with the constraint of a common tail index).
1.2 Stable distributions
31
Sample Characteristic Function Methods. Given an i.i.d. random sample ˆ = 1 n exp(itxj ). Since x1 , ..., xn of size n, define the sample cf by: φ(t) j=1 n ˆ ˆ are finite and, for any fixed t, |φ(t)| is bounded by unity all moments of φ(t) it is the sample average of i.i.d. random variables exp(itxj ). Hence, by the law ˆ is a consistent estimator of the cf φ(t). of large numbers, φ(t) To the best of our knowledge, Press (1972) was the first to use the sample cf in the context of statistical inference for stable laws. He proposed a simple estimation method for all four parameters, called the method of moments, based on transformations of the cf. However, the convergence of this method to the population values depends on the choice of four estimation points, whose selection is problematic. Koutrouvelis (1980) presented a much more accurate regression-type method which starts with an initial estimate of the parameters and proceeds iteratively until some prespecified convergence criterion is satisfied. Each iteration consists of two weighted regression runs. The number of points to be used in these regressions depends on the sample size and starting values of α. Typically no more than two or three iterations are needed. The speed of the convergence, however, depends on the initial estimates and the convergence criterion. The regression method is based on the following observations concerning the cf φ(t). First, from (1.2) we can easily derive: log(− log |φ(t)|2 ) = log(2σ α ) + α log |t|. The real and imaginary parts of φ(t) are for α = 1 given by: πα {φ(t)} = exp(−|σt|α ) cos μt + |σt|α βsign(t) tan , 2 πα {φ(t)} = exp(−|σt|α ) sin μt + |σt|α βsign(t) tan . 2
(1.12)
(1.13) (1.14)
Apart from considerations of principal values, equations (1.13)-(1.14) lead to: {φ(t)} πα arctan = μt + βσ α tan sign(t)|t|α . (1.15) {φ(t)} 2 Equation (1.12) depends only on α and σ and suggests that we can estimate these two parameters by regressing y = log(− log |φn (t)|2 ) on w = log |t| in the model: yk = m + αwk + k , where tk is an appropriate set of real numbers, m = log(2σ α ), and k denotes an error term. Koutrouvelis (1980) proposed to use tk = πk 25 , k = 1, 2, ..., K; with K ranging between 9 and 134 for different values of α and sample sizes.
32
1 Models for heavy-tailed asset returns
Once α ˆ and σ ˆ have been obtained and α and σ have been fixed at these values, estimates of β and μ can be obtained using (1.15). Next, the regressions are repeated with α ˆ, σ ˆ , βˆ and μ ˆ as the initial parameters. The iterations continue until a prespecified convergence criterion is satisfied. Koutrouvelis proposed to use Fama and Roll’s (1971) formula and the 25% truncated mean for initial estimates of σ and μ, respectively. Kogon and Williams (1998) eliminated this iteration procedure and simplified the regression method. For initial estimation they applied McCulloch’s method, worked with the continuous representation (1.3) of the cf instead of the classical one (1.2) and used a fixed set of only 10 equally spaced frequency points tk . In terms of computational speed their method compares favorably to the original regression method. It is over five times faster than the procedure of Koutrouvelis, but still about three times slower than the quantile method of McCulloch (Weron, 2004). It has a significantly better performance near α = 1 and β = 0 due to the elimination of discontinuity of the cf. However, it returns slightly worse results for other values of α. Matlab implementations of McCulloch’s quantile technique (stabcull.m) and the regression approach of Koutrouvelis (stabreg.m) are distributed with the MFE Toolbox accompanying the monograph of Weron (2006) and can be downloaded from www.ioz.pwr.wroc.pl/pracownicy/weron/MFE.htm. Maximum Likelihood Method. For a vector of observations x = (x1 , ..., xn ), the maximum likelihood (ML) estimate of the parameter vector θ = (α, σ, β, μ) is obtained by maximizing the log-likelihood function: Lθ (x) =
n
log f˜(xi ; θ),
(1.16)
i=1
where f˜(·; θ) is the stable pdf. The tilde reflects the fact that, in general, we do not know the explicit form of the stable density and have to approximate it numerically. The ML methods proposed in the literature differ in the choice of the approximating algorithm. However, all of them have an appealing common feature – under certain regularity conditions the ML estimator is asymptotically normal with the variance specified by the Fischer information matrix (DuMouchel, 1973). The latter can be approximated either by using the Hessian matrix arising in maximization or, as in Nolan (2001), by numerical integration. Because of computational complexity there are only a few documented attempts of estimating stable law parameters via maximum likelihood worth mentioning. DuMouchel (1971) developed an approximate ML method, which was based on
1.2 Stable distributions
33
grouping the data set into bins and using a combination of means to compute the density (FFT for the central values of x and series expansions for the tails) to compute an approximate log-likelihood function. This function was then numerically maximized. Much better, in terms of accuracy and computational time, are more recent ML estimation techniques. Mittnik et al. (1999) utilized the FFT approach for approximating the stable density function, whereas Nolan (2001) used the direct integration method. Both approaches are comparable in terms of efficiency. The differences in performance are the result of different approximation algorithms, see Section 1.2.2. Matsui and Takemura (2006) further improved Nolan’s method for the boundary cases, i.e. in the tail and mode of the densities and in the neighborhood of the Cauchy and the Gaussian distributions, but only in the symmetric stable case. As Ojeda (2001) observes, the ML estimates are almost always the most accurate, closely followed by the regression-type estimates and McCulloch’s quantile method. However, ML estimation techniques are certainly the slowest of all the discussed methods. For instance, ML estimation for a sample of 2000 observations using a gradient search routine which utilizes the direct integration method is over 11 thousand (!) times slower than the Kogon-Williams algorithm (calculations performed on a PC running STABLE ver. 3.13; see Section 1.2.2 where the program was briefly described). Clearly, the higher accuracy does not justify the application of ML estimation in many real life problems, especially when calculations are to be performed on-line. For this reason the program STABLE offers an alternative – a fast quasi ML technique. It quickly approximates stable densities using a 3-dimensional spline interpolation based on pre-computed values of the standardized stable density on a grid of (x, α, β) values. At the cost of a large array of coefficients, the interpolation is highly accurate over most values of the parameter space and relatively fast – only ca. 13 times slower than the Kogon-Williams algorithm. Alternative Methods. Besides the popular methods discussed so far other estimation algorithms have been proposed in the literature. A Bayesian Markov chain Monte Carlo (MCMC) approach was initiated by Buckle (1995). It was later modified by Lombardi (2007) who used an approximated version of the likelihood, instead of the twice slower Gibbs sampler, and by Peters, Sisson and Fan (2009) who proposed likelihood-free Bayesian inference for stable models. In a recent paper Garcia, Renault and Veredas (2010) estimate the stable law parameters with (constrained) indirect inference, a method particularly suited to situations where the model of interest is difficult to estimate but relatively
34
1 Models for heavy-tailed asset returns
easy to simulate. They use the skewed-t distribution as an auxiliary model, since it has the same number of parameters as the stable with each parameter playing a similar role.
1.3 Truncated and tempered stable distributions Mandelbrot’s (1963) seminal work on applying stable distributions in finance gained support in the first few years after its publication, but subsequent works have questioned the stable distribution hypothesis, in particular, the stability under summation (for a review see Rachev and Mittnik, 2000). Over the next few years, the stable law temporarily lost favor and alternative processes were suggested as mechanisms generating stock returns. In the mid 1990s the stable distribution hypothesis has made a dramatic comeback, at first in the econophysics literature. Several authors have found a very good agreement of high-frequency returns with a stable distribution up to six standard deviations away from the mean (Cont, Potters and Bouchaud, 1997). For more extreme observations, however, the distribution they found fell off approximately exponentially. To cope with such observations the so called truncated L´evy distributions (TLD) were introduced by Mantegna and Stanley (1994). The original definition postulated a sharp truncation of the stable pdf at some arbitrary point. Later, however, exponential smoothing was proposed by Koponen (1995) leading to the following characteristic function: σα |t| 2 2 α/2 α log φ(t) = − (t + λ ) cos α arctan − λ , (1.17) cos πα λ 2 where α = 1 is the tail exponent, σ is the scale parameter and λ is the truncation coefficient (for simplicity β and μ are set to zero here). Clearly the symmetric (exponentially smoothed) TLD reduces to the symmetric stable distribution (β = μ = 0) when λ = 0. For small and intermediate returns the TLD behaves like a stable distribution, but for extreme returns the truncation causes the distribution to converge to the Gaussian and, hence, all moments are finite. In particular, the variance and kurtosis are given by: Var(X) = k(X) =
α(1 − α) α α−2 σ λ , cos( πα ) 2
(1.18)
cos( πα 2 )(α − 2)(α − 3) . α(1 − α)σ α λα
(1.19)
1.3 Truncated and tempered stable distributions
35
0
10
Gaussian TSD(1.7,0.2) 1.7−Stable
−1
10
−1
PDF(x)
PDF(x)
10
−2
10
−3
−3
10
10
−4
10 −10
−2
10
−4
−5
0 x
5
10
10
0
1
10
10 x
0
10
TSD(λ=5) TSD(λ=0.5) TSD(λ=0.2) TSD(λ=0.01)
−1
10
−1
PDF(x)
PDF(x)
10
−2
10
−3
−3
10
10
−4
10 −10
−2
10
−4
−5
0 x
5
10
10
0
1
10
10 x
Figure 1.3: Top panels: Semilog and loglog plots of symmetric 1.7-stable, symmetric tempered stable (TSD) with α = 1.7 and λ = 0.2, and Gaussian pdfs. Bottom panels: Semilog and loglog plots of symmetric TSD pdfs with α = 1.7 and four truncation coefficients: λ = 5, 0.5, 0.2, 0.01. Note, that for large λ’s the distribution approaches the Gaussian (though with a different scale) and for small λ’s the stable law with the same shape parameter α. STFstab03 The convergence to the Gaussian and stable laws can be seen in Figure 1.3, where we compare stable and exponentially smoothed TLDs (or TSDs, see below) for a typically reported for financial data value of the tail exponent (α = 1.7). Thus the observation that the asset returns distribution is a TLD explains both the short-term stable behavior and the long run convergence to the normal distribution (for interesting insights on the CLT-type behavior of the TLD see a recent paper of Grabchak and Samorodnitsky, 2010).
36
1 Models for heavy-tailed asset returns
The (exponentially smoothed) TLD was not recognized in finance until the introduction of the KoBoL (Boyarchenko and Levendorskii, 2000) and CGMY models (Carr et al., 2002). Around this time Rosinski coined the term under which the exponentially smoothed TLD is known today in the mathematics literature – tempered stable distribution (TSD; see Rosinski, 2007). Despite the interesting statistical properties, the TSDs (TLDs) have not been applied extensively to date. The most probable reason for this being the complicated definition of the TSD. Like for stable distributions, only the characteristic function is known. No closed form formulas exist for the density or the distribution functions. No integral formulas, like Zolotarev’s (1986) for the stable laws (see Section 1.2.2), have been discovered to date. Hence, statistical inference is, in general, limited to ML utilizing the FFT technique for approximating the pdf (Bianchi et al., 2010; Grabchak, 2008). Moreover, compared to the stable distribution, the TSD introduces one more parameter (the truncation λ) making the estimation procedure even more complicated. Other parameter fitting techniques proposed so far comprise a combination of ad hoc approaches and moment matching (Boyarchenko and Levendorskii, 2000; Matacz, 2000). Apart from a few special cases, also the simulation of TSD variables is cumbersome and numerically demanding (Bianchi et al., 2010; Kawai and Masuda, 2010; Poirot and Tankov, 2006).
1.4 Generalized hyperbolic distributions 1.4.1 Definitions and basic properties The hyperbolic distribution saw its appearance in finance in the mid-1990s, when a number of authors reported that it provides a very good model for the empirical distributions of daily stock returns from a number of leading German enterprises (Eberlein and Keller, 1995; K¨ uchler et al., 1999). Since then it has become a popular tool in stock price modeling and market risk measurement (Bibby and Sørensen, 2003; Chen, H¨ardle and Jeong, 2008; McNeil, R¨ udiger and Embrechts, 2005). However, the origins of the hyperbolic law date back to the 1940s and the empirical findings in geophysics. A formal mathematical description was developed years later by Barndorff-Nielsen (1977). The hyperbolic law provides the possibility of modeling heavier tails than the Gaussian, since its log-density forms a hyperbola while that of the Gaussian is a parabola (see Figure 1.4), but lighter than the stable. As we will see later in this
1.4 Generalized hyperbolic distributions 0
1
10 Hyperbolic TSD(1.7,0.5) NIG Gaussian
0.8
−2
10
0.6
PDF(x)
PDF(x)
37
0.4
−4
10
−6
10
0.2 0
−8
0
0.5
1 x
1.5
2
10 −10
−5
0 x
5
10
Figure 1.4: Densities and log-densities of symmetric hyperbolic, TSD, NIG, and Gaussian distributions having the same variance, see eqns. (1.18) and (1.33). The name of the hyperbolic distribution is derived from the fact that its log-density forms a hyperbola, which is clearly visible in the right panel. STFstab04
Section, the hyperbolic law is a member of a larger, versatile class of generalized hyperbolic (GH) distributions, which also includes the normal-inverse Gaussian (NIG) and variance-gamma (VG) distributions as special cases. For a concise review of special and limiting cases of the GH distribution see Chapter 9 in Paolella (2007). The Hyperbolic Distribution. The hyperbolic distribution is defined as a normal variance-mean mixture where the mixing distribution is the generalized inverse Gaussian (GIG) law with parameter λ = 1, i.e. it is conditionally Gaussian (Barndorff-Nielsen, 1977; Barndorff-Nielsen and Blaesild, 1981). More precisely, a random variable Z has the hyperbolic distribution if: (Z|Y ) ∼ N (μ + βY, Y ) ,
(1.20)
where Y is a generalized inverse Gaussian GIG(λ = 1, χ, ψ) random variable and N(m, s2 ) denotes the Gaussian distribution with mean m and variance s2 . The GIG law is a positive domain distribution with the pdf given by: fGIG (x) =
(ψ/χ)λ/2 λ−1 − 1 (χx−1 +ψx) √ x e 2 , 2Kλ ( χψ)
x > 0,
(1.21)
38
1 Models for heavy-tailed asset returns
where the three parameters take values in one of the ranges: (i) χ > 0, ψ ≥ 0 if λ < 0, (ii) χ > 0, ψ > 0 if λ = 0 or (iii) χ ≥ 0, ψ = 0 if λ > 0. The generalized inverse Gaussian law has a number of interesting properties that we will use later in this section. The distribution of the inverse of a GIG variable is again GIG but with a different λ, namely if: Y −1 ∼ GIG(−λ, χ, ψ). (1.22) A GIG variable can be also reparameterized by setting a = χ/ψ and b = √ χψ, and defining Y = aY˜ , where: Y ∼ GIG(λ, χ, ψ)
then
Y˜ ∼ GIG(λ, b, b).
(1.23)
The normalizing constant Kλ (t) in formula (1.21) is the modified Bessel function of the third kind with index λ, also known as the MacDonald function. It is defined as: 1 ∞ λ−1 − 12 t(x+x−1 ) Kλ (t) = x e dx, t > 0. (1.24) 2 0 In the context of hyperbolic distributions, the Bessel functions are thoroughly discussed in Barndorff-Nielsen and Blaesild (1981). Here we recall only two properties that will be used later. Namely, (i) Kλ (t) is symmetric with respect to λ, i.e. Kλ (t) = K−λ (t), and (ii) for λ = ± 12 it can be written in a simpler form: π − 1 −t 1 K± 2 (t) = t 2e . (1.25) 2 Relation (1.20) implies that a hyperbolic random √ variable Z ∼ H(ψ, β, χ, μ) can be represented in the form: Z ∼ μ + βY + Y N(0, 1), with the cf: ∞ 2 1 φZ (u) = eiuμ eiβzu− 2 zu dFY (z). (1.26) 0
Here FY (z) denotes the distribution function of a GIG random variable Y with parameter λ = 1, see eqn. (1.21). Hence, the hyperbolic pdf is given by: √ ψ/χ 2 2 fH (x; ψ, β, χ, μ) = e− {ψ+β }{χ+(x−μ) }+β(x−μ) , (1.27) √ 2 ψ + β 2 K1 ( ψχ) √ or in an alternative parameterization (with δ = χ and α = ψ + β 2 ) by: √2 α2 − β 2 2 fH (x; α, β, δ, μ) = e−α δ +(x−μ) +β(x−μ) . (1.28) 2 2 2αδK1 (δ α − β )
1.4 Generalized hyperbolic distributions
39
The latter is more popular and has the advantage of δ > 0 being the traditional scale parameter. Out of the remaining three parameters, α and β determine the shape, with α being responsible for the steepness and 0 ≤ |β| < α for the skewness, and μ ∈ R is the location parameter. Finally, note that if we only have an efficient algorithm to compute K1 , the calculation of the pdf is straightforward. However, the cdf has to be numerically integrated from (1.27) or (1.28). The General Class. The generalized hyperbolic (GH) law can be represented as a normal variance-mean mixture where the mixing distribution is the generalized inverse Gaussian law with any λ ∈ R. Hence, the GH distribution is described by five parameters θ = (λ, α, β, δ, μ), using parameterization (1.28), and its pdf is given by:
1 (λ− 12 ) fGH (x; θ) = κ δ 2 + (x − μ)2 2 Kλ− 12 α δ 2 + (x − μ)2 eβ(x−μ) , (1.29) where: λ (α2 − β 2 ) 2 κ= √ . (1.30) 1 2παλ− 2 δ λ Kλ (δ α2 − β 2 ) The tail behavior of the GH density is ‘semi-heavy’, i.e. the tails are lighter than those of non-Gaussian stable laws and TSDs with a relatively small truncation parameter (see Figure 1.4), but much heavier than Gaussian. Formally, the following asymptotic relation is satisfied (Barndorff-Nielsen and Blaesild, 1981): fGH (x) ≈ |x|λ−1 e(∓α+β)x
for
x → ±∞,
(1.31)
which can be interpreted as exponential ‘tempering’ of the power-law tails (compare with the TSD described in Section 1.3). Consequently, all moments of the GH law exist. In particular, the mean and variance are given by: βδ 2 Kλ+1 (ζ) E(X) = μ + , (1.32) ζ Kλ (ζ) 2 Kλ+1 (ζ) β 2 δ 2 Kλ+2 (ζ) Kλ+1 (ζ) Var(X) = δ 2 + 2 − . (1.33) ζKλ (ζ) ζ Kλ (ζ) ζKλ (ζ) The Normal-Inverse Gaussian Distribution. The normal-inverse Gaussian (NIG) laws were introduced by Barndorff-Nielsen (1995) as a subclass of the generalized hyperbolic laws obtained for λ = − 21 . The density of the NIG
40
1 Models for heavy-tailed asset returns
distribution is given by: αδ δ√α2 −β 2 +β(x−μ) K1 (α δ 2 + (x − μ)2 ) fNIG (x) = e . π δ 2 + (x − μ)2
(1.34)
Like for the hyperbolic law the calculation of the pdf is straightforward, but the cdf has to be numerically integrated from eqn. (1.34). At the expense of four parameters, the NIG distribution is able to model asymmetric distributions with ‘semi-heavy’ tails. However, if we let α → 0 the NIG distribution converges to the Cauchy distribution (with location parameter μ and scale parameter δ), which exhibits extremely heavy tails. Obviously, the NIG distribution may not be adequate to deal with cases of extremely heavy tails such as those of Pareto or non-Gaussian stable laws. However, empirical experience suggests excellent fits of the NIG law to financial data (Karlis, 2002; Karlis and Lillest¨ol, 2004; Venter and de Jongh, 2002). Moreover, the class of normal-inverse Gaussian distributions possesses an appealing feature that the class of hyperbolic laws does not have. Namely, it is closed under convolution, i.e. a sum of two independent NIG random variables is again NIG (Barndorff-Nielsen, 1995). In particular, if X1 and X2 are independent NIG random variables with common parameters α and β but having different scale and location parameters δ1,2 and μ1,2 , respectively, then X = X1 + X2 is NIG(α, β, δ1 + δ1 , μ1 + μ2 ). Only two subclasses of the GH distributions are closed under convolution. The other class with this important property is the class of variance-gamma (VG) distributions, which is obtained when δ is equal to 0. This is only possible for λ > 0 and α > |β|. The VG distributions (with β = 0) were introduced to finance by Madan and Seneta (1990), long before the popularity of GH and NIG laws.
1.4.2 Simulation of generalized hyperbolic variables The most natural way of simulating GH variables is derived from the property that they can be represented as normal variance-mean mixtures. Since the mixing distribution is the GIG law, the resulting algorithm reads as follows: 1. simulate a random variable Y ∼ GIG(λ, χ, ψ) = GIG(λ, δ 2 , α2 − β 2 ); 2. simulate a standard normal random variable N ; √ 3. return X = μ + βY + Y N.
1.4 Generalized hyperbolic distributions
41
The algorithm is fast and efficient if we have a handy way of simulating GIG variates. For λ = − 21 , i.e. when sampling from the so-called inverse Gaussian (IG) distribution, there exists an efficient procedure that utilizes a transforma tion yielding two roots. It starts with the observation that if we let ϑ = χ/ψ then the IG(χ, ψ) density (= GIG(− 12 , χ, ψ); see eqn. (1.21)) of Y can be written as: χ −χ(x − ϑ)2 fIG (x) = exp . (1.35) 2πx3 2xϑ2 Now, following Shuster (1968) we may write: V =
χ(Y − ϑ)2 ∼ χ2(1) , Y ϑ2
(1.36)
i.e. V is distributed as a chi-square random variable with one degree of freedom. As such it can be simply generated by taking a square of a standard normal random number. Unfortunately, the value of Y is not uniquely determined by eqn. (1.36). Solving this equation for Y yields two roots: y1 = ϑ +
ϑ ϑV − 4ϑχV + ϑ2 V 2 2χ
and
y2 =
ϑ2 . y1
The difficulty in generating observations with the desired distribution now lies in choosing between the two roots. Michael, Schucany and Haas (1976) showed that Y can be simulated by choosing y1 with probability ϑ/(ϑ + y1 ). So for each random observation V from a χ2(1) distribution the smaller root y1 has to be calculated. Then an auxiliary Bernoulli trial is performed with probability p = ϑ/(ϑ + y1 ). If the trial results in a ‘success’, y1 is chosen; otherwise, the larger root y2 is selected. This routine is, for instance, implemented in the rnig function of the Rmetrics collection of software packages for S-plus/R (see also Section 1.2.2 where Rmetrics was briefly described). In the general case, the GIG distribution – as well as the (generalized) hyperbolic law – can be simulated via the rejection algorithm. An adaptive version of this technique is used to obtain hyperbolic random numbers in the rhyp function of Rmetrics. Rejection is also implemented in the HyperbolicDist package for S-plus/R developed by David Scott, see the R-project home page cran.r-project.org. The package utilizes a version of the algorithm proposed by Atkinson (1982), i.e. rejection coupled either with a two (‘GIG algorithm’ for any admissible value of λ) or a three (‘GIGLT1 algorithm’ for 0 ≤ λ < 1) part envelope (or majorizing function). However, finding the appropriate parameter values for these envelopes requires optimization and makes the technique burdensome.
42
1 Models for heavy-tailed asset returns
This difficulty led to a search for a short algorithm which would give comparable efficiencies but without the drawback of extensive numerical optimizations. A solution, based on the ‘ratio-of-uniforms’ method, was provided by Dagpunar (1989). First, recalling properties (1.22) and (1.23), observe that we only need to find a method to simulate Y˜ ∼ GIG(λ, b, b) variables and only for non˜ ˜ negative λ’s. Next, define the relocated variable Ym = Y − m, where the shift m = 1b (λ − 1 + (λ − 1)2 + b2 ) is the mode of the density of Y˜ . Then Y˜ can be generated by setting Y˜m = V /U , where the pair (U, V ) is uniformly distributed over the region {(u, v) : 0 ≤ u ≤ h(v/u)} with: bt+m+1 λ−1 h(t) = (t + m) exp − , for t ≥ −m. 2 t+m Since this region is irregularly shaped, it is more convenient to generate the pair (U, V ) uniformly over a minimal enclosing rectangle {(u, v) : 0 ≤ u ≤ u+ , v− ≤ v ≤ v+ }. Finally, the variate (V /U ) is accepted if U 2 ≤ h(V /U ). The efficiency of the algorithm depends on the method of deriving and the actual choice of u+ and v± . Further, for λ ≤ 1 and b ≤ 1 there is no need for the shift at mode m. Such a version of the algorithm is implemented in UNU.RAN, a library of C functions for non-uniform random number generation developed at the Vienna University of Economics, see statistik.wu-wien.ac.at/unuran.
1.4.3 Estimation of parameters Maximum Likelihood Method. The parameter estimation of GH distributions can be performed by the ML method, since there exist closed-form formulas (although, involving special functions) for the densities of these laws. The computational burden is not as heavy as for stable laws, but it still is considerable. In general, the ML estimation algorithm is as follows. For a vector of observations x = (x1 , ..., xn ), the ML estimate of the parameter vector θ = (λ, α, β, δ, μ) is obtained by maximizing the log-likelihood function: L(x; θ)
= log κ + +
n i=1
λ− 2
1 2
n
log(δ 2 + (xi − μ)2 ) +
i=1
log Kλ− 12 (α
δ2
+ (xi −
μ)2 )
+
n
β(xi − μ), (1.37)
i=1
where κ is defined by (1.30). Obviously, for hyperbolic (λ = 1) distributions the algorithm uses simpler expressions of the log-likelihood function due to relation (1.25).
1.4 Generalized hyperbolic distributions
43
The routines proposed in the literature differ in the choice of the optimization scheme. The first software product that allowed statistical inference with hyperbolic distributions, the HYP program, used a gradient search technique (Blaesild and Sorensen, 1992). In a large simulation study Prause (1999) utilized the bracketing method. Matlab functions hypest.m and nigest.m distributed with the MFE Toolbox (Weron, 2006) use the downhill simplex method, with slight modifications due to parameter restrictions. The main factor for the speed of the estimation is the number of modified Bessel functions to compute. Note, that for λ = 1 (i.e. the hyperbolic distribution) this function appears only in the constant κ. For a data set with n independent observations we need to evaluate n and n + 1 Bessel functions for NIG and GH distributions, respectively, whereas only one for the hyperbolic. This leads to a considerable reduction in the time necessary to calculate the likelihood function in the hyperbolic case. Prause (1999) reported a reduction of ca. 33%, however, the efficiency results are highly sample and implementation dependent. We also have to say that the optimization is challenging. Some of the parameters are hard to separate since a flat-tailed GH distribution with a large scale parameter is hard to distinguish from a fat-tailed distribution with a small scale parameter, see Barndorff-Nielsen and Blaesild (1981) who observed such a behavior already for the hyperbolic law. The likelihood function with respect to these parameters then becomes very flat, and may have local minima. In the case of NIG distributions Venter and de Jongh (2002) proposed simple estimates of α and β that can be used as staring values for the ML scheme. Starting from relation (1.31) for the tails of the NIG density (i.e. with λ = −1/2) they derived the following approximation: α−β α+β
1 x1−f + E(X|X > x1−f ) , 2 E(X 2 |X > x1−f ) − x1−f E(X|X > x1−f ) 1 xf + E(X|X < xf ) ∼ − , 2 E(X 2 |X < xf ) − xf E(X|X < xf ) ∼
where xf is the f -th population quantile, see Section 1.2.4. After the choice of a suitable value for f , Venter and de Jongh used f = 5%, the ‘tail estimates’ of α and β are obtained by replacing the quantiles and expectations by their sample values in the above relations. Another method of providing the starting values for the ML scheme was suggested by Prause (1999). He estimated the parameters of a symmetric (β = μ = 0) GH law with a reasonable kurtosis (i.e. with δα ≈ 1.04) that had the variance equal to that of the empirical distribution.
44
1 Models for heavy-tailed asset returns
Other Methods. Besides the ML approach other estimation methods have been proposed in the literature. Prause (1999) tested different estimation techniques by replacing the log-likelihood function with other score functions, like the Anderson-Darling and Kolmogorov statistics or Lp -norms. But the results were disappointing. Karlis and Lillest¨ol (2004) made use of the MCMC technique, however, again the results obtained were not impressive. Karlis (2002) described an Expectation-Maximization (EM) type algorithm for ML estimation of the NIG distribution. The algorithm can be programmed in any statistical package supporting Bessel functions and it has all the properties of the standard EM algorithm, like sure, but slow, convergence, parameters in the admissible range, etc. Recently Fragiadakis, Karlis and Meintanis (2009) used this approach to construct goodness-of-fit tests for symmetric NIG distributions. The tests are based on a weighted integral incorporating the empirical cf of suitably standardized data. The EM scheme can be also generalized to multivariate GH distributions (but with fixed λ, see Protassov, 2004).
1.5 Empirical evidence The empirical facts presented in Section 1.1 show that we should use heavy tailed alternatives to the Gaussian law in order to obtain acceptable estimates of market losses. In this section we apply the techniques discussed so far to two samples of financial data: the Dow Jones Industrial Average (DJIA) index and the Polish WIG20 index. Both are blue chip stock market indexes. DJIA is composed of 30 major U.S. companies that trade on NYSE and NASDAQ. WIG20 consists of 20 major Polish companies listed on the Warsaw Stock Exchange. We use daily closing index values from the period January 3, 2000 – December 31, 2009. Eliminating missing values (mostly U.S. and Polish holidays) we end up with 2494 (log-)returns for each index, see the top left panels in Figures 1.5 and 1.6. Like most financial time series, also these index returns contain volatility clusters which imply that the probability of a specific incurred loss is not the same on each day. During days of higher volatility we should expect larger than usual losses and during calmer days – smaller than usual. To remove volatility clusters it is necessary to model the process that generates them. Following Barone-Adesi, Giannopoulos, and Vosper (1999) and Kuester, Mittnik, and Paolella (2006) we eliminate volatility clusters by filtering the returns rt with
1.5 Empirical evidence
45
Table 1.1: Gaussian, hyperbolic, NIG, and stable fits to 2516 standardized (rescaled by the sample standard deviation) and σt -filtered returns of the DJIA from the period January 3, 2000 – December 31, 2009, see also Figure 1.5. The values of the Kolmogorov (K) and AndersonDarling (AD) goodness-of-fit statistics suggest the hyperbolic and NIG distributions as the best models for filtered and standardized returns, respectively. The symbols ∗∗∗ , ∗∗ , and ∗ denote significance of the goodness-of-fit tests at the 10%, 5%, and 1% level, respectively. Distribution α Gaussian Hyperbolic NIG Stable
1.5003 0.6988 1.6150
Gaussian Hyperbolic NIG Stable
2.1899 1.8928 1.9327
Parameters σ(δ) β μ Returns (standardized) 1.0000 -0.0026 0.0700 -0.0794 0.0691 0.6829 -0.0586 0.0548 0.4982 -0.1624 -0.0247 Filtered returns 1.0000 -0.0021 1.3793 -0.2852 0.2755 1.8172 -0.2953 0.2849 0.6669 -0.8026 -0.0101
Statistics K AD 4.083 0.756∗∗ 0.651∗∗∗ 1.171∗∗∗
38.61 0.671∗ 0.414∗∗ 1.792∗∗∗
1.857 0.713∗∗ 0.743∗∗ 1.074∗∗∗
4.328 0.658∗ 0.711∗ 1.504∗∗∗
a GARCH(1,1) process: rt = σt t ,
with
2 2 σt = c0 + c1 rt−1 + d1 σt−1 ,
(1.38)
where c0 , c1 and d1 are constants and t = rt /σt are the filtered returns, see the top right panels in Figures 1.5 and 1.6. We could also insert a moving average term in the conditional mean to remove any serial dependency if needed. To find the right model class for each dataset we fit four distributions: Gaussian, hyperbolic, NIG, and stable to standardized (rescaled by the sample standard deviation) and σt -filtered returns. The results are presented in Tables 1.1 and 1.2, see also Figures 1.5 and 1.6. We compare the fits using Kolmogorov (K) and Anderson-Darling (AD) test statistics (D’Agostino and Stephens, 1986). The latter may be treated as a weighted Kolmogorov statistics which puts more weight to the differences in the tails of the distributions. Naturally, the lower the values the better the fit. For both datasets, both statistics suggest the
46
1 Models for heavy-tailed asset returns
Table 1.2: Gaussian, hyperbolic, NIG and stable fits to 2514 standardized (rescaled by the sample standard deviation) and σt -filtered returns of the Polish WIG20 index from the period January 3, 2000 – December 31, 2009, see also Figure 1.6. The values of the Kolmogorov (K) and Anderson-Darling (AD) goodness-of-fit statistics suggest the hyperbolic distribution as the best model, with the NIG law following closely by. The symbols ∗∗∗ , ∗∗ , and ∗ denote significance of the goodness-of-fit tests at the 10%, 5%, and 1% level, respectively. Distribution α Gaussian Hyperbolic NIG Stable
1.5779 1.0972 1.7841
Gaussian Hyperbolic NIG Stable
2.0121 1.6855 1.8891
Parameters σ(δ) β μ Returns (standardized) 1.0000 0.0057 0.4884 -0.0069 0.0126 1.1054 -0.0042 0.0100 0.5977 0.1069 0.0222 Filtered returns 1.0000 0.0126 1.2288 0.0161 -0.0036 1.6984 0.0140 -0.0015 0.6529 0.1053 0.0200
K
Statistics AD
2.337 0.448∗∗∗ 0.557∗∗∗ 0.926∗∗∗
11.19 0.247∗∗∗ 0.360∗∗∗ 1.550∗∗∗
1.598 0.639∗∗∗ 0.669∗∗∗ 0.890∗∗∗
3.681 0.379∗∗ 0.435∗∗ 1.062∗∗∗
hyperbolic distribution as the best model, with the NIG law following closely by. Although no asymptotic results are known for stable or generalized hyperbolic laws, approximate critical values for these goodness-of-fit tests can be obtained via the bootstrap (or simulation) technique, see Stute, Manteiga and Quindimil (1993), Ross (2002), and Chapter 9 where Monte Carlo simulations are used for this purpose in the context of insurance loss distributions. Like in the latter, also here the significance is computed on the basis of 1000 simulated samples. Clearly, the Gaussian law cannot be accepted for any dataset. On the other hand, the stable law yields moderately high test values (indicating a poor fit), but high p-values. This is likely due to the significant dispersion of stable law parameter estimates for samples of this size. Finally, note that for the DJIA filtered returns the left tail is much heavier than the right tail, with the latter being pretty well modeled by the Gaussian law, see the bottom right panel in Figure 1.5. This is also confirmed by strongly negative skewness parameters (β). In contrast, the WIG20 filtered returns are roughly symmetric.
1.5 Empirical evidence
0.1
47
6
Returns σ
Filtered DJIA returns
4 2
Returns
Returns
0.05 0
0 −2
−0.05 −4 −0.1
−6 0
500 1000 1500 2000 Days (2000.01.03−2009.12.31)
2500
0
−1
CDF(x)
CDF(x)
10
−2
10
−3
−4
10
Returns Gauss Hyp NIG Stable
−2
10
−3
10
−4
0
10
1
10
10
Filt. returns Gauss Hyp NIG Stable 0
−1
−1
1−CDF(x)
1−CDF(x)
10
−2
10
−4
10
10 −x
10
−3
1
10
−x
10
2500
−1
10
10
500 1000 1500 2000 Days (2000.01.03−2009.12.31)
Returns Gauss Hyp NIG Stable
−2
10
−3
10
−4
0
1
10
10 x
10
Filt. returns Gauss Hyp NIG Stable 0
1
10
10 x
Figure 1.5: Top left : DJIA index (log-)returns and the GARCH(1,1)-based daily volatility estimate σt . Top right: σt -filtered DJIA returns. Middle: The left tails of the cdf of (standardized) returns (left ) or filtered returns (right) and of four fitted distributions. Bottom: The corresponding right tails. STFstab05
48
1 Models for heavy-tailed asset returns
0.1
6
Returns σ
Filtered WIG20 returns
4 2
Returns
Returns
0.05 0
0 −2
−0.05 −4 −0.1
−6 0
10
10
10
2500
0
−1
−4
Returns Gauss Hyp NIG Stable
−2
10
−3
10
−4
0
Filt. returns Gauss Hyp NIG Stable
10
1
10
10
10
0
−x
−1
−1
1−CDF(x)
1−CDF(x)
10
−2
10
−3
−4
10
1
10 −x
10
10
2500
−1
−2
−3
500 1000 1500 2000 Days (2000.01.03−2009.12.31)
10 CDF(x)
CDF(x)
10
500 1000 1500 2000 Days (2000.01.03−2009.12.31)
Returns Gauss Hyp NIG Stable
−2
10
−3
10
−4
0
1
10
10 x
Filt. returns Gauss Hyp NIG Stable
10
10
0
1
10 x
Figure 1.6: Top left : WIG20 index (log-)returns and the GARCH(1,1)-based daily volatility estimate σt . Top right : σt -filtered WIG20 returns. Middle: The left tails of the cdf of (standardized) returns (left ) or filtered returns (right ) and of four fitted distributions. Bottom: The corresponding right tails. STFstab06
Bibliography Atkinson, A. C. (1982). The simulation of generalized inverse Gaussian and hyperbolic random variables, SIAM Journal of Scientific & Statistical Computing 3: 502–515. Barndorff-Nielsen, O. E. (1977). Exponentially decreasing distributions for the logarithm of particle size, Proceedings of the Royal Society London A 353: 401–419. Barndorff-Nielsen, O. E. (1995). Normal\\Inverse Gaussian Processes and the Modelling of Stock Returns, Research Report 300, Department of Theoretical Statistics, University of Aarhus. Barndorff-Nielsen, O. E. and Blaesild, P. (1981). Hyperbolic distributions and ramifications: Contributions to theory and applications, in C. Taillie, G. Patil, B. Baldessari (eds.) Statistical Distributions in Scientific Work, Volume 4, Reidel, Dordrecht, pp. 19–44. Barone-Adesi, G., Giannopoulos, K., and Vosper, L., (1999). VaR without correlations for portfolios of derivative securities, Journal of Futures Markets 19(5): 583–602. Basle Committee on Banking Supervision (1995). An internal model-based approach to market risk capital requirements, http://www.bis.org. Bianchi, M. L., Rachev, S. T., Kim, Y. S. and Fabozzi, F. J. (2010). Tempered stable distributions and processes in finance: Numerical analysis, in M. Corazza, P. Claudio, P. (Eds.) Mathematical and Statistical Methods for Actuarial Sciences and Finance, Springer. Bibby, B. M. and Sørensen, M. (2003). Hyperbolic processes in finance, in S. T. Rachev (ed.) Handbook of Heavy-tailed Distributions in Finance, North Holland. Blaesild, P. and Sorensen, M. (1992). HYP – a Computer Program for Analyzing Data by Means of the Hyperbolic Distribution, Research Report 248, Department of Theoretical Statistics, Aarhus University.
50
Bibliography
Boyarchenko, S. I. and Levendorskii, S. Z. (2000). Option pricing for truncated L´evy processes, International Journal of Theoretical and Applied Finance 3: 549–552. Brcich, R. F., Iskander, D. R. and Zoubir, A. M. (2005). The stability test for symmetric alpha stable distributions, IEEE Transactions on Signal Processing 53: 977–986. Buckle, D. J. (1995). Bayesian inference for stable distributions, Journal of the American Statistical Association 90: 605–613. Carr, P., Geman, H., Madan, D. B. and Yor, M. (2002). The fine structure of asset returns: an empirical investigation, Journal of Business 75: 305–332. Chambers, J. M., Mallows, C. L. and Stuck, B. W. (1976). A Method for Simulating Stable Random Variables, Journal of the American Statistical Association 71: 340–344. Chen, Y., H¨ardle, W. and Jeong, S.-O. (2008). Nonparametric risk management with generalized hyperbolic distributions, Journal of the American Statistical Association 103: 910–923. Cont, R., Potters, M. and Bouchaud, J.-P. (1997). Scaling in stock market data: Stable laws and beyond, in B. Dubrulle, F. Graner, D. Sornette (eds.) Scale Invariance and Beyond, Proceedings of the CNRS Workshop on Scale Invariance, Springer, Berlin. D’Agostino, R. B. and Stephens, M. A. (1986). Goodness-of-Fit Techniques, Marcel Dekker, New York. Dagpunar, J. S. (1989). An Easily Implemented Generalized Inverse Gaussian Generator, Communications in Statistics – Simulations 18: 703–710. Danielsson, J., Hartmann, P. and De Vries, C. G. (1998). The cost of conservatism: Extreme returns, value at risk and the Basle multiplication factor, Risk 11: 101–103. Dominicy, Y. and Veredas, D. (2010). The method of simulated quantiles, ECARES working paper, 2010–008. DuMouchel, W. H. (1971). Stable Distributions in Statistical Inference, Ph.D. Thesis, Department of Statistics, Yale University. DuMouchel, W. H. (1973). On the Asymptotic Normality of the Maximum– Likelihood Estimate when Sampling from a Stable Distribution, Annals of Statistics 1(5): 948–957.
Bibliography Eberlein, E. and Keller, U. (1995). Bernoulli 1: 281–299.
51 Hyperbolic distributions in finance,
Fama, E. F. and Roll, R. (1971). Parameter Estimates for Symmetric Stable Distributions, Journal of the American Statistical Association 66: 331– 338. Fan, Z. (2006). Parameter estimation of stable distributions, Communications in Statistics – Theory and Methods 35(2): 245–255. Fragiadakis, K., Karlis, D. and Meintanis, S. G. (2009). Tests of fit for normal inverse Gaussian distributions, Statistical Methodology 6: 553–564. Garcia, R., Renault, E. and Veredas, D. (2010). Estimation of stable distributions by indirect inference, Journal of Econometrics, Forthcoming. Grabchak, M. (2010). Maximum likelihood estimation of parametric tempered stable distributions on the real line with applications to finance, Ph.D. thesis, Cornell University. Grabchak, M. and Samorodnitsky, G. (2010). Do financial returns have finite or infinite variance? A paradox and an explanation, Quantitative Finance, DOI: 10.1080/14697680903540381. Guillaume, D. M., Dacorogna, M. M., Dave, R. R., M¨ uller, U. A., Olsen, R. B. and Pictet, O. V. (1997). From the birds eye to the microscope: A survey of new stylized facts of the intra-daily foreign exchange markets, Finance & Stochastics 1: 95–129. Janicki, A. and Weron, A. (1994a). Can one see α-stable variables and processes, Statistical Science 9: 109–126. Janicki, A. and Weron, A. (1994b). Simulation and Chaotic Behavior of αStable Stochastic Processes, Marcel Dekker. Karlis, D. (2002). An EM type algorithm for maximum likelihood estimation for the Normal Inverse Gaussian distribution, Statistics and Probability Letters 57: 43–52. Karlis, D. and Lillest¨ol, J. (2004). Bayesian estimation of NIG models via Markov chain Monte Carlo methods, Applied Stochastic Models in Business and Industry 20(4): 323–338. Kawai, R. and Masuda, H. (2010). On simulation of tempered stable random variates, Preprint, Kyushu University.
52
Bibliography
Kogon, S. M. and Williams, D. B. (1998). Characteristic function based estimation of stable parameters, in R. Adler, R. Feldman, M. Taqqu (eds.), A Practical Guide to Heavy Tails, Birkhauser, pp. 311–335. Koponen, I. (1995). Analytic approach to the problem of convergence of truncated Levy flights towards the Gaussian stochastic process, Physical Review E 52: 1197–1199. Koutrouvelis, I. A. (1980). Regression–Type Estimation of the Parameters of Stable Laws, Journal of the American Statistical Association 75: 918–928. Kuester, K., Mittnik, S., and Paolella, M.S. (2006). Value-at-Risk prediction: A comparison of alternative strategies, Journal of Financial Econometrics 4(1): 53–89. K¨ uchler, U., Neumann, K., Sørensen, M. and Streller, A. (1999). Stock returns and hyperbolic distributions, Mathematical and Computer Modelling 29: 1–15. Lombardi, M. J. (2007). Bayesian inference for α-stable distributions: A random walk MCMC approach, Computational Statistics and Data Analysis 51(5): 2688–2700. Madan, D. B. and Seneta, E. (1990). The variance gamma (V.G.) model for share market returns, Journal of Business 63: 511–524. Mandelbrot, B. B. (1963). The variation of certain speculative prices, Journal of Business 36: 394–419. Mantegna, R. N. and Stanley, H. E. (1994). Stochastic processes with ultraslow convergence to a Gaussian: The truncated L´evy flight, Physical Review Letters 73: 2946–2949. Matacz, A. (2000). Financial Modeling and Option Theory with the Truncated L´evy Process, International Journal of Theoretical and Applied Finance 3(1): 143–160. Matsui, M. and Takemura, A. (2006). Some improvements in numerical evaluation of symmetric stable density and its derivatives, Communications in Statistics – Theory and Methods 35(1): 149–172. Matsui, M. and Takemura, A. (2008). Goodness-of-fit tests for symmetric stable distributions – empirical characteristic function approach, TEST 17(3): 546–566. McCulloch, J. H. (1986). Simple consistent estimators of stable distribution
Bibliography
53
parameters, Communications in Statistics – Simulations 15: 1109–1136. McNeil, A. J., R¨ udiger, F. and Embrechts, P. (2005). Quantitative Risk Management, Princeton University Press, Princeton, NJ. Michael, J. R., Schucany, W. R. and Haas, R. W. (1976). Generating Random Variates Using Transformations with Multiple Roots, The American Statistician 30: 88–90. Mittnik, S., Doganoglu, T. and Chenyao, D. (1999). Computing the Probability Density Function of the Stable Paretian Distribution, Mathematical and Computer Modelling 29: 235–240. Mittnik, S. and Paolella, M. S. (1999). A simple estimator for the characteristic exponent of the stable Paretian distribution, Mathematical and Computer Modelling 29: 161–176. Mittnik, S., Rachev, S. T., Doganoglu, T. and Chenyao, D. (1999). Maximum Likelihood Estimation of Stable Paretian Models, Mathematical and Computer Modelling 29: 275–293. Nolan, J. P. (1997). Numerical Calculation of Stable Densities and Distribution Functions, Communications in Statistics – Stochastic Models 13: 759–774. Nolan, J. P. (2001). Maximum Likelihood Estimation and Diagnostics for Stable Distributions, in O. E. Barndorff-Nielsen, T. Mikosch, S. Resnick (eds.), L´evy Processes, Brikh¨ auser, Boston. Nolan, J. P. (2010). Stable Distributions – Models for Heavy Tailed Data, Birkh¨ auser, Boston. In progress, Chapter 1 online at academic2.american.edu/∼jpnolan. Ojeda, D. (2001). Comparison of stable estimators, Ph.D. Thesis, Department of Mathematics and Statistics, American University. Paolella, M. S. (2001). Testing the stable Paretian assumption, Mathematical and Computer Modelling 34: 1095–1112. Paolella, M. S. (2007). Intermediate Probability: A Computational Approach, Wiley, Chichester. Peters, G. W., Sisson, S. A. and Fan, Y. (2009). Likelihood-free Bayesian inference for α-stable models, Preprint: http://arxiv.org/abs/0912.4729. Poirot, J. and Tankov, P. (2006). Monte Carlo option pricing for tempered stable (CGMY) processes, Asia-Pacific Financial Markets 13(4): 327-
54
Bibliography 344.
Prause, K. (1999). The Generalized Hyperbolic Model: Estimation, Financial Derivatives, and Risk Measures, Ph.D. Thesis, Freiburg University, http://www.freidok.uni-freiburg.de/volltexte/15. Press, S. J. (1972). Estimation in Univariate and Multivariate Stable Distribution, Journal of the American Statistical Association 67: 842–846. Protassov, R. S. (2004). EM-based maximum likelihood parameter estimation for multivariate generalized hyperbolic distributions with fixed λ, Statistics and Computing 14: 67–77. Rachev, S. and Mittnik, S. (2000). Stable Paretian Models in Finance, Wiley. Rosinski, J. (2007). Tempering stable processes, Stochastic Processes and Their Applications 117(6): 677–707. Ross, S. (2002). Simulation, Academic Press, San Diego. Samorodnitsky, G. and Taqqu, M. S. (1994). Stable Non–Gaussian Random Processes, Chapman & Hall. Shuster, J. (1968). On the Inverse Gaussian Distribution Function, Journal of the American Statistical Association 63: 1514–1516. Stahl, G. (1997). Three cheers, Risk 10: 67–69. Stute, W., Manteiga, W. G. and Quindimil, M.P. (1993). Bootstrap Based Goodness-Of-Fit-Tests, Metrika 40: 243–256. Venter, J. H. and de Jongh, P. J. (2002). Risk estimation using the Normal Inverse Gaussian distribution, The Journal of Risk 4: 1–23. Weron, R. (1996). On the Chambers-Mallows-Stuck Method for Simulating Skewed Stable Random Variables, Statistics and Probability Letters 28: 165–171. See also R. Weron (1996) Correction to: On the Chambers-Mallows-Stuck Method for Simulating Skewed Stable Random Variables, Working Paper, Available at MPRA: http://mpra.ub.unimuenchen.de/20761/. Weron, R. (2001). Levy–Stable Distributions Revisited: Tail Index > 2 Does Not Exclude the Levy–Stable Regime, International Journal of Modern Physics C 12: 209–223. Weron, R. (2006). Modeling and Forecasting Electricity Loads and Prices: A Statistical Approach, Wiley, Chichester.
Bibliography
55
Weron, R. (2011). Computationally Intensive Value at Risk Calculations, in J. E. Gentle, W. H¨ ardle, Y. Mori (eds.) Handbook of Computational Statistics, 2nd edition, Springer, Berlin, 911–950. Zolotarev, V. M. (1964). On representation of stable laws by integrals, Selected Translations in Mathematical Statistics and Probability 4: 84–88. Zolotarev, V. M. (1986). One–Dimensional Stable Distributions, American Mathematical Society.
2 Expected shortfall for distributions in finance Simon A. Broda and Marc S. Paolella
2.1 Introduction It has been nearly 50 years since the appearance of the pioneering paper of Mandelbrot (1963) on the non-Gaussianity of financial asset returns, and their highly fat-tailed nature is now one of the most prominent and accepted stylized facts. The recent book by Jondeau et al. (2007) is dedicated to the topic, while other chapters and books discussing the variety of non-Gaussian distributions of use in empirical finance include McDonald (1997), Knight and Satchell (2001), and Paolella (2007). In the context of portfolio selection and risk management, the use of risk measures such as the Value-at-Risk (VaR), Expected Shortfall (ES; also known as conditional value-at-risk or mean excess loss, see Chapter 9) and Lower Partial Moments (LPM) is ubiquitous, and the accuracy of their measurement is highly influenced by the choice of distribution (both in an unconditional and conditional context); see, for example, Kuester et al. (2006) and the references therein in the context of VaR. The calculation of the VaR for a particular return series amounts to calculating one or several quantiles from the predictive distribution; in a fully parametric model, this is readily done provided that a tractable expression for the cumulative distribution function (cdf) of the predictive distribution is available. While the VaR is still in great use, the expected shortfall (ES) is a more preferable risk measure; see Dowd (2005) for an excellent survey into the literature and discussion of its merits compared to VaR. By way of its definition, the calculation of ES will almost always involve more computation than the VaR. In this paper, we present easily-computed expressions for evaluating the ES for
P. Čížek et al. (eds.), Statistical Tools for Finance and Insurance, DOI 10.1007/978-3-642-18062-0_2, © Springer-Verlag Berlin Heidelberg 2011
57
58
2 Expected shortfall
numerous distributions which are now commonly used for modeling asset returns. Some involve straightforward integral exercises; for others, saddlepoint approximations, hereafter SPAs, are developed to quickly and accurately approximate the desired integral. All the results are applicable in a conditional context by allowing for a scale term (which is time-varying according to, say, a GARCH model). This chapter is outlined as follows. Section 2.2 is the longest; subsection 2.2.1 introduces the ES measure and its calculation via the moment generating function while subsections 2.2.2, 2.2.3 and 2.2.4 consider calculation of the ES for Student’s t generalizations, the stable Paretian distribution, and special cases of the generalized hyperbolic distribution, respectively. Section 2.3 presents ES results for finite mixture distributions, and Section 2.4 compares the smallsample distributional properties of the different VaR and ES estimates via a resampling exercise. Section 2.5 considers calculation of the LPM and its relation to the ES. Finally, Section 2.6 details the calculation of the ES for sums of independent random variables for two cases of interest which generalize the Gaussian distribution. The associated working paper, Broda and Paolella (2009a), contains further details and examples.
2.2 Expected shortfall for several asymmetric, fat-tailed distributions 2.2.1 Expected shortfall: definitions and basic results Let QX be the quantile function of r.v. X; that is, QX : [0, 1] → R with ← ← p → FX (p), where FX (p) is the generalized inverse function of FX defined in the usual way as inf{x | F (x) ≥ p}. The γ-level expected shortfall (ES) of X is defined as γ def 1 ESγ (X) = QX (p) dp. (2.1) γ 0 (Note that the ES and VaR are often taken to be positive numbers; we choose to use the negative sign convention.) Expression (2.1) is equivalent to the more common formulation, also referred to as the tail conditional expectation, when a continuous distribution is used, as is the case herein. In particular, −1 ← FX (p) = FX (p) and letting u = QX (p) and denoting the γ-quantile of X as
2.2 Expected shortfall for several asymmetric, fat-tailed distributions qX,γ = QX (γ), we immediately have 1 γ 1 qX,γ QX (p) dp = u fX (u) du = E [X | X ≤ qX,γ ] . γ 0 γ −∞
59
(2.2)
Furthermore, integration by parts shows that, if E[X] exists, then we can also write 1 qX,θ ESγ (X) = qX,θ − FX (x) dx, (2.3) θ −∞ thus showing that the ES can be expressed as an integral involving the pdf, cdf, or quantile function. Let Z be a location zero, scale one r.v., and let Y = σZ + μ for σ > 0. A simple calculation verifies that ES preserves location-scale transformations; that is, ESγ (Y ) = μ + σESγ (Z) .
(2.4)
Also, it is straightforward to see that, for c < 0 and R ∼ N (0, 1) (standard c normal distribution) with pdf φ and cdf Φ, −∞ rfR (r) dr = −φ (c). Thus, E [R | R ≤ c] = −φ (c) /Φ (c) and
ESγ (R) = −φ Φ−1 (γ) /γ. (2.5) Similar to the normal, there are distributions for which the expected shortfall integral (2.2) can be analytically expressed, often in terms of special functions such as gamma and beta. Examples include the Student’s t and its extensions such as the JoF and GAt, as detailed below. For other distributions, this will not be possible; besides numeric integration of the expressions in (2.1), (2.2) or (2.3), it might be possible to express the integral in terms of the moment generating function (mgf), if it exists. This could have enormous computational benefits depending on the cost of evaluating the integrands and their behaviors over the range of the integral. Finally, in some cases, it will be possible to avoid numeric integration by using an SPA. Let X be an absolutely continuous random variable with support SX , characteristic function ϕX , and mgf MX related by ϕX (s) = MX (is). The pdf of X at x ∈ SX can be computed by the usual inversion formula ∞ 1 fX (x) = e−itx ϕX (t) dt. (2.6) 2π −∞
60
2 Expected shortfall
With KX (s) = log MX (s) the cumulant generating function (cgf) of X, substituting s = it gives, for c = 0, 1 fX (x) = 2πi
c+i∞
exp {KX (s) − sx} ds,
(2.7)
c−i∞
which, for any c ∈ R, is the well-known Fourier–Mellin integral; see, for example, Schiff
q (1999, Chap. 4). Substituting (2.7) into the ES integral, say UX (q) = −∞ xfX (x) dx, reversing the order of the integrals, integrating the inner integral by parts and restricting c < 0 such that c is in the convergence strip of MX (so that the real part of s is negative), gives
c+i∞
2πiUX (q) = − c−i∞
q 1 + s s2
exp {KX (s) − sq} ds.
(2.8)
A further integration by parts gives
c+i∞
2πiUX (q) = −
exp {KX (s) − qs} KX (s)
c−i∞
ds . s
(2.9)
Observe that the integrand in (2.8) involves less computation and avoids having to develop the expression for KX .
q For the evaluation of the lower partial moments, UX (q; 2) = −∞ x2 fX (x) dx is useful, that is, T2,q (X) in the notation of Section 2.5. A similar derivation as for (2.8) and (2.9) yields the expressions 2πiUX (q; 2) =
c+i∞
−
exp {KX (s) − sq} c−i∞ c+i∞
= −
q2 2q 2 + 2 + 3 s s s
(1 + sq) exp {KX (s) − qs} KX (s)
c−i∞
ds(2.10) ds , (2.11) s2
with those for UX (q; 3), etc., similarly derived.
2.2.2 Student’s t and extensions
z Throughout, Bz (a, b) = 0 xa−1 (1 − x)b−1 dx for z ∈ [0, 1] is the incom¯z (a, b) = plete beta function, B(a, b) = B1 (a, b) is the beta function, and B Bz (a, b)/B(a, b) is the incomplete beta ratio.
2.2 Expected shortfall for several asymmetric, fat-tailed distributions
61
Let T ∼ tn (μ, c) indicate a Student’s t random variable with location μ and scale c. The standard (location-zero, scale-one) Student’s t pdf with n degrees of freedom is given by 1 − n+1 n− 2 2 fT (x; n) = n 1 1 + x2 /n . B 2, 2
A basic calculation shows that the tail component def
Ttail(c, n) = −fT (c; n)
c −∞
n + c2 n−1
(2.12)
xfT (x; n) dx is
,
(2.13)
and ESγ (T ; n, 0, 1) = γ −1 Ttail(qT,γ , n), where qT,γ is the γ–quantile of T ∼ tn . Note that, as n → ∞, the expression approaches that based on the normal distribution. There are numerous ways of extending the Student’s t distribution to support asymmetry, not all of which possess closed-form cdf or ES expressions. We consider several such extensions; as shown in Figure 2.1, plots of their ES as a function of their asymmetry parameter, for comparable degrees of freedom parameters, show rather different behavior. Jones and Faddy skewed t We start with the asymmetric generalization of the Student’s t distribution from Jones and Faddy (2003), which we abbreviate as JoF. The density of S ∼ JoF(a, b) is, for a, b ∈ R>0 , a+1/2 b+1/2 t t fJoF (t; a, b) = Ca,b 1 + 1− , yt yt
(2.14)
1/2 −1 where yt = a + b + t2 and Ca,b = B (a, b) (a + b)1/2 2a+b−1 . Let S ∼ JoF(a, b). If a < b (a > b), then S is negatively (positively) skewed, while S ∼ t2a if a = b. Via its relation to a beta random variable, they show that ¯ y (a, b), where 2y = 1 + t t2 + a + b −1/2 and that the cdf is FJoF (t; a, b) = B the rth moment, if it exists, is r (a + b)r/2 r j E [S ] = (−1) 2−j B (a + r/2 − j, b − r/2) , B (a, b) j=0 j r
62
2 Expected shortfall
with
1/2
E [S] = (a − b)
(a + b) 2
Γ (a − 1/2) Γ (b − 1/2) , Γ (a) Γ (b)
(2.15)
from which it is clear that, for a = b, E [S] = 0. Calculation then reveals that, with 2y = 1 + c(c2 + a + b)−1/2 , √ c a+b def JFtail(c, a, b) = sfS (s; a, b) ds = × (2.16) B (a, b) −∞ 1 By (a + 1/2, b − 1/2) − By (a − 1/2, b − 1/2) , 2 from which the ES can be computed. Figure 2.1 shows the 1% ES for the JoF(a, b) distribution as a function of b, for three values of a. Note that the case with a = b corresponds to the Student’s t distribution with 2a degrees of freedom and that, unlike the other distributions under consideration, the ES decreases with increasing b because the density is negatively skewed for a < b. Generalized asymmetric t The so-called generalized asymmetric t distribution, with abbreviation GAt is given by ⎧ 1 ! d −(ν+ d ) ⎪ ⎪ (−z · θ) ⎪ ⎪ , if z < 0, ⎪ ⎨ 1+ ν fGAt (z; d, ν, θ) = Cd,ν,θ × (2.17) 1 ! ⎪ d −(ν+ d ) ⎪ ⎪ (z/θ) ⎪ ⎪ 1+ , if z ≥ 0, ⎩ ν −1 d, ν, θ ∈ R>0 , where Cd,ν,θ = θ −1 + θ d−1 ν 1/d B d1 , ν . The Student’s t distribution with n degrees of freedom is the special case d = 2, ν = n/2 and θ = 1, and the introduction of a scale term.√That is, with n > 0 and Z ∼ GAt (2, n/2, 1), it is easy to confirm that X = 2Z ∼ tn . For θ > 1 (θ < 1) the distribution is skewed to the right (left), while for θ = 1, it is symmetric. Both parameters ν and d dictate the fatness of the tails, with ν playing a similar role as the degrees of freedom in the Student’s t, and d having a much more pronounced effect on the peakedness of the mode.
2.2 Expected shortfall for several asymmetric, fat-tailed distributions
63
Along with expressions for the cdf and the rth integer moment of Z ∼ GAt, basic calculations show that, for c < 0 and 0 ≤ r < νd, 1 + θ2 BL (ν − r/d, (r + 1) /d) r r/d r E [Z | Z < c] = (−1) ν , (θr + θr+2 ) BL (ν, 1/d) d
where L = ν/{ν + (−cθ) }, from which the ES can be computed. Figure 2.1 shows the ES as a function of θ for d = 2 and three values of ν, scaled as discussed below (2.17). Thus, at θ = 0, the ES coincides with that from a t(2ν) distribution. Azzalini skew t Azzalini (1985) considers the asymmetric generalization of the normal distribution with pdf given by fSN (z; λ) = 2φ(z)Φ(λz),
λ ∈ R,
(2.18)
and referred to as the skew normal. Here, φ and Φ refer to the standard normal pdf and cdf, respectively. Clearly, for λ = 0, fSN reduces to the standard normal density, while for λ = 0, it is skewed. In light of how a Student’s t random variate is defined, it seems natural to define a skew t random variable X as X = S/Y , where S ∼ SN (λ) and Y = χ2m /m. This is indeed the univariate version of the structure proposed and studied in Azzalini and Capitanio (2003), and leads to the density expression ! m+1 fX (x; λ, m) = 2φm (x) Φm+1 λx , (2.19) x2 + m where φm (x) and Φm (x) denote the pdf and cdf, respectively, of the Student’s t distribution at x with m degrees of freedom. Numeric integration can be used with the expressions in (2.2) and (2.19) to compute the ES, using, say, the 10−6 quantile as the lower integration bound. Figure 2.1 shows the 1% ES for the skew t as a function of the asymmetry parameter λ, for three different degrees of freedom. Noncentral Student’s t The use of the (singly) noncentral t distribution in the context of modeling asset returns was advocated and popularized by Harvey and Siddique (1999). There
64
2 Expected shortfall
are several expressions for the density, one of which useful for computation being k+1 2 2 Γ ((k + 1) /2) k k/2 1 √ fT (t; k, μ) = e−μ /2 2 πΓ (k/2) k+t ∞ i/2 i (tμ) 2 Γ {(k + i + 1) /2} × , i! t2 + k Γ {(k + 1) /2} i=0
(2.20)
where k ∈ R>0 is the degrees of freedom and μ ∈ R is the noncentrality parameter which dictates the asymmetry; see Paolella (2007, Sec. 10.4) and the references therein. Due to the computational burden of the density calculation of the (singly) noncentral t involved in the model of Harvey and Siddique (1999), Le´ on et al. (2005) advocated use of distributions which are easier to calculate. This critique is particularly valid in the empirical examples shown below, in which the MLE needs to be repeatedly calculated in a resampling exercise and, thus, a huge number of density evaluations. This issue, however, can be solved by using the SPA to the density (and cdf). It is a trivially computed, closed-form expression whose evaluation is about 275 times faster than evaluation of (2.20), even with an intelligent, parameter-driven choice of the upper limit of the sum. The SPA delivers an accuracy of at least three significant digits across the whole support and parameter range. This is more than enough to ensure that virtually any form of inference in empirical finance, such as model estimation, density and risk forecasting, will have the same order of accuracy as use of the exact pdf, even if the true innovations distribution were noncentral t—which obviously is not the case. Indeed, we show below in Section 2.4 that the choice of model and choice of data window has an enormous impact on inference for the ES, while, as illustrated in Figure 2.1 below, the impact of using the SPA instead of the true pdf is several orders of magnitude smaller. See Broda and Paolella (2007) for the derivation and formulae for the SPA of the (singly and doubly) noncentral t pdf and cdf, detailed assessment of accuracy, and comparison to numerous other approximations previously proposed in the literature. The SPA expression for the pdf can then be used to calculate the ES via (2.2) and numeric integration to a high degree of accuracy, and far faster than use of the exact pdf. Figure 2.1 shows the 1% ES of the noncentral t as a function of the asymmetry parameter μ, for three different degrees of freedom. These were all computed using (2.2) and the SPA pdf. To assess its accuracy, we overlay as a dashed line the ES for k = 2 degrees of freedom (worst case) computed
2.2 Expected shortfall for several asymmetric, fat-tailed distributions
65
using the exact pdf; we see that it is graphically essentially indistinguishable from the calculation using the SPA. It is possible to approximate the ES of the noncentral t without any numeric integration: A trivially computed and highly accurate SPA to the ES in this case, and also for the Azzalini skewed t are studied in Broda and Paolella (2009b). Because an mgf is not available for the latter two distributions, the development of the ES SPA is not as straightforward as used below for the skew-normal and NIG.
2.2.3 ES for the stable Paretian distribution We consider the stable Paretian distribution with tail index α restricted to 1 < α ≤ 2, and asymmetry parameter β, β ∈ [−1, 1]. Location and scale terms, say μ and σ, can be incorporated in the usual way, and we write X ∼ Sα,β (μ, σ). It should be kept in mind that σ 2 is not the variance; the second moment does not exist if α < 2. Several parametrizations of the characteristic function exist; we use that given in Samorodnitsky and Taqqu (1994): For 1 < α ≤ 2 and X ∼ Sα,β (0, 1), " πα # log ϕX (t) = −|t|α 1 − iβ sgn(t) tan . (2.21) 2 Clearly, the mgf of the stable distribution for α < 2 does not exist, and so (2.8) cannot be used. However, Stoyanov et al. (2006) give a numerically useful integral expression to compute the ES. For α > 1, S ∼ Sα,β (0, 1), 0 < γ < 1 and qS,γ = FS−1 (γ), ESγ (S; α, β, 0, 1) = where the tail component
c −∞
1 Stoy(qS,γ , α, β), γ
xfS (x; α, β) dx is
α |c| π/2 g(θ) exp −|c|α/(α−1) v(θ) dθ, (2.22) α − 1 π −θ¯0
sin α θ¯0 + θ − 2θ α cos2 (θ)
−
, g (θ) = 2 ¯ sin α θ0 + θ sin α θ¯0 + θ α/(α−1)
1/(α−1) cos α θ¯0 + θ − θ cos (θ)
v (θ) = cos αθ¯0 , cos (θ) sin α θ¯0 + θ Stoy(c, α, β) =
66
2 Expected shortfall
1% ES of GAt as a function of θ
1% ES of Jones and Faddy Skew t as a function of b 0
0
−5
−10
−10
−20
−15
−30
a=1 a=2 a=4
−20
−40
−25
−50
−30
−60
−35 0.5
1
1.5
2
2.5
3
−70 0
1% ES of Azzalini Skew t as a function of λ
ν=1 ν=2 ν=4 0.5
1
1.5
2
2.5
1% ES of noncentral t as a function of μ
0
0 −10
−5 −20 −30
df=2 df=4 df=8
−10
df=2 df=4 df=8
−40
−15
−50 −60
−20 −3
−2
−1
0
1
2
3
−3
−2
−1
0
1
2
3
Figure 2.1: The 1% ES for several asymmetric, fat-tailed distributions as a function of their corresponding asymmetry parameters. and
πα 1 θ¯0 = arctan β¯ tan and β¯ = sign (c) β. α 2 The formulae for β¯ and Stoy(c, α, β) have a minus sign in front in Stoyanov et al. (2006) because they use the positive sign convention for the VaR and ES. The integrand is “well behaved” (not oscillatory) and, because it is a definite integral, is very quickly and reliably evaluated.
2.2 Expected shortfall for several asymmetric, fat-tailed distributions
67
0
−5
−10
α=1.6 α=1.8 α=2.0
−15
−20
−25 −1
−0.5
0
0.5
1
Figure 2.2: The 1% ES for the asymmetric stable Paretian distribution with tail index α, as a function of asymmetry parameter β. Recall that, as α approaches 2, β no longer affects the shape of the density, and so the ES is constant, and coincides with that of a N(0,√ 2) random variable. In particular, from (2.4) and (2.5) with σ = 2, this is −3.769.
2.2.4 Generalized hyperbolic and its special cases The generalized hyperbolic distribution, or GHyp, is an extremely flexible class arising as a mean-variance mixture of normal distributions. While it can be successfully used for modeling asset returns, it is so flexible that one of its parameters, λ, tends to be somewhat redundant; as such, it is more common to use one of its many special cases. We consider four such cases, all of which are fat-tailed and possibly skewed, and have been used in the empirical finance literature. An extensive discussion and full details on the construction of the GHyp and its special cases is given in Paolella (2007, Chap. 9); Chen et al. (2008), Bibby and Sørensen (2003) and Weron (2004) discuss their use in finance and Value-at-Risk calculation. Here, we concentrate on the behavior of the ES and, below, consider a fast approximate method for calculating the ES of convolutions of one of the special cases, the NIG.
68
2 Expected shortfall
For ν ∈ R, let def
Bν (x) =
1 2
∞
−1
tν−1 e− 2 x(t+t 1
)
dt,
x > 0,
0
which is the modified Bessel function of the third kind with index ν. In Matlab, Bv (x) is easily computed with the built–in function besselk(v,x). For convenience, let def
∞
kλ (χ, ψ) =
0
λ/2 1 χ xλ−1 exp − (χx−1 + ψx) dx = 2 Bλ χψ . 2 ψ
Then, the GHyp density can then be compactly represented as
kλ− 12 (x − μ)2 + δ 2 , α2 β(x−μ) fGHyp (x; λ, α, β, δ, μ) = √ e . 2π kλ (δ 2 , α2 − β 2 )
(2.23)
We now consider four special cases of interest for modeling asset returns. Normal inverse Gaussian From (2.23), the NIG takes λ = −1/2, α > 0, β ∈ (−α, α), δ > 0, μ ∈ R, and def
NIG(α, β, δ, μ) = GHyp(−1/2, α, β, δ, μ)
(2.24)
with fNIG (x; α, β, δ, μ) given by eδ
√
α2 −β 2
π
αδ δ 2 + (x − μ)2
B1 α δ 2 + (x − μ)2 eβ(x−μ) .
(2.25)
If X ∼ NIG(α, β, δ, μ), then E[X] = μ + βη, where η = δ/
Var(X) = η + β 2
η2 , ω
μ3 (X) = 3β
η2 η3 + 3β 3 2 , ω ω
(2.26)
α2 − β 2 and ω = δ α2 − β 2 .
Like most of the interesting special cases of the generalized hyperbolic, the NIG does not admit a closed-form, easily-computed expression for its exact cdf, and numeric integration of the pdf needs to be used. Given that the pdf is
2.2 Expected shortfall for several asymmetric, fat-tailed distributions
69
available in essentially closed-form (using the fast built-in Bessel functions in, for example, Matlab), Newton’s method can be used to iterate on qn+1 = qn −
FX (qn ; α, β, δ, μ) − γ , fX (qn ; α, β, δ, μ)
X ∼ NIG (α, β, δ, μ) ,
to determine the relevant γ–level quantile q for computation of the VaR and ES. As a starting value, assuming γ is 0.01, we can use the mean minus three times the standard deviation, calculated from the quantities in (2.26). This is about five times faster than simply applying a root-search algorithm to the cdf. Hyperbolic asymmetric t This takes λ < 0, β ∈ R, α = |β|, δ > 0, μ ∈ R, and def
HAt(n, β, μ, δ) = GHyp(λ, |β|, β, δ, μ), where n = −2λ. For β = 0 and yx = δ 2 + (x − μ)2 , the pdf is −n+1
2 2 δn fHAt (x; n, β, μ, δ) = √ πΓ(n/2)
yx |β|
− n+1 2
B− n+1 (|β| yx ) eβ(x−μ) . 2
As β → 0 and setting δ 2 = n, it reduces to the Student’s t pdf with n degrees of freedom. Hyperbolic This takes λ = 1, α > 0, β ∈ (−α, α), δ > 0, μ ∈ R, and def
Hyp(α, β, δ, μ) = GHyp(1, α, β, δ, μ) with fHyp (x; α, β, δ, μ) given by α2 − β 2 exp −α δ 2 + (x − μ)2 + β(x − μ) . 2αδ B1 (δ α2 − β 2 ) Variance Gamma This takes λ > 0, α > 0, β ∈ (−α, α), δ = 0, μ ∈ R, and def
VG(λ, α, β, μ) = GHyp(λ, α, β, 0, μ),
70
2 Expected shortfall
1% ES of HAt as a function of β
1% ES of NIG as a function of β 0
−4 −5 −6 −10 −8 −15
−10
−25
−14 df=2 df=3 df=4 df=8
−16 −18 −20
α=1 α=2 α=4
−20
−12
−0.2
−0.1
0
0.1
0.2
−30 −35 −40
−3
−2
−1
0
1
2
3
Figure 2.3: The 1% ES for two special cases of the generalized hyperbolic distribution as a function of their corresponding asymmetry parameters. √ In both cases, μ = 0. For the HAt, δ is set to n, in which case the distribution coincides with the Student’s t with n degrees of freedom as β → 0. For NIG, δ = 1. For HAt, β can be any real number; for NIG, β ∈ (−α, α). with fVG (x; λ, α, β, μ) given by 2 2 λ λ− 12 2 α −β 2 |x − μ| √ Bλ− 1 (α|x − μ|)eβ(x−μ) . 2 α 2πΓ(λ) Figure 2.3 shows the 1% ES, computed with numeric integration, for the HAt and NIG distributions. Those for Hyp and VG look very similar to the NIG and are not shown.
2.3 Mixture distributions 2.3.1 Introduction Random variable Y is said to have a (univariate) k-component normal mixture distribution, in short, Y ∼ MixN(μ, σ, λ), if its unconditional density is given
2.3 Mixture distributions
71
by fY (y) =
k
λj φ y; μj , σj2 ,
(2.27)
j=1
where μ = (μ1 , . . . , μk ), σ = (σ1 , . . . , σk ), λ = (λ1 , . . . , λk ), λj > 0, j = k 1, . . . , k, j=1 λj = 1, are the mixing weights and φ y; μj , σj2 is the normal density.
2.3.2 Expected shortfall for normal mixture distributions We present the general framework of ES with finite mixtures of continuous distributions in this section, using the normal as an illustrative case, given its prominence in empirical work. The applications to mixtures of non-normal distributions, such as those discussed above, follow in a straightforward manner, and we just mention two potentially useful cases with the symmetric stable and Student’s t. Let X ∼ MixN(μ, σ, λ), with components Xj ∼ N(μj , σj2 ), j = 1, . . . , k. From (2.27), the cdf of X is easily seen to be FX (x; μ, σ, λ) =
k
k λi Φ x; μi , σi2 = λi Φ
i=1
i=1
x − μi , 0, 1 . σi
The γ–quantile of X, qX,γ , can be found numerically by solving equation γ − FX (qX,γ ; μ, σ, λ) = 0. The ES can be computed directly from the definition, using numeric integration (and replacing −∞ with, say, −100). This is easy to implement and fast to compute. However, a bit of algebra shows that the ES can be expressed in other forms which are more convenient for numerical calculation and also interpretation. def
With Z ∼ N (0, 1) and letting cj = (qX,γ − μj ) /σj , basic calculation leads to cj cj k 1 ESγ (X; μ, σ, λ) = λj σj zfZ (z) dz + μj fZ (z) dz . γ j=1 −∞ −∞ (2.28) Thus, using (2.5) and (2.28), we can write, for X ∼ MixN(μ, σ, λ), k λj Φ (cj ) φ (cj ) ESγ (X; μ, σ, λ) = μj − σ j . γ Φ (cj ) j=1
(2.29)
72
2 Expected shortfall
Recalling (2.4) and (2.5), this has the appearance of a weighted sum of the component ES values, but notice that μj − σj φ (cj ) /Φ (cj ) is not ESγ (Xj ) because cj = (qX,γ − μj ) /σj = qXj ,γ − μj /σj = qZ,γ = Φ−1 (γ) . We could write ESγ (X; μ, σ, λ) =
k
ωj ESγ (Xj )
(2.30)
j=1
for def
ωj = λj
Φ (cj ) μj − σj φ (cj ) /Φ (cj ) , γ μj − σj φ (qZ,γ ) /Φ (qZ,γ )
which, if all μj = 0, reduces to ωj = λj φ (cj ) /φ (qZ,γ ). It then suggests itself k to let ωj∗ = ωj / j=1 ωj , so that the ωj∗ can be interpreted as the fraction of the ES attributed to component j.
2.3.3 Symmetric stable mixture Let X ∼ MixStab(α, μ, σ, λ), with components Xj ∼ Sαj (μj , σj ), j = 1, . . . , k, and α = (α1 , . . . , αk ). The pdf of X is fX (x; α, μ, σ, λ) =
k
λj fS (x; αj , μj , σj ),
(2.31)
j=1
where fS (x; α, μ, c) is the location-μ, scale-c, symmetric stable Paretian density function with tail index α. If, as in most applicationsin finance, all the αi > 1, then the μi are the individual means, and E[X] = ki=1 λi μi . Similar to the mixed normal case, the cdf of X is FX (x) =
k j=1
λj FS
x − μj ; αj , 0, 1 , σj
where FS (·; α, 0, 1) is the cdf of the standard symmetric stable distribution, evaluated using the methods discussed above. The γ-quantile of X, qX,γ , can be found by numerically solving FX (qX,γ ) − γ = 0.
2.4 Comparison study
73
Let X ∼ MixStab(α, μ, σ, λ) with qX,γ the γ-quantile of X for some γ ∈ (0, 1). def
Then, similar to (2.28), with S ∼ Sα (0, 1) and cj = (qX,γ − μj ) /σj , ESγ (X; α, μ, σ, λ) =
k 1 λj [σj Stoy (cj , αj ) + μj FS (cj ; αj )] . γ
(2.32)
j=1
As in (2.30), we can also write ESγ (X; α, μ, σ, λ) = ωj = and interpret ωj∗ = ωj / ponent j.
k j=1
ωj ESγ (Xj ; αj ) for
λj σj Stoy (cj , αj ) + μj FS (cj ; αj ) , γ ESγ (Xj ; αj ) k j=1
ωj as the fraction of the ES attributed to com-
2.3.4 Student’s t mixtures Similar to the previous mixture distributions, let X ∼ MixT(v, μ, σ, λ), with pdf k fX (x; v, μ, σ, λ) = λj fT (x; vj , μj , σj ), j=1
where fT (x; v, μ, c) is the location-μ, scale-c, student’s t density function with v degree of freedom and v = (v1 , . . . , vk ). Similar to (2.28) and (2.32), with def cj = (qX,γ − μj ) /σj and using Ttail from (2.13), ESγ (X; v, μ, σ, λ) =
k 1 λj [σj Ttail (cj , vj ) + μj FT (cj ; vj )] . γ j=1
(2.33)
2.4 Comparison study The parametric structure imposed for estimating the ES will certainly affect its value. For example, when working with daily asset returns data, we expect that use of the normal distribution will vastly underestimate the true value, while the stable Paretian distribution (and mixtures of stables), with their extremely fat tails, might tend to overestimate it (see Chapter 1. In this section, we study the behavior of the estimated ES with respect to all distributions
74
2 Expected shortfall
discussed above, using both simulated and real data, for the 1% quantile (γ = 0.01). In particular, for a given data set and distribution, we compute the maximum likelihood estimate (MLE) of the parameters and, from the formulae and methods detailed above, the corresponding 1% VaR and ES. These give just point estimates; more interestingly is their small-sample distribution taking parameter uncertainty into account. To approximate this, we resample (in the usual way, with replacement) the series 100 times and, for each resampled series, compute the MLE and the corresponding VaR and ES. The resulting sampling distribution of the VaR and ES can then be graphically shown via a boxplot, and compared across the candidate probability distributions (normal, Student’s t, etc.). To facilitate the comparison, the same 100 resampled series are used for each distribution. In addition, a nonparametric estimator is used which just takes the VaR to be the 1% sample quantile and the ES to be the average of those observations falling below the sample VaR. The first example uses simulated data. Figure 2.4 is the first, and shows the resulting boxplots when the original data consists of 10,000 iid observations from the generalized asymmetric t (GAt ) with parameters d = 1.71, v = 1.96, θ = 0.879, location μ = 0.175 and scale c = 1.21. These values correspond to the MLE based on the 500 daily percentage returns of the DAX index from 27 Nov. 2006 to 24 Oct. 2008, and so can be considered as typical for a stock index during recent periods of relatively high turbulence. Also note that The abbreviations in Figure 2.4 refer to, from left to right, nonparametric, normal, Student’s t, Jones and Faddy skewed t, generalized asymmetric t, Azzalini skewed t, noncentral t, asymmetric stable Paretian, normal inverse Gaussian, hyperbolic asymmetric t, two-component normal mixture, two-component Student’s t mixture, and two-component symmetric stable mixture. We see from the upper panel that the distribution of the sample VaR for the GAt (the correct parametric specification) is virtually symmetric and centered on the true theoretical value of −4.71, shown as the horizontal dashed line. The only other estimator which contains the true value within its lower and upper quartiles is the nonparametric estimator, though its variability is considerably higher than that of the GAt. Observe that, for iid samples of size 10,000, one would expect that the nonparametric estimator of the 1% quantile should be reasonably unbiased. All other parametric specifications for the VaR are either downwards or upwards biased, the most extreme being the normal, which highly underestimates risk, and the stable Paretian model, which greatly overestimates it. For the ES, with true value −7.11 shown as the horizontal dashed line, the behavior of the various distributions is very similar to the VaR, with the GAt and nonparamet-
2.4 Comparison study
75
ric being the most accurate, and the normal (stable Paretian) underestimating (overestimating) the most. As expected, the Student’s t model underestimates risk because it cannot capture the asymmetry, while all the other skewed t generalizations (except the two component Student’s t mixture) overestimate the risk. Note that the sample size of 10,000 allows us to draw conclusions about the large-sample behavior of both the correct and misspecified models, but in small samples such as 250 (corresponding to about one year of daily data and often used in practice), the particular features of the sample at hand will strongly affect the performance of the various estimators and general conclusions are more difficult to draw. We now apply the methods to real data. Figure 2.5 shows the daily levels and percentage returns of the DAX index, from Monday, 28 Sept. 1998 to Friday, 24 Oct. 2008, yielding 2,630 daily returns. Figure 2.6 shows the result of the resampling exercise. The Jones/Faddy, Azzalini skew t and HAt behave rather similarly, as they did in the previous simulations, with median ES around −7. Rather different behavior compared to the simulations is shown by the noncentral t, whose distribution is tightly centered around an ES of only −5. The nonparametric, GAt, NIG and mixed t distributions are relatively close, with median ES around −6. Similar to the results for the simulated series, the normal and mixed normal yield the smallest ES while the stable and mixed stable yield the largest. Figure 2.7 is similar, but just uses the last 250 DAX returns. In this case, the ES distributions are much wider and differ markedly from the nonparametric (and normal) estimates, with distinctly more variation appearing for the GAt, stable, and HAt. Clearly, the return distribution changes with time, and inference will change based on the size of the chosen data window length. One way to account for this is to perform the resampling on the entire data series, but such that a higher probability of being selected is given to more recent observations. We do this by assigning a probability proportional to (T − t + 1)ρ−1 , where T is the length of the series, t is the index of the series, t = 1, . . . , T , and ρ is the parameter dictating how much relative probability is assigned to more recent observations, with ρ = 1 resulting in the equal probability case, ρ < 1 assigning more to recent observations, and ρ > 1 assigning less. The indices (values between 1 and T ) to extract the resampled series are no longer drawn from a discrete uniform distribution, but rather a multinomial, with probabilities given by the renormalized vector (T − t + 1)ρ−1 . Figure 2.8 shows the resulting boxplots, for the DAX returns data, over a range of ρ values, using just the
76
2 Expected shortfall
Jones and Faddy skew t distribution. The case ρ = 1 coincides with the Jones and Faddy boxplot in Figure 2.6. Otherwise, the behavior of the VaR and ES distributions is as we expect: As ρ decreases from one, recent returns are more likely to be in the resampled series, and the VaR and ES distributions become wider and have lower means, as in Figure 2.7. Obviously, the choice of ρ is subjective; but such choices are ubiquitous as with γ, the amount of data, etc..
2.5 Lower partial moments If it exists, the nth order lower partial moment with respect to reference point c is, for n ∈ N, c
LPMn,c (X) =
−∞
n
(c − x) fX (x) dx.
This is an important measure for financial portfolio risk with many advantages over the traditional measure (variance); see, for example, Barbosa and Ferreira (2004) and the references therein. It can be computed with numeric integration, though for both the normal and for fat-tailed distributions, choosing the lower bound on the integral can be problematic. From the binomial theorem, we can write n LPMn,c (X) = Kh,cTh,c (X) , (2.34) n
h=0
def c with Kh,c = Kh,c (n) = h cn−h (−1) and Th,c (X) = −∞ xh fX (x) dx. For Z ∼ N (0, 1) and c < 0, calculation shows that, for h ∈ N, (−1)h 2h/2−1 h+1 h+1 √ Th,c (Z) = Γ − Γc2 /2 , (2.35) 2 2 π def
h
where Γx (a) is the incomplete gamma function. In particular, T0,c (Z) = Φ (c) and T1,c (Z) = −φ (c) as in (2.5). For X ∼ tv with density (2.12), we have, for h < v and w = {c2 /v}/{1 + c2 /v}, h (−1) v h/2 h+1 v−h h+1 v−h v 1 B Th,c (X; v) = , − Bw , , (2.36) 2 2 2 2 2B 2 , 2 and Bw is the incomplete beta function. In particular, T0,c (X; v) = FX (c; v) = Φv (c) and T1,c (X; v) = φv (c) v + c2 / (1 − v) as in (2.13). Using this expression for Th,c (X; v), (2.34) can be quickly and accurately evaluated without the aforementioned numeric integration problem.
2.5 Lower partial moments
77
Bootstrap distribution of the Value at Risk
−3.5
−4
Values
−4.5
−5
−5.5
−6
Nonp
Norm
Stud
JoFa
GAt
Azzt
Nonc
Stab
NIG
HAt
MixN
Mixt
MixS
MixN
Mixt
MixS
Bootstrap distribution of the Expected Shortfall −4
−6
Values
−8
−10
−12
−14
−16
−18 Nonp
Norm
Stud
JoFa
GAt
Azzt
Nonc
Stab
NIG
HAt
Figure 2.4: Boxplots of VaR and ES for γ = 0.01 computed from MLE estimates for 100 resampled seriesgenerated from generalized asymmetric t (GAt ) with parameters d = 1.71, v = 1.96, θ = 0.879, location μ = 0.175 and scale c = 1.21.
78
2 Expected shortfall
DAX Index 8000
7000
6000
5000
4000
3000
1998
2000
2001
2002
2004
2005
2006
2008
2006
2008
DAX Percentage Returns 10 8 6 4 2 0 −2 −4 −6 −8 1998
2000
2001
2002
2004
2005
Figure 2.5: The levels and percentage returns of the DAX index.
2.5 Lower partial moments
79
Bootstrap distribution of the Value at Risk −3.5
−4
Values
−4.5
−5
−5.5
−6
−6.5 Nonp
Norm
Stud
JoFa
GAt
Azzt
Nonc
Stab
NIG
HAt
MixN
Mixt
MixS
MixN
Mixt
MixS
Bootstrap distribution of the Expected Shortfall −4
−6
Values
−8
−10
−12
−14
−16
−18 Nonp
Norm
Stud
JoFa
GAt
Azzt
Nonc
Stab
NIG
HAt
Figure 2.6: Same as in Figure 2.4 but using the 2,630 daily percentage returns on the DAX index.
80
2 Expected shortfall
Bootstrap distribution of the Value at Risk −3
−4
−5
Values
−6
−7
−8
−9
−10
−11
−12 Nonp
Norm
Stud
JoFa
GAt
Azzt
Nonc
Stab
NIG
HAt
MixN
Mixt
MixS
MixN
Mixt
MixS
Bootstrap distribution of the Expected Shortfall −5
−10
−15
Values
−20
−25
−30
−35
−40
−45
Nonp
Norm
Stud
JoFa
GAt
Azzt
Nonc
Stab
NIG
HAt
Figure 2.7: Same as in Figure 2.6 but using just the last 250 daily returns from the DAX, corresponding to Monday, 12 Nov. 2007 to Friday, 24 Oct. 2008.
2.5 Lower partial moments
81
Weighted Bootstrap distribution of the Value at Risk −4
−5
Values
−6
−7
−8
−9
−10 0.4
0.5
0.6
0.7
0.8
0.9
1
1.1
1.2
1.1
1.2
Weighted Bootstrap distribution of the Expected Shortfall −6
−8
−10
Values
−12
−14
−16
−18
−20
−22 0.4
0.5
0.6
0.7
0.8
0.9
1
Figure 2.8: Boxplots of the VaR and ES for γ = 0.01 based on the 2,630 daily percentage returns on the DAX index and the Jones and Faddy skewed t distribution, based on 200 weighted resamples, with each boxplot using a different weight, ρ, from 0.4 to 1.2. As a benchmark, the horizonal dashed lines indicate the estimated VaR and ES, respectively, of the DAX return series under the Jones and Faddy skewed t assumption.
82
2 Expected shortfall
As with the ES, for many distributions, analytical expressions such as (2.35) and (2.36) will not exist, but if the mgf exists, (2.10) can be used. If X ∼ MixN (μ, σ, λ) with k components, then LPMn,c (X) can be computed from (2.34) and (2.35), where Th,c (X) is, similar to the derivation of (2.28), given by h h m h−m Th,c (X) = λj σ j μj Tm,(c−μj )/σj (Z) , m m=0 j=1 k
(2.37)
for Z ∼ N(0, 1). In particular, LPMn,c (X) =
n h=0
Kh,c
k j=1
λj
h h m h−m σ j μj Tm,(c−μj )/σj (Z) , m m=0
(2.38)
with Tm,(c−μj )/σj (Z) given in (2.35). An important special case is for k = 1, which corresponds to X ∼ N(μ, σ2 ). Similarly, if X ∼ MixT (v, μ, σ, λ) and h < min vj , the same result holds, just replace Tm,(c−μj )/σj (Z) in (2.37) and (2.38) with Tm,(c−μj )/σj (Y ; vj ) from (2.36), with Y ∼ tvj .
2.6 Expected shortfall for sums The returns on the set of assets constituting a portfolio are blatantly not independent; however, the method of independent components analysis (ICA) is well-suited for separating the non-Gaussian data into a set of independent time series. An excellent textbook presentation of the method is given in Hyv¨ arinen et al. (2001), while the method has been applied to modeling asset returns by Back and Weigend (1997), Wu and Yu (2005), Wu et al. (2006), Chen et al. (2007), and Broda and Paolella (2008). Once the independent component series are computed, they each are forecasted (by, for example, a univariate GARCH model) and the sum of the individual density forecasts computed (weighted by the portfolio weights). More details and an example will be given below. Thus, the ability to construct the convolution of independent, non-Gaussian distributions becomes relevant. This can be accomplished by a basic application of the inversion formula to the relevant characteristic function or, far faster, of the fast Fourier transform. Even faster is to apply an SPA; a further benefit of this is that the expected shortfall can be immediately approximated as a by-product from the terms constituting the SPA. This enormous speed benefit
2.6 Expected shortfall for sums
83
(coupled with the already fast ICA method) can be advantageous, if not crucial, for portfolio optimization with ES constraints. Below, we first outline the saddlepoint technique for approximating the pdf and cdf, and discuss how it can be used to calculate the ES. Then it is applied to the skew-normal, generalized hyperbolic, and normal inverse Gaussian.
2.6.1 Saddlepoint approximation for density and distribution Saddlepoint methods provide highly accurate approximations to the density and distribution function and are based on performing operations on the mgf of random variables. We state the following results without explanation or proof; see Paolella (2007, Chap. 5) for a basic introduction and examples, and Butler (2007) for a highly detailed treatment and extensive references to the literature. For univariate, continuous random variable X with an mgf MX (t) which exists on an open set about zero, and its log, the cumulant generating function KX (t), the (first order) SPA to fX is given by fˆX (x) =
1 2π KX (ˆ s)
exp {KX (ˆ s) − xˆ s} ,
x = KX (ˆ s) ,
(2.39)
where sˆ = sˆ (x) is the solution to the saddlepoint equation x = KX (s) and is referred to as the saddlepoint at x. It is valid for all values of x in the interior of the support of X. The density fˆX will not necessarily integrate to one, though it will be close; as such, re-normalizing it (usually by numeric integration) further increases the accuracy. The approximation (2.39) is very straightforward to derive; considerably more difficult is obtaining a relatively simple SPA to the cdf. This was achieved by Lugannani and Rice (1980): When X is continuous, they show that 1 1 FˆX (x) = Φ (w) ˆ + φ (w) ˆ − , x = E [X] , (2.40) w ˆ u ˆ where w ˆ = sgn (ˆ s) 2ˆ sx − 2KX (ˆ s), u ˆ = sˆ KX (ˆ s), and Φ and φ are the cdf and pdf of the standard normal distribution, respectively. If x = E [X], then KX (0) = E [X] and sˆ = 0 is the saddlepoint for E [X]. Thus, at the mean, KX (0) = 0, so that w ˆ = 0, rendering Fˆ in (2.40) useless. This singularity is removable; however, linear interpolation around x = E [X] is most effective for obtaining the continuity point of the approximation at the mean of X.
84
2 Expected shortfall
Expression (2.39) is the leading term in an asymptotic expansion; the second order approximation is given by f˜ (x) = fˆ (x) (1 + a1 ) ,
i/2 (i) where a1 = κ ˆ 4 /8 − 5ˆ κ23 /24 and κ ˆ i = KX (ˆ s) / KX (ˆ s) . Similarly, κ ˆ4 5 2 κ ˆ3 F˜ (x) = Fˆ (x) − φ (w) ˆ u ˆ−1 − κ ˆ3 − u ˆ−3 − 2 + w ˆ −3 8 24 2ˆ u
(2.41)
(2.42)
for x = E [X].
2.6.2 Saddlepoint approximation for expected shortfall Let Z be a random variable with an mgf which exists on an open neighborhood of zero and let q = qZ,γ denote the γ–quantile of Z. Martin (2006) has shown that the integral in ESγ (Z) can be approximated as q q − μZ def I1 (q) = μZ FZ (q) − fZ (q) ≈ zfZ (z) dz, (2.43) sˆ −∞ where μZ = E[Z]. Values fZ (q) and FZ (q) can of course be replaced by their SPA counterparts. Replacing them, (2.43) can be written in an equivalent form as 1 μZ q I1 (q) = Φ(w ˆq )μZ + exp{−w ˆq2 /2} − , (2.44) 2π w ˆq u ˆq where w ˆq , u ˆq (and also the saddlepoint, sˆ) correspond to those in (2.40) evaluated at x = q. Martin (2006) derived his approximation using the same technique employed by Lugannani and Rice (1980) in obtaining their expansion of the distribution function. Result (2.44) can also be derived by applying the result given in Temme (1982) to fˆ in (2.39) or f˜ in (2.41); taking a second order term yields (see Broda and Paolella (2009b) for details) 1 def I2 (q) = Φ(w ˆq )μZ + exp{−w ˆq2 /2} × (2.45) 2π μZ q q μZ qˆ κ3 qa1 1 − + − 3 + 2− − , 3 w ˆq u ˆq u ˆq w ˆq 2ˆ uq u ˆq sˆq u ˆq
2.6 Expected shortfall for sums
85
0
30
−0.5
25 20
−1 Exact SPA1 SPA2
−1.5 −2
10
SPA1 SPA2
5
−2.5 −3 −2
15
0
0
2
4
6
8
10
−5 −2
0
2
4
6
8
10
Figure 2.9: Left panel : The exact 1% ES for the Azzalini skew normal distribution (2.18), as a function of its skewness parameter λ, computed with numeric integration (solid); the first order SPA (2.44) (dashed); and the second order SPA (2.45) (dash-dot). Right panel : The relative percentage error, r.p.e. = 100(approx − exact)/exact, of the two SPAs. where κ ˆ 3 and a1 are given below (2.41) and evaluated at x = q. As an example, Figure 2.9 shows the true ES (solid) and the two SPAs (dashed and dash-dot) for γ = 0.01 for the skew normal distribution. (The mgf is given below.) The exact values were calculated via numeric integration. The approximations are extremely accurate for λ < 0, and worsen as λ increases from zero. Quite obviously, in this case, the cost of obtaining the exact results via numeric integration is low and the SPAs are not really necessary. Their value comes when working with distributions for which we do not have simple expressions for their density, such as convolutions, as studied next.
2.6.3 Application to sums of skew normal Denote the location-μ, scale-σ version of the skew normal distribution (2.18) as SN (λ, μ, σ). As indicated in Azzalini (1985), it is straightforward to show
86
2 Expected shortfall
that the mgf corresponding to the SN (λ, 0, 1) distribution is
MSN (t; λ) = 2 exp t2 /2 Φ (tδ) ,
δ=√
λ , 1 + λ2
where, as usual, Φ denotes the cdf of the standard normal distribution. Thus, n def if X = i=1 Xi , with Xi = ind∼ SN (λi , μi , σi ), i = 1, . . . n, and δi defined in an obvious manner, then KXi (t) = log MXi (t) = log 2 + tμi + t2 σi2 /2 + log Φ (tσi δi ) , n KX (t) = i=1 KXi (t), and an SPA to the pdf and cdf of X can be computed. In particular, defining R (z) = φ (z) /Φ (z), KXi (t) = μi + tσi2 + δi σi R (tδi σi ) and, as φ (t) = −tφ (t), we have φ (tδi σi ) = −δi2 σi2 tφ (tδi σi ), def
R (tδi σi ) = −δi2 σi2 tR (tδi σi ) − δi σi R2 (tδi σi ) , and KXi (t) = σi2 + δi σi R (tδi σi ). The 2nd order SPA requires KXi (t) and (3)
KXi (t); with R = R (tδi σi ) and R (tδi σi ) = −δi2 σi2 [tR + R]−2δi σi RR , these (4)
are easily seen to be KXi (t) = δi σi R and KXi (t) = −δi3 σi3 (tR + 2R ) − 2 2δi2 σi2 RR + (R ) . (3)
(4)
The saddlepoint equation is x = KX (t) and needs to be solved numerically. The computation of R (z) = φ (z) /Φ (z) becomes numerically problematic for z < −36, and leads the saddlepoint root search to fail. However, as z → −∞, R (z) ≈ −z, or even more accurately, 1 1 1 ≈ 3− . R (z) z z We use this approximation for z < −36. As an illustration, for two components, the convolution formula can be used to get the exact density. We use λ1 = 1, μ1 = −4, σ1 = 2, and λ2 = −4, μ2 = 2, σ2 = 6. Figure 2.10 shows the true (solid) and second order SPA (dashed), and the relative percentage error, 100(approx − true)/true for the first and second order SPAs. To get a sense of the relative computing times, for the grid of 320 values shown in the figure, the SPA took 1.05 seconds to compute, while the convolution required 228 seconds, using a moderate tolerance level for the integration. The time required for more than n = 2 components barely changes with the SPA, remaining about one second, while nested integration for calculating the exact density becomes infeasible. Observe that the exact
2.6 Expected shortfall for sums
87
0.1 0.09
6 Exact SPA 2nd
4
0.08 0.07
2
0.06 0
0.05 0.04
−2
0.03 −4
0.02 0.01 0 −25
−6 −20
−15
−10
−5
0
5
−25
rpe SPA 1st rpe SPA 2nd −20
−15
−10
−5
0
5
Figure 2.10: Left panel : Exact pdf (solid) and second order SPA (dashed) of X1 + X2 , where X1 ∼ SN(1, −4, 2) independent of X2 ∼ SN(−4, 2, 6). Right panel : The relative percentage error of the 1st order (solid) and second order (dashed) SPA. calculation of the ES is similarly computationally prohibitive, while the SPA is nearly instantaneous and, as shown for the n = 1 case above, extremely accurate. Observe that the characteristic function ϕX (t) of X ∼ SN (λ) is clearly not MSN (it; λ) as is often the case. Instead, as shown by Pewsey
x (2000), ϕX (t) = exp −t2 /2 {1 + iτ (δt)}, where, for x > 0, τ (x) = 0 2/π exp u2 /2 du, and τ (−x) = −τ (x). Note that the exponent term in function τ is not −u2 /2, so that it cannot be represented in terms of the normal cdf, and must be numerically calculated. Thus, application of the inversion theorem would require nested numeric integration. More problematic is the explosive behavior of τ (x) as x → ∞, so that numerical evaluation of ϕX (t) will be precarious, at best.
2.6.4 Application to sums of proper generalized hyperbolic We denote by the “proper GHyp” the case for which λ ∈ R, α > 0, β ∈ (−α, α), δ > 0 and μ ∈ R. This excludes the border cases such as VG and HAt, but continues to nest the Hyp and NIG. Before developing the SPA to sums, we need some preliminary results, all of which are detailed in Paolella (2007, Chap.
88
2 Expected shortfall def
9). With yx =
δ 2 + (x − μ)2 , the pdf is λ
fGHyp (x; λ, α, β, δ, μ) = √ def
λ− 1
(α2 − β 2 ) 2 yx 2 Bλ− 12 (αyx ) eβ(x−μ) . 1 2παλ− 2 δ λ Bλ (δ α2 − β 2 )
def
def
Let χ = δ 2 > 0 and ψ = α2 − β 2 > 0. With ψt = α2 − (β + t)2 , the mgf of the proper GHyp can be expressed as √ λ/2 ψ μt Bλ χψt √ MGHyp (t) = e . (2.46) ψt Bλ χψ def def √ With η = χ/ψ and ω = χψ, for X ∼ GHyp(λ, α, β, δ, μ), E[X] = μ + βηBλ+1 (ω)/Bλ (ω). If X ∼ GHyp(λ, α, β, δ, μ) and a, b ∈ R with a = 0, then
aX + b ∼ GHyp(λ, α/|a|, β/a, δ|a|, aμ + b)
(2.47)
so that, recalling the assumption that δ > 0 for the proper GHyp, this can be def def used to show that (λ, α, β, δ, μ) with α = αδ, β = βδ is a parametrization such that δ is a scale parameter and λ, α, β are location-scale invariant. In particular, let GHyp(2) denote usage of the alternative parameter set. Then, for the standard case δ = 1 and μ = 0, X ∼ GHyp(λ, α, β, 1, 0) = GHyp(2) (λ, α, β, 1, 0) and from (2.47), for a > 0, aX + b ∼ GHyp(λ,
α β , , a, b) = GHyp(2) (λ, α, β, a, b), a a
showing that λ, α, β are location-scale invariant and a is a scale parameter. We can now consider the SPA to sums of independent proper GHyp random variables. Note that, in finance applications, it might be desirable to work with a mean-zero version, to which a (possibly time-varying) scale term is applied. Let X ∼ GHyp(2) (λ, α, β, 1, 0) (same as GHyp(λ, α, β, 1, 0)) and let β Bλ+1 ( α2 − β 2 ) def μ = E[X] = α2 − β 2 B λ ( α2 − β 2 ) so that Z = X − μ has mean zero, and Z ∼ GHyp(2) (λ, α, β, 1, −μ). Then, for c > 0, we take R = cZ = cX − cμ ∼ GHyp(2) (λ, α, β, c, −cμ),
2.6 Expected shortfall for sums
89
and E[R] = cE[Z] = 0. d def Let S = j=1 aj Xj , where Xj = ind∼ GHyp(2) λj , α ¯ j , β¯j , δj , μj , and aj > 0 are fixed constants, j = 1 . . . , d. Then, from the scaling result, S = dj=1 Yj , where def Yj = aj Xj ∼ GHyp(2) λj , α ¯ j , β¯j , aj δj , aj μj α ¯j β¯j = GHyp λj , , , aj δj , aj μj aj δj aj δj = GHyp (λj , αj , βj , cj , dj ) , and αj , βj , cj , dj are so-defined. The characteristic function of Yj is indeed ϕYj (t) = MYj (it), with the bessel function Bλ (·) defined for complex arguments. Thus, the characteristic function of S is tractable and the usual inversion formulae or FFT approach can be successfully applied to get the density of S. This can be effectively used to assess the accuracy of the saddlepoint (or any other) approximation. For the SPA, we require the cgf KYi (t) = log MYi (t), and its first two derivad tives, from which we can compute KS (t) = i=1 KYi (t), with similar expres sions for KS (t) and KS (t). For notation convenience, let Y ∼ GHyp (λ, α, β, c, μ) √ (drop the subscript i). Then, with ω = χψ = δ α2 − β 2 , and defining def Qt = Q (t) = 1 − (2βt + t2 ) /ψ, it is easy to verify that the mgf can be written as Bλ (ωQt) −λ MY (t) = eμt Q , Bλ (ω) t so that KY (t) = μt + log Bλ (ωQt) − log Bλ (ω) − λ log (Qt ). Of critical use is the result that −2Bν (x) = Bν−1 (x) + Bν+1 (x),
ν ∈ R, x ∈ R>0 .
(2.48)
Then dQ (t) / dt = − (β + t) /Qtψ and, via (2.48), β + t ω Bλ−1 (ωQt ) + Bλ+1 (ωQt ) λ KY (t) = μ + + . Qt ψ 2 Bλ (ωQt ) Qt The saddlepoint tˆ needs to be numerically determined by solving KS (t) = x. It will have one, and only one, solution in the convergence strip of the mgf of S, max (−αj − βj ) < t < min (αj − βj ).
90
2 Expected shortfall
Finally, a bit of work then shows that β+t ω Bλ−1 (ωQt ) + Bλ+1 (ωQt ) λ KY (t) = × P1 + + Qt ψ 2 Bλ (ωQt ) Qt ! 2 1 (β + t) × 1+ , Qt ψ Q2t ψ where P1 =
ω λ (β + t) P2 + 2 Q3t ψ
and for P2 , B2λ (ωQt ) P2 is ω β+t Bλ (ωQt ) × (Bλ−2 (ωQt) + 2Bλ (ωQt ) + Bλ+2 (ωQt )) 2 Qt ψ − [Bλ−1 (ωQt) + Bλ+1 (ωQt )] ω β+t × (Bλ−1 (ωQt ) + Bλ+1 (ωQt )) . 2 Qt ψ
2.6.5 Application to sums of normal inverse Gaussian The NIG is a special case of the proper generalized hyperbolic considered above which leads to considerably simplified expressions, allowing dus to develop the second order SPA. As before, we are interested in S = j=1 Yj , where the Yj are independent (but not identically distributed) NIG, and for the SPA, it suffices to work out the derivatives of the cgf for the d = 1 case: From (2.24), with Y ∼ NIG(α, β, c, μ) = GHyp(−1/2, α, β, c, μ), and as before, ψt = 2 α2 − (β + t) and ψ = α2 − β 2 , KY (t) = μt + δ ψ − ψt , −α − β ≤ t ≤ α − β, and we easily find −1/2
KY (t) = μ + δψt
(β + t) ,
−1/2
KY (t) = δψt (3)
1 + ψt−1 (β + t)
(4)
2
. (2.49)
For the second order SPA, we require KY (t) and KY (t). Standard calculation yields (3) −3/2 2 KY (t) = 3δψt (β + t) 1 + ψt−1 (β + t) , (4) −3/2 2 2 KY (t) = 3δψt 1 + ψt−1 (β + t) 1 + 5ψt−1 (β + t) .
2.6 Expected shortfall for sums
91
The saddlepoint equation is then given by x = KS (t) = must be numerically solved when d > 1.
d i=1
KYi (t), which
It is interesting to note that, in the d = 1 case, we get a simple, closed-form solution to the saddlepoint equation: From KY (t) in (2.49) with z = (x − μ) /δ and h = β + t, the saddlepoint equation satisfies z 2 = h2 / α2 − h2 , which −1/2 leads to t being given by ±αkz − β, where kz = z 1 + z 2 . From the convergence strip of the mgf and that |kz | < 1, the positive root must be the correct one. Thus, the saddlepoint is αz tˆ = √ − β, 1 + z2
z=
x−μ . δ
j/2 (j) Furthermore, for d = 1, with κj (t) = KY (t) / KY (t) for j = 3, 4, we arrive at −1/4 −1/2 κ3 tˆ = 3 (δα) z 1 + z2 , −1/2 −1 2 κ4 tˆ = 3 (δα) 1+z 1 + 5z 2 . The left panel of Figure 2.11 shows the accuracy of the two SPAs (as a function d of β) for the γ = 0.01 level ES of S = j=1 Yj in the d = 1 case, with Y1 ∼ NIG(2, β, 1, 0), for which we can easily compute the true ES as in Section 2.2.4. As with the skew normal, the second order SPA is significantly more accurate. Note that the requisite quantile qS,γ for the ES is also computed with the SPA to the cdf; if the true value of qS,γ is used, the SPAs would of course have even higher accuracy, but that would take much longer to compute and defeat the point of their use. The right panel of Figure 2.11 is the level–0.01 ES, but now for d = 2, with Y1 ∼ NIG(2, β, 1, 0) and Y2 ∼ NIG(2, 0, 1, 0), shown as a function of β. It is easy to see from the mgf that, in terms of convolutions, NIG(α, β, δ1 , μ1 ) ∗ NIG(α, β, δ2 , μ2 ) = NIG(α, β, δ1 + δ2 , μ1 + μ2 ),
(2.50)
so that only in the β = 0 case does S follow an NIG distribution. To obtain the exact ES, we use the inversion formula applied to the characteristic function of S to compute the exact cdf, and use root-search to get the correct quantile qS,γ . Then, (2.8) is used to compute the ES. As a check of (2.8), numeric integration of (2.2) using the density obtained by the FFT method can be used; both are fast and yield the same value to several significant digits. For the first order SPA, root search is applied to the first order cdf SPA to get the (approximate)
92
2 Expected shortfall −1
−1
−2
−2
−3
−3
−4
−4
−5
−5
−6
−6
−7
−7
−8
True SPA 1 SPA 2
−9 −10 −1.5
−1
−0.5
0
0.5
1
1.5
True SPA 1 SPA 2
−8 −9 −10 −1.5
−1
−0.5
0
0.5
1
1.5
Figure 2.11: Left panel : The level–0.01 ES of the NIG distribution for α = 2, δ = 1, μ = 0, as a function of the skewness parameter β, computed exactly and with the first- and second-order SPA. Right panel : The level–0.01 ES for a sum of d = 2 independent NIG random variables, with Y1 ∼ NIG(2, β, 1, 0) and Y2 ∼ NIG(2, 0, 1, 0), shown as a function of β. quantile, say qˆS,γ , and then I1 (ˆ qS,γ ) from (2.43) is computed. Similarly, for the second order SPA, I2 (˜ qS,γ ) from (2.45) is used to compute the ES, where q˜S,γ refers to the quantile obtained from root search using the second order cdf SPA. As in the d = 1 case, the second order SPA is quite accurate. With respect to their computational time, the 20 true values shown in the right panel of Figure 2.11 took 87.2 seconds to compute; the 20 values for the firstand second-order SPAs took 0.38 and 0.41 seconds, respectively. Thus, the SPA offers a speed increase of over 200 times.
2.6.6 Application to portfolio returns Before applying the method to a multivariate time series of foreign exchange data, we give the necessary details on independent component analysis, hereafter ICA. Consider a portfolio of N assets, each with associated return rit at time t ∈ {1, . . . , T }, and let rt = [r1,t , . . . , rN,t ] . Denoting the vector of portfolio weights as b, the portfolio return is Rt = b rt . In the ICA model, the
2.6 Expected shortfall for sums
93
returns are modeled as linear combinations of N unobserved factors ft , i.e., rt = Aft ,
(2.51)
where the mixing matrix A is, by assumption, invertible and constant over time. The unobserved factors are assumed to be independent of each other, and to have unit unconditional variance (the latter is an identifying restriction). A polar decomposition uniquely factorizes A into a symmetric positive definite matrix Σ1/2 and an orthogonal matrix U, A = Σ1/2 U,
(2.52)
where Σ1/2 is the symmetric positive definite square root of the unconditional covariance matrix E [rt rt ] = AA = Σ1/2 UU Σ1/2 = Σ, assuming zero means for simplicity. The purpose of the ICA method is to estimate A, i.e., to find a matrix W = A−1 such that ft = Wrt are independent. The naive approach of taking W = $ −1/2 produces uncorrelated components, but this implies independence only Σ if the data are iid multivariate Gaussian. In terms of the decomposition (2.52), the orthogonal matrix U remains to be estimated. ICA achieves this by maximizing, over the set of all linear combinations of the assets, some measure of non-Gaussianity, such as excess kurtosis, negentropy, or heteroskedasticity. The idea is that, from the central limit theorem, sums of independent random variables will be closer to a Gaussian. Therefore, the least Gaussian linear combination is the one which undoes the effect of the mixing matrix A in (2.51), which corresponds to the rows of W = A−1 . The unconditional distributions of the factors fit can then be estimated one at a time. In our example, a NIG distribution with parameters (αi , βi , ci , μi ) is d N d fitted for each factor. Then, from (2.47) and (2.51), Rt = i=1 Yi , where = denotes equality in distribution, Yi ∼ NIG(αi /|ai |, βi /ai , ci |ai |, ai μi ), and ai is the ith element of b A. As an example, we consider a EUR-denominated portfolio of N = 3 currencies (US Dollars, Swiss Francs, and British Pounds). The data range from 9 Dec. 1999 to 9 Jan. 2009, corresponding to a total of 2,371 observations per return
94
2 Expected shortfall
series. The top left panel of Figure 2.12 plots the exchange rates. The estimated mixing matrix is ⎡ ⎤ 0.5586 0.3167 0.0605 ˆ = ⎣ 0.1205 0.3901 −0.2795 ⎦ , A −0.1010 0.1853 0.2002 and after fitting a NIG distribution to each factor, and for given portfolio weights b, the distribution of portfolio returns can be computed by means of the SPA developed in the previous section. We consider b = [1/3, 1/3, 1/3] as an example. The resulting density is plotted in the top right panel of Figure 2.12, overlaid with a kernel density estimate. The fit is indeed very accurate. The bottom row of the same figure is similar to the boxplots of Section 2.5. The dots represent the ES and VaR obtained, respectively, from the naive nonparametric estimator, the normal assumption, and the ICA method with NIG factors, and the boxplots correspond to their bootstrap distributions. The normal model underestimates both risk measures. The nonparametric and ICA estimates of the portfolio VaR are similar, but the nonparametric ES estimate is higher than the one obtained by ICA, which is likely due to outliers in the data. It is also apparent that the ICA estimates have a lower variance. It is important to note that once A and the NIG parameters have been estimated, the approximate distribution of portfolio returns can be computed very efficiently, for any set of portfolio weights. For example, computing the 100 pdf values in the top right panel of Figure 2.12 takes about half a second, with the cdf and tail conditional expectations obtained essentially as by-products. This is important if the distribution needs to be evaluated for different portfolio weights, such as in portfolio optimization exercises, and can be considered the main advantage of the method.
2.6 Expected shortfall for sums
95
Exchange Rates per Euro
Kernel and Fitted SPA pdfs of Portfolio Returns 1.5 USD/EUR CHF/EUR GBP/EUR
1.6
Kernel Fitted
1.4
1
f(R)
1.2
1
0.5
0.8
0.6 09−Dec−1999
17−Mar−2002
24−Jun−2004
02−Oct−2006
09−Jan−2009
0 −3
−2
−1
0
1
2
R
Bootstrap distribution of the Value at Risk
Bootstrap distribution of the Expected Shortfall −0.9
−0.8 −1 −0.85
Values
Values
−1.1 −0.9
−1.2
−0.95 −1.3 −1 −1.4 Nonp
Norm
NIG SPA
Nonp
Norm
NIG SPA
Figure 2.12: Top row : Exchange Rates per EUR for USD, CHF, and GBP (left), Kernel and fitted pdf of equally weighted portfolio returns (right). Bottom row : Estimates of Portfolio VaR (left) and ES (right), along with their bootstrap distributions.
Bibliography Azzalini, A. (1985). A Class of Distributions which Includes the Normal Ones. Scandinavian Journal of Statistics, 12:171–178. Azzalini, A. and Capitanio, A. (2003). Distributions Generated by Perturbation of Symmetry with Emphasis on a Multivariate Skew t-Distribution. Journal of the Royal Statistical Society, Series B, 65(2):367–389. Back, A. D. and Weigend, A. S. (1997). A First Application of Independent Component Analysis to Extracting Structure from Stock Returns. International Journal of Neural Systems, 8:473–484. Barbosa, A. R. and Ferreira, M. A. (2004). Beyond Coherence and Extreme Losses: Root Lower Partial Moment as a Risk Measure. Technical report. Available at SSRN. Bibby, B. M. and Sørensen, M. (2003). Hyperbolic Processes in Finance. In Rachev, S. T., editor, Handbook of Heavy Tailed Distributions in Finance, pages 211–248. Elsevier Science, Amsterdam. Broda, S. and Paolella, M. S. (2007). Saddlepoint Approximations for the Doubly Noncentral t Distribution. Computational Statistics & Data Analysis, 51:2907–2918. Broda, S. A. and Paolella, M. S. (2008). CHICAGO: A Fast and Accurate Method for Portfolio Risk. Research Paper 08-08, Swiss Finance Institute. Broda, S. A. and Paolella, M. S. (2009a). Calculating Expected Shortfall for Distributions in Finance. Mimeo, Swiss Banking Institute, University of Zurich. Broda, S. A. and Paolella, M. S. (2009b). Saddlepoint Approximation of Expected Shortfall for Transformed Means. Mimeo, Swiss Banking Institute, University of Zurich. Butler, R. W. (2007). Saddlepoint Approximations with Applications. Cambridge University Press, Cambridge.
98
Bibliography
Chen, Y., H¨ardle, W., and Jeong, S.-O. (2008). Nonparametric Risk Management With Generalized Hyperbolic Distributions. Journal of the American Statistical Association, 103(483):910–923. Chen, Y., H¨ardle, W., and Spokoiny, V. (2007). Portfolio Value at Risk Based on Independent Component Analysis. Journal of Computational and Applied Mathematics, 205:594–607. Dowd, K. (2005). Measuring Market Risk. Wiley, New York, 2nd edition. Harvey, C. R. and Siddique, A. (1999). Autoregressive Conditional Skewness. Journal of Financial and Quantitative Analysis, 34(4):465–487. Hyv¨ arinen, A., Karhunen, J., and Oja, E. (2001). Independent Component Analysis. John Wiley & Sons, New York. Jondeau, E., Poon, S.-H., and Rockinger, M. (2007). Financial Modeling Under Non-Gaussian Distributions. Springer-Verlag, London. Jones, M. C. and Faddy, M. J. (2003). A Skew Extension of the t Distribution with Applications. Journal of the Royal Statistical Society, Series B, 65:159– 174. Knight, J. and Satchell, S., editors (2001). Return Distribution in Finance. Butterworth Heinemann, Oxford. Kuester, K., Mittnik, S., and Paolella, M. S. (2006). Value–at–Risk Prediction: A Comparison of Alternative Strategies. Journal of Financial Econometrics, 4(1):53–89. Le´on, A., Rubio, G., and Serna, G. (2005). Autoregressive Conditional Volatility, Skewness and Kurtosis. The Quarterly Review of Economics and Finance, 45:599–618. Lugannani, R. and Rice, S. O. (1980). Saddlepoint Approximations for the Distribution of Sums of Independent Random Variables. Adv. Appl. Prob., 12:475–490. Mandelbrot, B. (1963). The Variation of Certain Speculative Prices. Journal of Business, 36:394–419. Martin, R. (2006). The Saddlepoint Method and Portfolio Optionalities. Risk Magazine, 19(12):93–95. McDonald, J. B. (1997). Probability Distributions for Financial Models. In Maddala, G. and Rao, C., editors, Handbook of Statistics, volume 14. Elsevier
Bibliography
99
Science. Paolella, M. S. (2007). Intermediate Probability: A Computational Approach. John Wiley & Sons, Chichester. Pewsey, A. (2000). The Wrapped Skew-Normal Distribution on the Circle. Communications in Statistics – Theory and Methods, 29:2459–2472. Samorodnitsky, G. and Taqqu, M. (1994). Stable Non-Gaussian Random Processes, Stochastic Models with Infinite Variance. Chapman & Hall, London. Schiff, J. L. (1999). The Laplace Transform - Theory and Applications. Springer Verlag, New York. Stoyanov, S., Samorodnitsky, G., Rachev, S., and Ortobelli, S. (2006). Computing the Portfolio Conditional Value-at-Risk in the alpha-stable Case. Probability and Mathematical Statistics, 26:1–22. Temme, N. M. (1982). The Uniform Asymptotic Expansion of a Class of Integrals Related to Cumulative Distribution Functions. SIAM Journal on Mathematical Analysis, 13:239–253. Weron, R. (2004). Computationally Intensive Value at Risk Calculations. In Gentle, J. E., H¨ardle, W., and Mori, Y., editors, Handbook of Computational Statistics: Concepts and Methods. Springer, Berlin, Germany. Wu, E. H. C. and Yu, P. L. H. (2005). Volatility Modelling of Multivariate Financial Time Series by Using ICA-GARCH Models. In Gallagher, M., Hogan, J., and Maire, F., editors, Intelligent Data Engineering and Automated Learning – IDEAL 2005, volume 3578, pages 571–579. Springer, Berlin. Wu, E. H. C., Yu, P. L. H., and Li, W. K. (2006). Value At Risk Estimation Using Independent Component Analysis – Generalized Autoregressive Conditional Heteroscedasticity (ICA-GARCH) Models. International Journal of Neural Systems, 16(5):371–382.
3 Modelling conditional heteroscedasticity in nonstationary series ˇ ıˇzek Pavel C´
3.1 Introduction A vast amount of econometrical and statistical research deals with modeling financial time series and their volatility, which measures the dispersion of a series at a point in time (i.e., conditional variance). Although financial markets have been experiencing many shorter and longer periods of instability or uncertainty in last decades such as Asian crisis (1997), start of the European currency (1999), the “dot-Com” technology-bubble crash (2000–2002) or the terrorist attacks (September, 2001), the war in Iraq (2003) and the current global recession (2008–2009), mostly used econometric models are based on the assumption of stationarity and time homogeneity; in other words, structure and parameters of a model are supposed to be constant over time. This includes linear and nonlinear autoregressive (AR) and moving-average models and conditional heteroscedasticity (CH) models such as ARCH (Engel, 1982) and GARCH (Bollerslev, 1986), stochastic volatility models (Taylor, 1986), as well as their combinations. On the other hand, the market and institutional changes have long been assumed to cause structural breaks in financial time series, which was for example confirmed in data on stock prices (e.g., Andreou and Ghysels, 2002, Beltratti and Morana, 2004, and Eizaguirre et al., 2010) and exchange rates (e.g., Herwatz and Reimers, 2001, or Morales-Zumaqueroa and Sosvilla-Rivero, 2010). Moreover, ignoring these breaks can adversely affect the modeling, estimation, and forecasting of volatility as suggested, for example, by Diebold and Inoue (2001), Mikosch and Starica (2004), Pesaran and Timmermann (2004), and
P. Čížek et al. (eds.), Statistical Tools for Finance and Insurance, DOI 10.1007/978-3-642-18062-0_3, © Springer-Verlag Berlin Heidelberg 2011
101
102
3 Modelling conditional heteroscedasticity in nonstationary series
Hillebrand (2005). Such findings led to the development of the change-point analysis in the context of CH models; see for example Chen and Gupta (1997), Kokoszka and Leipus (2000), Andreou and Ghysels (2006), and Chen et al. (2010). Although these methods to detect structural changes in the time series are useful in uncovering major change points, their power is often rapidly decreasing with the number of change points in a given time series. Combined with the fact that the distance between the end of the sample and a change point has to increase with the sample size and with the assumption of timehomogeneity between any two breaks, they cannot relax the assumption of time-homogeneity of CH models to a larger extent. This is particularly visible when forecasting a time series as it is usually not possible to detect structural changes close to the end of the current time series (an exception being e.g. Andrews, 2003, in the linear AR models). An alternative approach, which we concentrate upon in this chapter, lies in relaxing the assumption of time-homogeneity and allowing some or all model parameters to vary over time (e.g., as in Chen and Tsay, 1993, Cai et al., 2000, and Fan and Zhang, 2008). Without structural assumptions about the transition of model parameters over time, time-varying coefficient models have to be estimated nonparametrically under some additional identification conditions. A classical identification assumption in the context of varying coefficient models is that the parameters of interest are smooth functions of time (e.g., Cai et al., 2000, Xu and Phillips, 2008, and Fryzlewicz et al., 2008). Models with parameters smoothly varying over time are very flexible, but their main assumption precludes sudden changes in the parameter values. Thus, smoothly-varying coefficient models cannot account for classical structural breaks. A different strategy, which allows for nonstationarity of a time series, is based on the assumption that a time series can be locally, that is, over short periods of time, approximated by a parametric model. As suggested by Spokoiny (1998), such a local approximation can form a starting point in the search for the longest period of stability (homogeneity), that is, for the longest time interval in which the series is described well by the parametric model. In the context of the local constant approximation, this kind of strategy was employed for the volatility modeling by H¨ardle et al. (2003), Mercurio and Spokoiny (2004), Starica and Granger (2005), and Spokoiny (2009), for instance. A generalizaˇ ıˇzek et al. (2009) and, tion to ARCH and GARCH models can be found in C´ in a slightly different context, in Polzehl and Spokoiny (2004). The main advantage of the approach using the local-approximation assumption to search for the longest time-homogeneous interval in a given series is that it unifies the change point analysis and smoothly-varying coefficient models. First, since
3.2 Parametric conditional heteroscedasticity models
103
finding the longest time-homogeneous interval for a parametric model at any point in time corresponds to detecting the most recent change point in a time series, this approach resembles the change point modeling as in Bai and Perron (1998) or Mikosch and Starica (2004), for instance, but it does not require prior information such as the number of changes and it does not need a large number of observations before each break point (because no asymptotic results are used for the selection of time-homogenous intervals). Second, since the adaptively selected time-homogeneous interval used for estimation necessarily differs at each time point, the model coefficients can arbitrarily vary over time, but in addition to that, the parameter values can suddenly jump in contrast to models assuming smooth development of the parameters over time (Fan and Zhang, 2008). To understand the benefits of various varying coefficient models, we will discuss here the conditional heteroscedasticity models (Section 3.2) and their timevarying alternatives: smoothly-varying CH models (Section 3.3), pointwise adaptive estimation of CH models (Section 3.4), and adaptive weights smoothing of CH models (Section 3.5). A real-world comparison will be facilitated by means the analysis of the S&P 500 stock index. In particular, daily data on the log-returns of the index are used from years 1997 to 2005. The reason for choosing this data set is that it is a difficult one for modelling using timevarying coefficients: someone forecasting the stock index by varying coefficient models has a hard time to outperform the standard GARCH model, whereas this is possibly an easy task with other types of data (e.g., for the exchange rate series as shown by Fryzlewicz et al., 2008).
3.2 Parametric conditional heteroscedasticity models Consider a time series Yt in discrete time, t ∈ N , which represents the logreturns of an observed asset-price process St : Yt = log(St /St−1 ). Modelling Yt using the conditional heteroscedasticity assumption means that Yt = σt εt , t ∈ N , where εt is a white noise process and σt is a predictable volatility (conditional variance) process. Identification and estimation of the volatility process σt typically relies on some parametric CH specification such as the ARCH (Engle,
104
3 Modelling conditional heteroscedasticity in nonstationary series
1982) and GARCH (Bollerslev, 1986) models: σt2 = ω +
p i=1
2 αi Yt−i +
q
2 βj σt−j ,
(3.1)
j=1
where p ∈ N , q ∈ N , and θ = (ω, α1 , . . . , αp , β1 , . . . , βq ) is the parameter vector; the ARCH and GARCH models correspond then to q = 0 and q > 0, respectively. An attractive feature of this model is that, even with very few coefficients, one can model most stylized facts of financial time series like volatility clustering or excessive kurtosis, for instance. A number of (G)ARCH extensions were proposed to make the model even more flexible; for example, EGARCH (Nelson, 1991), QGARCH (Sentana, 1995), and TGARCH (Glosten et al., 1993) that account for asymmetries in a volatility process. All mentioned CH models can be put into a common class of generalized linear volatility models: Yt = σt εt = g(Xt )εt , (3.2) p q Xt = ω + αi h(Yt−i ) + βj Xt−j , (3.3) i=1
j=1
where g and h are known functions and Xt is a (partially) unobserved process (structural variable) that models the volatility coefficient σt2 via transformation g: σt2 = g(Xt ). The GARCH model (3.1) is described by g(u) = u and h(r) = r2 , for instance. Despite its generality, the generalized linear volatility model is time homogeneous in the sense that the process Yt follows the same structural equation at each time point. In other words, the parameter θ and hence the structural dependence in Yt is constant over time. Even though models like (3.2)–(3.3) can often fit data well over a longer period of time, the assumption of homogeneity is too restrictive in practical applications: to guarantee a sufficient amount of data for reasonably precise estimation, these models are often applied over time spans of many years.
3.2.1 Quasi-maximum likelihood estimation The parameters in model (3.2)–(3.3) are typically estimated by the quasi maximum likelihood (quasi-MLE) approach, which employs the estimating equations generated under the assumption of Gaussian errors εt . This guarantees efficiency under the normality of innovations and consistency under rather general moment conditions (Hansen and Lee, 1994, and Francq and Zakoian, 2007).
3.2 Parametric conditional heteroscedasticity models
105
Using the observations Yt from some time interval I = [t0 , t1 ], the log-likelihood for the model (3.2)–(3.3) on an interval I can be represented in the form LI (θ) = {Yt , g[Xt(θ)]} t∈I
with the log-likelihood function (y, υ) = −0.5 log(υ) + y 2 /υ because the conditional distribution of Yt has a zero mean and variance σ 2 = g[Xt(θ)]. We define the quasi-MLE estimate θ˜I of the underlying parameter value θ0 by maximizing the log-likelihood LI (θ): θ˜I = argmax LI (θ) = argmax {Yt , g[Xt (θ)]}. (3.4) θ∈Θ
θ∈Θ
t∈I
Consider now the class of GARCH models, that is, g(u) = u and h(r) = r2 . The commonly used models in this class are the ARCH(p) and GARCH(1,1) models. There are several reasons why, if the partial autocorrelation structure of Yt2 does not indicate an ARCH process, GARCH(1,1) with only one lag in both components is typically used. On the one hand, one needs several hundreds of observations to obtain significant and reasonably precise estimates of ˇ ıˇzek et al., 2009). On the other hand, even GARCH parameters (e.g., see C´ GARCH(1,1) provides very good one-period ahead forecasts, which often outperform more complicated parametric models (Andersen and Bollerslev, 1998). This is especially true for the aggregated series such as the stock indices.
3.2.2 Estimation results Let us thus use the GARCH(1,1) model to estimate and predict the volatility of the stock index S&P 500. Although the data span from 1997 to 2005, we mostly concentrate on predictions within years 2001–2004. This period is marked by many substantial events affecting the financial markets, ranging from September 11, 2001, terrorist attacks and the war in Iraq (2003) to the crash of the technology stock-market bubble (2000–2002); see Figure 3.1 for the log-returns of the S&P 500 index in years 2001–2004. In this and other examples, where predictions are made and evaluated, we suppose that t1 represents a current day and our aim is to forecast the volatility tomorrow, that is, σ ˆt21 +1 at time t1 + 1 using the currently known data from times {t0 , . . . , t1 }. Thus, σ ˆt21 +1 always represents an out-of-sample forecast here. To evaluate this outof-sample forecasting performance over a given period {ts , . . . , te }, the mean
3 Modelling conditional heteroscedasticity in nonstationary series
0 −4
−2
Log−returns
2
4
106
2001
2002
2003
2004
2005
Time
Figure 3.1: The standardized log-returns of the S&P500 index in years 2001– 2004. STFtvch01
absolute prediction error is used: M AP E(ts , te ) =
te 1 2 |Y 2 − σ ˆt+1 |, te − ts + 1 t=t t+1
(3.5)
s
2 where the squared future returns Yt+1 are used as a noisy, but unbiased approximation of the underlying volatility (Andersen and Bollerslev, 1998). Note that we will report MAPE within each year 2001, . . . , 2004 in tables and running monthly averages of MAPE in graphs.
Predicting the volatility one day ahead using all available data points {1, . . . , t1 } for the GARCH(1,1) estimation results in the forecast on Figure 3.2. (Note that the data were for convenience rescaled so that the unconditional variance equals 1.) In the light of the (ex post) knowledge of possible structural changes in
107
15 10 5 0
Squared log−returns
20
3.2 Parametric conditional heteroscedasticity models
2001
2002
2003
2004
2005
Time
Figure 3.2: The volatility forecasts of GARCH(1,1) for S&P 500 using all historical data. STFtvch02
the series, one can however wonder whether using all available historical data is a good estimation strategy. To this end, we run the GARCH estimation using various historical windows, that is, the data {t1 − W, . . . , t1 } for W = 125, 250, 500, and 1000 representing periods of one half to four years. The MAPEs of all estimates are summarized for every year in Table 3.1. Obviously, the estimation using all available historical data is best in years 2001 and 2002, whereas the optimal forecasts in years 2003 and 2004 are achieved using data from the past two years and the past six months, respectively. Thus, the optimal historical interval to use differs accross time, most likely due to structural changes in the observed series. Unfortunately, there is no clear rule how to select the optimal historical window with well defined statistical properties, at least not in this basic setup. One can only select the overall best method based on the historical performance. To judge it, Table 3.1 contains also MAPE across all four years 2001–2004 (see row ‘Total’). Since the total
108
3 Modelling conditional heteroscedasticity in nonstationary series
Table 3.1: Mean absolute forecast errors in volatility by GARCH(1,1) using last 125, 250, 500, 1000, and all observations.
Year 2001 2002 2003 2004
W = 125 1.207 1.807 0.808 0.355
Total Weighted
1.044 1.046
Estimation window W = 250 W = 500 W = 1000 1.215 1.188 1.167 1.773 1.739 1.728 0.815 0.804 0.823 0.360 0.367 0.397 1.041 1.050
1.025 1.041
1.029 1.064
All 1.156 1.714 0.818 0.435 1.031 1.087 STFtvch03
mean reflects more the most volatile years than the other ones, we also present the weighted mean with weights indirectly proportional to the unconditional variance of raw data within each year (see row ‘Weighted’). In the case of GARCH(1,1), the best total performance can be attributed to the forecasts based on the past two years of data. Alternatively, various attempts to account for the nonstationary nature of financial time series are discussed in the following Sections 3.3–3.5. To facilitate comparison across methods, we will mostly use the GARCH(1,1) results using all available data as a benchmark since it provides the overall best forecasting performance in years 2001–2003.
3.3 Time-varying coefficient models An obvious feature of the generalized linear volatility model (3.2)–(3.3) is that the parametric structure of the process is assumed constant over the whole sample and cannot thus incorporate changes and structural breaks at unknown times in the model. A natural generalization leads to models whose coefficients may change over time (Fan and Zhang, 2008). In this context, a standard assumption is that the structural process Xt satisfies the relation (3.3) at any time, but the vector of coefficients θ may vary with the time t, θ = θ(t). The
3.3 Time-varying coefficient models
109
estimation of the coefficients as general functions of time is possible only under some additional assumptions on these functions. Typical assumptions are (i) time-varying coefficients are smooth functions of time (Cai et al., 2000, and Fryzlewicz et al., 2008) and (ii) time-varying coefficients are piecewise constant functions (Bai and Perron, 1998, and Mikosch and Starica, 2004). Due to the limitations of the latter approach such as the asymptotically increasing length of the intervals, where a parametric model (3.2)–(3.3) holds, we concentrate on the smoothly-varying coefficient models here. Following Cai et al. (2000), for instance, one can define the following timevarying equivalent of the model (3.2)–(3.3) by Yt = σt εt = g(Xt )εt , (3.6) p q Xt = ω(t) + αi (t)h(Yt−i ) + βj (t)Xt−j , (3.7) i=1
j=1
where ω(t), αi (t), and βj (t) are smooth functions of time and have to be estimated from the observations Yt . This very general model can be estimated for example by means of the kernel quasi-MLE method, see (3.4): t − t1 ˆ θI (t1 ) = argmax W {Yt , g[Xt(θ)]}, (3.8) b|I| θ∈Θ t∈I
where θ(t) = (ω(t), α1 (t), . . . , αp (t), β1 (t), . . . , βq (t)), b denotes the bandwidth parameter, |I| is the length of the interval I, and W : [−1/2, 1/2] → R is a symmetric weighting function, which integrates to 1 (e.g., W (z) = 1 on [−1/2, 1/2] in the simplest case). The general model (3.6)–(3.7) is however not often used in practice due to data demands. Estimating parameters as functions of time means that θ(t) has to be estimated locally using possibly rather small numbers of observations from time periods close to t. This is however very difficult to achieve even with the basic GARCH(1,1) (see also Section 3.4 for more details). Hence, the main interest concerning the smoothly time-varying CH models lies in the time-varying ARCH (tvARCH) models.
3.3.1 Time-varying ARCH models Formally, the tvARCH(p) model is defined by the structural equation (Dahlhaus and Subba Rao, 2006) Xt = ω(t/|I|) +
p i=1
2 αi (t/|I|)Yt−i ,
(3.9)
110
3 Modelling conditional heteroscedasticity in nonstationary series
so that the parameter vector θ(t) = (ω(t), α1 (t), . . . , αp (t)) consists of realvalued functions on [0, 1] and all observations are assigned time t/|I| within [0, 1] irrespective of the sample size |I|. This model is able to characterize the data with slowly decaying sample autocorrelations of the squared returns Yt2 , which are normally atributed to structural breaks or long memory in the series. To estimate the parameters of tvARCH(p), the kernel quasi-MLE method defined in (3.8) can be used, but it has a number of disadvantages in the context of the small-sample or local estimation. When the sample size is small, the likelihood function tends to be flat around its minimum, which leads to a large variance of estimates. Moreover, the kernel estimation requires solving a large number of these quasi-MLE problems, which is a computationally intensive task (especially taking into account also the selection of the bandwidth b discussed later). An alternative available in the class of ARCH models is the least squares (LS) estimation studied in the context of tvARCH by Fryzlewicz et al. (2008) because of its good small sample properties, closed form solution, and fast computation. The kernel-LS estimator for the tvARCH(p) process minimizes at a given time t1 the following expression: p 2 2 2 t − t1 (Yt − ω − j=1 αj Yt−j ) ˆ θI (t1 ) = argmax W , (3.10) b|I| κ(t1 , Yt−1 , . . . , Yt−p )2 θ∈Θ {t,t−p}⊂I
where W is a symmetric weighting function as in (3.8), b represents the bandwidth, and κ is also a positive weighting function. Although one does not need weighting by κ in principle, its use is recommended for heavy-tailed data since 2 2 using κ(t1 , Yt−1 , . . . , Yt−p ) proportional to Yt−1 + . . . + Yt−p reduces the number of moments required for the consistency and asymptotic normality of the kernel-LS estimator. Since the (3.10) has a closed form solution, an operational procedure – the two-step kernel-LS estimator – can be defined for a given interval I and a time point t1 ∈ I as follows (Fryzlewicz et al., 2008). Denoting Yt−1 = 2 2 (1, Yt−1 , . . . , Yt−p ) , 1 1. estimate variance of Yt2 at t1 , μ ˆI (t1 ) = t∈I W t−t Yt2 /b|I|; b|I|
3.3 Time-varying coefficient models
111
2 2 2. compute κ ˆ (t1 , Yt−1 ) = μ ˆ I (t1 ) + Yt−1 + . . . + Yt−p ,
ˆ I (t1 ) R
=
W
{t,t−p}⊂I
rˆI (t1 )
=
{t,t−p}⊂I
W
t − t1 b|I| t − t1 b|I|
Yt−1 Yt−1 , κ ˆ (t1 , Yt−1 )2
Yt−1 Yt2 ; κ ˆ (t1 , Yt−1 )2
ˆ −1 (t1 )ˆ 3. and set kernel-LS estimator to θˆI (t1 ) = R rI (t1 ). I The only missing component needed for estimation is the bandwidth b. This can be determined either by the standard leave-one-out cross-validation (Fryzlewicz et al., 2008) or by some global forecasting criterion (Cheng et al., 2003) if the aim is to use tvARCH(p) for predicting volatility as is the case here. Specifically, suppose the current time is t1 and we want to forecast the volatility σ ˆt21 +1 at time t1 + 1. For a given bandwidth b and a historical interval I, one can simply estimate θˆI (t1 ) to identify the ARCH(p) model valid at time t1 and then predict σ ˆt21 +1 (b) as in the case of the standard ARCH(p) model with parameters equal to θˆI (t1 ). Because the bandwidth is unknown, we can choose 2 it by evaluating the recent out-of-sample forecasts σ ˆt+1 at times t ∈ J = [τ, t1 ] and minimizing the prediction error: 2 2 ˆb = argmin P Eλ,H (b) = argmin |Yt+h −σ ˆt+h (b)|λ , (3.11) b>0
b>0
t∈J h∈H
where λ > 0 determines the form of the loss function and H is the forecasting horizon set. In the case of MAPE in (3.5), λ = 1 and the forecasting horizon is one day day, H = {1}. The historical period J can contain last three or six months of data, for instance.
3.3.2 Estimation results Let us now have a look at the estimation and prediction of S&P 500 using the tvARCH(p) models. First, consider p = 1, for which the estimation results are summarized in Figures 3.3 and 3.4. By looking at the forecasted volatility and prediction errors of tvARCH(1) relative to GARCH(1,1) (Figure 3.4), we see that it performs similarly or slightly worse in year 2001 and 2002, where the GARCH using all data was the best method, and outperforms GARCH(1,1) in years 2003 and 2004, where a very short window was optimal for GARCH(1,1)
112
3 Modelling conditional heteroscedasticity in nonstationary series
(see Section 3.2). Before comparing numerically the performance of the estimators, the parameter estimates and the used bandwidth are of interest (Figure 3.3). One can see that the bandwidth b|I| ranges from very small values such as two weeks (10 days) in periods after possible structural changes or possible outliers (cf. Figure 3.1) to almost half a year (130 days). The bandwidth choice exhibits a repetitive pattern: after every possible change point or outlier in the stock-index returns, the selected bandwidth suddenly drops and then gradually increases until the next nonstationarity is encountered. The parameter estimates change more or less smoothly as functions of time except for the periods just after structural changes: due to a very low bandwidth and a low precision of estimation, one can then observe large fluctuations in the parameter estimates. Nevertheless, the ARCH coefficient has small values below 0.2 most of the time and there are prolonged periods with the ARCH coefficient being zero. The estimation is now performed for the tvARCH(p) models with p ∈ {0, 1, 3, 5} and the corresponding MAPEs for years 2001–2004 are summarized in Table 3.2. Naturally, the results vary with the complexity of the ARCH model, but the differences are generally rather small. On the one hand, one could thus model volatility locally as a constant without losing much of a prediction power since tvARCH(0) performs overall as good as GARCH(1,1) in terms of MAPE; considering the weighted total MAPE, tvARCH(0) actually outperforms GARCH(1,1) using any estimation window (cf. Table 3.1). On the other hand, more complex models such as tvARCH(3) and tvARCH(5) seem to perform even better because (i) they do not perform worse than simpler models despite more parameters that have to be estimated locally (e.g., in year 2004, the bandwidth is rather small for all methods and yet tvARCH(5) matches tvARCH(1)) and (ii) they are by definition approximations of GARCH(1,1) and can thus match its performance in the periods where GARCH(1,1) is optimal. Finally, note that the total performance of tvARCH(3) and tvARCH(5) across the whole period is better than the best GARCH(1,1) model using two years of data (i.e., last 500 observations) by all criteria. The relatively small differences between tvARCH and GARCH can be explained by many pronounced structural changes in the series (see Figure 3.3): while structural changes obviously affect adversely the globally-specified GARCH(1,1) model, they indirectly worsen the performance of the time-varying models as well since the intervals used for estimation shortly after a structural change are short and the estimation is thus less precise.
113
2 0
1
Constant
3
4
3.3 Time-varying coefficient models
2001
2002
2003
2004
2005
2004
2005
2004
2005
0.4 0.2 0.0
ARCH(1)
0.6
Time
2001
2002
2003
20 40 60 80 100 120
Bandwidth
Time
2001
2002
2003 Time
Figure 3.3: The estimated parameters (top panels) and the used bandwidth (bottom panel) of tvARCH(1) as functions of time for S&P 500. STFtvch04
3 Modelling conditional heteroscedasticity in nonstationary series
15 10 5 0
Squared log−returns
20
114
2001
2002
2003
2004
2005
2004
2005
1.6 1.4 1.2 1.0 0.8 0.6
MAPE rel. to GARCH(1,1)
Time
2001
2002
2003 Time
Figure 3.4: The volatility forecast (top panel) by tvARCH(1) and the mean absolute prediction errors (bottom panel) of tvARCH(1) relative to GARCH(1,1) for S&P 500. STFtvch04
3.4 Pointwise adaptive estimation The main limitation the smoothly-varying CH models discussed in the previous Section 3.3 is the assumption of the continuity of θ(t), which precludes structural changes. Although we have seen in Section 3.3.2 that tvARCH(p) can actually deal with nonstationary data rather well, this was sometimes achieved
3.4 Pointwise adaptive estimation
115
Table 3.2: Mean one-day-ahead forecast errors in volatility by GARCH(1,1) and tvARCH(p) for p = 0, 1, 3, and 5.
Year 2001 2002 2003 2004
p=0 1.192 1.791 0.778 0.358
tvARCH(p) p=1 p=3 1.195 1.185 1.748 1.698 0.843 0.806 0.353 0.358
p=5 1.189 1.665 0.786 0.355
Global GARCH(1,1) 1.156 1.714 0.818 0.435
Total Weighted
1.030 1.033
1.034 1.046
0.998 1.016
1.031 1.087
1.012 1.029
STFtvch05
by estimating within a window of length b|I| = 10. This is formally possible only if the sample size is fixed as the asymptotic theory requires b|I| → ∞ as the interval length increases. These limitations motivate an alternative approach to the estimation of varying coefficient models (3.6)–(3.7), which is based on a finite-sample theory and thus does not suffer from the limitations of the classical change point detection or smoothly-varying coefficient models. This alternative strategy is based on the assumption that a time series can be locally, that is, over short periods of time, approximated by a parametric CH model such as (3.2)–(3.3). The aim is to find – by means of a finite-sample theory of testing – the longest historical time interval, where such a parameteric approximation is valid. This methodology was proposed by Mercurio and Spokoiny (2004) in the context of the local constant approximation of the ˇ ıˇzek et al. (2009) to the local ARCH volatility process and generalized by C´ and GARCH approximations. Formally, we assume that the observed data Yt are described by an unobserved process Xt as in (3.2), and at each point t1 , there exists a historical interval I(t1 ) = [t0 , t1 ] in which the process Xt “nearly” follows the parametric specification (3.3). This assumption enables us to apply locally the well developed parametric estimation for data {Yt }t∈I(t1 ) ˆ 1 ) = θ˜I(t ) . Obviously, to estimate the underlying parameter vector θ(t1 ) by θ(t 1 this modelling strategy results in a model with time-varying coefficients θ(t1 ), but contrary to smoothly-varying coefficient models in Section 3.3, it allows for discontinuous jumps in the parameter values of model (3.6)–(3.7): the lengths
116
3 Modelling conditional heteroscedasticity in nonstationary series
of intervals I(t1 ) and I(t1 − 1) are namely not assumed to be related in any way. ˆ 1 ), we have to find the historical interval of homogeneity I(t1 ), To estimate θ(t that is, the longest interval I = [t0 , t1 ], where data do not contradict the specified parametric model with some fixed parameter values. Starting at each time t1 with a very short interval I = [t1 − m + 1, t1 ], where m is a small fixed integer independent of the sample size, the search is done by successive extending and testing of interval I on homogeneity against a change point alternative: if the hypothesis of homogeneity is not rejected for a given I, a larger interval is taken and tested again. This procedure is repeated until a change point is found or I contains all past observations. To test the null hypothesis that the observations {Yt }t∈I follow the parametric model (3.2)–(3.3) with a fixed parameter θ0 , one can use, for example, the supremum likelihood-ratio (supLR) test as proposed in Andrews (1993). Since the alternative is that the process Yt follows different parameteric models within I, the supLR test statistics equals TI,T (I)
=
sup 2[ max {LJ (θJ ) + LJ c (θJ c )} − max LI (θ)] (3.12)
τ ∈T (I)
=
θJ ,θJ c ∈Θ
θ∈Θ
sup 2[LJ (θ˜J ) + LJ c (θ˜J c ) − LI (θ˜I )],
(3.13)
τ ∈T (I)
where intervals J = [t0 , τ ] and J c = [τ + 1, t1 ] and the supremum is taken over a set T (I) = {τ : t0 + m ≤ τ ≤ t1 − m } for some fixed integers m , m > 0. Although this procedure has well established asymptotic properties for one interval I and min{m , m } → ∞ as |I| → ∞ (Andrews, 1993), the challenge of the search for the longest interval of homogeneity lies in performing sequentially multiple supLR tests and using some small fixed m and m , which do not increase with the interval length. This sequential search for the longest time-homogeneous region and the appropriate choice of critical values for the test statistics TI,T (I) are discussed in Sections 3.4.1 and 3.4.2, respectively.
3.4.1 Search for the longest interval of homogeneity The main steps in the pointwise adaptive estimation procedure are now described. At each point t1 , the aim is to estimate the unknown parameter vector θ(t1 ) from historical data Yt , t ≤ t1 . First, the procedure selects a historical inˆ 1 ) = [t0 , t1 ] in which the data do not contradict the parametric model terval I(t
3.4 Pointwise adaptive estimation
117
(3.2)–(3.3). Afterwards, the quasi-MLE estimation is applied to data within ˆ 1 ) = θ˜ˆ . ˆ 1 ) to obtain the estimate θ(t the selected historical interval I(t I(t1 ) To perform the search, suppose that a growing set I0 ⊂ I1 ⊂ . . . ⊂ IK of historical interval candidates Ik = [t1 − mk + 1, t1 ] with the right-end point t1 is fixed, where the smallest interval I0 = [t1 − m0 + 1, t1 ] is so short that it can be accepted automatically as time-homogeneous. Every larger interval Ik will be successively tested for time-homogeneity using the test statistic TIk ,T (Ik ) defined in (3.12) and critical values zk constructed for each interval in Section ˆ 1 ) will then simply be chosen as the longest interval, 3.4.2. The interval I(t where the null hypothesis of time-homogeneity is accepted. To perform and describe the procedure, it is thus necessary to select (i) a specific parametric model (3.2)–(3.3) (e.g., constant volatility, ARCH(p), GARCH(1,1)); (ii) the set I = (I0 , . . . , IK ) of interval candidates (e.g., Ik = [t1 − mk + 1, t1 ] using a geometric grid mk = [m0 ak ], a > 1, and m0 ∈ N ); and (iii) the critical values z1 , . . . , zK as described later in Section 3.4.2. The complete description ˇ ıˇzek et al. (2009) follows. of the procedure as introduced in C´ 1. Set k = 1, Iˆ = I0 , and θˆ = θ˜I0 . 2. Test the hypothesis H0,k of no change point within the interval Ik using test statistics TIk ,T (Ik ) defined in (3.12) and the critical values zk . If a change point is detected, that is, TIk ,T (Ik ) > zk , go to point 4. Otherwise continue with point 3. 3. Set θˆ = θ˜Ik and θˆIk = θ˜Ik . Further, set k := k + 1. If k ≤ K, repeat point 2. Otherwise go to point 4. 4. Define Iˆ = Ik−1 = “the last accepted interval” and θˆ = θ˜Iˆ = “the final estimate.” Additionally, set θˆIk = . . . = θˆIK = θˆ if k ≤ K. The result of the procedure is the longest interval of homogeneity Iˆ and the ˆ Additionally, the estimate θˆI corresponding pointwise adaptive estimate θ. k after k steps of the procedure is defined. Even though this method is rather insensitive to the choice of its parameters, the set-up of the procedure cannot be completely arbitrary. Specifically, the tested intervals depend on the multiplier a, which is often chosen equal to 1.25 in this context, and the length m0 of the initial interval I0 . The choice of m0 should take into account the selected parametric model because I0 is always assumed to be time-homogeneous and m0 thus has to reflect flexibility of the parametric model. For example, while m0 = 20 might be reasonable for GARCH(1,1) model, m0 = 5 could be a reasonable choice for the locally constant approximation of a volatility process.
118
3 Modelling conditional heteroscedasticity in nonstationary series
3.4.2 Choice of critical values The choice of the longest time-homogeneous interval Iˆ is a multiple testing procedure. The critical values zk are thus selected in the classical way to achieve a prescribed performance under the null hypothesis, that is, in the pure parametric situation. If data come from a parametric model (3.2)–(3.3), the correct choice of Iˆ is the largest considered interval IK and the choice of any shorter interval can be interpreted as a “false alarm.” The critical values are then selected to minimize the “loss” due to a false alarm. ˇ ıˇzek et al. (2009) propose to measure the loss associated with a false alarm C´ ˆ = LI (θ˜I ) − LI (θ), ˆ that is, by the increase of by the value LIK (θ˜IK , θ) K K K the log-likelihood caused by estimating θ by θˆ rather than by the optimal estimate θ˜IK under the null hypothesis. Given r > 0 and the upper bound on the log-likelihood risk Eθ0 |LIK (θ˜IK , θ0 )|r ≤ Rr (θ0 ), which is data-independent ˇ ıˇzek et al., 2009), one can require the loss due to a false (Theorem 2.1 of C´ alarm to be bounded in the following way: ) ) ˆ )r ≤ ρRr (θ0 ), Eθ0 )LIK (θ˜IK , θ) (3.14) where ρ > 0 is a constant similar in meaning to the level of a test. Since one performs a test sequentially K times, one can decompose the upper bound on the log-likelihood loss in (3.14) to K equal parts (one per each step) and require ) )r Eθ0 )LIk (θ˜Ik , θˆIk )) ≤ ρk Rr (θ0 ), (3.15) where ρk = ρk/K ≤ ρ for k = 1, . . . , K. These K inequalities then define the K ˇ ıˇzek et critical values zk , k = 1, . . . , K. To simplify this rather general choice, C´ al. (2009) show that zk can be chosen as a linear function of |Ik | and describe its (straightforward) construction by simulation for a given r and ρ in details. This ˇ ıˇzek, 2009, facilitates a proper and general finite-sample testing procedure (see C´ for theoretical properties) with only one possible caveat: the critical values can depend on the true parameter values of the underlying process because (3.15) is specific to θ0 . Similarly to smoothly-varying coefficient models discussed in Section 3.3, the pointwise adaptive estimation also depends on some auxiliary parameters. A simple strategy is to use conservative values for these parameters and the corresponding set of critical values (e.g., r = 1 and ρ = 1). The choice of r and ρ can however be data-dependent by using the same strategy as for the bandwidth choice of the tvARCH model – minimizing some global forecasting error (Cheng et al., 2003). Since different values of r and ρ lead to different sets
3.4 Pointwise adaptive estimation
119
of critical values and hence to different estimates θˆ(r,ρ) (t1 ) and out-of sample forecasts σ ˆt21 +h (r, ρ), a data-driven choice of r and ρ can be done by minimizing the following prediction error: 2 2 (ˆ r , ρˆ) = argmin P Eλ,H (r, ρ) = argmin |Yt+h −σ ˆt+h (r, ρ)|λ , (3.16) r>0,ρ>0
r>0,ρ>0
t∈J h∈H
where λ > 0, H is the forecasting horizon such as H = {1}, and J = [τ, t1 ] represents a historical interval used for the comparison (e.g., J contains last three or six months of data).
3.4.3 Estimation results The pointwise adaptive estimation method will be now applied to the logreturns of the S&P 500 stock index. We consider the method applied using three models: the volatility process is locally approximated by means of the constant volatility, ARCH(1), and GARCH(1,1) models. The applicability is more complicated than in the case of the tvARCH model since the critical values constructed from (3.15) depend on the values of the ARCH and GARCH ˇ ıˇzek et al., 2009, for details and specific critical values). Since parameters (see C´ these methods are also computationally demanding, the values of the tuning parameters r and ρ are not selected by the criterion (3.16), but they are instead fixed to r = 0.5 and ρ = 1.5 in all cases. This has naturally a negative influence on the performance of the pointwise adaptive estimation, but a closely related adaptive weights smoothing presented in the next Section 3.5 will demonstrate the possibilities of the methodology under a data-driven choice of tuning parameters.
3 Modelling conditional heteroscedasticity in nonstationary series
4 3 1
2
Constant
5
6
120
2001
2002
2003
2004
2005
2004
2005
2004
2005
0.0 0.1 0.2 0.3 0.4
ARCH parameter
Time
2001
2002
2003
250 150 50
Length of the selected interval
Time
2001
2002
2003 Time
Figure 3.5: The estimated parameters of the pointwise adaptive ARCH(1) model (top panels) and the estimated lengths of the timehomogenous intervals (bottom panel) for S&P 500.
1.0
1.4
1.8
121
0.6
MAPE rel. to GARCH(1,1)
3.4 Pointwise adaptive estimation
2001
2002
2003
2004
2005
2004
2005
2004
2005
1.6 1.2 0.8
MAPE rel. to GARCH(1,1)
Time
2001
2002
2003
1.1 0.9 0.7
MAPE rel. to GARCH(1,1)
Time
2001
2002
2003 Time
Figure 3.6: The mean average prediction errors of the pointwise adaptive constant (top panel), ARCH(1) (middle panel), and GARCH(1,1) (bottom panel) models relative to the forecast errors of GARCH(1,1) for S&P 500.
122
3 Modelling conditional heteroscedasticity in nonstationary series
To understand the nature of estimation results, let us first look at the parameter estimates and the chosen interval lengths for the modelling based on the local ARCH(1) assumption; see Figure 3.5. The chosen intervals of homogeneity resemble the choice of bandwidth b|I| of tvARCH(1) in years 2001–2003, but differ substantially in the second half of 2003 and year 2004: the adaptively chosen intervals of homogeneity are much longer than the bandwidth choice of tvARCH(1). There are also some differences in the parameter estimates since the local ARCH(1) approximation results much more often in the ARCH parameter being equal to zero and thus in the local constant volatility model. Let us note at this point that the locally adaptive GARCH(1,1) estimates are of different nature in this series: the ARCH parameter mostly fluctuates between 0.05 and 0.2 and the GARCH parameter attains large values ranging from 0.8 to 0.95 most of the time. The performance of all pointwise adaptive estimates is characterized in terms of MAPE relative to the forecasting errors of the GARCH(1,1) on Figure 3.6. Similarly to tvARCH(1), the pointwise adaptive estimation seems to perform slightly worse than GARCH(1,1) in years 2001 and 2002, while outperforming GARCH(1,1) in years 2003 and 2004. One can notice that the large peaks characterizing the worst performance of the pointwise adaptive estimation correspond to the times following structural changes, that is, the times with very short intervals of time-homogeneity (cf. Figure 3.5). The biggest performance gain of the pointwise adaptive estimation can be therefore observed in year 2004, which seems free of further structural shocks to the market and thus more stable (see Figure 3.5 again). Furthermore, the local GARCH(1,1) modelling matches the GARCH(1,1) forecast performance much more closely than the other two approximations of the volatility process. At this point, the experience with tvARCH(p) would hint that using an ARCH(p) approximation would be beneficial, but it is practically infeasible due to the dependence of the critical values on parameters. To formally judge the performance of all methods, MAPEs per year are summarized in Table 3.3. These figures confirm the better predictive performance of GARCH(1,1) in years 2001 and 2002, although the difference between local and global GARCH(1,1) estimation is rather small in 2002. In year 2003 and 2004, the pointwise adaptive methods outperform GARCH(1,1), in the case of the local constant and ARCH(1) estimation even irrespective of the number of past observations used for GARCH(1,1) estimation (cf. Table 3.1). Further, it is interesting to note that, even though the GARCH(1,1)-based estimators exhibit overall better performance in terms of the total MAPE, the local constant approximation is actually preferable to the local GARCH(1,1)
3.5 Adaptive weights smoothing
123
Table 3.3: Mean absolute forecast errors in volatility by the pointwise adaptive estimation based on the local approximation of volatility by constant, ARCH(1), and GARCH(1,1).
Year 2001 2002 2003 2004 Total Weighted
Local approximation by Constant ARCH(1) GARCH(1,1) 1.228 1.298 1.251 1.853 1.891 1.722 0.745 0.760 0.818 0.361 0.351 0.367 1.046 1.040
1.075 1.058
1.041 1.057
Global GARCH(1,1) 1.156 1.714 0.818 0.435 1.031 1.087
modelling in all years but 2002 and it outperforms both the local and global GARCH(1,1) forecasts in terms of the weighted total MAPE (again irrespective of the historical-window size). Similarly to the tvARCH results, the locally applied ARCH(1) model is not a good option as it reduces to the locally constant volatility model most of the time. Finally, another advantage of the local constant approximation stems from the critical values independent of parameter values.
3.5 Adaptive weights smoothing The last approach to the local volatility modelling discussed here combines in a sense features of the smoothly-varying coefficient models and of the pointwise adaptive estimation. The adaptive weights smoothing (AWS) idea proposed by Polzehl and Spokoiny (2000) starts from an initial nonparametric fit such as the kernel quasi-MLE, that is, it first estimates a given time-varying model (3.6)–(3.7) nonparametrically using observations within small neighborhoods of a time period t1 . Later, AWS however tries to expand these local neighborhoods, and similarly to the pointwise adaptive estimation, to find at each time point the largest interval I containing t1 , where θ(t), t ∈ I, in (3.6)–(3.7) is constant, that is, where a parametric model with fixed parameter values is applicable. The main difference with respect to tvARCH and similar models is that the initial neighborhoods do not have to increase with the sample size,
124
3 Modelling conditional heteroscedasticity in nonstationary series
which makes it more flexible and well defined at times shortly after structural breaks. Compared to the pointwise adaptive estimation, AWS does not search for the historical intervals of homogeneity at each time point, but within the whole data set. Additionally, AWS does not necessarily require a time point to be surely be within or outside of a time-homogeneous region: a point can be in such a region only with a certain probability. We describe here the AWS procedure only in the special case when the unobserved volatility process is locally approximated by a constant. There are several reasons for this. Despite the existing extension of AWS to varying coefficient models (Polzehl and Spokoiny, 2003), the implementation exists only for models with one or two explanatory variables. This limits practical applications to volatility approximations by ARCH(p) models with a small p, which are typically outperformed by the local constant modelling (e.g., see Sections 3.3 and 3.4). The local constant modelling also seems to be similar or preferable to local GARCH(1,1) approximations of the volatility process (e.g., see Section 3.4 and Polzehl and Spokoiny, 2004), which is a likely result of the high data demand of the GARCH(1,1) model.
3.5.1 The AWS algorithm For the sake of simplicity, let us now thus describe the basic AWS procedure as introduced by Polzehl and Spokoiny (2000) adapted for the volatility modelling; an extention to the time-varying GARCH(1,1) models is discussed in Polzehl and Spokoiny (2004). Similarly to the pointwise adaptive estimation, AWS searches the largest neighborhood of a given time point t and thus requires an increasing sequence of neighborhoods U0 (t) ⊂ U1 (t) ⊂ . . . ⊂ UK (t), where typically Uk (t) = {τ : |τ − t| ≤ dk } for an increasing sequence of {dk }K k=0 . To describe AWS, one additionally needs again a symmetric integrable kernel function W and an alternative notation g(t) ≡ g(Xt ), see (3.6). 1. Initialization: set k = 0, define for all t, t ∈ I weights wk (t, t ) = 1, and at each t ∈ I, compute the initial estimates of the volatility function 2 t ∈U (t) wk (t, t )Yt gˆk (t) = k t ∈Uk (t) wk (t, t ) and its variance sˆ2k (t)
2 1 t ∈Uk (t) wk (t, t ) 2 = {Yt − gˆk (t )} 2 . |I| ) t ∈I w (t, t k t ∈Uk (t)
3.5 Adaptive weights smoothing
125
2. Adaptation: set k = k + 1, compute for all t, t ∈ I weights gˆk−1 (t) − gˆk−1 (t ) wk (t, t ) = W , φˆ sk−1 (t) and at each t ∈ I, compute new estimates of the volatility function 2 t ∈U (t) wk (t, t )Yt gˆk (t) = k t ∈Uk (t) wk (t, t ) and its variance sˆ2k (t)
2 1 t ∈Uk (t) wk (t, t ) 2 = {Yt − gˆk (t )} 2 . |I| ) t ∈I w (t, t t ∈Uk (t) k
3. Verification: if |ˆ gk (t) − gˆk−s (t)| > ηˆ sk−s (t) for any s ≤ k, set gˆk (t) = gˆk−1 (t). 4. Stopping rule: stop if k = K or if gˆk (t) = gˆk−1 (t) for all t ∈ I. The step 1 represents the initial nonparametric estimation within neighborhoods U0 (t). The adaptation step 2 defines the weights used for nonparametric estimation of gˆk (t) within a larger neighborhood Uk (t), k > 0. Most importantly, the weights wk (t, t ) are not proportional to the distance |t − t | as for classical nonparametric estimates, but they depend on the difference between volatility values at time t and t . Hence, more distant observations at t from a larger neighborhood Uk (t) are used only if the previously estimated volatility at times t and t are close relative to the variance of these values and a tuning parameter φ. The third step contains another tuning parameter η, which prevents large changes of the estimated values when neighborhoods are enlarged. Note that, in more recent versions of AWS, η instead defines a convex combination of the estimates gˆk (t) and gˆk−s (t) (e.g., Polzehl and Spokoiny, 2003) and is thus usually fixed. Finally, the algorithm stops if the largest neighborhood is reached or if there are no changes in the estimated volatility values anymore. The procedure depends again on a tuning parameter, which is denoted φ this time. Similarly to the pointwise adaptive estimation, one can either use a fixed conservative value or make the choice of φ data-dependent by minimizing some global forecasting error. Analogously to (3.16), one can select φ by 2 2 φˆ = argmin P Eλ,H (φ) = argmin |Yt+h −σ ˆt+h (φ)|λ , (3.17) φ>0
φ>0
t∈J h∈H
5
10
15
20
3 Modelling conditional heteroscedasticity in nonstationary series
0
Squared log−returns
126
2001
2002
2003
2004
2005
2004
2005
1.4 1.0 0.6
MAPE rel. to GARCH(1,1)
Time
2001
2002
2003 Time
Figure 3.7: The volatility forecasts and mean absolute prediction errors of the adaptive weights smoothing relative to GARCH(1,1) for S&P500. STFtvch06
where λ > 0, H is the forecasting horizon, and J = [τ, t1 ] represents a historical interval used for the comparison. Due to the fact that the volatility is locally approximated only by a constant here, J can be selected shorter than in the previous cases (e.g., one or two months).
3.6 Conclusion
127
Table 3.4: Mean absolute forecast errors in volatility by the adaptive weights smoothing. Year 2001 2002 2003 2004
AWS 1.185 1.856 0.782 0.352
GARCH(1,1) 1.156 1.714 0.818 0.435
Total Weighted
1.044 1.037
1.031 1.087 STFtvch07
3.5.2 Estimation results The data analyzed by AWS will be again the S&P500 index in years 2001– 2004. The forecasts using φ determined by (3.17) are presented on Figure 3.7, including the ratio of MAPE relative to the MAPE of GARCH(1,1). The yearly averages of absolute errors are reported in Table 3.4. The character of predictions differs substantially from the ones performed by GARCH(1,1), see Figure 3.2. Similarly to the pointwise adaptive estimation, AWS performs worse than GARCH(1,1) in the first two years, while outperforming it in the last two years of data. What is much more interesting is that – with the exception of year 2002 – AWS performs better than GARCH(1,1) estimated using the 500observation window. Additionally, AWS outperms any GARCH(1,1) estimate in years 2003 and 2004 and also in terms of the weighted total MAPE (cf. Table 3.1). This means that the local approximation of volatility by a constant practically performs better or as well as GARCH(1,1) once one primitively “protects” the model against structural breaks and uses just one or two years of data instead of the whole available history.
3.6 Conclusion In this chapter, several alternatives to the standard parametric conditional heteroscedasticity modelling were presented: the semiparametric and adaptive
128
3 Modelling conditional heteroscedasticity in nonstationary series
nonparametric estimators of models with time-varying coefficients, which can account for nonstationarities of the underlying volatility process. In most cases, adaptive or semiparametric estimation combined with a simpler rather than more complicated models resulted in forecasts of the same or better quality than the standard GARCH(1,1) model. Although this observation is admitedly limited to the data used for demonstration, the results in Fryzlewicz et al. ˇ ıˇzek et al. (2009) indicate that it (2008), Polzehl and Spokoiny (2004), or C´ is valid more generally. The only exception to this rule is tvARCH(p) with a larger number of lags, which behaves similarly to the local constant or ARCH(1) approximations of the volatility in instable periods, but at the same time, is able to capture the features of GARCH(1,1), where this parametric model is predicting optimally. Unfortunately, it is currently not possible to study analogs of tvARCH(5) in the context of the pointwise adaptive estimation or adaptive weights smoothing due to practical limitations of these adaptive methods.
Bibliography Andersen, T. G. and Bollerslev, T. (1998). Answering the skeptics: yes, standard volatility models do provide accurate forecasts, International Economic Review 39, 885–905. Andreou, E. and Ghysels, E. (2002). Detecting multiple breaks in financial market volatility dynamics, Journal of Applied Econometrics 17, 579– 600. Andreou, E. and Ghysels, E. (2006). Monitoring disruptions in financial markets, Journal of Econometrics 135, 77–124. Andrews, D. W. K. (1993). Tests for parameter instability and structural change with unknown change point, Econometrica 61, 821–856. Andrews, D. W. K. (2003). End-of-sample instability tests, Econometrica 71, 1661–1694. Bai, J. and Perron, P. (1998). Estimating and testing linear models with multiple structural changes, Econometrica 66, 47–78. Beltratti, A. and Morana, C. (2004). Structural change and long-range dependence in volatility of exchange rates: either, neither or both?, Journal of Empirical Finance 11, 629–658. Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity, Journal of Econometrics 31, 307–327. Cai, Z., Fan, J. and Yao, Q. (2000). Functional coefficient regression models for nonlinear time series, Journal of the American Statistical Association 95, 941–956. Chen, G., Choi, Y. K., and Zhou, Y. (2010). Nonparametric estimation of structural change points in volatility models for time series, Journal of Econometrics 126, 79–114. Chen, J. and Gupta, A. K. (1997). Testing and locating variance changepoints with application to stock prices, Journal of the American Statistical As-
130
Bibliography sociation 92, 739–747.
Chen, R. and Tsay, R. J. (1993). Functional-coefficient autoregressive models, Journal of the American Statistical Association 88, 298–308. Cheng, M.-Y., Fan, J. and Spokoiny, V. (2003). Dynamic nonparametric filtering with application to volatility estimation, in M. G. Akritas and D. N. Politis (eds.), Recent Advances and Trends in Nonparametric Statistics, Elsevier, North-Holland, pp. 315–333. ˇ ıˇzek, P., H¨ C´ ardle, W. and Spokoiny, V. (2009). Adaptive pointwise estimation in time-inhomogeneous conditional heteroscedasticity models, The Econometrics Journal 12, 248–271. Dahlhaus, R. and Subba Rao, S. (2006). Statistical inference for time-varying ARCH processes, The Annals of Statistics 34, 1075–1114. Diebold, F. X. and Inoue, A. (2001). Long memory and regime switching, Journal of Econometrics 105, 131–159. Eizaguirre, J. C., Biscarri, J. G., and Hidalgo, F. P. G. (2010). Structural term changes in volatility and stock market development: evidence for Spain, Journal of Banking & Finance 28, 1745–1773. Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of the variance of United Kingdom inflation, Econometrica 50, 987–1008. Fan, J. and Zhang, W. (2008). Statistical models with varying coefficient models, Statistics and Its Interface 1, 179–195. Francq, C. and Zakoian, J.-M. (2007). Quasi-maximum likelihood estimation in GARCH processes when some coefficients are equal to zero, Stochastic Processes and their Applications 117, 1265–1284. Fryzlewicz, P., Sapatinas, T., and Subba Rao, S. (2008). Normalised leastsquares estimation in time-varying ARCH models, Annals of Statistics 36, 742–786. Glosten, L. R., Jagannathan, R. and Runkle, D. E. (1993). On the relation between the expected value and the volatility of the nominal excess return on stocks, The Journal of Finance 48, 1779–1801. Hansen, B. and Lee, S.-W. (1994). Asymptotic theory for the GARCH(1,1) quasi-maximum likelihood estimator, Econometric Theory 10, 29–53.
Bibliography
131
H¨ ardle, W., Herwatz, H. and Spokoiny, V. (2003). Time inhomogeneous multiple volatility modelling, Journal of Financial Econometrics 1, 55–99. Herwatz, H. and Reimers, H. E. (2001). Empirical modeling of the DEM/USD and DEM/JPY foreign exchange rate: structural shifts in GARCH-models and their implications, SFB 373 Discussion Paper 2001/83, HumboldtUniverzit¨at zu Berlin, Germany. Hillebrand, E. (2005). Neglecting parameter changes in GARCH models, Journal of Econometrics 129, 121–138. Kokoszka, P. and Leipus, R. (2000). Change-point estimation in ARCH models, Bernoulli 6, 513–539. Mercurio, D. and Spokoiny, V. (2004). Statistical inference for timeinhomogeneous volatility models, The Annals of Statistics 32, 577–602. Mikosch, T. and Starica, C. (2004). Changes of structure in financial time series and the GARCH model, Revstat Statistical Journal 2, 41–73. Morales-Zumaquero, A., and Sosvilla-Rivero, S. (2010). Structural breaks in volatility: evidence for the OECD and non-OECD real exchange rates, Journal of International Money and Finance 29, 139–168. Nelson, D. B. (1991). Conditional heteroskedasticity in asset returns: a new approach, Econometrica 59, 347–370. Pesaran, M. H. and A. Timmermann (2004). How costly is it to ignore breaks when forecasting the direction of a time series?, International Journal of Forecasting 20, 411–425. Polzehl, J. and Spokoiny, V. (2000). Adaptive weights smoothing with applications to image restoration, Journal of the Royal Statistical Society, Ser. B 62, 335–354. Polzehl, J. and Spokoiny, V. (2003). Varying coefficient regression modelling by adaptive weights smoothing, Preprint No. 818, WIAS, Berlin, Germany. Sentana, E. (1995). Quadratic ARCH Models, The Review of Economic Studies 62, 639–661. Spokoiny, V. (1998). Estimation of a function with discontinuities via local polynomial fit with an adaptive window choice, Annals of Statistics 26, 1356–1378. Spokoiny, V. (2009). Multiscale local change-point detection with applications
132
Bibliography to Value-at-Risk, Annals of Statistics 37, 1405–1436.
Starica, C., and Granger, C. (2005) Nonstationarities in stock returns, The Review of Economics and Statistics 87, 503–522. Taylor, S. J. (1986). Modeling financial time series, Chichester: Wiley. Xu, K.-L., and Phillips, P. C. B. (2008). Adaptive estimation of autoregressive models with time-varying variances, Journal of Econometrics 142, 265– 280.
4 FX smile in the Heston model Agnieszka Janek, Tino Kluge, Rafal Weron, and Uwe Wystup
4.1 Introduction The universal benchmark for option pricing is flawed. The Black-Scholes formula is based on the assumption of a geometric Brownian motion (GBM) dynamics with constant volatility. Yet, the model-implied volatilities for different strikes and maturities of options are not constant and tend to be smile shaped (or in some markets skewed). Over the last three decades researchers have tried to find extensions of the model in order to explain this empirical fact. As suggested already by Merton (1973), the volatility can be made a deterministic function of time. While this approach explains the different implied volatility levels for different times of maturity, it still does not explain the smile shape for different strikes. Dupire (1994), Derman and Kani (1994), and Rubinstein (1994) came up with the idea of allowing not only time, but also state dependence of the volatility coefficient, for a concise review see e.g. Fengler (2005). This local (deterministic) volatility approach yields a complete market model. It lets the local volatility surface to be fitted, but it cannot explain the persistent smile shape which does not vanish as time passes. Moreover, exotics cannot be satisfactorily priced in this model. The next step beyond the local volatility approach was to allow the volatility to be driven by a stochastic process. The pioneering work of Hull and White (1987), Stein and Stein (1991), and Heston (1993) led to the development of stochastic volatility (SV) models, for reviews see Fouque, Papanicolaou, and Sircar (2000) and Gatheral (2006). These are multi-factor models with one of the factors being responsible for the dynamics of the volatility coefficient. Different driving mechanisms for the volatility process have been proposed, including GBM and mean-reverting Ornstein-Uhlenbeck type processes.
P. Čížek et al. (eds.), Statistical Tools for Finance and Insurance, DOI 10.1007/978-3-642-18062-0_4, © Springer-Verlag Berlin Heidelberg 2011
133
134
4 FX smile in the Heston model
The Heston model stands out from this class mainly for two reasons. Firstly, the process for the volatility is non-negative and mean-reverting, which is what we observe in the markets. Secondly, there exists a fast and easily implemented semi-analytical solution for European options. This computational efficiency becomes critical when calibrating the model to market prices and is the greatest advantage of the model over other (potentially more realistic) SV models. Its popularity also stems from the fact that it was one of the first models able to explain the smile and simultaneously allow a front-office implementation and a valuation of many exotics with values closer to the market than the BlackScholes model. Finally, given that all SV models generate roughly the same shape of implied volatility surface and have roughly the same implications for the valuation of exotic derivatives (Gatheral, 2006), focusing on the Heston model is not a limitation, rather a good staring point. This chapter is structured as follows. In Section 4.2 we define the model and discuss its properties, including marginal distributions and tail behavior. Next, in Section 4.3 we adapt the original work of Heston (1993) to a foreign exchange (FX) setting. We do this because the model is particularly useful in explaining the volatility smile found in FX markets; in equity markets the typical volatility structure is a strongly asymmetric skew (also called a smirk or grimace). In Section 4.4 we show that the smile of vanilla options can be reproduced by suitably calibrating the model parameters. Finally, in Section 4.5 we briefly discuss the alternatives to the Heston model.
4.2 The model Following Heston (1993) consider a stochastic volatility model with GBM-like dynamics for the spot price: √ (1) dSt = St μ dt + vt dWt (4.1) and a non-constant instantaneous variance vt driven by a mean-reverting square root (or CIR) process: √ (2) dvt = κ(θ − vt ) dt + σ vt dWt .
(4.2)
The stochastic increments of the two processes are correlated with parameter (1) (2) ρ, i.e. dWt dWt = ρdt. The remaining parameters – μ, θ, κ, and σ – can be interpreted as the drift, the long-run variance, the rate of mean reversion
4.2 The model
135
6
35 GBM Heston
30 Volatility [%]
FX rate
5.5 5 4.5 4 3.5
25 20 15 10
0
0.2
0.4 0.6 Time [years]
0.8
1
5
0
0.2
0.4 0.6 Time [years]
0.8
1
Figure 4.1: Sample trajectories (left panel ) and volatilities (right panel ) of the GBM and the Heston spot price process (4.1) obtained with the same set of random numbers. STFhes01
to the long-run variance, and the volatility of variance (often called the vol of vol), respectively. Sample paths and volatilities of the GBM and the Heston spot price process are plotted in Figure 4.1. By setting xt = log(St /S0 ) − μt, we can express the Heston model (4.1)-(4.2) in terms of the centered (log-)return xt and vt . The process is then characterized by the transition probability Pt (x, v | v0 ) to have (log-)return x and variance v at time t given the initial return x = 0 and variance v0 at time t = 0. The time evolution of Pt (x, v | v0 ) is governed by the following Fokker-Planck (or forward Kolmogorov) equation: ∂ P ∂t
=
∂ 1 ∂ {(v − θ)P } + (vP ) + ∂v 2 ∂x ∂2 1 ∂2 σ2 ∂ 2 + ρσ (vP ) + (vP ) + (vP ). 2 ∂x ∂v 2 ∂x 2 ∂v2 κ
(4.3)
Solving this equation yields the following semi-analytical formula for the density of centered returns x, given a time lag t of the price changes (Dragulescu and Yakovenko, 2002): +∞ 1 Pt (x) = eiξx+Ft (ξ) dξ, (4.4) 2π −∞
136 with
4 FX smile in the Heston model
2 2 +2κγ log cosh Ωt + Ω −γ sinh Ωt , 2 2κΩ 2 γ = κ + iρσξ, and Ω = γ 2 + σ 2 (ξ 2 − iξ).
Ft (ξ) =
κθ σ2
γt −
2κθ σ2
Somewhat surprisingly, the introduction of SV does not change the properties of the spot price process in a way that could be noticed just by a visual inspection of its realizations, see Figure 4.1 where sample paths of a GBM and the spot process (4.1) in the Heston model are plotted. To make the comparison more objective both trajectories were obtained with the same set of random numbers. In both cases the initial spot rate S0 = 4 and the domestic and foreign interest rates are rd = 5% and rf = 3%, respectively, yielding √ a drift of μ = rd − rf = √ 2%. The volatility in the GBM is constant vt = 4% = 20%, while in the Heston model it is driven by the mean-reverting process (4.2) with the initial variance v0 = 4%, the long-run variance θ = 4%, the speed of mean reversion κ = 2, and the vol of vol σ = 30%. The correlation is set to ρ = −0.05. A closer inspection of the Heston model does, however, reveal some important differences with respect to GBM. For instance, the probability density functions (pdfs) of (log-)returns have heavier tails – exponential compared to Gaussian, see Figure 4.2. In this respect they are similar to hyperbolic distributions (Weron, 2004, see also Chapter 1), i.e. in the log-linear scale they resemble hyperbolas, rather than parabolas of the Gaussian pdfs.
4.3 Option pricing Consider the value function of a general contingent claim U (t, v, S) paying g(S) = U (T, v, S) at time T . We want to replicate it with a self-financing portfolio. Due to the fact that in the Heston model we have two sources of uncertainty (the Wiener processes W (1) and W (2) ) the portfolio must include the possibility to trade in the money market, the underlying and another derivative security with value function V (t, v, S). We start with an initial wealth X0 which evolves according to: dX = Δ dS + Γ dV + rd (X − ΓV ) dt − (rd − rf )ΔS dt,
(4.5)
where Δ is the number of units of the underlying held at time t and Γ is the number of derivative securities V held at time t.
4.3 Option pricing
137
2.5 2
GBM Heston
0
10
−2
PDF(x)
PDF(x)
10 1.5 1
−4
10
−6
10
0.5 0 −1
−8
−0.5
0 x
0.5
1
10
−2
−1
0 x
1
2
Figure 4.2: The marginal pdfs in the Black-Scholes (GBM) and Heston models for the same set of parameters as in Figure 4.1 (left panel ). The tails of the Heston marginal pdfs are exponential, which is clearly visible in the log-linear scale (right panel ). STFhes02
The goal is to find Δ and Γ such that Xt = U (t, vt , St ) for all t ∈ [0, T ]. The standard approach to achieve this is to compare the differentials of U and X obtained via Itˆ o’s formula. After some algebra we arrive at the partial differential equation (PDE) which U must satisfy (for details on the derivation in the foreign exchange setting see Hakala and Wystup, 2002): 1 2 ∂ 2U ∂2U 1 ∂2U ∂U vS + ρσvS + σ 2 v 2 + (rd − rf )S + 2 2 ∂S ∂S∂v 2 ∂v ∂S ∂U ∂U + κ(θ − v) − λ(t, v, S) − rd U + ∂v ∂t
=
0.
(4.6)
The term λ(t, v, S) is called the market price of volatility risk. Heston (1993) assumed it to be linear in the instantaneous variance vt , i.e. λ(t, v, S) = λvt , in order to retain the form of the equation under the transformation from the statistical (or risky) measure to the risk-neutral measure.
138
4 FX smile in the Heston model
4.3.1 European vanilla FX option prices and Greeks We can solve (4.6) by specifying appropriate boundary conditions. For a European vanilla FX option these are: U (T, v, S) = max{φ(S − K), 0}, 1−φ U (t, v, 0) = Ke−rd τ , 2 ∂U 1 + φ −rf τ (t, v, ∞) = e , ∂S 2 ∂U (t, 0, S) + ∂S ∂U ∂U + κθ (t, 0, S) + (t, 0, S), ∂v ∂t Se−rf τ , for φ = +1, U (t, ∞, S) = Ke−rd τ , for φ = −1,
rd U (t, 0, S) =
(4.7) (4.8) (4.9)
(rd − rf )S
(4.10) (4.11)
where φ = ±1 for call and put options, respectively. The strike K is in units of the domestic currency and τ = T − t is the time to maturity (i.e. T is the expiration time in years and t is the current time). Heston (1993) solved the PDE analytically using the method of characteristic functions. For European vanilla FX options the price is given by: h(τ )
= =
HestonVanilla(κ, θ, σ, ρ, λ, rd , rf , vt , St , K, τ, φ)
φ e−rf τ St P+ (φ) − Ke−rd τ P− (φ) ,
(4.12)
where u1,2 = ± 21 , b1 = κ + λ − σρ, b2 = κ + λ, and * dj
=
gj
=
Cj (τ, ϕ)
Dj (τ, ϕ)
(ρσϕi − bj )2 − σ 2 (2uj ϕi − ϕ2 ),
(4.13)
bj − ρσϕi + dj , (4.14) bj − ρσϕi − dj = (rd − rf )ϕiτ + (4.15) dj τ κθ 1 − gj e + 2 (bj − ρσϕi + dj )τ − 2 log , σ 1 − gj bj − ρσϕi + dj 1 − edj τ = , (4.16) σ2 1 − gj edj τ
4.3 Option pricing
139
fj (x, vt , τ, ϕ)
= exp{Cj (τ, ϕ) + Dj (τ, ϕ)vt + iϕx}, −iϕy 1 1 ∞ e fj (x, vt , τ, ϕ) Pj (x, vt , τ, y) = + dϕ. 2 π 0 iϕ
(4.17) (4.18)
Note that the functions Pj are the cumulative distribution functions (in the variable y = log K) of the log-spot price after time τ = T − t starting at x = log St for some drift μ. Finally: P+ (φ)
=
P− (φ)
=
1−φ + φP1 (x, vt , τ, y), 2 1−φ + φP2 (x, vt , τ, y). 2
(4.19) (4.20)
The Greeks can be evaluated by taking the appropriate derivatives or by exploiting homogeneity properties of financial markets (Reiss and Wystup, 2001). In the Heston model the spot delta and the so-called dual delta are given by: Δ=
∂h(t) = φe−rf τ P+ (φ) ∂St
and
∂h(t) = −φe−rd τ P− (φ), ∂K
(4.21)
respectively. Gamma, which measures the sensitivity of delta to the underlying has the form: ∂Δ e−rf τ Γ= = p1 (log St , vt , τ, log K), (4.22) ∂St St where pj (x, v, τ, y) =
1 π
∞
e−iϕy fj (x, v, τ, ϕ) dϕ,
j = 1, 2,
(4.23)
0
are the densities corresponding to the cumulative distribution functions Pj (4.18). The time sensitivity parameter theta = ∂h(t)/∂t can be easily computed from (4.6), while the formulas for rho are the following: ∂h(t) = φKe−rd τ τ P− (φ), ∂rd
∂h(t) = −φSt e−rf τ τ P+ (φ). ∂rf
(4.24)
Note, that in the foreign exchange setting there are two rho’s – one is a derivative of the option price with respect to the domestic interest rate and the other is a derivative with respect to the foreign interest rate.
140
4 FX smile in the Heston model
The notions of vega and volga usually refer to the first and second derivative with respect to volatility. In the Heston model we use them for the first and second derivative with respect to the initial variance: ∂h(t) ∂vt
∂ 2 h(t) ∂vt2
∂ P1 (log St , vt , τ, log K) − ∂vt ∂ − Ke−rdτ P2 (log St , vt , τ, log K), ∂vt ∂2 = e−rf τ St 2 P1 (log St , vt , τ, log K) − ∂vt ∂2 − Ke−rdτ 2 P2 (log St , vt , τ, log K), ∂vt = e−rf τ St
where ∂ Pj (x, vt , τ, y) = ∂vt ∂2 Pj (x, vt , τ, y) = ∂vt2
(4.25)
(4.26)
1 ∞ D(τ, ϕ)e−iϕy fj (x, vt , τ, ϕ) dϕ, (4.27) π 0 iϕ 2 1 ∞ D (τ, ϕ)e−iϕy fj (x, vt , τ, ϕ) dϕ. (4.28) π 0 iϕ
4.3.2 Computational issues Heston’s solution is semi-analytical. Formulas (4.19-4.20) require to integrate functions fj , which are typically of oscillatory nature. Hakala and Wystup (2002) propose to perform the integration in (4.18) with the Gauss-Laguerre quadrature using 100 for ∞ and 100 abscissas. J¨ ackel and Kahl (2005) suggest using the Gauss-Lobatto quadrature (e.g. Matlab’s quadl.m function) and transform the original integral boundaries [0, +∞) to the finite interval [0, 1]. As a number of authors have recently reported (Albrecher et al., 2007; Gatheral, 2006; J¨ ackel and Kahl, 2005), the real problem starts when the functions fj are evaluated as part of the quadrature scheme. In particular, the calculation of the complex logarithm in eqn. (4.15) is prone to numerical instabilities. It turns out that taking the principal value of the logarithm causes Cj to jump discontinuously each time the imaginary part of the argument of the logarithm crosses the negative real axis. One solution is to keep track of the winding number in the integration (4.18), but is difficult to implement because standard numerical integration routines cannot be used. J¨ ackel and Kahl (2005) provide a practical solution to this problem.
4.3 Option pricing
141
A different, extremely simple approach can be traced back to Bakshi, Cao, and Chen (1997), but has not been recognized in the literature until the works of Albrecher et al. (2007), Gatheral (2006), and Lord and Kahl (2006). The idea is to make an efficient transformation when computing the characteristic function. Namely, to switch from gj in (4.14) to g˜j =
1 bj − ρσϕi − dj = , gj bj − ρσϕi + dj
(4.29)
which leads to new formulas for Cj and Dj : Cj (τ, ϕ)
Dj (τ, ϕ)
(rd − rf )ϕiτ + (4.30) κθ 1 − g˜j e−dj τ + 2 (bj − ρσϕi − dj )τ − 2 log , σ 1 − g˜j bj − ρσϕi − dj 1 − e−dj τ = , (4.31) σ2 1 − g˜j e−dj τ
=
in (4.17). Note, that the only differences between eqns. (4.14)-(4.16) and (4.29)(4.31), respectively, are the flipped minus and plus signs in front of the dj ’s. Recently, Lord and Kahl (2010) proved that the numerical stability of the resulting pricing formulas is guaranteed under the full dimensional and unrestricted parameter space. The mispricings resulting from using (4.14)-(4.16) are not that obvious to notice. In fact, if we price and backtest on short or middle term maturities only, we might not detect the problem at all. However, the deviations can become extreme for longer maturities (typically above 3-5 years; the exact threshold is parameter dependent, see Albrecher et al., 2007). Apart from the above semi-analytical solution for vanilla options, alternative numerical approaches can be utilized. These include finite difference and finite element methods, Monte Carlo simulations, and Fourier inversion of the cf. The latter is discussed in detail in Section 4.3.4. As for the other methods, finite differences must be used with care since high precision is required to invert scarce matrices. The Crank-Nicholson, ADI (Alternate Direction Implicit), and Hopscotch schemes can be used, however, ADI is not suitable to handle nonzero correlation. Boundary conditions must be set appropriately, for details see Kluge (2002). On the other hand, finite element methods can be applied to price both the vanillas and exotics, as explained for example in Apel, Winkler, and Wystup (2002). Finally, the Monte Carlo approach requires attention as the simple Euler discretization of the CIR process (4.2) may give rise to a negative variance. To deal with this problem, practitioners generally either adopt (i)
142
4 FX smile in the Heston model
the absorbing vt = max(vt , 0), or (ii) the reflecting principle vt = max(vt , −vt ). More elegant, but computationally more demanding solutions include sampling from the exact transition law (Glasserman, 2004) or application of the quadratic-exponential (QE) scheme (Andersen, 2008). For a recent survey see the latter reference, where several new algorithms for time-discretization and Monte Carlo simulation of Heston-type SV models are introduced and compared in a thorough simulation study. Both the absorbing/reflecting patches and the QE scheme are implemented in the simHeston.m function used to plot Figure 4.1.
4.3.3 Behavior of the variance process and the Feller condition The CIR process for the variance, as defined by (4.2), always remains nonnegative. This is an important property which, for instance, is not satisfied by the Ornstein-Uhlenbeck process. However, ideally one would like a variance process which is strictly positive, because otherwise the spot price process degenerates to a deterministic function for the time the variance stays at zero. As it turns out, the CIR process remains strictly positive under the condition that α :=
4κθ ≥ 2, σ2
(4.32)
which is often referred to as the Feller condition. We call α the dimensionality of the corresponding Bessel process (see below). If the condition is not satisfied, i.e. for 0 < α < 2, the CIR process will visit 0 recurrently but will not stay at zero, i.e. the 0-boundary is strongly reflecting. Unfortunately, when calibrating the Heston model to market option prices it is not uncommon for the parameters to violate the Feller condition (4.32). This is not a complete disaster, as the variance process can only hit zero for an infinitesimally small amount of time, but it is still worrying as very low levels of volatility (e.g. say below 1%) are repeatedly reached for short amounts of time and that is not something observed in the market. Besides being important from a modeling point of view, the Feller condition also plays a role in computational accuracy. For Monte Carlo simulations special care has to be taken so that the simulated paths do not go below zero if (4.32) is not satisfied. On the PDE side, the Feller condition determines whether the zero-variance boundary is in- or out-flowing, that is to say whether the
4.3 Option pricing
143
convection vector at the boundary points in- or outwards. To see this, we need to write the log-spot transformed Heston PDE in convection-diffusion form ∂ U = div(A grad U ) − div(U b) + f, ∂t and obtain
b(x, v) = v
1 2
κ+λ
1 +
2
ρσ + rf − rd , 1 2 σ − κθ 2
(4.33)
(4.34)
which is out-flowing at the v = 0 boundary if 1 2 σ − κθ < 0, 2
(4.35)
in which case the Feller condition (4.32) is satisfied. Having introduced and discussed the importance of this condition, we now give an overview of how it is derived; we refer the interested reader to Chapter 6.3 in Jeanblanc, Yor, and Chesney (2009) for a more thorough treatment. The main idea is to relate the CIR process to the family of squared Bessel processes which have well known properties. We call Xt an α-dimensional squared Bessel process and denote it by B2 (α) if it follows the dynamics of * dXt = α dt + 2 Xt+ dWt , (4.36) with Xt+ := max{0, Xt }. This definition makes sense for any real valued α. However, in the case of integer valued α we have an interesting interpretation: a B2 (α) process Xt , with X0 = 0, follows the same dynamics as the squared distance to the origin of an α-dimensional Brownian motion, i.e. Xt =
α
(i)
Bt ,
(4.37)
i=1
with B (i) being independent Brownian motions. From this we can conclude that for α = 2, Xt will never reach zero and, using the stochastic comparison theorem (Rogers and Williams, 2000), this property remains true for any α ≥ 2. Similarly for 0 < α ≤ 1 the value zero will be repeatedly hit (for α = 1 this happens as often as a one-dimensional Brownian motion crosses zero). The stochastic comparison theorem also immediately tells us that B2 (α) processes are non-negative for non-negative α: for α = 0 we get the trivial solution Xt = 0 and increasing the drift term to α > 0 cannot make the paths
144
4 FX smile in the Heston model
any smaller. For a proof of the above statements we refer to Chapter V.48 in Rogers and Williams (2000), from which we also state the following additional properties. If Xt is an α-dimensional squared Bessel process B2 (α) and X0 ≥ 0 then it is always non-negative and: 1. for 0 < α < 2 the process hits zero and this is recurrent, but the time spent at zero is zero, 2. for α = 2 the process is strictly positive but gets arbitrarily close to 0 and ∞, 3. for α > 2 the process is strictly positive and tends to infinity as time approaches infinity. To translate these properties to the class of CIR processes, we only need to apply the following space-time transformation to the squared Bessel process. Define dYt := e−κt Xβ(eκt −1) . Then Yt follows the dynamics of dYt = κ(αβ − Yt ) dt + 2 κβYt dWt ,
(4.38)
which is the same as the dynamics of the CIR process (4.2) if we set β = and α = βθ = 4κθ . σ2
σ2 4κ
4.3.4 Option pricing by Fourier inversion In this section, we briefly describe two Fourier inversion-based option pricing approaches: the Carr-Madan algorithm (Carr and Madan, 1999) and the LewisLipton approach (Lewis, 2001; Lipton, 2002). The basic idea of these methods is to compute the price by Fourier inversion given an analytic expression for the characteristic function (cf) of the option price. The Carr-Madan approach. The rationale for using this approach is twofold. Firstly, the FFT algorithm offers a speed advantage, including the possibility to calculate prices for a whole range of strikes in a single run. Secondly, the cf of the log-price is known and has a simple form for many models, while the pdf is often either unknown in closed-form or complicated from the numerical point of view. For instance, for the Heston model the cf takes the form (Heston, 1993; J¨ackel and Kahl, 2005): E{exp(iϕ log ST )} = f2 (x, vt , τ, ϕ), where f2 is given by (4.17).
(4.39)
4.3 Option pricing
145
Let us now follow Carr and Madan (1999) and derive the formula for the price of a European vanilla call option. Derivation of the put option price follows the same lines, for details see Lee (2004) and Schmelzle (2010). Alternatively we can use the call-put parity for European vanilla FX options (see e.g. Wystup, 2006). Let hC (τ ; k) denote the price of the call option maturing in τ = T − t years with a strike of K = ek : ∞ hC (τ ; k) = e−rT (es − ek )qT (s)ds, (4.40) k
where qT is the risk-neutral density of sT = log ST . The function hC (τ ; k) is not square-integrable (see e.g. Rudin, 1991) because it converges to S0 for k → −∞. However, by introducing an exponential damping factor eαk it is possible to make it an integrable function, H C (τ ; k) = eαk hC (τ ; k), for a suitable constant α > 0. A sufficient condition is given by: E{(ST )α+1 } < ∞,
(4.41)
which is equivalent to ψT (0), i.e. the Fourier transform of H C (τ ; k), see (4.42) below, being finite. In an empirical study Schoutens, Simons, and Tistaert (2004) found that α = 0.75 leads to stable algorithms, i.e. the prices are well replicated for many model parameters. This value also fulfills condition (4.41) for the Heston model (Borak, Detlefsen, and H¨ardle, 2005). Note, that for put options the condition is different: it is sufficient to choose α > 0 such that E{(ST )−α } < ∞ (Lee, 2004). Now, compute the Fourier transform of H C (τ ; k): ∞ ψT (u) = eiuk H C (τ ; k)dk −∞ ∞ ∞ = eiuk eαk e−rT (es − ek )qT (s)dsdk −∞ k ∞ s −rT = e qT (s) eαk+s − e(α+1)k eiuk dkds −∞ ∞
=
−rT
e
−∞ −rT
=
e
α2
−∞ (α+1+iu)s
qT (s)
e
α + iu
f2 {u − (α + 1)i} , + α − u2 + i(2α + 1)u
e(α+1+iu)s − α + 1 + iu
ds (4.42)
146
4 FX smile in the Heston model
where f2 is the cf of qT , see (4.39). We get the option price in terms of ψT using Fourier inversion: # e−αk ∞ −iuk e−αk ∞ " −iuk hC (τ ; k) = e ψ(u)du = e ψ(u) du (4.43) 2π −∞ π 0 Carr and Madan (1999) utilized Simpson’s rule to approximate this integral within the FFT scheme: N e−αkn − 2πi (j−1)(n−1) ibuj η h (τ ; kn ) ≈ e N e ψ(uj ) {3 + (−1)j − δj−1 }, (4.44) π j=1 3 C
where kn = η1 {−π + 2π N (n − 1)}, n = 1, . . . , N , η > 0 is the distance between the points of the integration grid, uj = η(j − 1), j = 1, . . . , N , and δ is the Dirac function. However, efficient adaptive quadrature techniques – like Gauss-Kronrod (see Matlab’s function quadgk.m; Shampine, 2008) – yield better approximations, see Figure 4.3. There is even no need to restrict the quadrature to a finite interval any longer. Finally note, that for very short maturities the option price approaches its non analytic intrinsic value. This causes the integrand in the Fourier inversion to oscillate and therefore makes it difficult to be integrated numerically. To cope with this problem Carr and Madan (1999) proposed a method in which the option price is obtained via the Fourier transform of a modified – using a hyperbolic sine function (instead of an exponential function as above) – time value. The Lewis-Lipton approach. Lewis (2000; 2001) generalized the work on Fourier transform methods by expressing the option price as a convolution of generalized Fourier transforms and applying the Plancherel (or Parseval) identity. Independently, Lipton (2001; 2002) described this approach in the context of foreign exchange markets. Instead of transforming the whole option price including the payout function as in Carr and Madan (1999), Lewis and Lipton utilized the fact that payout functions have their own representations in Fourier space. For instance, for the payout f (s) = (es − K)+ of a call option we have: ∞ ∞ fˆ(z) = eizs f (s)ds = eizs (es − K)ds −∞
=
log K
)∞ ) e e K iz+1 ) −K = − . iz + 1 iz )s=log K z 2 − iz (iz+1)s
izs
(4.45)
4.3 Option pricing
147
0.12
0.12 Carr−Madan Heston
0.1 Put option price
Call option price
0.1 0.08 0.06 0.04 0.02
0.08 0.06 0.04 0.02
0 1.1
1.15
1.2 Strike price
1.25
0 1.1
1.3
−3
1
1
1.25
1.3
1.15
1.2 Strike price
1.25
1.3
x 10
0.5 Error
Error
0.5
0
−1 1.1
1.2 Strike price
−3
x 10
−0.5
1.15
Carr−Madan C−M w/Gauss−Kronrod Lewis−Lipton 1.15
1.2 Strike price
1.25
0
−0.5
1.3
−1 1.1
Figure 4.3: European call (top left) and put (top right ) FX option prices obtained using Heston’s ‘analytical’ formula (4.12) and the CarrMadan method (4.44), for a sample set of parameters. Bottom panels: Errors (wrt Heston’s formula) of the original Carr-Madan FFT method, formula (4.43) of Carr-Madan, and the Lewis-Lipton formula (4.46). The integrals in the latter two methods (as well as in Heston’s ‘analytical’ formula) are evaluated using the adaptive Gauss-Kronrod quadrature with 10−8 relative accuracy. STFhes03
Note, that the Fourier transform is well behaved only within a certain strip of regularity in the complex domain. Interestingly the transformed payout of a put option is also given by (4.45), but is well behaved in a different strip in the complex plane (Schmelzle, 2010).
148
4 FX smile in the Heston model
The pricing formula is obtained by representing the integral of the product of the state price density and the option payoff function as a convolution representation in Fourier space using the Plancherel identity. For details we refer to the works of Lewis and Lipton. Within this framework, the price of a foreign exchange call option is given by (Lipton, 2002): U (t, vt , S) =
e−rd τ K e−rf τ S − × (4.46)
2π ∞ 1 1 exp (iϕ + 2 )X + α(ϕ) − (ϕ2 + 4 )β(ϕ)vt × dϕ, ϕ2 + 14 −∞
where κ ˆ = X = ζ(ϕ)
=
κ − 12 ρσ, log(S/K) + (rd − rf )τ * ϕ2 σ 2 (1 − ρ2 ) + 2iϕσρˆ κ+κ ˆ 2 + 14 σ 2 ,
ψ± (ϕ) = ∓(iϕρσ + κ ˆ ) + ζ(ϕ), κθ ψ− (ϕ) + ψ+ (ϕ)e−ζ(ϕ)τ α(ϕ) = − 2 ψ+ (ϕ)τ + 2 log , σ 2ζ(ϕ) β(ϕ)
=
1 − e−ζ(ϕ)τ . ψ− (ϕ) + ψ+ (ϕ)e−ζ(ϕ)τ
Note, that the above formula for β corrects a typo in the original paper of Lipton (2002), i.e. no minus sign. As can be seen in Figure 4.3 the differences between the different Fourier inversion-based methods and the (semi-)analytical formula (4.12) are relatively small. In most cases they do not exceed 0.5%. Note, that the original method of Carr-Madan (using FFT and Simpson’s rule) yields ‘exact’ values only on the grid ku . In order to preserve the speed of the FFT-based algorithm we use linear interpolation between the grid points. This approach, however, generally slightly overestimates the true prices, since the option price is a convex function of the strike. If we use formula (4.43) of Carr-Madan and evaluate the integrals using the adaptive Gauss-Kronrod quadrature the results nearly perfectly match the (semi-)analytical solution (also obtained using the Gauss-Kronrod quadrature). On the other hand, the Lewis-Lipton formula (4.46) yields larger deviations, but offers a speed advantage. It is ca. 50% faster than formula (4.43) with the adaptive Gauss-Kronrod quadrature, over 3 times faster than Heston’s (semi-)analytical formula, and over 4.5 times faster than the original method of Carr-Madan (using FFT with 210 grid points).
4.4 Calibration
149
4.4 Calibration 4.4.1 Qualitative effects of changing the parameters Before calibrating the model to market data we will show how changing the input parameters affects the shape of the fitted smile curve. This analysis will help in reducing the dimensionality of the problem. In all plots of this subsection the solid blue curve with x’s is the smile obtained for v0 = 0.01, σ = 0.2, κ = 1.5, θ = 0.015, and ρ = 0.05. First, take a look at the initial variance (top left panel in Figure 4.4). Apparently, changing v0 allows for adjustments in the height of the smile curve. On the other hand, the volatility of variance (vol of vol) has a different impact on the smile. Increasing σ increases the convexity of the fit, see the top right panel in Figure 4.4. In the limiting case, setting σ equal to zero would produce a deterministic process for the variance and, hence, a volatility which does not admit any smile (a constant curve). The effects of changing the long-run variance θ are similar to those observed by changing the initial variance, compare the left panels in Figure 4.4. It seems promising to choose the initial variance a √ priori, e.g. set v0 = implied at-the-money (ATM) volatility, and only let the long-run variance θ vary. In particular, a different initial variance for different maturities would be inconsistent. Changing the mean reversion κ affects the ATM part more than the extreme wings of the smile curve. The low/high deltas (Δ) remain almost unchanged while increasing the mean reversion lifts the center, see the bottom right panel in Figure 4.4. Moreover, the influence of κ is often compensated by a stronger vol of vol σ. This suggests fixing mean reversion (at some level, say κ = 1.5) and only calibrating the remaining three parameters. If the obtained parameters do not satisfy the Feller condition (4.32), it might be worthwhile to fix κ at a different level, say κ ˜ = 3, recalibrate the remaining parameters and check if the new estimates fulfill the condition and lead to a more realistic variance process. Finally, let us look at the influence of correlation. The uncorrelated case produces a fit that looks like a symmetric smile curve centered at-the-money, see Figure 4.5. However, it is not exactly symmetric. Changing ρ changes the degree of symmetry. In particular, positive correlation makes calls more expensive, negative correlation makes puts more expensive. Note, that the model yields a volatility skew, a typically observed volatility structure in equity markets, only when the correlation is set to a very high or low value
150
4 FX smile in the Heston model 13
13 σ = 0.2 σ = 0.1 σ = 0.3
v = 0.008
Implied volatility [%]
Implied volatility [%]
v0 = 0.010 0
12
v0 = 0.012 11
10
9
10
30
50 Delta [%]
70
10
10
30
50 Delta [%]
70
90
70
90
13 θ = 0.015 θ = 0.010 θ = 0.020
12
Implied volatility [%]
Implied volatility [%]
11
9
90
13
11
10
9
12
10
30
50 Delta [%]
70
90
κ = 1.5 κ = 0.5 κ = 3.0
12
11
10
9
10
30
50 Delta [%]
Figure 4.4: The effects of changing the model parameters on the shape of the smile: initial variance v0 (top left ), volatility of variance σ (top right), long-run variance θ (bottom left ), and mean reversion level κ (bottom right). STFhes04
4.4.2 The calibration scheme Calibration of SV models can be done in two conceptually different ways. One is to look at the time series of historical data. Estimation methods such as Generalized, Simulated, and Efficient Methods of Moments (respectively GMM, SMM, and EMM), as well as Bayesian MCMC have been extensively applied. See Broto and Ruiz (2004) for a review. In the Heston model we could also try to fit empirical distributions of returns to the marginal distributions specified in (4.4) via a minimization scheme. Unfortunately, all historical approaches
4.4 Calibration
151
13 ρ = 0.05 ρ = −0.15 ρ = 0.15
12
Implied volatility [%]
Implied volatility [%]
13
11
10
9
10
30
50 Delta [%]
70
90
ρ=0 ρ = −0.5 ρ = 0.5
12
11
10
9
10
30
50 Delta [%]
70
90
Figure 4.5: The effects of changing the correlation on the shape of the smile. STFhes05
have one common flaw: they do not allow for the estimation of the market price of volatility risk λ(t, v, S). Observing only the underlying spot price and estimating SV models with this information will not yield correct prices for the derivatives. This leads us to the second estimation approach: instead of using the spot data we calibrate the model to the volatility smile (i.e. prices of vanilla options). In this case we do not need to worry about estimating the market price of volatility risk as it is already embedded in the market smile. This means that we can set λ = 0 by default and just determine the remaining parameters. As a preliminary step, we have to retrieve the strikes since the smile in FX markets is specified as a function of delta. Comparing the Black-Scholes type formulas (in the FX market setting we have to use the Garman and Kohlhagen (1983) specification) for delta and the option premium yields the relation for the strikes Ki . From a computational point of view this stage requires only an inversion of the Gaussian distribution function. Next, based on the findings of Section 4.4.1, we fix two parameters (initial variance v0 and mean reversion κ) and fit the remaining three: volatility of variance σ, long-run variance θ, and correlation ρ for a fixed time to maturity and a given vector of market BlackScholes implied volatilities {ˆ σi }ni=1 for a given set of delta pillars {Δi }ni=1 . After fitting the parameters we compute the option prices in the Heston model using (4.12) and retrieve the corresponding Black-Scholes model implied volatilities {σi }ni=1 via a standard bisection method (function fzero.m in Matlab uses
152
4 FX smile in the Heston model
a combination of bisection, secant, and inverse quadratic interpolation methods). The next step is to define an objective function, which we choose to be the Sum of Squared Errors (SSE): SSE(κ, θ, σ, ρ, v0 ) =
n
{ˆ σi − σi (κ, θ, σ, ρ, v0 )}2 .
(4.47)
i=1
We compare the volatilities because they are all of similar magnitude, in contrast to the prices which can range a few orders of magnitude for it-the-money (ITM) vs. out-of-the-money (OTM) options. In addition, one could introduce weights for all the summands to favor ATM or OTM fits. Finally we minimize over this objective function using a simplex search routine (fminsearch.m in Matlab) to find the optimal set of parameters.
4.4.3 Sample calibration results We are now ready to calibrate the Heston model to market data. First, we take the EUR/USD volatility surface on July 1, 2004 and fit the parameters in the Heston model according to the calibration scheme discussed earlier. The results are shown in Figures 4.6–4.7. Note, that the fit is very good for intermediate and long maturities (three months and more). Unfortunately, the Heston model does not perform satisfactorily for short maturities (see Section 4.5.2 for a discussion of alternative approaches). Comparing with the fits in Weron and Wystup (2005) for the same data, the long maturity (2Y) fit is better. This is due to a more efficient optimization routine (Matlab 7.2 vs. XploRe 4.7) and utilization of the transformed formulas (4.29)-(4.31) instead of the original ones (4.14)-(4.16). Now we take a look at more recent data and calibrate the Heston model to the EUR/USD volatility surface on July 22, 2010. The results are shown in Figures 4.8–4.9. Again the fit is very good for intermediate and long maturities, but unsatisfactory for maturities under three months. The term structure of the vol of vol and correlation visualizes the problem with fitting the smile in the short term for both datasets. The calibrated vol of vol is very low for the 1W smiles, then jumps to a higher level. The correlation behaves similarly, for the 1W smiles it is much lower than for the remaining maturities. In 2004 it additionally changes sign as the skew changes between short and longer-term tenors. Also note, that the more skewed smiles in 2010 require much higher (anti-)correlation (−0.4 < ρ < −0.3 for τ ≥ 1M) to obtain a decent fit than the more symmetric smiles in 2004 (−0.01 < ρ < 0.05 for τ ≥ 1M).
4.4 Calibration
153 10.8 1W smile Heston fit
10.8
Implied volatility [%]
Implied volatility [%]
11
10.6 10.4 10.2 10
10
25
50 Delta [%]
75
10.2 10
10
25
50 Delta [%]
75
90
75
90
11.6 3M smile Heston fit
11
6M smile Heston fit
11.4 Implied volatility [%]
Implied volatility [%]
11.2
10.8 10.6 10.4 10.2 10
10.4
9.8
90
1M smile Heston fit
10.6
11.2 11 10.8 10.6 10.4
10
25
50 Delta [%]
75
90
10.2
10
25
50 Delta [%]
Figure 4.6: The EUR/USD market smile on July 1, 2004 and the fit obtained with the Heston model for τ = 1 week (top left), 1 month (top right ), 3 months (bottom left ), and 6 months (bottom right). STFhes06
As these examples show, the Heston model can be successfully applied to modeling the volatility smile of vanilla FX options in the mid- to longer-term. There are essentially three parameters to fit, namely the long-run variance (θ), which corresponds to the ATM level of the market smile, the vol of vol (σ), which corresponds to the convexity of the smile (in the market often quoted as butterflies), and the correlation (ρ), which corresponds to the skew of the smile (quoted as risk reversals). It is this direct link of the model parameters to the market that makes the Heston model so attractive to practitioners.
154
4 FX smile in the Heston model 11.8
11.8 1Y smile Heston fit
11.4 11.2 11 10.8 10.6 10
25
50 Delta [%]
75
11.2 11 10.8
10.4
90
0.9
10
25
50 Delta [%]
75
0.5
1 Tau [year]
1.5
90
0.06
0.8
0.04
0.7
Correlation (ρ)
Vol of vol (σ)
11.4
10.6
10.4
0.6 0.5 0.4
0.02 0 −0.02
0.3 0.2
2Y smile Heston fit
11.6 Implied volatility [%]
Implied volatility [%]
11.6
0
0.5
1 Tau [year]
1.5
2
−0.04
0
2
Figure 4.7: The EUR/USD market smile on July 1, 2004 and the fit obtained with the Heston model for τ = 1 year (top left ) and 2 years (top right). The term structure of the vol of vol and correlation visualizes the problem with fitting the smile in the short term (bottom panels). STFhes06
The key application of the model is to calibrate it to vanilla options and afterward use it for pricing exotics (like one-touch options, see Wystup, 2003) in either a finite difference grid or a Monte Carlo simulation. Surprisingly, the results often coincide with the traders’ rule of thumb pricing method (Wystup, 2006). This might also simply mean that a lot of traders use the same model. After all, it is a matter of belief which model reflects reality the best. Recent ideas are to take prices of one-touch options along with prices of vanilla options from the market and use this common input to calibrate the Heston model.
4.5 Beyond the Heston model
155 15.5
1W smile Heston fit
15 14.5 14 13.5 13
14.5 14 13.5 13 12.5
10
25
50 Delta [%]
75
12
90
10
25
50 Delta [%]
75
90
75
90
19 3M smile Heston fit
16
6M smile Heston fit
18 Implied volatility [%]
Implied volatility [%]
17
15 14 13 12
1M smile Heston fit
15 Implied volatility [%]
Implied volatility [%]
15.5
17 16 15 14 13
10
25
50 Delta [%]
75
90
12
10
25
50 Delta [%]
Figure 4.8: The EUR/USD market smile on July 22, 2010 and the fit obtained with the Heston model for τ = 1 week (top left), 1 month (top right ), 3 months (bottom left ), and 6 months (bottom right). STFhes07
4.5 Beyond the Heston model 4.5.1 Time-dependent parameters As we have seen in Section 4.4.3, performing calibrations for different time slices of the volatility matrix produces different values of the parameters. This suggests a term structure of some parameters in the Heston model. Therefore, we need to generalize the CIR process (4.2) to the case of time-dependent
156
4 FX smile in the Heston model 15 1Y smile Heston fit
17 16 15 14
13 12.5 12
12
11.5
25
50 Delta [%]
75
90
1.6
25
50 Delta [%]
75
0.5
1 Tau [year]
1.5
90
−0.2 Correlation (ρ)
1.2 1 0.8 0.6
−0.25 −0.3 −0.35
0.4 0.2
10
−0.15
1.4 Vol of vol (σ)
14 13.5
13 10
2Y smile Heston fit
14.5 Implied volatility [%]
Implied volatility [%]
18
0
0.5
1 Tau [year]
1.5
−0.4
2
0
2
Figure 4.9: The EUR/USD market smile on July 22, 2010 and the fit obtained with the Heston model for τ = 1 year (top left ) and 2 years (top right). Again the term structure of the vol of vol and correlation visualizes the problem with fitting the smile in the short term (bottom panels). STFhes07
parameters, i.e. we consider the process: √ dvt = κ(t){θ(t) − vt } dt + σ(t) vt dWt ,
(4.48)
for some nonnegative deterministic parameter functions σ(t), κ(t), and θ(t). The formula for the mean turns out to be: t E(vt ) = g(t) = v0 e−K(t) + κ(s)θ(s)eK(s)−K(t) ds, (4.49) 0
4.5 Beyond the Heston model with K(t) =
157
t
κ(s) ds. The result for the second moment is: t E(vt2 ) = v02 e−2K(t) + {2κ(s)θ(s) + σ 2 (s)}g(s)e2K(s)−2K(t) ds, 0
(4.50)
0
and hence for the variance (after some algebra): t Var(vt ) = σ 2 (s)g(s)e2K(s)−2K(t) ds.
(4.51)
0
The formula for the variance allows us to compute forward volatilities of variance explicitly. Assuming known values σT1 and σT2 for times 0 < T1 < T2 , we want to determine the forward volatility of variance σT1 ,T2 which matches the corresponding variances, i.e. T2 σT2 2 g(s)e2κ(s−T2 ) ds = (4.52) 0
T1
= 0
σT2 1 g(s)e2κ(s−T2 ) ds +
T2 T1
σT2 1 ,T2 g(s)e2κ(s−T2 ) ds.
The resulting forward volatility of variance is thus: σT2 1 ,T2 = where
t
H(t) =
g(s)e 0
2κs
σT2 2 H(T2 ) − σT2 1 H(T1 ) , H(T2 ) − H(T1 )
θ 2κt 1 1 ds = e + (v0 − θ)eκt + 2κ κ κ
(4.53)
θ − v0 . 2
(4.54)
Assuming known values ρT1 and ρT2 for times 0 < T1 < T2 , we want to determine the forward correlation coefficient ρT1 ,T2 to be active between times T1 and T2 such that the covariance between the Brownian motions of the variance process and the exchange rate process agrees with the given values ρT1 and ρT2 . This problem has a simple answer, namely: ρT1 ,T2 = ρT2 ,
T1 ≤ t ≤ T2 .
This can be seen by writing the Heston model in the form: √ (1) dSt = St μ dt + vt dWt , √ √ (1) (2) dvt = κ(θ − vt ) dt + ρσ vt dWt + 1 − ρ2 σ vt dWt ,
(4.55)
(4.56) (4.57)
158
4 FX smile in the Heston model
for a pair of independent Brownian motions W (1) and W (2) . Observe that choosing the forward correlation coefficient as stated does not conflict with the computed forward volatility.
4.5.2 Jump-diffusion models While trying to calibrate short term smiles, the volatility of volatility often seems to explode along with the speed of mean reversion. This is a strong indication that the process ‘wants’ to jump, which of course it is not allowed to do. This observation, together with market crashes, has lead researchers to consider models with jumps (Gatheral, 2006; Lipton, 2002; Martinez and Senge, 2002). Such models have been investigated already in the mid-seventies (Merton, 1976), long before the advent of SV. Jump-diffusion (JD) models are, in general, more challenging to handle numerically than SV models. Like the latter, they result in an incomplete market. But, whereas SV models can be made complete by the introduction of one (or a few) traded options, a JD model typically requires the existence of a continuum of options for the market to be complete. Bates (1996) and Bakshi, Cao, and Chen (1997) suggested using a combination of jumps and stochastic volatility. This approach allows for even a better fit to market data, but at the cost of a larger number of parameters to calibrate from the same market volatility smile. Andersen and Andreasen (2000) let the stock dynamics be described by a JD process with local volatility. This method combines ease of modeling steep, short-term volatility skews (jumps) and accurate fitting to quoted option prices (deterministic volatility function). Other alternative approaches utilize L´evy processes (Cont and Tankov, 2003; Eberlein, Kallsen, and Kristen, 2003) or mixing unconditional disturbances (Tompkins and D’Ecclesia, 2006).
Bibliography Albrecher, H., Mayer, P., Schoutens, W. and Tistaert, J. (2006). The little Heston trap, Wilmott Magazine, January: 83–92. Andersen, L. and Andreasen, J. (2000). Jump-Diffusion Processes: Volatility Smile Fitting and Numerical Methods for Option Pricing, Review of Derivatives Research 4: 231–262. Andersen, L. (2008). Simple and efficient simulation of the Heston stochastic volatility model, The Journal of Computational Finance 11(3), 1–42. Apel, T., Winkler, G., and Wystup, U. (2002). Valuation of options in Heston’s stochastic volatility model using finite element methods, in J. Hakala, U. Wystup (eds.) Foreign Exchange Risk, Risk Books, London. Bakshi, G., Cao, C. and Chen, Z. (1997). Empirical Performance of Alternative Option Pricing Models, Journal of Finance 52: 2003–2049. Bates, D. (1996). Jumps and Stochastic Volatility: Exchange Rate Processes Implicit in Deutsche Mark Options, Review of Financial Studies 9: 69– 107. Borak, S., Detlefsen, K., and H¨ardle, W. (2005). FFT-based option pricing, in P. Cizek, W. H¨ ardle, R. Weron (eds.) Statistical Tools for Finance and Insurance, Springer, Berlin. Broto, C., and Ruiz, E. (2004). Estimation methods for stochastic volatility models: A survey, Journal of Economic Surveys 18(5): 613–649. Carr, P. and Madan, D. (1999). Option valuation using the fast Fourier transform, Journal of Computational Finance 2: 61–73. Cont, R., and Tankov, P. (2003). Financial Modelling with Jump Processes, Chapman & Hall/CRC. Derman, E. and Kani, I. (1994). Riding on a Smile, RISK 7(2): 32–39.
160
Bibliography
Dragulescu, A. A. and Yakovenko, V. M. (2002). Probability distribution of returns in the Heston model with stochastic volatility, Quantitative Finance 2: 443–453. Dupire, B. (1994). Pricing with a Smile, RISK 7(1): 18–20. Eberlein, E., Kallsen, J., and Kristen, J. (2003). Risk Management Based on Stochastic Volatility, Journal of Risk 5(2): 19–44. Fengler, M. (2005). Semiparametric Modelling of Implied Volatility, Springer. Fouque, J.-P., Papanicolaou, G., and Sircar, K.R. (2000). Derivatives in Financial Markets with Stochastic Volatility, Cambridge University Press. Garman, M. B. and Kohlhagen, S. W. (1983). Foreign currency option values, Journal of International Monet & Finance 2: 231–237. Gatheral, J. (2006). The Volatility Surface: A Practitioner’s Guide, Wiley. Glasserman, P. (2004). Monte Carlo Methods in Financial Engineering, Springer-Verlag, New-York. Hakala, J. and Wystup, U. (2002). Heston’s Stochastic Volatility Model Applied to Foreign Exchange Options, in J. Hakala, U. Wystup (eds.) Foreign Exchange Risk, Risk Books, London. Heston, S. (1993). A Closed-Form Solution for Options with Stochastic Volatility with Applications to Bond and Currency Options, Review of Financial Studies 6: 327–343. Hull, J. and White, A. (1987). The Pricing of Options with Stochastic Volatilities, Journal of Finance 42: 281–300. J¨ackel, P. and Kahl, C. (2005). Not-So-Complex Logarithms in the Heston Model, Wilmott Magazine 19: 94–103. Jeanblanc, M., Yor, M. and Chesney, M. (2009). Mathematical Methods for Financial Markets, Springer. Kluge, T. (2002). Pricing derivatives in stochastic volatility models using the finite difference method, Diploma thesis, Chemnitz Technical University. Lee, R., (2004). Option pricing by transform methods: extensions, unification and error control, Journal of Computational Finance 7(3): 51–86. Lewis, A. (2000). Option Valuation under Stochastic Volatility: With Mathematica Code, Finance Press.
Bibliography
161
Lewis, A. (2001). A simple option formula for general jump-diffusion and other exponential L´evy processes, Envision Financial Systems and OptionCity.net, California. http://optioncity.net/pubs/ExpLevy.pdf. Lipton, A. (2001). Mathematical Methods for Foreign Exchange, World Scientific. Lipton, A. (2002). The vol smile problem, Risk, February, 61-65. Lord, R., and Kahl, C. (2006). Why the rotation count algorithm works, Tinbergen Institute Discussion Paper 2006065/2. Lord, R., and Kahl, C. (2010). Complex logarithms in Heston-like models, Mathematical Finance 20(4), 671-694. Martinez, M. and Senge, T. (2002). A Jump-Diffusion Model Applied to Foreign Exchange Markets, in J. Hakala, U. Wystup (eds.) Foreign Exchange Risk, Risk Books, London. Merton, R. (1973). The Theory of Rational Option Pricing, Bell Journal of Economics and Management Science 4: 141–183. Merton, R. (1976). Option Pricing when Underlying Stock Returns are Discontinuous, Journal of Financial Economics 3: 125–144. Reiss, O. and Wystup, U. (2001). Computing Option Price Sensitivities Using Homogeneity, Journal of Derivatives 9(2): 41–53. Rogers, L.C.G. and Williams, D. (2000). Diffusions, Markov Processes, and Martingales – Volume Two: Itˆ o Calculus, McGraw-Hill. Rubinstein, M. (1994). Implied Binomial Trees, Journal of Finance 49: 771– 818. Rudin, W., (1991). Functional Analysis, McGraw-Hill. Schoutens, W., Simons, E. and Tistaert, J. (2004). A perfect calibration! Now what? Wilmott March: 66–78. Schmelzle, M., (2010). Option pricing formulae using Fourier transform: Theory and application, Working paper. Shampine, L.F. (2008). Vectorized adaptive quadrature in Matlab, Journal of Computational and Applied Mathematics 211: 131–140. Stein, E. and Stein, J. (1991). Stock Price Distributions with Stochastic Volatility: An Analytic Approach, Review of Financial Studies 4(4): 727–752.
162
Bibliography
Tompkins, R. G. and D’Ecclesia, R. L. (2006). Unconditional Return Disturbances: A Non-Parametric Simulation Approach, Journal of Banking and Finance 30(1): 287–314. Weron, R. (2004). Computationally intensive Value at Risk calculations, in J.E. Gentle, W. H¨ ardle, Y. Mori (eds.) Handbook of Computational Statistics, Springer. Weron, R., and Wystup. U. (2005). Heston’s model and the smile, in P. Cizek, W. H¨ ardle, R. Weron (eds.) Statistical Tools for Finance and Insurance, Springer. Wystup, U. (2003). The market price of one-touch options in foreign exchange markets, Derivatives Week, 12(13). Wystup, U. (2006). FX Options and Structured Products, Wiley.
5 Pricing of Asian temperature risk Fred Espen Benth, Wolfgang Karl H¨ardle, and Brenda Lopez Cabrera
Global warming increases weather risk by rising temperatures and increasing between weather patterns. PricewaterhouseCoopers (2005) releases the top 5 sectors in need of financial instruments to hedge weather risk. An increasing number of business hedge risks with weather derivatives (WD): financial contracts whose payments are dependent on weather-related measurements. Chicago Mercantile Exchange (CME) offers monthly and seasonal futures and options contracts on temperature, frost, snowfall or hurricane indices in 24 cities in the US., six in Canada, 10 in Europe, two in Asia-Pacific and three cities in Australia. Notional value of CME Weather products has grown from 2.2 USD billion in 2004 to 218 USD billion in 2007, with volume of nearly a million contracts traded, CME (2005). More than the half of the total weather derivative business comes from the energy sector, followed by the construction, the retail and the agriculture industry, according to the Weather Risk Management Association, PricewaterhouseCoopers (2005). The use of weather derivatives can be expected to grow further. Weather derivatives are different from most financial derivatives because the underlying weather cannot be traded and therefore cannot be replicated by other financial instruments. The pricing of weather derivatives has attracted the attention of many researchers. Dornier and Querel (2000) and Alaton, Djehiche and Stillberger (2002) fitted Ornstein-Uhlenbeck stochastic processes to temperature data and investigated future prices on temperature indices. Campbell and Diebold (2005) analyse heteroscedasticity in temperature volatily and Benth (2003), Benth and Saltyte Benth (2005) and Benth, Saltyte Benth and Koekebakker (2007) develop the non-arbitrage framework for pricing different temperature derivatives prices. The market price of risk (MPR) is an important parameter of the associated equivalent martingale measures used to price and hedge weather futures/options in the market. The majority of papers so far have priced non-tradable assets assuming zero MPR, but this assumption underestimates WD prices.
P. Čížek et al. (eds.), Statistical Tools for Finance and Insurance, DOI 10.1007/978-3-642-18062-0_5, © Springer-Verlag Berlin Heidelberg 2011
163
164
5 Pricing of Asian temperature risk
The estimate of the MPR is interesting by its own and has not been studied earlier. We study therefore the MPR structure as a time dependent object with concentration on emerging markets in Asia. We find that Asian Temperatures (Tokyo, Osaka, Beijing, Teipei and Koahsiung) are normal in the sense that the driving stochastics are close to a Wiener Process. The regression residuals of the temperature show a clear seasonal variation and the volatility term structure of cumulative average temperature (CAT) futures presents a modified Samuelson effect. In order to achieve normality in standardized residuals, the seasonal dependence of variance of residuals is calibrated with a truncanted Fourier function and a Generalized Autoregressive Conditional Heteroscedasticity GARCH(p,q). Alternatively, the seasonal variation is smoothed with a Local Linear Regression estimator, that it is based on a locally fitting a line rather than a constant. By calibrating model prices, we imply the market price of temperature risk for Asian futures. Mathematically speaking this is an inverse problem that yields in estimates of MPR. We find that the market price of risk is different from zero when it is assumed to be (non)-time dependent for different contract types and it shows a seasonal structure related to the seasonal variance of the temperature process. The findings support theoretical results of reverse relation between MPR and seasonal variation of temperature process, indicating that a simple parametrization of the MPR is possible and therefore, it can be infered by calibration of the data or by knowing the formal dependence of MPR on seasonal variation for regions where there is no weather derivative market. This chapter is structured as follows, the next section we discuss the fundamentals of temperature derivatives (future and options), their indices and we also describe the monthly temperature futures traded at CME, the biggest market offering this kind of product. Section 3, - the econometric part - is devoted to explaining the dynamics of temperature data by using a continuous autoregressive model (CAR). In section 4, - the financial mathematics part - the weather dynamics are connected with pricing. In section 5, the dynamics of Tokyo and Osaka temperature are studied and by using the implied MPR from cumulative total of 24-hour average temperature futures (C24AT) for Japanese Cities or by knowing the formal dependence of MPR on seasonal variation, new derivatives are priced, like C24AT temperatures in Kaohsiung, where there is still no formal weather derivative market. Section 6 concludes the chapter. All computations in this chapter are carried out in Matlab version 7.6. The temperature data and the Weather Derivative data was provided by Bloomberg Professional service.
5.1 The temperature derivative market
165
5.1 The temperature derivative market The largest portion of futures and options written on temperature indices is traded on the CME, while a huge part of the market beyond these indices takes place OTC. A call option is a contract that gives the owner the right to buy the underlying asset for a fixed price at an agreed time. The owner is not obliged to buy, but exercises the option only if this is of his or her advantage. The fixed price in the option is called the strike price, whereas the agreed time for using the option is called the exercise time of the contract. A put option gives the owner the right to sell the underlying. The owner of a call option written on futures F(τ,τ1 ,τ2 ) with exercise time τ ≤ τ1 and measurement period [τ1 , τ2 ] will receive:
max F(τ,τ1 ,τ2 ) − K, 0 (5.1) where K is the strike price. Most trading in weather markets centers on temperature hedging using either heating degree days (HDD), cooling degree days (CDD) and Cumulative Averages (CAT). The HDD index measures the temperature over a period [τ1 , τ2 ], usually between October to April, and it is defined as: τ2 HDD(τ1 , τ2 ) = max(c − Tu, 0)du (5.2) τ1
where c is the baseline temperature (typically 18◦ C or 65◦ F) and Tu is the average temperature on day u. Similarly, the CDD index measures the temperature over a period [τ1 , τ2 ], usually between April to October, and it is defined as: τ2 CDD(τ1 , τ2 ) = max(Tu − c, 0)du (5.3) τ1
The HDD and the CDD index are used to trade futures and options in 20 US cities (Cincinnati, Colorado Springs, Dallas, Des Moines, Detroit, Houston, Jacksonville, Kansas City, Las Vegas, Little Rock, Los Angeles, MinneapolisSt. Paul, New York, Philidelphia, Portland, Raleigh, Sacramento, Salt Lake City, Tucson, Washington D.C), six Canadian cities (Calgary, Edmonton, Montreal, Toronto, Vancouver and Winnipeg) and three Australian cities (Brisbane, Melbourne and Sydney). The CAT index accounts the accumulated average temperature over a period [τ1 , τ2 ] days: τ2 CAT(τ1 , τ2 ) = Tu du (5.4) τ1
166
5 Pricing of Asian temperature risk −T
T
where Tu = t,max 2 t,min . The CAT index is the substitution of the CDD index for nine European cities (Amsterdam, Essen, Paris, Barcelona, London, Rome, Berlin, Madrid, Oslo, Stockholm). Since max(Tu −c, 0)−max(c−Tu , 0) = Tu −c, we get the HDD-CDD parity CDD(τ1 , τ2 ) − HDD(τ1 , τ2 ) = CAT(τ1 , τ2 ) − c(τ2 − τ1 )
(5.5)
Therefore, it is sufficient to analyse only HDD and CAT indices. An index similar to the CAT index is the Pacific Rim Index, which measures the accumulated total of 24-hour average temperature (C24AT) over a period [τ1 , τ2 ] days for Japanese Cities (Tokyo and Osaka): τ2 C24AT(τ1 , τ2 ) = T+u du (5.6) τ1
where T+u =
1 24
24 1
Tui dui and Tui denotes the temperature of hour ui .
The options at CME are cash settled i.e. the owner of a future receives 20 times the Degree Day Index at the end of the measurement period, in return for a fixed price (the future price of the contract). The currency is British pounds for the European Futures contracts, US dollars for the US contracts and Japanese Yen for the Asian cities. The minimum price increment is one Degree Day Index point. The degree day metric is Celsius and the termination of the trading is two calendar days following the expiration of the contract month. The Settlement is based on the relevant Degree Day index on the first exchange business day at least two calendar days after the futures contract month. The accumulation period of each CAT/CDD/HDD/C24AT index futures contract begins with the first calendar day of the contract month and ends with the calendar day of the contract month. Earth Satellite Corporation reports to CME the daily average temperature. Traders bet that the temperature will not exceed the estimates from Earth Satellite Corporation. At the CME, the measurement periods for the different temperature indices are standarized to be each month of the year and two seasons: the winter (October - April) and summer season (April - October). The notation for temperature futures contracts is the following: F for January, G for February, H for March, J for April, K for May, M for June, N for July, Q for August, U for October, V for November and X for December. J7 stands for 2007, J8 for 2008, etc. Table 5.1 describes the CME future data for Osaka historical temperature data, obtained from Earth Satellite (EarthSat) corporation (the providers of temperature derivative products traded at CME). The J9 contract corresponds to a contract with the temperature measurement period from 20090401 (τ1 ) to
5.2 Temperature dynamics
167
Table 5.1: C24AT Contracts listed for Osaka at the beginning of the measurement period (τ1 −τ2 ) and CME and C24ATs from temperature data. Source: Bloomberg Code F9 G9 H9 J9 K9
First-trade 20080203 20080303 20080403 20080503 20080603
Last-trade 20090202 20090302 20090402 20100502 20090602
τ1 20090101 20090201 20090301 20090401 20090501
τ2 20090131 20090228 20090331 20090430 20090531
CME 200.2 220.8 301.9 460.0 592.0
ˆ C24AT 181.0 215.0 298.0 464.0 621.0
20090430 (τ2 ) and trading period (t) from 20080503 to 20080502. At trading day t, CME issues seven contracts (i = 1, · · · , 7) with measurement period τ1i ≤ t < τ2i (usually with i = 1) or t ≤ τ1i < τ2i with i = 1, . . . , 7 (six months ahead from the trading day t). Table 5.1 also shows the C24AT from the historical temperature data obtained from Osaka Kansai International Airport. Both indices are notably differed and the raised question here is related to weather modelling and forecasting. The fair price of a temperature option contract, derived via a hedging strategy and the principle of no arbitrage, requires a stochastic model for the temperature dynamics. In the next section, a continuous-time process AR(p) (CAR(p)) is proposed for the temperature modelling.
5.2 Temperature dynamics Suppose that (Ω, F , P ) is a probability space with a filtration {Ft }0≤t≤τmax , where τmax denotes a maximal time covering all times of interest in the market. The various temperature forward prices at time t depends explicitly on the state vector Xt . Let Xq(t) be the q’th coordinate of the vector Xt with q = 1, .., p. Here it is claimed that Xt is namely the temperature at times t, t − 1, t − 2, t − 3 . . .. Following this nomenclature, the temperature time series at time t (q = 1): Tt = Λt + X1(t)
(5.7)
168
5 Pricing of Asian temperature risk
with Λt a deterministic seasonal function. Xq(t) of a continuous-time process AR(p) (CAR(p)). ⎛ 0 1 0 ... ⎜ 0 0 1 ... ⎜ ⎜ . . .. .. A=⎜ ⎜ ⎝ 0 ... ... 0 −αp −αp−1 . . .
can be seen as a discretization Define a p × p-matrix: ⎞ 0 0 0 0 ⎟ ⎟ .. ⎟ (5.8) 0 . ⎟ ⎟ ⎠ 0 1 0 −α1
in the vectorial Ornstein-Uhlenbeck process Xt ∈ Rp for p ≥ 1 as: dXt = AXt dt + ept σt dBt
(5.9)
where ek denotes the k’th unit vector in Rp for k = 1, ...p, σt > 0 states the temperature volatility, Bt is a Wiener Process and αk are positive constants. Note that the form of the Ap×p -matrix, makes Xq(t) to be a Markov process. By applying the multidimensional Itˆ o F ormula, the process in Equation (5.9) has the explicit form: s Xs = exp {A(s − t)}x + exp {A(s − u)}ep σu dBu (5.10) t
for s ≥ t ≥ 0 and stationarity when the eigenvalues of A have
tholds negative real 2 part or the variance matrix 0 σt−s exp {A(s)}ep e p exp A (s) ds converges as t → ∞. There is an analytical link between Xq(t) , and the lagged temperatures up to time t − p. We first say that the state vector Xt is given by the prediction from the dynamics in (5.9). Using the expected value as the prediction, and by abusing the notation, we say that the state Xt is given as the solution of the first-order system of differential equations dXt = AXt dt
(5.11)
By substituting iteratively into the discrete-time dynamics, one obtains that: p = 1, Xt = X1(t) and dX1(t) = −α1 X1(t) dt p = 2, dt = 1, X1(t+2) ≈ (2 − α1 )X1(t+1) + (α1 − α2 − 1)X1(t)
5.2 Temperature dynamics
169
p = 3, X1(t+1) − X1(t) X2(t+1) − X2(t)
= X2(t) dt = X3(t) dt
X3(t+1) − X3(t) X1(t+2) − X1(t+1)
= −α3 X1(t) dt − α2 X2(t) dt − α1 X3(t) dt = X2(t+1) dt
X2(t+2) − X2(t+1) X3(t+2) − X3(t+1)
= X3(t+1) dt = −α3 X1(t+1) dt − α2 X2(t+1) dt − α1 X3(t+1) dt
X1(t+3) − X1(t+2) X2(t+3) − X2(t+2)
= X2(t+2) dt = X3(t+2) dt
X3(t+3) − X3(t+2)
= −α3 X1(t+2) dt − α2 X2(t+2) dt − α1 X3(t+2) dt
substituting into the X1 dynamics and setting dt = 1: X1(t+3)
≈ (3 − α1 )X1(t+2) + (2α1 − α2 − 3)X1(t+1) + (−α1 + α2 − α3 + 1)X1(t)
(5.12)
Now, we approximate by Euler discretization to get the following for X1(t) , X2(t) and X3(t) . For X3(t) and using a time step of length 2 (dt = 2), we obtain X3(t+2) − X3(t) = −α3 X1(t) · 2 − α2 X2(t) · 2 − α1 X3(t ) · 2 . Using the Euler approximation on X2(t) with time step 1 yields X2(t+1) − X2(t) = X3(t) and similarly for X1t we get X1(t+1) − X1(t) = X2(t) and X1(t+2) − X1(t+1) = X2(t+1) Hence, inserting in the approximation of X3(t) we find X3(t+2)
= +
(1 − 2α1 + 2α2 − 2α3 )X1(t) + (4α1 − 2α2 − 2)X1(t+1) (1 − 2α1 )X1(t+2) (5.13)
Thus, we see that we can recover the state of X3(t) by inserting X1(t) = Tt − Λt at times t, t − 1 and t − 2. Next, we have X2(t+2) − X2(t+1) = X3(t+1)
170
5 Pricing of Asian temperature risk
which implies, using the recursion on X3(t+2) in (5.13) X2(t+2)
= +
X2(t+1) + (1 − 2α1 + 2α2 − 2α3 )X1(t−1) − (4α1 − 2α2 − 2)X1(t) (1 − 2α1 )X1(t+1)
Inserting for X2(t+1) , we get X2(t+2)
=
X1(t+2) − 2α1 X1(t+1) + (−2 + 4α1 − 2α2 )X1(t)
+
(1 − 2α1 + 2α2 − 2α3 )X1(t−1)
(5.14)
We see that X2(t+2) can be recovered by the temperature observation at times t + 2, t + 1, t and t − 1. Finally, the state of X1(t) is obviously simply today’s temperature less its seasonal state.
5.3 Temperature futures pricing As temperature is not a tradable asset in the market place, no replication arguments hold for any temperature futures and incompleteness of the market follows. In this context all equivalent measures Q will be risk-neutral probabilities. We assume the existence of a pricing measure Q, which can be parametrized and complete the market, Karatzas and Shreve (2001). For that, we pin down an equivalent measure Q = Qθt to compute the arbitrage free price of a temperature future: F(t,τ1 ,τ2 ) = EQθt [Y (Tt )|Ft ]
(5.15)
with Y (Tt ) being the payoff from the temperature index (CAT, HDD, CDD indices) and θt denotes the time dependent market price of risk (MPR). The risk adjusted probability measure can be retrieved via Girsanov’s theorem, by establishing: t Btθ = Bt − θu du (5.16) 0
Btθ is a Brownian motion for any time before the end of the trading time (t ≤ τmax ) and a martingale under Qθt . Here the market price of risk (MPR) θt = θ is as a real valued, bounded and piecewise continuous function. Under Qθ , the temperature dynamics of (5.10) become dXt = (AXt + ep σt θt )dt + ep σt dBtθ
(5.17)
5.3 Temperature futures pricing with explicit dynamics, for s ≥ t ≥ 0: Xs
171
s
= exp {A(s − t)}x + exp {A(s − u)}ep σu θu du t s + exp {A(s − u)}ep σu dBuθ
(5.18)
t
From Theorem 4.2 (page 12) in Karatzas and Shreve (2001) we can parametrize the market price of risk θt and relate it to the risk premium for traded assets (as WD are indeed tradable assets) by the equation μt + δt − rt = σt θt
(5.19)
where μt is the mean rate of return process, δt defines a dividend rate process, σt denotes the volatility process and rt determines the risk-free interest rate process of the traded asset. In other words, the risk premium is the compensation, in terms of mean growth rate, for taking additional risk when investing in the traded asset. Assuming that δt = 0, a sufficient parametrization of the MPR is setting θt = (μt − rt )/σt to make the discounted asset prices martingales. We later relax that assumption, by considering the time dependent market price of risk.
5.3.1 CAT futures and options Following Equation (5.15) and using Fubini-Tonelli, the risk neutral price of a future based on a CAT index under Qθ is defined as: τ2 Qθ FCAT (t,τ1 ,τ2 ) = E Ts ds|Ft (5.20) τ1
For contracts whose trading date is earlier than the temperature measurement period, i.e. 0 ≤ t ≤ τ1 < τ2 , Benth et al. (2007) calculate the future price explicitly by inserting the temperature model (5.7) into (5.20): τ2 τ1 FCAT (t,τ1 ,τ2 ) = Λu du + at,τ1 ,τ2 Xt + θu σu at,τ1 ,τ2 ep du τ1 t τ2 −1 + θ u σu e [exp {A(τ2 − u)} − Ip ] ep du (5.21) 1A τ1 −1 e 1A
with at,τ1 ,τ2 = [exp {A(τ2 − t)} − exp {A(τ1 − t)}] and p × p identity matrix Ip . While for CAT futures traded between the measurement period i.e.
172
5 Pricing of Asian temperature risk
τ1 ≤ t < τ2 , the risk neutral price is: t Qθ Qθ FCAT (t, τ1 , τ2 ) = E Ts ds|Ft + E τ1 t
=
E Qθ τ2
+
Ts ds|Ft
t
Ts ds|Ft +
τ1
τ2
τ2
Λu du + at,t,τ2 Xt
t
−1 θu σu e [exp {A(τ2 − u)} − Ip ] ep du 1A
t −1 e 1 A
where at,t,τ2 = [exp {A(τ2 − t)} − Ip ]. Since the expected value of the temperature from τ1 to t is already known, this time the future price consists on a random and a deterministic part. Details of the proof can be found in Benth, Saltyte Benth and Koekebakker (2008). Note that the CAT futures price is given by the aggregated mean temperature (seasonality) over the measurement period plus a mean reversion weighted dependency on Xt , which is depending on the temperature of previous days Tt−k , k ≤ p. The last two terms smooth the market price of risk over the period from the trading date t to the end of the measurement period τ2 , with a change happening in time τ1 . Similar results hold for the C24AT index futures. Note that the only coordinate of Xt that has a random component dB θ is Xpt , hence the dynamics under Qθ of FCAT (t, τ1 , τ2 ) is: dFCAT (t, τ1 , τ2 ) = σt at,τ1 ,τ2 ep dBtθ where σt at,τ1 ,τ2 ep denotes CAT future volatility. From the risk neutral dynamics of FCAT (t, τ1 , τ2 ), the explicit formulae for the CAT call option written on a CAT future with strike K at exercise time τ < τ1 during the period [τ1 , τ2 ]: CCAT (t,τ,τ1 ,τ2 )
= × +
exp {−r(τ − t)} " FCAT (t,τ1 ,τ2 ) − K Φ {d (t, τ, τ1 , τ2 )} τ Σ2CAT (s,τ1 ,τ2 ) dsΦ {d (t, τ, τ1 , τ2 )} t
2
where
τ
d (t, τ, τ1 , τ2 ) = FCAT (t,τ1 ,τ2 ) − K/ t
Σ2CAT (s,τ1 ,τ2 ) ds
and ΣCAT (s,τ1 ,τ2 ) = σt at,τ1 ,τ2 ep
(5.22)
5.3 Temperature futures pricing
173
Note that once that a risk neutral probability Qθ is chosen, the market of futures and options is complete and therefore we can replicate the option. In order to do that, one should compute the number of CAT-futures held in the portfolio, which is simply computed by the option’s delta: ∂CCAT (t,τ,τ1 ,τ2 ) = Φ {d (t, T, τ1 , τ2 )} ∂FCAT (t,τ1 ,τ2 )
(5.23)
The strategy holds close to zero CAT futures when the option is far out of the money, close to 1 otherwise.
5.3.2 CDD futures and options Analogously, one derives the CDD future price. Following (5.15), the risk neutral price of a CDD future which is traded at 0 ≤ t ≤ τ1 < τ2 is defined as: τ2 Qθ FCDD(t,τ1 ,τ2 ) = E max(Ts − c, 0)ds|Ft τ1 τ2 m{t,s,e exp{A(s−t)}Xt } − c 1 = υt,s ψ ds (5.24) υt,s τ1 where
m{t,s,x}
=
σu θ u e 1 exp {A(s − t)} ep du + x
t
2 υt,s
s
Λs − c + s
=
" #2 σu2 e du 1 exp {A(s − t)} ep
t
ψ(x) = xΦ(x) + ϕ(x)
(5.25)
For CDD futures contracts traded at τ1 ≤ t ≤ τ2 , the non-abitrage price of a CDD future is: τ2 FCDD(t,τ1 ,τ2 ) = EQθ max(Ts − c, 0)ds|Ft
τ1 t
max(Ts − c, 0)ds|Ft τ2 m{t,s,e exp{A(s−t)}Xt } − c 1 + υt,s ψ ds υt,s t = E
Qθ
τ1
(5.26)
174
5 Pricing of Asian temperature risk
2 with m{t,s,x} and υt,s defined as above. Note again that the expected value of the temperature from τ1 to t is known.
The dynamics of the FCDD(t,τ1 ,τ2 ) for 0 ≤ t ≤ τ1 under Qθ is given by: τ2 dFCDD (t, τ1 , τ2 ) = σt e 1 exp {A(s − t)} ep τ1 m{t,s,e exp{A(s−t)}Xt } − c 1 × Φ dsdBtθ υt,s The term structre of volatility for CDD futures is defined as: τ2 ΣCDD(s,τ1 ,τ2 ) = σt e 1 exp {A(s − t)} ep τ1 m{t,s,e exp{A(s−t)}Xt } − c 1 × Φ ds υt,s
(5.27)
For the call option written CDD-future, the solution is not analytical but is given in terms of an expression suitable for Monte Carlo simulation. The risk neutral price of a CDD call written on a CDD future with strike K at exercise time τ < τ1 during the period [τ1 , τ2 ]: = exp {−r(τ − t)} τ2 mindex − c × E max υτ,s ψ ds − K, 0 (5.28) υτ,s τ1 x=Xt τ index = τ, s, e exp {A(s − t)} x + e 1 1 exp {A(s − u)} ep σu θu du + Σ(s,t,τ ) Y CCDD(t,T,τ1 ,τ2 )
t
Y ∼N(0, 1)
Σ2(s,t,τ )
τ
=
" #2 e1 exp {A(s − u)} ep σu2 du
t
If the ΣCDD(s,τ1 ,τ2 ) is non-zero for almost everywhere t ∈ [0, τ ], then the hedging strategy HCDD is given by: τ2 m(τ,s,Z(x)) − c σt HCDD(t,τ1 ,τ2 ) = E 1 υτ,s ψ ds > K ΣCDD(s,τ1 ,τ2 ) υτ,s τ1 τ2 m(τ,s,Z(x)) − c × e1 exp {A(s − t)} ep Φ ds (5.29) υτ,s τ1 x=Xt
5.3 Temperature futures pricing
175
for t ≤ τ , where Z(x) is a normal random variable with mean τ e exp {A(s − t)} x + e 1 1 exp {A(s − u)} ep σu θu du t
and variance
Σ2(s,t,τ ) .
5.3.3 Infering the market price of temperature risk In the weather derivative market there is obviously the question of choosing the right price among possible arbitrage free prices. For pricing nontradable assets one essentially needs to incorporate the market price of risk (MPR), which is an important parameter of the associated equivalent martingale measures used to price and hedge weather futures/options in the market. MPR can be calibrated from data and thereby using the market to pin down the price of the temperature derivative. Once we know the MPR for temperature futures, then we know the MPR for options. By inverting (5.21), given observed prices, θt is inferred for contracts with trading date t ≤ τ1 < τ2 . Setting θti as a constant for each of the i contract, with i = 1 . . . 7, θˆti is estimated via: FAAT (t,τ1i ,τ2i ) −
arg min θˆti
−
θˆti t
τ2i
+ τ1i
τ2i τ1i
$t ˆ u du − a ˆt,τ1i ,τ2i X Λ
τ1i
ˆ t,τ1i ,τ2i ep du σ ˆu a
−1 σ ˆ u e 1A
"
exp
A(τ2i
# − u) − Ip ep du
!2 (5.30)
A simpler parametrization of θt is to assume that it is constant for all maturities. We therefore estimate this constant θt for all contracts with t ≤ τ1i < τ2i ,
176
5 Pricing of Asian temperature risk
i = 1, · · · , 7 as follows: FCAT (t,τ1i ,τ2i ) −
arg min Σ7i=1 θˆt
t
τ2i
+ τ1i
τ1i
ˆ u du − a ˆ t,τ1i ,τ2i Xt Λ
τ1i
θˆt
−
τ2i
ˆ t,τ1i ,τ2i ep du σ ˆu a
−1 σ ˆu e 1 A
"
# exp A(τ2i − u) − Ip ep du
!2
Assuming now that, instead of one constant market price of risk per trading day, we have a step function with jump θˆt = I (u ≤ ξ) θˆt1 + I (u > ξ) θˆt2 with jump point ξ (take e.g. the first 150 days before the beginning of the measurement period). Then we estimate θˆt for contracts with t ≤ τ1i < τ2i , i = 1, · · · , 7 by: f (ξ) =
arg min Σ7i=1 FCAT (t,τ1i ,τ2i ) − θˆt1 ,θˆt2
−
θˆt1
τ1i
t
τ2i
+ τ1i
θˆt2 t
τ1i
ˆ t,τ1i ,τ2i ep du I (u > ξ) σ ˆu a
I (u > τ1i
ˆ u du − a ˆ t,τ1i ,τ2i Xt Λ
ˆ t,τ1i ,τ2i ep du I (u ≤ ξ) σ ˆu a
τ2i
+
τ1i
"
# −1 I (u ≤ ξ) σ ˆu e exp A(τ2i − u) − Ip ep du 1 A
−
τ2i
−1 ξ) σ ˆu e 1 A
"
# exp A(τ2i − u) − Ip ep du
!2
Generalising the previous piecewise continuous function, the (inverse) problem of determining θt with t ≤ τ1i < τ2i , i = 1, · · · 7 can be formulated via a series
5.4 Asian temperature derivatives
177
expansion for θt :
FAAT (t,τ1i ,τ2i ) −
arg min Σ7i=1 γ ˆk
−
K τ1i
t
− −
τ2i
τ1i
ˆ u du − a $t ˆ t,τ1i ,τ2i X Λ
ˆ k (ui )ˆ ˆ t,τ1 ,τ2 ep dui γˆk h σui a
k=1 K τ2i
τ1i
"
−1 ˆ k (ui )ˆ γˆk h σui e exp A(τ2i − ui ) 1A
k=1
!2
Ip ] ep dui
(5.31)
where hk (ui ) is a vector of known basis functions and γk defines the coefficients. Here hk (ui ) may denote a spline basis for example. H¨ ardle and L´opez Cabrera (2009) show additional methods about how to infere the MPR.
5.4 Asian temperature derivatives 5.4.1 Asian temperature dynamics We turn now to the analysis of the weather dynamics for Tokyo, Osaka, Beijing and Taipei daily temperature data. The temperature data were obtained from the Tokyo Narita International Airport, Osaka Kansai International Airport and Bloomberg. We consider recordings of daily average temperatures from 19730101 - 20090604. In all studied data, a linear trend was not detectable but a clear seasonal pattern emerged. Figure 5.1 shows 8 years of daily average temperatures and the least squares fitted seasonal function with trend: 2π(t − a3 ) Λt = a0 + a1 t + a2 cos (5.32) 365 The estimated coefficients are displayed in Table 5.2. The low order polynomial deterministic trend smooths the seasonal pattern and makes the model to be parsimonius. The coefficient a ˆ0 can be interpretated as the average temperature, while a ˆ1 as the global warming trend component. In most of the Asian cases, as expected, the low temperatures are observed in the winter and high temperatures in the summer.
5 Pricing of Asian temperature risk
Temperature inTokyo
178
30
20
10
Temperature inOsaka
0 20010101
20050101 Time
20070101
20090101
20070101
20090101
20070101
20081231
20070101
20081231
30 20 10 0
20010101
Temperature inBeijing
20030101
20030101
20050101 Time
30 20 10 0 −10
Temperature inTaipei
20010101
20030101
20050101 Time
30 20
10 20010101
20040101 Time
Figure 5.1: Seasonality effect and daily average temperatures for Tokyo Narita International Airport, Osaka Kansai International Airport, Beijing and Taipei. STFAsianWeather1
After removing the seasonality in (5.32) from the daily average temperatures, Xt = Tt − Λt
(5.33)
we check whether Xt is a stationary process I(0). In order to do that, we apply the Augmented Dickey-Fuller test (ADF) (1 − L)X = c1 + μt + τ LX + α1 (1 − L)LX + . . . αp (1 − L)Lp X + εt , where p is the number of lags by which
5.4 Asian temperature derivatives
179
Table 5.2: Seasonality estimates of daily average temperatures in Asia. Data source: Bloomberg City Tokyo Osaka Beijing Taipei
Period 19730101-20081231 19730101-20081231 19730101-20081231 19920101-20090806
a ˆ0 15.76 15.54 11.97 23.21
a ˆ1 7.82e-05 1.28e-04 1.18e-04 1.68e-03
a ˆ2 10.35 11.50 14.91 6.78
a ˆ3 -149.53 -150.54 -165.51 -154.02
Table 5.3: Stationarity tests. City Tokyo Osaka Beijing Taipei
τˆ(p-value) -56.29(< 0.01) -17.86(< 0.01) -20.40(< 0.01) -33.21(< 0.01)
ˆ k(p-value) 0.091(< 0.1) 0.138(< 0.1) 0.094(< 0.1) 0.067(< 0.1)
the regression is augmented to get residuals free of autocorrelation. Under H0 (unit root), τ should be zero. Therefore the test statistic of the OLS estimator of τ is applicable. If the null hypothesis H0 (τ = 0) is rejected then Xt is a stationary process I(0). Stationarity can also be verified by using the KPSS Test: Xt = c + μt + t k i=1 ξi +εt with stationary εt and iid ξt with an expected value 0 and variance 1. If H0 : k = 0 is accepted then the process is stationary. The estimates of τ and k of the previuos stationarity tests are illustrated in Table 5.3, indicating that the stationarity is achieved. The Partial Autocorrelation Function (PACF) of (5.33) suggests that higher order autoregressive models AR(p), p > 1 are suitable for modelling the time evolution of Asia temperatures after removing seasonality, see Figure 5.2. The covariance stationarity dynamics were captured using autoregressive lags over different year-lengths moving windows, as it is denoted in Table 5.4 and Table 5.5 for the case of Tokyo and Osaka. The autoregressive models showed, for larger length periods, higher order p and sometimes lack of stability (AR *), i.e. the eigenvalues of matrix A (5.8) had positive real part. Since local estimates of the a fitted seasonal variation σt with GARCH models captures long memory affects and assuming that it shocks temperature residuals in the same way over different length periods, the autoregressive model AR(3) was therefore chosen. The choice p = 3 is also
180
5 Pricing of Asian temperature risk PACF, alpha=0.05 0.7
0.6
0.6
0.5
0.5
Partial Autocorrelation function
Partial Autocorrelation function
PACF, alpha=0.05 0.7
0.4
0.3
0.2
0.1
0.3
0.2
0.1
0
0
−0.1 0
0.4
5
10
15
20
25 Lags
30
35
40
45
−0.1 0
50
5
10
15
20
25 Lags
30
35
40
45
50
35
40
45
50
PACF, alpha=0.05 0.7
PACF, alpha=0.05 0.8
0.6 0.7 Partial Autocorrelation function
0.5 Partial Autocorrelation function
0.6 0.5 0.4 0.3 0.2
0.4 0.3 0.2 0.1 0
0.1 −0.1 0 −0.1 0
−0.2 0 5
10
15
20
25 Lags
30
35
40
45
5
50
10
15
20
25 Lags
30
Figure 5.2: Partial autocorrelation function (PACF) for Tokyo (upper left), Osaka (upper right), Beijing (lower left), Taipei (lower right) STFAsianWeather2
Table 5.4: Tokyo Moving window for AR, * denotes instability Year 73-75 76-78 79-81 82-84 85-87 88-90 91-93 94-96 97-99 00-02 03-05 06-09
every 3 years AR(1) AR(1) AR(1) AR(8)* AR(1) AR(1) AR(1) AR(1) AR(1) AR(1) AR(3) AR(1)
every 6 years AR(3)
every 9 years AR(3)
every 12 years AR(8)*
AR(8)* AR(3)
every 18 years
AR(9)* AR(9)* AR(3)
AR(3)
AR(3)
AR(1) AR(3)
AR(3) AR(3)
AR(3)
5.4 Asian temperature derivatives
181
Table 5.5: Osaka Moving window for AR, * denotes instability. Year 73-75 76-78 79-81 82-84 85-87 88-90 91-93 94-96 97-99 00-02 03-05 06-09
every 3 years AR(1) AR(3) AR(3) AR(2) AR(3) AR(3) AR(3) AR(1) AR(2) AR(1) AR(3) AR(1)
every 6 years
every 9 years
AR(3)
AR(3)
every 12 years
every 18 years
AR(3)
AR(3)
AR(6)* AR(3)
AR(3)
AR(6)* AR(3)
AR(6)*
AR(2) AR(3)
AR(7)* AR(3)
AR(7)*
Table 5.6: Coefficients of (C)AR(p), Model selection: AIC. AR
CAR
Eigenvalues
Coefficient β1 β2 β3 α1 α2 α3 real part of λ1 real part of λ2,3
Tokyo(p=3) 0.668 -0.069 -0.079 -2.332 1.733 -0.480 -1.257 -0.537
Osaka(p=3) 0.748 -0.143 -0.079 -2.252 -1.647 -0.474 -1.221 -0.515
Beijing(p=3) 0.741 -0.071 0.071 -2.259 -1.589 -0.259 -0.231 -1.013
Taipei(p=3) 0.808 -0.228 0.063 -2.192 -1.612 -0.357 -0.396 -0.8976
confirmed by the Akaike and Schwarz information criteria for each city. The coefficients of the fitted autoregressive process Xt+p =
p
βi Xt+p−i + σt εt
(5.34)
i=1
and their corresponding are CAR(3)-parameters displayed in Table 5.6. The stationarity condition is fulfilled since the eigenvalues of A have negative real parts. The element components of the matrix A (5.8) do not change over time, this makes the process stable. After trend and seasonal components were removed, the residuals εt and the squared residuals ε2t of (5.34) for Chinese temperature data are plotted in the Figure 5.3 and for Japan in Figure 5.4. According to the modified Li-McLeod Portmanteau test, we reject at 1% significance level the null hypothesis H0 that the residuals are uncorrelated. The ACF of the residuals of AR(3) for
182
5 Pricing of Asian temperature risk
Table 5.7: First 9 Coefficients of σt2 and GARCH(p = 1, q = 1). Tokyo Osaka Beijing Taipei
cˆ1 3.91 3.40 3.95 3.54
cˆ2 -0.08 0.76 0.70 1.49
cˆ3 0.74 0.81 0.82 1.62
cˆ4 -0.70 -0.58 -0.26 -0.41
cˆ5 -0.37 -0.29 -0.50 -0.19
cˆ6 -0.13 -0.17 -0.20 0.03
cˆ7 -0.14 -0.07 -0.17 -0.18
cˆ8 0.28 0.01 -0.05 -0.11
cˆ9 -0.15 -0.04 0.10 -0.16
α 0.09 0.04 0.03 0.06
β 0.50 0.52 0.33 0.33
Asian cities is close to zero and according to Box-Ljung statistic the first few lags are insignificant as it is displayed in Figure 5.5 for the case of China and Figure 5.6 for the case of Japan. However, the ACF for the squared residuals (also displayed in Figure 5.5, 5.6) shows a high seasonal pattern. 2 This seasonal dependence of variance of residuals of the AR(3) (ˆ σt,F T SG ) for the Asian cities is calibrated with a truncanted Fourier function and a Generalized Autoregressive Conditional Heteroscedasticity GARCH(p,q):
2 σ ˆt,F T SG
= c1 + +
16
2iπt 365
c2i cos i=1 2 2 α1 (σt−1 εt−1 )2 + β1 σt−1
+ c2i+1 sin
2iπt 365
(5.35)
Alternatively to the seasonal variation of the 2 step model, one can smooth the 2 data with a Local Linear Regression (LLN) σ ˆt,LLR estimator: min a,b
365 i=1
!2 2 σ ˆt,LLR,i
− a(t) − b(t)(Ti − t)
Ti − t K h
! (5.36)
Asympotically they can be approximated by Fourier series estimators. Table 5.7 shows the first 9 coefficients of the seasonal variation using the 2 steps model. Figure 5.7 shows the daily empirical variance (the average of 35 years squared residuals for each day of the year) and the fitted squared volatility 2 2 function for the residuals σ ˆt,F ˆt,LLR using Epanechnikov Kernel und T SG and σ bandwidth bandwidth h = 4.49, 4.49 for the Chinese cities and h = 3.79 for Japanese cities at 10% significance level. The results are different to the Campbell and Diebold (2005) effect for American and European temperature data, high variance in earlier winter - spring and low variance in late winter - late summer. Figure 5.8 and 5.9 shows the ACF plot of the Asian temperature residuals εˆt 2 and squared residuals εˆ2t , after dividing out the seasonal volatility σ ˆt,LLR from
5.4 Asian temperature derivatives
183
20
Residuals
10 0 −10 −20 1973
1979
1985
1991 Time
1997
2003
2008
1979
1985
1991 Time
1997
2003
2008
1995
1998
2001 Time
2004
2007
1995
1998
2001 Time
2004
2007
Squared Residuals
200 150 100 50 0 1973 20
Residuals
10 0 −10 −20 1992
Squared Residuals
200 150 100 50 0 1992
Figure 5.3: Residuals εˆt and squared residuals εˆ2t of the AR(p) (for Beijing (12 panel) and Taipei (3-4 pannel)) during 19730101-20081231. No rejection of H0 that the residuals are uncorrelated at 0% significance level, according to the modified Li-McLeod Portmanteau test STFAsianWeather3
the regression residuals. The ACF plot of the residuals remain unchanged and now the ACF plot for squared residuals presents a non-seasonal pattern. Table 5.8 shows the statistics for the standardized residuals under different
184
5 Pricing of Asian temperature risk 20
Residuals
10 0 −10 −20 1973
1979
1985
1991 Time
1997
2003
2008
1979
1985
1991 Time
1997
2003
2008
1979
1985
1991 Time
1997
2003
2008
1979
1985
1991 Time
1997
2003
2008
Squared Residuals
200 150 100 50 0 1973 20
Residuals
10 0 −10 −20 1973
Squared Residuals
200 150 100 50 0 1973
Figure 5.4: Residuals εˆt and squared residuals εˆ2t of the AR(p) (for Tokyo (12 panel) and Osaka (3-4 pannel)) during 19730101-20081231. No rejection of H0 that the residuals are uncorrelated at 0% significance level, according to the modified Li-McLeod Portmanteau test STFAsianWeather3
εˆt seasonal variations ( σt,FεˆtT S , σt,FεˆTt SG and σt,LLR ). The estimate of the seasonal variation by local linear regression was closer to the normal residuals. The acceptance of the null hyptohesis of normality is at 1% significance level.
5.4 Asian temperature derivatives
185
0.04
ACF Residuals
0.02 0 −0.02 0
100
200
300
400
500 Lag
600
700
800
900
1000
100
200
300
400
500 Lag
600
700
800
900
1000
100
200
300
400
500 Lag
600
700
800
900 1.000
100
200
300
400
500 Lag
600
700
800
900 1.000
0.08 ACF Squared residuals
0.06 0.04 0.02 0 −0.02 0
ACF Residuals
0.04 0.02 0
−0.02
ACF Squared residuals
0 0.1
0.05 0
−0.05 −0.1 0
Figure 5.5: ACF for residuals εˆt and squared residuals εˆ2t of the AR(p) of the AR(p) (for Beijing (1-2 panel) and Taipei (3-4 pannel)) during 19730101-20081231 STFAsianWeather4
The log Kernel smoothing density estimate against a log Normal Kernel evaluated at 100 equally spaced points for Asian temperature residuals has been plotted in Figure (5.10) to verify if residuals become normally distributed. The seasonal variation modelled by GARCH (1,1) and the local linear regression are adequately capturing the intertemporal dependencies in daily temperature.
186
5 Pricing of Asian temperature risk
ACF Residuals
0.04 0.02 0 −0.02 0
100
200
300
400
500 Lag
600
700
800
900
1000
100
200
300
400
500 Lag
600
700
800
900
1000
100
200
300
400
500 Lag
600
700
800
900
1000
100
200
300
400
500 Lag
600
700
800
900
1000
ACF Squared residuals
0.08 0.06 0.04 0.02 0 −0.02 0
ACF Residuals
0.04 0.02 0 −0.02 0
ACF Squared residuals
0.08 0.06 0.04 0.02 0 −0.02 0
Figure 5.6: ACF for residuals εˆt and squared residuals εˆ2t of the AR(p) of the AR(p) (for Tokyo (1-2 panel) and Osaka (3-4 panel)) during 19730101-20081231 STFAsianWeather4
5.4 Asian temperature derivatives
187
8
12
7
10
Seasonal Variance
Seasonal Variance
6
5
4
8
6
4
3 2
2
1 Jan
Feb
Mar
Apr
May
Jun Jul Time
Aug
Sep
Oct
Nov
0 Jan
Dec
9
8
Apr
May
Jun Jul Time
Aug
Sep
Oct
Nov
Dec
Feb
Mar
Apr
May
Jun Jul Time
Aug
Sep
Oct
Nov
Dec
6 Seasonal Variance
Seasonal Variance
Mar
7
7
6
5
4
5
4
3
3
2
2
1 Jan
Feb
8
Feb
Mar
Apr
May
Jun Jul Time
Aug
Sep
Oct
Nov
Dec
1 Jan
2 2 Figure 5.7: Daily empirical variance, σ ˆt,F σt,LLR for Beijing (upper left), T SG ,ˆ Taipei (upper right), Tokyo (lower left), Osaka (lower right)
STFAsianWeather5
Table 5.8: Statistics of the Asian temperature residuals εˆt and squared resid2 uals εˆ2t , after dividing out the seasonal volatility σ ˆt,LLR from the regression residuals εˆt σt,F T S
City Tokyo
Osaka
Beijing
Taipei
Jarque Bera Kurtosis Skewness Jarque Bera Kurtosis Skewness Jarque Bera Kurtosis Skewness Jarque Bera Kurtosis Skewness
158.00 3.46 -0.15 129.12 3.39 -0.15 234.07 3.28 -0.29 201.09 3.36 -0.39
ε ˆt σt,F T SG
127.23 3.39 -0.11 119.71 3.35 -0.14 223.67 3.27 -0.29 198.40 3.32 -0.39
ε ˆt σt,LLR
114.50 3.40 -0.12 105.02 3.33 -0.14 226.09 3.25 -0.29 184.17 3.3 -0.39
188
5 Pricing of Asian temperature risk
ACF Residuals
0.02 0.01 0 −0.01 −0.02
ACF Squared residuals
−0.03 0
100
200
300
400
500 Lag
600
700
800
900
1000
100
200
300
400
500 Lag
600
700
800
900
1000
100
200
300
400
500 Lag
600
700
800
900
1000
100
200
300
400
500 Lag
600
700
800
900
1000
0.02 0.01 0 −0.01 −0.02 −0.03 0
ACF Residuals
0.02 0.01 0 −0.01 −0.02
ACF Squared residuals
−0.03 0
0.02 0.01 0 −0.01 −0.02 −0.03 0
Figure 5.8: ACF for temperature (squared) residuals panel)and Taipei (3-4 pannel).
εˆt σt,LLR
for Beijing (1-2
STFAsianWeather6
5.4.2 Pricing Asian futures In this section, using Equation (5.30) and (5.31) but for C24AT index futures, we infered the market price of risk for C24AT Asian temperature derivatives as H¨ardle and L´ opez Cabrera (2009) did for Berlin monthly CAT futures. Table 5.9 shows the replication of the observed Tokyo C24AT index futures prices
5.4 Asian temperature derivatives
189
ACF Residuals
0.02 0.01 0 −0.01 −0.02
ACF Squared residuals
−0.03 0
100
200
300
400
500 Lag
600
700
800
900
1000
100
200
300
400
500 Lag
600
700
800
900
1000
100
200
300
400
500 Lag
600
700
800
900
1000
100
200
300
400
500 Lag
600
700
800
900
1000
0.02 0.01 0 −0.01 −0.02 −0.03 0
ACF Residuals
0.02 0.01 0 −0.01 −0.02
ACF Squared residuals
−0.03 0
0.02 0.01 0 −0.01 −0.02 −0.03 0
Figure 5.9: ACF for temperature (squared) residuals panel) and Osaka (3-4 panel).
εˆt σt,LLR
for Tokyo (1-2
STFAsianWeather6
traded in Bloomberg on 20090130, using the constant MPR for each contract per trading day and the time dependent MPR using cubic polynomials with number of knots equal to the number of traded contracts (7). One can notice that the C24AT index futures for Tokyo are underpriced when the MPR is equal to zero. From (5.21) for C24AT index futures, we observe that a high proportion of the price value comes from the seasonal exposure, showing
190
5 Pricing of Asian temperature risk 0
0
0
0
−2
−2
−2
−2
−4
−4
−4
−4
−6
−6
−6
−6
−8
−8
−8
−8
−10
−10
−10
−10
−12
−12
−12
−12
−14
−5
0
5
−14
−5
0
5
−14
−5
0
5
−14
0
0
0
0
−2
−2
−2
−2
−4
−4
−4
−4
−6
−6
−6
−8
−8
−8
−8
−10
−10
−10
−10
−12
−12
−12
−12
−14
−5
0
5
−14
−5
0
5
−14
−5
0
5
−5
0
5
−6
−5
0
5
−14
Figure 5.10: Log of Kernel smoothing density estimate vs Log of Normal Kernel εˆt for σt,LLR (upper) and σt,FεˆTt SG (lower) of Tokyo (left), Osaka (left middle), Beijing (right middle), Taipei (right) STFAsianWeather7
Table 5.9: Tokyo & Osaka C24AT future prices estimates on 20090130 from different MPR parametrization methods. City Tokyo Osaka
Code
FC24AT Bloomberg
FC24AT,θˆ0
FC24AT,θˆi
FC24AT,θˆspl
J9 K9 J9 K9
450.000 592.000 460.000 627.000
452.125 630.895 456.498 663.823
448.124 592.000 459.149 624.762
461.213 640.744 -
t
t
t
high CAT temperature futures prices from June to August and low prices from November to February. The influence of the temperature variation σt can be clearly reflected in the behaviour of the MPR. For both parametrization, MPR is close to zero far from measurement period and it jumps when it is getting closed to it. This phenomena is also related to the Samuelson effect, where the CAT volatility for each contract is getting closed to zero when the time to measurement period is large. C24AT index futures future prices with constant MPR estimate per contract per trading day full replicate the Bloomberg estimates and pricing deviations are smoothed over time when the estimations use smoothed MPRs. Positive (negative) MPR contributes positively (negatively) to future prices, leading to larger (smaller) estimation values than the real prices.
5.4 Asian temperature derivatives
191
The Chicago Mercantile exchange does not carry out trade CDD futures for Asia, however one can use the estimates of the smoothed MPR of CAT (C24AT) futures in (5.21) to price CDD futures. From the HDD-CDD parity (5.5), one can estimate HDD futures and compare them with real data. Since C24AT futures are indeed tradable assets, a simple and sufficient parametrization of the MPR to make the discounted asset prices martingales is setting θt = (μt − rt )/σt . In order to see which of the components (μt − rt or σt ) contributes more to the variation of the MPR, the seasonal effect that the MPR θt presents was related with the seasonal variation σt of the underlying process. In this case, the relationship between θt and σt is well defined given by the deterministic form of σt (σt,F T SG , σt,LLR ) in the temperature process. First, using different trading day samples, the average of the calibrated θti over the period [τ1 , τ2 ] was estimated as: i θˆ[τ = 1 ,τ2 ]
Tτ1 ,τ2
1 − tτ1 ,τ2
Tτ1 ,τ2
θˆti ,
t=tτ1 ,τ2
where t[τ1 ,τ2 ] and T[τ1 ,τ2 ] indicate the first and the last trade for the contracts with measurement month [τ1 , τ2 ]. Similarly, the variation over the measurement period [τ1 , τ2 ] was defined as: 2 σ ˆ[τ = 1 ,τ2 ]
τ2 1 σ ˆ2. τ2 − τ1 t=τ t 1
Then one can conduct a regression model of θˆτi 1 ,τ2 on σ ˆτ21 ,τ2 . Figure 5.11 shows the linear and quadratic regression of the average of the calibrated MPR and the temperature variation σt (σt,F T SG , σt,LLR ) of CATC24AT Futures with Measurement Period (MP) in 1 month for Berlin-Essen and Tokyo weather derivative from July 2008 to June 2009. The values of θˆti for contracts on Berlin and Essen were assumed coming from the same population, while for the asian temperature market, Tokyo was the only considered one for being the largest one. As we expect, the contribution of σt into θt = (μt −rt )/σt gets larger the closer the contracts are to the measurement period. Table 5.10 shows the coefficients of the parametrization of θˆti for the German and Japanese temperature market. A quadratic regression was fitting more suitable than a linear regression (see R2 coefficients). The previous findings generally support theoretical results of reverse relation between MPR θˆτ1 ,τ2 and seasonal variation σt (σt,F T SG , σt,LLR ), indicating that
192
5 Pricing of Asian temperature risk 0.6
0.1
0.4
0
0.2
MPR
MPR
0.2
−0.1
0
−0.2
−0.2
−0.3 3.5 4 4.5 5 5.5 6 average temperature variation in the measurement month
−0.4 5.5 5 4.5 4 3.5 3 2.5 average temperature variation in the measurement month
Figure 5.11: Average of the Calibrated MPR and the Temperature Variation of CAT-C24AT Futures with Measurement Period (MP) in 1 month (Linear, quadratic). Berlin and Essen (left) and Tokyo (right) from July 2008 to June 2009. STFAsianWeather8
Table 5.10: Parametrization of MPR in terms of seasonal variation for contracts with measurement period of 1 month. City BerlinEssen
Parameters a b c R2adj
θˆτ1 ,τ2 = a + b · σ ˆτ21 ,τ2 0.3714 -0.0874 0.4157
θˆτ1 ,τ2 = a + b · σ ˆτ21 ,τ2 + c · σ ˆτ41 ,τ2 2.0640 -0.8215 0.0776 0.4902
Tokyo
a b c
-
4.08 -2.19 0.28 0.71
R2adj
a simple parametrization is possible. Therefore, the MPR for regions without weather derivative markets can be infered by calibration of the data or by knowing the formal dependence of MPR on seasonal variation. We conducted an empirical analysis to weather data in Koahsiung, which is located in the south of China and it is characterized by not having a formal temperature market, see Figure 5.12. In a similar way that other Asian cities, a seasonal
5.4 Asian temperature derivatives
193
Table 5.11: Coefficients of the seasonal function with trend for Koahsiung i cˆi dˆi
1 5.11 -162.64
2 -1.34 19.56
3 -0.39 16.72
4 0.61 28.86
5 0.56 16.63
6 0.34 21.84
function with trend was fitted: ˆt Λ
3
2πi(t − dˆi ) = 24.4 + 16 · 10 t + cˆi · cos 365 i=1 6 2π(i − 4)(t − dˆi ) + I(t ∈ ω) · cˆi · cos 365 −5
(5.37)
i=4
where I(t ∈ ω) is the indicator function for the months of December, January and February. This form of the seasonal function makes possible to capture the peaks of the temperature in Koahsiung, see upper panel of Figure 5.12. The coefficient values of the fitted seasonal function are shown in Table 5.11. The fitted AR(p) process to the residuals of Koahsiung by AIC was of degree p = 3, where β1 = 0.77, β2 = −0.12, β3 = 0.04 and CAR(3) with coefficients α1 = −2.24, α2 = −1.59, α3 = −0.31 The seasonal volatility fitted with Local Linear Regression (LLR) is plotted in the middle panel of Figure 5.12, showing high volatility in late winter - late spring and low volatility in early summer - early winter. The standardized residuals after removing the seasonal volatility are very closed to normality (kurtosis=3.22, skewness=-0.08, JB=128.74), see lower panel of Figure 5.12. For 0 ≤ t ≤ τ1 < τ2 , the C24AT Future Contract for Kaohsiung is equal to: τ2 Qθ FC24AT (t,τ1 ,τ2 ) = E Ts ds|Ft τ1 τ2 τ1 = Λu du + at,τ1 ,τ2 Xt + θˆτ1 ,τ2 σu at,τ1 ,τ2 ep du τ1 t τ2 −1 + θˆτ1 ,τ2 σu e [exp {A(τ2 − u)} − Ip ] ep du (5.38) 1 A τ1
194
5 Pricing of Asian temperature risk
Beijing Tokyo
Osaka
Taipei
Temperature in °C
Kaohsiung
30 25 20 15
20010801
20030801
20050801
20070801
20090801
6
Variation
5 4 3 2 1 0 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Time
0.5
0
0.4 −5
0.3 0.2
−10
0.1 0 −5
0
5
−15
−5
0
5
Figure 5.12: Map, Seasonal function with trend (upper), Seasonal volatility function (middle) and Kernel smoothing density estimate vs Norεˆt mal Kernel for σt,LLR (lower) for Koahsiung STFAsianWeather9
where θˆτ1 ,τ2 = 4.08 − 2.19 · σ ˆτ21 ,τ2 + 0.028 · σ ˆτ41 ,τ2 , i.e. the formal dependence of MPR on seasonal variation for C24AT-Tokyo futures. In this case σ ˆτ21 ,τ2 =1.10
5.4 Asian temperature derivatives
195
Table 5.12: C24AT Calls in Koahsiung Derivative Type Index r t Measurement Period Strike Tick Value FC24AT (20090901,20091027,20091031) CC24AT (20090901,20090908 ,20091027,20091031) CC24AT (20090901,20090915 ,20091027,20091031) CC24AT (20090901,20090922 ,20091027,20091031) CC24AT (20090901,20090929 ,20091027,20091031)
Parameters C24AT 1% 1. September 2009 27-31. October 2009 125◦ C 0.01◦ C=25 139.60 12.25 10.29 8.69 7.25
and θˆτ1 ,τ2 =2.01, therefore FC24AT (20090901,20091027,20091031) =139.60. The C24AT-Call Option written on a C24AT future with strike K at exercise time τ < τ1 during period [τ1 , τ2 ] is equal to: CC24AT (t,τ,τ1 ,τ2 ) = exp {−r(τ − t)} " × FC24AT (t,τ1 ,τ2 ) − K Φ {d(t, τ, τ1 , τ2 )} τ # + Σ2C24AT (s,τ1 ,τ2 ) dsΦ {d(t, τ, τ1 , τ2 )} , t
Table 5.12 shows the value of the C24AT-Call Option written on a C24AT future with strike price K = 125◦ C, the measurement period during the 2731th October 2009 and trading date on 1st. September 2009. The price of the C24AT-Call for Koahsiung decreases when the measurement period is getting closer. This example give us the insight that by knowing the formal dependence of MPR on seasonal variation, one can infere the MPR for regions where weather derivative market does not exist and with that one can price new exotic derivatives. Without doubt, the empirical findings need to be study in depth such that theoretical results can hold.
Bibliography Alaton, P., Djehiche, B. and Stillberger, D. (2002). On modelling and pricing weather derivatives, Appl. Math. Finance 9(1): 1–20. Barrieu, P. and El Karoui, N. (2002). Optimal design of weather derivatives, ALGO Research 5(1). Benth, F. (2003). On arbitrage-free pricing of weather derivatives based on fractional brownian motion., Appl. Math. Finance 10(4): 303–324. Benth, F. (2004). Option Theory with Stochastic Analysis: An Introduction to Mathematical Finance., Springer Verlag, Berlin. Benth, F. and Meyer-Brandis, T. (2009). The information premium for nonstorable commodities, Journal of Energy Markets 2(3). Benth, F. and Saltyte Benth, J. (2005). Stochastic modelling of temperature variations with a view towards weather derivatives., Appl. Math. Finance 12(1): 53–85. Benth, F., Saltyte Benth, J. and Koekebakker, S. (2007). Putting a price on temperature., Scandinavian Journal of Statistics 34: 746–767. Benth, F., Saltyte Benth, J. and Koekebakker, S. (2008). Stochastic modelling of electricity and related markets, World Scientific Publishing. Brody, D., Syroka, J. and Zervos, M. (2002). Dynamical pricing of weather derivatives, Quantit. Finance (3): 189–198. Campbell, S. and Diebold, F. (2005). Weather forecasting for weather derivatives, American Stat. Assoc. 100(469): 6–16. Cao, M. and Wei, J. (2004). Weather derivatives valuation and market price of weather risk, 24(11): 1065–1089. CME (2005). An introduction to cme weather products, http://www.vfmarkets.com/pdfs/introweatherfinal.pdf, CME Alternative Investment Products .
198
Bibliography
Davis, M. (2001). Pricing weather derivatives by marginal value, Quantit. Finance (1): 305–308. Dornier, F. and Querel, M. (2000). Caution to the wind, Energy Power Risk Management,Weather Risk Special Report pp. 30–32. Hamisultane, H. (2007). Extracting information from the market to price the weather derivatives, ICFAI Journal of Derivatives Markets 4(1): 17–46. H¨ardle, W. K. and L´ opez Cabrera, B. (2009). Infering the market price of weather risk, SFB649 Working Paper, Humboldt-Universit¨ at zu Berlin . Horst, U. and Mueller, M. (2007). On the spanning property of risk bonds priced by equilibrium, Mathematics of Operation Research 32(4): 784– 807. Hull, J. (2006). Option, Future and other Derivatives, Prentice Hall International, New Jersey. Hung-Hsi, H., Yung-Ming, S. and Pei-Syun, L. (2008). Hdd and cdd option pricing with market price of weather risk for taiwan, The Journal of Future Markets 28(8): 790–814. Ichihara, K. and Kunita, H. (1974). A classification of the second order degenerate elliptic operator and its probabilistic characterization, Z. Wahrsch. Verw. Gebiete 30: 235–254. Jewson, S., Brix, A. and Ziehmann, C. (2005). Weather Derivative valuation: The Meteorological, Statistical, Financial and Mathematical Foundations., Cambridge University Press. Karatzas, I. and Shreve, S. (2001). Methods of Mathematical Finance., Springer Verlag, New York. Malliavin, P. and Thalmaier, A. (2006). Stochastic Calculus of Variations in Mathematical finance., Springer Verlag, Berlin, Heidelberg. Mraoua, M. and Bari, D. (2007). Temperature stochastic modelling and weather derivatives pricing: empirical study with moroccan data., Afrika Statistika 2(1): 22–43. Platen, E. and West, J. (2005). A fair pricing approach to weather derivatives, Asian-Pacific Financial Markets 11(1): 23–53. PricewaterhouseCoopers (2005). Results of the 2005 pwc survey, presentation to weather risk managment association by john stell, http://www.wrma.org,
Bibliography
199
PricewatehouseCoopers LLP . Richards, T., Manfredo, M. and Sanders, D. (2004). Pricing weather derivatives, American Journal of Agricultural Economics 86(4): 1005–10017. Turvey, C. (1999). The essentials of rainfall derivatives and insurance, Working Paper WP99/06, Department of Agricultural Economics and Business, University of Guelph, Ontario. .
6 Variance swaps Wolfgang Karl H¨ardle and Elena Silyakova
6.1 Introduction Traditionally volatility is viewed as a measure of variability, or risk, of an underlying asset. However recently investors have begun to look at volatility from a different angle, variance swaps have been created. The first variance swap contracts were traded in late 1998, but it was only after the development of the replication argument using a portfolio of vanilla options that variance swaps became really popular. In a relatively short period of time these over-the-counter (OTC) derivatives developed from simple contracts on future variance to more sophisticated products. Recently we have been able to observe the emergence of 3G volatility derivatives: gamma swaps, corridor variance swaps, conditional variance swaps and options on realised variance. Constant development of volatility instruments and improvement in their liquidity allows for volatility trading almost as easily as traditional stocks and bonds. Initially traded OTC, now the number of securities having volatility as underlying are available on exchanges. Thus the variance swaps idea is reflected in volatility indices, also called ”fear” indices. These indices are often used as a benchmark of equity market risk and contain option market expectations on future volatility. Among those are VIX – the Chicago Board Options Exchange (CBOE) index on the volatility of S&P 500, VSTOXX on Dow Jones EURO STOXX 50 volatility, VDAX – on the volatility of DAX. These volatility indices represent the theoretical prices of one-month variance swaps on the corresponding index. They are calculated daily and on an intraday basis by the exchange from the listed option prices. Also, recently exchanges started offering derivative products, based on these volatility indices – options and futures.
P. Čížek et al. (eds.), Statistical Tools for Finance and Insurance, DOI 10.1007/978-3-642-18062-0_6, © Springer-Verlag Berlin Heidelberg 2011
201
202
6 Variance swaps
6.2 Volatility trading with variance swaps Variance swap is a forward contract that at maturity pays the difference be2 2 tween realised variance σR (floating leg) and predefined strike Kvar (fixed leg) multiplied by notional Nvar . 2 2 (σR − Kvar ) · Nvar
(6.1)
2 When the contract expires the realised variance σR can be measured in different ways, since there is no formally defined market convention. Usually variance swap contracts define a formula of a final realised volatility σR . It is a square root of annualized variance of daily log-returns of an underlying over a swap’s maturity calculated in percentage terms:
3 4 2 T 4 252 St 5 σR = log · 100 T t=1 St−1
(6.2)
There are two ways to express the variance swap notional: variance notional and vega notional. Variance notional Nvar shows the dollar amount of profit 2 (loss) from difference in one point between the realised variance σR and the 2 strike Kvar . But since market participants usually think in terms of volatility, vega notional Nvega turns out to be a more intuitive measure. It shows the profit or loss from 1% change in volatility. The two measures are interdependent and can substitute each other: Nvega = Nvar · 2Kvar
(6.3)
2 EXAMPLE 6.1 Variance notional Nvar = 2500. If Kvar is 20% (Kvar = 400) and the subsequent variance realised over the course of the year is (15%)2 2 (quoted as σR = 225), the investor will make a loss: 2 2 Loss = Nvar · (σR − Kvar )
437500 = 2500 · (400 − 225)
6.3 Replication and hedging of variance swaps
203
Marking-to-market of a variance swap is straightforward. If an investor wishes to close a variance swap position at some point t before maturity, he needs to define a value of the swap between inception 0 and maturity T . Here the 2 additivity property of variance is used. The variance at maturity σR,(0,T ) is just 2 a time-weighted sum of variance realised before the valuation point σR,(0,t) and 2 variance still to be realised up to maturity σR,(t,T ) . Since the later is unknown 2 yet, we use its estimate Kvar,(t,T ) . The value of the variance swap (per unit of variance notional) at time t is therefore: 2 2 2 T −1 tσR,(0,t) − (T − t)Kvar,(t,T ) − Kvar,(0,T )
(6.4)
6.3 Replication and hedging of variance swaps 2 The strike Kvar of a variance swap is determined at inception. The realised 2 variance σR , on the contrary, is calculated at expiry (6.2). Similar to any forward contract, the future payoff of a variance swap (6.1) has zero initial 2 2 value, or Kvar = E[σR ]. Thus the variance swap pricing problem consists in 2 finding the fair value of Kvar which is the expected future realised variance.
To achieve this, one needs to construct a trading strategy that captures the realised variance over the swap’s maturity. The cost of implementing this strategy will be the fair value of the future realised variance. One of the ways of taking a position in future volatility is trading a delta-hedged option. The P&L from delta-hedging (also called hedging error) generated from buying and holding a vanilla option up to maturity and continuously delta-hedging it, captures the realised volatility over the holding period. Some assumptions are needed: • the existence of futures market with delivery dates T ≥ T • the existence of European futures options market, for these options all strikes are available (market is complete) • continuous trading is possible • zero risk-free interest rate (r = 0) • the price of the underlying futures contract Ft following a diffusion process with no jumps:
204
6 Variance swaps
dFt = μt dt + σt dWt Ft
(6.5)
We assume that the investor does not know the volatility process σt , but believes that the future volatility equals σimp , the implied volatility prevailing at that time on the market. He purchases a claim (for example a call option) with σimp . The terminal value (or payoff) of the claim is a function of FT . For a call option the payoff is denoted: f (FT ) = (FT − K)+ . The investor can define the value of a claim V (Ft , t) at any time t, given that σimp is predicted correctly. To delta-hedge the long position in V over [0, T ] the investor holds a dynamic short position equal to the option’s delta: Δ = ∂V /∂Ft . If his volatility expectations are correct, then at time t for a delta-neutral portfolio the following relationship holds: 1 2 Θ = − σimp Ft2 Γ 2
(6.6)
V (FT , T ) = f (FT )
(6.7)
subject to terminal condition:
Θ = ∂V /∂t is called the option’s theta or time decay and Γ = ∂ 2 V /∂Ft2 is the option’s gamma. Equation (6.6) shows how the option’s value decays in time (Θ) depending on convexity (Γ). Delta-hedging of V generates the terminal wealth:
T
P &LΔ = −V (F0 , 0, σimp ) −
ΔdFt + V (FT , T )
(6.8)
0
which consists of the purchase price of the option V (F0 , 0, σimp ), P&L from delta-hedging at constant implied volatility σimp and final pay-off of the option V (FT , T ). Applying Itˆo’s lemma to some function f (Ft ) of the underlying process specified in (6.5) gives: f (FT ) = f (F0 )+ 0
T
∂f (Ft ) 1 dFt + ∂Ft 2
T
Ft2 σt2 0
∂ 2 f (Ft ) dt+ ∂Ft2
0
T
∂f (Ft ) dt (6.9) ∂t
6.3 Replication and hedging of variance swaps
205
For f (Ft ) = V (Ft , t, σt ) we therefore obtain:
T
V (FT , T ) = V (F0 , 0, σimp ) + 0
1 ΔdFt + 2
T
Ft2 Γσt2 dt 0
+
T
Θdt
(6.10)
0
Using relation (6.6) for (6.10) gives:
T
V (FT , T ) − V (F0 , 0, σimp ) =
ΔdFt + 0
1 2
T 2 Ft2 Γ(σt2 − σimp )dt
(6.11)
0
Finally substituting (6.11) into (6.8) gives P &LΔ of the delta-hedged option position:
P &LΔ =
1 2
T 2 Ft2 Γ(σt2 − σimp )dt
(6.12)
0
Thus buying the option and delta-hedging it generates P&L (or hedging error) equal to differences between instantaneous realised and implied variance, accrued over time [0, T ] and weighed by Ft2 Γ/2 (dollar gamma). However, even though we obtained the volatility exposure, it is path-dependent. To avoid this one needs to construct a portfolio of options with path-independent P&L or in other words with dollar gamma insensitive to Ft changes. Figure 6.1 represents the dollar gammas of three option portfolios with an equal number of vanilla options (puts or calls) and similar strikes lying in a range from 20 to 200. Dollar gammas of individual options are shown with thin lines, the portfolio’s dollar gamma is a bold line. First, one can observe, that for every individual option dollar gamma reaches its maximum when the option is ATM and declines with price going deeper out of the money. One can make a similar observation by looking at the portfolio’s dollar gamma when the constituents are weighted equally (first picture). However, when we use the alternative weighting scheme (1/K), the portfolio’s dollar gamma becomes flatter (second picture). Finally by weighting options with 1/K 2 the portfolio’s dollar gamma becomes parallel to the vertical axis (at least in 20 – 140 region), which suggests that the dollar gamma is no longer dependent on the Ft movements.
6 Variance swaps
Dollar gamma
Dollar gamma
Dollar gamma
206
200 100 0
20
40
60
80 100 120 140 Underlying price
160
180
200
20
40
60
80 100 120 140 Underlying price
160
180
200
20
40
60
80 100 120 140 Underlying price
160
180
200
200 100 0
200 100 0
Figure 6.1: Dollar gamma of option portfolio as a function of stock price. Weights are defined: equally, proportional to 1/K and proportional to 1/K 2. STFdollargamma2D
We have already considered a position in a single option as a bet on volatility. The same can be done with the portfolio of options. However the obtained exposure is path-dependent. We need, however the static, path-independent trading position in future volatility. Figures 6.1, 6.2 illustrate that by weighting the options’ portfolio proportional to 1/K 2 this position can be achieved. Keeping in mind this intuition we proceed to formal derivations. Let us consider a payoff function f (Ft ): 2 f (Ft ) = T
F0 Ft log + −1 Ft F0
(6.13)
This function is twice differentiable with derivatives: f (Ft ) =
2 T
1 1 − F0 Ft
(6.14)
6.3 Replication and hedging of variance swaps
207
Figure 6.2: Dollar gamma of option portfolio as a function of stock price and maturity. Weights are defined proportional to 1/K 2 . STFdollargamma3D
f (Ft ) =
2 T Ft2
(6.15)
and f (F0 ) = 0
(6.16)
One can give a motivation for the choice of the particular payoff function (6.13). The first term, 2 log F0 /T Ft , is responsible for the second derivative of the payoff f (Ft ) w.r.t. Ft , or gamma (6.15). It will cancel out the weighting term in (6.12) and therefore will eliminate path-dependence. The second term 2/T (Ft/F0 − 1) guarantees the payoff f (Ft ) and will be non-negative for any positive Ft . Applying Itˆ o’s lemma to (6.13) (substituting (6.13) into (6.9)) gives the expression for the realised variance: 1 T
T
σt2 dt 0
2 = T
F0 FT 2 T 1 1 log + −1 − − dFt FT F0 T 0 F0 Ft
(6.17)
208
6 Variance swaps
Equation (6.17) shows that the value of a realised variance for t ∈ [0, T ] is equal to • a continuously rebalanced futures position that costs nothing to initiate and is easy to replicate: 2 T
T
0
1 1 − F0 Ft
dFt
(6.18)
• a log contract, static position of a contract that pays f (FT ) at expiry and has to be replicated: 2 T
F0 FT log + −1 FT F0
(6.19)
Carr and Madan (2002) argue that the market structure assumed above allows for the representation of any twice differentiable payoff function f (FT ) in the following way: f (FT ) = f (k) + f (k)
k
" # (FT − k)+ − (k − FT )+ +
f (K)(K − FT ) dK +
+
+
0
∞
(6.20)
f (K)(FT − K)+ dK
k
Applying (6.20) to payoff (6.19) with k = F0 gives: log
F0 FT
+
FT −1 = F0
F0 0
1 (K − FT )+ dK + K2
∞
F0
1 (FT − K)+ dK (6.21) K2
Equation (6.21) represents the payoff of a log contract at maturity f (FT ) as a sum of • the portfolio of OTM puts (strikes are lower than forward underlying price F0 ), inversely weighted by squared strikes:
F0 0
1 (K − FT )+ dK K2
(6.22)
6.4 Constructing a replication portfolio in practice
209
• the portfolio of OTM calls (strikes are higher than forward underlying price F0 ), inversely weighted by squared strikes: ∞ 1 (FT − K)+ dK (6.23) 2 K F0 Now coming back to equation (6.17) we see that in order to obtain a constant exposure to future realised variance over the period 0 to T the trader should, at inception, buy and hold the portfolio of puts (6.22) and calls (6.23). In addition he has to initiate and roll the futures position (6.18). We are interested in the costs of implementing the strategy. Since the initiation of futures contract (6.18) costs nothing, the cost of achieving the strategy will be defined solely by the portfolio of options. In order to obtain an expectation 2 of a variance, or strike Kvar of a variance swap at inception, we take a riskneutral expectation of a future strategy payoff: 2 Kvar =
2 rT e T
0
F0
1 2 P0 (K)dK + erT 2 K T
∞
F0
1 C0 (K)dK K2
(6.24)
6.4 Constructing a replication portfolio in practice Although we have obtained the theoretical expression for the future realised variance, it is still not clear how to make a replication in practice. Firstly, in reality the price process is discrete. Secondly, the range of traded strikes is limited. Because of this the value of the replicating portfolio usually underestimates the true value of a log contract. One of the solutions is to make a discrete approximation of the payoff (6.19). This approach was introduced by Derman et al. (1998). Taking the logarithmic payoff function, whose initial value should be equal to the weighted portfolio of puts and calls (6.21), we make a piecewise linear approximation. This approach helps to define how many options of each strike investor should purchase for the replication portfolio. Figure 6.3 shows the logarithmic payoff (dashed line) and the payoff of the replicating portfolio (solid line). Each linear segment on the graph represents the payoff of an option with strikes available for calculation. The slope of this
210
6 Variance swaps
1.5
F(ST)=log(ST/S*)+S*/ST−1 − log−payoff (solid line)
F(ST)
1
Linear approximation (dashed line)
S* − threshold between calls and puts
0.5
0 0
50
100 S
150
Figure 6.3: Discrete approximation of a log payoff.
200
STFlogpayoff
linear segment will define the amount of options of this strike to be put in the portfolio. For example, for the call option with strike K0 the slope of the segment would be: w(K0 ) =
f (K1,c ) − f (K0 ) K1,c − K0
(6.25)
where K1,c is the second closest call strike. The slope of the next linear segment, between K1,c and K2,c , defines the amount of options with strike K1,c . It is given by w(K1,c ) =
f (K2,c ) − f (K1,c ) − w(K0 ) K2,c − K1,c
(6.26)
Finally for the portfolio of n calls the number of calls with strike Kn,c : f (Kn+1,c ) − f (Kn,c ) − w(Ki,c ) Kn+1,c − Kn,c i=0 n−1
w(Kn,c ) =
(6.27)
The left part of the log payoff is replicated by the combination of puts. For the
6.5 3G volatility products
211
portfolio of m puts the weight of a put with strike Km,p is defined by w(Km,p ) =
m−1 f (Km+1,p ) − f (Km,p ) − w(Kj,p ) Km,p − Km+1,p
(6.28)
j=0
Thus constructing the portfolio of European options with the weights defined by (6.27) and (6.28) we replicate the log payoff and obtain value of the future realised variance. Assuming that the portfolio of options with narrowly spaced strikes can produce a good piecewise linear approximation of a log payoff, there is still the problem of capturing the ”tails” of the payoff. Figure 6.3 illustrates the effect of a limited strike range on replication results. Implied volatility is assumed to be constant for all strikes (σimp = 25%). Strikes are evenly distributed one point apart. The strike range changes from 20 to 1000. With increasing numbers of options the replicating results approach the ”true value” which equals to σimp in this example. For higher maturities one needs a broader strike range than for lower maturities to obtain the value close to actual implied volatility. Table 6.1 shows the example of the variance swap replication. The spot price of S ∗ = 300, riskless interest rate r = 0, maturity of the swap is one year T = 1, strike range is from 200 to 400. The implied volatility is 20% ATM and changes linearly with the strike (for simplicity no smile is assumed).The weight of each option is defined by (6.27) and (6.28).
6.5 3G volatility products If we need to capture some particular properties of realised variance, standard variance swaps may not be sufficient. For instance by taking asymmetric bets on variance. Therefore, there are other types of swaps introduced on the market, which constitute the third-generation of volatility products. Among them are: gamma swaps, corridor variance swaps and conditional variance swaps.
212
6 Variance swaps
Table 6.1: Replication of a variance swaps strike by portfolio of puts and calls. Strike 200 210 220 230 240 250 260 270 280 290 300 310 320 330 340 350 360 370 380 390 400
IV 0.13 0.14 0.15 0.15 0.16 0.17 0.17 0.18 0.19 0.19 0.20 0.21 0.21 0.22 0.23 0.23 0.24 0.25 0.25 0.26 0.27
BS Price 0.01 0.06 0.23 0.68 1.59 3.16 5.55 8.83 13.02 18.06 23.90 23.52 20.10 17.26 14.91 12.96 11.34 9.99 8.87 7.93 7.14
Type of option Put Put Put Put Put Put Put Put Put Put Call Call Call Call Call Call Call Call Call Call Call
Weight 0.0003 0.0002 0.0002 0.0002 0.0002 0.0002 0.0001 0.0001 0.0001 0.0001 0.0000 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 0.0001 Kvar
Share value 0.0000 0.0000 0.0000 0.0001 0.0003 0.0005 0.0008 0.0012 0.0017 0.0021 0.0001 0.0014 0.0021 0.0017 0.0014 0.0011 0.0009 0.0008 0.0006 0.0005 0.0005 0.1894 STFvswapstrike
By modifying the floating leg of a standard variance swap (6.2) with a weight process wt we obtain a generalized variance swap.
2 σR
2 T 252 Ft = wt log T t=1 Ft−1
(6.29)
Now, depending on the chosen wt we obtain different types of variance swaps: wt = 1 defines a standard variance swap.
6.5 3G volatility products
213
Strike of a swap
0.25 0.2 0.15 0.1 0.05 0 1 0.8 0.6 Maturity
0.4 0.2
0
200
800 600 400 Number of options
1000
Figure 6.4: Dependence of replicated realised variance level on the strike range and maturity of the swap. STFaccurrepl
6.5.1 Corridor and conditional variance swaps The weight wt = w(Ft ) = IFt ∈C defines a corridor variance swap with corridor C. I is the indicator function, which is equal to one if the price of the underlying asset Ft is in corridor C and zero otherwise. If Ft moves sideways, but stays inside C, then the corridor swap’s strike is large, because some part of volatility is accrued each day up to maturity. However if the underlying moves outside C, less volatility is accrued resulting the strike to be low. Thus corridor variance swaps on highly volatile assets with narrow 2 2 corridors have strikes KC lower than usual variance swap strike Kvar . Corridor variance swaps admit model-free replication in which the trader holds statically the portfolio of puts and calls with strikes within the corridor C. In this case we consider the payoff function with the underlying Ft in corridor C = [A, B] f (Ft ) =
2 T
F0 Ft log + − 1 IFt ∈[A,B] Ft F0
(6.30)
214
6 Variance swaps
The strike of a corridor variance swap is thus replicated by 2 K[A,B] =
2 rT e T
F0
A
1 2 P0 (K)dK + erT 2 K T
B F0
1 C0 (K)dK K2
(6.31)
C = [0, B] gives a downward variance swap, C = [A, ∞] - an upward variance swap. Since in practice not all the strikes K ∈ (0, ∞) are available on the market, corridor variance swaps can arise from the imperfect variance replication, when just strikes K ∈ [A, B] are taken to the portfolio. Similarly to the corridor, realised variance of conditional variance swap is accrued only if the price of the underlying asset in the corridor C. However the accrued variance is averaged over the number of days, at which Ft was in the corridor (T ) rather than total number of days to expiry T . Thus ceteris 2 paribus the strike of a conditional variance swap KC,cond is smaller or equal to 2 the strike of a corridor variance swap KC .
6.5.2 Gamma swaps As it is shown in Table 6.2, a standard variance swap has constant dollar gamma and vega. It means that the value of a standard swap is insensitive to Ft changes. However it might be necessary, for instance, to reduce the volatility exposure when the underlying price drops. Or in other words, it might be convenient to have a derivative with variance vega and dollar gamma, that adjust with the price of the underlying. The weight wt = w(Ft ) = Ft /F0 defines a price-weighted variance swap or gamma swap. At maturity the buyer receives the realised variance weighted to each t, proportional to the underlying price Ft . Thus the investor obtains path-dependent exposure to the variance of Ft . One of the common gamma swap applications is equity dispersion trading, where the volatility of a basket is traded against the volatility of basket constituents. The realised variance paid at expiry of a gamma swap is defined by
σgamma
3 4 2 T 4 252 Ft St 5 = log · 100 T t=1 F0 St−1
(6.32)
6.5 3G volatility products
215
One can replicate a gamma swap similarly to a standard variance swap, by using the following payoff function: 2 f (Ft ) = T
Ft Ft Ft log − +1 F0 F0 F0
f (Ft ) =
(6.33)
2 Ft log T F0 F0
(6.34)
2 T F0 Ft
(6.35)
f (Ft ) =
f (F0 ) = 0
(6.36)
Applying Itˆ o’s formula (6.9) to (6.33) gives
1 T
0
T
Ft 2 2 σt dt = F0 T
T FT FT FT 2 Ft log − +1 − log dFt F0 F0 F0 T F0 0 F0
(6.37)
Equation (6.37) shows that accrued realised variance weighted each t by the value of the underlying is decomposed into payoff (6.33), evaluated at T , and T 2 Ft a continuously rebalanced futures position log dFt with zero value T F0 0 F0 at t = 0. Then applying the Carr and Madan argument (6.20) to the payoff (6.33) at T we obtain the t = 0 strike of a gamma swap:
2 Kgamma =
2 2rT e T F0
F0 0
1 2 2rT P0 (K)dK + e K T F0
∞
F0
1 C0 (K)dK K
(6.38)
Thus gamma swap can be replicated by the portfolio of puts and calls weighted by the inverse of strike 1/K and rolling the futures position.
216
6 Variance swaps
Table 6.2: Variance swap greeks. Greeks
Call
Put
Standard variance swap
Gamma swap
Delta
∂V ∂Ft
Φ(d1 )
Φ(d1 ) − 1
2 1 1 ( − ) T F0 Ft
2 Ft log T F0 F0
Gamma
∂ 2V ∂Ft2
φ(d1 ) √ Ft σ τ
φ(d1 ) √ Ft σ τ
2 Ft2 T
2 T F0 Ft
Ft2 ∂ 2 V 2∂Ft2
Ft φ(d1 ) √ 2σ τ
Ft φ(d1 ) √ 2σ τ
1 T
Ft T F0
∂V ∂σt
√ φ(d1 )Ft τ
∂V ∂σt2
Ft φ(d1 ) √ 2σ τ
Dollar gamma Vega
Variance vega
√ 2στ φ(d1 )Ft τ T
2στ Ft T F0
Ft φ(d1 ) √ 2σ τ
τ Ft T F0
τ T
6.6 Equity correlation (dispersion) trading with variance swaps 6.6.1 Idea of dispersion trading The risk of the portfolio (or basket of assets) can be measured by the variance (or alternatively standard deviation) of its return. Portfolio variance can be calculated using the following formula: 2 σBasket
=
n i=1
wi2 σi2
+2
n n
wi wj σi σj ρij
(6.39)
i=1 j=i+1
where σi - standard deviation of the return of an i-th constituent (also called volatility), wi - weight of an i-th constituent in the basket, ρij - correlation
6.6 Equity correlation (dispersion) trading with variance swaps
217
coefficient between the i-th and the j-th constituent. Let’s take an arbitrary market index. We know the index value historical development as well as price development of each of index constituent. Using this information we can calculate the historical index and constituents’ volatility using, for instance, formula ( 6.2). The constituent weights (market values or current stock prices, depending on the index) are also known to us. The only parameter to be defined are correlation coefficients of every pair of constituents ρij . For simplicity assume ρij = const for any pair of i, j and call this parameter ρ - average index correlation, or dispersion. Then having index volatility σindex and volatility of each constituent σi , we can express the average index correlation:
ρ=
2
σ2 − index n n i=1
n
wi2 σi2 j=i+1 wi wj σi σj i=1
(6.40)
Hence it appears the idea of dispersion trading, consisting of buying the volatility of index constituents according to their weight in the index and selling the volatility of the index. Corresponding positions in variances can be taken by buying (selling) variance swaps. By going short index variance and long variance of index constituents we go short dispersion, or enter the direct dispersion strategy. Why can this strategy be attractive for investors? This is due to the fact that index options appear to be more expensive than their theoretical Black-Scholes prices, in other words investors will pay too much for realised variance on the variance swap contract expiry. However, in the case of single equity options one observes no volatility distortion. This is reflected in the shape of implied volatility smile. There is growing empirical evidence that the index option skew tends to be steeper then the skew of the individual stock option. For instance, this fact has been studied in Bakshi et al. (2003) on example of the S&P500 and Branger and Schlag (2004) for the German stock index DAX. This empirical observation is used in dispersion trading. The most widespread dispersion strategy, direct strategy, is a long position in constituents’ variances and short in variance of the index. This strategy should have, on average, positive payoffs. Hoverer under some market conditions it is profitable to enter the trade in the opposite direction. This will be called - the inverse dispersion strategy.
218
6 Variance swaps
The payoff of the direct dispersion strategy is a sum of variance swap payoffs of each of i-th constituent 2 2 (σR,i − Kvar,i ) · Ni
(6.41)
and of the short position in index swap 2 2 (Kvar,index − σR,index ) · Nindex
(6.42)
Ni = Nindex · wi
(6.43)
where
The payoff of the overall strategy is: Nindex ·
n
! 2 wi σR,i
−
2 σR,Index
− ResidualStrike
(6.44)
i=1
The residual strike
ResidualStrike = Nindex ·
n
! 2 wi Kvar,i
−
2 Kvar,Index
(6.45)
i=1
is defined by using methodology introduced before, by means of replication portfolios of vanilla OTM options on index and all index constituents. However when implementing this kind of strategy in practice investors can face a number of problems. Firstly, for indices with a large number of constituent stocks (such as S&P 500) it would be problematic to initiate a large number of variance swap contracts. This is due to the fact that the market for some variance swaps did not reach the required liquidity. Secondly, there is still the problem of hedging vega-exposure created by these swaps. It means a bank should not only virtually value (use for replication purposes), but also physically acquire and hold the positions in portfolio of replicating options. These options in turn require dynamic delta-hedging. Therefore, a large variance swap trade (as for example in case of S&P 500) requires additional human capital from the bank and can be associated with large transaction costs. The
6.7 Implementation of the dispersion strategy on DAX index
219
remedy would be to make a stock selection and to form the offsetting variance portfolio only from a part of the index constituents. It has already been mentioned that, sometimes the payoff of the strategy could be negative, in order words sometimes it is more profitable to buy index volatility and sell volatility of constituents. So the procedure which could help in decisions about trade direction may also improve overall profitability. If we summarize, the success of the volatility dispersion strategy lies in correct determining: • the direction of the strategy • the constituents for the offsetting variance basket The next sections will present the results of implementing the dispersion trading strategy on DAX and DAX constituents’ variances. First we implement its classical variant meaning short position in index variance against long positions in variances of all 30 constituents. Then the changes to the basic strategy discussed above are implemented and the profitability of these improvements measured.
6.7 Implementation of the dispersion strategy on DAX index In this section we investigate the performance of a dispersion trading strategy over the 5 years period from January 2004 to December 2008. The dispersion trade was initiated at the beginning of every moth over the examined period. Each time the 1-month variance swaps on DAX and constituents were traded. First we implement the basic dispersion strategy, which shows on average positive payoffs over the examined period (Figure 6.5).Descriptive statistics shows that the average payoff of the strategy is positive, but close to zero. Therefore in the next section several improvements are introdused. It was discussed already that index options are usually overestimated (which is not the case for single equity options), the future volatility implied by index options will be higher than realized volatility meaning that the direct dispersion strategy is on average profitable. However the reverse scenario may also take place. Therefore it is necessary to define whether to enter a direct dispersion (short index variance, long constituents variance) or reverse dispersion (long
220
6 Variance swaps
1
0.5
0
−0.5 2004
2005
2006
2007
2008
2009
Figure 6.5: Average implied correlation (dotted), average realized correlation (gray), payoff of the direct dispersion strategy (solid black).
Table 6.3: Comparison of basic and improved dispersion strategy payoffs for the period from January 2004 to December 2008. Strategy Basic Improved
Mean 0.032 0.077
Median 0.067 0.096
Std.Dev. 0.242 0.232
Skew 0.157 -0.188
Kurt. 2.694 3.012
J-B 0.480 0.354
Prob. 0.786 0.838
index variance and short constituents’ variances) strategy. This can be done by making a forecast of the future volatility with GARCH (1,1) model and multiplying the result by 1.1, which was implemented in the paper of Deng (2008) for S&P500 dispersion strategy. If the variance predicted by GARCH is higher than the variance implied by the option market, one should enter the reverse dispersion trade (long index variance and short constituents variances).After using the GARCH volatility estimate the average payoff increased by 41.7% (Table 6.3).
The second improvement serves to decrease transaction cost and cope with market illiquidity. In order to decrease the number of stocks in the offsetting portfolio the Principal Components Analysis (PCA) can be implemented. Using PCA we select the most ”effective” constituent stocks, which help to
6.7 Implementation of the dispersion strategy on DAX index
221
capture the most of index variance variation. This procedure allowed us to decrease the number of offsetting index constituents from 30 to 10. According to our results, the 1-st PC explains on average 50% of DAX variability. Thereafter each next PC adds only 2-3% to the explained index variability, so it is difficult to distinguish the first several that explain together 90%. If we take stocks, highly correlated only with the 1-st PC, we can significantly increase the offsetting portfolio’s variance, because by excluding 20 stocks from the portfolio we make it less diversified, and therefore more risky. However it was shown that one still can obtain reasonable results after using the PCA procedure. Thus in the paper of Deng (2008) it was successfully applied to S&P500.
Bibliography Carr P., and Madan D. (1998). Towards a theory of volatility trading, Volatility: 417-427. Demeterfi, K., Derman, E., Kamal, M. and Zou, J. (1999). More Than You Ever Wanted To Know About Volatility Swaps, Goldman Sachs Quantitative Strategies Research Notes. Franke, J., H¨ ardle, W. and Hafner, C. M. (2011). Statistics of Financial Markets: An Introduction (Third ed.), Springer Berlin Heidelberg. Hull, J.(2008). Options, Futures, and Other Derivatives (7th revised ed.), Prentice Hall International. Neuberger, A.(1990). Volatility Trading, London Business School Working paper.
7 Learning machines supporting bankruptcy prediction Wolfgang Karl H¨ ardle, Linda Hoffmann, and Rouslan Moro
This work presents one of the more recent and efficient learning systems – support vector machines (SVMs). SVMs are mainly used to classify various specialized categories such as object recognition (Sch¨ olkopf (1997)), optical character recognition (Vapnik (1995)), electric load prediction (Eunite (2001)), management fraud detection (R¨ atsch and M¨ uller (2004)), and early medical diagnostics. It is also used to predict the solvency or insolvency of companies or banks, which is the focus of this work. In other words, SVMs are capable of extracting useful information from financial data and then label companies by giving them score values. Furthermore, probability of default (PD) values for companies can be calculated from those score values. The method is explained later. In the past, discriminant analysis (DA) and logit models were used for classification, especially in the financial world. The logit model is in a way a generalized DA model because it does not assume multivariate normality and equal covariance matrices. Over time, researchers, bankers and others found out that those two models were inefficient because they cannot classify satisfactorily if the data is non-linear separable. Consequently, the rate of prediction of new companies was low and demand for more accurate default estimations was developed, thus work with artificial neural nets (ANN), decision trees and SVMs started. The literature (S. Chen (accepted in 2009)) has shown that SVMs produce better classification results than parametric methods. Additionally, SVMs have single solutions characterized by the global minimum of the optimized function and they do not rely heavily on heuristics. SVM are attractive estimators and aree therefore worth studying.
P. Čížek et al. (eds.), Statistical Tools for Finance and Insurance, DOI 10.1007/978-3-642-18062-0_7, © Springer-Verlag Berlin Heidelberg 2011
225
226
7 Learning machines to help predict bankruptcy
7.1 Bankruptcy analysis Lending money is an act based on trust in the debtors ability to repay a loan. From where does the lender get this trust? Banks and other institutions rely heavily on statistical tools that try to predict the financial situation of borrowers. Since those tools are only estimates of reality, lenders risk losing what they have invested. Therefore, improving predictions of bankruptcies allow for better lending decisions. The task for statisticians is to bolster existing methods and to develop new ones. The problem of default and credit risks is not new and the idea of using financial ratios to analyze companies is more than a century old. Ramser and Foster (1931), Fitzpatrick (1932) and Winakor and Smith (1935) were one of the first scientists to apply financial ratios for bankruptcy predictions. The systematic application of statistics to bankruptcy analysis began with the work of Beaver (1966) and Altman (1968). They introduced the univariate and multivariate discriminant analysis (DA). In 1968, Altman presented a formula for predicting bankruptcy known as the linear Z-score model. This formula was widely popular for calculating defaults and even today it is used due to its simplicity. A drawback of the Z-score model is the assumption of equal normal distributions for both failing and successful companies with the same covariance matrix. In reality, the distributions may not be normal, therefore, financial institutions required a more sophisticated method. The centre of research shifted towards the logit and probit models. In 1977, Martin introduced ’Early warning of bank failure’ and a few years later Ohlson (1980) published ’Financial ratios and the probabilistic prediction of bankruptcy.’ Wiginton (1980), Zavgren (1983) and Zmijewski (1984) continued working on logit and probit models. During that time, other statisticians proposed different methods such as the gambler’s ruin model (Wilcox, 1971), option pricing theory (Merton, 1974), recursive partitioning (Frydman, Altman and Kao, 1985), neural networks (Tam and Kiang, 1992) and rough sets (Dimitras, Slowinski, Susmaga and Zopounidis, 1999). Glennon and Nigro (2005) suggested a hazard or survival analysis. From a geometrical point of view, SVMs classify solvent and insolvent companies into two groups by putting a margin of separation between them. For the best classification, the margin needs to be maximized and the error of misclassification minimized. The easiest way to classify occurs when the data is linearly separable. However, this is not always the case. Sometimes two groups cannot be separated linearly in the dimension they exist but can be in a higher
7.1 Bankruptcy analysis
227
dimensional space. The kernel technique (Hastie, Tibshirani, and Friedman (2001)) allows us to map the data into a higher dimensional feature space. For that reason a SVM is a more powerful tool than classical a DA, logit or probit. (The two later are only linear classifiers.) The purpose of an SVM is to classify new data x after training a classification function f . This f needs to be a good approximation of y when x is observed minimizing the expected risk R (f ) = |f (x) − y| dP (x, y). (7.1) y are the labels of x and |f (x) − y| is known as the loss function. f is an element of the set of measurable functions F . To avoid overfitting F is restricted to a smaller amount of functions. This approach is called empirical risk minimization principle (ERM). In practice the distribution P (x, y) is unknown. Therefore, R (f ) cannot be calculated but may be approximated. Because of the unknown distribution P (x, y), the empirical risk needs to be introduced: ˆ (f ) = 1 R |f (xi ) − yi | . n i=1 n
(7.2)
Our loss function is the average value of misclassifications over the training set. def 2 Many loss functions exist such as the least square LLS(y,t) = (1 − yt) , hinge def def Lhinge (y, t) = max {0, 1 − yt} and logistic Llogic(y,t) = log {1 + exp(−yt)}2 . These are used according to the problem posed or user preference. The minimization of the expected and empirical risk: fopt
= arg min R (f ) ,
(7.3)
fˆn
ˆ (f ) , = arg min R
(7.4)
f ∈F f ∈F
do not necessary coincide (Figure 7.1). Depending on the size of F, fopt and fˆn will become arbitrarily close as n increases. From this we conclude that minimizing the expected risk directly is not possible due to the unknown distribution P (x, y). According to statistical learning theory (Vapnik, 1995), it is possible to estimate the Vapnik-Chervonenkis (VC)
228
7 Learning machines to help predict bankruptcy
Risk
R
ˆ R ˆ (f) R
R (f)
f
f opt
fˆn
Function class
Figure 7.1: The minimum values fopt and fˆn of the expected (R) and empirical ˆ risk functions generally do not coincide. (R)
bound by putting an upper bound on R(f ). With probability 1 − η: 2 h log 2n h + 1 − log(η/4) ˆ R(f ) ≤ R(f ) + , n where h ≤ min
r2 w + 1, n + 1 (Vapnik 95).
r is the radius of the smallest sphere containing data and the margin. h is the VC dimension.
2
w
(7.5)
(7.6) is the width of
Next we review the VC dimension and its relation to the SVM. If for some d h f ∈ F , the objects
xi ∈d R , i = 1, ..., h can be shattered in all 2 possible ways and no set xj ∈ R , j = 1, ..., q exists with q > n, then h is called the VC dimension of F in a d-dimensional space. For instance, x1 , x2 and x3 ∈ R2 can be shattered linearly in 8 = 23 ways. If we add a fourth point x4 , we cannot shatter them into 24 = 16 different ways. Hence, the VC-dimension h is 3 in R2 as shown in Figure 7.2. The expression for the VC bound (7.5) is a regularized functional where the VC dimension h is a parameter controlling the complexity of the classifier function. We could find a function that makes no training error but its performance on new data would be low. Therefore, it is important to control the complexity.
7.1 Bankruptcy analysis
229
Figure 7.2: Eight possible ways of shattering 3 points on the plane with a linear indicator function.
x2
xT w+b=0
margin o
T
x w+b=-1 x x
x
x
o
o
x
-b x |w| x
0
o
o
o
o
w o
x x
o
x d --
o
d+
xT w+b=1 x1
Figure 7.3: The separating hyperplane x w + b = 0 and the margin in a linearly separable (left) and non-separable (right) case. Crosses denote solvent companies, zeros are the insolvent ones. The hyperplanes bounding the margin zone equidistant from the separating hyperplane are represented as x w + b = 1 and x w + b = −1. The misclassification penalty in the non-separable case is proportional to the distance ξ/ w. This means we have a trade-off between the number of classification errors on the training set and the complexity of the classifier function.
230
7 Learning machines to help predict bankruptcy
The second goal is to maximize the margin separating the two groups as shown in Figure 7.3. The separating function generated by a linear SVM is x w + b = 0,
(7.7)
where w is of dimension d × 1 called the weight or the slope. b is a scalar representing the location parameter. xi is a d × 1 vector of the characteristics of company i, e.g. financial ratios described in 7.4. There are d different characteristics. Figure 7.3 visualizes the geometrical idea behind the SVM. In both pictures, a margin is created by straight parallel lines which separate the two groups. Recall that the distance between the groups needs to be maximized in order to improve classification. If the two groups are perfect linearly separable, meaning that no observations are in the margin or the opposite group as shown on the left picture, then all observations will satisfy x i w + b ≥ 1 for x i w
+ b ≤ −1 for
yi = 1, yi = −1,
However, if there are observations in the marginal zone or in the opposite group shown on the right panel of Figure 7.3, the misclassification needs to be penalized with η. Hence, the inequalities are adjusted for all observations and we get: x i w + b ≥ 1 − ξi xi w + b ≤ −1 + ξi
for for
yi = 1, yi = −1,
ξi ≥ 0,
(7.8) (7.9) (7.10)
One of the main advantages of the SVM method is that only the data points on the margin, in the margin and in the opposite group, are used for classification. These observations are then called support vectors giving the technique its name. The logit and DA methods use all observations for classification causing a higher cost of iteration. From the above inequalities (7.8) and (7.9), the primal minimization problem of the SVM is written as: 1 min w2 + C ξi , w 2 n
i=1
(7.11)
7.1 Bankruptcy analysis
231
subject to yi (x i w + b) ≥ 1 − ξi
(7.12)
ξi ≥ 0.
(7.13)
C is the complexity or capacity. Smaller C’s lead to larger margins and avoid over-fitting, but the misclassification rate is potentially higher. Sometimes each group has a different C. These C’s are only necessary if the sizes of both groups are sufficiently different when training the SVM. Furthermore, the second term of (7.11) serves as the misclassification penalty. The primal problem above cannot be solved directly. The introduction of Lagrangian multipliers to (7.11) - (7.13) leads to following equations: 1 2 w + C ξi − αi {yi x βi ξi , i w + b − 1 + ξi } − 2 i=1 i=1 i=1 n
min max LP =
w,b,ξi αi ,μi
n
n
where αi ≥ 0 and βi ≥ 0 are the Lagrange multipliers. They are non-zero for support vectors. Equation (7.14) needs to be rewritten as a dual problem using the KarushKuhn-Tucker conditions (Gale, Kuhn and Tucker, 1951). The dual problem is
max = αi
n i=1
1 αi αj yi yj x i xj , 2 i=1 j=1 n
αi −
n
s.t.0 ≤ αi ≤ C, n αi yi = 0.
(7.14) (7.15) (7.16)
i=1
Note that this optimization problem is convex; therefore, a unique solution can be found. This dual optimization problem can be solved by hand or with a statistical computer program. Its solutions are the n Lagrangian multipliers αi . They determine the degree of influence of each training observation. The harder an observation is to classify the higher αi will be. This explains why the αi ’s are zero for the observations lying in the correct area. Once αi ’s are calculated the weight w of the d-variables is given: w=
n i=1
αi yi xi .
232
7 Learning machines to help predict bankruptcy Data Space
x2
x
x
o o
x x o
o o o o o o
x
x x x x
x2 2
o
x x
x x x x x o x o x o x x x o o x o x x o o x o o o o o o o o o x
x x x
Feature Space
o
o o x x x o x x o o o o o x1
o 2 1/2 x1 x2
x1 2
Figure 7.4: Mapping from a two-dimensional data space into a threedimensional space of features R2 to R3 .
Notice the Lagrangian multipliers are directly related to the weights. In logistic regression this comparison of α and w is not possible. If the data used is non-linear separable we transform it into a higher dimensional space, called the feature space. The minimization (7.11) depends only on the scalar product x xi , not on the original x and xi . Therefore, x xi can be replaced by a function k(x, xi ) so that there is a mapping from a lower dimensional space into higher dimensional space in which the data is linearly separable. The kernel function k(xi , yj ) must satisfy the Mercer conditions (Mercer (1909)): 1. k(xi , yj ) = k(yj , xi ) = (symmetric) 2. ∀(x1 , . . . , xn ), (y1 , . . . , yn ), kij = k(xi , yj ), a ka ≥ 0 (semi-positive definite) Hence, the data can be mapped into infinitely dimensional spaces as in the case with the Gaussian kernels. Figure 7.4 visualizes a simple example of the above theory. Let k be the quadratic kernel function: 2 k(x, xi ) = x xi ,
7.1 Bankruptcy analysis
233
which maps from R2 into R3 . The explicit map is: √ 2 ψ(x, xi ) = x2 , 2xxi , x2i Fortunately, we do not need to determine the transformation ψ explicitly because k(x, xi ) = ψ(x, xi ) ψ(x, xi ). Non-linear data can be classified into a higher dimensional feature space without changing the SVM solution. Higher dimensionality of the data and degree of the polynomial of k results in a larger number of features. The advantage of the ’kernel trick’ becomes obvious: We train the SVM in a higher dimensional feature space, where the data is linear separable, but the feature space does not need to be determined explicitly. Some examples of kernel functions are: d
• k(xi , xj ) = (xi · xj + c) (Polynomial) • k(xi , xj ) = exp − xi − xj 2 / 2σ 2 (RBF) • K(xi , xj ) = tanh(kx i xj − δ) – the hyperbolic tangent kernel −2
• K(xi , xj ) = e−(xi −xj )
r
Σ−1 (xi −xj )/2
(stationary Gaussian kernel)
The last kernel has an anisotropic radial basis. Later, we will apply this kernel taking Σ equal to the variance matrix of the training set and r as a constant standing for the radius of the smallest sphere containing data. The higher r the lower the complexity. Once, the support vector machine is trained and we have found the values for w we are able to classify a new company described by variables of x using the classification rule: g(x) = sign x w + b , (7.17) n where w = i=1 αi yi xi and b = 12 (x+1 + x−1 ) w. x+1 and x−1 are any two support vectors belonging to different classes. They must both lay on the margin boundary. To reduce numerical errors when training the SVM it is desirable to use averages over all x+ and x− instead of two arbitrarily chosen support vectors. The value of the classification function, or in our study, the score of a company is computed as: f (x) = x w + b.
(7.18)
234
7 Learning machines to help predict bankruptcy
Default
1
-1 1
2
3
4
5
6
7
8
Company Rank
Figure 7.5: Monotonisation of PD’s with the pool adjacent violator algorithm. The thin line denotes PD’s estimated with the k−N N method with uniform weights and k = 3 before monotonisation and the bold red line after monotonisation. Here y = 1 for insolvencies, y = −1 for solvent companies.
We note that each value of f (x) uniquely corresponds to a default probability (PD). There are different ways of calculating the PD values for each company. Here, one solution is presented with two steps. First, we calculate for all i = 1, 2, . . . , n observations of the training set: n w(z − zi )I(yi = 1) P˜D(z) = i=1n , i=1 w(z − zi ) 2 where w(z − zi ) = exp (z − zi ) /2h2 . zi = Rankf (xi ) is the rank of the ith company. Higher scores f (xi ) lead to higher ranks. The smoothness of P˜D(z) is influenced by the bandwidth h. Smaller h’s give higher smoothness. Using the company rank zi instead of the score f (xi ) we obtain a k − N N smoother i) with Gaussian weights nw(z−z which decay gradually as |z − zi | grows. j=1 w(z−zj After the first step, the PD values are not necessarily monotone as the thin black line shows in Figure 7.5. Therefore, a second step is necessary. In step 2, we monotonise P˜D(zi ) by using the Pool Adjacent Violator (PAV) algorithm (Barlow, Bartholomew, Bremmer, and Brunk (1972)). Figure 7.5 illustrates how the previous black line (before monotonisation) changes to the thick red line (PD after monotonisation). The PD values become monotone.
7.1 Bankruptcy analysis
235
0.5
PD
CPD
0
PD and cumulative PD
1
PD and Cumulative PD
0
50
100 Company rank
150
200
Figure 7.6: The circles represent the smoothing and monotonisation of default (y = 1) and non-default (y = 0) companies.
The companies are ordered according to their rank on the horizontal axis. The value y = 1 indicates insolvency and y = −1 indicates solvency. Between the ranking 1 and 2 monotonicity is violated. After applying the the PAV algorithm, the PD value is corrected. With the PAV we obtain monotonised probabilities of default, PD(xi ), for the observations of the training set as seen in Figure 7.6. A PD for any observation x of the testing set is computed by interpolating PDs for two adjacent observations in terms of the score from the training set. If the score for x lies beyond the range of the scores of the training set, then PD(x) is set equal to the score of the first neighbouring observation of the training set. Finally, we measure the quality of the rating method for which the accuracy ratio (AR) is applied. AR values close to one indicate good performances of the method. If the two groups are the same size then we calculate the AR:
236
7 Learning machines to help predict bankruptcy
0.6 0.4 0.0
0.2
F1 (Percentage of insolvency)
0.8
1.0
ROC curve
0.0
0.2
0.4
0.6
0.8
1.0
FO (Percentage of solvency)
Figure 7.7: Receiver Operating Characteristics Curve. An SVM is applied with the radial basis 2Σ1/2 and capacity C = 1.
AR ≈ 2
1
y(x)dx − 1. 0
From the AR, we can draw a receiver operating characteristics (ROC) curve as shown in Figure 7.7. The blue line demonstrates the perfect separation. Due to the intermesh of insolvent and solvent companies perfect separation does not exist. Usually the curve looks like the red line. In the case of a naive model, the ROC curve is simply the bisector.
7.2 Importance of risk classification and Basel II
237
7.2 Importance of risk classification and Basel II When we look back at the history of financial markets, we observe many ups and downs. The wish to create a stable financial system will always exist. In 1974, one of the largest post World War II bankruptcies happened, the liquidation of I. D. Herstatt KGaA, a private bank in Cologne. Consequently, a committee was formed with the intent of setting up regulations for banks. One way to prevent bankruptcies is to correctly assert the financial situation of the contractor. In 1988, the Basel Committee on Banking Supervision (BCBS) presented the Basel I Accord. It suggested a minimal capital requirement for banks to prevent future bank crashes. This treaty did not prevent the world market from a crisis in 2002. Economists realized that credit rating systems, which developed in the 1950, did not accurately price risks. Others said that the solution to market crashes are efficient regulations. The BCBS established Basel II, which addressed credit and operational risk, and also integrated supervisory review and market discipline. It gives banks the option to use their own rating systems, but they must base the validity of their systems on statistical models. This law came into effect at the end of 2006. The committee in Basel also decided to require banks to use rating systems with new particular properties such as statistical empirical justification for the calculation of PD values. All the additional requirements will be in effect by 2012. Thus the German Bundesbank, for example, is working hard adjusting their rating software. Unfortunately, up until five years ago, only a small percentage of firms and banks were rated so that developing any classification system brings many challenges. Even today, many companies are not listed on the database. Often, the premium for credits is not based on models but on the sole decision of a loan officer. According to Basel II, a company is declared insolvent if it overdraws its debts 90 days after maturity. Lending arbitrarily is not safe, and it is better to calculate the default probability (PD) of the borrower. The PD value can be directly calculated from the score found with the SVM method as explained in the previous section. Then, one can decide to which rating class the borrower belongs as shown in Table 7.1 or in Table 7.2 Those two tables are just general examples but do not correspond to each industry because different tables for different countries and industries exist. Banks and insurances can use tables such as 7.1 and 7.2 to establish the risk premium for each class. Whereas Table 7.3 displays the capital requirements according to Basel I and Basel II.
238
7 Learning machines to help predict bankruptcy Rating Class (S&P) AAA AA+ AA AAA+ A ABBB+ BBB BBBBB+ BB BBB+ B BCCC+ CCCD
One year PD (%) 0.01 0.01 0.02 0.03 0.05 0.08 0.13 0.22 0.36 0.58 0.94 1.55 2.50 4.08 6.75 10.88 17.75 29.35 100.00
Table 7.1: Rating classes and PDs. Source: Henking (2006)
7.3 Description of data The dataset used in this work comes from the credit reform database provided by the Research Data Center (RDC) of the Humboldt Universit¨ at zu Berlin. It contains financial information from 20000 solvent and 1000 insolvent German companies. The time period ranges from 1996 to 2002 and in the case of the insolvent companies the information was gathered 2 years before the insolvency took place. The last annual report of a company before it went bankrupt receives the indicator y = 1 and for the rest (solvent) y = −1. We are given 28 variables, i.e. cash, inventories, equity, EBIT, number of employees and branch code. From the original data, we create common financial indicators which are denoted as x1 . . . x25. These ratios can be grouped into four categories such as profitability, leverage, liquidity and activity. The in-
7.4 Calculations
239 Rating Class (S&P) AAA AA A A BBB BB B CCCD
Five year PD (%) 0.08 0.32 0.91 0.08 3.45 12.28 32.57 69.75 100.00
Table 7.2: Rating classes and PDs. Henking (2006)
dicators are presented in Table 7.4. For the x9 formula, INGA and LB mean intangible assets and lands & buildings, respectively.
7.4 Calculations In order to reduce the effect of the outliers on the results, all observations that exceeded the upper limit of Q75 + 1.5 ∗ IQ (Inter-quartile range) or the lower limit of Q25 − 1.5 ∗ IQ were replaced with these values. Table 7.5 gives an overview of the summary statistics. In the next table (7.6), the insolvent and solvent companies for each year are displayed. Insolvent company data for the year of 1996 are missing, we will therefore exclude them from further calculations. We are left with 1000 insolvent and 18610 solvent companies. Not all variables are good predictors for the classification method. The most common techniques to find the right variables are Mallows’ CP, backward stepwise and forward stepwise selection. The latter is preferred when dealing with many variables because the cost of computation is reduced. This method starts with a univariate model and continues adding variables until all variables are included. At each step the variable is kept whose addition to the model resulted in the highest median accuracy ratio (AR). As a result, ratios x24, x3, x15, x12, x25, x22, x5, and x2 generate the model with the highest AR (60.51%).
240
7 Learning machines to help predict bankruptcy Rating Class (S&P)
One-year PD (%)
AAA AA A+ A ABBB BB B+ B BCCC CC C D
0.01 0.02 – 0.04 0.05 0.08 0.11 0.15 – 0.40 0.65 – 1.95 3.20 7.00 13.00 > 13
Capital Requirements (%) (Basel I) 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00 8.00
Capital Requirements (%) (Basel II) 0.63 0.93 – 1.40 1.60 2.12 2.55 3.05 – 5.17 6.50 – 9.97 11.90 16.70 22.89 > 22.89
Table 7.3: Rating grades and capital requirements. Source: (Damodaran, 2002) and (F¨ user, 2002). The figures in the last column were estimated by the authors for a loan to an SME with a turnover of 5 million Euros with a maturity of 2.5 years using the data from column 2 and the recommendations of the Basel Committee on Banking Supervision (BCBS, 2003).
7.5 Computational results After choosing the best predictors, we start calculating score values for each company. As mentioned earlier, the results depend on the predefined C, the capacity, and r. To demonstrate how performance changes, we will use the accounts payable turnover (x24), which is the best univariate model, and calculate Type I and Type II errors. Type I errors are those when companies were predicted to stay solvent but turned insolvent. Type II errors are the mistakes we make by assuming companies will default but they do not. Keep in mind that different kernels will also influence performance. We use one of the most common ones, the radial Gaussian kernel.
7.5 Computational results Ratio No. x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25
Definition NI/TA NI/Sales OI/TA OI/Sales EBIT/TA (EBIT+AD)/TA EBIT/Sales Equity/TA (Equity-ITGA)/ (TA-ITGA-Cash-LB) CL/TA (CL-Cash)/TA TL/TA Debt/TA EBIT/Interest exp. Cash/TA Cash/CL QA/CL CA/CL WC/TA CL/TL TA/Sales INV/Sales AR/Sales AP/Sales Log(TA)
241 Ratio Return on assets Net profit margin Operating Inc./Total ass. Operating profit margin EBIT/Total assets EBITDA EBIT/Sales Own funds ratio (simple) Own funds ratio (adj.)
Category Profit. Profit. Profit. Profit. Profit. Profit. Profit. Leverage Leverage
Current liab./Total ass. Net indebtedness Total liab./Total ass. Debt ratio Interest coverage ratio Cash/Total assets Cash ratio Quick ratio Current ratio Working Capital Current liab./Total liab. Asset turnover Inventory turnover Account receiv. turnover Account payable turnover Log(Total assets)
Leverage Leverage Leverage Leverage Leverage Liquidity Liquidity Liquidity Liquidity Liquidity Liquidity Activity Activity Activity Activity Activity
Table 7.4: Defintions of financial ratios.
In Figures 7.8–7.11 the triangles represent solvent and circles represent insolvent companies from a chosen training set. The solid shapes represent the support vectors. We randomly chose 50 solvent and 50 insolvent companies. The colored background corresponds to different score values f . The bluer the area, the higher the score and probability of default. Most successful companies are in the red area and have positive profitability and reasonable activity.
242
7 Learning machines to help predict bankruptcy Ratio x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21 x22 x23 x24 x25
q0.05 -0.19 -0.015 -0.22 -0.16 -0.19 -0.13 -0.14 0.00 -0.01 0.18 -0.12 0.29 0.00 -7.90 0.00 0.00 0.18 0.56 -0.32 0.34 0.43 0.02 0.02 0.03 13.01
Med. 0.00 0.00 0.00 0.00 0.02 0.07 0.01 0.05 0.05 0.52 0.49 0.76 0.21 1.05 0.02 0.03 0.68 1.26 0.15 0.84 1.63 0.16 0.12 0.14 14.87
q0.95 0.09 0.06 0.10 0.07 0.13 0.21 0.10 0.40 0.56 0.91 0.89 0.98 0.61 7.20 0.16 0.43 1.90 3.73 0.63 1.00 4.15 0.89 0.33 0.36 17.16
IQR 0.04 0.03 0.06 0.04 0.07 0.08 0.04 0.13 0.17 0.36 0.36 0.35 0.29 2.47 0.05 0.11 0.54 0.84 0.36 0.37 1.41 0.26 0.11 0.10 1.69
q0.05 -0.09 -0.07 -0.011 -0.08 -0.09 -0.04 -0.07 0.00 0.00 0.09 -0.05 0.16 0.00 -6.78 0.00 0.00 0.25 0.64 -0.22 0.22 0.50 0.01 0.00 0.01 12.82
Med. 0.02 0.01 0.03 0.02 0.05 0.11 0.02 0.14 0.16 0.42 0.36 0.65 0.15 2.16 0.03 0.08 0.94 1.58 0.25 0.85 2.08 0.11 0.09 0.07 17.95
q0.05 0.19 0.10 0.27 0.13 0.27 0.35 0.14 0.60 0.95 0.88 0.83 0.96 0.59 73.95 0.32 1.40 4.55 7.15 0.73 1.00 6.19 0.56 0.25 0.24 1.657
IQR 0.06 0.03 0.09 0.04 0.09 0.12 0.05 0.23 0.32 0.39 0.41 0.40 0.31 5.69 0.10 0.29 1.00 1.56 0.41 0.4 1.76 0.16 0.09 0.08 2.37
Table 7.5: Descriptive statistics for financial ratios.
Figure 7.8 presents the classification results for an SVM using r = 100 and the fixed capacity C = 1. With the given priors, the SVM has trouble classifying between solvent and insolvent companies. The radial base r, which determines the minimum radius of a group, is too large. Default companies do not seem to exist. Notice that the SVM is doing a poor job of distinguishing between the groups even though most observations are used as support vectors. In Figure 7.9, the minimal radius is reduced to 2 while C remains the same. Clearly, the SVM starts recognizing the difference between solvent and insolvent
7.5 Computational results Year 1996 1997 1998 1999 2000 2001 2002
Solvent 1390 1468 1615 2124 3086 4380 5937
243 Insolv. 0 146 1274 179 175 187 186
Solv. Ratio 100.00% 90.95% 92.71% 92.23% 94.63% 95.91% 96.96%
Insolv. Ratio 0.00% 9.05% 7.29% 7.77% 5.37% 4.09% 3.04%
Table 7.6: The distribution of the data over the years for solvent and insolvent companies.
C 0.001 0.100 10.000 100.000 1000.000 10.0 10.0 10.0 10.0 10.0
r 0.600 0.600 0.600 0.600 0.600 0.002 0.060 6.000 60.000 2000.000
Type I error 40.57 38.42 34.43 25.22 25.76 37.20 31.86 36.97 37.27 41.09
Type II error 23.43 24.45 27.86 34.44 34.26 32.79 29.25 27.86 25.87 24.85
Table 7.7: Misclassification errors (30 randomly selected samples; one predictor x24).
companies resulting in sharper clusters for successful and failing companies. Decreasing r even further, i.e. to 0.5 as in Figure 7.10, emphasizes the groups in even more detail. If the radial base is too small, then the complexity will be too high for a given data set. If we increase the capacity C, we decrease the distance between the groups. Figure 7.11 demonstrates this effect on the classification results. The beige area outside the clusters is associated with score values of around zero. With higher C’s, the SVM localizes only one cluster of successful companies. It is crucial to use statistical methods, i.e. the leave-one-out method, to find the
244
7 Learning machines to help predict bankruptcy
SVM classification plot 1.0 0.2 0.5
0.0
0.1
x3
−0.5 0.0 −1.0
−1.5
−0.1
−2.0 −0.2 −2.5 0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
x24
Figure 7.8: Ratings of companies in two dimensions. The case of a low complexity of classifier functions, the radial basis is 100, the capacity is fixed at C = 1.
STFsvm01
optimal priors C and r for classifying companies. After a SVM classification function f is trained, we can calculate the scores for new companies, determine their PD values and decide if they belong to the solvent or insolvent group. Further, a SVM learns the cluster of both groups given that the constants C and r are chosen appropriately. If the capacity is too high the knowledge of cluster centre vanishes. If r is too high, the groups might intermesh. The choice of the kernel k also affects the solution. Often, the most appropriate kernel is one of the Gaussian kernel. Earlier we introduced one way of converting scores into PDs including a smoothing technique. However, we can also calibrate the PDs by hand. Consider an example with r = 2 and C = 1. We choose three rating classes: safe, neutral and risky and give each class a corresponding score value of f < −0.0115, −0.0115 < f < 0.0115 and f > 0.0115, respectively. Next we count how many companies belong to each group then calculate the ratio of failing companies
7.6 Conclusions
245
SVM classification plot 1.5 0.2 1.0 0.1
x3
0.5
0.0
0.0
−0.5
−0.1
−1.0 −0.2
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
x24
Figure 7.9: Ratings of companies in two dimensions. The case of an average complexity of classifier functions, the radial basis is 2, the capacity is fixed at C = 1.
STFsvm01
giving us the estimated probability of default for each rating class. With a sufficient number of observations in the training set, the rating classes could be divided up into finer ones. We have seen an example in Table 7.1. The rating company S & P uses over a dozen rating classes such as AAA, AA ,. . . Different rating agencies use a variety of numbers and names for their classifications.
7.6 Conclusions SVMs are capable of predicting whether a company will be solvent or insolvent based only on its financial and economic information. Of course, the SVM needs to be provided with a training data set. Once we have learned the SVM, it divulges the financial situation of a company, which may not be obvious at
246
7 Learning machines to help predict bankruptcy
SVM classification plot
0.2 1.0
0.1
x3
0.5
0.0
0.0
−0.5
−0.1
−1.0 −0.2
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
x24
Figure 7.10: Ratings of companies in two dimensions. The case of an excessively high complexity of classifier functions, the radial basis is 0.5, the capacity is fixed at C = 1.
STFsvm01
first glance. SVMs are easy to implement with their low number of calculational steps and priors. They often give the best classification results compared to logit or other methods. Thus, SVMs have become more and more popular over the last decade. We have learned that the scores found with SVM models can be used to calculate the individual PDs for each company. Consequently, credits and other financial instruments can be adjusted accordingly. Banks and companies will profit from those results because it helps them to decide with what kind of risk they can carry. This leads hopefully to a more stable financial market by increasing our ability to predict defaults and accurately evaluate risk.
7.6 Conclusions
247
SVM classification plot 5 0.2
0
x3
0.1
−5
0.0
−0.1
−10
−0.2
−15 0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
x24
Figure 7.11: Ratings of companies in two dimensions, the case of a high capacity (C = 200). The radial basis is fixed at 2.
STFsvm01
Bibliography Altman, E. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy, The Journal of Finance pp. 589–609. Beaver, W. (1966). Financial ratios as predictors of failures. empirical research in accounting: Selected studies, Journal of Accounting Research pp. 71– 111. supplement to vol. 5. Dimitras, A., Slowinski, R., Susmaga, R. and Zopounidis, C. (1999). Business failure prediction using rough sets, European Journal of Operational Research (114): 263–280. Eunite (2001). Electricity load forecast competition of the EUropean Network on Intelligent TEchnologies for Smart Adaptive Systems, http://neuron.tuke.sk/competition/ . Fitzpatrick, P. (1932). A comparison of the ratios of successful industrial enterprises with those of failed companies. Frydman, H., Altman, E. and Kao, D.-L. (1985). Introducing recursive partitioning for financial classification: The case of financial distress, The Journal of Finance 40(1): 269–291. Gale, D., Kuhn, H. W. and Tucker, A. W. (1951). Linear Programming and the Theory of Games, in Activity Analysis of Production and Allocation, T. C. Koopmans (ed.), John Wiley & Sons, New York, NY. Henking, A. (2006). Kreditrisikomessung, first edn, Springer. Merton, R. (1974). On the pricing of corporate debt: The risk structure of interest rates, The Journal of Finance 29(2): 449–470. Ohlson, J. (1980). Financial ratios and the probabilistic prediction of bankruptcy, Journal of Accounting Research pp. 109–131. Ramser, J. and Foster, L. (1931). A demonstration of ratio analysis. bulletin no. 40.
250
Bibliography
R¨atsch and M¨ uller (2004). Zur Beurteilung des Fraud-risikos im Rahmen der Abschlusspr¨ ufung, Die Wirtschaftspr¨ ufung pp. 1057–1068. S. Chen, W. H. a. R. M. (accepted in 2009). Modelling default risk with support vector machines, Journal of Quantitative Finance . Sch¨ olkopf, B. (1997). Support vector learning, Oldenbourg. Tam, K. and Kiang, M. (1992). Managerial application of neural networks: the case of bank failure prediction, Management Science 38(7): 926–947. Vapnik, V. (1995). The Nature of Statistical Learning Theory, Springer, New York. Wiginton, J. (1980). A note on the comparison of logit and discriminant models of consumer credit behaviour, Journal of Financial and Quantitative Analysis 15(3): 757–770. Wilcox, A. (1971). A simple theory of financial ratios as predictors of failure, Journal of Accounting Research pp. 389–395. Winakor, A. and Smith, R. (1935). Changes in the financial structure of unsuccessful industrial corporations. bulletin no. 51. Zavgren, C. (1983). The prediction of corporate failure: The state of the art, Journal of Accounting Literature (2): 1–38. Zmijewski, M. (1984). Methodological issues related to the estimation of financial distress prediction models, Journal of Accounting Research 20(0): 59– 82.
8 Distance matrix method for network structure analysis Janusz Mi´skiewicz
8.1 Introduction The distance matrix method goes in line with the network analysis of market structure such as clustering analysis (Focardi and Fabozzi, 2004), geometry of crashes (Araujo and Louca, 2007), degree distribution of nodes (Boss et al., 2004; Liu and He, 2009) etc. These methods allow, among other things, to investigate the time evolution of correlations in time series such as stocks. The analysis of such time series correlations has emerged from investigations of portfolio optimization. The standard approach is based on the cross-correlation matrix analysis and optimizations of share proportions (see e.g. Adams et al., 2003; Cuthberson and Nitzsche, 2001). The basic question regarding what the most desirable proportion are among different shares in the portfolio lead to the introduction of a distance between time series, and in particular, of the ultrametric distance, which has become a classical method of correlation analysis between stocks (Bonanno et al., 2001; Mantegna and Stanley, 2000). The method allows to analyze the structure of the market, and therefore, simplifies the choice of shares. In fact, this question about structure of the stock market should be tackled before portfolio optimization. The distance matrix method was later adopted in investigations of other economic questions such as the structure analysis of dependencies between companies (Mantegna, 1999; Bonanno et al., 2001; Bonanno et al., 2003) or the analysis of country economies by their levels of globalization (Mi´skiewicz and Ausloos, 2006b; Mi´skiewicz and Ausloos, 2008a; Mi´skiewicz, 2008; Ausloos and Gligor, 2008; Gligor and Ausloos, 2007; Gligor and Ausloos, 2008). The main purpose of the matrix distance method is however to simplify the choice of stocks and to specify the investment horizon. The method is also capable to
P. Čížek et al. (eds.), Statistical Tools for Finance and Insurance, DOI 10.1007/978-3-642-18062-0_8, © Springer-Verlag Berlin Heidelberg 2011
251
252
8 Distance matrix method for network structure analysis
analyse similarities between stocks in terms of information hidden in the time series and to suggest how to react to those changes. Generally, the correlation analysis of a group of time series can be divided into two steps: (i) calculation of the distance between each two time series and (ii) the analysis of the resulting distance matrix, which consists of the distances for all pairs of time series within the group. The latter is performed e.g. by constructing appropriated networks such as the minimum spanning tree (MST), the bidirectional minimum length path (BMLP), or the unidirectional minimum length path (UMLP), which are defined below and accompanied by illustrations based on the share price evolution of two subsets of the New York Stocks Exchange and Warsaw Stocks Exchange. In this chapter, several distance measures are discussed, i.e. the Manhattan, the Ultrametric, the Entropy and Theil index base distances, respectively. A particular attention is paid to the Manhattan and Ultrametric distances because they can be used as a base for constructing other distances, i.e. the Entropy and Theil index distances here. Therefore, the role of the noise influence on these distances is discussed and results obtained by application of the Manhattan and Ultrametric distances on generated time series compared. Although the most popular one seems to be the Ultrametric Distance I would like to focus the reader attention toward the others.
8.2 Correlation distance measures A straightforward way to introduce a distance measure between two time series, resides in considering them as vectors and use the distance so defined in vector spaces. The distance D(A, B) between time series A and B should satisfy the following conditions: 0⇔A=B
D(A, B)
=
D(A, B) D(A, B)
= D(B, A) ≤ D(A, C) + D(A, B).
(8.1)
Although there are many distances defined on general vector spaces (see e.g. Deza and Deza, 2006), in a practical application only few of them are in use: the Manhattan and Euclidean distances. These distances can be however combined with additional time series transformations such as the entropy or Theil index calculations as we later demonstrate.
8.2 Correlation distance measures
253
8.2.1 Manhattan distance The Manhattan Distance (MD) DM (A, B) is defined as the sum of the absolute values of the difference between the elements of two time series A and B, such that A = (a1 , a2 , . . . , an ), B = (b1 , b2 , . . . , bn ), where i is a discrete time index: DM (A, B) =
n
|ai − bi |.
(8.2)
i=1
It is possible to generalize MD to the case of delayed distances: (k)
DM (A, B) =
n−k
|ai − bi+k |,
(8.3)
i=1
where k is a (positive) time lag.
8.2.2 Ultrametric distance The Ultrametric Distance (UD) can be defined as the Euclidean distance of the normalized time series (Mantegna and Stanley, 2000). The UD measure DU (A, B)(t,T ) is based on the linear correlation coefficient: E(AB)(t,T ) − E(A)(t,T ) E(B)(t,T ) Corr(t,T ) (A, B) = * , (8.4) (E(A2 )(t,T ) − E(A)2(t,T ) )(E(B 2 )(t,T ) − E(B)2(t,T ) ) where E(. . .)(t,T ) denotes the mean value at time t over a time window of size T . The correlation function (8.4) is transformed in order to fulfill distance axioms: * DU (A, B)(t,T ) = 2(1 − Corr(t,T ) (A, B)), (8.5) where t and T respectively denote the final point and the size of the time window over which an average is taken. The present definition is a generalization of the original definition (Mantegna and Stanley, 2000), where the time window T is the length of the analyzed time series. The correlation function (8.4) takes value in the interval [−1, 1]. The value equal to -1 corresponds to anticorrelated data and +1 to correlated data. The
254
8 Distance matrix method for network structure analysis
case when Corr(t,T ) (A, B) = 0 is interpreted as a uncorrelated data series. Equation (8.5) maps the linear space of the series Ln of length n onto the interval [0, 2] (DU (A, B) : Ln × Ln → [0, 2]). The important points are that √ DU (A, B) = 0 means linear correlation between time series, DU (A, B) = 2 represents no correlation, and DU (A, B) = 2 indicates that the time series are anticorrelated. In some application (e.g. Mi´skiewicz and Ausloos, 2008a; Mi´skiewicz and Ausloos, 2006a; Mi´skiewicz and Ausloos, 2008b), an alternative normalization of UD is used: * DU (A, B)(t,T ) = (1/2)(1 − Corr(t,T ) (A, B)). (8.6) In this case, the distance between time series is mapped onto the interval [0, 1] and the√ time series are linearly correlated if DU (A, B) = 0, the value DU (A, B) = 22 indicates that there are no correlations and the valueDU (A, B) = 1 that the time series are anticorrelated. In the following illustrative examples, the second definition (8.6) will be used.
8.2.3 Noise influence on the time series distance From an application point of view, the question of noise influence onto the final results is one of the most important ones. In this paragraph the noise impact on the results given by (8.2) and (8.6) is investigated. Let us assume that there are two time series A and B. The elements of A and B are denoted by ai and bi respectively. The correlation between time series is understood as the existence of a reversible map with a one-to-one correspondence (an injection). Then the correlation between the time series A and B can be described as: f : A → B. (8.7) Let us assume that there is a noise W with the mean value equal to zero E(W ) = 0.
(8.8)
The noise may influence the correlation in three ways: by being present either in the time series A, B, or A and B. Then (8.7) takes one of the following forms: f (A) = B + W, (8.9) B = f −1 (A) − W.
(8.10)
8.2 Correlation distance measures
255
or f (A + W ) = B + W,
(8.11)
Equations (8.9), (8.10), and (8.11) reflect the noise dependence problem, which is, in general, not trivial and should be treated with caution especially in the case of nonlinear correlations, because the reverse function might not exist or may have some strong existence constraints. Therefore the problem may be split into two cases: searching for either linear or non-linear dependencies. In what follows, the noise influence on the MD will be discussed in the case of correlation function having a well defined reverse function f −1 ; then the analysis performed in the following section is valid in each case (8.9), (8.10), and (8.11). Since the UD is defined for measuring linear correlations, the noise influence will be discussed only in the case of a linear correlation function f .
8.2.4 Manhattan distance noise influence The noise influence on MD can be found by direct application of the distance definition (8.6) to the time series (8.10): DM (A, B) =
n
|ai − f −1 (ai ) + wi |
(8.12)
i=1
It is assumed that the mean value of the noise is equal to zero [assumption (8.8)]. However this does not imply that the mean value of the absolute value of the noise will also be equal to zero. Let us consider the following cases: • the time series significantly differ from each other, such that: ∀i ai − bi + wi < 0 or ai − bi + wi > 0. • the time series are similar in values n i=1 (ai − bi ) ∼ = 0, n
(8.13)
(8.14)
When there is a significant difference between time series as in (8.13), the noise
256
8 Distance matrix method for network structure analysis
influence is reduced because if ∀i ai − bi + wi > 0, then (8.15) n n n |ai − f −1 (ai ) − wi | = {ai − f −1 (ai ) − wi } = {ai − f −1 (ai )} i=1
i=1
i=1
The same conclusion can be easily drawn if ∀i ai − bi + wi < 0. In the case of small differences between time series (8.14), the noise may significantly influence the value of MD. Because of (8.14), the difference ai − f −1 (ai ) can be close to zero for sufficiently long time series. Assuming the worst case ai − f −1 (ai ) = 0, only the noise remains in (8.12): n i=1
|wi | > 0 and lim
n
n→∞
|wi | = n · E(|W |)
(8.16)
i=1
The expected value of the function |W | depends on the noise spectrum, e.g. if W is assumed to be described by the uniform distribution on the interval [−1, 1] then the absolute function maps this interval onto [0, 1] with the mean value E(|W |) = 12 . Then MD is the linear function of the time series length. Considering the case of significantly different time series [assumption (8.13)], MD is a monotonic function of the time series size (or so called length). The dependence on the series size allows one to investigate what kind of relation exists between different time series. Let us consider a few examples: in the simplest case, when the time series differ by a constant value, f (ai ) = bi + z, z = const, MD will increase linearly with the length of the time series. In the case of linear correlations the MD will increase with the square of the time series length. More generally, let us consider the evolution of MD as a function of the time series length n. In the case of arbitrary correlations the value of MD will evolve with n as does the space between the time series. Since the ai > bi and ai < bi cases are equivalent from the point of view of MD, one can only consider the case where ai > bi without loss of generality. For sufficiently long time series (n 0) and taking into account (8.15) the MD measure can be approximated by a continuous function of the time series length: n DM (A, B)(n) {a(t) − b(t)} dt (8.17) 0
8.2 Correlation distance measures
257
If the set of discrete data points can be fitted by a continuous function then the correlation function may be found as f (n) =
d(DM (A, B)(n)) , dn
(8.18)
where n is considered here as a continuous parameter. Therefore the form of the correlation may be found as a derivative of MD in (8.18). The above observation may help in the classification of mutual correlations because a log-log transformation of (8.17) can lead to a number, i.e. the polynomial rank. Therefore the calculation of the mutual correlation function along (8.18) allows one to apply the polynomial classification hierarchical scheme.
8.2.5 Ultrametric distance noise influence The Ultrametric Distance (8.6) takes values in the interval [0, 1], where 0 corresponds to the case of linearly correlated time series and 1 is reached for anticorrelated data. In order to consider the case of the noise impact on UD defined in (8.6), let a noisy signal like (8.9) be introduced into both time series so that, $ + W and B = B $ + W, A=A
(8.19)
where W denotes the noise. Therefore (8.6) takes the form:
DU (A, B)(t,T ) = 1 ={ [1 − (E((A + W )(B + W ))(t,T ) − E(A + W )(t,T ) E(B + W )(t,T ) )× 2 (8.20) ((E((A + W )2 )(t,T ) − E(A + W )2(t,T ) )× (E((B + W )2 )(t,T ) − E(B + W )2(t,T ) )− 2 )]} 2 1
1
In the following calculations the time and time interval indices will be omitted. Knowing that the mean value of the noise is equal to zero [assumption (8.8)], (8.20) can be rewritten as DU (A, B) = (8.21) 1 E(AB + AWB + WA B + WA WB ) − E(A)E(B) 1 2 = { (1 − 1 )} . − 2 2 2 2 2 [(E((A + WA ) ) − E(A) )(E((B + WB ) ) − E(B) )] 2
258
8 Distance matrix method for network structure analysis
Considering the case of independent noise processes, it holds E(AW ) = E(A)E(W ) and (8.21) can be rewritten as: DU (A, B) = 1 E(AB) + E(AWB ) + E(WA B) + E(WA WB ) − E(A)E(B) 1 2 { (1 − 1 )} 2 [(E((A + WA )2 ) − E(A)2(t,T ) )(E((B + WB )2 ) − E(B)2 )]− 2 1 E(AB) + E(WA WB ) − E(A)E(B) 1 1 = { (1 − )]− 2 )} 2 2 2 2 [(E((AA + WA WA )) − E(A) )(E((BB + WB WB )) − E(B) (8.22) Considering that if WA and WB noises are independent, i.e. implying E(WA WB ) = E(WA )E(WB ), one has finally: DU (A, B) = (8.23) 1 1 E(AB) − E(A)E(B) 2 { (1 − 1 )} . 2 2 − 2 2 2 2 2 [(E(A ) + E(WA ) − E(A) )(E(B ) + E(WB ) − E(B) )] 2 Since in general E(WA2 ) = 0 and E(WB2 ) = 0, it seems that the noise may influence significantly the results obtained for UD. These properties discussed are illustrated below on five numerically generated time series. They were generated by a uniform random number generator. First, three examples of the distances between time series with given mean values are investigated. The following series are considered: Ex.1 Random time series with an identical mean value (equal to 0.5) Ex.2 Random time series with different mean values (equal to 0.0 and 0.5) Ex.3 Two a priori linearly correlated time series: Ex.3a Random time series ”A” with a mean value 0.5 Ex.3b Linearly correlated time series ”B” such that bi = 5ai + w(0.5), where w(0.5) is an uncorrelated white noise. All time series have 104 data points. The results of UD and MD application for the above time series are presented in Figures 8.1-8.2. The MD distance between time series plotted as a function of the time series length (Figure 8.1) is seen to be stabilizing after a relatively short time (10–50
8.2 Correlation distance measures
100000
noise 1 noise 2 noise 3
10000 distance
259
1000
100
10
1 1
10 100 1000 time series length
10000
Figure 8.1: Analysis of correlations, the MD measure. The line “noise 1” is time series with the same mean value (Ex.1), the second “noise 2” case are time series with different mean values (Ex.2) and the third “noise 3” the linearly correlated time series (Ex.3). The correlation measure versus the time series length in loglog plot
data points). A stable, linear behavior of the MD function is observed for time series longer than 70–100 data points. The obtained slopes of the functions are: (noise 1) 1.012, (noise 2) 0.996 (noise 3) 1.004, i.e. they are close to unity. Applying (8.18) it can be found that these are zero level correlation types, which indicate a constant distance between mean values as it should be indeed. In the case of UD (Figure 8.2) the correlation measure achieves a stable value after 100 - 200 data points. In the cases (Ex.1) and (Ex.2) of generated time series the UD is about 0.7, while in the case of the linearly correlated data √ (Ex.3), the distance is equal to 0.1. The largest distance is equal to 0.7 (≈ 22 ) would suggest that there is no correlation between the time series, but the average resulting value (0.1) points out to the possibility of linear correlations: since the generated time series are linearly correlated, one could expect the DU (A, B) = 0. The difference between expected and achieved result is attributed to the noise influence. In order to investigate other correlations two additional examples are next investigated:
260
8 Distance matrix method for network structure analysis
1
noise 1 noise 2 noise 3
0.9 0.8 distance
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
2000 4000 6000 8000 time series length
10000
Figure 8.2: Analysis of time series correlations through UD (8.6) as a function of the time series length. The line “noise 1” is the UD between the time series which have identical mean value, the “noise 2” line is the UD between two time series with different mean values and the “noise 3” line is the UD between the linearly correlated time series.
Ex.4 A time series with a trend the time series A is defined as ai = ti + w(0.5), where ti is increased at every time step by 0.01 and w(0.5) is a white noise with the mean value equal 0.5; the time series B, linearly correlated to A: bi = 5ai + w(0.5) Ex.5 A nonlinear correlation between time series the first time series is defined as above ai = ti + w(0.5) and the second time series is defined according to the equation: bi = 5a2i + w(0.5). In the case of the MD, the results on Figure 8.3 at first do not differ between pairs of time series correlated by linear or quadratic functions, at least up to time series with lengths as long as about 100 data points. For time series longer than 100 data points the distance functions between the time series with linear (f1 ) and the nonlinear correlation (f2 ) depart from each other. The linear functions have been fitted for the interval (103 , 104 ), and in such a case, the
8.2 Correlation distance measures
1e+09
261
f1 f2
1e+08
distance
1e+07 1e+06 100000 10000 1000 100 10 1
10 100 1000 time series length
10000
Figure 8.3: MD in the case of the time series with a trend as a function of the time series length plotted in loglog. ”f1 ” is the MD between time series ai = ti + w(0.5) and bi = 5ai + w(0.5) (Ex.4) and ”f2 ” is MD between time series ai = ti + w(0.5) and bi = 5a2i + w(0.5) (Ex.5).
functions are indeen highly linear. The linear coefficients are: for the function f1 : 1.97 ≈ 2 and for f2 : 2.96 ≈ 3 respectively. In Figure 8.4, the UDs between time series with a trend are plotted as functions of the time series length. In the case of the linearly correlated time series (f1 ), UD decreases with the time series length, stabilizing around 4000 - 5000 data points and achieving a value close to zero. A similar conclusion can be drawn in the case of quadratically correlated time series (Figure 8.4, “f2 ” line), where UD is initially decreasing and stabilizing on the level 0.12 (still close to zero) for time series longer than 1000 data points. Analogous results have been obtained in the case of examples discussed in the beginning of this section, i.e. for linearly correlated time series without trend (Ex.3).
262
8 Distance matrix method for network structure analysis
0.35
f1 f2
0.3
distance
0.25 0.2 0.15 0.1 0.05 0 0
2000 4000 6000 8000 time series length
10000
Figure 8.4: UD in the case of the time series with a trend. ”f1 ” is UD between time series ai = ti + w(0.5) and bi = 5ai + w(0.5) (Ex.4) and ”f2 ” is UD between time series ai = ti + w(0.5) and bi = 5a2i + w(0.5) (Ex.5).
8.2.6 Entropy distance The main problem in defining the Entropy Distance (ED) stems from the measurement of the time series entropy itself. The most popular measure of entropy is the Shannon entropy defined as S=− pi log pi , (8.24) where pi is the probability of the system to be found in the i state. Although one can apply measures based on the Shannon entropy (see Brigo and Liinev, 2005), there are two important problems. The main difficulty in the application of the above definition is that economic data are not a priori categorised; it is unlikely that some state (data value) would (often) repeat. In this situation one has to bin the data, but such a process is arbitrary and the results can vary from zero (one huge category) up to log n, where n is the number of elements in the analyzed time series. The second complication is the length of the time series to be considered: because most of the economic time series are nonstationary, and therefore, only relatively short time periods can be treated as a quasi stationary. In contrast to this, if the distance measure requires some averaging
8.3 Distance matrices analysis
263
(e.g., as in UD or categorisation into bins), then, in order to avoid smoothing off the salient fluctuations, the analyzed time series should be up to several dozen of data points (e.g. Liu et al., 2009). Therefore any approximation of the distribution function (pi ) is disputable. Because of the above mentioned problems, the Theil index (Shorrocks, 1980), which measures the difference between the system maximum entropy and its entropy at a given time, is used as an alternative of the Shannon entropy. The Theil index is mainly used to compare incomes and economies, its features or grouping structures (Akita, 2003; Theil, 1965; Conceicao et al., 2000). T hA (t, T ) =
t i=t−T
( t
Ai
j=t−T
Aj
log
Ai ). A(t,T )
(8.25)
Having calculated the Theil index of time series, the distance between them has to be calculated. Taking into account that UD is less robust to the noise influence (see Section 8.2) than MD, the later distance measure is chosen to be the base of the ED distance. However the MD distance (8.2) depends on the time series length (see Section 8.2). In order to compare distances between time series of different lengths, MD is normalised by the length of the time series (8.26). n |ai − bi | DME (A, B) = . (8.26) n i=1 Finally, the Theil Index Distance (TD) is defined (8.27) DT (A, B)(t,T ) = DME (T hA (t, T ), T hB (t, T )).
(8.27)
8.3 Distance matrices analysis The distance matrix indexed by with time-series indices A, B can be calculated for various sizes of time windows and as a function of time. Thereafter such distances can be compared in many varied ways, e.g. by analyzing different networks on which the shares, bonds or countries are nodes, and the links the distances. Three network structures are hereby constructed and their structural and statistical properties are discussed. Thereafter the network evolutions are discussed.
264
8 Distance matrix method for network structure analysis
For the sake of clarity some basic definitions of the network theory are presented below. The interested reader can find more information (see e.g. Levis, 2009; Gross and Yellen, 2006; Bondy and Murty, 2008) or others textbooks on graph theory. A graph is defined as a set of nodes and edges. The edges can have weights, and in this case, the graph is weighted. The node degree is defined as a number of edges connected to that node or the sum of weights of the edges connected to the node. A path is a graph where all the nodes are connected and the degree of any node is not greater than two. The distance between the nodes is the number of edges (or the sum of its weights) in the shortest path connecting them. If the path begins and ends in the same node it is called a cycle. A non-empty graph is called connected if any of its nodes are connected by a path, which is a subset of the graph. A tree is a connected graph which does not contain cycles. The three types of trees considered here are: 1. Unidirectional minimal length path (UMLP) Algorithm: The chain begins with an arbitrarily chosen node. The node with the shortest distance to the first node is attached to the chain and becomes the end of the chain. One searches for the node closest to this end, attaches it, and so on. Nodes are attached only to the end of the chain. The UMLP graph does not contain loops and the node degree does not exceed two. 2. Bidirectional minimal length path (BMLP) Algorithm: First the pair of the closest neighbours is searched for and becomes the seed with the first two ends of the chain. Then the two nodes closest to each end of the chain are searched for, and the one with the shortest distance to one of the ends is attached and becomes the new end on that side of the chain. The procedure is repeated. The chain grows in two directions. The BMLP graph does not contain loops and the node degree does not exceed two. 3. Minimal spanning tree (MST) The presented above chain methods can be considered as a simplification of the minimal spanning tree (MST). In a connected weighted graph of n objects, the MST is a tree having n − 1 edges that minimize the sum of the distances between nodes. Algorithm: Several algorithms for generating an MST can be found in the literature. Prim’s and Kruskal’s algorithms (Kruskal, 1957; Prim,
8.4 Examples
265
1957) are the best known. Here the Kruskal’s algorithm will be used. The main idea of the algorithm is to add repeatedly the next shortest edge that does not produce a cycle. Let assume that the distance matrix is known as constructed according to one of the above distance definitions. In order to speed up the algorithm the distances should be sorted out in increasing order; the algorithm starts from the pair of closest neighbours. Then one searches for the pair of nodes with the next-smallest distance. At this moment there are two possibilities: if these two pairs have a common vertex then we join these pairs; and if they are separated then at this moment we consider them as separated regions of MST. The process is continued rejecting only pairs which are already on the tree (they have been attached in previous steps). The procedure ends with a fully connected graph without loops. MST is used if one is interested in the detailed hierarchical structure of the analyzed subject, while BMLP and UMLP are useful if the general properties of the given set are under investigations such as the diameter of the graph, the average distance between nodes etc. The UMLP and BMLP algorithms are faster then MST, so the chain structure are especially useful if the problem requires constructing a huge number of networks (e.g. in the analysis of the time evolution of a system). On the other hand, UMLP is suggested to be used if there is a noticeable item, which deserves a special treatment, e.g. some stock among a set of stocks can be used as the specific seed of a UMLP chain.
8.4 Examples 8.4.1 Structure of stock markets In this example the classical analysis of stock market structure is presented. It consists of the UD distance matrix calculation and constuction of the appropriate MST. Since Mantegna (1999), it has been shown that the method can point out to the clustering of companies and to changes on the market, e.g. during crashes (Onnela et al., 2003). Consider a data set consisting of daily closure prices, obtained from Yahoo, for 20 stocks listed in the S&P 500 index. The considered time period extends from 02.01.2009 to 30.04.2010 including total 334 price quotes per stock. The following stocks have been selected: ABB Ltd. (ABB), Apple Inc. (AAPL), Boeing
266
8 Distance matrix method for network structure analysis
Co. (BA), the Coca-Cola Company (KO), Emerson Electric Co. (EMR), General Electric Co. (GE), Hewlett-Packard Company (HPQ), Hitachi Ltd. (HIT), IBM (IBM), Intel Corporation (INTC), Johnson & Johnson (JNJ), Lockheed Martin Corporation (LMT), Microsoft Co. (MSFT), Northrop Grumman Corporation (NOC), Novartis AG (NVS), Colgate-Palmolive Co. (CL), Pepsico Inc. (PEP), Procter & Gamble Co. (PG), Tower Semiconductor LTD. (TSEM), Wisconsin Energy Corporation Co. (WEC). A preliminary step in the MST construction is the calculation of the log returns of stocks Yi (t): pi (t) Yi (t) = log . (8.28) pi (t − 1) where pi (t) is the price of the asset i at the time t. Next, the UD distance matrix is calculated using (8.6). The obtained matrix is presented in Table 8.1. Since the distance matrix is symmetric, only the upper triangular part is presented. Finally, the MST is constructed using Kruskal’s algorithm (Kruskal, 1957). The sequence of the node attachment is presented in Table 8.2, while the obtained MST is reported in Figure 8.5. On the MST graph, Figure 8.5, a clustering of companies can be noticed – the companies belonging to the same industry form branches of the tree or are attached to the same node. Here the following industry clusters can be distinguished: (i) food industry: PG, PEP, KO, JNJ; (ii) computer and electronic industry: AAPL, HPQ and INTC, TSEM, MSFT; (iii) aircraft industry: BA, NOC, LMT. Therefore one can observe that the method allows to properly classify companies into intuitively appropriate clusters. It also provides information about potential leaders: e.g. the node which has the highest number of connected nodes is EMR which is a diversified global technology company. On this example, it is already illustrated that the UD may group companies in an interesting way. However the range of potential applications of the method is much wider. Many interesting features can be tackled by applying methods of the graph theory such as calculating node degree distributions (Bonanno et al., 2003; G´orski et al., 2008), observing clique formation (Tumminello et al., 2005; Onnela et al., 2004), etc. The above discussed example shows the state of the system at one given moment of time, but the time window can be also changed, so that the evolution of the network can be monitored. A Hamiltonian like description can follow (see e.g. Gligor and Ausloos, 2008; Ausloos and Gligor, 2008).
8.4 Examples
267
Table 8.1: The UD distance matrix calculated for the considered set of 20 companies listed in the S&P 500 index. The table is split into two parts. ABB AAPL BA KO EMR GE HPQ HIT IBM INTC
ABB AAPL BA KO EMR GE HPQ HIT IBM INTC JNJ LMT MSFT NOC NVS CL PEP PG TSEM WEC
ABB 0
AAPL 0.47 0
BA 0.47 0.52 0
KO 0.53 0.59 0.55 0
EMR 0.37 0.46 0.44 0.53 0
GE 0.46 0.51 0.49 0.58 0.45 0
HPQ 0.44 0.44 0.49 0.58 0.41 0.52 0
HIT 0.53 0.58 0.60 0.63 0.53 0.56 0.57 0
IBM 0.44 0.46 0.54 0.56 0.45 0.55 0.44 0.59 0
INTC 0.43 0.45 0.52 0.58 0.41 0.48 0.44 0.57 0.46 0
JNJ 0.51 0.55 0.55 0.48 0.50 0.55 0.54 0.61 0.56 0.55 0
LMT 0.59 0.57 0.50 0.57 0.54 0.54 0.57 0.62 0.606 0.59 0.54 0
MSFT 0.50 0.53 0.54 0.58 0.47 0.57 0.46 0.61 0.50 0.44 0.58 0.61 0
NOC 0.53 0.52 0.43 0.53 0.48 0.51 0.53 0.57 0.55 0.54 0.51 0.40 0.52 0
NVS 0.57 0.63 0.60 0.57 0.58 0.58 0.61 0.65 0.59 0.58 0.58 0.59 0.58 0.56 0
CL 0.53 0.60 0.56 0.52 0.54 0.62 0.56 0.61 0.53 0.58 0.52 0.59 0.58 0.56 0.61 0
PEP 0.54 0.59 0.54 0.45 0.55 0.60 0.54 0.63 0.59 0.57 0.50 0.57 0.56 0.54 0.61 0.53 0
PG 0.50 0.56 0.54 0.51 0.47 0.55 0.50 0.56 0.53 0.55 0.46 0.56 0.56 0.51 0.58 0.45 0.49 0
TSEM 0.45 0.49 0.56 0.62 0.48 0.56 0.51 0.60 0.53 0.37 0.62 0.64 0.52 0.59 0.62 0.64 0.62 0.59 0
WEC 0.56 0.58 0.57 0.55 0.54 0.58 0.55 0.65 0.57 0.56 0.51 0.57 0.60 0.55 0.60 0.56 0.57 0.56 0.61 0
As it has been shown by Onnela et al. (2002) and Onnela et al. (2003) the network structure and position of the assets on the MST is closely related to the portfolio properties, i.e. risk and return. If a node is chosen as a central point of the network then the portfolio risk level can be connected with the distance between nodes. It can be easily understood that stocks included in a low risk portfolio are located further away from the central node than those included in a high-risk portfolio. Moreover characterisation of stocks by its branch and location within the branch enable identifying the possibility to interchange different stocks in the portfolio. In most cases, one can pick up stocks from different asset clusters, but from similar layer (set of nodes within
268
8 Distance matrix method for network structure analysis
Table 8.2: The distances and sequence of attachments for MST in the case of UD distance matrix for 20 arbitrarily selected stocks listed in the S&P 500 index. step 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
i ABB JNTC LMT EMR EMR BA BA INTC ABB AAPL EMR PEP KO JNJ EMR KO JNJ EMR NOC
j EMR TSEM NOC HPQ INTC NOC EMR MSFT IBM HPQ GE PG PEP PG PG JNJ WEC HIT NVS
DU (i, j) 0.368 0.372 0.400 0.406 0.410 0.429 0.870 0.435 0.439 0.443 0.447 0.449 0.452 0.459 0.467 0.480 0.512 0.528 0.563 STFdmm01
the same distance measured from an appropriately chosen centre as a number of links connecting the node with the center) and interchange them without altering the portfolio properties. Therefore asset trees provide an intuitionfriendly approach to the portfolio optimization problem.
8.4.2 Dynamics of the network In this section the time evolution of possible networks made of shares is investigated. The results for UMLP, BMLP and MST networks and different
8.4 Examples
269
NV S
LM T
CL
HIT K BA GE KKK rr r KK r r KK rr KK rrr HP Q EM R ABB s s s ss ss sss PG IN T C T SEM
KO
JN J
AAP L
P EP
N OC
IBM
M SF T
W EC
Figure 8.5: The MST network generated in the case of UD for 20 arbitrarily selected stocks listed in the S&P 500 index. The distances and sequence of attachments are given in Table 8.2. distance (UD and TD) are compared in order to illustrate the differences between distances and show possible application areas of the mentioned network structures. In the first stage of the analysis of the network evolution, an appropriate time window should be chosen. The procedure for finding a network evolution consists in the following steps: having chosen a time window size, the appropriate part of the time series is taken, the network is constructed, and chosen network characteristics are calculated. Then the time window is shifted by one time unit and the procedure repeated. By doing so, the time evolution of the network characteristics is obtained. As an example, the stability of correlations within the network will be analysed. The problem is quite important because it is closely related to the robustness of the portfolio optimisation: it can be expected that if the correlation and structure of the network are stable, one can expect to have found an optimized portfolio, valid within an assumed investment horizon.
270
8 Distance matrix method for network structure analysis
The following analysis concentrates on two parameters: (i) the time evolution of the mean distance between nodes, (ii) the rate of the connection between nodes, i.e. how often there was a link between a pair of nodes. Additionally the role of the time window length is investigated. Practically, the networks are generated with gliding a fixed time window along the time axis; the mean distance, standard deviation, skewness and kurtosis of the ensemble of links of all generated networks are calculated. The dynamics of the network analysed for the previously described data set of companies are also compared (to check the robustness of the procedure) with the corresponding analysis performed for 17 companies listed on the WIG20 index of the Warsaw Stock Exchange (GPW) plus the WIG20 index itself, thus 18 time series: WIG20, PEKAO (PEO), PKO BP (PKO), KGHM (KGH), PKN ORLEN (PKN), TPSA (TPS), BZ WBK (BZW), ASSECO POLAND (ACP), CEZ (CEZ), GETIN HOLDING (GTN), GTC (GTC), TVN (TVN), PBG (PBG), POLIMEXMS (PXM), BRE (BRE), LOTOS (LTS), CYFROWY POLSAT (CPS), BIOTON (BIO). The considered time period extends from 05.01.2009 to 30.04.2010, thereby including a total of 334 quotes. Note that the time series are first nonlinearly transformed into the log returns of stocks prices as in (8.28). For every time window T ∈ (20, 323), the ensemble of appropriate networks are thus constructed and then the statistical parameters of the distances between nodes are calculated. The results for the subset of S&P 500 and WIG20 are presented in Figures 8.6-8.7, respectively. Comparing the plots obtained for MST, BMLP, and UMLP displays, one can observe that the mean distance between nodes and its standard deviation behave similarly. Generally speaking the plots differ by a constant value. This is due to the different optimisation levels used for establishing network structures: MST is constructed under the strongest conditions, so the mean and standard deviation of the distance ensemble is the smallest of all, while UMLP algorithm generates the networks with highest value of the mean and standard deviation. Differences are observed in the case of the skewness and kurtosis. These two parameters measure the asymmetry and peakedness of the distance distributions. The main question at this stage is the choice of the time window length, which is related to the investment horizon. The optimal time window should be long enough to smoothen out the noise in the time series, but as short as possible in order not to hide interesting phenomena. Especially interesting are the extreme points of these parameters. Beside, these characteristics give the opportunity to choose the investment horizon. If the skewness is negative, it means that many distances are smaller than the mean value and a high risk portfolio can be construced. For positive skewness, more links are longer than
8.4 Examples
271 S&P 500
S&P 500 0.16 MST BMLP UMLP
0.55
MST BMLP UMLP
0.14
0.5
std
mean
0.12 0.45
0.4
0.1 0.08
0.35
0.06
0.3
0.04 0
50
100
150
200
250
300
350
0
50
100
window
200
250
300
350
window
S&P 500
S&P 500
1
0.6 MST BMLP UMLP
0.8
MST BMLP UMLP
0.4 0.2
0.6
0 0.4
kurtosis
skewness
150
0.2
-0.2 -0.4
0
-0.6
-0.2
-0.8
-0.4
-1 0
50
100
150
200
window
250
300
350
0
50
100
150
200
250
300
350
window
Figure 8.6: Mean distance, standard deviation, skewness and kurtosis of the ensemble of distances between 20 companies of the chosen subset of S&P 500 in the case of MST, BMLP and UMLP as a function of the time window size. STFdmm02
the mean distance and there are thus chances to obtain a low risk portfolio. A similar argument applies to the kurtosis parameters. A positive kurtosis (leptokurtic) denotes a higher peak than expected for a Gaussian distribution; thus the probability of the mean value is greater than in the case of a Gaussian distribution, but also has fatter tails. On the other hand, if the kurtosis is negative (mesokurtic), the situation is the opposite: the maximum of the distribution is smaller than the equivalent Gaussian and there are less fat tails. Considering the above motivations in the case of MST networks (Figures 8.68.7), the following time window are susspected to be interesting: • due to the skewness
272
8 Distance matrix method for network structure analysis WIG20
WIG20
0.5
0.16
0.48 0.46
MST BMLP UMLP
0.44
0.14 0.13
0.42
std
mean
MST BMLP UMLP
0.15
0.4
0.12
0.38
0.11
0.36 0.1
0.34 0.32
0.09 0
50
100
150
200
250
300
350
0
50
100
window
200
250
300
350
window
WIG20
WIG20
0.7
0.2 MST BMLP UMLP
0.6 0.5
MST BMLP UMLP
0 -0.2
0.4 0.3
kurtosis
skewness
150
0.2 0.1 0
-0.4 -0.6 -0.8
-0.1 -1
-0.2 -0.3
-1.2 0
50
100
150
200
window
250
300
350
0
50
100
150
200
250
300
350
window
Figure 8.7: Mean distance, standard deviation, skewness and kurtosis of the ensemble of distances between companies of the chosen subset of WIG20 index in the case of MST, BMLP and UMLP as a function of the time window size. STFdmm03
– of subset of S&P 500 T = 34 days (d), T = 99d, T = 119d, T = 158d; – of subset of WIG20 T = 21d, T = 76d, T = 90d, T = 221d; • due to the kurtosis – of subset of S&P 500 T = 25d, T = 54d, T = 82d, T = 112d, T = 209d, T = 250d; – of subset of WIG20 T = 27d, T = 41d, T = 123d, T = 190d, T = 227d. Considering the results (Figures 8.6-8.7) obtained for BMLP network the interesting time windows are:
8.4 Examples
273 UD, WIG20, time window=100d
UD, S&P 500, time window=100d
0.5
0.58
BMLP UMLP 0.48 MST
BMLP UMLP MST
0.56
mean distance
mean distance
0.54 0.46 0.44 0.42
0.52 0.5 0.48 0.46 0.44
0.4 0.42 0.38
0.4 0
50
100
150 time
200
250
0
50
100
150
200
250
time
Figure 8.8: Time evolution of mean distance between companies of the chosen subset of WIG20 and S&P 500. STFdmm04 • due to the skewness – of subset of S&P 500 T = 64d, T = 123d, T = 200d, T = 270d; – of subset of WIG20 T = 18d, T = 36d, T = 51d, T = 219d; • due to the kurtosis – for the S&P 500 subset, T = 31d, T = 58d, T = 102d, T = 248d; – for the WIG20 subset, T = 112d, T = 144d, T = 232d. Since the main purpose is to illustrate the method and neither to exhaust the reader with massive analysis nor to avoid repeating arguments, only one case of the mean distance evolution between network nodes is further discussed. Taking into account that one of the time window sizes worth investigating, which is present in most cases, is the time window of ≈ 100 days, T = 100d, is chosen for the following discussion. The evolution of the mean distance between companies for the subsets of S&P 500 and WIG20 is presented in Figure 8.8. The time given on the bottom axes corresponds to the final point of the considered time window, i.e. the points on the plot correspond to the time interval [t − T − 1, t]. The time evolution of mean distances is seen to be similar for all considered network structures. The differences between them are the result of the optimisation level to construct the networks. In the case of the WIG20 companies subset in the interval 2009.06.01-2009.10.21, the mean distance between nodes is scattered
274
8 Distance matrix method for network structure analysis
around the level [0.47 (UMLP), 0.4 (BMLP), 0.39 (MST)], then goes down to the value [0.46 (UMLP), 0.39 (BMLP), 0.38 (MST)], and finally reaches its maximum at the mean distance [0.5 (UMLP), 0.435 (BMLP, MST)] close to the end of the analysed interval. In the case of the companies included in the S&P 500, the type of studied network does not influence significantly the results. During the first 30 days of the analysis, the mean distance remains on the level [0.45 (UMLP), 0.5 (BMLP), 0.4 (MST)], then during the next 50 days the mean distance is rising up to [0.5 (UMLP, BMLP), 0.47 (MST)]; only for UMLP network a short peak is observed at 2009.03.30 (up to 0.56). The mean distance remains at this level for 75 days and is decreasing to the value [0.465 (UMLP, BMLP), 0.44 (MST)]. Summing up the above analysis, the mean distance between nodes (thus the size of the network) remains on the same level. Thus one can expect that the network structure does not change much during the evolution. In order to verify the above hypothesis, the rate table of the connection between nodes is calculated. It is presented below in Tables 8.3–8.5. The MST network allows to categorize shares into branches as well as investigate clustering of nodes. Therefore, among considered here networks and from the point of view of structure analysis MST is the most interesting network (Table 8.2). The rate of connections shows that most of the links are stable (rate greater than 5%) while the consecutive networks are generated in the moving time window. The results of the time evolution and the frequency of connections shows that, for a MST network, one can perform a portfolio optimization procedure considering relatively long investment horizons. The rate of connections between companies of S&P 500 index, in each UMLP, BMLP and MST cases, are presented in Tables 8.6-8.8. A slightly different situation is observed for the subset of companies of S&P 500 index. Only in two cases (LTM-NOC and INTC-TSEM) the frequency of connections exceeds 5%. So if in the later case a portfolio is constucted at some moment of time its properties (such as being an optimal portfolio) may be change in time. The TD analysis is more demanding numerically than UD since in the former method there are two different time windows, whence the number of combinations of possible time windows sizes increases significantly. Here the results for the MST are presented only, while the other two network structures are left for the reader as an exercise. As in the already discussed case of UD, at the first stage the statistical parameters of the ensemble of networks generated for the given time window sizes (T1 , T2 ∈ [5d, 100d]) are calculated. The results are presented in Figures 8.9 and 8.10. The mean value and its standard deviation
8.4 Examples
275
Table 8.3: The rate (in percent) of connections between companies in the case of UD, subset of WIG20, and UMLP; only nonzero values are presented. The table is split into two parts; rows containing only zero elements are omitted. WIG20 WIG20 ACP BIO BRE BZW CEZ CPS GTN
WIG20 ACP BIO BRE CEZ CPS GTN GTC KGH LTS PBG PEO PKN PKO PXM TPS
ACP
BIO
0.934
0.050
BRE 5.882
BZW
CEZ
CPS
GTN
0.227 1.035
0.252 1.591
1.414 0.025 3.989
0.101
4.696
GTC
0.177 0.076 1.742
KGH
LTS
PBG
0.959 1.212 0.278
0.126 0.025 1.035
1.742 1.060 0.025 3.232 0.404 0.833 2.398 0.328 0.151
0.025 1.439 0.025
0.151 0.631 0.606
PEO 3.433
1.490
0.454 1.439 0.050
PKN 1.515 0.379 0.050 0.985
0.278 0.732 2.196 4.772 0.151 0.050
PKO 0.934 0.429 1.616
0.707 0.101 0.783 2.373 4.645 0.076
PXM
TPS
TVN
1.161 0.884 1.868 0.177 0.025 0.606 2.676 0.303 1.212 0.278 0.202 0.505 0.050
1.136 2.651
2.853 0.278 0.252 1.035 0.353 0.050 3.030 1.641 0.480 0.076
1.288 1.994 0.076 0.076 0.480 0.151 0.884
0.076 0.050 0.985
0.833 0.808
STFdmm05
of the distance between nodes have the highest values for small time window sizes (T1 , T2 < 10d). In the case of the standard deviation there is a small difference between the set of companies quoted on the New York Stock Exchange (Figure 8.9) and those on the Polish GPW (Figure 8.10). The standard deviation of the distance between nodes for small T1 < 15d and long T2 > 30d takes a low value for the S&P 500 companies while in the case of the subset of companies quoted at GPW the standard deviation is relatively high. For the S&P 500 companies, the skewness and kurtosis plots show islands of low value for 40d T1 70d and 35d T2 65d in the case of skewness and for 20d T1 40d and 20d T2 70d in the case of kurtosis. In the case subset of WIG20 companies, the highest values of the skewness and kurtosis
276
8 Distance matrix method for network structure analysis
Table 8.4: The rate (in percent) of connections between companies in the case of UD, subset of WIG20, and in the BMLP case; only nonzero values are presented. The table is split into two parts for readability; rows containing only zero elements are omitted. WIG20
BRE CEZ CPS LTS
WIG20 5.882
ACP 5.882
BIO 4.847
BRE 5.882
BZW 5.201
CEZ 5.403
CPS 5.882
GTN 5.882
GTC
KGH 5.882
LTS 5.807
PBG 5.882
PEO 5.882 1.035
PKN 5.882
PKO 5.882
PXM 5.882
TPS 5.882
TVN 5.882
0.429
0.682 0.050 0.076
STFdmm06
are observed for the shortest time window considered in the analysis, that is, for T1 ≈ 5d and T2 ≈ 5d. As in the case of UD analysis, only one case of history will be discussed here. In the previous example (UD) the time window T = 100d was chosen. On the other hand in the case of TD, an island of low value of skewness and kurtosis can be observed at the centre of the plots Figure 8.9. Therefore the following time windows sizes were chosen: T1 = 50d, T2 = 50d. The evolution of the mean distance between nodes is presented in Figure 8.11. The mean distance between nodes has the highest value for the BMLP network for both considered sets of companies. Analysing the evolution of the mean distance between companies four stages can be distinguished: 1. the subset of S&P 500 mean distance is initially decreasing (2009.06.01 – 2009.08.10) for all the networks; however the mean distance of BMLP is very scattered; for the subset of WIG20 companies a decrease of the mean distance is observed for the period (2009.06.01 – 2009.09.01); 2. then the distances increase for the subset of S&P 500 companies during the next [20 (UMLP, MST), 50 (BMLP)] consecutive time windows; in the case of WIG20 companies the mean distance increases last ≈ 60d (UMLP, MST) and ≈ 200d (BMLP); 3. later the mean distance remains on a relatively stable level for the subset of S&P 500 companies; in the case of WIG20 companies the stable period
8.4 Examples
277
Table 8.5: The rate (in percent) of connections between companies in the case of UD, subset of WIG20, MST case; only nonzero values are presented. The table is split into two parts for readability; rows containing only zero elements are omitted. WIG20 ACP BIO BZW CEZ CPS GTN
WIG20 ACP BIO BRE BZW CEZ CPS GTN GTC LTS PKN
WIG20 5.882
ACP 0.909 1.035
BIO 3.433
BRE 5.882
BZW 0.505 0.429
CEZ 0.252
0.278
CPS 5.882
LTS 4.469
PBG 3.585 0.050
PEO 5.882
GTC
0.631 0.050 0.833
KGH 5.882
GTN 5.680
PKN 5.882
PKO 5.882
0.076
1.187
1.490 1.389 0.177
PXM 5.579
TPS 5.882
TVN 5.680
2.954 2.449 2.323 1.363 0.555 1.717
0.151 0.151
0.303 0.555
0.353
0.732
1.414
0.050 0.025 0.126
STFdmm07
spans over 50d; 4. finally a decrease of the mean distance between nodes is observed for the case of WIG20 companies. For the sake of clarity only the frequency tables of the MST network are presented in Tables 8.9-8.10. In the rate of connections there are no prevailing links. The rate of connections rarely exceeds 1%. Therefore the mean value of the distances between companies can be treated as the main information of this analysis. It should be stressed that a TD based analysis compares, in terms of information theory (Gray, 1990), information hidden in the time series and so it is focused on a different aspect of the time series properties than UD. On the other hand, since the Theil index measure how different is are the time series from absolutely random sequence, it detects patterns existant in the time se-
278
8 Distance matrix method for network structure analysis
Table 8.6: The rate (in percent) of connections between companies in the case of UD, subset of S&P 500, UMLP case; only nonzero values are presented. The table is split into two parts for readability; rows containing only zero elements are omitted. ABB
AAPL
ABB AAPL BA KO EMR GE HPQ HIT IBM
ABB AAPL BA KO GE HPQ HIT IBM INTC JNJ LMT MSFT NOC NVS CL PEP PG TSEM
BA 0.091 0.136
KO
EMR 4.515
GE 2.087 1.679 1.248
HPQ 1.180 2.972 0.340
0.068
0.681 0.408
HIT 0.159 0.159 0.181 1.157
IBM 0.023 2.087
INTC 0.930 1.157 0.091
0.113 0.386 1.157 0.068
0.318 0.454 1.520
JNJ
LMT
2.064 0.068
2.427 0.499 0.590
0.113 1.225 0.045
0.318 0.023 0.318
MSFT 0.136 0.408 0.386 1.951 1.838 1.202 0.363 0.091 0.023
NOC 0.544 2.064 0.159 0.045 0.023 0.386 0.091 4.356 1.452
NVS 0.499 0.068 0.250 0.386 0.998 0.136 1.611 0.749 0.091 0.068 0.023 1.633 0.045
CL
PEP
PG
0.045 1.270 0.340
1.180 4.877 0.159
0.295 0.340
0.499 0.839 0.023 0.318 0.454 0.113 0.091 0.181
0.045 0.068 0.045 0.023 0.885 0.295 0.522 0.159 1.520
0.930 0.045 0.204 3.426 0.227 0.159 0.454 4.220 0.159
TSEM 0.658 2.110 0.023 0.136 0.023 0.272 0.454 0.613 4.946 0.159 0.091 0.408 0.340
WEC
1.679 0.499 0.613 0.340 0.250 0.136 2.518 0.068 0.340 0.340 2.019 0.544 0.590
0.068 0.227
STFdmm08
ries. Therefore, TD shows the difference between time series in the terms of the paterns existing in the time series. Such patterns in price evolution might be connected with behaviour of investors. So the results of TD analysis could be a prompt to a detailed analysis of the stocks market players decisions.
8.5 Summary
279
Table 8.7: The rate (in percent) of connections between companies in the case of UD, subset of S&P 500, BMLP case; only nonzero values are presented. The table is split into two parts for readability; rows containing only zero elements are omitted. ABB ABB AAPL BA KO EMR GE HPQ HIT IBM
ABB AAPL BA KO EMR GE HPQ HIT IBM INTC JNJ LMT MSFT NOC NVS CL PG
AAPL 0.363
BA 0.023
KO 0.998
EMR 2.609 2.246 2.586 1.611
GE 0.794
1.815
JNJ 0.227
LMT
2.382
2.609
0.250
0.272
0.023 1.112
0.817
MSFT 1.180 0.318 0.068 1.429 0.930 0.136 0.181 1.701 0.658 0.658
NOC 0.635 0.613 0.613 0.613 3.199 0.613 1.384 0.613 0.613 1.565 0.613 0.613 0.771
HPQ 0.930 0.930 0.613 0.839 3.539 0.930
HIT 2.314 0.091
0.295
IBM 0.363
0.023 2.246
0.681
0.749 0.023
TSEM 3.153 0.590 0.068
NVS 2.314 0.113
CL 1.407
PEP 0.794
PG 0.091
0.295
1.202
1.815
2.518
0.136
0.544
0.658
0.340
0.023 0.658
0.023 0.749
1.112
0.023 0.658
0.113 0.658 0.499
0.363 0.613
0.272 0.613
0.567 0.613
INTC 2.495 0.613 1.044 1.112 1.792 1.112 1.452 0.635 1.112
WEC
2.609 0.749 0.476 1.112 0.295 0.091 0.771 0.454 0.363 0.454
0.907 0.023 1.112 0.250 0.363
STFdmm09
8.5 Summary In this chapter, the distance matrix methods have been presented. The reader attention have been drawn to the most popular methods of time series comparison: the Manhattan Distance and the Ultrametric Distance. These two distances have been compared and it has been shown that, although the UD is very popular in various investigations, the MD should be also considered as an important research and analysis tool. It should be stressed that the MD is more robust to the noise influence than the UD. However, since the UD origin
280
8 Distance matrix method for network structure analysis
Table 8.8: The rate (in percent) of connections between companies in the case of UD, subset of S&P 500, MST case; only nonzero values are presented. The table is split into two parts for readability; rows containing only zero elements are omitted. ABB
AAPL
ABB AAPL BA KO EMR GE HPQ HIT IBM
KO 0.068
EMR 4.696 1.225 3.131
GE 1.180
HIT 2.064
IBM 0.363 1.384
INTC 1.747 0.975
0.181
HPQ 0.907 3.199 0.091
3.267
4.492
0.635 0.590
1.248
0.522 0.635 1.565 1.86 0.635
1.407
JNJ ABB AAPL BA KO EMR GE HPQ HIT IBM INTC JNJ LMT MSFT NOC NVS CL PEP
BA 0.023
LMT
MSFT
0.363 0.998 2.359
NOC 0.771
0.113
4.741 0.703 0.318
3.766
0.045
NVS 2.246
CL
PEP
PG
0.113
4.424
0.771 2.813 1.361
TSEM 0.522 0.318
WEC
1.021 2.269 0.068
0.476
5.263
0.250 2.291
0.113 0.476
0.113 0.862 0.045
0.544
0.340 0.975 2.904
5.263 1.452
0.454 0.181 0.703
0.613 0.227 0.771
0.159 0.635 5.263 1.293
0.136
STFdmm10
lies in portfolio optimization methods this distance is particularly useful in the process of portfolio construction. The analysis of UD and investigations of the appropriate MST network may simplify and speed up the process of portfolio construction, particularly the choice of shares which should be included in the portfolio. The analysis of the time evolution of the distance between time series was presented in the Section 8.4.2. Since most of financial (and generally economic) time series are not stationary the question about the time evolution of the distance between time series seems to be well motivated. The methods presented
8.5 Summary
281
(Mean distance, MST)
(Std, MST)
0.025 0.02 0.015 0.01 0.005
0.03 0.025 0.02 0.015 0.01 0.005 0
20
40 T2
60
80
1000
20
40
60
80
100 0
T1
20
40 T2
(Skewness, MST)
60
80
1000
20
40
60
80
100
T1
(Kurtosis, MST)
4.5 4 3.5 3 2.5 2
30 25 20 15 10 5 0
20
40 T2
60
80
1000
20
40
60
80
100
T1
0
20
40 T2
60
80
1000
20
40
60
80
100
T1
Figure 8.9: Mean distance, standard deviation, skewness and kurtosis of the ensemble of distances between nodes of the chosen subset of S&P 500 as a function of the time windows sizes; MST network. STFdmm11
in the Section 8.4.2 are based on the classical idea of the time moving window. The procedure of distance matrix calculation and the chosen network construction is repeated in moving time window. The main problem of such analysis is huge amount of data generated. In practice it is impossible to analyse in details few hundreds (or thousands) of generated networks. Therefore one has to focus on the chosen parameters such as mean distance, diameter of the network, node distribution etc. Moreover the time required for the analysis depends on the network type used. Although theoreticaly any of the networks can be used in real situation, the computation time can not be extended to infinity and the problem of the network choice is important. In this context, three different networks have been presented to the reader: UMLP, BMLP and MST. The UMLP network require the least demanding algorithm, however its structure
282
8 Distance matrix method for network structure analysis
(Mean distance, MST)
(Std, MST)
0.04 0.035 0.03 0.025 0.02
0.08 0.07 0.06 0.05 0.04 0
20
40 T2
60
80
1000
20
40
60
80
100 0
T1
20
40 T2
(Skewness, MST)
60
80
1000
20
40
60
80
100
T1
(Kurtosis, MST)
7 6 5 4 3
60 50 40 30 20 10 0
20
40 T2
60
80
1000
20
40
60
80
100
T1
0
20
40 T2
60
80
1000
20
40
60
80
100
T1
Figure 8.10: Mean distance, standard deviation, skewness and kurtosis of the ensemble of distances between nodes of the chosen subset of companies of WIG20 index as a function of the time windows sizes; MST network. STFdmm12
strongly depends on the choice of the seed (made “by hand” by the researcher). Of course in some situation it can be an advantage, because of the question posed. The second network – BMLP require more computer time than UMLP but less than MST. Moreover, the results obtained by construction of BMLP and MST are quite similar (see Section 8.4.2) so the suggested strategy is the following: firstly to perform analysis of the time evolution of the distance using BMLP and then for the chosen cases perform calculation using MST network. The last problem discussed in this chapter is the entropy based distance. In details the Theil index based distance (TD) was discussed. The TD measure can be particularly interesting in the analysis of the stock market or investors
8.5 Summary
283 ED, WIG20, T1=50d, T2=50d
ED, S&P 500, T1=50d, T2=50d
0.11
0.1 BMLP UMLP MST
0.1 0.09
0.08 0.07 mean distance
0.08 mean distance
BMLP UMLP MST
0.09
0.07 0.06 0.05
0.06 0.05 0.04
0.04
0.03
0.03
0.02
0.02
0.01
0.01
0 0
50
100
150 time
200
250
0
50
100
150
200
250
time
Figure 8.11: Time evolution of mean distance distances between nodes of the chosen subset of companies of WIG20 and S&P 500 indexes in the case of the TD; UMLP, BMLP and MST networks, time windows T1 = 50d, T2 = 50d. STFdmm13
behaviour, because this distance is focused on the search of structures hidden in the investigated time series and comparison of the differences between information measures of the time series. The results of TD may provide interesting informations, which may be difficult to find out by other methods.
284
8 Distance matrix method for network structure analysis
Table 8.9: The rate (in percent) of connections between companies in the case of TD, subset of S&P 500, MST; only nonzero values are presented. The table is split into two parts for readability; rows containing only zero elements are omitted. ABB
AAPL 0.204
BA 0.817 0.091
KO 0.658 0.204 0.136
EMR 0.272 0.885 0.227 0.885
GE 0.499 0.590 0.272 1.747 0.635
HPQ 0.159 0.476 0.771 0.544 0.340 0.771
HIT 0.113 1.066 0.386 0.567 0.862 1.021 0.703
IBM 1.338 0.794 0.204 0.091 0.136 0.068 0.975 0.953
INTC 0.068 1.180 0.613 0.272 2.042 0.726 0.181 0.590 0.227
JNJ 0.975 0.136 0.749 1.112 0.318 0.975 0.522 0.181 0.091 0.408
LMT 0.749 0.544 0.363 0.045 0.386 0.023 1.112 0.703 1.543 0.091 0.613
MSFT
NOC 0.703 0.953 0.272 0.227 0.408 0.522 1.157 1.044 0.250 0.204 0.522 0.181 0.181
NVS 0.476 0.227 1.021 0.204 0.703 0.340
CL 0.023 0.522 0.113 0.590 0.136 0.386 0.522 0.635 0.658 0.091 0.930 0.113 2.246 0.658 0.113
PEP 0.272 0.159 0.454 0.181 0.408 0.318 0.567 0.930 0.862 0.953 0.250 0.907 0.794 0.749 0.386 0.930
PG 0.295 0.250 0.363 1.248 0.613 0.703 0.681 0.363 0.045 0.159 0.091 0.408 2.223 0.045 0.045 0.295 0.272
TSEM 1.157 0.250 1.270 0.431 0.091 0.476 0.613
WEC 0.885 1.134 1.724 0.181 0.408 0.386 0.318 0.068 0.567 0.862 0.045 0.930 0.159 0.499 0.272 0.318 0.907 0.045 0.703
ABB AAPL BA KO EMR GE HPQ HIT IBM
ABB AAPL BA KO EMR GE HPQ HIT IBM INTC JNJ LMT MSFT NOC NVS CL PEP PG TSEM
0.862 0.204 1.180 0.045 0.068 0.113 0.159 0.318 0.726 0.476 0.091
0.181 0.658 0.590 0.862 1.157 0.635
0.749 0.159 0.499 0.567 0.817 1.270 0.227 0.136
STFdmm14
8.5 Summary
285
Table 8.10: The rate (in percent) of connections between companies in the case of TD, subset of WIG20, MST; only nonzero values are presented. The table is split into two parts for readability; rows containing only zero elements are omitted. WIG20 WIG20 ACP BIO BRE BZW CEZ CPS GTN
WIG20 ACP BIO BRE BZW CEZ CPS GTN GTC KGH LTS PBG PEO PKN PKO PXM TPS
ACP 0.025
KGH 1.691 0.126
LTS 0.429 1.262
1.035 1.591 0.227 0.606 0.379 0.202
0.530 0.732 0.454 0.328 0.278 0.151 0.076
BIO
PBG 0.757 0.884 1.136 0.101 0.050 1.439 0.076 1.464 0.505 0.328 1.414
BRE 0.505 1.363
BZW 1.464 1.086
CEZ 0.227 0.278
0.909
0.581 0.252
PEO 1.591 0.379
PKN 1.641 0.076
0.808 0.379 0.177 0.429 0.555
1.086 0.227 1.893
PKO 0.126 1.591 0.177 0.757 0.581 0.303 0.707 0.328 1.464
0.303 0.505 1.111
0.732 0.202 1.792 2.499 0.379
0.985 0.707 1.187
CPS 0.404 0.808 0.076 1.060 1.439 0.707
GTN 0.682 0.404 1.540 0.732 0.227 1.490 0.227
GTC 0.126 1.161 1.363 0.025 0.177 1.035 1.591 0.934
PXM 1.086 0.404
TPS 1.010 0.454 0.581 0.783 0.985 0.101 1.464 0.505 0.025 0.227 0.656 0.480 2.121 0.177 1.818 0.353
TVN
0.454 0.858 0.328 0.884 0.379 0.707 0.959 0.656 0.985 1.490 0.076 0.606
1.464 1.212 0.858 0.808 0.480 0.959 0.909 1.565 0.379 0.252 0.328 0.353 0.429 1.540 0.025
STFdmm15
Bibliography Adams, A., Booth, P., Bowie, D. and Freeth, D. (2003). Investment mathematics, Vol. 100 of Wiley Investment Classics, John Wiley & Sons. Akita, T. (2003). Decomposing regional income inequality in China and Indonesia using two-stage nested Theil decomposition method, The Annals of Regional Science 37: 55–77. Araujo, T. and Louca, F. (2007). The geometry of crashes. A measure of the dynamics of stock market crises., Quantitative Finance 7: 63–74. Ausloos, M. and Gligor, M. (2008). Cluster Expansion Method for Evolving Weighted Networks Having Vector-like Nodes., Acta Phys. Pol. A 114: 491–499. Bonanno, G., Caldarelli, G., Lillo, F. and Mantegna, R. N. (2003). Topology of correlation-based minimal spanning trees in real and model markets, Phys. Rev. E 68(4): 046130. Bonanno, G., Lillo, F. and Mantegna, R. N. (2001). High-frequency crosscorrelation in a set of stocks, Quantitative Finance 1: 96 – 104. Bonanno, G., Lillo, F., Fabrizio and Mantegna, R. N. (2001). Levels of complexity in financial markets, Phys. A 299(1-2): 16 – 27. Bondy, J. A. and Murty, U. R. S. (2008). Graph theory, Springer Boss, M., Elsinger, H., Summer, M. and Thurner, S. (2004). Network topology of the interbank market., Quantitative Finance 4: 677–684. Brigo, D. and Liinev, J. (2005). On the distributional distance between the lognormal libor and swap market models., Quantitative Finance 5: 433– 442. Conceicao, P., Galbraith, J. K. and Bradford, P. (2000). The Theil Index in Sequences of Nested and Hierarchic Grouping Structures: Implications for the Measurement of Inequality through Time with Data Aggregated at Different Levels of Industrial Classification.
288
Bibliography
Cuthberson, K. and Nitzsche, D. (2001). Investments spot and derivative markets, John Willey & Sons, LTD. Deza, E. and Deza, M. (2006). dictionary of distances, ELSEVIER. Focardi, S. M. and Fabozzi, F. J. (2004). A methodology for index tracking based on time-series clustering., Quantitative Finance 4: 417–425. Gligor, M. and Ausloos, M. (2007). Cluster structure of EU-15 countries derived from the correlation matrix analysis of macroeconomic index fluctuations., Eur. Phys. J B 57: 139–146. Gligor, M. and Ausloos, M. (2008). Convergence and cluster structures in EU area according to fluctuations in macroeconomic indices., Journal of Economic Integration 23: 297–330. Gray, R. M. (1990). Entropy and information theory, Springer-Verlag New York, Inc. G´ orski, A., Dro˙zd˙z, S., and Kwapie´ n, J. (2008). Scale free effects in world currency exchange network , Eur. Phys. J. B 66: 91–96. Gross, J. and Yellen, J. (2006). Graph theory and its application, CRC Press Kruskal, J. B. (1957). On the Shortest Spanning Subtree of a Graph and the Traveling Salesman Problem., Proceedings of the American Mathematical Society, Vol. 7, pp. 48–50. Levis, T. G. (2009). Network Science. Theory and Applications., John Wiley & Sons. Liu, J., K., C. and He, K. (2009). Fierce stock market fluctuation disrupts scalefree distribution, Quantitative Finance pp. 1469–7688. Mantegna, R. N. (1999). Hierarchical structure in financial markets, Eur. Phys. J. B 11(1): 193–197. Mantegna, R. N. and Stanley, H. E. (2000). An Introduction to Eonopysics: Correlations and Complexity in Finance, Cambridge Unifersity Press. Mi´skiewicz, J. (2008). Globalization – Entropy unification through the Theil index., Phys. A 387: 6595–6604. Mi´skiewicz, J. and Ausloos, M. (2006a). An Attempt to observe Economy Globalization: The Cross Correlation Distance Evolution of the Top 19 GDP’s, Int. J. Mod. Phys. C 17(3): 317–331.
Bibliography
289
Mi´skiewicz, J. and Ausloos, M. (2006b). G7 country Gross Domestic Product (GDP) time correlations. A graph network analysis, in H. Takayasu (ed.), Practical Fruits of Econophysics, Springer Verlag. Mi´skiewicz, J. and Ausloos, M. (2008a). Correlation measure to detect time series distances, whence economy globalization., Phys. A 387: 6584–6594. Mi´skiewicz, J. and Ausloos, M. (2008b). Correlation measure to detect time series distances, whence economy globalization, Phys. A 387: 6584–6594. Onnela, J. P., Chakraborti, A., Kaski, K. and Kert´esz, J. (2002). Dynamic asset trees and portfolio analysis, Eur. Phys. J. B 30: 285–288. Onnela, J. P., Chakraborti, A., Kaski, K. and Kert´esz, J. (2003). Dynamic asset trees and black monday, Phys. A 324(1-2): 247 – 252. Onnela, J.-P., Chakraborti, A., Kaski, K., Kert´esz, J. and Kanto, A. (2003). Dynamics of market correlations: Taxonomy and portfolio analysis, Phys. Rev. E 68(5): 056110. Onnela, J. P., Kaski, K. and Kert´esz, J. (2004). Clustering and information in correlation based financial networks, Eur. Phys. J. B 38: 353–362. Prim, R. C. (1957). Shortest connection networks and some generalizations., Bell System Technical Journal 36: 1389–1401. Shorrocks, A. F. (1980). The Class of Additively Decomposable Inequality Measures, Econometrica 48: 613–626. Theil, H. (1965). The Information Approach to Demand Analysis, Econometrica 33: 67–87. Tumminello, M., Aste, T., Di Matteo, T., Mantegna, R. N. and Stanley, H. E. (2005). A Tool for Filtering Information in Complex Systems, Proceedings of the National Academy of Sciences of the United States of America 102: 10421–10426.
Part II
Insurance
9 Building loss models Krzysztof Burnecki, Joanna Janczura, and Rafal Weron
9.1 Introduction A loss model or actuarial risk model is a parsimonious mathematical description of the behavior of a collection of risks constituting an insurance portfolio. It is not intended to replace sound actuarial judgment. In fact, according to Willmot (2001), a well formulated model is consistent with and adds to intuition, but cannot and should not replace experience and insight. Moreover, a properly constructed loss model should reflect a balance between simplicity and conformity to the data since overly complex models may be too complicated to be useful. A typical model for insurance risk, the so-called collective risk model, treats the aggregate loss as having a compound distribution with two main components: one characterizing the frequency (or incidence) of events and another describing the severity (or size or amount) of gain or loss resulting from the occurrence of an event (Kaas et al., 2008; Klugman, Panjer, and Willmot, 2008; Tse, 2009). The stochastic nature of both components is a fundamental assumption of a realistic risk model. In classical form it is defined as follows. If {Nt}t≥0 is a process counting claim occurrences and {Xk }∞ k=1 is an independent sequence of positive independent and identically distributed (i.i.d.) random variables representing claim sizes, then the risk process {Rt}t≥0 is given by Rt = u + c(t) −
Nt
Xi .
(9.1)
i=1
The non-negative constant u stands for the initial capital of the insurance company and the deterministic or stochastic function of time c(t) for the preNt mium from sold insurance policies. The sum { i=1 Xi } is the so-called aggregate claim process, with the number of claims in the interval (0, t] being
P. Čížek et al. (eds.), Statistical Tools for Finance and Insurance, DOI 10.1007/978-3-642-18062-0_9, © Springer-Verlag Berlin Heidelberg 2011
293
294
9 Building loss models
modeled by thecounting process Nt . Recall, that the latter is defined as n Nt = max{n : i=1 Wi ≤ t}, where {Wi }∞ i=0 is a sequence of positive ran0 dom variables and i=1 Wi ≡ 0. In the insurance risk context Nt is also referred to as the claim arrival process. The collective risk model is often used in health insurance and in general insurance, whenever the main risk components are the number of insurance claims and the amount of the claims. It can also be used for modeling other noninsurance product risks, such as credit and operational risk (Chernobai, Rachev, and Fabozzi, 2007; Panjer, 2006). In the former, for example, the main risk components are the number of credit events (either defaults or downgrades), and the amount lost as a result of the credit event. The simplicity of the risk process defined in eqn. (9.1) is only illusionary. In most cases no analytical conclusions regarding the time evolution of the process can be drawn. However, it is this evolution that is important for practitioners, who have to calculate functionals of the risk process like the expected time to ruin and the ruin probability, see Chapter 10. The modeling of the aggregate claim process consists of modeling the counting process {Nt } and the claim size sequence {Xk }. Both processes are usually assumed to be independent, hence can be treated independently of each other (Burnecki, H¨ardle, and Weron, 2004). Modeling of the claim arrival process {Nt} is treated in Section 9.2, where we present efficient algorithms for four classes of processes. Modeling of claim severities and testing the goodness-of-fit is covered in Sections 9.3 and 9.4, respectively. Finally, in Section 9.5 we build a model for the Danish fire losses dataset, which concerns major fire losses in profits that occurred between 1980 and 2002 and were recorded by Copenhagen Re.
9.2 Claim arrival processes In this section we focus on efficient simulation of the claim arrival process {Nt }. This process can be simulated either via the arrival times {Ti }, i.e. moments when the ith claim occurs, or the inter-arrival times (or waiting times) Wi = Ti − Ti−1 , i.e. the time periods between successive claims (Burnecki and Weron, 2005). ∞ Note that in terms of Wi ’s the claim arrival process is given by Nt = n=1 I(Tn ≤ t). In what follows we discuss four examples of {Nt }, namely the classical (homogeneous) Poisson process, the non-homogeneous Poisson process, the mixed Poisson process and the renewal process.
9.2 Claim arrival processes
295
9.2.1 Homogeneous Poisson process (HPP) The most common and best known claim arrival process is the homogeneous Poisson process (HPP). It has stationary and independent increments and the number of claims in a given time interval is governed by the Poisson law. While this process is normally appropriate in connection with life insurance modeling, it often suffers from the disadvantage of providing an inadequate fit to insurance data in other coverages with substantial temporal variability. Formally, a continuous-time stochastic process {Nt : t ≥ 0} is a (homogeneous) Poisson process with intensity (or rate) λ > 0 if (i) {Nt } is a counting process, and (ii) the waiting times Wi are independent and identically distributed and follow an exponential law with intensity λ, i.e. with mean 1/λ (see Section 9.3.2). This definition naturally leads to a simulation scheme for the successive arrival times Ti of the Poisson process on the interval (0, t]: Algorithm HPP1 (Waiting times) Step 1: set T0 = 0 Step 2: generate an exponential random variable E with intensity λ Step 3: if Ti−1 + E < t then set Ti = Ti−1 + E and return to step 2 else stop Sample trajectories of homogeneous (and non-homogeneous) Poisson processes are plotted in the top panels of Figure 9.1. The thin solid line is a HPP with intensity λ = 1 (left) and λ = 10 (right). Clearly the latter jumps more often. Alternatively, the homogeneous Poisson process can be simulated by applying the following property (Rolski et al., 1999). Given that Nt = n, the n occurrence times T1 , T2 , . . . , Tn have the same distribution as the order statistics corresponding to n i.i.d. random variables uniformly distributed on the interval (0, t]. Hence, the arrival times of the HPP on the interval (0, t] can be generated as follows: Algorithm HPP2 (Conditional theorem) Step 1: generate a Poisson random variable N with intensity λt Step 2: generate N random variables Ui distributed uniformly on (0, 1), i.e. Ui ∼ U(0, 1), i = 1, 2, . . . , N Step 3: set (T1 , T2 , . . . , TN ) = t · sort{U1 , U2 , . . . , UN } In general, this algorithm will run faster than HPP1 as it does not involve a loop. The only two inherent numerical difficulties involve generating a Poisson
296
9 Building loss models
70
70 λ(t)=1+0⋅t λ(t)=1+0.1⋅t λ(t)=1+1⋅t
60
50
40
N(t)
N(t)
50
30
40 30
20
20
10
10
0
0
2
λ(t)=10+0⋅cos(2πt) λ(t)=10+1⋅cos(2πt) λ(t)=10+10⋅cos(2πt)
60
4
6
8
10
0 0
1
t
2
3
4
5
t
Number of events per day
80 Claim intensity November−December
70 60 50 40 30
0
0.5
1
1.5 Time (years)
2
2.5
3
Figure 9.1: Top left panel : Sample trajectories of a NHPP with linear intensity λ(t) = a + b · t. Note that the first process (with b = 0) is in fact a HPP. Top right panel : Sample trajectories of a NHPP with periodic intensity λ(t) = a + b · cos(2πt). Again, the first process is a HPP. Bottom panel : Intensity of car accident claims in Greater Wroclaw area, Poland, in the period 1998-2000 (data from one of the major insurers in the region). Note, the larger number of accidents in late Fall/early Winter due to worse weather conditions. STFloss01
random variable and sorting a vector of occurrence times. Whereas the latter problem can be solved, for instance, via the standard quicksort algorithm implemented in most statistical software packages (like sortrows.m in Matlab), the former requires more attention.
9.2 Claim arrival processes
297
A straightforward algorithm to generate a Poisson random variable would take N = min{n : U1 · . . . · Un < exp(−λ)} − 1,
(9.2)
which is a consequence of the properties of the HPP (see above). However, for large λ, this method can become slow as the expected run time is proportional to λ. Faster, but more complicated methods are available. Ahrens and Dieter (1982) suggested a generator which utilizes acceptance-complement with truncated normal variates for λ > 10 and reverts to table-aided inversion otherwise. Stadlober (1989) adapted the ratio of uniforms method for λ > 5 and classical inversion for small λ’s. H¨ormann (1993) advocated the transformed rejection method, which is a combination of the inversion and rejection algorithms. Statistical software packages often use variants of these methods. For instance, Matlab’s poissrnd.m function uses the waiting time method (9.2) for λ < 15 and Ahrens’ and Dieter’s method for larger values of λ. Finally, since for the HPP the expected value of the process E(Nt ) = λt, it is natural to define the premium function as c(t) = ct, where c = (1 + θ)μλ and μ = E(Xk ). The parameter θ > 0 is the relative safety loading which “guarantees” survival of the insurance company. With such a choice of the premium function we obtain the classical form of the risk process: Rt = u + (1 + θ)μλt −
Nt
Xi .
(9.3)
i=1
9.2.2 Non-homogeneous Poisson process (NHPP) The choice of a homogeneous Poisson process implies that the size of the portfolio cannot increase or decrease. In addition, it cannot describe situations, like in motor insurance, where claim occurrence epochs are likely to depend on the time of the year (worse weather conditions in Central Europe in late Fall/early Winter lead to more accidents, see the bottom panel in Figure 9.1) or of the week (heavier traffic occurs on Friday afternoons and before holidays). For modeling such phenomena the non-homogeneous Poisson process (NHPP) is much better. The NHPP can be thought of as a Poisson process with a variable (but predictable) intensity defined by the deterministic intensity (or rate) function λ(t). Note that the increments of a NHPP do not have to be stationary. In the special case when the intensity takes a constant value λ(t) = λ, the NHPP reduces to the homogeneous Poisson process with intensity λ.
298
9 Building loss models
The simulation of the process in the non-homogeneous case is slightly more complicated than for the HPP. The first approach, known as the thinning or rejection method, is based on the following fact (Bratley, Fox, and Schrage, 1987; Ross, 2002). Suppose that there exists a constant λ such that λ(t) ≤ λ for all t. Let T1∗ , T2∗ , T3∗ , . . . be the successive arrival times of a homogeneous Poisson process with intensity λ. If we accept the ith arrival time Ti∗ with probability λ(Ti∗ )/λ, independently of all other arrivals, then the sequence {Ti }∞ i=0 of the accepted arrival times (in ascending order) forms a sequence of the arrival times of a non-homogeneous Poisson process with the rate function λ(t). The resulting simulation algorithm on the interval (0, t] reads as follows: Algorithm NHPP1 (Thinning) Step 1: set T0 = 0 and T ∗ = 0 Step 2: generate an exponential random variable E with intensity λ Step 3: if T ∗ + E < t then set T ∗ = T ∗ + E else stop Step 4: generate a random variable U distributed uniformly on (0, 1) Step 5: if U < λ(T ∗ )/λ then set Ti = T ∗ (→ accept the arrival time) Step 6: return to step 2 As mentioned in the previous section, the inter-arrival times of a homogeneous Poisson process have an exponential distribution. Therefore steps 2–3 generate the next arrival time of a homogeneous Poisson process with intensity λ. Steps 4–5 amount to rejecting (hence the name of the method) or accepting a particular arrival as part of the thinned process (hence the alternative name). Note, that in this algorithm we generate a HPP with intensity λ employing the HPP1 algorithm. We can also generate it using the HPP2 algorithm, which in general is much faster. The second approach is based on the observation that for a NHPP with rate function λ(t) the increment Nt − Ns , 0 < s < t, is distributed as a Poisson
+ = t λ(u)du (Grandell, 1991). Hence, the random variable with intensity λ s cumulative distribution function Fs of the waiting time Ws is given by Fs (t) = =
P(Ws ≤ t) = 1 − P(Ws > t) = 1 − P(Ns+t − Ns = 0) = s+t t 1 − exp − λ(u)du = 1 − exp − λ(s + v)dv . s
0
If the function λ(t) is such that we can find a formula for the inverse Fs−1 for each s, we can generate a random quantity X with the distribution Fs by using
9.2 Claim arrival processes
299
the inverse transform method. The simulation algorithm on the interval (0, t], often called the integration method, can be summarized as follows: Algorithm NHPP2 (Integration) Step 1: set T0 = 0 Step 2: generate a random variable U distributed uniformly on (0, 1) Step 3: if Ti−1 + Fs−1 (U ) < t set Ti = Ti−1 + Fs−1 (U ) and return to step 2 else stop The third approach utilizes a generalization of the property used in the HPP2 algorithm. Given that Nt = n, the n occurrence times T1 , T2 , . . . , Tn of the non-homogeneous Poisson process have the same distributions as the order statistics corresponding to n independent random variables distributed
t on the interval (0, t], each with the common density function f (v) = λ(v)/ 0 λ(u)du, where v ∈ (0, t]. Hence, the arrival times of the NHPP on the interval (0, t] can be generated as follows: Algorithm NHPP3 (Conditional theorem) Step 1: generate a Poisson random variable N with intensity
t 0
λ(u)du
Step 2: generate N random variables Vi , i = 1, 2, . . . N with density f (v) =
t λ(v)/ 0 λ(u)du. Step 3: set (T1 , T2 , . . . , TN ) = sort{V1 , V2 , . . . , VN }. The performance of the algorithm is highly dependent on the efficiency of the computer generator of random variables Vi . Simulation of Vi ’s can be done either via the inverse transform method by integrating the density f (v) or via the acceptance-rejection technique using the uniform distribution on the interval (0, t) as the reference distribution. In a sense, the former approach leads to Algorithm NHPP2, whereas the latter one to Algorithm NHPP1. Sample trajectories of non-homogeneous Poisson processes are plotted in the top panels of Figure 9.1. In the top left panel realizations of a NHPP with linear intensity λ(t) = a + b · t are presented for the same value of parameter a. Note, that the higher the value of parameter b, the more pronounced is the increase in the intensity of the process. In the top right panel realizations of a NHPP with periodic intensity λ(t) = a + b · cos(2πt) are illustrated, again for the same value of parameter a. This time, for high values of parameter b the events exhibit a seasonal behavior. The process has periods of high activity (grouped around natural values of t) and periods of low activity, where almost
300
9 Building loss models
no jumps take place. Such a process is much better suited to model the seasonal intensity of car accident claims (see the bottom panel in Figure 9.1) than the HPP. Finally, we note that since in the non-homogeneous case the expected value of t the process at time t is E(Nt ) = 0 λ(s)ds, it is natural to define the premium
t function as c(t) = (1 + θ)μ 0 λ(s)ds. Then the risk process takes the form:
t
λ(s)ds −
Rt = u + (1 + θ)μ 0
Nt
Xi .
(9.4)
i=1
9.2.3 Mixed Poisson process In many situations the portfolio of an insurance company is diversified in the sense that the risks associated with different groups of policy holders are significantly different. For example, in motor insurance we might want to make a difference between male and female drivers or between drivers of different age. We would then assume that the claims come from a heterogeneous group of clients, each one of them generating claims according to a Poisson distribution with the intensity varying from one group to another. Another practical reason for considering yet another generalization of the classical Poisson process is the following. If we measure the volatility of risk processes, expressed in terms of the index of dispersion Var(Nt )/ E(Nt ), then often we obtain estimates in excess of one – a value obtained for the homogeneous and the non-homogeneous cases. These empirical observations led to the introduction of the mixed Poisson process (MPP), see Rolski et al. (1999). In the mixed Poisson process the distribution of {Nt } is given by a mixed Poisson distribution (Rolski et al., 1999). This means that, conditioning on an extrinsic random variable Λ (called a structure variable), the random variable {Nt } has a Poisson distribution. Typical examples for Λ are two-point, gamma and general inverse Gaussian distributions (Teugels and Vynckier, 1996). Since for each t the claim numbers {Nt } up to time t are Poisson variates with intensity Λt, it is now reasonable to consider the premium function of the form c(t) = (1 + θ)μΛt. This leads to the following representation of the risk process: Rt = u + (1 + θ)μΛt −
Nt i=1
Xi .
(9.5)
9.2 Claim arrival processes
301
The MPP can be generated using the uniformity property: given that Nt = n, the n occurrence times T1 , T2 , . . . , Tn have the same joint distribution as the order statistics corresponding to n i.i.d. random variables uniformly distributed on the interval (0, t] (Albrecht, 1982). The procedure starts with the simulation of n as a realization of Nt for a given value of t. This can be done in the following way: first a realization of a non-negative random variable Λ is generated and, conditioned upon its realization, Nt is simulated according to the Poisson law with parameter Λt. Then we simulate n uniform random numbers in (0, t). After rearrangement, these values yield the sample T1 ≤ . . . ≤ Tn of occurrence times. The algorithm is summarized below. Algorithm MPP1 (Conditional theorem) Step 1: generate a mixed Poisson random variable N with intensity Λt Step 2: generate N random variables Ui distributed uniformly on (0, 1), i.e. Ui ∼ U(0, 1), i = 1, 2, . . . , N Step 3: set (T1 , T2 , . . . , TN ) = t · sort{U1 , U2 , . . . , UN }
9.2.4 Renewal process Generalizing the homogeneous Poisson process we come to the point where instead of making λ non-constant, we can make a variety of different distributional assumptions on the sequence of waiting times {W1 , W2 , . . .} of the claim arrival process {Nt }. In some particular cases it might be useful to assume that the sequence is generated by a renewal process, i.e. the random variables Wi are i.i.d., positive with a distribution function F . Note that the homogeneous Poisson process is a renewal process with exponentially distributed inter-arrival times. This observation lets us write the following algorithm for the generation of the arrival times of a renewal process on the interval (0, t]: Algorithm RP1 (Waiting times) Step 1: set T0 = 0 Step 2: generate an F -distributed random variable X Step 3: if Ti−1 + X < t then set Ti = Ti−1 + X and return to step 2 else stop An important point in the previous generalizations of the Poisson process was the possibility to compensate risk and size fluctuations by the premiums. Thus, the premium rate had to be constantly adapted to the development of the
302
9 Building loss models
claims. For renewal claim arrival processes, a constant premium rate allows for a constant safety loading (Embrechts and Kl¨ uppelberg, 1993). Let {Nt } be a renewal process and assume that W1 has finite mean 1/λ. Then the premium function is defined in a natural way as c(t) = (1 + θ)μλt, like for the homogeneous Poisson process, which leads to the risk process of the form (9.3).
9.3 Loss distributions There are three basic approaches to deriving the loss distribution: empirical, analytical, and moment based. The empirical method, presented in Section 9.3.1, can be used only when large data sets are available. In such cases a sufficiently smooth and accurate estimate of the cumulative distribution function (cdf) is obtained. Sometimes the application of curve fitting techniques – used to smooth the empirical distribution function – can be beneficial. If the curve can be described by a function with a tractable analytical form, then this approach becomes computationally efficient and similar to the second method. The analytical approach is probably the most often used in practice and certainly the most frequently adopted in the actuarial literature. It reduces to finding a suitable analytical expression which fits the observed data well and which is easy to handle. Basic characteristics and estimation issues for the most popular and useful loss distributions are discussed in Sections 9.3.2-9.3.8. Note, that sometimes it may be helpful to subdivide the range of the claim size distribution into intervals for which different methods are employed. For example, the small and medium size claims could be described by the empirical claim size distribution, while the large claims – for which the scarcity of data eliminates the use of the empirical approach – by an analytical loss distribution. In some applications the exact shape of the loss distribution is not required. We may then use the moment based approach, which consists of estimating only the lowest characteristics (moments) of the distribution, like the mean and variance. However, it should be kept in mind that even the lowest three or four moments do not fully define the shape of a distribution, and therefore the fit to the observed data may be poor. Further details on the moment based approach can be found e.g. in Daykin, Pentikainen, and Pesonen (1994). Having a large collection of distributions to choose from, we need to narrow our selection to a single model and a unique parameter estimate. The type of the objective loss distribution can be easily selected by comparing the shapes of the empirical and theoretical mean excess functions. Goodness-of-fit can
303
1
1
0.8
0.8
0.6
0.6
cdf(x)
cdf(x)
9.3 Loss distributions
0.4 0.2
0.4 0.2
Lognormal cdf Approx. of the edf
Empirical df (edf) 0
0
1
2
3
4
5
0
0
x
1
2
3
4
5
x
Figure 9.2: Left panel : Empirical distribution function (edf) of a 10-element log-normally distributed sample with parameters μ = 0.5 and σ = 0.5, see Section 9.3.5. Right panel : Approximation of the edf by a continuous, piecewise linear function superimposed on the theoretical distribution function. STFloss02
be measured by tests based on the empirical distribution function. Finally, the hypothesis that the modeled random event is governed by a certain loss distribution can be statistically tested. In Section 9.4 these statistical issues are thoroughly discussed.
9.3.1 Empirical distribution function A natural estimate for the loss distribution is the observed (empirical) claim size distribution. However, if there have been changes in monetary values during the observation period, inflation corrected data should be used. For a sample of observations {x1 , . . . , xn } the empirical distribution function (edf) is defined as: 1 Fn (x) = #{i : xi ≤ x}, (9.6) n i.e. it is a piecewise constant function with jumps of size 1/n at points xi . Very often, especially if the sample is large, the edf is approximated by a continuous, piecewise linear function with the “jump points” connected by linear functions, see Figure 9.2.
304
9 Building loss models
The empirical distribution function approach is appropriate only when there is a sufficiently large volume of claim data. This is rarely the case for the tail of the distribution, especially in situations where exceptionally large claims are possible. It is often advisable to divide the range of relevant values of claims into two parts, treating the claim sizes up to some limit on a discrete basis, while the tail is replaced by an analytical cdf. If the claim statistics are too sparse to use the empirical approach it is desirable to find an explicit analytical expression for a loss distribution. It should be stressed, however, that many standard models in statistics – like the Gaussian distribution – are unsuitable for fitting the claim size distribution. The main reason for this is the strongly skewed nature of loss distributions. The lognormal, Pareto, Burr, and Weibull distributions are typical candidates to be considered in applications. However, before we review these probability laws we introduce two very versatile distributions – the exponential and gamma.
9.3.2 Exponential distribution Consider the random variable with the following density and distribution functions, respectively: f (x) F (x)
= =
βe−βx , 1 − e−βx ,
x > 0, x > 0.
(9.7) (9.8)
This distribution is called the exponential distribution with parameter (or intensity) β > 0. The Laplace transform of (9.7) is ∞ β def L(t) = e−tx f (x)dx = , t > −β, (9.9) β+t 0 yielding the general formula for the k-th raw moment def
mk = (−1)k
∂ k L(t) )) k! = . ) ∂tk t=0 β k
(9.10)
The mean and variance are thus β −1 and β −2 , respectively. The maximum likelihood estimator (equal to the method of moments estimator) for β is given by: 1 βˆ = , (9.11) m ˆ1
9.3 Loss distributions
305
where
1 k x , n i=1 i n
m ˆk =
(9.12)
is the sample k-th raw moment. To generate an exponential random variable X with intensity β we can use the inverse transform (or inversion) method (Devroye, 1986; Ross, 2002). The method consists of taking a random number U distributed uniformly on the interval (0, 1) and setting X = F −1 (U ), where F −1 (x) = − β1 log(1 − x) is the inverse of the exponential cdf (9.8). In fact we can set X = − β1 log U since (1 − U ) has the same distribution as U . The exponential distribution has many interesting features. As we have seen in Section 9.2.1, it arises as the inter-occurrence time of the events in a HPP. It has the memoryless property, i.e. P(X > x + y|X > y) = P(X > x). Further, the n-th root of the Laplace transform (9.9) is L(t) =
β β+t
n1 ,
(9.13)
which is the Laplace transform of a gamma variate (see Section 9.3.4). Thus the exponential distribution is infinitely divisible. The exponential distribution is often used in developing models of insurance risks. This usefulness stems in a large part from its many and varied tractable mathematical properties. However, a disadvantage of the exponential distribution is that its density is monotone decreasing (see Figure 9.3), a situation which may not be appropriate in some practical situations.
9.3.3 Mixture of exponential distributions Using the technique of mixing one can construct a wide class of distributions. The most commonly used in applications is a mixture of two exponentials, see Chapter 10. In Figure 9.3 a pdf of a mixture of two exponentials is plotted together with the pdfs of the mixing laws. Note, that the mixing procedure can be applied to arbitrary distributions. The construction goes as follows. Let a1 , a2 , . . . , an denote a series of nonn negative weights satisfying i=1 ai = 1. Let F1 (x), F2 (x), . . . , Fn (x) denote an
306
9 Building loss models
0.8
10
pdf(x)
0.6 pdf(x)
0
Exp(0.3) Exp(1) MixExp(0.5,0.3,1)
0.4
10
−1
0.2
0
0
2
4 x
6
10
8
−2
0
2
4 x
6
8
Figure 9.3: Left panel: A probability density function (pdf) of a mixture of two exponential distributions with mixing parameter a = 0.5 superimposed on the pdfs of the component distributions. Right panel: Semi-logarithmic plots of the pdfs displayed in the left panel. The exponential pdfs are now straight lines with slopes −β. Note, the curvature of the pdf of the mixture of two exponentials. STFloss03
arbitrary sequence of exponential distribution functions given by the parameters β1 , β2 , . . . , βn , respectively. Then, the distribution function: F (x) =
n
ai Fi (x) =
i=1
n
ai {1 − exp(−βi x)} ,
(9.14)
i=1
is called a mixture of n exponential distributions (exponentials). The density function of the constructed distribution is f (x) =
n
ai fi (x) =
i=1
n
ai βi exp(−βi x),
(9.15)
i=1
where f1 (x), f2 (x), . . . , fn (x) denote the density functions of the input exponential distributions. The Laplace transform of (9.15) is L(t) =
n i=1
ai
βi , βi + t
t > − min {βi }, i=1...n
(9.16)
9.3 Loss distributions
307
yielding the general formula for the k-th raw moment mk =
n
ai
i=1
k! . βik
(9.17)
n The mean is thus i=1 ai βi−1 . The maximum likelihood and method of moments estimators for the mixture of n (n ≥ 2) exponential distributions can only be evaluated numerically. Simulation of variates defined by (9.14) can be performed using the composition approach (Ross, 2002). First generate a random variable I, equal to i with probability ai , i = 1, ..., n. Then simulate an exponential variate with intensity βI . Note, that the method is general in the sense that it can be used for any set of distributions Fi ’s.
9.3.4 Gamma distribution The probability law with density and distribution functions given by: f (x)
= β(βx)α−1
F (x)
=
e−βx , Γ(α)
x
β(βs)α−1 0
x > 0,
e−βs ds, Γ(α)
x > 0,
(9.18) (9.19)
where α and β are non-negative, is known as the gamma (or Pearson’s Type III) distribution. In the above formulas ∞ def Γ(a) = y a−1 e−y dy, (9.20) 0
is the standard gamma function. Moreover, for β = 1 the integral in (9.19): x 1 def Γ(α, x) = sα−1 e−s ds, (9.21) Γ(α) 0 is called the incomplete gamma function. If the shape parameter α = 1, the exponential distribution results. If α is a positive integer, the distribution is called an Erlang law. If β = 12 and α = ν2 then it is called a chi-squared (χ2 ) distribution with ν degrees of freedom, see the top panels in Figure 9.4. Moreover, a mixed Poisson distribution with gamma mixing distribution is negative binomial.
308
9 Building loss models
0.6
10 Gamma(1,2) Gamma(2,1) Gamma(3,0.5)
0.5
10
−1
pdf(x)
pdf(x)
0.4
0
0.3 0.2
10
−2
0.1 0
0
2
4 x
6
10
8
10 LogN(2,1) LogN(2,0.1) LogN(0.5,2)
0.5
10
0
2
4 x
6
8
0
−1
pdf(x)
pdf(x)
0.4
−3
0.3 0.2
10
−2
0.1 0
0
5
10
15 x
20
25
10
−3
0
5
10
15
20
25
x
Figure 9.4: Top panels: Three sample gamma pdfs, Gamma(α, β), on linear and semi-logarithmic plots. Note, that the first one (black solid line) is an exponential law, while the last one (dashed blue line) is a χ2 distribution with ν = 6 degrees of freedom. Bottom panels: Three sample log-normal pdfs, LogN(μ, σ), on linear and semilogarithmic plots. For small σ the log-normal distribution resembles the Gaussian. STFloss04
The gamma distribution is closed under convolution, i.e. a sum of independent gamma variates with the same parameter β is again gamma distributed with this β. Hence, it is infinitely divisible. Moreover, it is right-skewed and approaches a normal distribution in the limit as α goes to infinity.
9.3 Loss distributions
309
The Laplace transform of the gamma distribution is given by: α β L(t) = , t > −β. β+t
(9.22)
The k-th raw moment can be easily derived from the Laplace transform: mk =
Γ(α + k) . Γ(α)β k
(9.23)
Hence, the mean and variance are E(X) = Var(X) =
α , β α . β2
(9.24) (9.25)
Finally, the method of moments estimators for the gamma distribution parameters have closed form expressions: α ˆ = βˆ =
m ˆ 21 , m ˆ2 −m ˆ 21 m ˆ1 , m ˆ2 −m ˆ 21
(9.26) (9.27)
but maximum likelihood estimators can only be evaluated numerically. Unfortunately, simulation of gamma variates is not straightforward. For α < 1 a simple but slow algorithm due to J¨ ohnk (1964) can be used, while for α > 1 the rejection method is more optimal (Bratley, Fox, and Schrage, 1987; Devroye, 1986). The gamma law is one of the most important distributions for modeling because it has very tractable mathematical properties. As we have seen above it is also very useful in creating other distributions, but by itself is rarely a reasonable model for insurance claim sizes.
9.3.5 Log-Normal distribution Consider a random variable X which has the normal distribution with density 1 1 (x − μ)2 fN (x) = √ exp − , −∞ < x < ∞. (9.28) 2 σ2 2πσ
310
9 Building loss models
Let Y = eX so that X = log Y . Then the probability density function of Y is given by: 1 1 1 (log y − μ)2 f (y) = fN (log y) = √ exp − , y > 0, (9.29) y 2 σ2 2πσy where σ > 0 is the scale and −∞ < μ < ∞ is the location parameter. The distribution of Y is called log-normal; in econometrics it is also known as the Cobb-Douglas law. The log-normal cdf is given by: log y − μ F (y) = Φ , y > 0, (9.30) σ where Φ(·) is the standard normal (with mean 0 and variance l) distribution function. The k-th raw moment mk of the log-normal variate can be easily derived using results for normal random variables: k kX σ2 k2 mk = E Y = E e = MX (k) = exp μk + , (9.31) 2 where MX (z) is the moment generating function of the normal distribution. In particular, the mean and variance are σ2 E(X) = exp μ + , (9.32) 2
2 Var(X) = exp σ − 1 exp 2μ + σ 2 , (9.33) respectively. For both standard parameter estimation techniques the estimators are known in closed form. The method of moments estimators are given by: ! ! n n 1 1 1 2 μ ˆ = 2 log xi − log x , (9.34) n i=1 2 n i=1 i ! ! n n 1 2 1 2 σ ˆ = log x − 2 log xi , (9.35) n i=1 i n i=1 while the maximum likelihood estimators by: μ ˆ
=
1 log(xi ), n i=1
(9.36)
σ ˆ2
=
1 {log(xi ) − μ ˆ }2 . n
(9.37)
n
n
i=1
9.3 Loss distributions
311
Finally, the generation of a log-normal variate is straightforward. We simply have to take the exponent of a normal variate. The log-normal distribution is very useful in modeling of claim sizes. For large σ its tail is (semi-)heavy – heavier than the exponential but lighter than powerlaw, see the bottom panels in Figure 9.4. For small σ the log-normal resembles a normal distribution, although this is not always desirable. It is infinitely divisible and closed under scale and power transformations. However, it also suffers from some drawbacks. Most notably, the Laplace transform does not have a closed form representation and the moment generating function does not exist.
9.3.6 Pareto distribution Suppose that a variate X has (conditional on β) an exponential distribution with intensity β (i.e. with mean β −1 , see Section 9.3.2). Further, suppose that β itself has a gamma distribution (see Section 9.3.4). The unconditional distribution of X is a mixture and is called the Pareto distribution. Moreover, it can be shown that if X is an exponential random variable and Y is a gamma random variable, then X/Y is a Pareto random variable. The density and distribution functions of a Pareto variate are given by: f (x) F (x)
αλα , x > 0, (λ + x)α+1 α λ = 1− , x > 0, λ+x =
(9.38) (9.39)
respectively. Clearly, the shape parameter α and the scale parameter λ are both positive. The k-th raw moment: mk = λk k!
Γ(α − k) , Γ(α)
(9.40)
exists only for k < α. The mean and variance are thus: E(X) = Var(X) =
λ , α−1 αλ2 , (α − 1)2 (α − 2)
(9.41) (9.42)
312
9 Building loss models 0
1
10 Par(0.5,2) Par(2,0.5) Par(2,1)
−1
10
0.6
pdf(x)
pdf(x)
0.8
0.4
−2
10
−3
10
0.2 0
−4
0
1
2
3
4
10
5
0
1
10
x
10
2
10
x
Figure 9.5: Three sample Pareto pdfs, Par(α, λ), on linear and doublelogarithmic plots. The thick power-law tails of the Pareto distribution (asymptotically linear in the log-log scale) are clearly visible. STFloss05
respectively. Note, that the mean exists only for α > 1 and the variance only for α > 2. Hence, the Pareto distribution has very thick (or heavy) tails, see Figure 9.5. The method of moments estimators are given by: α ˆ
=
ˆ λ
=
m ˆ2 −m ˆ 21 , m ˆ 2 − 2m ˆ 21 m ˆ 1m ˆ2 , m ˆ 2 − 2m ˆ 21
2
(9.43) (9.44)
where, as before, m ˆ k is the sample k-th raw moment (9.12). Note, that the estimators are well defined only when m ˆ 2 − 2m ˆ 21 > 0. Unfortunately, there are no closed form expressions for the maximum likelihood estimators and they can only be evaluated numerically. Like for many other distributions the simulation of a Pareto variate X can be conducted via the inverse transform method. The inverse
of the cdf (9.39) has a simple analytical form F −1 (x) = λ (1 − x)−1/α − 1 . Hence, we can set X = λ U −1/α − 1 , where U is distributed uniformly on the unit interval. We have to be cautious, however, when α is larger but very close to one. The theoretical mean exists, but the right tail is very heavy. The sample mean will, in general, be significantly lower than E(X).
9.3 Loss distributions
313
The Pareto law is very useful in modeling claim sizes in insurance, due in large part to its extremely thick tail. Its main drawback lies in its lack of mathematical tractability in some situations. Like for the log-normal distribution, the Laplace transform does not have a closed form representation and the moment generating function does not exist. Moreover, like the exponential pdf the Pareto density (9.38) is monotone decreasing, which may not be adequate in some practical situations.
9.3.7 Burr distribution Experience has shown that the Pareto formula is often an appropriate model for the claim size distribution, particularly where exceptionally large claims may occur. However, there is sometimes a need to find heavy-tailed distributions which offer greater flexibility than the Pareto law, including a non-monotone pdf. Such flexibility is provided by the Burr distribution and its additional shape parameter τ > 0. If Y has the Pareto distribution, then the distribution of X = Y 1/τ is known as the Burr distribution, see the top panels in Figure 9.6. Its density and distribution functions are given by: f (x) F (x)
xτ −1 = τ αλα , (λ + xτ )α+1 α λ = 1− , λ + xτ
x > 0, x > 0,
(9.45) (9.46)
respectively. The k-th raw moment 1 k/τ k k mk = λ Γ 1+ Γ α− , Γ(α) τ τ
(9.47)
exists only for k < τ α. Naturally, the Laplace transform does not exist in a closed form and the distribution has no moment generating function as it was the case with the Pareto distribution. The maximum likelihood and method of moments estimators for the Burr distribution can only be evaluated numerically. A Burr variate X can be generated using the inverse transform method. The inverse of the cdf (9.46) has a sim" #1/τ ple analytical form F −1 (x) = λ (1 − x)−1/α − 1 . Hence, we can set
−1/α 1/τ X = λ U −1 , where U is distributed uniformly on the unit interval. Like in the Pareto case, we have to be cautious when τ α is larger but very close to one. The sample mean will generally be significantly lower than E(X).
314
9 Building loss models 0
1.2
10 Burr(0.5,2,1.5) Burr(0.5,0.5,5) Burr(2,1,0.5)
1
pdf(x)
pdf(x)
0.8 0.6
−2
10
0.4 0.2 0
−4
0
2
4 x
6
10
8
0
2
10
10 x
0
1.2
10 Weib(1,0.5) Weib(1,2) Weib(0.01,6)
1
pdf(x)
pdf(x)
0.8 0.6
−2
10
0.4 0.2 0
−4
0
1
2
3
4
x
5
10
0
1
2
3
4
5
x
Figure 9.6: Top panels: Three sample Burr pdfs, Burr(α, λ, τ ), on linear and double-logarithmic plots. Note, the heavy, power-law tails. Bottom panels: Three sample Weibull pdfs, Weib(β, τ ), on linear and semilogarithmic plots. We can see that for τ < 1 the tails are much heavier and they look like power-law. STFloss06
9.3.8 Weibull distribution If V is an exponential variate, then the distribution of X = V 1/τ , τ > 0, is called the Weibull (or Frechet) distribution. Its density and distribution
9.4 Statistical validation techniques
315
functions are given by: f (x) F (x)
= τ βxτ −1 e−βx , x > 0, −βxτ = 1−e , x > 0, τ
(9.48) (9.49)
respectively. For τ = 2 it is known as the Rayleigh distribution. The Weibull law is roughly symmetrical for the shape parameter τ ≈ 3.6. When τ is smaller the distribution is right-skewed, when τ is larger it is left-skewed, see the bottom panels in Figure 9.6. We can also observe that for τ < 1 the distribution becomes heavy-tailed. In this case, like for the Pareto and Burr distributions, the moment generating function is infinite. The k-th raw moment can be shown to be k −k/τ mk = β Γ 1+ . (9.50) τ Like for the Burr distribution, the maximum likelihood and method of moments estimators can only be evaluated numerically. Similarly, Weibull variates can be generated using the inverse transform method.
9.4 Statistical validation techniques Having a large collection of distributions to choose from we need to narrow our selection to a single model and a unique parameter estimate. The type of the objective loss distribution can be easily selected by comparing the shapes of the empirical and theoretical mean excess functions. The mean excess function, presented in Section 9.4.1, is based on the idea of conditioning a random variable given that it exceeds a certain level. Once the distribution class is selected and the parameters are estimated using one of the available methods the goodness-of-fit has to be tested. A standard approach consists of measuring the distance between the empirical and the fitted analytical distribution function. A group of statistics and tests based on this idea is discussed in Section 9.4.2.
9.4.1 Mean excess function For a claim amount random variable X, the mean excess function or mean residual life function is the expected payment per claim on a policy with a
316
9 Building loss models
fixed amount deductible of x, where claims with amounts less than or equal to x are completely ignored:
∞ {1 − F (u)} du e(x) = E(X − x|X > x) = x . (9.51) 1 − F (x) In practice, the mean excess function e is estimated by eˆn based on a representative sample x1 , . . . , xn : xi >x xi eˆn (x) = − x. (9.52) #{i : xi > x} Note, that in a financial risk management context, switching from the right tail to the left tail, e(x) is referred to as the expected shortfall (Weron, 2004). When considering the shapes of mean excess functions, the exponential distribution plays a central role. It has the memoryless property, meaning that whether the information X > x is given or not, the expected value of X − x is the same as if one started at x = 0 and calculated E(X). The mean excess function for the exponential distribution is therefore constant. One in fact easily calculates that for this case e(x) = 1/β for all x > 0. If the distribution of X is heavier-tailed than the exponential distribution we find that the mean excess function ultimately increases, when it is lightertailed e(x) ultimately decreases. Hence, the shape of e(x) provides important information on the sub-exponential or super-exponential nature of the tail of the distribution at hand. Mean excess functions for the distributions discussed in Section 9.3 are given by the following formulas (and plotted in Figure 9.7): • exponential: e(x) =
1 ; β
• mixture of two exponentials: e(x)
=
a β1
exp(−β1 x) +
1−a β2
exp(−β2 x)
a exp(−β1 x) + (1 − a) exp(−β2 x)
;
• gamma: e(x)
=
α 1 − F (x, α + 1, β) · − x = β −1 {1 + o(1)} , β 1 − F (x, α, β)
where F (x, α, β) is the gamma distribution function (9.19);
9.4 Statistical validation techniques
317
150
150 Log−normal Gamma (α1) Mix−Exp e(x)
100
e(x)
100
Pareto Weibull (τ1) Burr
50
50
0
5
10 x
15
20
0
5
10 x
15
20
Figure 9.7: Left panel: Shapes of the mean excess function e(x) for the lognormal, gamma (with α < 1 and α > 1) and a mixture of two exponential distributions. Right panel: Shapes of the mean excess function e(x) for the Pareto, Weibull (with τ < 1 and τ > 1) and Burr distributions. STFloss07
• log-normal: e(x)
=
=
2 2 exp μ + σ2 1 − Φ ln x−μ−σ σ −x= ln x−μ 1−Φ σ σ2 x {1 + o(1)} , ln x − μ
where o(1) stands for a term which tends to zero as u → ∞; • Pareto: e(x) =
λ+x , α−1
α > 1;
318
9 Building loss models
• Burr: e(x)
=
=
−α λ1/τ Γ α − τ1 Γ 1 + τ1 λ · · Γ(α) λ + xτ 1 1 xτ · 1 −B 1 + ,α − , −x = τ τ λ + xτ x {1 + o(1)} , ατ − 1
ατ > 1,
where Γ(·) is the standard gamma function (9.20) and x def Γ(a + b) B(a, b, x) = y a−1 (1 − y)b−1 dy, Γ(a)Γ(b) 0
(9.53)
is the beta function; • Weibull: e(x) = =
Γ (1 + 1/τ ) 1 τ 1 − Γ 1 + , βx exp (βxτ ) − x = τ β 1/τ x1−τ {1 + o(1)} , βτ
where Γ(·, ·) is the incomplete gamma function (9.21).
9.4.2 Tests based on the empirical distribution function A statistics measuring the difference between the empirical Fn (x) and the fitted F (x) distribution function, called an edf statistic, is based on the vertical difference between the distributions. This distance is usually measured either by a supremum or a quadratic norm (D’Agostino and Stephens, 1986). The most popular supremum statistic: D = sup |Fn (x) − F (x)| ,
(9.54)
x
is known as the Kolmogorov or Kolmogorov-Smirnov statistic. It can also be written in terms of two supremum statistics: D+ = sup {Fn (x) − F (x)} x
and D − = sup {F (x) − Fn (x)} , x
9.4 Statistical validation techniques
319
where the former is the largest vertical difference when Fn (x) is larger than F (x) and the latter is the largest vertical difference when it is smaller. The Kolmogorov statistic is then given by D = max(D + , D− ). A closely related statistic proposed by Kuiper is simply a sum of the two differences, i.e. V = D+ + D− . The second class of measures of discrepancy is given by the Cramer-von Mises family ∞ 2 Q=n {Fn (x) − F (x)} ψ(x)dF (x), (9.55) −∞
where ψ(x) is a suitable function which gives weights to the squared difference {Fn (x) − F (x)}2 . When ψ(x) = 1 we obtain the W 2 statistic of Cramer-von Mises. When ψ(x) = [F (x) {1 − F (x)}]−1 formula (9.55) yields the A2 statistic of Anderson and Darling. From the definitions of the statistics given above, suitable computing formulas must be found. This can be done by utilizing the transformation Z = F (X). When F (x) is the true distribution function of X, the random variable Z is uniformly distributed on the unit interval. Suppose that a sample x1 , . . . , xn gives values zi = F (xi ), i = 1, . . . , n. It can be easily shown that, for values z and x related by z = F (x), the corresponding vertical differences in the edf diagrams for X and for Z are equal. Consequently, edf statistics calculated from the empirical distribution function of the zi ’s compared with the uniform distribution will take the same values as if they were calculated from the empirical distribution function of the xi ’s, compared with F (x). This leads to the following formulas given in terms of the order statistics z(1) < z(2) < · · · < z(n) : i D+ = max − z(i) , (9.56) 1≤i≤n n (i − 1) D − = max z(i) − , (9.57) 1≤i≤n n D
=
V
=
W2
=
max(D+ , D− ), +
−
D +D , 2 n (2i − 1) 1 z(i) − + , 2n 12n i=1
(9.58) (9.59) (9.60)
320
9 Building loss models
A2
= −n −
n
1 (2i − 1) log z(i) + log(1 − z(n+1−i) ) = n i=1
1 (2i − 1) log z(i) + n i=1 +(2n + 1 − 2i) log(1 − z(i) ) .
(9.61)
n
= −n −
(9.62)
The general test of fit is structured as follows. The null hypothesis is that a specific distribution is acceptable, whereas the alternative is that it is not: H0 : Fn (x) = F (x; θ), H1 :
Fn (x) = F (x; θ),
where θ is a vector of known parameters. Small values of the test statistic T are evidence in favor of the null hypothesis, large ones indicate its falsity. To see how unlikely such a large outcome would be if the null hypothesis was true, we calculate the p-value by: p-value = P (T ≥ t),
(9.63)
where t is the test value for a given sample. It is typical to reject the null hypothesis when a small p-value is obtained. However, we are in a situation where we want to test the hypothesis that the sample has a common distribution function F (x; θ) with unknown θ. To employ any of the edf tests we first need to estimate the parameters. It is important to recognize that when the parameters are estimated from the data, the critical values for the tests of the uniform distribution (or equivalently of a fully specified distribution) must be reduced. In other words, if the value of the test statistics T is d, then the p-value is overestimated by PU (T ≥ d). Here PU indicates that the probability is computed under the assumption of a uniformly distributed sample. Hence, if PU (T ≥ d) is small, then the p-value will be even smaller and the hypothesis will be rejected. However, if it is large then we have to obtain a more accurate estimate of the p-value. Ross (2002) advocates the use of Monte Carlo simulations in this context. ˆ First the parameter vector is estimated for a given sample of size n, yielding θ, and the edf test statistics is calculated assuming that the sample is distributed ˆ returning a value of d. Next, a sample of size n of F (x; θ)ˆ according to F (x; θ), distributed variates is generated. The parameter vector is estimated for this simulated sample, yielding θˆ1 , and the edf test statistics is calculated assuming
9.5 Applications
321
that the sample is distributed according to F (x; θˆ1 ). The simulation is repeated as many times as required to achieve a certain level of accuracy. The estimate of the p-value is obtained as the proportion of times that the test quantity is at least as large as d. An alternative solution to the problem of unknown parameters was proposed by Stephens (1978). The half-sample approach consists of using only half the data to estimate the parameters, but then using the entire data set to conduct the test. In this case, the critical values for the uniform distribution can be applied, at least asymptotically. The quadratic edf tests seem to converge fairly rapidly to their asymptotic distributions (D’Agostino and Stephens, 1986). Although, the method is much faster than the Monte Carlo approach it is not invariant – depending on the choice of the half-samples different test values will be obtained and there is no way of increasing the accuracy. As a side product, the edf tests supply us with a natural technique of estimating the parameter vector θ. We can simply find such θˆ∗ that minimizes a selected edf statistic. Out of the four presented statistics A2 is the most powerful when the fitted distribution departs from the true distribution in the tails (D’Agostino and Stephens, 1986). Since the fit in the tails is of crucial importance in most actuarial applications A2 is the recommended statistic for the estimation scheme.
9.5 Applications In this section we illustrate some of the methods described earlier in the chapter. We conduct the analysis for the Danish fire losses dataset, which concerns major fire losses in Danish Krone (DKK) that occurred between 1980 and 2002 and were recorded by Copenhagen Re. Here we consider only losses in profits. The Danish fire losses dataset has been adjusted for inflation using the Danish consumer price index.
9.5.1 Calibration of loss distributions We first look for the appropriate shape of the distribution. To this end we plot the empirical mean excess function for the analyzed data set, see Figure 9.8. Since the function represents an empirical mean above some threshold level, its values for high x’s are not credible, and we do not include them in the plot.
322
9 Building loss models 12
en(x) (DKK million)
10 8 6 4 2 0
0
2
4
6
8 10 x (DKK million)
12
14
16
18
Figure 9.8: The empirical mean excess function eˆn (x) for the Danish fire data. STFloss08
Before we continue with calibration let us note, that in recent years outlierresistant or so-called robust estimates of parameters are becoming more widespread in risk modeling. Such models – called robust (statistics) models – were introduced by P.J. Huber in 1981 and applied to robust regression analysis (Huber, 2004). Under the robust approach the extreme data points are eliminated to avoid a situation when outliers drive future forecasts in an unwanted (such as worst-case scenario) direction. One of the first applications of robust analysis to insurance claim data can be found in Chernobai et al. (2006). In that paper top 1% of the catastrophic losses were treated as outliers and excluded from the analysis. This procedure led to an improvement of the forecasting power of considered models. Also the resulting ruin probabilities were more optimistic than those predicted by the classical model. It is important to note, however, that neither of the two approaches – classical or robust – is preffered over the other. Rather, in the presence of outliers, the robust model can be used to complement to the classical one. Due to space limits, in this chapter we will only present the results of the latter. The robust approach can be easily conducted following the steps detailed in this Section. The Danish fire losses show a super-exponential pattern suggesting a lognormal, Pareto or Burr distribution as the most adequate for modeling. Hence, in what follows we fit only these three distributions. We apply two estimation schemes: maximum likelihood and A2 statistics minimization. Out of the three fitted distributions only the log-normal has closed form expressions for
9.5 Applications
323
Table 9.1: Parameter estimates obtained via the A2 minimization scheme and test statistics for the fire loss amounts. The corresponding p-values based on 1000 simulated samples are given in parentheses. Distributions: Parameters:
Tests:
D V W2 A2
log-normal μ=12.525 σ=1.5384
Pareto α=1.3127 λ=4.0588 · 105
0.0180
0.0262
Burr α=0.9844 λ=1.0585 · 106 τ =1.1096 0.0266
(0.020)
( 0 and MX (0) = −μ. Hence, the curves y = MX (z) and y = 1 + (1 + θ)μz may intersect, as shown in Figure 10.1.
332
10 Ruin probability in finite time
Table 10.1: Typical claim size distributions. In all cases x ≥ 0. Name Exponential Gamma Weibull Mixed exp’s
Parameters pdf Light-tailed distributions β>0 fX (x) = β exp(−βx) α
β fX (x) = Γ(α) xα−1 exp(−βx) fX (x) = βτ xτ −1 exp(−βxτ ) n fX (x) = {ai βi exp(−βi x)}
α > 0, β > 0 β > 0, τ ≥ 1 n βi > 0, ai = 1 i=1
i=1
Pareto
Heavy-tailed distributions β > 0, 0 < τ < 1 fX (x) = βτ xτ −1 exp(−βxτ ) x−μ)2 1 μ ∈ R, σ > 0 fX (x) = √2πσx exp − (ln 2σ 2 α α λ α > 0, λ > 0 fX (x) = λ+x λ+x
Burr
α > 0, λ > 0, τ > 0
Weibull Log-normal
fX (x) =
ατ λα xτ −1 (λ+xτ )α+1
An analytical solution to equation (10.5) exists only for few claim distributions. However, it is quite easy to obtain a numerical solution. The coefficient R satisfies the inequality: 2θμ R < (2) , (10.6) μ where μ(2) = E(Xi2 ), see Asmussen (2000). Let D(z) = 1 + (1 + θ)μz − MX (z). Thus, the adjustment coefficient R > 0 satisfies the equation D(R) = 0. In order to get the solution one may use the Newton-Raphson formula Rj+1 = Rj −
D(Rj ) , D (Rj )
(10.7)
with the initial condition R0 = 2θμ/μ(2) , where D (z) = (1 + θ)μ − MX (z). (3) Moreover, if it is possible to calculate the third raw moment μ , we can obtain a sharper bound than (10.6) (Panjer and Willmot, 1992):
R
2 – α3
+ – ατ > 2 – ατ < 2 ατ > 3
342
10 Ruin probability in finite time
0.8
1.5
(ψ (u,T)−ψMC (u,T))/ψ MC(u,T)
0.7 0.6
ψ(u,T)
0.5 0.4 0.3 0.2
0
0.5
0
−0.5
−1
0.1 0
Segerdahl Brown. diff. Corr. diff. De Vylder
1
3
6
9
12 15 18 21 u (DKK million)
24
27
30
−1.5 0
3
6
9
12 15 18 21 u (DKK million)
24
27
30
Figure 10.2: The reference ruin probability obtained via Monte Carlo simulations (left panel), the relative error of the approximations (right panel). The mixture of two exponentials case with T fixed and u varying. STFruin09
10.4 Numerical comparison of the finite time approximations Now, we will illustrate all six approximations presented in Section 10.3. We consider three claim amount distributions which were best fitted to the Danish fire losses data in Chapter 9, namely the mixture of two exponentials (a running example in Section 10.3), log-normal and Pareto distributions. The parameters of the distributions are: β1 = 3.8617 · 10−7 , β2 = 3.6909 · 10−6 , a = 0.2568 (mixture), μ = 12.5247, σ 2 = 1.5384 (log-normal), and α = 1.3127, λ = 4.0588 · 105 (Pareto). The ruin probability will be depicted as a function of u, ranging from 0 to 30 million DKK, with fixed T = 1 or with fixed value of u = 5 million DKK and varying T from 0 to 5 years. The relative safety loading is set to 30%. All figures have the same form of output. In the left panel, the reference ruin probability values obtained via 10 x 10000 Monte Carlo simulations are presented. The right panel depicts the relative error with respect to the reference values.
10.4 Numerical comparison of the finite time approximations
0.45
343
1.5 Segerdahl Brown. diff. Corr. diff. De Vylder
(ψ (u,T)−ψMC (u,T))/ψ MC(u,T)
0.4 0.35
ψ(u,T)
0.3 0.25 0.2 0.15 0.1
1
0.5
0
−0.5
−1 0.05 0
0
1
2 T (years)
3
4
5
−1.5 0
3
6
9
12
15 18 T (years)
21
24
27
30
Figure 10.3: The reference ruin probability obtained via Monte Carlo simulations (left panel), the relative error of the approximations (right panel). The mixture of two exponentials case with u fixed and T varying. STFruin10
First, we consider the mixture of two exponentials. As we can see in Figures 10.2 and 10.3 the Brownian motion diffusion and Segerdahl approximations almost for all values of u and T give highly incorrect results. Corrected diffusion and finite time De Vylder approximations yield acceptable errors, which are generally below 10%. In the case of log-normally distributed claims, we can only apply two approximations: diffusion by Brownian motion and finite time De Vylder, see Table 10.9. Figures 10.4 and 10.5 depict the reference ruin probability values obtained via Monte Carlo simulations and the relative errors with respect to the reference values. Again, the finite time De Vylder approximation works better than the diffusion approximation, but, in general, the errors are not acceptable. Finally, we take into consideration the Pareto claim size distribution. Figures 10.6 and 10.7 depict the reference ruin probability values and the relative errors with respect to the reference values for the α-stable L´evy motion diffusion approximation (we cannot apply finite time De Vylder approximation as α < 3). We see that the error is quite low for high values of T .
344
10 Ruin probability in finite time
0.8
1.5 Brown. diff. De Vylder (ψ (u,T)−ψMC (u,T))/ψ MC (u,T)
0.7 0.6
ψ(u,T)
0.5 0.4 0.3 0.2
0.5
0
−0.5
−1
0.1 0
1
0
3
6
9
12 15 18 21 u (DKK million)
24
27
−1.5 0
30
3
6
9
12 15 18 21 u (DKK million)
24
27
30
Figure 10.4: The reference ruin probability obtained via Monte Carlo simulations (left panel), the relative error of the approximations (right panel). The log-normal case with T fixed and u varying. STFruin11
0.45
1.5 Brown. diff. De Vylder (ψ (u,T)−ψMC (u,T))/ψ MC(u,T)
0.4 0.35
ψ(u,T)
0.3 0.25 0.2 0.15 0.1
1
0.5
0
−0.5
−1 0.05 0
0
1
2 T (years)
3
4
5
−1.5
0
1
2
3
4
5
T (years)
Figure 10.5: The reference ruin probability obtained via Monte Carlo simulations (left panel), the relative error of the approximations (right panel). The log-normal case with u fixed and T varying. STFruin12
10.4 Numerical comparison of the finite time approximations
0.8
345
1.5 Stable diff. (ψ (u,T)−ψMC (u,T))/ψ MC(u,T)
0.7 0.6
ψ(u,T)
0.5 0.4 0.3 0.2
0.5
0
−0.5
−1
0.1 0
1
0
3
6
9
12 15 18 21 u (DKK million)
24
27
−1.5 0
30
3
6
9
12 15 18 21 u (DKK million)
24
27
30
Figure 10.6: The reference ruin probability obtained via Monte Carlo simulations (left panel), the relative error of the approximation with α-stable L´evy motion (right panel). The Pareto case with T fixed and u varying. STFruin13
0.45
1.5 Stable diff. (ψ (u,T)−ψMC (u,T))/ψ MC(u,T)
0.4 0.35
ψ(u,T)
0.3 0.25 0.2 0.15 0.1
1
0.5
0
−0.5
−1 0.05 0
0
1
2 T (years)
3
4
5
−1.5
0
1
2
3
4
5
T (years)
Figure 10.7: The reference ruin probability obtained via Monte Carlo simulations (left panel), the relative error of the approximation with α-stable L´evy motion (right panel). The Pareto case with u fixed and T varying. STFruin14
Bibliography Asmussen, S. (2000). Ruin Probabilities, World Scientific, Singapore. Burnecki, K. (2000). Self-similar processes as weak limits of a risk reserve process, Probab. Math. Statist. 20(2): 261-272. Burnecki, K., Mi´sta, P., and Weron, A. (2005). Ruin Probabilities in Finite and Infinite Time, in P. Cizek, W. H¨ardle, and R. Weron (eds.) Statistical Tools for Finance and Insurance, Springer, Berlin, 341-379. Burnecki, K., Mi´sta, P., and Weron, A. (2005). What is the best approximation of ruin probability in infinite time?, Appl. Math. (Warsaw) 32(2): 155176. Degen, M., Embrechts, P., and Lambrigger, D.D. (2007). The quantitative modeling of operational risk: between g-and-h and EVT, Astin Bulletin 37(2): 265-291. Embrechts, P., Kaufmann, R., and Samorodnitsky, G. (2004). Ruin Theory Revisited: Stochastic Models for Operational Risk, in C. Bernadell et al (eds.) Risk Management for Central Bank Foreign Reserves, European Central Bank, Frankfurt a.M., 243-261. Furrer, H., Michna, Z., and Weron, A. (1997). Stable L´evy motion approximation in collective risk theory, Insurance Math. Econom. 20: 97–114. Grandell, J. (1991). Aspects of Risk Theory, Springer, New York. Iglehart, D. L. (1969). Diffusion approximations in collective risk theory, Journal of Applied Probability 6: 285–292. Kaishev, V.K., Dimitrova, D.S. and Ignatov, Z.G. (2008). Operational risk and insurance: a ruin-probabilistic reserving approach, Journal of Operational Risk 3(3): 39–60. Nolan, J.P. (2010). Stable Distributions - Models for Heavy Tailed Data, Birkhauser, Boston, In progress, Chapter 1 online at academic2.american.edu/∼jpnolan.
348
Bibliography
Panjer, H.H. and Willmot, G.E. (1992). Insurance Risk Models, Society of Actuaries, Schaumburg. Segerdahl, C.-O. (1955). When does ruin occur in the collective theory of risk?, Skand. Aktuar. Tidskr 38: 22–36.
11 Property and casualty insurance pricing with GLMs Jan Iwanik
11.1 Introduction The purpose of insurance rate making is to ensure that the book of business generates enough revenue to pay claims, expenses, and make profit. Apart from managing the overall level the actuary also needs to design a rating segmentation structure. The insurance company must understand which policies are more and which are less likely to generate claims. The price of insurance should reflect these differences. Different lines of insurance are modeled by actuaries in a different ways. An introduction to property and casualty insurance and standard pricing methods can be found in Ferguson et. al. (2004) and Flinter and Trupin (2004). The lines of business where statistical regression modeling is most widely applied are those where insurers work with large books of small and similar risks. Good examples are private and light commercial auto or homeowners insurance. The methodology used for these lines are the Generalized Linear Models (GLMs). Other, more general, approaches include generalized additive models, generalized non-linear models and mixed models, see for example Wood (2006), McCulloch, Neuhaus and Searle (2008), and Hardle et. al. (2004). This chapter is a brief introduction to the GLMs for insurance cost pricing. We will start with a discussion of the typical data sets used. Next, we will explain the building blocks GLMs . Using a policy level insurance data set we will then demonstrate how GLMs can be used to model claim frequency and claim severity. Finally, we will present the tools commonly used for diagnosing and selecting final models and the rate structure.
P. Čížek et al. (eds.), Statistical Tools for Finance and Insurance, DOI 10.1007/978-3-642-18062-0_11, © Springer-Verlag Berlin Heidelberg 2011
349
350
11 Property and casualty insurance pricing with GLMs
Table 11.1: An extract from a simplified exposure table for auto insurance evaluated as of January 1, 2009. RISK ID NY7658 NY7658 NY7658 VA3453 MO6723 ...
START DATE 2008-07-01 2008-10-01 2002-09-01 2006-11-12 2007-11-11 ...
EARN EXP. 0.50 0.25 0.33 1.00 1.00
VEH. VALUE 12 700 16 790 9 890 14 000 7 800 ...
COVER
AGE
full 3rd party 3rd party full full ...
55 22 44 19 31 ...
VEHICLE TYPE 4x4 van convert standard standard ...
11.2 Insurance data used in statistical modeling The main source of data used for rate making are the company’s own records. Typically this data is organized in the form of two tables. The exposure table contains the risk details and earned exposure information and the claim table holds information on claims. Every row in the exposure table represents one policy. If a policy is renewed, the new policy term is normally registered as an additional row. The columns represent the risk details. Typical risk details for an auto policy include: age of policyholder, number of drivers, location of risk, type of car, deductible, amount of insurance, last year’s mileage, etc. Earned exposure is assigned to every observation. Earned exposure represents the number of years the policy was active. An old annual policy has earned exposure of one. An annual policy written six months ago has earned exposure of 12 . Table 11.1 shows a sample structure of the exposure table for an auto insurance product. The other data set is the claim table. Every claim is attached to a specific policy and there can be multiple claims per policy. Table 11.2 demonstrates the structure of a sample claim table. The two data sets are often merged into a single file. In the numerical examples in this chapter we will be using a sample data set of 100 000 homeowners policies. It combines the exposure and claims information from a hypothetical insurer.
11.3 The structure of generalized linear models
351
Table 11.2: An extract from a simplified claim table for auto insurance evaluated as of January 1, 2009. RISK ID NY7658 PA3400 VA3453 GA7550 ...
START DATE 2008-10-01 2003-01-03 2008-11-01 2002-05-09 ...
EVENT DATE 2008-12-26 2003-03-03 2008-11-18 2003-02-13
CLAIM CAUSE accident fire accident accident
CLAIM TYPE injury own prop. injury own prop. ...
PAID RESERVE (USD) (USD) 1 000 22 000 6 700 0 0 4 200 1 500 0 ... ...
External databases can also be used in rate making. They are available from industry organizations, specialized companies, credit bureaus, research institutions, and the government. External data can be added as new columns to the exposure table. These additional columns can include the credit score, demographic information, etc. Once the external data is merged to the company’s data sets, it plays the same modeling role as the internal data.
11.3 The structure of generalized linear models One of the tools of predictive modeling is statistical regression. The standard multivariate Gaussian regression model expresses the dependent random variable y as a linear function of independent variables (xi )i=1,...,k in the following form: y = β0 + β1 x1 + β2 x2 + . . . βk xk + ε = β x + ε
(11.1)
where ε is distributed according to N(0, σ 2 ). The normality assumption used in Gaussian regression is relaxed for GLMs. A GLM is defined by the equation E[y] = g −1 (β x)
(11.2)
where g(·) is called the link function and y follows a distribution from the exponential family. We will now briefly discuss the building blocks of GLMs.
352
11 Property and casualty insurance pricing with GLMs
For a more comprehensive introduction, see for example McCullagh and Nelder (1989), Muller (2004).
11.3.1 Exponential family of distributions In GLMs the random variable y needs to be distributed according to one of the distributions from the exponential family of distributions. A probability distribution is a member of the exponential family of distribution if its density function or probability mass function can be expressed as f (y; θ, φ) = c(y, φ) exp
yθ − a(θ) φ
.
(11.3)
Functions a(·) and c(·) define the parametric subfamily of the exponential family with the canonical parameter θ and the dispersion parameter φ. Dispersion parameter may or may not be given so one or two parameters if the parametric subfamily may have to be estimated from the data. The function a() must be twice differentiable and convex. The following holds if Y is from the exponential family, see for example McCullagh and Nelder (1989, Ch. 2):
E[Y ] = a (θ) Var(X) = φa (θ) .
(11.4) (11.5)
This means that, since a (θ) is invertible, θ is a function of E[Y ]. It is also true that c(·) does not depend on θ so c(·) cannot depend on E[Y ]. As a consequence c(·) does not depend on (βi )k=1,...,k from equation (11.2) and so it is irrelevant to the estimation of the model parameters and it will be omitted from further discussion. y2 Note that for a(θ) = θ2 /2 and c(y, φ) = exp(− 2φ )/ 2πφ2 , formula (11.3) becomes the density of the normal distribution with mean θ and variance φ. Other popular probability distributions which can be expressed by (11.3) include: gamma, inverted gamma, binomial, negative binomial, Poisson, and Tweedie. Although the exponential family is a general class, some useful distributions do not belong to it, for instance the log-normal distribution.
11.3 The structure of generalized linear models
353
Table 11.3: Commonly used variance and link functions. Variance function Gaussian Bernoulli Poisson gamma inv. Gaussian
v(μ) 1 μ(1 − μ) μ μ2 μ3
Link function identity logistic logarithmic complementary
g(x) x x log( 1−x ) log(x) log(1 − x)
11.3.2 The variance and link functions Since θ is a function of μ = E[Y ] and it follows from (11.5) that Var(Y ) is a function of θ, we can conclude that Var(Y ) = φv(μ). Function v(·) linking the mean of the distribution μ with its variance Var(Y ) is called the variance function. For example, for the Poisson distribution v(μ) = μ. Some commonly used variable and link functions are summarized in Table 11.3. As specified by equation (11.2), the dependent variable y is linked to the independent variables in a GLM through a monotonic and smooth function g(·) called the link function. The link function adds flexibility to the modeling process. For example, by the use of a logarithm link function we can transform the additive regression model into a multiplicative structure. The non-identity link functions can be useful but they also introduce bias into GLM predictions. The bias will reduce when the variance of the estimators (βˆi )i=1,...,k reduces so it should be low when the sample size is large. Statistical packages often identify the class of GLM models by the couple {variance function, link function} or simply {v(·), g(·)}. This parametrization is sufficient to determine the parameters of the underlying distributions, model coefficients, variance of the estimators and other relevant information.
11.3.3 The iterative algorithm Except for special cases, the are no open form solutions to the ML estimation problem in GLMs. To estimate ML parameters of a GLM numerical algorithms have to be used instead. There is more than one way to do this effectively, see
354
11 Property and casualty insurance pricing with GLMs
McCullagh and Nelder (1989, Ch. 2) or de Jong and Heller (2008, Ch. 5). Since for every observation θj is a function of μj , under the model assumptions it must be also be a function of the product of the given (xj,i )i=1,...,k and of the unknown (βi )i=1,...,k . So the likelihood function for a given set of n independent observations (yj )j=1,...,n from an exponential family is given by
L(y, θ, φ) =
n 6 j=1
f (yj ; θj , φ) =
n 6
c(yj , φ) exp
j=1
yj θ(β, xj ) − a θ(β, xj ) . φ
The purpose of the algorithm is to find estimators (βˆi )i=1,...,k which, for the given set of n independent observations (xj , yj )j=1,...,n , maximize the above likelihood function. The basic numerical algorithm is based on an iterative extension of the weighted linear regression which can be found for example in McCullagh and Nelder (1989) or de Jong and Heller (2008). Because the method involves an iterative procedure, in some practical situations the algorithm may not converge. However, it is rare to encounter a serious convergence issue with a well defined and correctly design real life insurance problem.
11.4 Modeling claim frequency Some types of risks generate low frequency and high severity claims, for example product liability. Other claims have higher frequency and low severity, for example car audio theft or excess reimbursement. To correctly manage an insurance portfolio one must understand both claim frequency and claim severity. Some risk factors may be included in one model but excluded form the other. Others can have a positive impact on frequency but a negative impact on severity. Building a credible model for claim frequency is usually easier than for claim severity because there is less variance in the frequency data than in severity data. Also, there are normally more observations available for frequency modeling. This is because all observations are used in frequency modeling but only the ones with claims in severity modeling.
11.4 Modeling claim frequency
355
11.4.1 Pre-modeling steps The first step is to make sure that we understand the business context. The analyst needs to understand what covers are being modeled, which types of claims are included, how are they trended and developed, and if the case reserves, reinsurance, costs and others financial adjustments were applied. Table 11.2 stores two types of claims: bodily injury and own damage. Each claim type should normally be modeled separately because different models are useful for different claim types. For instance, the frequency of auto fire claims typically does not depend on driver’s age while the frequency of accidental damage claims does. A special type of claim is the catastrophic claim (see also Chapter 12). Catastrophic claims are typically modeled using specializes external tools and hence it is best to exclude them from the regression modeling, see Grossi, Kunreuter and Chandu (2005). The data should be randomly divided into two subsamples for the purpose of split validation. One sample, say 75% of the full set, will be used for developing the model. The remaining 25% will be used later to measure and compare the predictive power of the models.
11.4.2 The Poisson model Our objective is to create a model predicting the expected number of claims Nj for policy j with given risk details (xj,i )i=1,...,k . We will use homeowners insurance as an example. The typical assumption is that the number of claims for every policy has a Poisson distribution with mean λj = exp(β T xj ). This assumption has strong analytical and theoretical justifications and is a part of the collective risk model (see Chapter 9). Hence, we assume that the expected number of claims from policy j with exposure ej can be expressed as E[Nj ] = ej exp(β xj ) = log−1 {log(ej ) + β xj }.
(11.6)
This equation follows the the GLM structure (11.2). The random distribution is Poisson and the link function is the logarithm. Logarithmic link function guarantees that the expected number of claims is positive and it enforces the multiplicative structure of the model. The task now is to estimate parameters (βˆi )i=0,1,...,k .
356
11 Property and casualty insurance pricing with GLMs
The expected value of the number of claims is a product of a series of factors. Although this assumption is widely used in insurance practice, there is nothing about the nature of insurance risk that would guarantee that this structure is optimal. The question we are trying to answer is ‘assuming that the risk has a multiplicative structure, what are the best (βi )i=0,...,k to model it?’.
11.4.3 A numerical example In the R package the GLM models are fitted by the ‘glm()’ routine. The example from this subsection uses the ‘formula’ parameter to define the regression formula and the ‘family’ parameter to define the couple {v(·), g(·)}. More information and practical examples demonstrating the ‘glm()’ routine can be found in Ripley and Venables (2002). In this example we use the homeowners insurance data set with 100 000 observations. We will create a model expressing the expected number of claims with 18 coefficients. Each coefficient represents one level of one independent variable grouped as described in Subsection 11.6.1. The earned exposure in this model is offset and included as a logarithm, see equation (11.6). The exposure coefficient is fixed rather than being fitted by the iterative GLM algorithm, see Subsection 11.6.2. The logarithmic link function ensures that exposure has a simple multiplicative impact on the predictions. Table 11.4 lists the summary of this model. The intercept of the model is -1.95 so the base frequency is e−1.95 ≈ 0.14. If all claims had a severity of, say, USD 1 000 then the pure premium would be USD 140. The rest of the output represents the logarithms of the factors for different policyholder ages, ages of house, construction types and other risk factors. We will discuss the diagnostics for this model and steps to improve it in Section 11.7
11.5 Modeling claim severity Claim severity is usually harder to model statistically than claims frequency. One reason for this is high volatility of the observed claim severities. In an auto insurance portfolio claims can range from, say, USD 30 to USD 1 mln or more.
11.5 Modeling claim severity
357
Table 11.4: The summary of the Poisson frequency model. Coefficients (Intercept) house-age-06-to-15-TRUE house-age-16-to-30-TRUE house-age-31-to-35-TRUE house-age-36-to-xx-TRUE ph-age-36-to-45-TRUE ph-age-46-to-55-TRUE ph-age-56-to-65-TRUE ph-age-66-to-80-TRUE ph-age-81-to-xx-TRUE residents-3-to-4-TRUE residents-5-to-x-TRUE other-products-Y-TRUE constr-steel-TRUE constr-masonry-TRUE constr-steelmas-TRUE constr-wood-TRUE status-married-TRUE
Estimate -1.95 0.16 0.21 0.16 0.12 0.06 0.09 0.03 -0.05 -0.10 0.00 0.09 0.01 -0.43 -0.61 -0.38 -0.51 -0.78
Std. Error 0.13 0.04 0.03 0.03 0.03 0.05 0.04 0.05 0.05 0.06 0.02 0.03 0.02 0.12 0.12 0.12 0.16 0.03
z value -14.59 3.95 5.41 4.35 3.16 1.23 1.86 0.63 -0.96 -1.46 0.26 3.05 0.58 -3.53 -4.83 -3.07 -3.10 -21.29
p value 0.00 0.00 0.00 0.00 0.00 0.21 0.06 0.52 0.33 0.14 0.79 0.00 0.55 0.00 0.00 0.00 0.00 0.00
Additional complexity arises from the development of claims. Claim severity changes over time as reserves change and partial payments are made. Figure 11.1 shows five examples of real auto claims developing over time. Some of these claims were handled in a straightforward way and some had multiple partial payments or reserve changes. One needs to decide how to treat this development. One approach is to use claims developed as of a given calendar day. Another approach would be to evaluate each claim as of n months after the accident date.
11.5.1 Data preparation Similarly to the frequency modeling, the first step in severity modeling is to understand the data. One decision to be made is between using incurred claims (including case reserves) or paid claims. Both methods have advantages. The
0
5000
15000
11 Property and casualty insurance pricing with GLMs
Total incurred loss (USD)
358
0
10
20
30
40
Months since claim reported
Figure 11.1: Each line represents the development of a different auto claim. Most claims stabilize about 25 months after being reported.
former uses the most recent estimate of claims but the later is independent of the reserving philosophy. Another issue is the inflation. Historical claims could be corrected and be brought to the current claim cost using an inflation index or we could keep the dollar values but include the accident (or the underwriting) year as a factor in the GLM model so that all inflationary effects are reflected by this variable. One also has to decide at what level to cap the claims. Claim capping is necessary since very large claims are often outliers and they would distort the results of the modeling process. A large loss can have enormous impact on the estimators for some segments of the portfolio, if left uncapped. Catastrophic claims should be removed from the data set for similar reasons.
11.5.2 A numerical example There are 8 930 claims in our data of 100 000 household risks. Usually different loss types cannot be effectively accommodated by the same GLM model. For
11.5 Modeling claim severity
359
Table 11.5: The summary of the gamma severity model. Coefficients (Intercept) limit constr-steel-TRUE constr-masonry-TRUE constr-steelmas-TRUE constr-wood-TRUE house-age-06-to-15-TRUE house-age-16-to-30-TRUE house-age-31-to-45-TRUE house-age-36-to-xx-TRUE policy-year-2005-TRUE policy-year-2006-TRUE
Estimate 9.99 0.3e-06 0.02 0.02 0.01 0.47 -0.07 -0.08 0.14 0.08 -0.09 -0.01
Std. Error 0.12 0.00 0.11 0.11 0.11 0.16 0.03 0.03 0.03 0.03 0.02 0.02
t value 82.95 13.62 0.15 0.16 0.07 2.87 -1.91 -2.46 3.21 2.46 -3.51 -0.69
p value 0.00 0.00 0.87 0.87 0.93 0.00 0.05 0.01 0.00 0.01 0.00 0.48
simplicity, we will however assume that all claims are of the same type. First, we have to decide on the specific parametric subfamily of the exponential family of distributions. Chapter 9 of this book discusses sample loss distributions used in insurance applications and outlines methods of fitting these distributions to the data. However, in case of regression modeling the observed (yj )j=1,...,k are not identically distributed because for every j the expected value E[yj ] is a function of (xj,i )i=1,...,k . Simple methods of distribution fitting are not directly applicable. However, the tail behavior of the distribution of (yj )i=1,...,k can be used as guidance in selecting the appropriate claim distribution for the GLM. The most commonly used claim severity distributions in GLMs is gamma and this is how we will start this analysis. The process of making the final decision on the claim severity distribution is described in Subsection 11.7. We use the R program to build the GLM model for homeowners damage losses. The dependent variable are the capped gross paid claims evaluated as of March 2009, including allocated loss adjustment expense. The independent variables are: insurance limit, age of home, construction type and underwriting year. The link function is logarithm so the model is multiplicative. No offsetting is used.
360
11 Property and casualty insurance pricing with GLMs
Based on the summary of the model from Table 11.5, the base severity is e9.99 = 22 026. The limit of insurance, construction, and age of home (if above 16 years) are statistically significant. We can also see how the underwriting year picks up the inflation. The multiplier for year 2005 is e−0.09 = 0.91, for year 2006 it is e−0.01 = 0.98, and for the base year of 2007 it is e0 = 1.00. So the annual claim inflation is ca. 4.5%.
11.6 Some practical modeling issues 11.6.1 Non-numeric variables and banding Let’s assume that we want to express the expected number of claims per policy as a function of the type of house. House type can be detached, semi-detached and terrace. Since the type of house is not a number it cannot be accommodated directly. The solution is to create two binomial dummy variables one for semidetached and one for terrace houses and to add them to the model. The same procedure can be applied to numerical variables. The square footage of a house, rather than to enter the model directly, can be treated as a nonnumeric variable which can be either 750 or less, 751 to 1200, and 1201 or more. This gives the modeling software more freedom in fitting the functional of the independent variables. Banding variables can also enforce regularities on the estimates or control the model’s number of degrees. Since the final insurance rates often consist of multipliers for banded risk parameters, modeling the risk in a similar way simplifies rate making.
11.6.2 Functional form of the independent variables It is common to pre-assume a functional form of the impact of an independent variable. For example, one could maintain that the frequency of theft claims is proportional to the crime rate in a given area or that deductibles have certain impact on claim severity. GLMs allow restricting the functional form of independent variables. In the simplest case a coefficient βi can be fixed. This is the case with earned exposure in our frequency model. The log of exposure enters the model directly and the corresponding coefficient is not estimated by the iterative algorithm.
11.7 Diagnosing frequency and severity models
361
This extreme form of restricting the model is called offsetting. It is easy to see that any other function f (xi ) can be offset in the model in the same way. More sophisticated functional forms commonly used in regression models are splines. Most statistical packages can estimate the parameters of a spline function of an independent variable with their standard GLM routines.
11.7 Diagnosing frequency and severity models The successful estimation of the coefficients does not guarantee that the model is correct. The next step is to diagnose if the models we created are valid. There is no single way of determining this. However, there are a few standard checks that should be run every time a new frequency or severity model is created.
11.7.1 Expected value as a function of variance The assumption for the example in Subsection 11.4.3 was that the number of claims N follows the Poisson distribution and hence E[N ] = Var(N ). To verify if this assumption holds, it is not correct to just test if the observed (yj )j=1,...,n are Poisson distributed. They may well not be, but the model could still be correct. This is because our hypothesis is that every yj follows a Poisson distribution with a different expected value λj which is a functions of xj . The simplest way to verify if E[N ] = Var(N ) is to sort the policies by the predicted number of claims and then to group them into buckets of, say, 500. Next calculate the observed average number of claims and observed variance and plot these numbers against each other. Figure 11.2 shows this method for our frequency model developed in Subsection 11.4.3. The points on the chart lay along the identity line. This suggests that distribution used is correct.
11.7.2 Deviance residuals In the case of GLMs the residuals yj − yˆj are not normally distributed nor are they expected to be homoscedastic. However, we still want to check if they
0.02
0.06
0.10
0.14
11 Property and casualty insurance pricing with GLMs
Sample variance in groups of 500
362
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.16
Sample mean in groups of 500
Figure 11.2: Average observed number of claims vs observed variance. Based on 200 groups, 500 policies each.
conform to the underlying GLM assumptions. We visually test this conformity by using deviance residuals which represent the observations’ contribution to the model’s total deviance: 2 rj = sign(yj − yˆj ) 2
yj {a−1 (yj ) − θˆj } − a{a−1 (yj )} + a(θˆj ) , φ
(11.7)
where yj is the j-th observation and yˆj is its prediction from the model. If the model assumptions hold and the data sample is large enough, the deviance residuals should be approximately normally and identically distributed. Figure 11.3 depicts the standardized version of deviance residuals for our severity model from Subsection 11.5.2. The spread of the residuals does not form any strong pattern when we move along the horizontal axis. The residuals are not fully symmetric but perhaps they are sufficiently close to being normally distributed. The fit is not perfect but an amount of imperfection usually has to be accepted in real life problems.
363
1 0 −1 −2 −3
Deviance residuals
2
11.7 Diagnosing frequency and severity models
10.2
10.4
10.6
10.8
11.0
11.2
Predicted frequency
Figure 11.3: Deviance residuals from the severity model.
11.7.3 Statistical significance of the coefficients The significance of GLM coefficients can be tested in the standard way using t-statistics or z-statistics. In our frequency model the p-values related to the age of house are low suggesting it has an impact on the claims frequency. Other significant factors are: the marital status, the number of occupants, and other policies with this insurer. A practical way of testing the coefficients is to plot them against their confidence intervals. Figure 11.4 summarizes the multipliers in the frequency model and their 95% confidence intervals. This figure supports the conclusions based on the p-values: the age of home is probably an important risk factor. The age of policyholder probably is not. It is sometimes useful to simultaneous test multiple coefficients. To test if construction type is a significant risk factor, we can remove all construction type coefficients from the model and test if the model improved, rather than to test each individual coefficient separately. This can be with the f -statistics based on the analysis of deviance.
1.1 0.7
0.9
Factor
1.1 0.7
0.9
Factor
1.3
11 Property and casualty insurance pricing with GLMs
1.3
364
0−5
6−15
16−30
41−45
46+
House age
18−35
36−45
46−55
56−65
66−80
81+
Policyholder age
Figure 11.4: The solid lines show the estimated factors for different house and homeowners age bands. Dashed lines are the 95% confidence intervals of these estimates. If the number 1.0 lays within the confidence intervals then the factor may be insignificant.
11.7.4 Uniformity over time The checks discussed so far are based on statistical and formal tests. They will only lead to correct conclusions if all the underlying assumptions for these tests are met. In real life rather than with independence, we much more often deal with ‘almost independence’, instead of normality we usually observe ‘almost normality’, etc. For practical reasons, it is worth performing a time consistency tests. We will determine if the model is different if its coefficients are interacted with time. If a factor’s impact does not change over time, then there is good chance that this variable is correctly included in the model. If it changes, then we need to understand why. This practical approach was introduced in Anderson et. al. (2004). One way to check this is to divide the data set by underwriting year and to build separate models for every year. A better approach is to introduce interacted variables. The coefficient in question is interacted with some time variable, for instance with the underwriting year. Instead of the detached house variable there will be a new variable for detached homes in 2005, detached
365
0.7 0.8 0.9 1.0 1.1 1.2
Factor
11.7 Diagnosing frequency and severity models
0−5
6−15
16−30
31−45
46+
Age of house
Figure 11.5: Age of house factors for three different underwriting years: 2005 (solid), 2006 (dashed) and 2007 (dash-dotted).
homes in 2006, detached homes in 2007, etc. We have to gauge if the coefficients estimated this way vary a lot from year to year. Figure 11.5 demonstrates this comparison. Age of house has a similar impact on claim frequency regardless of the underwriting year. The curves are not identical but in each of the three years they initially go up and then stabilize or drop after the house is about thirty years old. Similarity of the pattern across the underwriting years is a common sense argument for leaving age of house in the model.
11.7.5 Selecting the final models After the initial models are validated the actuary should build improved models by excluding or replacing the irrelevant variables, transforming variables, selecting more appropriate underlying distributions, or changing other model assumptions. This iterative process will result in a series of acceptable frequency and severity models. The actuary is now tasked with selecting the final model.
366
11 Property and casualty insurance pricing with GLMs
Most of the time there is no such thing as the best model. As George Box famously stated ‘all models are wrong but some are useful’. A good model is one which gives relatively credible results most of the time. Because GLMs are developed with ML estimation, we normally have easy access to their likelihoods. This makes it particularly easy to apply comparison methods based on the likelihood ratios and the information criteria like the Akaike and the Bayesian criterion of Schwarz, see Akaike (1974) and Schwarz (1978). Another simple way of ranking the models is to compare how the prediction compares to the observed data on the testing data set. Figure 11.6 depicts a comparison of cumulative predictions vs. cumulative observations. The data points are sorted by the predicted value and plotted against the identity line. For a model which predicts the claims perfectly the curve would equal zero for most of the low risks and then it would jump rapidly for the worst risks. If the model had no predictive power, the curve would follow the identity line. The area between the straight line and the model curve, sometimes called the Gini index, can be used to quantify this simple analysis.
11.8 Finalizing the pricing models Once the final frequency and severity models are selected, no further modeling is needed. The pure premium (or the expected claim cost) multipliers for every given risk characteristics is simply the expected frequency multipliers times the expected severity multipliers. Table 11.6 demonstrates this calculation. If frequency and severity were modeled for more than one type of claim and there is a need for a single multiplicative structure for pure premiums, these pure premium factors can be regressed from the multiple frequency and severity models. For instance, if auto insurance GLMs were build for own damage and third party damage, the final pure premium can be modeled as E[y] = fˆ1 sˆ1 + fˆ2 sˆ2 = g −1 (β x) ,
(11.8)
where (fˆ1 , sˆ1 ) are the predictions from the frequency and severity models for own damage and (fˆ2 , sˆ2 ) represent third party damage. If this approach is used the final risk model lacks the statistical information available for the frequency and severity models. The confidence intervals or the dispersion parameter
367
0.0 0.2 0.4 0.6 0.8 1.0
Fraction of observed claims
11.8 Finalizing the pricing models
0.0
0.2
0.4
0.6
0.8
1.0
Fraction of exposure (sorted by prediction)
Figure 11.6: Exposure sorted by predicted claim frequency. The black curve shows the observed cumulative number of claims.
Table 11.6: Combining the frequency and severity claim cost models. Coefficient
Frequency fˆ
Severity sˆ
Pure premium pˆ = fˆ ∗ sˆ
(Intercept) house-age-06-to-15-TRUE house-age-16-to-30-TRUE house-age-31-to-45-TRUE house-age-36-to-xx-TRUE constr-steel-TRUE ....
e−1.95 = 0.14 e0.16 = 1.17 e0.21 = 1.23 e0.16 = 1.17 e0.12 = 1.12 e−0.43 = 0.65 ....
e9.99 = 22026 e−0.07 = 0.93 e−0.08 = 0.92 e0.14 = 1.15 e0.08 = 1.08 e0.02 = 1.02 ....
3083 1.09 1.13 1.35 1.21 0.66 ....
of this model do not reflect the underlying insurance risks but rather of the predictions of the frequency and severity models. As such, they carry little useful information.
368
11 Property and casualty insurance pricing with GLMs
Final pure premium model is often used for offsetting any regulatory or business restrictions. If, for example, the statue requires that female drivers enjoy a discount of 5%, the multiplier of 0.95 should be offset for the appropriate coefficient in the pure premium model given by formula (11.8). This way the underlying frequency and severity models are unaffected by the external constrain and the other variables try to “undo the harm” the restriction did to the predictive power the pure premium model. Some practitioners prefer to model pure premium directly rather then to model frequency and severity separately. This is done using the Tweedie distributions and requires additional estimation steps, see for example Denuit et. al. (2008). Finally, the rates implemented by the insurance company are a tradeoff between many different considerations. The GLMs developed in this chapter only reflect the actuarial cost of insurance. Other important elements are fixed costs, demand elasticity, risk premiums, strategic issues, and competitive, or regulatory environments. Some of these problems can also be modeled with GLMs and some are better predicted with other methods. For a brief introduction to the additional the practical problems which need to be solved before the final rate is set, see Felblum, Sherwood, McLenahan et. al. (2001).
Bibliography Akaike, H. (1974). A new look at the statistical model identification, IEEE Transactions on Automatic Control. Anderson, D., Feldblum, S., Modlin, C., Schirmacher, D., Shirmacher, E., Thandi, N. (2004). A Practicioners Guide to Generalized Linear Models, Casualty Actuarial Society. Denuit, H., Dhaene, J., Goovaerts, M., Kaas, R. (2008). Modern Actuarial Risk Theory Using R, Springer. de Jong, P., Heller, G. (2008). Generalized Linear Models for Insurance Data, Cambridge University Press. Felbum, S., McLenahan C., Sherwood, M., et al. (2001). Foundations of Casualty Actuarial Science, Casualty Actuarial Society. Ferguson, C., Luthardt, C., Rejda, G., Wiening, E., (2004). Personal Insurance, American Institute of Chartered Property Casualty Underwriters. Flinter, A., Trupin, J. (2004). Commercial Insurance, American Institute of Chartered Property Casualty Underwriters. Grossi, P., Kunreuter, H., Chandu, C. (2005). Catastrophe Modeling: A New Approach to Managing Risk, Springer. Hardle W., Muller, M., Sperlich, S., Werwatz, A. (2004). Nonparametric and Semiparametric Models, Springer. McCullagh, P., Nelder, J. (1989). Generalized Linear Models, Chapman & Hall / CRC. McCulloch, C., Neuhaus, J., Searle, S. (2008). Generalized, Linear, and Mixed Models, Wiley Blackwell. Muller, M. (2004). Generalized Linear Models. In: Gentle, J., Hardle, W., Mori, Y. (eds) Handbook of Computational Statistics, Springer. Ripley, B., Venables, W. (2002). Modern Applied Statistics with S, Springer.
370
Bibliography
Schwarz, G. (1978). Estimating the Dimension of a Model, Annals of Statistcs. Wood, S. (2006). Generalized Additive Models: An Introduction with R, Chapman & Hall / CRC.
12 Pricing of catastrophe bonds Krzysztof Burnecki, Grzegorz Kukla, and David Taylor
12.1 Introduction Catastrophe (CAT) bonds are one of the more recent financial derivatives to be traded on the world markets. In the mid-1990s a market in catastrophe insurance risk emerged in order to facilitate the direct transfer of re-insurance risk associated with natural catastrophes from corporations, insurers and reinsurers to capital market investors. The primary instrument developed for this purpose was the CAT bond. CAT bonds are more specifically referred to as insurance-linked securities (ILS). The distinguishing feature of these bonds is that the ultimate repayment of principal depends on the outcome of an insured event. The basic CAT bond structure can be summarized as follows (Lane, 2004; Sigma, 2009): 1. The sponsor establishes a special purpose vehicle (SPV) as an issuer of bonds and as a source of reinsurance protection. 2. The issuer sells bonds to investors. The proceeds from the sale are invested in a collateral account. 3. The sponsor pays a premium to the issuer; this and the investment of bond proceeds are a source of interest paid to investors. 4. If the specified catastrophic risk is triggered, the funds are withdrawn from the collateral account and paid to the sponsor; at maturity, the remaining principal – or if there is no event, 100% of principal – is paid to investors. There are three types of ILS triggers: indemnity, index and parametric. An indemnity trigger involves the actual losses of the bond-issuing insurer. For example the event may be the insurer’s losses from an earthquake in a certain
P. Čížek et al. (eds.), Statistical Tools for Finance and Insurance, DOI 10.1007/978-3-642-18062-0_12, © Springer-Verlag Berlin Heidelberg 2011
371
372
12 Pricing of catastrophe bonds
area of a given country over the period of the bond (H¨ardle and L´ opez, 2010). An industry index trigger involves, in the US for example, an index created from property claim service (PCS) loss estimates. A parametric trigger is based on, for example, the Richter scale readings of the magnitude of an earthquake at specified data stations. In this chapter we address the issue of pricing CAT bonds with indemnity and index triggers.
12.1.1 The emergence of CAT bonds Until fairly recently, property re-insurance was a relatively well understood market with efficient pricing. However, naturally occurring catastrophes, such as earthquakes and hurricanes, are beginning to have a dominating impact on the industry. In part, this is due to the rapidly changing, heterogeneous distribution of high-value property in vulnerable areas. A consequence of this has been an increased need for a primary and secondary market in catastrophe related insurance derivatives. The creation of CAT bonds, along with allied financial products such as catastrophe insurance options, was motivated in part by the need to cover the massive property insurance industry payouts of the early- to mid-1990s. They also represent a “new asset class” in that they provide a mechanism for hedging against natural disasters, a risk which is essentially uncorrelated with the capital market indices (Doherty, 1997). Subsequent to the development of the CAT bond, the class of disaster referenced has grown considerably. As yet, there is almost no secondary market for CAT bonds which hampers using arbitrage-free pricing models for the derivative. Property insurance claims of approximately USD 60 billion between 1990 and 1996 (Canter, Cole, and Sandor, 1996) caused great concern to the insurance industry and resulted in the insolvency of a number of firms. These bankruptcies were brought on in the wake of hurricanes Andrew (Florida and Louisiana affected, 1992), Opal (Florida and Alabama, 1995) and Fran (North Carolina, 1996), which caused combined damage totalling USD 19.7 billion (Canter, Cole, and Sandor, 1996). These, along with the Northridge earthquake (1994) and similar disasters (for the illustration of the US natural catastrophe data see Figure 12.1), led to an interest in alternative means for underwriting insurance. In 1995, when the CAT bond market was born, the primary and secondary (or re-insurance) industries had access to approximately USD 240 billion in capital (Canter, Cole, and Sandor, 1996; Cummins and Danzon, 1997). Given the capital level constraints necessary for the reinsuring of property losses and the potential for single-event losses in excess of USD 100 billion, this was clearly
Adjusted PCS catastrophe claims (USD billion)
12.1 Introduction
373
20 18 16 14 12 10 8 6 4 2 0 1990
1992
1994
1996
1998
2000
Years
Figure 12.1: Graph of the PCS catastrophe loss data, 1990-1999. STF2cat01
insufficient. The international capital markets provided a potential source of security for the (re-)insurance market. An estimated capitalisation of the international financial markets, at that time, of about USD 19 trillion underwent an average daily fluctuation of approximately 70 basis points or USD 133 billion (Sigma, 1996). The undercapitalisation of the re-insurance industry (and their consequential default risk) meant that there was a tendency for CAT re-insurance prices to be highly volatile. This was reflected in the traditional insurance market, with rates on line being significantly higher in the years following catastrophes and dropping off in the intervening years (Froot and O’Connell, 1997; Sigma, 1997). This heterogeneity in pricing has a very strong damping effect, forcing many re-insurers to leave the market, which in turn has adverse consequences for the primary insurers. A number of reasons for this volatility have been advanced (Winter, 1994; Cummins and Danzon, 1997). CAT bonds and allied catastrophe related derivatives are an attempt to address these problems by providing effective hedging instruments which reflect long-term views and can be priced according to the statistical characteristics of the dominant underlying process(es). Their impact, since a period of standardisation between 1997 and 2003, has been substantial. As a consequence the rise in prices associated with the uppermost segments of the CAT re-insurance
374
12 Pricing of catastrophe bonds
programs has been dampened. The primary market has developed and both issuers and investors are now well-educated and technically adept. In the years 2000 to 2003, the average total issue exceeded USD 1 billion per annum (McGhee, 2004). The catastrophe bond market witnessed yet another record year in 2003, with total issuance of USD 1.73 billion, an impressive 42 percent year-on-year increase from 2002s record of USD 1.22 billion. During the year, a total of eight transactions were completed, with three originating from firsttime issuers. The year also featured the first European corporate-sponsored transaction (and only the third by any non-insurance company). Electricit´e de France, the largest electric utility in Europe, sponsored a transaction to address a portion of the risks facing its properties from French windstorms. With USD 7 billion in publicly disclosed issuances for the year, 2007 was by far the most active year in the history of the catastrophe bond market. Cat bond issuance volume for 2007 increased by 49 percent over the 2006 record of USD 4.7 billion and 251 percent over the 2005 record of USD 2 billion (McGhee, Clarke, Fugit, and Hathaway, 2008). Despite the challenging financial conditions of late 2008 and early 2009, the catastrophe bond market continued to play a critical role for both sponsors and investors over the past 12 months. Throughout the year, as financial markets stabilized generally, catastrophe bond issuance conditions continued to improve. In 2009 USD 3.4 billion of risk capital was issued through 18 transactions. In terms of risk capital, this is a 25 percent increase over 2008. Heading into 2010, the catastrophe bond market continued to provide an increasingly attractive and worthwhile supplement to sponsors risk transfer programs (Klein, 2010). It is interesting to note that very few of the issued bonds receive better than “non-investment grade” BB ratings, however, the first CAT bond was triggered only in 2007 (Kamp Re 2005 triggered by Zurich Re’s Katrina losses).
12.1.2 Insurance securitization Capitalisation of insurance and consequential risk spreading through share issue, is well established and the majority of primary and secondary insurers are public companies. Investors in these companies are thus de facto bearers of risk for the industry. This however relies on the idea of risk pooling through the law of large numbers, where the loss borne by each investor becomes highly predictable. In the case of catastrophic natural disasters, this may not be possible as the losses incurred by different insurers tend to be correlated. In this
12.1 Introduction
375
situation a different approach for hedging the risk is necessary. A number of such products which realize innovative methods of risk spreading already exist and are traded (Litzenberger, Beaglehole, and Reynolds, 1996; Cummins and Danzon, 1997; Aase, 1999; Sigma, 2003). They are roughly divided into re-insurance share related derivatives, including Post-loss Equity Issues and Catastrophe Equity Puts, and asset–liability hedges such as Catastrophe Futures, Options and CAT Bonds. In 1992, the Chicago Board of Trade (CBOT) introduced the CAT futures. In 1995, the CAT future was replaced by the PCS option. This option was based on a loss index provided by PCS. The underlying index represented the development of specified catastrophe damages, was published daily and eliminated the problems of the earlier ISO index. The options traded better, especially the call option spreads where insurers would appear on both side of the transaction, i.e. as buyer and seller. However, they also ceased trading in 2000. Much work in the re-insurance industry concentrated on pricing these futures and options and on modelling the process driving their underlying indices (Canter, Cole, and Sandor, 1996; Embrechts and Meister, 1997; Aase, 1999). CAT bonds are allied but separate instruments which seek to ensure capital requirements are met in the specific instance of a catastrophic event.
12.1.3 CAT bond pricing methodology In this chapter we investigate the pricing of CAT Bonds. The methodology developed here can be extended to most other catastrophe related instruments. However, we are concerned here only with CAT specific instruments, e.g. California Earthquake Bonds (Sigma, 1996; Sigma, 1997; Sigma, 2003; McGhee, 2004), and not re-insurance shares or their related derivatives. In the early market for CAT bonds, the pricing of the bonds was in the hands of the issuer and was affected by the equilibrium between supply and demand only. Consequently there was a tendency for the market to resemble the traditional re-insurance market. However, as CAT bonds become more popular, it is reasonable to expect that their price will begin to reflect the fair or arbitrage-free price of the bond, although recent discussions of alternative pricing methodologies have contradicted this expectation (Lane, 2003). Our pricing approach assumes that this market already exists. Some of the traditional assumptions of derivative security pricing are not correct when applied to these instruments due to the properties of the underlying
376
12 Pricing of catastrophe bonds
contingent stochastic processes. There is evidence that certain catastrophic natural events have (partial) power-law distributions associated with their loss statistics (Barton and Nishenko, 1994). This overturns the traditional lognormal assumption of derivative pricing models. There are also well-known statistical difficulties associated with the moments of power-law distributions, thus rendering it impossible to employ traditional pooling methods and consequently the central limit theorem. Given that heavy-tailed or large deviation results assume, in general, that at least the first moment of the distribution exists, there will be difficulties with applying extreme value theory to this problem (Embrechts, Resnick, and Samorodnitsky, 1999). It would seem that these characteristics may render traditional actuarial or derivatives pricing approaches ineffective. There are additional features to modelling the CAT bond price which are not to be found in models of ordinary corporate or government issue (although there is some similarity with pricing defaultable bonds). In particular, the trigger event underlying CAT bond pricing is dependent on both the frequency and severity of natural disasters. In the model described here, we attempt to reduce to a minimum any assumptions about the underlying distribution functions. This is in the interests of generality of application. The numerical examples will have to make some distributional assumptions and will reference some real data. We will also assume that loss levels are instantaneously measurable and updatable. It is straightforward to adjust the underlying process to accommodate a development period. There is a natural similarity between the pricing of catastrophe bonds and the pricing of defaultable bonds. Defaultable bonds, by definition, must contain within their pricing model a mechanism that accounts for the potential (partial or complete) loss of their principal value. Defaultable bonds yield higher returns, in part, because of this potential defaultability. Similarly, CAT bonds are offered at high yields because of the unpredictable nature of the catastrophe process. With this characteristic in mind, a number of pricing models for defaultable bonds have been advanced (Jarrow and Turnbull, 1995; Zhou, 1997; Duffie and Singleton, 1999). The trigger event for the default process has similar statistical characteristics to that of the equivalent catastrophic event pertaining to CAT bonds. In an allied application to mortgage insurance, the similarity between catastrophe and default in the log-normal context has been commented on (Kau and Keenan, 1996). With this in mind, we have modelled the catastrophe process as a compound doubly stochastic Poisson process. The underlying assumption is that there
12.2 Compound doubly stochastic Poisson pricing model
377
is a Poisson point process (of some intensity, in general varying over time) of potentially catastrophic events. However, these events may or may not result in economic losses. We assume the economic losses associated with each of the potentially catastrophic events to be independent and to have a certain common probability distribution. This is justifiable for the Property Claim Loss indices used as the triggers for the CAT bonds. Within this model, the threshold time can be seen as a point of a Poisson point process with a stochastic intensity depending on the instantaneous index position. We make this model precise later in the chapter. In the article of Baryshnikov, Mayo, and Taylor (1998) the authors presented an arbitrage-free solution to the pricing of CAT bonds under conditions of continuous trading. They modelled the stochastic process underlying the CAT bond as a compound doubly stochastic Poisson process. Burnecki and Kukla (2003) applied their results in order to determine no-arbitrage prices of a zerocoupon and coupon bearing CAT bond. In Section 12.2 we present the doubly stochastic Poisson pricing model. In Section 12.3 we study 10-year catastrophe loss data provided by Property Claim Services. We find a distribution function which fits the observed claims in a satisfactory manner and estimate the intensity of the non-homogeneous Poisson process governing the flow of the natural events. In Section 12.4 we illustrate the values of different CAT bonds associated with this loss data with respect to the threshold level and maturity time. To this end we apply Monte Carlo simulations.
12.2 Compound doubly stochastic Poisson pricing model The CAT bond we are interested in is described by specifying the region, type of events, type of insured properties, etc. More abstractly, it is described by the aggregate loss process Ls and by the threshold loss D. Set a probability space (Ω, F , F t , ν) and an increasing filtration F t ⊂ F , t ∈ [0, T ]. This leads to the following assumptions: • There exists a doubly stochastic Poisson process (Bremaud, 1981) Ms describing the flow of (potentially catastrophic) natural events of a given type in the region specified in the bond contract. The intensity of this Poisson point process is assumed to be a predictable bounded process ms . This process describes the estimates based on statistical analysis and scientific knowledge about the nature of the catastrophe causes. We
378
12 Pricing of catastrophe bonds will denote the instants of these potentially catastrophic natural events as 0 ≤ t1 ≤ . . . ≤ ti ≤ . . . ≤ T .
• The losses incurred by each event in the flow {ti } are assumed to be independent, identically distributed random values {Xi } with distribution function F (x) = P(Xi < x). • A progressive process of discounting rates r. Following the traditional practice, we assume the process is continuous almost everywhere. This process describes the value at time s of USD 1 paid at time t > s by ⎧ ⎫ ⎨ t ⎬ exp{−R(s, t)} = exp − r(ξ) dξ . ⎩ ⎭ s
Therefore, one has Lt =
ti T . Define the process Zs = E(Z|Fs ). We require that Zs is a predictable process. This can be interpreted as the independence of payment at maturity from the occurrence and timing of the threshold. The amount Z can be the principal plus interest, usually defined as a fixed percentage over the London Inter-Bank Offer Rate (LIBOR). The no-arbitrage price of the zero-coupon CAT bond associated with a threshold D, catastrophic flow Ms , a distribution function of incurred losses F , and paying Z at maturity is given by Burnecki and Kukla (2003): Vt1
=
E [Z exp {−R(t, T )} (1 − NT )|F t ]
=
E Z exp {−R(t, T )}
·
⎧ ⎨ ⎩
T 1−
ms {1 − F (D − Ls )} I(Ls < D)ds t
⎫ ⎬ ⎭
|F t .
(12.2)
We evaluate this CAT bond price at t = 0, and apply appropriate Monte Carlo simulations. We assume for the purposes of illustration that the annual continuously-compounded discount rate r = ln(1.025) is constant and corresponds to LIBOR, T ∈ [1/4, 2] years, D ∈ [2.54, 30] billion USD (quarterly – 3*annual average loss). Furthermore, in the case of the zero-coupon CAT bond we assume that Z = 1.06 USD. Hence, the bond is priced at 3.5% over LIBOR when T = 1 year. Figure 12.4 illustrates the zero-coupon CAT bond values (12.2) with respect to the threshold level and time to expiry in the Burr and NHP1 case. We can see that as the time to expiry increases, the price of the CAT bond decreases. Increasing the threshold level leads to higher bond prices. When T is a quarter and D = 30 billion USD the CAT bond price approaches the value 1.06 exp {− ln(1.025)/4} ≈ 1.05 USD. This is equivalent to the situation when the threshold time exceeds the maturity (τ T ) with probability one.
12.4 Dynamics of the CAT bond price
383
1.04
0.83
0.63
0.42
0.21
2.54 8.14 13.73 19.32 24.91
0.60
0.95
1.30
1.65
2.00
Figure 12.4: The zero-coupon CAT bond price with respect to the time to expiry (left axis) and threshold level (right axis) in the Burr and NHP1 case.
Consider now a CAT bond which has only coupon payments Ct , which terminate at the threshold time τ . The no-arbitrage price of the CAT bond associated with a threshold D, catastrophic flow Ms , a distribution function of incurred losses F , with coupon payments Cs which terminate at time τ is given by Burnecki and Kukla (2003): T
Vt2
exp {−R(t, s)} Cs (1 − Ns )ds|F t
= E t
T
exp {−R(t, s)} Cs
= E ·
t
s 1−
mξ (1 − F (D − Lξ )) I(Lξ < D)dξ ds|F t .
(12.3)
t
We evaluate this CAT bond price at t = 0 and assume that Ct ≡ 0.06. The value of V02 as a function of time to maturity (expiry) and threshold level in
384
12 Pricing of catastrophe bonds
0.11
0.09
0.07
0.05
0.03
2.54 8.14 13.73 19.32 24.91
0.60
0.95
1.30
1.65
2.00
Figure 12.5: The CAT bond price, for the bond paying only coupons, with respect to the time to expiry (left axis) and threshold level (right axis) in the Burr and NHP1 case.
the Burr and NHP1 case is illustrated by Figure 12.5. We clearly see that the situation is different to that of the zero-coupon case. The price increases with both time to expiry and threshold level. When D = 30 USD billion and T = 2
2 years the CAT bond price approaches the value 0.06 0 exp {− ln(1.025)t} dt ≈ 0.12 USD. This is equivalent to the situation when the threshold time exceeds the maturity (τ T ) with probability one. Finally, we consider the case of the coupon-bearing CAT bond. Fashioned as floating rate notes, such bonds pay a fixed spread over LIBOR. Loosely speaking, the fixed spread may be analogous to the premium paid for the underlying insured event, and the floating rate, LIBOR, is the payment for having invested cash in the bond to provide payment against the insured event, should a payment to the insured be necessary. We combine (12.2) with Z equal to par value (PV) and (12.3) to obtain the price for the coupon-bearing CAT bond. The no-arbitrage price of the CAT bond associated with a threshold D, catastrophic flow Ms , a distribution function of incurred losses F , paying P V at
12.4 Dynamics of the CAT bond price
385
maturity, and coupon payments Cs which cease at the threshold time τ is given by Burnecki and Kukla (2003): Vt3
=
E P V exp {−R(t, T )} (1 − NT )
T
exp {−R(t, s)} Cs (1 − Ns )ds|F t
+ t
=
E P V exp{−R(t, T )} T
exp{−R(t, s)} Cs 1 −
+ t
!
s mξ (1 − F (D − Lξ )) I(Lξ < D)dξ t
− P V exp {−R(s, T )} ms (1 − F (D − Ls )) I(Ls < D) ds|F t . (12.4) We evaluate this CAT bond price at t = 0 and assume that P V = 1 USD, and again Ct ≡ 0.06. Figure 12.6 illustrates this CAT bond price in the Burr and NHP1 case. The influence of the threshold level D on the bond value is clear but the effect of increasing the time to expiry is not immediately clear. As T increases, the possibility of receiving more coupons increases but so does the possibility of losing the principal of the bond. In this example (see Figure 12.6) the price decreases with respect to the time to expiry but this is not always true. We also notice that the bond prices in Figure 12.6 are lower than the corresponding ones in Figure 12.4. However, we recall that in the former case P V = 1.06 USD and here P V = 1 USD. The choice of the fitted loss distribution affects the price of the bond. Figure 12.7 illustrates the difference between the zero-coupon CAT bond prices calculated under the two assumptions of Burr and log-normal loss sizes in the NHP1 case. It is clear that taking into account heavier tails (the Burr distribution), which can be more appropriate when considering catastrophic losses, leads to essentially different prices (the maximum difference in this example reaches 50%). Figures 12.8 and 12.9 show how the choice of the fitted Poisson point process influences the CAT bond value. Figure 12.8 illustrates the difference between the zero-coupon CAT bond prices calculated in the NHP1 and HP cases under the assumption of the Burr loss distribution. We see that the differences vary
386
12 Pricing of catastrophe bonds
1.00
0.80
0.61
0.41
0.22
2.54 8.14 13.73 19.32 24.91
0.60
0.95
1.30
1.65
2.00
Figure 12.6: The coupon-bearing CAT bond price with respect to the time to expiry (left axis) and threshold level (right axis) in the Burr and NHP1 case.
from −12% to 6% of the bond price. Finally, Figure 12.9 illustrates the difference between the zero-coupon CAT bond prices calculated in the NHP1 and NHP2 cases under the assumption of the Burr loss distribution. The difference is always below 10%. In our examples, equations (12.2) and (12.4), we have assumed that in the case of a trigger event the bond principal is completely lost. However, if we would like to incorporate a partial loss in the contract it is sufficient to multiply P V by the appropriate constant.
12.4 Dynamics of the CAT bond price
387
0.00 -0.11 -0.22 -0.33 -0.44
2.54 8.14 13.73 19.32 24.91
0.60
0.95
1.30
1.65
2.00
Figure 12.7: The difference between zero-coupon CAT bond prices in the Burr and the log-normal cases with respect to the threshold level (left axis) and time to expiry (right axis) under the NHP1 assumption.
0.06
0.02
-0.02
-0.05
-0.09
2.54 8.14 13.73 19.32 24.91
0.60
0.95
1.30
1.65
2.00
Figure 12.8: The difference between zero-coupon CAT bond prices in the NHP1 and the HP cases with respect to the time to expiry (left axis) and the threshold level (right axis) under the Burr assumption.
388
12 Pricing of catastrophe bonds
0.07
0.04
0.01
-0.03
-0.06
2.54 8.14 13.73 19.32 24.91
0.60
0.95
1.30
1.65
2.00
Figure 12.9: The difference between zero-coupon CAT bond prices in the NHP1 and the NHP2 case with respect to the time to expiry (left axis) and the threshold level (right axis) under the Burr assumption.
Bibliography Aase, K. (1999). An Equilibrium Model of Catastrophe Insurance Futures and Spreads, The Geneva Papers on Risk and Insurance Theory 24: 69–96. Barton, C. and Nishenko, S. (1994). Natural Disasters – Forecasting Economic and Life Losses, SGS special report. Baryshnikov, Yu., Mayo, A. and Taylor, D.R. (1998). Pricing of CAT bonds, preprint. Bremaud, P. (1981). Point Processes and Queues: Martingale Dynamics, Springer, New York. Burnecki, K., Kukla, G. and Weron R. (2000). Property Insurance Loss Distributions, Phys. A 287: 269–278. Burnecki, K. and Kukla, G. (2003). Pricing of Zero-Coupon and Coupon Cat Bonds, Appl. Math. (Warsaw) 30(3): 315–324. Burnecki, K. and Weron, R. (2008). Visualization Tools for Insurance, in Handbook of Data Visualization, Ch. Chen, W. Hrdle, and A. Unwin (eds.), Springer, Berlin, 899–920. Canter, M. S., Cole, J. B. and Sandor, R. L. (1996). Insurance Derivatives: A New Asset Class for the Capital Markets and a New Hedging Tool for the Insurance Industry, Journal of Derivatives 4(2): 89–104. Cummins, J. D. and Danzon, P. M. (1997). Price Shocks and Capital Flows in Liability Insurance, Journal of Financial Intermediation 6: 3–38. Cummins, J. D., Lo, A. and Doherty, N. A. (1999). Can Insurers Pay for the “Big One”? Measuring the Capacity of an Insurance Market to Respond to Catastrophic Losses, preprint, Wharton School, University of Pennsylvania. D’Agostino, R.B.and Stephens, M.A. (1986). Goodness-of-Fit Techniques, Marcel Dekker, New York.
390
Bibliography
Doherty, N. A. (1997). Innovations in Managing Catastrophe Risk, Journal of Risk & Insurance 64(4): 713–718. Duffie, D. and Singleton, K. J. (1999). Modelling Term Structures of Defaultable Bonds, The Review of Financial Studies 12(4): 687–720. Embrechts, P., Resnick, S. I. and Samorodnitsky, G. (1999). Extreme Value Theory as a Risk Management Tool, North American Actuarial Journal 3(2): 30–41. Embrechts, P. and Meister, S. (1997). Pricing Insurance Derivatives: The Case of Cat-Futures, in Securitization of Insurance risk, 1995 Bowles Symposium, SOA Monograph M-F197-1: 15–26. Froot, K. and O’Connell P. (1997). On the Pricing of Intermediated Risks: Theory and Application to Catastrophe Reinsurance, NBER Working Paper No. w6011. H¨ ardle, W. and Lopez, B. (2010). Calibrating CAT Bonds for Mexican Earthquakes, Journal of Risk and Insurance, to appear. Jarrow, R. A. and Turnbull, S. (1995). Pricing Options on Financial Securities Subject to Default Risk, Journal of Finance 50: 53–86. Kau, J. B. and Keenan, D. C. (1996). An Option-Theoretic Model of Catastrophes Applied to Mortgage Insurance, Journal of Risk and Insurance 63(4): 639–656. Klein, C. (2010). Reinsurance Market Review 2010, Guy Carpenter & Company, Inc. Klugman, S. A., Panjer, H.H., and Willmot, G.E. (2008). Loss Models: From Data to Decisions (3rd ed.), Wiley, New York. Lane, M.N. (2001). Rationale and Results with the LFC CAT Bond Pricing Model, Lane Financial L.L.C. Lane, M.N. (2004). The Viability and Likely Pricing of “CAT Bonds” for Developing Countries, Lane Financial L.L.C. Litzenberger, R.H., Beaglehole, D. R., and Reynolds, C. E. (1996). Assessing Catastrophe Reinsurance-Linked Securities as a New Asset Class, Journal of Portfolio Management (December): 76–86. McGhee, C. (2004). Market Update: The Catastrophe Bond Market at YearEnd 2003, Guy Carpenter & Company, Inc.
Bibliography
391
McGhee, C., Clarke, R., Fugit, J., and Hathaway, J. (2008). The Catastrophe Bond Market at Year-End 2007, Guy Carpenter & Company, Inc. Sigma (1996). Insurance Derivatives and Securitization: New Hedging Perspectives for the US Catastrophe Insurance Market?, Report Number 5, Swiss Re. Sigma (1997). Too Little Reinsurance of Natural Disasters in Many Markets, Report Number 7, Swiss Re. Sigma (2003). The Picture of ART, Report Number 1, Swiss Re. Sigma (2009). The role of indices in transferring insurance risks to capital markets, Report Number 4, Swiss Re. Winter, R. A. (1994). The dynamics of competitive insurance markets, Journal of Financial Intermediation 3: 379–415. Zhou, C. (1994). A Jump Diffusion Approach to Modelling Credit Risk and Valuing Defaultable Securities, preprint, Federal Board.
13 Return distributions of equity-linked retirement plans Nils Detering, Andreas Weber, and Uwe Wystup
13.1 Introduction In the recent years an increasing demand for capital guaranteed equity-linked life insurance products and retirement plans has emerged. In Germany, a retirement plan, called Riester-Rente, is supported by the state with cash payments and tax benefits. Those retirement plans have to preserve the invested capital. The company offering a Riester-Rente has to ensure that at the end of the saving period at least all cash inflows are available. Due to the investors demand for high returns, banks and insurance companies are not only offering saving plans investing in riskless bonds but also in products with a high equity proportion. For companies offering an equity-linked Riester-Rente the guarantee to pay out at least the invested capital is a big challenge. Due to the long maturities of the contracts of more than 30 years it is not possible to just buy a protective put. Many different concepts are used by banks and insurance companies to generate this guarantee or to reduce the remaining risk for the company. They vary from simple Stop Loss strategies to complex dynamic hedging strategies. In our work we analyze the return distribution generated by some of these strategies. We consider several examples: • A classical insurance strategy with investments in the actuarial reserve fund. In this strategy a large proportion of the invested capital is held in the actuarial reserve fund to fully generate the guarantee. Only the remaining capital is invested in products with a higher equity proportion. The actuarial reserve fund is considered riskless. It usually guarantees a minimum yearly interest rate. • A Constant Proportion Portfolio Insurance strategy (CPPI) which is similar to the traditional reserve fund in that it ensures not to fall below a
P. Čížek et al. (eds.), Statistical Tools for Finance and Insurance, DOI 10.1007/978-3-642-18062-0_13, © Springer-Verlag Berlin Heidelberg 2011
393
394
13 Return distributions of equity-linked retirement plans certain floor in order to generate the guarantee. In contrast to the traditional strategy the amount necessary to generate the guarantee is not fully invested in the riskless products. The amount invested in the more risky equity products is leveraged for a higher equity exposure. Continuous monitoring ensures that the guarantee is not at risk, since the equity proportion is reduced with the portfolio value becoming closer to the floor.
• A stop loss strategy where all the money is invested into pure equity until the floor is reached. If this happens all the invested capital is shifted into the riskless products in order to provide the guarantee at the end. There are also equity-linked life insurance guarantees sold in Germany. In these products the insurance company promises to pay out the maximum of the invested amount and an investment in an equity fund reduced by a guarantee cost (usually yearly as a percentage of the fund value). The return distribution of these products highly depends on the guarantee cost. Due to the long maturities of the contracts, the pricing of this guarantee cost is not straightforward and the price is extremely dependent on the model which is chosen. For this reason they are not included in this comparative study. An introduction to equity-linked guarantees and their pricing can be found in (Hardy, 2003). We simulate the return distribution for the different strategies and for different investment horizons. We analyze how fee structures, often used by insurance companies, affect the return distribution. We also study the impact of the cash payments of the state. Therefore we analyze an investment plan which maximizes the federal support. To model the distribution we extend the jump diffusion model by Kou (2002) by allowing the jumps to be displaced. Therefore, we go beyond the classical Black-Scholes model (Black and Scholes, 1973) and explicitly allow for jumps in the market as we could observe them within the last two years. One reason for using a jump diffusion model is that it better represents reality and therefore the rebalancing between equity and fixed income funds in the strategies under consideration is more realistic. The second reason is that Stop Loss and CPPI both only generate a sure guarantee in a market without jumps. Therefore, we analyze how often a CPPI and a Stop Loss strategy fail if we allow for jumps. We find out that one of the driving factors of the return distribution is the fee structure of the contract. For some contracts only 85% of the invested capital is actually invested in the strategy. The remaining 15% is taken by the insurance or bank as sales and maintenance fees. With the yearly management
13.2 The displaced double-exponential jump diffusion model
395
fees of the underlying funds the return distribution is additionally weakened. Another very important factor of the return distribution is the guarantee concepts. Due to an equity exposure ranging from 36% to 100% percent the return distributions vary significantly.
13.2 The displaced double-exponential jump diffusion model 13.2.1 Model equation As mentioned in Section 13.1, we would like to go beyond the classical BlackScholes Model and allow for exponentially distributed jumps in the market. We extend the model by Kou (Kou, 2002) by allowing the jumps to be displaced. The governing equation for the displaced jump diffusion model (DDE) is
dSt St− ST
=
⎛ ⎞ Nt μ dt + σ dWt + d ⎝ (Vj − 1)⎠ ,
(13.1)
j=1
=
N6 T −t σ2 St exp μ − − δ τ + σWT −t Vj , 2
(13.2)
j=1
where (Wt ) is a standard Brownian motion, (Nt ) a Poisson process with intensity λ > 0 and Vj independent identically distributed random variables Vj ∼ eY , where Y represents the relative jump size with a minimal jump of κ, therefore leading to jumps of Y in the range (−∞, −κ] ∪ [κ, +∞), with parameters • μ denoting the expected drift, • σ denoting the volatility, • λ denoting the expected number of jumps per year,
396
13 Return distributions of equity-linked retirement plans
Sample DDE path 12
10
value
8
6
4
2
0 0
5
10
15
20
25
30
35
40
time in years
Figure 13.1: Displaced Double-Exponential jump process: simulated paths with parameters T = 35 years, μ = 6%, σ = 14.3%, λ = 5.209, κ = 2.31%, η1 = η2 = η = 1/1.121%, p = 0.5, S0 = 1. • δ the drift adjustment which is chosen such that the process St has the desired drift μ. The processes (Wt ), (Nt ), and the random variables Vj are all independent. We illustrate sample paths drawn from this distribution for a period of 35 years in Figure 13.1. Except for the drift, the parameters are estimated to resemble the daily log returns of the MSCI World index for the last thirty years. For the drift we simulate different scenarios. We do not intend to simulate actively managed funds. We assume a minimal jump size in order to distinguish between jumps arising from the Poisson process and the Brownian motion. The minimum jump size
13.2 The displaced double-exponential jump diffusion model
397
DDE density 50
40
fY (y)
30
20
10
0 -0.1
-0.05
0
0.05
0.1
y
Figure 13.2: Displaced Double-Exponential density of Y with parameters κ = 2.31%, η1 = η2 = η =1/1.121%, p = 0.5 is chosen in order to qualify the 1% lowest and 1% highest daily log returns as jumps. We chose the jumps Y to be exponentially distributed assuming only values outside the interval (−κ, +κ). Therefore, the jump part of the process has the density ⎧ ⎨ pη1 e−(y−κ)η1 0 fY (y) = ⎩ (1 − p)η2 e(y+κ)η2 with η1 > 1, η2 > 0 and 0 ≤ p ≤ 1.
if y ≥ κ, if |y| < κ, if y ≤ −κ,
(13.3)
398
13 Return distributions of equity-linked retirement plans
13.2.2 Drift adjustment Similar to the work of Kou (Kou, 2002) we calculate the drift adjustment for the jump process by δ
E[eY − 1] e+κ e−κ = λ pη1 + (1 − p)η2 −1 . η1 − 1 η2 + 1 =
(13.4)
In this paper we use η = η1 = η2 and p = 0.5.
13.2.3 Moments, variance and volatility The variance of the random number log as St Var log S0
St S0
of the process (13.2) can be written
2
= σ t + Var
N t
Uk (κ + Hk )
k=1
= σ 2 t + λt((κ + h)2 + h2 ), where the Hk are independent exponentially distributed random numbers with expectation h = η1 . Uk is a random variable which takes the value of +1 and −1 with probability 12 . We calculate the first two moments
E
N t
Uk (κ + Hk )
k=1
= =
∞ n=0 ∞ n=0
= 0,
E
n
Uk (κ + Hk ) · P[Nt = n]
k=1
n · 0 · P[Nt = n]
13.3 Parameter estimation
E
399
N t
2 Uk (κ + Hk )
k=1
=
∞ n=0
E
n
2 Uk (κ + Hk )
· P[Nt = n]
k=1 2
= λt((κ + h) + h2 ). For the variance we obtain
Var
N t
Uk (κ + Hk )
k=1
=
E
N t
2
Uk (κ + Hk )
−
E
k=1
=
N t
!2 Uk (κ + Hk )
k=1
λt((κ + h)2 + h2 ).
Finally, the volatility of the DDE-process is 2
1 St Var log t S0
=
σ 2 + λ((κ + h)2 + h2 ).
(13.5)
13.3 Parameter estimation 13.3.1 Estimating parameters from financial data We estimate the parameters based on the historical data of the MSCI Daily TR (Total Return) Gross (gross dividends reinvested) in USD for the period between January 1 1980 and October 2 2009. We denote these prices with x0 , x1 , . . . , xN and the log-returns by Δ
ri = log
xi , xi−1
i = 0, 1, . . . , N.
(13.6)
400
13 Return distributions of equity-linked retirement plans
Table 13.1: Estimated parameters for the DDE-process. Parameter Total volatility σ ˆtot Volatility of the diffusion part σ ˆ Jump intensity λ Minimum jump size κ Expected jump size above minimum jump size h Drift adjustment δ
Value 14.3% 11.69% 5.209 2.31% 1.121% 0.339%
The estimate for the daily log return is
r¯ =
N 1 ri . N i=1
(13.7)
The estimate for the total volatility σ ˆtot is
2 σ ˆtot
=
#Prices per year N −1
N
! ri2 − N r¯2
.
(13.8)
i=1
To determine the parameters for the jump process we have to define a level κ such that ri with ri ≥ κ is considered to be a jump. To determine this κ we define for a given level u ∈ [0, 1] the u% smallest and u% biggest daily log returns The level u should be chosen such that the resulting returns are intuitively considered as jumps. If u is chosen too high, even small log-returns are considered jumps, and if u is too low, almost no jumps occur. Of course, this level is subjective. We have chosen u = 1% because in this case only daily changes of more than 2% are considered to be a jump. Changes of less than 2% can be explained with the diffusion part with sufficiently high probability. It turns out that for the analyzed MSCI World Index, the smallest up-jump and the smallest down-jump is almost equal. The absolute values of the relative jump size is on average 2.31%. We use this minimal jump size as an estimator for κ The value η of the single parameter exponential distribution is chosen such that η fits the mean of the observed jumps. From the financial data and
13.4 Interest rate curve
401
the already fixed parameters we obtain h = 1/η with η = 1.21%. The number of jumps divided by the total number of observations yields an estimate for the jump frequency. Annualizing this frequency we can estimate λ to be 5.21. Finally we have to correct the estimator for the volatility according to Equation (13.5) since the volatility consists of the jump part and the diffusion part. We summarize the estimated parameters in Table 13.1.
13.4 Interest rate curve To calculate the current value of the future liability (floor) and the performance of the riskless investments, we use the zero bond curve as of October 1 2009. The curve is extracted from the money market and swap rate quotes on Reuters. This curve is static and not simulated. Interpolation is done linear in the rates. See Table 13.2 for the calculated discount factors and the market quotes as seen on Reuters.
13.5 Products In this section we describe the analyzed products and the assumptions we made.
13.5.1 Classical insurance strategy with investment in the actuarial reserve fund The current value of the future liability is calculated and a sufficient amount to meet this liability in the future is invested in the actuarial reserve fund. The actuarial reserve fund is assumed to be riskless and accrues the interest implied by the current zero bond curve but at least the currently guaranteed interest of 2.25%. The return participation is added to the contract once a year. Only the excess amount which is not needed for the guarantee is invested in the risky asset. We assume that the calculation of the amount needed to meet the future liability is based on the guaranteed interest rate of 2.25%. See Figure 13.3 for a picture of the exposure distribution in the classical insurance case.
402
13 Return distributions of equity-linked retirement plans
Table 13.2: Extracted discount factors from money market quotes and swap rates. Date 11/3/2009 12/3/2009 1/4/2010 2/3/2010 3/3/2010 4/5/2010 5/3/2010 6/3/2010 7/5/2010 8/3/2010 9/3/2010 10/4/2010 10/3/2011 10/3/2012 10/3/2013 10/3/2014 10/5/2015 10/3/2016 10/3/2017 10/3/2018 10/3/2019 10/5/2020 10/4/2021 10/3/2022 10/3/2023 10/3/2024 10/3/2029
Instrument Money market Money market Money market Money market Money market Money market Money market Money market Money market Money market Money market swap swap swap swap swap swap swap swap swap swap swap swap swap swap swap swap
Rate 0.43% 0.59% 0.75% 0.84% 0.92% 1.01% 1.06% 1.10% 1.14% 1.18% 1.20% 1.20% 1.71% 2.15% 2.46% 2.71% 2.92% 3.15% 3.24% 3.36% 3.46% 3.55% 3.64% 3.72% 3.79% 3.84% 3.99%
Discount factor 99.96% 99.89% 99.80% 99.71% 99.61% 99.48% 99.37% 99.26% 99.13% 99.01% 98.88% 98.80% 96.64% 93.76% 90.64% 87.30% 83.85% 80.11% 77.01% 73.71% 70.47% 67.25% 64.09% 61.07% 58.18% 55.43% 44.76%
13.5.2 Constant proportion portfolio insurance The Constant Proportion Portfolio Insurance structure (CPPI) works similar to the classical strategy of investments in the actuarial reserve fund. The difference is that instead of investing only the excess amount into the risky asset the excess amount is leveraged in order to allow for a higher equity participa-
13.5 Products
403
Actuarial fund 1.4
floor pure equity fund equity exposure actuarial fund total
1.2
value
1 0.8 0.6 0.4 0.2 0 0
2
4
6
8
10
12
time in years
Figure 13.3: Simulated path for a classical insurance strategy and 10 year investment horizon
tion. The investment is monitored on a continuous basis to guarantee that the investment doesn’t fall below the floor. With F being the floor of the future obligations, N AV the net asset value of the fund and a the leverage factor, the rebalancing equation for the risky asset R is R = max (a (N AV − F ) , N AV ) .
(13.9)
The leverage factor determines the participation on the equity returns. The higher the leverage factor a, the higher the participation on positive return but also the risk to reach the floor. For a leverage factor of a = 1 the structure becomes static when assuming constant interest rates. A commonly used value for a in the industry is 3. For a sample path of a CPPI and the equity and fixed income distribution structure with leverage factor 3 see Figure 13.4. This strategy guarantees 100% capital protection in a continuous model. In a model with jumps this is not the case anymore. The strategy is subject to gap risk,
404
13 Return distributions of equity-linked retirement plans
CPPI 1.4
floor pure equity fund equity exposure CPPI fixed income CPPI total CPPI
1.2
value
1 0.8 0.6 0.4 0.2 0 0
2
4
6
8
10
12
time in years
Figure 13.4: Simulated CPPI path with leverage factor 3 and 10 years investment horizon
the risk that due to a jump in the market, rebalancing is not possible and the fund value drops below the floor. We neglect liquidity issues here which would cause an additional risk. However due to the continuous reduction of the equity exposure this risk is rather small. See for example (Black and Perold, 1992) for the general theory of constant proportion portfolio insurance.
13.5.3 Stop loss strategy In the stop loss strategy 100% of the equity amount is held in the risky fund until the floor is reached. In this case all the investment is moved to the fixed income fund to generate the guarantee at maturity. See Figure 13.5 for a path where the stop loss barrier is reached and all the investment is shifted into the fixed income fund. This strategy is riskless as the CPPI strategy in a continuous
13.6 Payments to the contract and simulation horizon
405
Stop Loss 1.4
total StopLoss 1.2
value
1 0.8 0.6 0.4 0.2 0 0
2
4
6
8
10
12
time in years
Figure 13.5: Simulated stop-loss path with 10 years investment horizon
model. In a model with jumps again we are imposed to gap risk. We neglect liquidity issues here which actually forces the insurer to liquidate the risky asset before it reaches the floor level. This issue becomes especially important for large funds, which is often the case for retirement plans or life insurance products since the amount to liquidate is so big that it actually influences the market.
13.6 Payments to the contract and simulation horizon Since capital guaranteed life insurance products are especially popular in Germany under the master agreement of the Riester-Rente, we consider a typical payment plan with an horizon of 20 years. To be eligible for the maximal
406
13 Return distributions of equity-linked retirement plans
amount of cash payments and tax benefits the insured has to spend at least 4% of his yearly gross income for the insurance product, including the payments from the state but no more than 2,100 EUR. In this case he receives cash payments of 154 EUR per year, and additional 185 EUR for each child born before January 1 2008 and 300 EUR for each child born on or after this date. Even though this is not the focus of this paper, we assume a saving plan that allows for these benefits in order to compare typical cost structure against the state benefits. We consider the situation of a person being 45 years old when entering the contract and earning 30,000 EUR a year. We further assume that he has one child born after January 1 2008 but before entering the contract. In this case the insured receives 454 EUR from the state, so he actually only has to pay 746 EUR per year to reach 1,200 per year (4% of his income). This is a very high support rate of 37%. For comparison, if we take an investor without children, earning 52,500 EUR per year, the support rate would only be maximal 7.3%. We assume a monthly payment of 100 EUR and do not distinguish between payments made by the state and by the insured. The total nominal amount is 20 × 1, 200 = 24, 000 EUR. This is the amount the issuer of the plan needs to guarantee at retirement. There is no guarantee during the lifetime of the contract. Especially in the case that the insured dies before retirement, the payments to the contract are not guaranteed and only the current account value can be transfered to another contract or payed out. In case the contract is payed out, payments form the state will be claimed back.
13.7 Cost structures We study the impact of different cost structures which are often seen in insurance products. Often the fee structure of these products is rather complex and consists of a combination of various fees. • Sales and Distribution cost: These costs are usually charged to pay a sales fee for the agent who closed the deal with the insured. These fees are usually dependent on the total cash contracted to pay into the contract until maturity. However they are usually charged uniformly distributed over the first 5 years of the contract. In insurance business they are called α-cost. • Administration cost: These costs are usually charged on the cash payments to the contract during the entire lifetime of the contract. They are usually charged to cover administrative costs of the contract. In insurance
13.8 Results without costs
407
Table 13.3: Model parameters for the different scenarios. The other model parameters are taken from the estimates in Table 13.1 Scenario Standard Optimistic Pessimistic
drift μ 6% 8% 4%
volatility σtot 14.3% 14.3% 14.3%
business they are called β-cost. • Capital management cost: These costs are charged based on the sum of the payments up to the effective date. They are usually charged for capital management. To compare the impact of these different cost structures we analyze costs that are equivalent in terms of the current value. We assume a total fee of 4% of all payments to the contract, i.e. 960 EUR. The current value of these fees with the applied zero bond curve is 681 EUR. Since the typical costs in insurance products are usually a combination of all these fees we also simulate the impact of the sum of these single fees which is a commonly used cost charge.
13.8 Results without costs We present the simulation results for different scenarios. Since the drift is hardly estimated from the past realization, we calculate the results for three different drift assumptions. In Figure 13.6 it can be seen how the distribution varies between the different strategies. The stop loss strategy has an expected distribution very close to the pure equity investment since it has the highest equity participation as can be seen in Table 13.4,Table 13.5 and Table 13.6. So, for the bullish investor this might be the optimal investment for his Riester-Rente. A similar return profile is provided by a CPPI structure with a high leverage factor. The advantage of the CPPI product in practice is that due to the continuous reduction of the exposure if the market is performing badly, the liquidity issue is smaller than for the stop loss strategy. However, as can be seen in Figure 13.6, the risk
408
13 Return distributions of equity-linked retirement plans
Actuarial fund 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0
CPPI leverage 1.5 0.05 0.04 0.03 0.02 0.01 0
20 40 60 80 100 120 140 160
20 40 60 80 100 120 140 160
CPPI leverage 2.0 0.05 0.04 0.03 0.02 0.01 0
CPPI leverage 3.0 0.06 0.05 0.04 0.03 0.02 0.01 0
20 40 60 80 100 120 140 160
20 40 60 80 100 120 140 160
CPPI leverage 4.0 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.01 0
Stop Loss 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0
20 40 60 80 100 120 140 160
20 40 60 80 100 120 140 160
Figure 13.6: Return distribution of the different strategies. We list the capital available at retirement (in units of 1,000 EUR) on the x-axes. of returns close to zero is rather high for both, the stop loss and the CPPI with a high leverage factor. For the bearish investor a classical product with an investment mainly in the actuarial fund or a CPPI product with a small leverage factor could be the better choice.
13.9 Impact of costs
409
Table 13.4: Results in the standard scenario. Strategy Actuarial CPPI Leverage CPPI Leverage CPPI Leverage CPPI Leverage Stop Loss
1.5 2 3 4
mean 40945 43636 44731 45211 45326 45443
median 39393 38826 38129 39726 40489 40867
exposure 36.54% 73.48% 87.83% 94.03% 95.56% 96.65%
Table 13.5: Results in the optimistic scenario. Strategy Actuarial CPPI Leverage CPPI Leverage CPPI Leverage CPPI Leverage Stop Loss
1.5 2 3 4
mean 45055 53074 55850 56765 56942 57059
median 42991 45937 48409 50825 51126 51252
exposure 38.91% 79.01% 92.66% 97.14% 98.06% 98.69%
Table 13.6: Results in the pessimistic scenario. Strategy Actuarial CPPI Leverage CPPI Leverage CPPI Leverage CPPI Leverage Stop Loss
1.5 2 3 4
mean 37706 37160 36892 36698 36640 36693
median 36541 34041 32332 30838 30751 32118
exposure 34.25% 67.81% 81.77% 89.19% 91.32% 92.96%
13.9 Impact of costs It can be seen that the fees have a high impact on the return distribution. Even if the fees have actually the same current value, the impact on the return distribution is different. The alpha cost weakens the expected return most since
410
13 Return distributions of equity-linked retirement plans
Table 13.7: Results for the actuarial reserve product in the standard scenario with costs. Cost No cost Alpha cost Beta cost Cost on accumulated payments Sum of all fees
mean 40945 38923 39120 39231 35387
median 39393 37678 37747 37782 34422
exposure 36.54% 30.89% 33.83% 35.17% 25.93%
Table 13.8: Results in the CPPI strategy with leverage factor 3 with costs in the standard scenario. Cost No cost Alpha cost Beta cost Cost on accumulated payments Sum of all fees
mean 45326 43236 43433 43546 39506
median 40486 38423 38496 38566 33966
exposure 95.56% 94.36% 94.42% 94.51% 91.13%
it decreases the exposure at the beginning of the saving period. This impact is very high for the actuarial reserve product which even without fees only has an average equity participation of 36.54%. The alpha fee reduces this further to 30.9% as can be seen in Table 13.7. The fees on the accumulated payments has the least impact since these are mainly charged at the end of the saving period and therefore the impact on the equity exposure is smaller. The insured has to carefully study whether the negative impact of the fee structure is actually fully compensated by the federal cash payments. This highly depends on the cost structure which varies massively between the different products and on the income and family situation of the insured, which in turn determines the amount of cash benefits from the state. In many cases it may be advisable to choose a product outside the class of Riester-Rente that has a smaller cost ratio. In this case the investor can buy less costly products and can freely choose a product without capital protection, which has a higher expected return. A more detailed analysis of the different cost structures can be found in (Detering, Weber and Wystup, 2009).
13.10 Impact of jumps
411
Table 13.9: Results in the stop loss strategy with costs in the standard scenario. Cost No cost Alpha cost Beta cost Cost on accumulated payments Sum of all fees
mean 45443 43376 43577 43706 39747
median 40869 38918 39044 39139 35141
exposure 96.65% 95.62% 95.67% 95.75% 92.83%
13.10 Impact of jumps The CPPI and the Stop Loss strategy are risk free in a continuous equity model. However, in a model with jumps, we are exposed to gap risk, which means that the value of portfolio of risky assets can fall below the floor. In this case the strategy fails to generate the guarantee. In practice the leverage factor is chosen such that even a very big jump in the market still maintains the guarantee. For example, a 20% jump in the market doesn’t cause a loss for a rebalanced portfolio if the leverage factor is below 5. In this case all the equity exposure would be lost, but the guarantee could exactly be generated. So, with a moderate leverage factor the remaining gap risk is negligible. This is also reflected in our jump diffusion model. For the stop loss strategy the situation is different since the equity exposure does not decrease when approaching the floor level. We calculate how often the strategy fails in case of a jump on average. The expected number of guarantee shortfalls is shown in Table 13.10. We assume that after the stop loss level is reached and all money is invested in the risk free fund it stays there until maturity. Only new cash flows are again invested in the risky fund. Therefore, it might happen that there are several guarantee shortfalls in one path. The shortfall number is the accumulated sum of shortfalls over one path. We show the number of shortfalls for a contract without fees and for a contract with fees. The fees increase the risk of a shortfall since the insurance company has to guarantee the cash payments made by the insured. Table 13.10 shows that in the CPPI strategy the guarantee is never at risk. The shortfalls in the situation with fees are actually not caused by large jumps, but are caused by high fees close to maturity. At this time the fee charge on the incoming cash flow is so high compared to the difference between invested
412
13 Return distributions of equity-linked retirement plans
Table 13.10: Shortfalls for stop loss and CPPI for 100,000 simulations. Product CPPI, factor 3, no fees CPPI, factor 3, with fees Stop loss, no fees Stop loss, with fees
number of paths with gap 0 265 18804 31317
average realized gap 0 50 219 354
amount and its present value that the cash contribution does not suffice to ensure the guarantee. In case of a low performing path history the investments done earlier are not able to compensate for this.
13.11 Summary We have compared the performance of savings plans within the class of difference capital guarantee mechanisms: from the stop loss to classic investments in actuarial reserve funds. CPPI strategies with different leverage factors can be viewed as a compromises between these two extremes. In bullish markets savings plans with a high equity ratio perform the best, in bearish markets the classic insurance concept shows better returns. A stop loss strategy suffers from gap risk, whence a CPPI strategy combines the strength of both gap risk minimization and equity ratio maximization. The effect of fees on the savings plans dominates the performance, especially in typical fee structures found in the German Riester-Rente. The private investor is advised to check carefully if the federal cash payments can compensate the fees taking into account his own salary and tax situation.
Bibliography Black, F. and Scholes, M. (1973). The pricing of Options and Corporate Liabilities, Journal of Political Economy 81 (3): 637-654. Black, F. and Perold, A.R. (1992). Theory of constant proportion portfolio insurance, Journal of Economic Dynamics and Control 16 : 403–426. Detering, N., Weber, A. and Wystup, U. (2009). Riesterrente im Vergleich Eine Simulationsstudie zur Verteilung der Rendite im Auftrag von EuroMagazin, MathFinance Research Paper. Hardy, M. (2003). Investment Guarantees: Modelling and Risk Management for Equity-Linked Life Insurance, Wiley Finance. Kou, S.G. (2002). A Jump-Diffusion-Model for option pricing, Management Science 48 (8): 1086–1101.
Index α-cost, 406 α-stable L´evy motion, 339 β-cost, 407 actuarial reserve fund, 393 adaptive weights smoothing, 123 algorithm, 124 adjustment coefficient, 331 asset risk, 201 autoregressive model continuous, 164 bandwidth choice, 111 bankruptcy prediction, 225 financial ratios, 226 Bessel function modified, 68 beta function, 60 incomplete, 60, 76 binomial theorem, 76 Brownian motion, 337 Carr-Madan approach, 144 casualty insurance, 349 catastrophe bond, 371 coupon-bearing, 384 pricing, 375 trigger, 371 with only coupon payments, 383 zero-coupon, 382
central limit theorem, 22 characteristic function, 65, 87, 89 claim aggregate process, 293 arrival process, 294 capping, 358 catastrophic, 355, 358 frequency, 354, 356, 365 incurred, 358 paid, 358 severity, 356, 359 table, 350 composition approach, 307 conditional heteroscedasticity model autoregressive, 104 estimation, 105 generalized autoregressive, 104 generalized linear, 104 varying coefficient, 109 autoregressive, 109 estimation, 109, 110 counting process, 294 CPPI (Constant proportion portfolio insurance), 402 cumulant generating function, 60, 83 Danish fire losses, 321 defaultable bond, 376 degree days
P. Čížek et al. (eds.), Statistical Tools for Finance and Insurance, DOI 10.1007/978-3-642-18062-0, © Springer-Verlag Berlin Heidelberg 2011
415
416 cooling (CDD), 165 heating (HDD), 165 delta pillar, 151 derivatives weather, → temperature derivatives deviance residuals, 361 diagnosing models, 361 discriminant analysis, 225 dispersion trading, 216, 217, 219 distance correlation measures, 252 entropy, 262 Manhattan, 252, 253 noise influence, 255 noise influence, 254 Theil index, 263 ultrametric, 252, 253 noise influence, 257 distance matrix, 251 analysis of, 263 distribution Burr, 313, 380 Cauchy, 24 chi-squared, 307 Cobb-Douglas, 310 Erlang, 307 exponential, 304, 311, 316, 334, 380 mixture, 305, 380 exponential family, 352, 354 canonical parameter, 352 dispersion parameter, 352 gamma, 307, 311 variance, 69 Gaussian, 24 generalized hyperbolic, 67, 87 density function, 39 maximum likelihood estimation, 42
Index mean, 39 simulation, 40 tails, 39 variance, 39 generalized inverse Gaussian (GIG), 37 simulation, 40–42 goodness-of-fit, 320 p-value, 320 heavy-tailed, 67, 76, 312, 316, 331, 380 hyperbolic, 36, 37, 69 density function, 38 inverse Gaussian (IG) simulation, 41 L´evy, 24 L´evy stable, 23 light-tailed, 331 log-normal, 309, 380 negative binomial, 307 normal, 337 mixture, 71 skew, 85 normal inverse Gaussian (NIG), 68, 90 density function, 40 simulation, 41 tail estimates, 43 normal mixture, 70 Pareto, 311, 313, 380 Pearson’s Type III, 307 Poisson, 297, 355 raw moment, 304, 309 semi-heavy- tailed, 311 stable, 22, 23, 72 S parametrization, 24 S 0 parametrization, 24 density function, 25 direct integration method, 26 distribution function, 25
Index maximum likelihood estimation, 32 mixture, 72 Paretian, 65 regression method, 31, 32 simulation, 28 STABLE program, 28, 33 stable Paretian, 23 Student, 60, 61 generalized asymmetric, 62 hyperbolic, 69 mixture, 73 noncentral, 63 skewed, 61, 63 tempered stable (TSD), 34 truncated L´evy, 34 characteristic function, 34 truncated stable, 34 variance-gamma, 40 Weibull, 314, 380 domain of attraction, 337 Dow Jones Industrial Average (DJIA), 21, 44 empirical distribution function (edf), 303 statistic, 318 entropy index, 252 Euler discretization, 169 expected shortfall, 57, 58, 84 exposure table, 350 Fast Fourier Transform (FFT), 92, 145 fees, 406 filtered return, 45 Fourier function truncanted, 164 truncated, 182 Fourier series estimator, 182
417 Fourier–Mellin integral, 60 gamma, 204, 359 swaps, 214 gamma function, 307 incomplete, 76 gap risk, 411 GARCH, 104, 163 generalized linear models (GLM), 349, 351 link function, 351, 353, 355 variance function, 353 graph, 263 edge, 263 node, 263 path, 263 Heston model, 134 calibration, 150, 152 characteristic function (cf), 144 correlation, 149 density, 135 drift, 134 Feller condition, 142 greeks, 139 initial variance, 149 long-run variance, 134, 149 rate of mean reversion, 134, 149 simulation, 136, 141 vol of vol, 135, 149 volatility of variance, 135 heteroskedasticity, 93 HPP, 295, 300, 305, 324 simulation algorithm conditional theorem, 295 waiting times, 295 implied volatility, 151 incomplete gamma function, 307 independent components analysis (ICA), 82, 92
418 insurance data, 350 product, 350 insurance-linked security, 371 intensity function, 297 inter-arrival time, 294 interacted variables, 365 interval of homogeneity, 116 inverse transform method, 299, 305, 312 inversion formula, 59, 87 inversion method, 305 Itˆo formula multidimensional, 168 Itˆ o’s lemma, 204
Index loss model, 293
market price of risk, 163 market price of volatility risk, 137, 151 martingale, 163 maximum likelihood estimation, 74 quasi, 105 kernel, 109 mean absolute error (MAE), 106 mean excess function, 315, 321 mean residual life function, 315 memoryless property, 305 Mercer conditions, 232 mixing matrix, 93, 94 modified Bessel function, 38 jump-diffusion model, 158 moment displaced double-exponential (DDE), generating function, 83, 85 395 generation function, 59 with stochastic volatility, 158 lower partial, 57, 76 MPP, 300 kernel simulation algorithm function, 232 conditional theorem, 301 Gaussian, 233 uniformity property, 301 hyperbolic tangent, 233 polynomial, 233 negentropy, 93 NHPP, 297, 324 technique, 227 Kuhn-Tucker conditions, 231 intensity function, 297, 299 kurtosis, 93 rate function, 297 simulation algorithm Laplace transform, 305, 309 conditional theorem, 299 least squares integration, 299 kernel, 110 thinning, 298 Lewis-Lipton approach, 146 non-homogeneous Poisson linear regression, 354 process (NHPP), 380 local linear regression, 163 local volatility, 158 option log-normal distribution, 309 delta-hedged, 203 logit, 225 gamma, loss distribution, 302 → gamma
Index PACF, 163, 179 Pareto distribution, 311 pointwise adaptive estimation, 114 Poisson process, 395 doubly stochastic, 376 homogeneous, → HPP mixed, → MPP non-homogeneous, → NHPP, 330 Poisson random variable, 297 polar decomposition, 93 policy, 350 auto, 350 portfolio, 93 replication, 209 prediction error, 111, 119, 125 premium function, 297, 300, 325 premium model pure, 368 probability of default, 225 process CIR, 134 Ornstein-Uhlenbeck, 168 square root, 134 Property Claim Services catastrophe data, 379 property insurance, 349
419 classification, 237 expected, 227 of asset, → asset risk risk model, 293 collective, 293, 355 risk process, 293, 297, 300, 325, 329 aggregate claim, 293 claim arrival, 294 initial capital, 293 inter-arrival time, 294 premium, 293 safety loading, 297 waiting time, 294 robust statistics, 322 ruin probability, 330 Brownian motion diffusion approximation, 337 corrected diffusion approximation, 338 exponential distribution, 334 finite time, 330 finite time De Vylder approximation, 340 infinite time, 330 L´evy motion diffusion approximation, 338 Segerdahl normal approximation, 335 ruin theory, 329
quantile, 71, 73, 84 rate function, 297 renewal process, 301 simulation algorithm waiting times (RP1), 301 return filtered, 45 Riester-Rente, 393 risk
saddlepoint approximation, 58, 64, 83, 84 second order, 90 safety loading, 297 seasonal variation, 163 seasonality, 163 Shannon entropy, 262 simulation composition approach, 307
420 inverse transform method, 299, 305 split validation, 355 statistic Anderson and Darling (A2 ), 319, 380 Cramer-von Mises (W2 ), 319, 380 Kolmogorov (D), 318 Kolmogorov-Smirnov (D), 318, 380 Kuiper (V), 319, 380 stochastic process, 163 stochastic volatility (SV), 133 stop loss strategy, 404 support vector machine, 225 tail component, 61, 65 temperature CAT futures, 171 CAT index, 171 CDD futures, 173 derivatives, 163, 170 Asian, 177 dynamics, 167 futures, 170 risk, 163 market price, 175 test statistics supremum likelihood-ratio, 116 Theil index, 252, 263, 277 thinning, 298 Value at Risk (VaR), 21, 57 conditional, 57 Vapnik-Chervonenkis bound, 228 variance swap, 201 conditional, 213 corridor, 213
Index downward, 214 generalized, 212 greeks, 215 hedging, 203 replication, 203 volatility, 103 3G products, 211 instruments, 201 trading, 202 volatility skew, 133, 152 volatility smile, 133, 149, 151, 152 waiting time, 294 Wiener process, 163