Pairs
Trading Quantitative Methods and Analysis
GANAPATHY VIDYAMURTHY
John Wiley & Sons, Inc.
Contents
Preface
i...
593 downloads
4436 Views
2MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Pairs
Trading Quantitative Methods and Analysis
GANAPATHY VIDYAMURTHY
John Wiley & Sons, Inc.
Contents
Preface
ix
Acknowledgments
xi
PART ONE Background Material CHAPTER 1 Introduction The CAPM Model Market Neutral Strategy Pairs Trading Outline Audience
CHAPTER 2 Time Series Overview Autocorrelation Time Series Models Forecasting Goodness of Fit versus Bias Model Choice Modeling Stock Prices
CHAPTER 3 Factor Models Introduction Arbitrage Pricing Theory The Covariance Matrix Application: Calculating the Risk on a Portfolio Application: Calculation of Portfolio Beta Application: Tracking Basket Design Sensitivity Analysis
3 3 5 8 9 10
14 14 15 16 24 25 26 30
37 37 39 42 44 47 49 50 v
CONTENTS
vi
CHAPTER 4 Kalman Filtering Introduction The Kalman Filter The Scalar Kalman Filter Filtering the Random Walk Application: Example with the Standard & Poor Index
52 52 54 57 60 64
PART TWO Statistical Arbitrage Pairs CHAPTER 5 Overview History Motivation Cointegration Applying the Model A Trading Strategy Road Map for Strategy Design
CHAPTER 6 Pairs Selection in Equity Markets Introduction Common Trends Cointegration Model Common Trends Model and APT The Distance Measure Interpreting the Distance Measure Reconciling Theory and Practice
CHAPTER 7 Testing for Tradability Introduction The Linear Relationship Estimating the Linear Relationship: The Multifactor Approach Estimating the Linear Relationship: The Regression Approach Testing Residual for Tradability
73 73 74 75 80 82 83
85 85 87 90 93 94 97
104 104 106 107 108 112
CHAPTER 8 Trading Design
118
Introduction
118
Contents
Band Design for White Noise Spread Dynamics Nonparametric Approach Regularization Tying Up Loose Ends
vii
119 122 126 130 135
PART THREE Risk Arbitrage Pairs CHAPTER 9 Risk Arbitrage Mechanics Introduction History The Deal Process Transaction Terms The Deal Spread Trading Strategy Quantitative Aspects
CHAPTER 10 Trade Execution Introduction Specifying the Order Verifying the Execution Execution During the Pricing Period Short Selling
CHAPTER 11 The Market Implied Merger Probability Introduction Implied Probabilities and Arrow-Debreu Theory The Single-Step Model The Multistep Model Reconciling Theory and Practice Risk Management
CHAPTER 12 Spread Inversion Introduction The Prediction Equation The Observation Equation
139 139 140 141 142 145 147 149
151 151 152 155 161 166
171 171 173 175 177 180 184
189 189 190 192
CONTENTS
viii
Applying the Kalman Filter Model Selection Applications to Trading
193 194 197
Index
205
Preface
ost book readers are likely to concur with the idea that the least read portion of any book is the preface. With that in mind, and the fact that the reader has indeed taken the trouble to read up to this sentence, we promise to leave no stone unturned to make this preface as lively and entertaining as possible. For your reading pleasure, here is a nice story with a picture thrown in for good measure. Enjoy! Once upon a time, there were six blind men. The blind men wished to know what an elephant looked like. They took a trip to the forest and with the help of their guide found a tame elephant. The first blind man walked into the broadside of the elephant and bumped his head. He declared that the elephant was like a wall. The second one grabbed the elephant’s tusk and said it felt like a spear. The next blind man felt the trunk of the elephant and was sure that elephants were similar to snakes. The fourth blind man hugged the elephant’s leg and declared the elephant was like a tree. The next one caught the ear and said this is definitely like a fan. The last blind man felt the tail and said this sure feels like a rope. Thus the six blind men all perceived one aspect of the elephant and were each right in their own way, but none of them knew what the whole elephant really looked like.
M
ix
x
PREFACE
Oftentimes, the market poses itself as the elephant. There are people who say that predicting the market is like predicting the weather, because you can do well in the short term, but where the market will be in the long run is anybody’s guess. We have also heard from others that predicting the market short term is a sure way to burn your fingers. “Invest for the long haul” is their mantra. Some will assert that the markets are efficient, and yet some others would tell you that it is possible to make extraordinary returns. While some swear by technical analysis, there are some others, the so-called fundamentalists, who staunchly claim it to be a voodoo science. Multiple valuation models for equities like the dividend discount model, relative valuation models, and the Merton model (treating equity as an option on firm value) all exist side by side, each being relevant at different times for different stocks. Deep theories from various disciplines like physics, statistics, control theory, graph theory, game theory, signal processing, probability, and geometry have all been applied to explain different aspects of market behavior. It seems as if the market is willing to accommodate a wide range of sometimes opposing belief systems. If we are to make any sense of this smorgasbord of opinions on the market, we would be well advised to draw comfort from the story of the six blind men and the elephant. Under these circumstances, if the reader goes away with a few more perspectives on the market elephant, the author would consider his job well done.
PART
One
Background Material
CHAPTER
1
Introduction
W
e start at the very beginning (a very good place to start). We begin with the CAPM model.
THE CAPM MODEL CAPM is an acronym for the Capital Asset Pricing Model. It was originally proposed by William T. Sharpe. The impact that the model has made in the area of finance is readily evident in the prevalent use of the word beta. In contemporary finance vernacular, beta is not just a nondescript Greek letter, but its use carries with it all the import and implications of its CAPM definition. Along with the idea of beta, CAPM also served to formalize the notion of a market portfolio. A market portfolio in CAPM terms is a portfolio of assets that acts as a proxy for the market. Although practical versions of market portfolios in the form of market averages were already prevalent at the time the theory was proposed, CAPM definitely served to underscore the significance of these market averages. Armed with the twin ideas of market portfolio and beta, CAPM attempts to explain asset returns as an aggregate sum of component returns. In other words, the return on an asset in the CAPM framework can be separated into two components. One is the market or systematic component, and the other is the residual or nonsystematic component. More precisely, if rp is the return on the asset, rm is the return on the market portfolio, and the beta of the asset is denoted as b, the formula showing the relationship that achieves the separation of the returns is given as rp = βrm + θ p
(1.1)
Equation 1.1 is also often referred to as the security market line (SML). Note that in the formula, brm is the market or systematic component of the return. b serves as a leverage number of the asset return over the market return. For 3
BACKGROUND MATERIAL
4
instance, if the beta of the asset happens to be 3.0 and the market moves 1 percent, the systematic component of the asset return is now 3.0 percent. This idea is readily apparent when the SML is viewed in geometrical terms in Figure 1.1. It may also be deduced from the figure that b is indeed the slope of the SML. qp in the CAPM equation is the residual component or residual return on the portfolio. It is the portion of the asset return that is not explainable by the market return. The consensus expectation on the residual component is assumed to be zero. Having established the separation of asset returns into two components, CAPM then proceeds to elaborate on a key assumption made with respect to the relationship between them. The assertion of the model is that the market component and residual component are uncorrelated. Now, many a scholarly discussion on the import of these assumptions has been conducted and a lot of ink used up on the significance of the CAPM model since its introduction. Summaries of those discussions may be found in the references provided at the end of the chapter. However, for our purposes, the preceding introduction explaining the notion of beta and its role in the determination of asset returns will suffice. Given that knowledge of the beta of an asset is greatly valuable in the CAPM context, let us discuss briefly how we can go about estimating its value. Notice that beta is actually the slope of the SML. Therefore, beta may be estimated as the slope of the regression line between market returns and the asset returns. Applying the standard regression formula for the estimation of the slope we have
rp Asset Return
qp
brm β
rm Market Return
FIGURE 1.1
The Security Market Line.
Introduction
5
β =
cov(rp rm ) var(rm )
(1.2)
that is, beta is the covariance between the asset and market returns divided by the variance of the market returns. To see the typical range of values that the beta of an asset is likely to assume in practice, we remind ourselves of an oft-quoted adage about the markets, “A rising tide raises all boats.” The statement indicates that when the market goes up, we can typically expect the price of all securities to go up with it. Thus, a positive return for the market usually implies a positive return for the asset, that is, the sum of the market component and the residual component is positive. If the residual component of the asset return is small, as we expect it to be, then the positive return in the asset is explained almost completely by its market component. Therefore, a positive return in the market portfolio and the asset implies a positive market component of the return and, by implication, a positive value for beta. Therefore, we can expect all assets to typically have positive values for their betas.
MARKET NEUTRAL STRATEGY Having discussed CAPM, we now have the required machinery to define market neutral strategies: They are strategies that are neutral to market returns, that is, the return from the strategy is uncorrelated with the market return. Regardless of whether the market goes up or down, in good times and bad the market neutral strategy performs in a steady manner, and results are typically achieved with a lower volatility. This desired outcome is achieved by trading market neutral portfolios. Let us therefore define what we mean by a market neutral portfolio. In the CAPM context, market neutral portfolios may be defined as portfolios whose beta is zero. To examine the implications, let us apply a beta value of zero to the equation for the SML. It is easy to see that the return on the portfolio ceases to have a market component and is completely determined by qp, the residual component. The residual component by the CAPM assumption happens to be uncorrelated with market returns, and the portfolio return is therefore neutral to the market. Thus, a zero beta portfolio qualifies as a market neutral portfolio. In working with market neutral portfolios, the trader can now focus on forecasting and trading the residual returns. Since the consensus expectation or mean on the residual return is zero, it is reasonable to expect a strong mean-reverting behavior (value oscillates back and forth about the mean
BACKGROUND MATERIAL
6
value) of the residual time series.1 This mean-reverting behavior can then be exploited in the process of return prediction, leading to trading signals that constitute the trading strategy. Let us now examine how we can construct market neutral portfolios and what we should expect by way of the composition of such portfolios. Consider a portfolio that is composed of strictly long positions in assets. We expect that beta of the assets to be positive. Then positive returns in the market result in a positive return for the assets and thereby a positive return for the portfolio. This would, of course, imply a positive beta for the portfolio. By a similar argument it is easy to see that a portfolio composed of strictly short positions is likely to have a negative beta. So, how do we construct a zero beta portfolio, using securities with positive betas? This would not be possible without holding both long and short positions on different assets in the portfolio. We therefore conclude that one can typically expect a zero beta portfolio to comprise both long and short positions. For this reason, these portfolios are also called long–short portfolios. Another artifact of long–short portfolios is that the dollar proceeds from the short sale are used almost entirely to establish the long position, that is, the net dollar value of holdings is close to zero. Not surprisingly, zero beta portfolios are also sometimes referred to as dollar neutral portfolios.
Example Let us consider two portfolios A and B, with positive betas bA and bB and with returns rA and rB rA = β A rm + θ A
(1.3)
rB = β Brm + θ B We now construct a portfolio AB, by taking a short position on r units of portfolio A and a long position on one unit of portfolio B. The return on this portfolio is given as rAB = –r.rA + rB. Substituting for the values of rA and rB, we have rAB = (− rβ A + β B ).rm + (− r.θ A + θ B )
1
(1.4)
The assertion of CAPM that the expected value of residual return is zero is rather strong. It has been discussed extensively in academic literature as to whether this prediction of CAPM is indeed observable. It is therefore recommended that we explicitly verify the mean-reverting behavior of the spread time series. In later chapters we will discuss methods to statistically check for mean-reverting behavior.
Introduction
7
Thus, the combined portfolio has an effective beta of –rbA + bB. This value becomes zero, when r = bB/bA. Thus, by a judicious choice of the value of r in the long–short portfolio we have created a market neutral portfolio.
COCKTAIL CORNER In cocktail situations involving investment professionals, it is fairly common to hear the terms long–short, market neutral, and dollar neutral investing bandied about. Very often they are assumed to mean the same thing. Actually, that need not be the case. You could be long– short and dollar neutral but still have a nonzero beta to the market. In which case you have a nonzero market component in the portfolio return and therefore are not market neutral. If you ever encountered such a situation, you could smile to yourself. Tempting as it might be, I strongly urge that you restrain yourself. But, of course, if you are looking to be anointed the “resident nerd,” you could go ahead and launch into an exhaustive explanation of the subtle differences to people with cocktails in hand not particularly looking for a lesson in precise terminology.
8
BACKGROUND MATERIAL
PAIRS TRADING Pairs trading is a market neutral strategy in its most primitive form. The market neutral portfolios are constructed using just two securities, consisting of a long position in one security and a short position in the other, in a predetermined ratio. At any given time, the portfolio is associated with a quantity called the spread. This quantity is computed using the quoted prices of the two securities and forms a time series. The spread is in some ways related to the residual return component of the return already discussed. Pairs trading involves putting on positions when the spread is substantially away from its mean value, with the expectation that the spread will revert back. The positions are then reversed upon convergence. In this book, we will look at two versions of pairs trading in the equity markets; namely, statistical arbitrage pairs and risk arbitrage pairs. Statistical arbitrage pairs trading is based on the idea of relative pricing. The underlying premise in relative pricing is that stocks with similar characteristics must be priced more or less the same. The spread in this case may be thought of as the degree of mutual mispricing. The greater the spread, the higher the magnitude of mispricing and greater the profit potential. The strategy involves assuming a long–short position when the spread is substantially away from the mean. This is done with the expectation that the mispricing is likely to correct itself. The position is then reversed and profits made when the spread reverts back. This brings up several questions: How do we go about calculating the spread? How do we identify stock pairs for which such a strategy would work? What value do we use for the ratio in the construction of the pairs portfolio? When can we say that the spread has substantially diverged from the mean? We will address these questions and provide some quantitative tools to answer them. Risk arbitrage pairs occur in the context of a merger between two companies. The terms of the merger agreement establish a strict parity relationship between the values of the stocks of the two firms involved. The spread in this case is the magnitude of the deviation from the defined parity relationship. If the merger between the two companies is deemed a certainty, then the stock prices of the two firms must satisfy the parity relationship, and the spread between them will be zero. However, there is usually a certain level of uncertainty on the successful completion of a merger after the announcement, because of various reasons like antitrust regulatory issues, proxy battles, competing bidders, and the like. This uncertainty is reflected in a nonzero value for the spread. Risk arbitrage involves taking on this uncertainty as risk and capturing the spread value as profits. Thus, unlike the case of statistical arbitrage pairs, which is based on valuation considerations, risk arbitrage trade is based strictly on a parity relationship between the prices of the two stocks.
Introduction
9
The typical modus operandi is as follows. Let us call the acquiring firm the “bidder” and the acquired firm the “target.” On the eve of merger announcement, the bidder shares are sold short and the target shares are bought. The position is then unwound on completion of the merger. The spread on merger completion is usually lower than when it was put on. The realized profit is the difference between the two spreads. In this book, we discuss how the ratio is determined based on the details of the merger agreement. We will develop a model for the spread dynamics that can be used to answer questions like, “What is the market expectation on the odds of merger completion?” We shall also demonstrate how the model may be used for risk management. Additionally, we will focus on trade timing and provide some quantitative tools for the process.
OUTLINE The book provides an overview of two different versions of pairs trading in the equity markets. The first version is based on the idea of relative valuation and is called statistical arbitrage pairs trading. The second involves pairs trading that arises in the context of mergers and is called risk arbitrage. Even though they are commonly called arbitrage strategies in the industry, they are by no means risk-free. In this book we take an in-depth look at the various aspects of these strategies and provide quantitative tools to assist in their analysis. I must also quickly point out at this juncture that the methodologies discussed in the book are not by any measure to be construed as the only way to trade pairs because, to put it proverbially, there is more than one way to skin a cat. We do, however, strive to present a compelling point of view attempting to integrate theory and practice. The book is by no means meant to be a guarantee for success in pairs trading. However, it provides a framework and insights on applying rigorous analysis to trading pairs in the equity markets. The book is in three parts. In the first part, we present preliminary material on some key topics. We concede that there are books entirely devoted to each of the topics addressed, and the coverage of the topics here is not exhaustive. However, the discussion sets the context for the rest of the book and helps familiarize the reader with some important ideas. It also introduces some notation and definitions. The second part is devoted to statistical arbitrage pairs, and the third part is on risk arbitrage. The book assumes some knowledge on the part of the reader of algebra, probability theory, and calculus. Nevertheless, we have strived to make the material accessible and the reader could choose to pick up the background along the way. As a refresher, the appendix at the end of this chapter lists the
BACKGROUND MATERIAL
10
basic probability formulas that the reader can expect to encounter in the course of reading the book. In terms of the sequence of chapters, we highly recommend that readers familiarize themselves with the chapters on time series and multifactor models before getting on to statistical arbitrage pairs, as those ideas and technical terms are referenced quite frequently in the course of the discussions. Concepts from Chapter 4, on Kalman filtering, are used in Chapter 12, related to smoothing risk arbitrage spreads. Other than the preceding dependencies, the rest of the material is mostly self-contained.
AUDIENCE This book is written to appeal to a broad audience spanning students, practitioners, and self-study enthusiasts. It is written in an easy reading style, first presenting the broad ideas and concepts and subsequently delving into the details. The idea is to provide readers with the flexibility to revisit aspects of the details on their own timetable. To further facilitate this, a bullet summary highlighting the key points is provided at the end of every chapter. The book could serve as a reference text for students pursuing a degree in mathematical finance or be used as part of an advanced course for MBA students. Also, the topics addressed in the book would be of keen interest not only to academicians but also to traders and quantitative analysts in hedge funds and brokerage houses. The background material in Part 1 provides a primer on various subjects that are drawn on in the course of the analysis. The background material and the analysis methodology appear as a recurring theme in strategy analysis and are generally applicable to other asset classes as well. Given this and the easy readable style of the book, we hope that this book serves as a reference for investment professionals.
SUMMARY The CAPM model helps separate out portfolio returns into a market component and a residual component. Portfolios with a zero market component are called market neutral portfolios. Market neutral strategies involve the trading of market neutral portfolios, and the returns generated by such strategies are uncorrelated with the market. Pairs trading is a genre of market neutral strategies in which a portfolio has only two assets.
Introduction
11
In the book, we will discuss two classes of pairs trading strategies; namely, risk arbitrage and statistical arbitrage.
FURTHER READING MATERIAL CAPM Elton, Edwin J. and Martin J. Gruber. Modern Portfolio Theory and Investment Analysis, 4th Edition. (New York: John Wiley & Sons, Inc., 1991). Fama, Eugene F. and Kenneth R. French. “The Cross-Section of Expected Stock Returns.” Journal of Finance 47, no. 2 (June 1992): 427–465. Market Neutral Strategies Nicholas, Joseph G. Market Neutral Investing: Long/Short Hedge Fund Strategies. (New York: Bloomberg Press, 2000).
BACKGROUND MATERIAL
12
APPENDIX Below are a few formulas on random variables that we are likely to encounter throughout the book.
DEFINITIONS Let X, Y, and Z be random variables. Let (x1, y1, z1),(x2, y2, z2),...,(xN, yN, zN) be N realization 3-tuples for these random variables.
Mean The mean or expected value of X is denoted by E[X] = mx. The estimated value of the mean of a random variable is known as the average. N The formula for the average is xavg = N1 xi .
∑ i =1
Variance
(
)
2 The variance of X is var(X) = E x − µ x . The estimated value of the square root of variance is the familiar standard deviation. N
Its value is calculated using the formula xstddev =
1 N
∑ (xi − xavg )2 . i =1
Covariance The covariance between X and Y is denoted as
[(
)(
)]
cov(X, Y ) = E x − µ x y − µ y . An estimation of the covariance may be calculated using the formula N
1 N
∑ (xi − xavg )(yi − yavg ) . i =1
Correlation The correlation between X and Y is corr(X, Y ) =
cov(X, Y ) var(X) var(Y )
The formula for the estimate of correlation is given as N
1 N
∑ (xi − xavg )(yi − yavg ) i =1
(X
stddev
)(Y
stddev
)
Introduction
13
The correlation between any two random variables is always a value between +1 and –1. Every random variable is perfectly correlated with itself, that is, the correlation is 1.0. Two random variables are said to be uncorrelated when the correlation between them is 0.
FORMULAS If a, b are nonrandom numbers, then the following formulas hold: E[aX + bY] = aE[X] + bE[Y] var(aX + b) = a2 var(X) var(X + Y) = var(X) + var(Y) + 2cov(X,Y) var(X – Y) = var(X) + var(Y) – 2cov(X,Y) cov(aX, bY) = abcov(X,Y) cov(X,Y + Z) = cov(X,Y) + cov(X,Z) corr(aX, bY) = corr(X,Y)
CHAPTER
2
Time Series
OVERVIEW A time series is a sequence of values measured over time. These values may be derived from a fixed deterministic formula, in which case they are referred to as a deterministic time series. Alternately, the value may be obtained by drawing a sample from a probability distribution, in which case they may be termed as probabilistic or stochastic time series. In this chapter, we will focus on stochastic time series. Now, if the value at each instance in a stochastic time series is drawn from a probability distribution, how is it different from repeated drawings from a probability distribution? The added twist is that the probability distributions used for the drawings can themselves vary with time. The formal specification prescribing ways in which the distributions could vary with time and the discipline of analyzing stochastic time series was pioneered and popularized by Nobert Weiner.1 For this reason, the subject area is also referred to at times as Weiner filtering. In the early days of Weiner filtering, the ideas were in theorem form, and to use them in practical applications one had to work through the rigorous mathematical definitions and theorems. Along came George Box and Gwilym Jenkins in the early 1970s, who formulated the application of Weiner filtering concepts into a recipe-like format. Their step-by-step prescription to the process of model building not only had great intuitive appeal but also managed to transform what was considered an esoteric science into a robust engineering discipline. The approach could now be readily applied to forecasting problems. The methodology gained instant popularity with time series analysts and has become the staple by far for the analysis of sto-
1
Nobert Weiner is also credited with coining the word cybernetics, the shortened version of which is the ubiquitous cyber, which has by usage become a prefix for a lot of terms associated with the Internet.
14
Time Series
15
chastic time series. Fittingly, their methodology for time series forecasting is referred to as the Box-Jenkins approach. In this chapter, we will describe the Box-Jenkins approach. Instead of doing this by definition, we will attempt to do this by way of construction and examples. We begin by introducing some basic notation. Throughout the chapter the value of a time series at time t is denoted as yt. It then follows that the general time series is the set of values yt, t = 0, 1, 2, 3...T. We denote this as yt.
AUTOCORRELATION Let us begin the discussion by introducing the notion of the autocorrelation. Given a stochastic time series, the first question one tends to ask in the process of analysis is, “Is there a relationship between the value now and the value observed one time step in the past?” We can choose to answer the question by measuring the correlation between the time series values one time interval apart. The strength of the (linear) relationship is reflected in the correlation number. And what about the relationship of the current value to the value two time steps in the past? What about three time steps in the past? The question seems to repeat itself naturally for the whole range of time steps. The answer to these questions, spanning the entire range of time steps, could very well be the autocorrelation function. The autocorrelation function is the plot of the correlation between values in the time series based on the time interval between them. The x-axis denotes the length of the time lag between the current value and the value in the past. The y-axis value for a time lag t, (x = t) is the correlation between the values in the time series t time units apart. This correlation is estimated using the formula T
ρˆ (τ ) =
1 T
∑
t = τ +1
[yt − y ][yt − τ − y ] (2.1)
T
1 T
∑ (yt − y)
2
t =1
where y– is the calculated average of variable y. The plot of the estimated correlation against time intervals forms an estimation of the autocorrelation function, called the correlogram. It serves as a proxy for the autocorrelation function of the time series. We shall see in the ensuing discussions that the autocorrelation function serves as a signature or fingerprint for a time series and plays a key role in characterizing various cases of the time series that we describe in the following sections.
16
BACKGROUND MATERIAL
TIME SERIES MODELS The approach we will adopt in the description of time series models is to start with the special cases and eventually build up to the generalized version.
White Noise The white noise is the simplest case of a probabilistic time series. It is constructed by drawing a value from a normal distribution at each time instance. Furthermore, the parameters of the normal distribution are fixed and do not change with time. Thus, in this case, the time series is equivalent to drawing samples repeatedly from a probability distribution. If we denote the value from the drawing at time t as et, the value of the time series at time t is then yt = et. Note that there is no special requirement in the definition of white noise that the invariant distribution be a normal or Gaussian distribution. This is, however, the most widely used version of white noise in practice and is referred to as Gaussian white noise. A plot of a white noise series is shown in Figure 2.1a. The correlogram for that time series is calculated as is shown in Figure 2.1b. Note that at the lag value of zero, the correlation is unity; that is, every sample is perfectly correlated with itself. At all the other lag values the measured correlation is negligible. Let us see why that is. At all time steps, the values are drawn from identical independent normal distributions. It is also a fact that the correlation between independent random variables is zero; that is, they are uncorrelated. Therefore, for a white noise series, the correlation between the values for all time intervals is zero, and this is reflected in the correlogram. But what is the genesis of the term white noise? It has to do with the Fourier transform of the autocorrelation function. A discussion of that is a little beyond the scope of this introduction, so for that we direct the reader to other books written in the area, as noted in the reference section. Let us now focus on the predictability of the white noise time series. The question we ask is as follows: Does knowledge of the past realization help in the prediction of the time series value in the next time instant? It does help to some extent. Knowledge of the past realization helps us to estimate the variance of the normal distribution. This enables us to arrive at some intelligent conclusions about the odds of the next realization of the time series being greater than or less than some value. Summing up, in a white noise series, the variance of the value at each point in the series is the variance of the normal distribution used for drawing the white noise values. This distribution with a specific mean and variance is time invariant. Thus, a white noise series is a sequence of uncorrelated random variables with constant mean and variance.
Time Series
17
3 2 1 0 –1 –2 10
30
50
FIGURE 2.1A
70
90
White Noise Series.
1.0 Auto Correlation
0.8 0.6 0.4 0.2 –0.0 –0.2 0
10 Lag
5
FIGURE 2.1B
15
20
White Noise ACF.
Moving Average Process (MA) We now generate another time series from the white noise series above. The value yt of this time series at time t is given by the rule yt = et + bet–1
(2.2)
In words, the time series value is the sum of the current white noise realization plus beta2 times the white noise realization one time step ago. Note that 2
Beta in this connotation is a nondescript Greek symbol denoting a constant and has no relationship to the CAPM model.
BACKGROUND MATERIAL
18
when b = 0, this is the same as the white noise series. In Figure 2.2a is a plot of a time series of this type. This specific time series was generated from the white noise sequence in Figure 2.1 using the formula yt = et + 0.8et–1. The correlogram of the series is plotted in Figure 2.2b. In the correlogram, note that there is a steep drop in the value after t = 1. To see why that is, let us consider the time series values for the three consecutive time steps t, t + 1, and t + 2. yt = et + bet–1 yt+1 = et+1 + bet yt+2 = et+2 + bet+1
(2.3)
3 2 1 0 –1 –2 –3 10
30
FIGURE 2.2A
50
70
90
MA(1) Series.
1.0 Auto Correlation
0.8 0.6 0.4 0.2 –0.0 –0.2 0
5
FIGURE 2.2B
10 Lag
MA(1) Series ACF.
15
20
Time Series
19
Observe that the values one time interval apart (t = 1) have in their terms one common white noise realization value (albeit with different coefficients). Between yt and yt+1 the common white noise realization is et. Similarly, between yt+1 and yt+2 there is et+1. Because of this, we expect there to be some correlation between them. However, between yt and yt+2, values two time intervals apart (t = 2), we have no common white noise realizations. They are independent drawings from normal distributions and are therefore uncorrelated (correlation = 0). Thus, after exhibiting strong correlation after one time step, the correlation goes to zero from the next time step onward. This would explain the steep drop in correlation after t = 1. To examine the predictability of this time series, we again ask the same question: Does knowledge of the past realization help in the prediction of the next time series value? The answer here is a resounding yes. At time step t we know what the white noise realization was at time step t – 1. Thus our prediction for time step t would be a value that is normally distributed with the mean, ytpred = βε t −1. The variance of the predicted value would be the variance of the et, which is same as the variance of the white noise used to construct the time series. Since these values are based on the condition that we know the past realization of the time series, they are called the conditional mean and the conditional variance of the time series. To conclude, knowledge of the past definitely helps in the prediction of time series. Summing up, the preceding series was constructed using a linear combination (moving average) of white noise realizations. The series is therefore called a moving average (MA) series. Also, because we used the current value and one lagged value of the white noise series, the series qualifies as a first-order moving average process, denoted as MA(1). This idea is easily generalized to a series where the value is constructed using q lagged values of white noise realizations. yt = et + b1et–1 + b2et–2 +...+bqet–q
(2.4)
Such a series is called the moving average series of order q or an MA(q) series.
Autoregressive Process (AR) In the previous example we had constructed a time series by taking a linear combination of a finite number of past white noise realizations. In this section we will construct the series using a linear combination of infinite past values of the white noise realization. In practice, though, infinity is approximated by taking a very large number of values. A question that immediately pops to mind is that if we add an infinite sequence of numbers, will the sum not go to infinity? In some instances it might go to infinity. There are,
BACKGROUND MATERIAL
20
however, cases where the sum of an infinite sequence of numbers is actually a finite value.3 Let us denote the value of the time series at instant t as yt = et + aet–1 + a2 et–2 +...
(2.5)
The infinite moving average representation above is called the MA(•) representation. To simplify Equation 2.5, consider the value of the time series at t – 1. It is given as yt–1 = et–1 + aet–2 + a2 et–3 +...
(2.6)
Examining the two equations, note that we can write yt in terms of yt–1 as follows: yt = ayt–1 + et
(2.7)
In words, the value at time t is alpha times the value at time t – 1 plus a white noise term. Note that alpha may be viewed as the slope of the regression between two consecutive values of the time series. Since the next value in the time series is obtained by multiplying the past value with the slope of the regression, it is called an autoregressive (AR) series. Figure 2.3a is the plot of the AR time series, generated using the white noise values seen in Figure 2.1. The corresponding correlogram is shown in Figure 2.3b. Notice that the correlation values fall off gradually with increasing lag values; that is, there is not much of a sharp drop. To get an insight into why that is, let us apply the same kind of reasoning as we did for the MA model. Every time step has in it additive terms comprising all the previous white noise realizations. Therefore, there will always be white noise realizations that are common between two values of the time series however far apart they may be. Naturally, we can expect there to be some correlation between any two values in the time series regardless of the time interval between them. It is therefore not surprising that the correlation exhibits a slow decay. To answer the predictability question, here, too, as in the moving average case, knowledge of the past values of the time series is helpful in predicting what the next value is likely to be. In this case we have ytpred = α yt −1. The conditional variance of the predicted value would be the variance of the et, which is same as the variance of the white noise used to construct the time series. The one-step autoregressive series may be extended to an autoregressive (AR) series of order p, denoted as AR(p). The value at time t is given as 3
We touch upon this topic very briefly in the appendix. However, for a full-blown discussion on stability analysis, we recommend that the reader follow up with the references.
Time Series
21
4 2 0 –2 –4 10
30
FIGURE 2.3A
50
70
90
AR(1) Series.
1.0 Auto Correlation
0.8 0.6 0.4 0.2 –0.0 –0.2 0
5
FIGURE 2.3B
10 Lag
15
20
AR(1) Series ACF.
yt = et + a1yt–1 + a2yt–2 +...+ apyt–p
(2.8)
It is, however, important to bear in mind that the generalized AR series is generated from a white noise series using linear combinations of past realizations.
The General ARMA Process The AR(p) and MA(q) models can be mixed to form an ARMA(p, q) model. By extrapolation it is easy to see that the generation rule for an ARMA (p, q) process is given as
BACKGROUND MATERIAL
22
[
]
y t = α 1y t −1 + α 2 y t − 2 + … + α p y t − p +
[
+ ε t + β1ε t −1 + β 2ε t − 2 + … + β qε t − q
(2.9)
]
We once again underscore the main point (hoping to drive it home) by quoting our constant refrain pertaining to Weiner filtering: The preceding models are all constructed using a linear combination of past values of the white noise series. An important consequence of that fact is that the sum of two independent ARMA series is also ARMA.
The Random Walk Process An important and special ARMA series that merits discussion is the random walk. The random walk has been studied extensively by scientists from various disciplines. Phenomena ranging from the movement of molecules to fluctuations of stock prices have been modeled as random walks. Let us therefore discuss this in some detail. A random walk is an AR(1) series with a = 1. From the definition of an AR series given, the value of the time series at time t is therefore yt = et + et–1 + et–2 +... = et + yt–1
(2.10)
In words, the random walk is essentially a simple sum of all the white noise realizations up to the current time. The AR representation provides an alternate way to look at the random walk. It is the value of the time series one time step ago plus the white noise realization at the current time step. The white noise realization at the current time step in the case of the random walk is known as the innovation. Figure 2.4 is a picture of the random walk generated using the white noise series in Figure 2.1. Let us now begin to examine some properties of the random walk. What do we expect the variance of the random walk to be at time t? Applying the formulas from the appendix in Chapter 1 on the MA(∞) (infinity) representation of the random walk, along with the fact that white noise drawings are uncorrelated, we have
( )
( )
( )
(
)
( )
var yt = var ε t + var ε t −1 + var ε t − 2 + L + var ε1
(2.11)
Since these random white noise drawings all have the same variance, the variance of the random walk at any time t is clearly
( )
( )
var yt = t var ε t
(2.12)
Note that in this case the variance depends on the time instant, and it increases linearly with time t. (If the variance increases linearly with t, then the
Time Series
23
15
10
5
0
–5 10
30
FIGURE 2.4
50
70
90
Random Walk Series.
standard deviation increases linearly with t ). In this case, unlike all the previous cases, the variance increases monotonically with time; that is, the values are capable of moving to extremes with the passage of time. Also, the statistical parameters like the unconditional mean and variance are not time invariant, or stationary. The series is therefore called a nonstationary time series. The correlation between a value and its immediate lagging value is 1. Our prediction for the next time step would then be a value with mean equal to the current time step; that is, ytpred = yt −1. The variance, of course, is the variance of the white noise realizations. As a matter of fact, our prediction for any number of time steps would be a distribution whose mean is the current value of the series. However, because the variance increases linearly with time, the error in our prediction progressively increases with the number of time steps. Of the different time series reviewed so far, the random walk is the only series in which the prediction of the mean value for the next time step is the current value. Such series where the expected value at the next time step is the value at the current time step are known as martingales. The random walk qualifies as a martingale. The random walk also exhibits a strong trending behavior. Let us examine that statement by contrasting the behavior of the random walk with
24
BACKGROUND MATERIAL
other time series. The other time series tend to oscillate about the mean of the series; that is, they exhibit mean reversion. To see what we mean, we suggest that the reader examine the time series plots and see how many times the different time series cross the mean (zero in this case). It is easy to see that the random walk has the least number of zero crossings. Even though the increments to the series at each time instance have equal odds of being positive or negative, it is not uncommon for the random walk series to stay positive (or negative) during the entire time.
FORECASTING Having discussed the stochastic time series models, let us now direct our attention to the problem of forecasting. The classical forecasting problem may be stated as follows: We are given historical time series data with values up to the current time. We are required to predict the value of the next time step value as closely as possible. In the stochastic time series context, this means that we first identify the ARMA model that is most likely to have resulted in the data set and then use the estimated parameters of the model to forecast the next value of the time series. Let us now formally lay down the steps involved in forecasting problems involving stochastic time series. The solution method is best described as a three-step process. The first step involves transforming the time series such that it is amenable to analysis. We call this the preprocessing step. The data are then analyzed for patterns that may clue us in on the dynamics of the time series. This means that we identify the ARMA model that is likely to have resulted in the data. This is the analysis step. Finally, we make our prediction in the prediction step. We now discuss each of the three steps in detail. Preprocessing involves dealing with pesky issues like checking for missing values, weeding out bad data, eliminating outliers, and so forth. It may also involve transforming the time series to prepare it for analysis. A simple transformation may be to subtract the mean of the series. Other methods may involve creating a new time series by a functional transformation. The application of the logarithmic function to values of the given series prior to analysis is a good example. In the context of ARMA models, an important transformation technique that is frequently used is known as differencing. It is a process by which a new series is constructed by taking the difference between two consecutive values in the given series. Let us discuss the motivation for doing that. The ARMA model based forecasting is typically focused on the stationary time series. If we are given a series that is deemed nonstationary, differencing helps transform the nonstationary series into a stationary series. The output from the differencing operation may be viewed as the
Time Series
25
series of increments to the current value. Thus, analyzing the differenced output amounts to studying the changes in the values as opposed to the values themselves. The next step is the analysis step. It involves identifying the ARMA model used to generate the given time series data. An ARMA model is completely identified when we are given the white noise series and the rule to generate the time series from the white noise realizations. Sometimes, the white noise series is implicit. The estimated ARMA parameters are, however, stated explicitly. But why should we try to fit an ARMA model to a given data set? The answer is simply that ARMA models provide an empirical explanation for the data without concerning themselves with theoretical justifications. This makes them readily applicable to a variety of situations. Also, the fact that ARMA models are empirical is not necessarily a bad thing, as insights from the model fitting exercise can be later used to construct a plausible theory. Once the underlying ARMA model is identified, we can proceed to the prediction step. We use the model parameters to predict the next value in the series. This completes the forecasting exercise. As seen earlier in our discussion of the ARMA model, the prediction of the next time step value is rather straightforward once the model is identified. Therefore, insofar as forecasting is concerned, identifying the correct model is key to obtaining a good forecast. Not surprisingly, a good portion of the field of time series analysis is focused on model identification.
GOODNESS OF FIT VERSUS BIAS We noted that identifying the right model is key to obtaining a good forecast. There are quite a few software packages4 that estimate parameter values for ARMA models. While they are based on a variety of approaches, the basic underlying theme in all of them remains the same; that is, the goal to find the most appropriate ARMA model. Note the use of the term most appropriate. Let us focus on what it actually means. Intuitively, a model may be deemed appropriate based on the accuracy with which it is able to account for the given data set. Let us call the number that quantifies this accuracy the “goodness of fit” measure. An example of the goodness of fit measure is the least squares criterion, which is simply the sum of squares of the prediction error. Prediction error is defined as the difference between the actual observation and the value predicted by the model. The idea then is to find a model that minimizes the least squares 4 Eviews, S-Plus, and SAS are some software packages that deal with time series modeling and forecasting.
26
BACKGROUND MATERIAL
criterion (sum of squared errors) for the given data. Another example of the goodness of fit measure is the maximum likelihood criterion. This is a number representative of the probability that the given data set was produced by a particular set of parameter values. The idea here is to find the parameters that maximize the probability, or the maximum likelihood criterion. Thus, the goodness of fit measure helps identify the best model for the given data set. Of course, the preceding statement is not without caveats. Let us say that we are required to choose the best four-parameter model fitting the data. The goodness of fit criterion would do a wonderful job in helping us achieve that. It is, however, very likely that the best five-parameter model would have a better goodness of fit score. As a matter of fact, we can in all likelihood keep improving our goodness of fit score by increasing the number of explanatory variables. Therefore, using the goodness of fit score without reservation amounts to advocating the philosophy of the more the merrier for explanatory variables. Is that necessarily a good thing? What happens when we apply the model to out-of-sample data? Will we get the same level of accuracy? To see the logic more clearly, let us discuss an extreme case where we fit 100 data points with a 100th-order polynomial (100 explanatory variables). With that, we can get an exact fit to the data and the best possible goodness of fit score ever. However, as a working model for prediction, it is probably not much use to us. Increasing the parameters indefinitely may result in a model that fits the current data set but performs poorly when used outside the current sample. Restating, we could say that our model with a large number of explanatory variables is hopelessly biased to the current data set. So, here is our dilemma: We can improve the goodness of fit by increasing the number of explanatory variables and run the risk of bias, or we can use few explanatory variables and possibly miss further reduction in forecast error. The question at this point is, “How do I know the point at which I have a reasonable goodness of fit, and at the same time know that I am not overly biased to the current data set?” The resolution of this forms the topic of discussion in the following section.
MODEL CHOICE The model choice process attempts to achieve a trade-off between goodness of fit and bias. In order to decide whether to increase the number of explanatory variables, we pose the question, “Am I getting sufficient bang for the buck in terms of fit error reduction for the addition of the new explanatory variable?” If I am, then let us go with the additional variable; otherwise, we stick with the model at hand.
Time Series
27
The Akaike information criterion (AIC) quantifies the preceding tradeoff argument.5 In general, every model with k parameters is associated with an AIC number as follows: n ei2 AIC = n log + 2k i =1 n
∑
(2.13)
where ei is the forecast error on the ith data point. Here, the first term represents the goodness of fit, and the second term is the bias. For every additional variable, the second term increases by a value of 2. However, when a variable is added, we expect the fit to improve and the variance of the forecast error to go down. If this reduction is more than 2, then the AIC value for the model with an additional variable will be lower, and we will have got our proverbial bang for the buck. If the value is higher, then the trade-off is not worth it, and we stick with the current model. The rationale for the AIC formula and the quantitative value used for trade-off has a strong foundation in information theory and is far from arbitrary. Further follow-up material on this can be found in the reference section.
Example The application of the AIC idea is illustrated in the following exercise. An AR(3) time series that was generated is shown in Figure 2.5a. AR models of various orders were fit to it and the AIC values calculated. The result is plotted in Figure 2.5b. The x-axis denotes the number of parameters in the AR
4 2 0 –2 –4 0
20
40
FIGURE 2.5A 5
60
80
100
AR(3) Series.
AIC is but one of many cost functions. The Schwartz information criterion (SIC) and the Bayesian information criterion (BIC) are also popular.
BACKGROUND MATERIAL
28
RAINING ON THE PARADE If you ever happen to make a presentation involving data analysis, here is a situation that you might encounter. After all the preparation involving umpteen coffees, and bleary-eyed but vigorous mouse clicking at statistical packages as you present your forecasting model, there is a wise guy in the audience who quips, “I am sure I can fit any model to the degree of accuracy I want by adding a lot of variables. I do not see how your model is any good.” While you would like to stare him down until he sulks and quietly leaves the room, more often than not the wise guy happens to be the boss. Unfortunately for you, more often than not he is also correct. The key, however, is to be one up on the wise guy! Based on the preceding discussion you can now wax eloquently about the tug of war between goodness of fit and the evil of bias and how you have meticulously taken into account the effect of adding multiple variables in the forecasting model. Dazzle everyone with your slides on AIC calculations and top it off with an out-of-sample test. If your presentation is close to end of fiscal year, you can chuckle to yourself about the bump in bonus you are likely to see due to this.
AIC Value
Time Series
29
80.00 70.00 60.00 50.00 40.00 30.00 20.00 10.00 0.00 0
2
4
6
8
10
Number of Parameters
FIGURE 2.5B
AIC Plot.
model and the y-axis is the AIC value. Note that the AIC value registers a minimum at four parameters. This is three AR parameters and a constant value for the mean of the series. Using more parameters will result in a better goodness of fit but will not help in forecasting. In some instances, it might actually hurt the forecasting results.
BACKGROUND MATERIAL
30
MODELING STOCK PRICES The model that is most commonly assumed for stock price movement is called a log-normal process; that is, the logarithm of the stock price is assumed to exhibit a random walk. Let us discuss the implications of such an assumption. First, this says that the logarithm of the stock price is a martingale. This is to say that the observed price of a stock at the next time period is roughly equal to the price at the current time, give or take a few. That is definitely reasonable. Next, let us examine the resulting time series when we difference the random walk. Differencing the random walk yields the increment to the random walk at each time step. The set of increments by definition are drawings from a normal distribution. But this is exactly how white noise is defined. Thus, differencing a random walk results in a white noise series. Also, bear in mind that the differencing output of the log-normal process (the difference in the logarithm of the prices) can be interpreted as the stock return.6 Putting the two together, the implication of the log-normal assumption is that stock returns are essentially a white noise process. Let us look at the plausibility of this implication. Figure 2.6a is a plot of the logarithm of the price of GE (General Electric) over a 100-day period. The series is then differenced, yielding the differenced plot in Figure 2.6b. To quickly check the nature of differenced values (returns), we urge the reader to examine Figure 2.6d. It is a Q-Q plot of the returns versus the normal distribution. The closer the points are to the straight line, the more the actual distribution behaves like a normal distribution. The autocorrelation plot of the returns is depicted in Figure 2.6c. Note that the correlation values are negligible, signifying that an assumption of white noise for the differenced series in a random walk is definitely plausible. Now, let us discuss the issues surrounding predictability in a random walk. We know that for a random walk the predicted value at the next time step is the value at the current time step. That is all fine, but the purpose of prediction is to make profits, and profits are made by correctly predicting the increment to the random walk in the next time period. However, because the random walk is a martingale, the mean value of the predicted increment is zero. The actual realized value of the increment is anybody’s guess. Does the situation improve when we try to predict values two time steps ahead? Not very much really. The mean value of the predicted increment is still
6
log(p2 ) − log(p1 ) ≈
to be the return.
p2 − p1 . Hence the difference in the logarithm may be construed p1
Time Series
31
3.30 3.25 3.20 3.15 3.10 3.05 10
30
50
Returns
FIGURE 2.6A
70
90
GE Series.
0.08 0.06 0.04 0.02 0.00 –0.02 –0.04 –0.06 0
2
4
FIGURE 2.6B
6
8
10
Returns.
1.0 Auto Correlation
Auto Correlation
0.8
0.05
0.00
0.6 0.4 0.2 –0.0 –0.2
–0.05
0
–2.9
FIGURE 2.6C
–0.8 Lag
1.3
GE Returns Q-Q Plot.
FIGURE 2.6D
5
10 Lag
15
GE Returns ACF.
20
BACKGROUND MATERIAL
32
zero. If anything, the variance of the normal distribution two time steps away increases, and the plausible range of values that the increment can assume actually increases, further increasing our prediction error. Therefore, knowing the past history of a random walk is not much help in predicting the forward-looking increments. The situation is very different for stationary processes. Armed with the knowledge that stationary processes are mean reverting, one can predict the increment to be greater than or equal to the difference between the current value and the mean. The prediction is guaranteed to hold true at some point in the future realizations of the time series. However, stock prices are modeled as a log-normal process, and that is definitely not stationary. So, where does that leave us in terms of making profits? Definitely not anywhere close to making money. The reader is probably wondering what the point of this whole chapter is. If the logarithm of stock prices is assumed to be random walk, there is no need to go at it in a roundabout way. Just say it is futile trying to predict stock returns and leave it at that. But all hope is not lost. We shall see in the later chapters that it may be possible to construct portfolios whose time series are actually stationary, and the returns for those portfolios are indeed predictable. Let us stop here with this teaser.
SUMMARY A time series is constructed by periodically drawing samples from probability distributions that vary with time. The white noise process is the most elementary form of time series and is generated by drawing samples from a fixed distribution at every time instance. ARMA time series are generated using fixed linear combinations of white noise realizations. Time series forecasting for ARMA processes involves deciphering the linear combination and the white noise sequence used to generate the given data and using it to predict the future values. A random walk process is the time series where the current value is a simple sum of all the white noise realizations up to the present time. A random walk is a nonstationary time series. Nonstationary time series are usually transformed to stationary time series using differencing. The logarithm of the stock price series is usually modeled as a random walk.
Time Series
33
FURTHER READING MATERIAL Time Series Diebold, Francis X. Elements of Forecasting. (Cincinnati, Ohio: South Western College Publishing, 1998). AIC Akaike, H. “A New Look at Statistical Model Identification.” IEEE Transactions on Automatic Control 19 (1974): 716–723.
BACKGROUND MATERIAL
34
APPENDIX Lag 1 Correlation in a MA(1) Series The variance of yt can be calculated using the preceding formulas as
( )
(
) ) var(ε )
( )
( )
(
)
var yt = var ε t + βε t −1 = var ε t + β 2 var ε t −1 + 2β cov ε t , ε t −1 =
(
= 1 + β2
t
(
)
[
] = var(y ) var(y ) β var(ε ) β = = 1+ β var(y ) E yt yt −1
corr yt , yt −1 =
t −1
t
( )
βE ε t2−1
( ) ( )
=
var yt var yt −1
t
2
t
Therefore, unlike the white noise series, this series has a nontrivial correlation structure.
Lag 1 Correlation in a AR(1) Series The variance at each time instant is given as
( ) (
( )
) ( )
var yt = E yt2 = 1 + α 2 + α 4 + α 6 + … var ε t
If a > 1, the series explodes and the variance becomes infinity. However, when a > 1, the variance can be calculated as the sum of an infinite geometric series and written as
( )
var yt =
( )
var ε t
1 − α2
The covariance is given as
(
)
[
]
[(
) ]
( )
cov yt , yt −1 = E yt yt −1 = E αyt −1 + ε t yt −1 = α var yt
Time Series
35
The correlation is therefore
(
)
corr yt , yt −1 =
(
) = α var(y ) = α var(y ) var(y )
cov yt , yt −1
t
t
t
Conditions under Which the Maximum Likelihood Is Equivalent to Minimizing Sum of Squares The substitution of the logarithm of the likelihood criterion with the sum of the squared errors hinges on a key assumption. The assumption is that the errors follow a normal distribution with a zero mean. Based on this assumption, every error value may be assigned a probability of occurrence.
( )
p ei =
(
e /σ i exp− 2 2π
1
)
2
We now make another assumption that the errors are independent of each other. Then the probability (likelihood) of obtaining the following error sequence is the product of these probabilities.
(
N
) ∏ p(e )
p error =
i =1
i
Now, taking the logarithm of the above equation on both sides we have an expression for logarithm of the likelihood.
(
N
N
) ∑ log[p(e )] = −2N log(2π ) − 2σ1 ∑ e
log likelihood =
i =1
i
2
i =1
2 i
Let us examine our motivation for doing that. If we arrange a sequence of numbers in ascending or descending order and take their logarithms in sequence, the logarithms are guaranteed to be in ascending or descending order, as the case may be. We might say that transforming a set of numbers into their logarithms preserves their ranks. Therefore, maximizing the loglikelihood is equivalent to maximizing the likelihood. We shall see in the following discussion that the log-likelihood can be simpler to deal with.
BACKGROUND MATERIAL
36
Examining the expression for logarithm of the likelihood, we see that N
the only variable term is −
∑ ei2 . Note that this is the sum of squared errors i =1
multiplied by the negative sign. Thus, maximizing the logarithm of the likelihood is the same as minimizing the sum of squared errors. Therefore, in situations where we make the assumptions as discussed, then the sum of squares multiplied by a negative sign may be used as a proxy for the maximum likelihood.
CHAPTER
3
Factor Models
INTRODUCTION Factor models are models that are used to explain the risk/return characteristics of assets. It is actually a rather loose term that serves to describe a wide variety of models. However, all the models share the common characteristic that they may be viewed as extensions to the CAPM model. The premise of the CAPM model is that the returns of assets are explicable almost completely by the behavior of the overall market. Each asset is sensitive to the market in its own characteristic way, and this sensitivity is termed beta. 37
38
BACKGROUND MATERIAL
Thus in the CAPM model there is a single explanatory factor and exposure value; namely, the market return and beta. A natural extension to this idea would then be to have multiple explanatory factors and exposure/sensitivity values. For instance, it is possible to construe that the return on a stock depends on the sector of the economy in which it operates, the market capitalization, and a good number of other explanatory factors that can be drawn from the available repertoire of market variables. In this context of multiple explanatory factors, the return of a stock would then be an aggregate of the return contributions of the factors scaled according to the sensitivity/factor exposure. Thus, the return of a stock in a factor model is explained by the return contributions of the various factors. Depending on the type of the factors used, factor models may be loosely categorized into three main groups: statistical factor models, macroeconomic factor models, and fundamental factor models. The factors in a statistical factor model are what we shall call eigen portfolios. They are a set of building-block portfolios with the property that their returns are uncorrelated with each other. Also, the return on any portfolio can be expressed as a linear combination of the returns on the eigen portfolios. However, the eigen portfolios are actually statistical artifacts deduced from data, and interpreting the results is a task that is easier said than done. So, when looking to answer questions from a valuation or a risk control standpoint, one would have to examine the returns closely to answer the question: What is the predominant theme or themes that characterize the eigen portfolio? It is this problem of interpretation that makes the statistical factor models more of a black box and hard to use. Not surprisingly, the preference for practitioners has been models that allow them to specify the factors (macroeconomic or fundamental) allowing for a more intuitive explanation for the factor returns. These models are different from the statistical factor model in that the role of the eigen portfolios is actually assumed by some macroeconomic or fundamental variable that can be observed directly. The macroeconomic factor models are constructed using historical stock returns and observable macroeconomic variables. An example of proprietary macroeconomic factor models is the Burmeister, Ibbotson, Roll, and Ross (BIRR) model. The factors or attributes in such models typically include short-term bond yield changes, long-term bond yield changes, dollar value versus other currencies, investor confidence, and changes in long-run economic growth. In contrast to the macroeconomic model, the fundamental factor model uses company and industry attributes and market data as raw descriptors to explain the returns. Examples of commercially available models of this type are the BARRA and Wilshire Atlas models. The inputs to these models are typically industry factors comprising the industries in which the firms operate, and other fundamental factors like price/earnings
Factor Models
39
ratio, the price/book ratio, attributes relating to the capital structure of the firm like debt/equity ratios, and the like. Even though there exists a wide variety of models, it may not be necessary to discuss each of the models on an individual basis. The theoretical underpinning for the models is provided by arbitrage pricing theory (APT). Thus, by treating the factors used as inputs in an abstract way and discussing arbitrage pricing theory, we can cover a lot of ground on the behavior and use of these different models.
ARBITRAGE PRICING THEORY Arbitrage pricing theory was originally proposed by Stephen A. Ross in 1976. Unlike the preceding introduction, in which APT was presented as an extension of CAPM, the original proposal by Ross is actually embedded in an arbitrage argument and is appropriately reflected in the name of the theory. In this chapter, however, we will avoid an elaborate discussion on the foundations of APT. For that, we direct the reader to the material listed in the references. Instead, we will provide simple definitions and focus on a few applications to familiarize the reader with the concepts and their application. In the multifactor framework, an asset is fully characterized by its factor exposure/sensitivity profile. The contribution to the overall asset return due to each factor is commensurate with the exposure/sensitivity of the asset to the different factors. The total return is the aggregate of the contributions. Therefore, if APT was to be summed up in one sentence, it would probably be something like this: “Give me the risk factor profile of a security, and I will tell you all about its risk and return characteristics.” Let us now describe some terminology and notation surrounding APT. We will first start with risk factor exposures. Keeping with the idea of APT being an extension of the CAPM model, let us denote the factor exposures as β1 , β 2 , β 3 , … , β k . If r1 , r2 , r3 , … , rk denote the return contributions of each factor, then the return on the stock is given as
(
)
(
)
r = b1r1 + b2r2 + b3r3 +...+ bkrk + re
(3.1)
where re is the idiosyncratic return or specific return on the stock that is not explicable by the factors in the model. One of the key assumptions of APT is that the specific return for a given stock is uncorrelated with both the factor returns and the specific returns of any other stock. Let us now focus on the evaluation of risk. The risk in a stock is measured as the variance of the return. The variance of return may in some ways be likened to the range of possible values that the return can assume. A small variance is indicative of a narrow range and therefore lower risk, whereas a large variance or wide range is indicative of higher uncertainty in the returns
BACKGROUND MATERIAL
40
and therefore greater risk. This approach to measuring risk as the second moment of the return distributions was originally proposed by Markowitz, in the context of portfolio optimization. The Markowitz approach to portfolio design is also sometimes referred to as mean-variance optimization and was awarded the Nobel Prize in economics. Today it has become common practice to use the variance of the return as a measure of risk. We will also keep with this practice and illustrate how risk/variance of return is calculated in the APT framework. We do this by way of an example. Let us consider an APT model with two factors. The returns on the stock in the two factor model case is given as r = b1r1 + b2r2 + re
(3.2)
The risk is then measured as the variance of this return. To evaluate it, let us first expand the squared return of the stock using the algebraic identity
( a + b + c)
2
= a2 + b2 + 2ab + 2ac + 2bc + c 2
(3.3)
We then have r 2 = β12 r12 + β 22 r22 + 2β1β 2 r1r2 + 2β1r1re + 2β 2 r2 re + re2
(3.4)
Applying expectations on both sides and using the formulas in the appendix in the first chapter, we have
()
( )
( )
(
)
( )
var r = β12 var r1 + β 22 var r2 + 2β1β 2 cov r1 , r2 + var re
(3.5)
Note that since re is uncorrelated with both r1 and r2, the terms with their products do not feature on the value for the variance. Also, Equation 3.5 can be written in matrix form as follows:
( ) ( )
var r1 var r = β1 β 2 cov r1 , r2
() [
]
( ) ( )
cov r1 , r2 β1 + var re var r2 β 2
( )
(3.6)
Notice the structure of the equation. We have the factor exposure profile and its transpose on either side of a square matrix. This square matrix is structured such that it has the variance of the factor returns on its diagonal and the covariance as the off-diagonal elements. It is also commonly referred to as the covariance matrix of factor returns and plays a central role in the calculation of the risk of the security. We can simplify the notation for risk even further:
Factor Models
41
var(r) = eVeT + var(re)
(3.7)
where V is the covariance matrix and e is the factor exposure vector. Also note from Equation 3.7 that the variance of the return is expressed as a simple sum of two terms. The first term is the variance due to the common factors, and the second term is the idiosyncratic/specific variance. Also, given that the standard deviation is the square root of variance, Equation 3.7 may also be written as 2 2 σ ret = σ cf2 + σ specific
(3.8)
One can easily remember the formula by drawing parallels between this and the Pythagorean theorem from high school geometry. The standard deviations may be represented as the sides of a right-angled triangle as shown in Figure 3.1. In practice, it turns out that the specific variance is the smaller component of the total variance, and a significant portion of the total variance is explained by the common factor variance. Note that key to the evaluation of the common factor variance is the knowledge of the covariance matrix of factor returns. So, how is the covariance matrix calculated in practice? If we have a sample of past historic factor returns, then it is a simple matter of using the formulas in the appendix of the first chapter to evaluate each of the entries of the covariance matrix. The question therefore now becomes, how do we get a sample of past historic factor returns? To do this, we first write out the linear equations for the return of each stock with known stock returns, treating the factor returns as unknown variables. Next, we solve this system of equations to obtain an estimate of the factor and specific returns. We now have the past factor returns that may be used to estimate the covariance matrix. In other words, the covariance matrix can be deduced from the factor returns. The converse of this statement is also true. Knowledge of the
σtotal
σspecific
σcommon factor
FIGURE 3.1
The Risk Diagram.
BACKGROUND MATERIAL
42
covariance matrix with the factor exposures and specific variances is sufficient for us to deduce the vector of expected factor returns. The reader is referred to the book by Grinold and Kahn on how that is done. Consequently, knowledge of the factor covariance matrix and the specific variances is sufficient in order to specify an APT model completely. With that said, let us formally list the parameters that are typically provided in the specification of a factor model. They are as follows: Factor Exposure Matrix. This is the matrix of exposure/sensitivity factors. If there are N stocks in our universe and k factors in the model, we can construct a N × k matrix with the exposures for each stock in a row. Let us denote this matrix as X. Factor Covariance Matrix. This is denoted as V. Specific Variance Matrix. This is the specific variance for each of the N stocks assembled in an N × N matrix with the specific variances on the diagonal. Because the specific variances are assumed to be uncorrelated, the nondiagonal elements are zero. This matrix is denoted as ∆. Of the three parameters, the factor covariance matrix is by far the most interesting. We will therefore discuss some of the properties of the covariance matrix in its own section.
THE COVARIANCE MATRIX The factor covariance matrix plays a key role in the determination of the risk. It is in fact a square matrix. In a model with k factors, the dimensions of the covariance matrix is k × k. The diagonal elements form the variance of the individual factors, and the nondiagonal elements are the covariances and may have nonzero values. A nonzero covariance implies that the returns of two explanatory factors share some correlation. For example, consider the situation where market capitalization and the leverage of the firm are used as explanatory variables. It is not uncommon within an industry to find that the small cap names have a high amount of leverage. If we assume that the small cap names outperformed the overall market, then we can expect to see a nonzero correlation between the returns attributed to the leverage and capitalization factors. Hence, it is possible to have nonzero entries in the offdiagonal elements of the covariance matrix. The covariance matrix is also symmetric. This is self-evident because the (i,j)th element and the (j,i)th element contain the entry for the covariance between the ith factor and the jth factor and are therefore the same. Additionally, the covariance matrix is also positive definite. This means
Factor Models
43
that the matrix has a square root; that is, V = B2 for some B, where V, B are matrices. Consider the situation where we are required to evaluate the covariance between the returns of securities A and B. Let eA and eB be the factor exposure vectors for the two stocks. Adapting the formula for variance previously discussed, we have the covariance as cov(rA , rB ) = e AVeBT
(3.9)
We can therefore calculate the covariance between all the securities in our universe and make them entries in a covariance matrix. This matrix would come in handy to evaluate correlations between securities. Note that if the total universe of securities is about 5000 stocks, then the covariance matrix for the list is a square matrix with 25 million entries. Calculating the variance and covariance of each stock pair individually by sampling past data can be a tedious endeavor. Armed with the factor exposure vectors and the factor covariance matrix, the covariance and correlations between securities may be calculated readily. Thus, the use of the factor covariance matrix reduces the complexity of evaluating the correlations between securities in a dramatic way. Even so, the full and complete covariance matrix for all the stocks in the universe is given by CovMatrix = XVXT
(3.10)
It is also prudent to be aware of certain potential issues when working with the factor covariance matrix. For example, it is not uncommon in investment circles to hear someone say, “But you don’t want to be mining the covariance matrix!” Let us examine what they mean by that. Mining here refers to data mining, albeit with a negative connotation. The word is used synonymously with bias, indicating that since the covariance matrix is deduced from historical data, the values are a reflection of the past and may not hold going forward. While the empirical observation has been that the covariance matrices are relatively stable, it is still subject to the fact that the values used in the covariance matrix may not be exact, and it may be useful to do some sensitivity analysis on applications where we use the covariance matrix. It is probably worthwhile to also bear in mind that along with the covariance matrix, the specific variance is also backward looking and is subject to the “mining syndrome.” Another issue that is commonly cited with regard to the covariance matrix and its use is the underlying assumption of the Gaussian distribution for the computed variance. The so-called fat tails that are ubiquitous in the variance of returns in financial time series are not accounted for by the model.
BACKGROUND MATERIAL
44
Nevertheless, in the absence of any information whatsoever, the covariance matrix serves as a critical piece of information to assess correlations between securities and for use in factor models. Careful use of the covariance matrix can help keep the reconciliation process between the risk and return of a large universe of securities tractable.
Example Factor exposure for stock A in two-factor model = (0.5, 0.75) The specific variance on stock A = .0123
.0625 .0225 The factor covariance matrix for the two-factor model is = .0225 .1024 The variance of return for the stock .0625 .0225 0.5 0.75 + .0123 .0225 .1024 0.75 = .0901 + .0123
[
A = 0.5
]
= .1024 The square root of the variance is the standard deviation = .32, or 32%. Thus, stock A has a volatility value of 32%. Factor exposures for stock B in the two-factor model = (0.75, 0.5) Covariance between stocks
[
A and B = 0.5
.0625 .0225 0.75 0.75 = 0.0801 0.225 .1024 0.5
]
APPLICATION: CALCULATING THE RISK ON A PORTFOLIO In the earlier sections we discussed how the APT model may be used to calculate the risk on a particular asset. Now we will focus on assessing the risk of an entire portfolio. Just as with a single security, the risk can be expressed as a sum of two components; namely, common factor risk and specific risk. We will adopt the approach in which we reason out the formulas for each of these components, leading in turn to the expression for the risk in a portfolio. Let us start with the common factor variance for a security, which can be computed if we know the factor exposures for the security and the factor covariance matrix. The logic to evaluate the common factor variance for the
Factor Models
45
TALKING POINTS Bear in mind that the APT model is a linear model; that is, the return is a linear combination of factor returns. Is that a limitation of the model? Some would argue that linearity in returns is probably valid only in a fixed range of values for the factor exposures. As the values stretch more and more to the extremes, linearity leads to progressively poor predictions of asset returns. This is akin to quite a few phenomena encountered in the physical sciences where linear relationships are valid only in a certain operating range. With that said, if you happen to find yourself in a finance conference, in the middle of a discussion on APT at cocktail hour, here is something to try out. Look to a distance, put a finger to your cheek and proclaim wistfully, “I wonder what the consequences are of neglecting the potentially inherent nonlinearities in the modeling process. . . .” In all likelihood, this will cause a whole lot of other people to also wonder with you. If, however, someone were to probe further and ask you to elaborate on your musings, you could always pretend to recognize someone at a distance, smile politely, and excuse yourself, saying that you need to mingle.
portfolio is also along the same lines. If we are able to evaluate the net factor exposure for the portfolio, then we can treat it as a single security and evaluate the common factor variance using the formula discussed in the previous section. Let us therefore concentrate our efforts on the evaluation of the factor exposure for the portfolio. The factor exposure of the portfolio is simply the weighted sum of the factor exposures of all the securities in it. The weight to be used for the exposure of each security is determined by the weight of the security in the portfolio. Let us consider a portfolio composed of two securities, A and B, with exposure vectors given by eA and eB. Let the weights of the two securities in the portfolio be hA and hB, respectively. The exposure vector of the portfolio is given as ep = hAeA + hBeB
(3.11)
Assuming a two-factor model and writing out the formula in matrix form, we have
BACKGROUND MATERIAL
46
β A,1 e p = hA hB β B,1
[
]
β A, 2 β B, 2
(3.12)
or ep = hX
(3.13)
That is, the exposure of a portfolio is given by the matrix product of the holdings vector and the exposure matrix. Using the value in Equation 3.13 for the factor exposure of the portfolio, the common factor variance is given by substituting the calculated exposure value:
σ cf2 = e pVe Tp
(3.14)
Performing the substitutions and applying the matrix identity on the transpose of a matrix product, we have
σ cf2 = hXVX T hT
(3.15)
Let us now tackle the specific risk component. The specific risk is calculated as the variance of the specific return. Now, the specific return on the portfolio is the sum of the specific returns of the assets in the portfolio weighted by the share of the security in the portfolio holdings. Also, note that by model assumptions the specific returns of the securities are uncorrelated with each other. If the returns are uncorrelated with each other, then the overall variance is a weighted sum of the individual specific variances. In the preceding two-security example, the specific variance is given as
( )
( )
2 σ specific = h12 var re , A + h22 var re , B
(3.16)
Writing this out in matrix form, we have
( )
var r e, A 2 = h1 h2 σ specific 0
[
]
0
( )
var re , B
h 1 h2
(3.17)
or 2 σ specific = h∆hT
(3.18)
Adding it all together we are now able to put down an expression for the total risk in a portfolio based on the portfolio holdings and the parameters of the APT model. The risk/portfolio variance is therefore given as
Factor Models
47
2 2 σ portfolio = σ cf2 + σ specific
(3.19)
= hXVX T hT + h∆hT The formula to evaluate the variance may be easily adapted to evaluate the covariance between two portfolios. We will leave it to the reader to reason out that the covariance between two portfolios Y and Z is given by
(
)
cov rY , rZ = hY XVX T hZT + hY ∆hZT
(3.20)
In any case, the formula for variance in Equation 3.20 can be used to calculate the risk in a portfolio. Assuming normality, a two-standard-deviation movement on either side of the daily mark to market should be able to catch the price movement for the next day, 95 percent of the time. Yet in practice that may not turn out to be the case. Let us therefore examine the points of failure of the model. We start by listing the inputs to the risk calculation and then examine the different scenarios. The key inputs are the factor exposures and the covariance matrix. When a big news event occurs with respect to a particular stock, the factor exposures that we use in the model for that stock are no longer valid. The market now trades on the expectation that the factor exposures after the news event are likely to be a lot different. The covariance structure between the factors is still intact. In any case, since the factor exposure input to the risk model is no longer valid, the calculation breaks down for that particular stock. The second scenario is the occurrence of a scenario-altering, huge macroeconomic event, for example, relating to interest rates. The big event typically manifests itself in the form of a liquidity crisis. In these situations, the covariance structure breaks down, leading to the breakdown of the model. Another explanation for observing price moves of more than two standard deviations of that expected by the Gaussian assumption is attributed to the nonnormal fat-tailed distribution of asset returns observed in practice. This can be somewhat addressed by calibrating the number of standard deviations to use in our assessment of the range of price movement. It is therefore important that users of the multifactor technology also be aware of the potential points of failure in the model.
APPLICATION: CALCULATION OF PORTFOLIO BETA The topic of this section is more of a misnomer, and it likely to strike the reader as an anomaly. We had earlier stated that the APT is a more
BACKGROUND MATERIAL
48
advanced version of CAPM. Then, if we have access to the APT model, why would we want to calculate the parameters of a simpler CAPM model? Is that not going back full circle? That would indeed be true. However, if we are looking to hedge our portfolio with the market portfolio, then the hedge ratio that provides the best possible hedge is given by the beta of the portfolio. Therefore, it does make sense to calculate beta. While the title may as well read “Determination of Hedge Ratio,” having the word beta in the title helps make the association between beta and the hedge ratio explicit. To see the relationship between beta and the hedge portfolio, consider the linear combination of the two portfolios in the ratio 1:l. The return of the linear combination is given by rp – lrm, where rp is the return on the portfolio and rm is the return on the market. The value of l that results in the least return variance of the combined portfolio is indeed the best hedge ratio. To evaluate the variance of the return, we will apply the algebraic identity
(a + b)
2
= a2 + b2 + 2ab
(3.21)
to the combined portfolio.
(r
p
− λrm
)
2
= rp2 + λ2 rm2 − 2λrp rm
(3.22)
Applying expectations on both sides and using the formulas in the appendix of Chapter 1, we have
(
)
( )
( )
( )
var rp − λrm = var rp + λ2 var rm − 2λ cov rp rm
(3.23)
To find the value for l that minimizes the variance, we differentiate with respect to l and equate the differential to zero. It is easy to then see that
λ =
(
cov rp , rm
( )
)
var rm
(3.24)
Equation 3.24 also happens to be the definition of beta, and we therefore conclude that beta is the best hedge ratio. Equation 3.24 is composed of the covariance and the variance terms, which we know how to evaluate in the APT framework. Applying the substitutions, we have
λ =
T T e pVem + hp ∆hm T T emVem + hm ∆hm
(3.25)
Factor Models
49
where ep and em are the factor exposures of the portfolio and the market, and hp and hm are their respective holdings vectors. We are thus able to evaluate the optimal hedge ratio.
APPLICATION: TRACKING BASKET DESIGN A tracking basket is a basket of stocks that tracks an index. If the basket is the same as the index, then the prices of both will be the same at all times. However, if the tracking basket is composed of fewer stocks than the index, then there is likely to be tracking error; that is, the returns for the tracking basket are not exactly the same as the returns for the index. The discrepancy in the returns is expressed in terms of tracking error. Tracking error may be defined as the standard deviation of the difference in the return between the tracking basket and the index. In the definition of tracking error, the mean value of the difference is assumed to be zero. To see this more clearly, consider a long–short portfolio where we are long the index and short the tracking basket. The expectation is that the return on the index and tracking basket is the same. Therefore, a profit made on one leg of the portfolio is likely to be neutralized by an equal loss on the other leg. The expected value of total return on the portfolio is therefore zero. However, while the expected return is zero, it is possible for the actual return value to be a nonzero value. The extent of this variation from zero is captured by the standard deviation of returns of the long–short portfolio and forms a measure of the tracking error. A natural deduction from the preceding discussion is that the design of a tracking basket involves designing a portfolio such that it minimizes the tracking error. Writing out the equations for the variance of the error in returns, we have
(
Min: E rm − rp
)
2
( )
( )
( )
= var rm + var rp − 2 cov rm rp
(3.26)
where rp is the return on the tracking basket and rm is the return on the market. Expanding the terms using APT constructs, we have T T Min: hm XVX T hm + hm ∆hm + hp XVX T hpT + hp hpT −
[
− 2 hm XVX T hpT + hm ∆hp
(3.27)
]
Bear in mind that there are some constraints on the values of hp; that is, some of them are forced to have zero values even though they are part of the index.
50
BACKGROUND MATERIAL
Note that in Equation 3.27, the variance of the market portfolio is not at all affected by changing the composition of the tracking basket. The only two terms that are affected when we change the tracking basket composition is the tracking basket variance and the covariance term. Therefore, minimizing the tracking error is equivalent to minimizing the sum of the two terms. The error variance may also be viewed as a sum of two components; namely, a common factor component and a specific component. Now, if we were to design the tracking basket such that the factor exposures of the basket match the factor exposures of the index exactly (even though their contents may not be identical), then the common factor component goes to zero. We are now left only with the specific components of the variance. Recall from our earlier discussion that the contributions to the total variance from the specific components are a lot less than they would be from the common factor component. Furthermore, if the two portfolios are highly diversified, the expected value of the specific returns on the portfolios is zero. Hence, the tracking error contribution in this case is solely due to the different specific returns in the portfolios. It is now easy to appreciate that a good starting point to the design of tracking baskets will ensure that the factor exposures match as closely as possible. Hence, APT constructs may be used to design tracking baskets.
SENSITIVITY ANALYSIS In our discussion on the covariance matrix we talked about the mining syndrome, the idea being that the estimation of the covariance matrix is biased to the past and may not hold going forward into the future. This apprehension may be objectively examined by studying the stability of the covariance matrix. One approach would be to estimate a sequence of covariance matrices and study the variations between two consecutive ones. The extent to which the variations affect the particular situation—say, the evaluation of beta, the measurement of risk, or the design of tracking baskets—may be gauged by perturbing the current covariance matrix with the sequence of observed changes and running the calculations with the perturbed matrix. This results in a set of values for the estimated parameter. We can then treat the set of values as realizations from a probability distribution and get an idea of the error in our estimates using the current covariance matrix to help us quantify the extent of uncertainty due to the mining syndrome.
Factor Models
51
SUMMARY Factor models are models that are used to explain the risk return characteristics of assets and come in many flavors. Even though the details may vary, factor models are firmly based on the principles of arbitrage pricing theory (APT). A factor model is considered fully specified by the factor exposures, the factor covariance matrix, and the specific variance matrix. A factor model may be used as a framework to estimate many commonplace parameters that may be needed in the course of the investment process. Examples of such computations include the estimation of risk on a portfolio, the evaluation of portfolio beta, and computing the contents of a tracking basket. The factor covariance matrix is a crucial piece of information that the factor model provides. However, it must be used with care.
FURTHER READING MATERIAL King, Benjamin F. “Market and Industry Factors in Stock Price Behavior.” Journal of Business 39, no. 1 (1966): 139–190. Ross, Stephen A. “The Arbitrage Theory of Capital Asset Pricing.” Journal of Economic Theory 13 (1976): 341–360. Jarrow, Robert and Andrew Rudd. “A Comparison of the APT and CAPM: A Note.” Journal of Banking and Finance 7, no. 3 (1983): 295–303. Grinold, Richard C. and Ronald N. Kahn, Active Portfolio Management, Second Edition. (New York: McGraw-Hill, 1999).
CHAPTER
4
Kalman Filtering
INTRODUCTION Control theory is a branch of engineering that deals with the control of engineering systems. The engineering systems could be from diverse domains, and the applications include controlling the power output of an automotive engine, stabilizing the rate of rotation of an electric motor, and controlling the “rate of reaction” (or the speed of a chemical process). The control of these systems is exercised by the manipulation of so-called control variables. For example, in the case of controlling the power output of an automotive engine, the control variable could be the amount of fuel injected into the engine. This would control the thrust of the piston and therefore the power output from the engine. Similarly, in the other two examples, the control variables could be the amount of current flowing through the motor coils or the ambient temperature of a chemical process. Thus, the control variables provide a harness that helps us to control the system effectively. Prior to the proposal of Kalman filtering, the typical approach for system control involved the specification of a fully comprehensive mathematical model describing the system dynamics. The model is usually formulated in the form of a differential equation. This helps to determine in a quantitative manner the effects of the control variables on system dynamics. Control is then effected by manipulating the variables as prescribed by the model. Along with the preceding approach also came a painful realization of its limitations. The mathematical models may not be 100-percent accurate, as there may be some approximations used in the modeling process. Additionally, the instruments used to measure the system parameters may have some built-in inaccuracies that could result in measurement error. To compound things even further, there may also be some extraneous disturbances to the system that cannot be anticipated and modeled in a deterministic fashion. Hence, the aforementioned methodology becomes increasingly harder to implement as the systems grow in complexity. 52
Kalman Filtering
53
It was under these circumstances that the Kalman filter was proposed by R. E. Kalman. He addressed these issues in a direct and practical manner and presented his ideas in a ground-breaking article1 titled “A New Approach to Linear Filtering and Prediction Problems.” The approach caught the attention and imagination of the engineering community, and the ideas found application in multiple domains. One of the key contributions of Kalman’s article is the notion of system state,” or the current state of the system. It is represented as a vector of the current values of various system parameters. The vector itself is deduced from a set of measurements on the system that is in turn translated into system-state terms. This translation is typically modeled as a linear equation. Thus, the means to make observations and the equation translating the observation into the system state fully characterize the notion of system state. Having established the notion of system state, Kalman proceeded to cast a dynamical system in terms of system states. A dynamical system in the Kalman-filtering approach is modeled as a sequence of transitions from one system state to another. These transitions are also modeled as linear equations. Next, he asserted that to monitor the system effectively (for purposes of control) it makes sense to make an assessment of the state that we are currently in and the state that we expect to transition to in the next time step. In other words, we are in a situation where we are constantly predicting the next system state and taking measurements to verify the predictions. The Kalman filter provides a prescription to reconcile this sequence of predictions followed by measurements to arrive at a sequence of optimal estimates for system states. This approach to thinking of systems as a sequence of state transitions was a radical departure from the thinking at the time. It started a revolution of sorts in the field of control theory and marks the beginning of a new era in the field commonly referred to by many as Modern Control Theory. So, how does this fit into our scheme of things? For one, we use the Kalman-filtering technique to filter the noise from the observed spread in the case of risk arbitrage. Describing this in the introduction therefore helps provide the context for its application later in the book. To help to illustrate the Kalman-filtering ideas and also as a matter of interest, we apply the Kalman-filtering concepts to smooth out a random walk. Now, many practitioners of technical analysis make use of so-called moving averages to smooth out or filter price series. This method of using moving averages may be thought of as an attempt to estimate the sequence of stock prices (states) after filtering out the noise. The common peeve 1
Kalman, R. E. ( 1960). “A New Approach to Filtering and Prediction Problems.” Transaction of the ASME Journal of Basic Engineering, 82(Series D), 35–45.
BACKGROUND MATERIAL
54
The practical nature of the modeling process and solution approach made the Kalman filter immediately applicable to a wide variety of situations. In fact, one of the first applications of the Kalman filter was in the lunar module of Apollo 11, the spacecraft for the first landing on the Moon. Therefore, if anything in the book should qualify as rocket science, this definitely fits the bill. ☺
against the moving averages has always been that they tend to lag when there is a sharp and sudden change in price movement. The Kalman filter can help construct better smoothers. Although we do not delve deeply into the matter, we believe that this approach may very well contain the seeds of reasoning for the observation of the so-called Fibonacci retracements, which is well documented in the area of technical analysis.
THE KALMAN FILTER Continuing on the theme of the previous section, the Kalman-filtering process can best be described as a three-step process of prediction, observation, and reconciliation or correction. In the prediction step, we predict the next system state based on our knowledge of the current system state. Along with it, we also estimate the error in our prediction. This completes the prediction step.
Kalman Filtering
55
Next we take a reading of the state of the system after allowing for a fixed amount of time to elapse, the idea being that the system would now have transitioned to the new state. The readings can be translated in systemstate terms based on a mathematical model. Similar to the prediction step, we also estimate the error associated with our observation. The observation along with an estimate of the error constitutes the observation step. We now have two estimates for the states involved: one based on our prediction and the other based on our observation. The natural next step is then to reconcile the two state estimates, taking into account the magnitude of the associated errors. Stating it differently, the predicted estimate is corrected based on the observation. This is therefore called the correction step. This reconciled estimate of the system state from the correction step is the final estimate of the current system state. The preceding process is then repeated again for the state at the next time instance, making the Kalman filter a recursive prediction–correction method. The preceding steps are also illustrated in the form of a diagram in Figure 4.1. The reader is probably now curious as to how the correction to the predicted value is effected. Let us discuss that briefly. Note that it is possible to translate the prediction of the next state into a set of expected observations.
Estimate Observation Error Variance
Observation Step Take Measurements
Get Reconciled State Estimate + Variance of Estimate
Prediction Step Predict Next State
Estimate Prediction Error Variance
Use Reconciled State Estimate for Prediction
FIGURE 4.1
The Kalman Filtering Process.
BACKGROUND MATERIAL
56
Let us call this as the predicted observation. The equation for the corrected state in a Kalman filter is given as corrected State = predicted state + k (actual observation – – predicted observation) The difference between the actual observation and the predicted observation is called the observation innovation. Note that a fraction of the observation innovation is added as a correction to the predicted state. The value of this fraction k is known as the Kalman gain. The Kalman-filtering approach provides a prescription on what would be the most appropriate value to use for k. This value is decided such that the corrected state has the least amount of error variance associated with it. Besides providing the prescription to reconcile the prediction and observation, Kalman also provided definitive proof that the process is indeed optimal in the case where the mathematical models of state and observation are both linear and the errors are drawn from independent Gaussian distributions. We will, of course, not delve into the proofs, but rather try to explain the basic idea by way of illustrations. With that said, we introduce some notation and formally list the steps involved in the Kalman-filtering process. Let Xt denote the state at time t. Note that the value for the state can also be a vector; that is, the state has a multidimensional representation. The mathematical model used to predict the state at time t in a Kalman filter setting is typically of the form Xt = AXt–1 + ut
(4.1)
where A is a matrix, Xt and Xt–1 are the state vectors at time t and t – 1, respectively, and ut is the error vector that accounts for the impreciseness of the model. Next we make an observation at time t. Let us call this observation Yt. The measurements made are a linear combination of the state elements and therefore can be written as Yt = HXt + vt
(4.2)
In this scenario the values of the matrices A and H are known. Initially we make a prediction of the state at time t, knowing all the state information up to time t – 1. Let us denote this estimate Xˆ t | t −1. The error is measured as the variance in the case of a single dimensional state and as a covariance matrix in the case of a multidimensional state. Let us denote it generally as Pˆt | t −1. Just as in the case of the predicted value, the measurement also has an error variance/covariance matrix associated
Kalman Filtering
57
with it. Let us call that R. We are now ready to formally list the steps in the Kalman-filtering process as follows: 1. Evaluate Xˆ t | t −1 and Pˆt | t −1 using the state equation. 2. Find the observation Yt and R by observing the system. 3. Evaluate Kt, also known as the Kalman gain, which will be used to obtain the linear minimum error variance estimate. 4. Evaluate Xˆ t | t given by Xˆ t | t −1 + Kt Yt − HXˆ t |t −1 . 5. Finally, evaluate Pt, the error variance/covariance of Xˆ t | t.
(
)
These steps are repeated again for the next time step. The formulas for the evaluations at each step are relegated to the appendix at the end of this chapter. Upon examining the equations in the appendix, one can say that they do seem a little cryptic, and the reasoning and rationale behind them is not evident. In the subsequent sections we will illustrate the ideas behind the equations.
THE SCALAR KALMAN FILTER In this section, we will discuss the estimation of the value of a constant. Let us first examine how to do it in normal course. The typical method would be to take multiple measurements of the value and use the average of the measured values as an estimate of the constant. The reasoning behind the approach is that the measurements could have errors associated with them; that is, some measured values could be greater than the true value of the constant, and others could be lower. By taking an average of the values, we expect the errors to cancel each other out. More precisely, the standard deviation of the error in the average goes down by a square root of n factor, where n is the number of measurements. The consequence of this method is in fact a well-known statistical concept. By increasing the number of observations of a constant variable and taking averages, we can make the error in our estimate of the constant as small as desired. However, a caveat to that approach is that we will need to wait until the last of the n measurements have been completed before coming up with an estimate of the constant (which may not be a bad idea at all). The Kalman filter, however, makes an estimate of the value of the constant based on the current available information and updates the estimate as and when more observations are made. Of course, the result after n observations in both cases will be the same.
BACKGROUND MATERIAL
58
Note that if the error variances associated with different observations are the same, then a simple average will be fine. However, in the case where the observed values have different levels of accuracy, we would like to assign more weight to the observations with greater accuracy. In such cases, instead of taking a simple average that weights each point uniformly, a weighted average solution would be more appropriate. We will illustrate how to address the weighted average situation using the Kalman filter concepts and thereby hope to provide some insight to the calculation of the Kalman gain. In the estimation of the constant value, the system state is the onedimensional constant value itself. Let us say that our current estimate of the constant value is xˆ i with error in the estimate being ei, that is, xˆ i | i = x + ε i
(4.3)
x is the true value of the constant. Although we do not know what the exact value of ei is, we do know that it has a zero mean and known variance given by σ ε2,i . Let us now do the prediction step. Since the value stays a constant, our prediction for the next state is the current value itself. Therefore, we have
(
)
xˆ i +1| i = xˆ i , var xˆ i +1| i = σ ε2, i
(4.4)
Next we make a measurement. Let us call the measurement of the constant value yi and the error associated with it hi with zero mean and known variance given as σ η2,i. Therefore at the end of the measurement step, we have the observation
( )
yi = x + ηi , var yi = σ η2, i
(4.5)
We now have a prediction and an observation, and we wish to reconcile the two values. We do this in the reconciliation/correction step where we correct the predicted value based on the observation. Writing out the correction equation, we have
(
xˆ i +1| i +1 = xˆ i +1| i + k yi − xˆ i +1| i
)
(4.6)
Note that this can be rewritten as
(
)
xˆ i +1| i +1 = 1 − k xˆ i +1| i + kyi
(4.7)
That is, the corrected state estimate is a weighted average of the two state estimates, with k, the Kalman gain, being the weight. The choice for the
Kalman Filtering
59
Kalman gain must be such that the error variance of the final estimated state is minimized. Writing out the variance for the final state estimate we have
(
) (
)
var xˆ i +1| i +1 = 1 − k
2
(
)
( ) (
)
2
var xˆ i +1| i + k2 var yi = 1 − k σ ε2, i + k2σ η2, i (4.8)
The task is now to find the value of k that minimizes the variance. Instead of doing the derivation mathematically, we will arrive at the value of k by analogy to some high school circuit theory. Consider the parallel circuit as shown in Figure 4.2. The fraction of current flowing into each arm of the circuit k and 1 – k is shown in Equation 4.10. According to Ohm’s law, the current flow chooses the path of least resistance, and the flow on each arm is inversely proportional to the resistance of that arm. According to Kirchoff’s law, the sum of the current flowing through each arm of the circuit must be equal to the total current flow I. Also, the flow of current is such that minimum energy is expended in the process. The energy of the circuit is given as
(
)
2 E = I 2 1 − k Rε + k2Rη
(4.9)
Equating Rε = σ ε2 and Rη = σ η2, it is now easy to see the similarity between the two situations. The fraction of current and the Kalman gain is given as k=
I(1-k)
σ ε2 σ η2
+ σ ε2
Re
I
I.k
FIGURE 4.2
Rh
A Parallel Circuit.
(4.10)
BACKGROUND MATERIAL
60
The effective resistance of the circuit or the equivalent variance of the combination is given as
(
)
R = var xˆ i +1| i +1 =
Rε Rη Rε + Rη
=
σ ε2σ η2 σ ε2 + σ η2
(4.11)
Having determined the value of k, we can now compute the corrected state and the variance of the corrected state. The process is now repeated again for the next measurement until we reach the last of the planned n measurements. The final outcome of taking a weighted average after making all the measurements will work out to be the same value calculated using the Kalman procedure.
FILTERING THE RANDOM WALK Let us now discuss the application of the Kalman filter to the random walk. From Chapter 2, on time series, we know that a random walk series is a simple sum of white noise realizations up to the current time. In other words, the next point in the random walk series is evaluated by adding to the current point a random drawing from a Gaussian distribution. Also note that this has relevance to stock prices, as the logarithm of stock prices is typically modeled as a random walk. Now suppose we are assigned the task of watching the random walk. The outcome of the watching exercise is to come up with the random walk series. In stock price terms, the watching exercise translates to coming up with a time series of stock prices. To do that, we observe the prices at regular time intervals and record them. The resulting sequence of values constitutes a random walk, and our mission is accomplished. Note that the last traded price at each instance is known without error, and it is therefore possible to observe the series without error. And if there is no error in the observation, then the Kalman filter model does not apply. So why do we even attempt such an exercise? We address this matter in the following discussion. Broadly speaking, the price at any given time instance may be construed as the price at which the supply meets demand. Let us call this the equilibrium price. Let us now frame the aim of the watching exercise as generating the sequence of equilibrium prices over consecutive time intervals. In this context, the periodic measurement approach amounts to using the observed price at a specific time as the equilibrium price for the time interval. It is now easy to make the case that the prices at the end of regular chunks of time are indeed ap-
Kalman Filtering
61
proximations of the equilibrium price for the time chunk, and the notion of observation error begins to make sense. It is therefore definitely reasonable to apply Kalman-filtering ideas to stock price series with the logarithm of prices modeled as a random walk. Summarizing the discussions so far, we have assigned ourselves the task of watching a random walk and making observations at regular time intervals. Each of the observations has a measure of error associated with it. The purpose of the exercise is to come up with a plausible set of system states. Keeping with the notational conventions already established, let us denote the sequence of observations starting from time t = 0, the beginning of our watching exercise as yt, and the true states as xt. The observation equation may be written yt = xt + et; that is, the observation is the true state plus some error. Now, according to the definition of the random walk, we also have xt = xt–1 + et (the current state is the previous state plus an innovation). Writing this as a sequence of predictions and observations, we have y0 = x0 + e0 x1 = x0 + e1 y1 = x1 + e1 x2 = x1 + e2 y2 = x2 + e2 ...
(observation) (prediction) (observation) (prediction) (observation)
Writing the equations in matrix form, we have y 0 1 0 0 −1 1 y = 0 1 1 0 0 −1 y 0 0 2
e0 0 x 0 0 − ε1 0 x1 + e1 1 x2 −ε 2 e 1 2
For notational convenience let us denote the system of equations as Y2 = H2X2 + h2 The 2 in the subscript is the index of the state and is indicative of the number of state estimates used in forming the equations. The above is a set of five equations with three unknown values x0, x1, x2. Given that there are more equations than unknowns, the system is also referred to as an overdetermined set. If the errors at each stage are drawn from identical, independent
BACKGROUND MATERIAL
62
normal distributions, then the typical solution to the overdetermined set is obtained by applying the least squares method:
(
X2 = H 2T H 2
)
−1
H 2TY2
(4.12)
That is, we multiply both sides by the transpose of H2, to get a system of three equations with three unknowns and then solve the system of equations. This would be our solution method in the normal course. Now let us move on to the next observation y3. Upon obtaining the new data point, the estimate of our state is determined by the solution of the equation Y3 = H3X3 + h3 a system of seven equations with four unknown variables. We could then use the typical approach to solve the overdetermined equations and obtain an estimate for the value of x3. Note, however, that as the number of observations increases, the size of the matrices grows, and the computational costs could potentially go up. It is here that the Kalman-filtering algorithm comes to our rescue. The results from the previous computations are used in an iterative fashion to estimate the value of the next state. The value of x2 and its variance as calculated in the previous step is used in the evaluation of the state x3, thereby keeping the computational cost of evaluating the next step the same regardless of how far down the time scale we are. Regardless, the end result of the state estimate in the Kalman-filtering case is the same as solving the set of equations using the least squares approach. Note that we have made an important assumption in the process; that is, the state variance at each time step is equal to the observation variance. (This is in addition to the assumption of independence of the error distributions.) Thus, with the preceding assumptions, the Kalman filter boils down to a least squares solution of equations. The twist is that the solution is calculated in a recursive fashion. This version of the Kalman filter is therefore known as the recursive least squares method. The assumption of identical and independent error distributions manifests itself in the covariance matrix of the errors. The independence also implies that the errors are not correlated. Therefore, the off-diagonal elements in the covariance matrix are all zero. The diagonal elements in the covariance matrix are the variances of the error terms. If they are drawn from identical distributions, then the variances should be the same. Therefore, the covariance matrix in this case may be represented as the identity matrix multiplied by a constant. Let us now turn our attention back to the solution of the preceding model. It turns out that the estimated state at a given time for that set of
Kalman Filtering
63
equations can be represented as a weighted linear combination of the observations. Additionally, the weights are actually ratios of Fibonacci numbers. The Fibonacci sequence of numbers is constructed starting from two seed numbers, F0 = 0 and F1 = 1. The next number in the series is generated by adding the last two numbers in the series, Fn = Fn–1 + Fn–2. Applying the formula in an iterative fashion, we obtain the Fibonacci sequence as follows: 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, . . . . There are a variety of situations in which the Fibonacci numbers appear. Some sources of information on Fibonacci numbers are listed in the reference section. In any case, the solution to our problem, that is, the estimate of state x2, is given as x2 =
5 2 1 y + y + y 8 2 8 1 8 0
Similarly, x3 is given as x3 =
13 5 2 1 y + y + y + y 21 3 21 2 21 1 21 0
In general, if we are to estimate xT, then xT = w0yT + w1yT–1 + w2yT–2 + . . . .
(4.13)
where F2 T +1 −1 F2 T +1 − 3 F2 T +1 − 5 F1 ( ) ( ) ( ) w0 , w1 , w2 , … , wT = , , ,…, (4.14) F2 T +1 F2(T +1) F2(T +1) F2(T +1) ( )
)
Note that the first weight is the ratio of two consecutive Fibonacci numbers. F The ratio 2(T +1) approaches the value g. The value g is famously known as F2(T +1) −1 the golden mean ratio. It is given by the formula g = 1 + 5 / 2 and has an approximate value of 1.618. The first weight in the observation is actually the reciprocal of the ratio 1⁄g ≈ 0.618. The subsequent weights are 1⁄g , 1g ⁄ , . . . . and so on. To see that the second ratio is 1⁄g , consider the following: 3
5
3
F2(T +1) − 3 F2(T +1)
=
F2(T +1) − 3 F2(T +1) − 2 F2(T +1) −1 1 ≈ 3 F2(T +1) − 2 F2(T +1) −1 F2(T +1) g
(4.15)
64
BACKGROUND MATERIAL
We will once again draw attention to the fact that the solution presented in Equation 4.15 stays valid only when the state variance is equal to the observation variance. In other situations, we need to obtain an estimate of the state and observation variance at each time step. We will therefore conclude this section with a brief discussion on the estimation of the state and observation variances. In general random walk terms, the state variance can be estimated as the variance of the innovations. Modeling the stock price series as a random walk, the innovations correspond to the period returns of the price series. The state variance in this case is therefore the variance of the period returns. This calculation is rather commonplace in financial circles and is often referred to as historic volatility. The observation variance is, however, a tricky issue. Let us assume that in addition to the closing price in a time interval, we also observe the high and low stock prices within the interval. The variance of the error in the observation must be the volatility of the stock price in the time period. This volatility is characterized by high and low values. Several methods to estimate the volatility based on the high–low prices exist, and the references are provided in the reference section. One may use any one of the methods described in the papers in the appendix to estimate the observation variance. Once the state and observation variances are known, we are ready to apply the Kalman-filtering approach.
APPLICATION: EXAMPLE WITH THE STANDARD & POOR INDEX Is it conceivable that the observation and innovation variance will be the same? Why not? After all, they are both representative of the volatility of the same underlying random walk. Let us therefore see how we fare in practice. We apply the random walk-filtering process with all its assumptions on the Standard & Poor (S&P) index. We use the closing prices of the S&P index depository receipt (spidr) with ticker SPY in our example. As discussed in the section Filtering the Random Walk, we use Equation 4.13 to determine the state of the process at a given time. The weights in Equation 4.15 are determined using Fibonacci numbers. Note that if we decide to use T + 1 observations in the state estimation process, the weight of the last data 1 point according to Equation 4.14 is given by . This is the fraction F2(T +1) of the oldest data point used in the estimate. It turns out that this value approaches zero rather quickly, and its contribution to the value of the state becomes insignificant as T increases. In order to demonstrate the point, we constructed a plot of the reciprocals of the Fibonacci series in Figure 4.3.
Kalman Filtering
65
Reciprocals of Fibonacci Numbers
1.0
0.8
0.6
0.4
0.2
0.0 0
2
4
6
FIGURE 4.3
8
10 Index
12
14
16
18
20
Reciprocals of Fibonacci Numbers.
Therefore, numerically speaking, it may be sufficient to use, say, the last 10 observations in the state estimation process. Table 4.1 lists the weightings to use for the 10 observations. With the available weights and the observations, the computation process becomes a simple calculation of the weighted average of the observed values, resulting in a unique sequence of states. This set of states is optimal under the assumptions discussed. Although this is all nice, note that the typical user of moving averages probably works with them over multiple time periods. The time periods are used to modify the coarseness of the approximation, and it may be argued that looking at moving averages calculated over multiple time frames gives
TABLE 4.1
Reciprocals of Fibonacci Numbers.
1
2
3
4
5
6
7
8
9
10
0.6180
0.2360
0.0901
0.0344
0.0131
0.0050
0.0019
0.0007
0.0002
0.0001
BACKGROUND MATERIAL
66
the technical analyst additional information. The question before us now is, therefore, how can we fine-tune the coarseness of the approximations using the Kalman filter? To achieve varying levels of coarseness using the Kalman filter, we make use of an important property that relates to the sampling of a random walk sequence; that is, the random walk sequence sampled at any frequency results in a random walk sequence. To see why that is, consider a random walk sequence with observations at times 1, 2, 3, and so on. By definition, the observation at time 1 plus a value drawn from a normal distribution gives the observation at time 2, and so on. Now let us sample the random walk at half of the original observation frequency. This results in a new sequence with the values for the times 1, 3, 5, and so on. Note that the value at time 3 is given by the value at time 1 plus a drawing from a normal distribution to get it to time 2, and then again by adding another drawing from a normal distribution to get it to time 3. Thus, the transition from time 1 to time 3 is effected by summing two random drawings from independent normal distributions and, in turn, adding it to the value at time 1. But the sum of
4.60
4.55
4.50
4.45
Log of S&P daily close Smoothing with two-period sampling
4.40
0
10
20
30
40
50
60
Days
FIGURE 4.4
Kalman Smoothing of Random Walk.
70
Kalman Filtering
67
two independent normal random variables may itself be treated as a drawing from a normal distribution with an adjustment to the variance of the distribution. Therefore, the transition from time 1 to time 3 is also effected by adding the value at time = 1 to a drawing from a normal distribution. This therefore fits the definition of a random walk sequence. We only need to apply this idea in an iterative fashion to see that the random walk sequence sampled at any frequency results in a random walk. Armed with this information, we can conclude that the Kalman smoothing approach may be applied to the random walk sequence sampled at multiple frequencies to achieve varying degrees of coarseness. In our example, we calculate the state values by sampling the random walk at half the frequency of the available data. Suppose we need to calculate the value of the state at time t = 22. We use the observations y22 ,y20 ,y18 , . . . . Similarly, to estimate the state value at time t = 23, we use the observation values y23 ,y21 ,y19 , . . . . The resulting smoothing on the logarithm of the S&P prices is shown in Figure 4.4.
SUMMARY The Kalman filter is an optimal state estimation process applied to a dynamic system that involves random perturbations. Inherent in any discussion on the Kalman filter are the notions of state and observation. Kalman filtering may be summarized as a three-step process comprising prediction, observation, and correction, or reconciliation of the prediction with the observation. The simplest case of the Kalman filter reduces to finding the average of n numbers. The recursive least squares method is also a special case of the Kalman filter that may be applied to filtering random walks. When the state and observation variances are the same, that is, the signal-to-noise ratio is unity, then the estimation of the Kalman states for a random walk boils down to a weighted average of the observations, with the weights formed by ratios of Fibonacci numbers. The degree of smoothness to be achieved in a random walk can be controlled by varying the sampling rate of the random walk sequence.
FURTHER READING MATERIAL Kalman Filter Maybeck, Peter S. Stochastic Model, Estimation and Control. (San Diego, California: Academic Press, 1979).
68
BACKGROUND MATERIAL
Harvey, A. C. Forecasting, Structural Time Series Model and the Kalman Filter. (Cambridge, UK. Cambridge University Press, 1991). Variance Calculation Ismail, Malik Magdon and Amir Atiya. “A Maximum Likelihood Approach to Variance Estimation for a Brownian Motion Using the High Low and Close.” Quantitative Finance (forthcoming). Rogers, L. and S. Satchell. “Estimating Variance from High Low and Closing Prices.” Annals of Applied Probability 1, no. 4 (1991): 504–512. Rogers, L., S. Satchell and Y. Yoon. “Estimating the Volatility of Stock Prices: A Comparison of Methods That Use High and Low Prices.” Applied Financial Economics 4 (1994): 241–247. Fibonacci Series Huntley, H. E. The Divine Proportion: A Study in Mathematical Beauty. (New York: Dover Publications, 1970). Fischer, Robert. Fibonacci Applications and Strategies for Traders. (New York: John Wiley & Sons, Inc., 1993).
Kalman Filtering
69
APPENDIX We describe the formulas for the Kalman filtering steps here. The notation we use is the same as that discussed in the section on the Kalman filter. 1. Evaluate Xˆ t | t −1 and Pˆt | t −1 using the state equation. Xˆ t | t −1 = AXˆ t −1| t −1 Pˆt | t −1 = APˆt −1| t −1 AT 2. Find the observation Yt and R by observing the system. Note that we have the matrix H defined as follows: Yt = HXt + vt 3. Compute the Kalman gain Kt.
(
Kt = Pˆt H T HPˆt H T + R
(
)
−1
)
4. Evaluate Xˆ t | t given by Xˆ t | t −1 + Kt Yt − HXˆ t |t −1 . 5. Evaluate Pˆt | t .
(
)
(
Pˆt | t = 1 − KH Pˆt | t −1 1 − KH
)
T
+ KRK T
PART
Two Statistical Arbitrage Pairs
CHAPTER
5
Overview
HISTORY The first practice of statistical pairs trading is attributed to Wall Street quant Nunzio Tartaglia, who was at Morgan Stanley in the mid 1980s. At the time, he assembled a group of mathematicians, physicists, and computer scientists. Their mission was to develop quantitative arbitrage strategies using state-of-the-art statistical techniques. The strategies developed by the group were automated to the point where they could generate trades in a 73
74
STATISTICAL ARBITRAGE PAIRS
mechanical fashion and, if needed, execute them seamlessly through automated trading systems. At that time, trading systems of this kind were considered the cutting edge of technology. One of the techniques they used for trading involved trading securities in pairs. The process involved identifying pairs of securities whose prices tended to move together. Whenever an anomaly in the relationship was noticed, the pair would be traded with the idea that the anomaly would correct itself. This came to be known on the street as “pairs trading.” Tartaglia and his group employed pairs trading with great success in 1987. The group, however, disbanded in 1989. Members of the group found themselves in various other trading firms, and knowledge of the idea of pairs trading gradually spread. Pairs trading has since increased in popularity and has become a common trading strategy used by hedge funds and institutional investors.
MOTIVATION Let us now explain the idea behind pairs trading. The general theme for investing in the marketplace from a valuation point of view is to sell overvalued securities and buy the undervalued ones. However, it is possible to determine that a security is overvalued or undervalued only if we also know the true value of the security in absolute terms. But, this is very hard to do. Pairs trading attempts to resolve this using the idea of relative pricing; that is, if two securities have similar characteristics, then the prices of both securities must be more or less the same. Note that the specific price of the security is not of importance. The price may be wrong. It is only important that the prices of the two securities be the same. If the prices happen to be different, it could be that one of the securities is overpriced, the other security is underpriced, or the mispricing is a combination of both. Pairs trading involves selling the higher-priced security and buying the lower-priced security with the idea that the mispricing will correct itself in the future. The mutual mispricing between the two securities is captured by the notion of spread. The greater the spread, the higher the magnitude of mispricing and greater the profit potential. A long–short position in the two securities is constructed such that it has a negligible beta and therefore minimal exposure to the market. Hence, the returns from the trade are uncorrelated to market returns, a feature typical of market neutral strategies. Based on the discussion so far, it is easy to deduce that the key to success in pairs trading lies in the identification of security pairs. In a study by Gatev et al., a purely empirical approach to achieving this end was adopted. They methodically chose pairs based entirely on the historical price movement of securities and checked to see how pairs trading would have fared in a double-blind study. Besides the set of pairs chosen using historical prices,
Overview
75
another set of pairs was created by randomly pairing the securities with one another. The trades that would have been executed based on the empirical pairing approach were then pitted against the trades where the securities were randomly paired. The difference in the returns between the two groups was found to be statistically significant, and the return generated by the methodically paired set was better than the randomly paired sample set. Unlike the purely empirical approach, the methodology that we subscribe to comprises theoretical valuation concepts that are then validated with empirical models and data. We will later show that the theoretical valuation approach helps us to easily identify pairs based on the fundamentals of the firm. It also leads naturally to the formula used to measure the spread, the degree of mispricing between the two securities. Our theoretical explanation for the comovement of security prices stems from arbitrage pricing theory (APT). According to APT, if two securities have exactly the same risk factor exposures, then the expected return of the two securities for a given time frame is the same. The actual return may, however, differ slightly because of different specific returns for the two securities. It is important to note at this point that APT for the two securities has to be valid in all time frames. Let the price of securities A and B at time t be ptA and ptB, and at time t + i be ptA+ i and ptB+ i, respectively. The return in the time period i for the two securities is given as log(ptA ) − log(ptA+ i ) and log(ptB ) − log(ptB+ i ).1 Now let us say that we have the prices of both securities at the current time. The return on both securities is expected to be the same in all time frames. In other words, the increment to the logarithm of the prices at the current time must be about the same for both the securities at all time instances in the future. This, of course, means that the time series of the logarithm of the two prices must move together, and the spread calculation formula is therefore based on the difference in the logarithm of the prices. Having explained our approach, we now need to define in precise terms what we mean when we say that the price series or the log price series of the two securities must move together. Fortunately for us, the idea of comovement of two time series has been well developed in the field of econometrics. We discuss it in the following section on cointegration.
COINTEGRATION In the introduction to time series we briefly discussed the preprocessing step for nonstationary series. The series is typically transformed into a stationary pt + i − pt . Thus, return can pt be thought of as the increment in the logarithm of the prices.
1
The value as calculated here is approximately equal to
STATISTICAL ARBITRAGE PAIRS
76
time series by differencing. By extension, when analyzing multivariate time series where each of the component series is nonstationary, it would then make sense to difference each component and then subject them to examination. However, that need not be the case. In the course of examining multivariate series to determine statistically if there is a cause–effect relationship between the variables represented by the time series, the econometricians Engle and Granger observed a rather interesting phenomenon. Even though two time series are nonstationary, it is possible that in some instances a specific linear combination of the two is actually stationary; that is, the two series move together in somewhat of a lockstep. Engle and Granger coined the term cointegration and proposed the idea in an article, the reference for which is at the end of this chapter. Notably, this was one of the ideas for which they won the Nobel Prize in economics in 2003. Let us now state the idea of cointegration more formally. Let yt, and xt be two nonstationary time series. If for a certain value g, the series yt – gxt is stationary, then the two series are said to be cointegrated. Real-life examples of cointegration abound in economics. In fact, the first demonstrations and tests of cointegration involved economic variable pairs like consumption and income, short-term and long-term rates, the M2 money supply and GDP, and so forth. The explanation for cointegration dynamics is captured by the notion of error correction. The idea behind error correction is that cointegrated systems have a long-run equilibrium; that is, the long-run mean of the linear combination of the two time series. If there is a deviation from the long-run mean, then one or both time series adjust themselves to restore the long-run equilibrium. The formal theorem stating that error correction and cointegration are essentially equivalent representations is called the Granger representation theorem. We shall not attempt to discuss the proof of the theorem, but simply present here the error correction representation. Let ext be the white noise process corresponding to time series xt . Let ε yt be the white noise process corresponding to the time series yt . The error correction representation is
{ }
( = α (y
) )+ε
yt − yt −1 = α y yt −1 − γ xt −1 + ε yt xt − xt −1
x
t −1
− γ xt −1
{ }
(5.1)
xt
Let us interpret the Equations 5.1. The left-hand side is the increment to the time series at each time step. The right-hand side is the sum of two expressions, the error correction part and the white noise part. Let us look at the error correction part ay(yt–1 – gxt–1) from the first equation. The term yt–1 – gxt–1 is representative of the deviation from the long-run equilibrium (equi-
Overview
77
librium value is zero in this case), and g is the coefficient of cointegration. ay is the error correction rate, indicative of the speed with which the time series corrects itself to maintain equilibrium. Thus, as the two series evolve with time, deviations from the long-run equilibrium are caused by white noise, and these deviations are subsequently corrected in future time steps. We will now illustrate that the idea of error correction does indeed lead to a stationary time series for the spread. Two independent white noise series with zero mean and unit standard deviation were generated to represent ε yt and ε xt, respectively. The other values were set as ay = –0.2, ax = 0.2, and g = 1.0. Note that it is important to have the two coefficients ay and ax set to opposite signs for error-correcting behavior. The values for the two time series xt and yt were then generated using the simulated data and the equations from the error correction representation. A plot of the two series is shown in Figure 5.1. Subsequently, the spread at each time instance was calculated using the known value for g. A plot of the spread series and its autocorrelation is shown in Figures 5.2a and b. It is easy to appreciate from the autocorrelation function that the spread series is indeed stationary. A more direct approach to model cointegration is attributed to Stock and Watson, called the common trends model. The primary idea of the common
{ }
{ }
10 8 6 4 2 0 Time Series 1 Time Series 2
–2
10
30
FIGURE 5.1
50
70
Cointegrated Time Series.
90
STATISTICAL ARBITRAGE PAIRS
78
8 6 4 2 0 –2 –4 –6 10
30
50
FIGURE 5.2A
70
90
Spread.
Auto Correlation
1.0 0.8 0.6 0.4 0.2 –0.0 –0.2 0
10 Lag
5
FIGURE 5.2B
15
20
Spread ACF.
trends model is that of a time series being expressed as a simple sum of two component time series: a stationary component and a nonstationary component. If two series are cointegrated, then the cointegrating linear composition acts to nullify the nonstationary components, leaving only the stationary components. To see what we mean, consider two time series yt = nyt + ε yt
(5.2)
z t = n zt + ε zt where nyt and n zt are the random walk (nonstationary ) components of the two time series, and ε yt and ε zt are the stationary components of the time series. Also, let the linear combination yt – gzt be the cointegrating combi-
Overview
79
nation that results in a stationary time series. Expanding the linear combination and rearranging some terms, we have yt − γ z t = (nyt − γ n zt ) + (ε yt − γ ε zt )
(5.3)
If the combination in Equation 5.3 must be stationary, the nonstationary component must be zero, implying that nyt = γ n zt , or the trend component of one series must be a scalar multiple of the trend component in the other series. Therefore, for two series to be cointegrated, the trends must be identical up to a scalar. In later chapters, we will rely on the Stock-Watson model to establish links between arbitrage pricing theory and cointegration.
HIGHLIGHTING THE POINT This is an anecdote about an ingenious little kid. He was asked on a test to say a few sentences about a cow. The poor lad knew only how to say a few sentences about a tree. Thinking for a moment, the kid in his first sentence tied the cow to a tree and then went on to talk about the tree. The example is probably a little tongue in cheek. It is, however, true that there is a strong urge to relate the unknown to something familiar and enhance understanding through association. To further underscore the point, it is worth mentioning that the spirit of that approach actually forms the basis for formal proof techniques in mathematics. Proof by induction, a technique attributed to Cantor, relies on forming a series of logical relationships from the most general to the most trivial. Proofs of NP completeness, used to classify algorithms in the area of computational complexity theory (attributed to Richard Karp), also rely on transformations to a known problem. It is probably safe to say that in almost all fields of human endeavor there is a common tendency to relate the intractable to something manageable and leverage existing knowledge to arrive at meaningful conclusions. We are no different from everyone else in this respect. When faced with the prospect of having to work with nonstationary time series, we immediately look for ways to construct portfolios that can be related to stationary time series. The transformation to stationarity is typically achieved using cointegration ideas and strict parity relationships. Needless to say, this approach appears as a recurrent theme in the design of trading strategies across all asset classes.
STATISTICAL ARBITRAGE PAIRS
80
APPLYING THE MODEL In this section, we fit the cointegration model to the logarithm of stock prices. For the cointegration model to apply, we would require the logarithm of stock prices to be a nonstationary series. The assumption that the logarithm of stock prices is a random walk (read as nonstationary) is a rather standard one. It has been used fairly extensively in option-pricing models with satisfactory results. We are therefore good on that assumption and ready to proceed further. Let us say that two stocks A and B are cointegrated with the nonstationary time series corresponding to them being log ptA and log ptB , respectively. Applying the error correction representation described here, we have
{ ( )}
( ) ( ) log( p ) − log( p ) = α
( ) ( ) log( p ) − γ log( p ) + ε
{ ( )}
log ptA − log ptA−1 = α A log ptA−1 − γ log ptB−1 + ε A B t
B t −1
B
A t −1
B t −1
(5.4)
B
The parameters that uniquely determine the model are the cointegration coefficient g and the two error correction constants aA and aB. Therefore, estimating the model involves determining the appropriate values for aA, aB, and g. The left-hand side of the Equations 5.4 is the return of the stocks in the current time period. On the right-hand side, note the expression for the long-run equilibrium, log(ptA−1 ) − γ . log(ptB−1 ), in both the equations. In words, it is the scaled difference of the logarithm of price. Incidentally, this coincides with what we termed the spread in our earlier discussion. Also notice that the subscripts for stock prices in the expression for the long-run equilibrium is t – 1. The past deviation from equilibrium plays a role in deciding the next point in the time series. Therefore, knowledge of the past realizations may be used to give us an edge in predicting the increments to the logarithm of prices; that is, returns. This is important and exciting. Even though both stocks follow a log-normal process, one can eke out some predictability in their returns based on past realizations. Thus, one can attempt to trade either of the stocks in the pair based on predictions using the estimated values from the error correction representation. Let us now focus on the cointegration part of the representation theorem. This is the assertion that the time series of the long-run equilibrium (also termed spread in our case) is stationary and mean reverting. Now, we definitely know a lot about predicting stationary time series. If only we could associate the time series of the spread to a portfolio, we could consider trading the portfolio based on our prediction of the time series values. We proceed
Overview
81
to do exactly that. Consider a portfolio with long one share of A and short g shares of B. The return of the portfolio for a given time period is given as
[log(p
A t +i )
] [
]
− log(ptA ) − γ log(ptB+ i ) − log(ptB ) .
(5.5)
COINTEGRATION AND TRACKING ERROR Tracking error is an idea that arises naturally in the context of an investment technique called indexing. The basic premise for indexing is the notion that it is extremely hard to time the market. Therefore, the strategy adopted is to passively invest in the market index. Alternately, to reduce trading costs, the investment is made in portfolios designed to mimic index returns. These portfolios are sometimes referred to as tracking baskets. The ability of the tracking basket to mimic the market returns is characterized by its tracking error. The tracking error may therefore be thought of as a measure of discrepancy or margin of error that one can expect in the tracking process. Implicit in the definition of tracking error is the time interval for which the returns are measured. It is typically one year (assumed to have 252 trading days). If the holding period is different from one year, then the tracking error needs to be scaled accordingly. The convention is to assume that the tracking error is a random walk series and scale the tracking error using the formula below. If T is the holding period for the portfolio, then err(T ) =
err(1 year)
T
(5.6)
252
The tricky part here is in the random walk assumption for tracking error in the process of scaling. In the case of long–short portfolios consisting of cointegrated stocks, the tracking error is not a random walk series. As a matter of fact, it is a stationary time series. The variance or standard deviation in this case is independent of the holding period. In other words, the tracking error remains the same regardless of the holding period, and scaling formulas are not required. We therefore caution against making the random walk assumption blindly, particularly in indexing situations, because they tend to distort the tracking error.
STATISTICAL ARBITRAGE PAIRS
82
Rearranging the terms a little bit, we have
[log(p
A t +i )
] [
]
− γ log(ptB+ i ) − log(ptA ) − γ log(ptB ) = spread t + i − spread t (5.7)
Therefore, the return on the portfolio is the increment to the spread value in the time period i. We have successfully associated a portfolio with a stationary time series. The one thing that remains is providing an interpretation for g, the cointegration coefficient. We will discuss this in later chapters.
A TRADING STRATEGY We now construct a simple trading strategy. The idea is to trade on the oscillations about the equilibrium value for the spread. We could put on a trade upon deviation from the equilibrium value and unwind the trade when equilibrium is restored. However, note that the equilibrium value is also the mean value of the time series. Therefore, given that the spread swings equally in both directions about the equilibrium value, we could potentially unwind the trade when the spread deviates in the other direction. This would reduce the trading frequency on average by a factor of two. Given that stocks have a bid-ask spread, we would incur a trading slippage every time a trade is executed. Reducing the trading frequency reduces the effect of this slippage. Let us therefore consider the strategy where the trades are put on and unwound on a deviation of ∆ on either direction from the long-run equilibrium m. We buy the portfolio (long A and short B) when the time series is ∆ below the mean and sell the portfolio (sell A and buy B) when the time series is ∆ above the mean in i time steps.
( ) ( ) log( p ) − γ log( p ) = µ + ∆ log ptA − γ log ptB = µ − ∆ A t +i
(5.8)
B t +i
The profit on the trade is the incremental change in the spread, 2∆.
Example The Data Consider two stocks A and B that are cointegrated with the following data: Cointegration Ratio = 1.5 Delta used for trade signal = 0.045 Bid price of A at time t = $19.50 Ask price of B at time t = $7.46
Overview
83
Ask price of A at time t + i = $20.10 Bid price of B at time t + i = $7.17 Average bid-ask spread for A = .0005 percent (5 basis points) Average bid-ask spread for B = .0010 percent ( 10 basis points) The Strategy We first examine if trading is feasible given the average bid-ask spreads. Average trading slippage = ( 0.0005 + 1.5 × 0.0010) = .002 ( 20 basis points) This is smaller than the delta value of 0.045. Trading is therefore feasible. At time t, buy shares of A and short shares of B in the ratio 1:1.5. Spread at time t
= log (19.50) – 1.5 × log (7.46) = –0.045
At time t + i, sell shares of A and buy back shares the shares of B. Spread at time t + i
= log (20.10) – 1.5 × log (7.17) = 0.045
Total return = return on A + g × return on B = log (20.10) – log(19.50) + 1.5 × (log(7.46) – log(7.17) ) = 0.3 + 1.5 × 4.0 = .09 (9 percent)
ROAD MAP FOR STRATEGY DESIGN The discussion so far briefly outlines how we might trade once we know two stocks are cointegrated. We do concede that the course of the discussion so far has brought up more questions on the details. How do we identify candidate stock pairs? Can we verify that they are indeed cointegrated? How do we determine the cointegration coefficient? What is the most appropriate value for delta? We explore the questions and issues involved in the subsequent chapters. To that end, we provide a road map for the design and analysis of the pairs trading strategy. The steps involved are as follows: 1. Identify stock pairs that could potentially be cointegrated. This process can be based on the stock fundamentals or alternately on a pure statistical approach based on historical data. Our preferred approach is to make the stock pair guesses using fundamental information.
STATISTICAL ARBITRAGE PAIRS
84
2. Once the potential pairs are identified, we verify the proposed hypothesis that the stock pairs are indeed cointegrated based on statistical evidence from historical data. This involves determining the cointegration coefficient and examining the spread time series to ensure that it is stationary and mean reverting. 3. We then examine the cointegrated pairs to determine the delta. A feasible delta that can be traded on will be substantially greater than the slippage encountered due to the bid-ask spreads in the stocks. We also indicate methods to compute holding periods.
SUMMARY Statistical pairs trading is a relative value arbitrage on two securities and is based on the premise that there is a long-run equilibrium between the prices of the stocks composing the pair. The degree of deviation from the long-run equilibrium is called the spread and represents the extent of mutual mispricing. Any deviation from the long-run equilibrium is compensated for in subsequent movements of the time series. Pairs trading involves trading on the oscillations about the equilibrium value. The econometric paradigm of cointegration and error correction is central to the analysis of the pairs-trading strategy.
FURTHER READING MATERIAL Pairs Trading Gatev, Evan, G., William, N. Goetzmann, and K. Greet Rouwenhorst. Pairs Trading: Performance of a Relative Value Arbitrage Rule. NBER Working Papers 7032, National Bureau of Economic Research Inc., 1999. Cointegration Engle, Robert F. and C. W. Granger. “Co-integration and Error Correction: Representation, Estimation and Testing.” Econometrica 55, no. 2 (March 1987): 251–276. Stock, James H. and Mark W. Watson. “Testing for Common Trends.” Journal of the American Statistical Association 83, no. 404 (December 1988): 1097–1107. Enders, Walter. Applied Econometric Time Series. (New York: John Wiley & Sons, Inc., 1995).
CHAPTER
6
Pairs Selection in Equity Markets
INTRODUCTION In Chapter 5, we explained that strategy design was essentially a three-step process. The three steps are identification of stock pairs, cointegration testing, and trading rule formulation. In this chapter, we focus on the first of the three steps, the process of identifying potential stock pairs. In this step we essentially short-list the pairs for cointegration testing and further analysis. Why do we need to do that? We could just as well work with a candidate list of all possible pairs, run cointegration tests on all of them, and eliminate pairs that fail the tests. This certainly seems reasonable, except for the fact that in a universe of 5,000 stocks, we have a total of about 12 million pairs. 85
86
STATISTICAL ARBITRAGE PAIRS
Running tests on 12 million pairs is definitely not a viable option. Therefore, the natural inclination is to try to reduce the number. The most expedient way (not necessarily the best) to accomplish that is by using heuristics, or rules of thumb. In the heuristics approach, the list of pairs is explicitly partitioned into two sets: potentially cointegrated and not potentially cointegrated. This partitioning is accomplished by applying a set of rules. The rules are designed to exclude pairs with a slim chance of being cointegrated. This limits the number of pairs in the candidate list and reduces the number of cointegration tests that need to be performed. Although this approach seems reasonable, it is best characterized as being ad hoc. Different people have different belief systems about the market. Therefore, reasonable people can come up with dramatically different rule sets based on their personal experiences. This makes the rule sets anecdotal in nature. Furthermore, it is possible for individuals to hold opposing views. An aggregation of rules representing the beliefs of multiple individuals may end up being inconsistent. This is now a case where the whole is lesser than the sum of its parts and results in missed opportunities. It might therefore be useful to put some thought into this to come up with a more definitive methodology. The methodology we prescribe here is distinctly different from the rules of thumb or heuristics approach. Instead of attempting to evaluate explicit partitions, this approach aims to arrive at a relative ordering of the pairs based on the degree of comovement. Each pair is associated with a score/ distance measure. The higher the score, the greater the degree of comovement, and vice versa. Notably, such a structure lends itself to deductive reasoning. If we find that a pair is unsuitable for pairs trading, then we have good reason to believe that every pair with a score/distance measure worse than the current pair is also unsuitable. The pair selection process now becomes equivalent to choosing a suitable threshold value for the distance measure. Notice that this approach relies solely on the distance measure for ordering the pairs. Therefore, a proper choice of the distance measure is key to the pairs selection process. Let us quickly examine the properties that would be desirable of the score/distance measure. First, if the evaluation of the score/distance measure took as much effort as cointegration testing, then it would defeat the purpose. We may as well test for cointegration directly with all the pairs in the exhaustive list. Therefore, at the very least, the evaluation of the distance measure must be relatively easy and straightforward. Additionally, it is desirable that the distance measure not be completely empirical. Empirical deductions rely solely on historical data. This comes with an underlying assumption that the fundamentals of the firm are essentially static. There are no changes that impact the valuation of the firm. Needless to say, this need not be true. Ideally, we would prefer to tie the evaluation of the distance
Pairs Selection in Equity Markets
87
measure to the fundamentals of the firm. Accomplishing this would help provide a theoretical justification for the use of the score/distance measure and therefore a sound economic rationale for our expectation of cointegrated behavior. In this chapter, we focus on defining, justifying, and interpreting the score/distance measure. It is therefore a bit theoretical in nature. To help motivate the choice of the score/distance measure, the nexus between cointegration and arbitrage pricing theory (APT) is explored. We draw parallels between the common trends model for cointegration and the ideas of APT. Conditions to be satisfied for cointegration in the common trends model are translated into APT constructs. This makes it possible to evaluate the score/distance measure in the APT framework. Moreover, if the multifactor implementation of APT is composed of fundamental variables, then the distance measure calculated relates to the fundamentals of the firm. We will then have an easy evaluation procedure for the distance measure that is firmly based on the fundamentals of the firm, thus satisfying the requirements of the score/distance measure. In addition, we bridge the gap between theoretical expectations and practical observations. Assumptions, caveats, and loopholes that are encountered in the process are stated explicitly. Insights gained in the exercise will help us understand what can go wrong and the things to beware of while trading. Sources of risk are identified and quantified with a view to enhance the understanding of the dynamics involved in pairs trading and the kinds of pairs to avoid. Let us start with the common trends model.
COMMON TRENDS COINTEGRATION MODEL Let us recall the common trends model from Chapter 5. We are given two series yt and zt, as shown in Equation 6.1. yt = nyt + ε yt
(6.1)
z t = n zt + ε zt nyt and n zt represent the so-called common trends or random walk components of the two time series; ε yt , ε zt are the stationary and specific components of the time series. If the two series are cointegrated, then their common trends must be identical up to a scalar, nyt = γ n zt
(6.2)
where g is the cointegration coefficient. Let us examine some of the implications of the common trends model.
STATISTICAL ARBITRAGE PAIRS
88
Inference 1: In a cointegrated system with two time series, the innovations sequences derived from the common trend components must be perfectly correlated. (Correlation value must be +1 or –1). Let us denote the innovation sequences derived from the common trends of the two series as ryt and rzt. Recall from Chapter 2 that the innovation sequence for a random walk is obtained by differencing it. The innovation sequences shown here are therefore the result of differencing nyt and nzt , respectively. In equation form, we have nyt +1 − nyt = ryt +1
(6.3)
n zt +1 − n zt = rzt +1 According to the common trends model, the common trends must be identical up to a scalar. nyt = γ n zt
(6.4)
nyt +1 = γ n zt +1
(6.5)
Now, if we require
it follows from simple algebra on Equations 6.4 and 6.5 that ryt +1 = γ rzt +1
(6.6)
This means that the innovations derived from the common trends must also be identical up to a scalar. Incidentally, the scalar also happens to be the cointegration coefficient g. Now, if two variables are identical up to a scalar (in this case the cointegration coefficient), they must be perfectly correlated. If the cointegration coefficient is positive, then the correlation value is +1. If it is negative, then the correlation value is –1. Thus, in a cointegrated system the innovation sequences derived from the common trends must be perfectly correlated. Inference 2: The cointegration coefficient may be obtained by a regression of the innovation sequences of the common trends against each other. Based on the preceding discussion we have a linear relationship between the innovation sequences given as ryt = γ rzt
(6.7)
Pairs Selection in Equity Markets
89
It therefore follows that the cointegration coefficient may be obtained by performing a simple regression of one innovation sequence against the other. The cointegration coefficient is therefore given as
γ =
(
cov ry , rz
( )
var rz
)
(6.8)
Discussion In this section we looked at conditions necessary for cointegration. We established that the innovation sequences derived from the common trend terms must be perfectly correlated. We urge the reader to take special note of this fact. The idea resonates throughout the chapter and forms the basis for the scoring function. A subtle point to highlight at this juncture is that there are in effect two correlation measures. One correlation measure pertains to the innovation sequences of the common trend alone. This correlation value is either +1 or –1. Another correlation may be calculated on the innovation sequences of the whole series, taking into account both the common trend and the stationary components. This measure can take on a whole range of values depending on the variance of the stationary components. The two correlation measures are very different, and it important not to confuse one for the other. Our next point pertains to the stationarity of the common trend and specific components of the two time series. As far as common trends go, there is no restriction on them. They can be stationary or nonstationary, and this is neither critical nor material for cointegration. The same cannot be said for the specific components. The definition of the common trends model relies on their being stationary time series. Therefore, for cointegration to exist, it is absolutely necessary for the specific components to be stationary. By implication, the first difference of the specific component must not be white noise, because if the differenced series were white noise, then the specific series would be a random walk, a nonstationary series. This violates the stationarity condition for the specific component. Therefore, the first difference of the specific component cannot be white noise. In summary, there are two conditions that need to be satisfied for cointegration in a common trends model. First, the innovation sequences derived from the common trends of the two series must be identical up to a scalar. Next, the specific components of the two series must be stationary. Having now established an understanding of the common trends model, we turn to the issue of applying this to stock data. Application is possible only if we separate the time series into nonstationary/common trends and stationary/specific components. To do that, we look to arbitrage pricing
STATISTICAL ARBITRAGE PAIRS
90
theory (APT) and establish a link between APT and the common trends model. This will be the focus of the discussion in the following section, on common trends.
COMMON TRENDS MODEL AND APT Earlier, in Chapter 2, the logarithm of stock price was modeled as a random walk. To accommodate the common trends model, let us modify that a little and consider the logarithm of the stock price to be a sum of a random walk and a stationary series log(pricet) = nt + et
(6.9)
where nt is the random walk, and et is the stationary component. Differencing the logarithm of stock price yields the sequence of returns. Therefore, based on Equation 6.9, the return rt at time t may also be separated into two parts log(pricet) – log(pricet–1) = nt – nt–1 + (et – et–1)
(6.10)
rt = rtc + rts
(6.11)
where rtc is the return due to the nonstationary trend component, and rts is the return due to the stationary component. Notice that the return due to trend component is the same as the innovation derived from the trend component. Therefore, the cointegration criterion pertaining to the innovations of the common trend may be rephrased as follows: If two stocks are cointegrated, the returns from their common trends must be identical up to a scalar. But why in the world should stocks ever have common returns? Is there a financial rationale for these to exist among stocks? The answer to that is a resounding yes, and APT comes in handy in providing a comprehensive explanation for this. Recalling the earlier discussions on APT, stock returns may be separated into common factor returns (returns based on the exposure of stocks to different risk factors) and specific returns (returns specific to the stock). If two stocks share the same risk factor exposure profile, then the common factor returns for both the stocks must be the same. This provides us with a rationale for when we might expect stocks to have a common return component. We are now ready to draw parallels between the common trends model and APT. According to APT, stock returns for a time period may be separated into two types: common factor returns and specific returns. Let these correspond to the common trend innovation and the first difference of the
Pairs Selection in Equity Markets
91
specific component in the common trends model. For the correspondence to be valid, the integration of the specific returns must be a stationary time series. Alternately, as discussed in the previous section, the specific returns must not be white noise. But APT is a single time frame model and cannot provide us with any guarantees on the time series of specific returns. We would therefore have to make that leap of faith and assume that the specific return series is not white noise. Nevertheless, it is reassuring to know that the validity of this assumption is tested when running the cointegration tests and pairs where the specific component is nonstationary are eliminated. Let us go ahead and make the assumption that the specific return is not white noise. We can now interpret the inferences from the common trends model in APT terms. The correlation of the innovation sequence is the common factor correlation. Also, the cointegration coefficient may be calculated using APT constructs to evaluate the common factor variance and covariance in its formula. Let us now formally put all of the preceding discussions into an observation. Observation 1: A pair of stocks with the same risk factor exposure profile satisfies the necessary conditions for cointegration. Condition 1 Now let us consider two stocks A and B with risk factor exposure vectors gx and x, respectively. The factor exposure vectors in this case are identical up to a scalar. We denote the factor exposures as Stock A:
gx = (gx1,gx2,gx3,...,gxn)
Stock B:
x = (x1, x2, x3,..., xn)
Geometrically, it may be interpreted that the factor exposure vectors of the two stocks point in the same direction; that is, the angle between them is zero. If b = (b1,b2,b3,...,bn) is the factor returns vector, and rAspec and rBspec are the specific returns for stocks A and B, then the returns for the stocks rA and rB are given as
(
) + x b +, … , + x b ) + r
rA = γ x1b1 + x2b2 + x3b3 +, … , +bn + rAspec
(
rB = x1b1 + x2b2
3 3
n n
spec B
STATISTICAL ARBITRAGE PAIRS
92
The common factor returns for the stocks are therefore
(
rAcf = γ x1b1 + x2b2 + x3b3 +, … , + xnbn rBcf
)
= x1b1 + x2b2 + x3b3 +, … , + xnbn
Thus, rAcf = γ rBcf. The innovation sequences of the common trend are identical up to a scalar. This satisfies the first condition for cointegration as per the common trends model. Condition 2 Now consider the linear combination of the returns.
(
) (
rA − γ rB = rAcf − γ rBcf + rAspec − γ rBspec
)
The left-hand side of the equation represents the return of a portfolio long one share of A and short g shares of B. The right-hand side shows that this return is separable into common factor returns and specific returns of the portfolio. Therefore cf spec rport = rport + rport
where cf rport = rAcf − γ rBcf spec rport = rAspec − γ rBspec
Notice that if the stocks A and B are cointegrated, then the common factor cf return rport becomes zero. Additionally, the return on the long–short portfolio may also be viewed as the output from differencing the spread time series. Based on the separation of the return series, the spread series may also be represented as the sum of two components. One is the integrated common factor return, which we shall call the common factor spread. The other is the integrated specific return, which we shall call the specific spread. Writing this in equation form, we have cf spreadtcf − spreadtcf−1 = rport spec spreadtspec − spreadtspec −1 = rport
spreadtport = spreadtcf + spreadtspec
Pairs Selection in Equity Markets
93
Again notice that if the stocks are cointegrated, then spreadtcf becomes cf zero, because rport is zero. Therefore, when the stocks are cointegrated, the spread of the portfolio is the same as the spread due to the specific components. Additionally, the spread series must be stationary for cointegration. This will be true if the spread due to specific spread is stationary. The specific spread will be stationary if the integration of the specific returns of the individual specific returns of the stocks is stationary. This is indeed the assumption that we make when relating APT with cointegration. Therefore, if this assumption bears out, we will have satisfied all the necessary conditions for cointegration.
Summary The driving idea in the common trends model has been to view a given time series as a sum of stationary and nonstationary components. The idea of viewing a time series as a sum of component series for ease of analysis is in fact a common practice. Notably, in the practical analysis of time series, the convention is to make an implicit assumption that the series is composed of trend, seasonal, and stochastic components. Arbitrage pricing theory declares this idea of composition rather explicitly and takes on a constructionist approach to modeling stock returns. Every risk factor in the APT model is associated with a time series of returns. The weighted sum of these series, with the factor exposures as weights, is the expected returns series for a stock. To get the actual returns, we need to add the specific return series to the expected returns. Two stocks with identical risk factor exposures would therefore have the same common factor returns. In other words, the common factor return of a long–short portfolio of the two stocks is zero. The common factor contribution to risk in this case is the least it can ever be, zero. Furthermore, if the integration of the specific returns of the stocks is stationary, then the two stocks are cointegrated. Thus, a key area to look for a measure of cointegration is the risk factor exposure profiles of the stocks and how closely aligned they are.
THE DISTANCE MEASURE We are now ready to define the distance measure. Recall from the discussions on the common trends model that the necessary condition for cointegration is that the innovation sequences derived from the common trends
STATISTICAL ARBITRAGE PAIRS
94
must be perfectly correlated. We also established that the common factor return of the APT model might be interpreted as the innovations derived from the common trends. The correlation between the innovation sequences is therefore the correlation between the common factor returns. The closer the absolute value of this measure is to unity, the greater will be the degree of comovement. The distance measure we propose is therefore exactly that: the absolute value of the correlation of the common factor returns. The formula for the distance measure is therefore given as d(A, B) = ρ =
(
) var(r ) var(r ) cov rA , rB A
(6.12)
B
In APT terms, if xA and xB are the factor exposure vectors of the two stocks A and B, and F is the covariance matrix, the distance measure may be calculated as
ρ =
(x
x A F xB A F xB
)(x F x ) B
(6.13)
B
Note that in Equation 6.13 we have used only common factor terms. The specific variance contribution is not used.
INTERPRETING THE DISTANCE MEASURE In previous discussions we hinted that the perfect alignment of the factor exposure vectors, that is, a zero angle between them, is indicative of cointegration. In this section we will show that the correlation measure as calculated in Equation 6.13 could actually be interpreted as the cosine of the angle between transformed versions of the factor exposure vectors corresponding to the two stocks. But why do we need to transform the factor exposure vectors? Can we not directly measure the angle between them? We could do that but for the fact that all factors in the multifactor model are not created equal. Returns are more sensitive to changes in some factors versus the others. We therefore need to transform from the space of factor exposure to the space of returns and then measure the angle between the transformed vectors. We will discuss the exact nature of this transformation and show that correlation can indeed be interpreted as the cosine of the angle.
Pairs Selection in Equity Markets
95
Calculating the Cosine of the Angle between Two Vectors It is useful at this point to define the inner product of two vectors. Given two vectors eA and eB as follows:
( ) = (e , e , … , e )
A e A = e1A , e2A , … , e N
eB
B 1
B 2
(6.14)
B N
The inner product between the two vectors is given by the formula A B e AeB = e1Ae1B + e2Ae2B + … + e N eN
(6.15)
In matrix notation, the inner product may be represented as e AeBT, where eBT is the transpose of the vector. The length of a vector is the square root of its inner product. Therefore, we have
( )
length e A =
e Ae TA
(6.16)
With the preceding equations, we are now ready to calculate the angle between two vectors. The steps involved in calculating the angle between two vectors are as follows. Step 1. We first evaluate the unit vectors (vectors of length one) pointing in the direction of the two vectors. We do this by scaling each element of the vector by its length. Step 2. The cosine of the angle between the two vectors is now the inner product of the unit vectors pointing in their directions. We will leave it to the reader to work out that the two steps may be condensed into a single formula as given in Equation 6.17. cos θ =
(
e AeBT
)(
e Ae TA eBeBT
)
(6.17)
Example Let us say we are required to calculate the cosine of the angle between the vectors A = (0, 2) and B = (3, 0). Calculating the lengths of these vectors, we have
STATISTICAL ARBITRAGE PAIRS
96
length(A) =
02 + 22 = 2
length(B) =
32 + 02 = 3
We now scale each vector by dividing every element in the vector by the length of the vector. Denoting the unit vectors in small letters, we have a = (0, 1) b = (1, 0) The cosine of the angle between the vectors is given by the inner product of the unit vectors. cosq = ab = 0.1 + 1.0 = 0 The value of the cosine is zero, indicating that the angle between the two vectors is 90 degrees. The two vectors are indeed orthogonal to each other.
Geometric Interpretation Key to doing the geometric interpretation is the idea of eigenvalue decomposition of the covariance matrix F. A brief discussion on eigenvalue decomposition is provided in the appendix. Let the eigenvalue decomposition of F be given as UDUT. If xA and xB are the factor exposure vectors of the two stocks, let us consider a transformation of the two vectors as shown in Equations 6.18. eA = xAUD1/2
(6.18)
eB = xBUD1/2 This is the transformation from the factor exposure space to the factor return space. Now, using simple matrix manipulations it is easy to verify that
( )
xA F xTA =
var rA
( )
xBF xBT =
var rB
length e A = length eB =
(
( )
(6.19)
( )
(6.20)
)
(6.21)
e AeBT = xA F xBT = cov rA , rB
Using Equations 6.19, 6.18, and 6.19, the cosine of the angle between the vectors eA and eB works out as follows:
Pairs Selection in Equity Markets
97
e AeBT
cos θ =
( )
( )
=
length e A length eB =
(
) =ρ var(r ) var(r )
xA F xBT
(x F x )(x F x ) A
T A
B
=
(6.22)
T B
cov rA , rB A
B
This completes the geometric interpretation.
RECONCILING THEORY AND PRACTICE So far we have established the basis for the distance measure in theory. We have shown that the absolute value of the common factor correlation is a good way to measure the degree of comovement in stock prices. Let us now examine the practical issues surrounding it. There may be issues and sources of error that we may not be able to mitigate entirely. However, being aware of them and the risks they pose can provide insights into what can go wrong during trading.
Stationarity of Integrated Specific Returns We discussed earlier that the necessary condition for cointegration is that the integration of the specific returns time series must be stationary. To verify this directly means that we must evaluate the common factor and specific returns for each time period. Alternately, this could be verified when performing the cointegration tests. Cointegration testing involves the estimation of the cointegration coefficient and ensuring that the spread series of the long–short portfolio constructed with this ratio is indeed stationary. If the integration of the spread series is nonstationary, this would show up in the spread series being nonstationary. Readers can convince themselves of this by examining the observation made in the section Common Trends Model and APT. Thus, testing for stationarity of the spread is sufficient to ensure that stationarity of the integrated specific returns, and the assumption that we make here is verified in due course.
Deviations from Ideal Conditions From an APT point of view we established that two stocks will be cointegrated if their factor exposure vectors are perfectly aligned. More precisely, the common factor correlation between them must be +1 or –1. This is a
STATISTICAL ARBITRAGE PAIRS
98
rather tall order and is seldom satisfied in practice. Unless the stocks are class A and class B shares of the same firm, it is unlikely that the stocks’ factor exposures will be perfectly aligned. If two stocks do not have their factor exposures perfectly aligned, then any long–short portfolio composed of the two stocks will have a nonzero component for the common factor returns. Our model for cointegration relies on a zero value for the common factor returns, and the violation of this represents deviation from ideal conditions for cointegration. Stating the above in formula form, we have
(
) (
rA − γ rB = rAcf − γ rBcf + rAspec − γ rBspec cf cf rportf = rport + rport
)
(6.23) (6.24)
cf The value of rport in Equation 6.24 is nonzero. Consequently, following the logic from the observation in the section Common Trends and APT, the common factor spread is also a nonzero quantity. It is helpful to write this in equation form.
spreadt = spreadtcf + spreadtspec
(6.25)
The common factor spread may very well be nonstationary, violating the cointegration condition of spread stationarity. But can we still make do with less than perfect conditions of cointegration? How do we quantify the deviation? Let us say that the spread series is composed of a stationary component (typically the specific spread) and a nonstationary component (typically the common factor spread). Let the variances of the two components be 2 2 and σ nonstationary,T . Note that the variance of the nonstationary comσ stationary ponent is specified for a time horizon T. Also, let t be the trading horizon. If the change in the nonstationary component of the spread is small, we could treat it more or less as a constant and say that we have a cointegrated pair. A measure of the deviation from cointegration is captured in the signal-tonoise ratio as given in Equation 6.26. SNR =
σ stationary σ nonstationary,t
(6.26)
The ideal is to have the nonstationary component as close to zero as possible. If it is exactly zero, then the signal-to-noise ratio would be infinity. In practice, a very large number for the ratio would make our assumption of cointegration reasonable. Given that the variance of the nonstationary com-
Pairs Selection in Equity Markets
99
ponent increases linearly with the trading horizon, then, all things being equal, having a shorter trading horizon is definitely closer to the ideal condition of cointegration. This is of course determined by the dynamics of the spread and the rate at which the spread oscillates about the mean. Based on Equation 6.25, the overall spread may also be interpreted as the specific return spread with a varying mean that is dictated by the common factor spread. If the common factor spread is a nonstationary time series, then the overall spread is equal to the specific spread with a stochastic drift to its mean value. Thus, deviation from ideal cointegration conditions results in what we shall call mean drift. The fallout from mean drift is that trading with symmetric bands may not be optimal because the movement of the spread series may be skewed. The worse part is that it could skew either way depending on the movement of the common factor spread, and it is not possible to know in advance. Also, if the common factor spread is nonstationary, then the variance of the skew scales linearly with time. An important insight to be gleaned in the process is that the mere passage of time represents an increase in risk in pairs trading and must be taken into account to design time-based stop orders.
Model Error Model error occurs when our model specification is way off the mark. Let us say that there is dramatic news that could potentially result in a drastic change in fundamentals of a firm. The market immediately begins the process of adjustment to the news, predicting in some sense where the factor exposures of the firm would be under the new circumstances. In such cases where the expectation is for a dramatic change in the factor exposure, the common factor correlation evaluated before becomes dated, and the correlation structure between the two stocks breaks down. It is important to be aware of this and be constantly on the lookout for such events when trading pairs.
Numerical Example Consider three stocks A, B, and C with factor exposures in a two factor model as follows: xA = [1 1] xB = [0.75 1] xC = [1 0.75] .0625 .0225 Let the factor covariance matrix F = .0225 .1024
STATISTICAL ARBITRAGE PAIRS
100
Step 1: Calculate the common factor variance and covariance. .0625 .0225 1 1 = .2099 .0225 .1024 1 (sqrt(.2099) is the volatility, 45.8%
[
]
var(A) = 1
[
.0625 .0225 0.75 1 = .171 (volatility of 41.3%) .0225 .1024 1
[
.0625 .0225 1 .75 = .1539 (volatility of 39.2%) .0225 .1024 .75
]
var(B) = .75
]
var(C) = 1
[
.0625 .0225 .75 1 = .1887 .0225 .1024 1
[
.0625 .0225 1 1 = .1787 .0225 .1024 .75
cov(A, B) = 1
cov(A, C) = 1
]
]
Step 2: Calculate the correlation (absolute value of correlation is the distance measure).
(
)
) = var( A) var(B)
(
)
) = var( A) var(C )
corr A, B =
corr A, C =
(
cov A, B
(
cov A, C
0.1887 .2099 × .171
= 0.9957
0.1787 .2099 × .171539
= 0.9431
If we have to choose one pair for purposes of trading, based on the distance measure, our choice would therefore be the pair (A, B). Step 3: Calculate the cointegration coefficient. A detailed discussion on the reasoning for the formula is provided in the next chapter.
λ AB =
cov(AB) .1887 = = 1.1032 var(B) .171
λ AC =
cov(AC) .1787 = = 1.1613 var(C) .1539
Step 4: Calculate the residual common factor exposure in the paired portfolio. This is the exposure that causes mean drift.
Pairs Selection in Equity Markets
101
expAB = [1
1] – 1.1032 × [0.75
expAC = [1
1] – 1.1613 × [1
1] = [.1726
.75] = [–.1613
–.1032] .129]
Step 5. Calculate the common factor portfolio variance/variance of residual exposure.
( ) [
.0625 .0225 .1726 − .1032 = .0221 .0225 .1024 −.1032
]
cf var rAB = .1726
cf σ AB =
.0021 = .046
( ) [
cf var rAC = −.1613
cf σ AC =
.0625 .0225 −.1613 .129 = .0024 .0225 .1024 .129
]
.0024 = .049
Step 6. Calculate the specific variance of the portfolio. To simplify our illustration, let us assume the specific variance for all of the stocks to be 0.0016.
( )
( )
( )
spec 2 var rAB = var rAspec + γ AB var rBspec
= .0016 + 1.10322 × .0016 = .0035 spec σ AB =
( )
.00035 = .059
( )
( )
spec 2 var rAC = var rAspec + γ AC var rCspec
= .0016 + 1.16132 × .0016 = .0037 spec σ AB =
.0037 = .061
Step 7. Calculate the SNR ratio with white noise assumptions for residual stock return. SNRAB =
SNRAC =
spec σ AB cf σ AB spec σ AC cf σ AC
=
.059 = 1.282 .046
=
.061 = 1.245 .049
STATISTICAL ARBITRAGE PAIRS
102
Therefore, even on a signal-to-noise ratio basis the stock pair (A B) does better than stock pair (A C). Notice that having a high value for the specific risk/variance (provided it is stationary) is highly desirable as it improves the SNR. A higher specific variance means higher stock volatility, indicating that a high volatility environment is conducive for pairs trading.
SUMMARY The candidate list of potentially cointegrated stock pairs can be compiled by the process of identifying similar stocks. The notion of similarity is formalized using a distance measure between two stocks. The distance measure is based on an APT model possibly with fundamental risk factors. The candidate list of pairs is determined by choosing pairs with distance values within a certain threshold. The distance measure is the absolute value of the common factor correlation between the two stocks. If the common factor correlation is +1 or –1 and the integration of the specific return series of the stocks involved are stationary, then conditions for cointegration are satisfied. It may be possible to trade pairs of stocks even though they deviate from ideal conditions of cointegration. The signal-to-noise ratio as defined is a measure of the deviation from the ideal condition of cointegration.
FURTHER READING MATERIAL Cointegration Properties Philips, P. C. B. and S. N. Durlauf. “Multiple Time Series Regression with Integrated Processes,” Review of Economic Studies 53 (1986): 473–495. Park, J. Y. S. Oularis and B. Choi. “Spurious Regressions and Tests for Co-integration” (CAE working paper 88–07, Cornell University, Ithaca, New York (1988).) Near Cointegration Haldrup, Niels and Michael Jansson. “Spurious Regression, Co-integration and Near Co-integration: A Unifying Approach” (working papers, DK-8000, Department of Economics, University of Aarhus, Denmark, 1999.)
Pairs Selection in Equity Markets
103
APPENDIX: EIGENVALUE DECOMPOSITION Consider a scalar l and a corresponding vector v. They are an eigenvalue, eigenvector pair of a square matrix A if they satisfy the equation Av = lv The equation means that the vector v is special with respect to the matrix A. Multiplying the vector with the matrix does not change the direction or orientation of the vector. Its magnitude, however, is multiplied by the scalar l. A square n × n matrix may have n eigenvalues l1,l2,...,ln and n corresponding eigenvectors v1,v2,...,vn Therefore Av1 = l1v1 Av2 = lv2 M Avn = lvn Let us write this in matrix form. We construct a diagonal matrix D with the eigenvalues forming the diagonal and a matrix U with each column corresponding to an eigenvectors. Then AU = UD or A = UDU–1 Note that we now have the matrix A in terms of its eigenvalues and eigenvectors. This is the eigenvalue decomposition of A. The covariance matrix F has special properties, and in that case U–1 = UT and F = UDUT
CHAPTER
7
Testing For Tradability
INTRODUCTION In Chapter 6 we discussed the process of choosing potential stock pairs. In this chapter we will focus on whether the identified candidate pairs are actually tradable. Based on the discussions so far, we can state that a pair is tradable if the stocks making up the pair are cointegrated. We need to bear in mind, however, that in most cases we are dealing with systems that are not exactly cointegrated. As a matter of fact, in the course of examining the candidate pairs, we can almost always expect to have a residual factor exposure causing the phenomenon of mean drift, thereby resulting in the signal to be nonstationary. However, if the signal-to-noise ratio is good enough, then for all practical purposes we could treat the residual series as stationary for the time period of the trade. Based on the preceding observations we could refine the phrasing of our question as follows: How do we decide that a pair is tradable even though it deviates from ideal conditions of cointegration? To seek out an approach to answer that, we could draw on the insights gleaned from cointegration testing procedures. With that in mind, let us outline the process of verifying that two stocks are indeed cointegrated. Most verification processes are based on the “If it walks like a duck and quacks like a duck, it must be a duck” philosophy. Needless to add, the approach to cointegration testing is also along the same lines. We identify the properties that must be satisfied if cointegration were indeed true and check to see if the properties hold. If they do, then the stock pair in question is declared cointegrated. Let us therefore review some properties of cointegrated systems that could be potentially of use in the testing process. We start off with the Stock-Watson characterization of cointegrated systems. Each individual stock series is modeled as a sum of a trend component and a stationary component. The property characteristic of cointegrated systems is that the trend components of the two stocks must be the same. If the trend component is the same and the stationary component just oscillates about some value, then there must be a strong correlation between 104
Testing for Tradability
105
the two stock series (not to be confused with the correlation between the first differences or returns of the series). A strong correlation is usually evidenced by a good fit in the regression of one time series against the other. Additionally, a good fit is characterized by a strong t statistic and r-squared measure. Therefore, deducing from the preceding information, we could look for a good value for the t statistic and the r-squared measure from the regression of the two time series, infer a strong correlation between the two series, and declare the existence of a common trend. This sounds logical, does it not? However, it is not true. Let us see why that is. Upon careful examination of the preceding argument, it can be seen that it is actually incomplete. The argument would be complete if we could assert that the good fit upon regression or strong correlation property is a trait exclusive to cointegrated systems. Only if that were true could we conclude that evidence of the property implies cointegration. Unfortunately for us, the good fit on regression property is not exclusive to cointegrated systems. As a matter of fact and rather surprisingly, completely independent random walks when regressed against each other can also result in a high r-squared measure. This rather counterintuitive phenomenon was reported in the findings of a simulation experiment conducted by Granger and Newbold, who aptly coined the phrase “spurious regression” to describe it. Thus, as evidenced by spurious regression, the strong correlation property is not exclusive to cointegrated systems and therefore cannot be used as a definitive test for cointegration. We now turn to another property of cointegrated systems. This is the existence of a linear combination of two time series that is stationary and mean reverting. That property is in fact a defining property of cointegration systems. We could therefore design a cointegration test based on the verification of this property. Such a test for cointegration was first prescribed by Engle and Granger. The rationale behind the test proceeds somewhat like this: If two time series are cointegrated, then a simple regression of one time series against the other should produce a good estimate of the linear relationship. Also, the spread series resulting from the linear relationship, that is, the residual series from the regression, must be stationary. Therefore, to test for cointegration all we need to do is estimate the linear relationship between the two series given using simple regression and test for stationarity of the residuals. If the residuals form a stationary time series, then we have a cointegrated pair. Thus, cointegration testing is a two-step process:1 1. Determination of the linear relationship. 2. Stationarity testing on the residuals. 1
There is a one-step test for cointegration originally proposed by Johannsen, which I do not discuss here. However, references for it are in the appendix.
STATISTICAL ARBITRAGE PAIRS
106
We now go back to our original question. Knowing that we are dealing with systems that are not exactly cointegrated, how do we adapt the cointegration test to test for tradability? With regard to the first step, there is not much of a change because in any case we would want to know the linear relationship between the two stocks. The difference, however, is in the second step. Contrary to the strict requirement of stationarity of the residual series for cointegration, the pair is deemed tradable as long as the residual exhibits a sufficient degree of mean reversion. I will seek to quantify this idea of mean reversion and show how the results may be used to directly verify whether a stock pair is tradable or not. Thus, similar to the cointegration testing, testing for tradability is also a two-step process: estimation of the linear relationship and measuring the degree of mean reversion. We will discuss these two steps in detail in the following section, on the linear relationship. Let us start with the estimation of the linear relationship.
THE LINEAR RELATIONSHIP The linear relationship between the two time series is given as
( )
( )
log ptA − γ log ptB = µ + ε t
(7.1)
In Equation 7.1, the left-hand side of the equation represents a linear combination of the two time series; g in the equation is the cointegration coefficient. The right-hand side of the equation therefore represents the residual series and is expressed as the sum of two components: m is the equilibrium value, and et is a time series with zero mean, which may be construed as the disturbance term in the equilibrium. If the series were mean reverting, then we would expect its value to oscillate about the equilibrium value. Owing to this, the linear relationship between the two series is also termed the equilibrium relationship, characterized by the two values g and m. It is therefore important to remind ourselves of the economic interpretation of these two values. The interpretation of g as the common factor beta between the two stocks was already discussed in Chapter 6. We are now left with the interpretation of m. To do that, let us consider a portfolio long one share of stock A and short g shares of stock B. The g shares of B represents a position in terms of stock B that is equivalent to one share of A. Now, according to the equilibrium relationship, such a portfolio yields an average cash flow of m, which is given back when the position is reversed. Thus m represents the premium paid for holding stock A over an equivalent position of stock B. But
Testing for Tradability
107
do such premiums exist in real life? As a matter of fact, stocks do trade at a premium for a variety of reasons. Greater relative liquidity (liquidity premium), the possibility of the firm being a takeover target (takeover premium), and pure charisma on the part of some stocks are some reasons that come to mind. Therefore, in the evaluation of the equilibrium relationship care must be taken to estimate both the values g the cointegration coefficient and m the premium. We will discuss two approaches to estimating the equilibrium relationship. The first approach is based on the multifactor framework. The second approach is the regression approach. Let us begin with the multifactor approach.
ESTIMATING THE LINEAR RELATIONSHIP: THE MULTIFACTOR APPROACH In Chapter 6 we mentioned that the cointegration coefficient could be estimated by performing a regression of the common factor returns of one stock against the other. The estimated value from the regression is the cointegration coefficient. Also, the formula for the regression coefficient can be expressed in terms of the covariance and variance of the stocks involved. Now, exploiting the multifactor framework the variance and covariance may be expressed in terms of the factor exposures and the factor covariance matrix. Thus, the regression formula may be expressed completely in terms of the constructs of the multifactor framework and may be used to evaluate the cointegration coefficient. It is important to note here that we have two values for the cointegration coefficient depending on the choice of the independent variable. If the linear relationship is expressed assuming stock B to be the independent variable, we have
( )
( )
rest = log ptA − γ log ptB
(7.2)
(
(7.3)
γ = β AB =
cov rA , rB
( )
var rB
)=e
T A Fe B T eB FeB
Alternately, if the equilibrium relationship is represented assuming stock A to be the independent variable, we have
( )
( )
rest = log ptB − γ ′ log ptA ,
(7.4)
STATISTICAL ARBITRAGE PAIRS
108
γ ′ = β BA =
(
cov rA , rB
( )
var rA
)=e
T B FeA e TA FeA
(7.5)
We now have two relationships and two values for the cointegration coefficient, bringing us to the question of which value to use in our tests. We suggest going with the larger of the two. From a purely numerical viewpoint in terms of reducing precision errors, we are better off estimating the larger of the two numbers. This choice has some additional implications. To see that, let us suppose that our choice was g (stock B is the independent variable) because g > g ′, then it follows from the formulas that var rB < var rA . Thus, by choosing the larger of the two values for the cointegration coefficient, we are by implication designating the stock with lower volatility as the independent variable. Once the value of the cointegration coefficient is determined, we can very easily evaluate the residual time series. From the earlier discussion on the linear relationship, the equilibrium value m is the mean value of the residual time series. If this is significantly different from zero, we have a nonzero equilibrium value. Otherwise, we could assume that it is zero. To summarize, the steps involved in estimating the equilibrium relationship are as follows:
( )
( )
1. Calculate the two values g and g ′ using multifactor model constructs. 2. Determine which value must be used for the cointegration coefficient. Our choice is guided by the larger of the two values g and g ′. 3. Construct the time series corresponding to the appropriate linear combination and evaluate its mean. If it is significant, we have a nonzero equilibrium value; otherwise, it is zero.
ESTIMATING THE LINEAR RELATIONSHIP: THE REGRESSION APPROACH The linear relationship may also be estimated using a regression approach. The use of regression to estimate the linear relationship is based on the premise that if two series are cointegrated, then a simple regression of one time series against the other should give us the cointegration coefficient and the value of the premium. The slope of the regression line is the cointegration coefficient, and the intercept is the premium. The attractiveness of the regression approach is that the general methodology of regression is well known and has found ready application in innumerable situations. Implementations of the ordinary least squares approach is part of most software packages and may be readily used on the prepared
Testing for Tradability
109
data set. In fact, it has often been argued that the simplicity of the regression process is probably its most powerful feature. Then why do we even attempt to discuss it at length? Well, it is also true that although the general idea of regression is relatively simple, the simplicity of the process lends itself to hasty application without much thought. Thus, ironically, the pervasiveness and simplicity of the regression approach is the reason to discuss aspects of the regression approach in some detail. Proper use of regression is possible only if we thoroughly understand the standard regression scenario and the deviations of our situation from the standard. Let us therefore examine the standard regression scenario as applied to physical systems. Central to physical systems is the study of cause and effect; that is, the response of the system to a particular stimulus. If the expectation is that the response is proportional to the stimulus, the process of linear regression comes in very handy in measuring this constant of proportionality. The typical experiment involves subjecting the system to a series of inputs or a stimulus in a specified range and measuring the response of the system to those inputs. The input–output pairs then form the data set on which the regression is run. The independent variable in this case is the input or stimulus, and the slope of the regression is the constant of proportionality connecting the stimulus to the response. Let us now highlight some aspects of the experimental process. Note that in this case, since the experimenter is the one administering the stimulus to the system, he or she can design it to be accurate with a very small magnitude of error. We can therefore assume that there is no error in the input data. However, the output data come from the response of the system and may not be known accurately due to imperfect experimental conditions. Also, if the experimental conditions do not change dramatically during the course of the trials, we may assume that the error manifest in each observation of the response is drawn from a common probability distribution. Thus, here is a situation where the input is known relatively accurately, and the source of error is only in the output with an error standard deviation that is constant across all observations. Now let us see how stock price data relates to the scenario described above. First, it may be argued that the price of a stock is known exactly. Then where is the source of error? Note that we use just one representative value for the price in a given time period, while in actuality the price changes constantly within the period itself. Therefore, a reasonable case may be made for the existence of uncertainty or error associated with our choice for the price in the time period. Next, in our situation, there is no distinct separation of cause and effect. The prices of the two stocks may very well be feeding off each other, as discussed in the error correction model for cointegration in Chapter 5. Since it is not possible to easily separate cause from effect in this scenario, and both of the price values are read as outputs, we are
STATISTICAL ARBITRAGE PAIRS
110
now faced with a situation where there could be uncertainty or error in both variables. This is vastly different from the standard linear regression case, which assumes errors only in one variable. Also mentioned earlier, each stock moves constantly, reaching its own highs and lows within a time period, with the high and low price characterizing the range of price movement. Although we allow ourselves to choose only one representative price for a given time period for each stock within this range, we also admit that there is a certain amount of uncertainty associated with our choice. But data points in each time period are chosen from the price ranges of different magnitudes. Under such circumstances, it would be rather facetious to assume a constant probability distribution for the uncertainty or errors that are intricately linked to the magnitude of the price range in a given period. We can strongly assert that our situation is one where the uncertainty associated with each data point is different. This is also different from the standard regression scenario of constant variance in the observations. Therefore, to sum up, we have a situation where there is error associated with both observations, and the variance of the observation error is also a varying quantity. Although the differences of our situation with the standard regression scenario are substantial, they are by no means novel. Such situations have been encountered in a variety of other applications and may be applied to ours without change. Going that route, however, adds to the complexity of the process. We will briefly discuss the solution approach in this case to highlight the issues and proceed to suggest a much simpler approach. The situation of nonconstant error distributions coupled with errors in both variables can be handled by minimizing the chi-squared merit function, given as
( ) ( ) (7.6) var(ε ) + γ var(ε ) In Equation 7.6, var(ε ) and var(ε ) are the variances of the error in the observations log( p ) and log( p ) . The errors are assumed to have a zero (
A t
N
) ∑
χ2 γ , µ =
[log ptA − γ log ptB − µ ]2 A t
t =1
A t
B t
2
B t
B t
mean and may be calibrated based on the range of movement of the stock within each time period. Note that for our purposes it is not important to have an absolute measure of the variance of the errors, just that the values be proportional to the actual variances. To further understand how the chi-squared function handles the situation of nonconstant error distributions, let us examine it in a little bit of detail. The value in the numerator of the merit function is the squared error in the regression. If the variance as shown in the denominator was a constant, then the minimization boils down to minimizing the sum of squared errors,
Testing for Tradability
111
which is the ordinary least squares procedure. The denominator term, in fact, weights each data point in the cost function to be inversely proportional to the variance of the individual error terms. Thus, it may also be thought of as the sum of squared errors each normalized by its variance. This approach to regression using the chi-squared merit function is sometimes also called the weighted least squares approach. However, typical applications adopting a weighted least squares approach assume the error only in the response variable and not in both. Specifically, those applications do not have the term with g 2 in the denominator. The presence of this term in the denominator complicates the minimization of the chi-squared function, in that the derivative of the chi-squared function with respect to g is now nonlinear, and so we may need to resort to numerical methods to solve this. In summary, the regression process can get fairly involved if we are to account for the varying error probability distributions in the measurement with errors in measurement for both the variables. Nevertheless, if there is a way to construct a price series such that the errors associated with the observation in each time period may be assumed to be the same, then we can do away with these complications, work with just ordinary least squares, and arrive at a reasonable answer. Let us see if we might be able to do that. Note that in the previous paragraph we mentioned that we choose only one representative price for a given time period from the range of stock price movement for that period. In a typical scenario where the length of a period is one trading day, the standard convention in the construction of daily time series is to use the closing price at the end of the trading day; that is, the latest price in the process of serial price adjustment. Let us call this method for recording the price time series the close-close method. Time series of this type are usually constructed to perform a mark to market of stock inventory. Given the nature of the auction process and the price discovery mechanism, this is definitely a reasonable choice. But is this approach to data construction appropriate for our purposes? It is seductive by habit to use the same construction regardless of purpose even though such an exercise may be ill-advised. Care must be taken to ensure that the process of data construction is a reflection of the specific purpose at hand. Our purpose is to examine the price relationship between two stocks. In this quest, to examine price relationships, more important than the closing price in a given time period is the answer to the question, “At what price in the time period was the liquidity a maximum?” That would be the consensus price in the time chunk at which the most buyers and sellers agreed that the price was right and a lot of shares changed hands. Therefore, conclusions drawn on the maximum liquidity price series of two stocks would be more reliable than using the close-close approach. A reasonable proxy for the
112
STATISTICAL ARBITRAGE PAIRS
maximum liquidity price is the volume weighted average price, commonly termed as the VWAP price. We could therefore construct our time series with VWAP prices. By choosing the VWAP price we are arguably within reason to assume that the error distributions of the VWAP price from the maximum liquidity price is about the same regardless of the magnitude of the price range. We can therefore resort to the simple ordinary least squares version in our regression analysis. Of course, the manner in which the calculation of standard error forming the basis for the t statistics tests would still need alteration. In our scenario of cointegration or tradability testing, the emphasis on the t statistic is rather low, and therefore we contend that this is something we can live with. Additionally, note that using the VWAP price has the tendency to temper extreme values and therefore has the added benefit of minimizing the effect of outliers on the regressions. In conclusion, the time series constructed with the VWAP price is better suited to understand equilibrium relationships and should be the preferred approach. However, this does not do away with the need to decide which of the two price series we should use for the independent variable. We will avoid being repetitive and just say that the same idea as was adopted in the multifactor model case may be applied here, also.
TESTING RESIDUAL FOR TRADABILITY Subsequent to estimating the equilibrium relationship we need to construct the residual time series. Although we advocated using VWAP prices to estimate the equilibrium relationship, we recommend constructing the residual time series by applying the equilibrium relationship to the time series of stocks constructed using the close-close method. If the two series are indeed cointegrated, constructing such a time series provides us with a fairly good picture of the oscillations about the equilibrium value. We begin by reviewing the ideal situation. In an ideal situation for tradability, the two stocks would be cointegrated, and the residual series would be stationary. It is therefore desirable that the properties of the residual series exhibit the characteristics of stationary series. One of the properties of stationary series, the property of mean reversion, is of particular importance to us. This is relevant because pairs trading is a bet that the residual series will revert to its mean or equilibrium value. In other words, deviations from the mean are quickly corrected by the series moving back toward the mean. It would therefore be nice if we could quantify the degree of mean reversion of a given time series. It turns out that highly mean-reverting series are also characterized by a high frequency of zero-crossings. A zero-crossing is defined as the transition of the time series across its long-run mean. The frequency of zero-crossing is
Testing for Tradability
113
then the number of times we can expect the time series to cross its equilibrium value in unit time. Thus, the zero-crossing frequency provides us with a quantitative characterization for the mean reversion property. Notice that if the zero-crossing rate is very high, then the time to revert to mean is short, implying that the time we need to hold the paired position is small. The signal-to-noise ratio is bound to be good, and we could be more comfortable with the idea that the pair is tradable. Thus, a high zerocrossing rate for the residual series is a preferred trait and directly appeals to our requirements. A high zero-crossing rate is also indicative of a stationary series. To strengthen this conviction, we make an observation in contrast by considering the example of Brownian motion a nonstationary series. Even though the distribution of Brownian motion is symmetric about the mean, the zerocrossing event is very infrequent. The theoretical explanation for the phenomenon is well captured by the famous arcsine law for Brownian motion discovered by P. Levy. The law provides us with information on the last passage time, or the last time that the Brownian motion visited zero. More precisely, let us consider a Brownian motion starting at zero, or time t = 0 and stopped at time T. If g is the last time when zero is visited, then the probability distribution satisfies
(
)
Pg = $2.30 Thus, the order unambiguously specifies the tickers, the trade direction, trade quantity, the spread constraint, and the ratio constraint that are to be satisfied.
VERIFYING THE EXECUTION The paired execution order as specified to the broker is usually for a considerable size in terms of the number of shares involved. Given the large position sizes and the fact that this is a paired transaction, the order is worked by the broker in a series of executions. This is done on a best-effort basis. The broker then delivers to the arbitrageur a list of executions for both the stocks. It is now up to the arbitrageur to evaluate the quality of execution. The quality of executions may be measured two ways, speed and efficacy. In situations when there is news in the market and quick action is required, execution speed is of utmost importance and may very well be the criterion upon which the execution quality is measured. However, in most other situations the arbitrageur is interested in capturing as high a spread as possible when putting on the spread and unwinding at as low a spread as possible. In such cases, the execution efficacy, characterized by how well the executions measure up to the specified spread value, is more important. While one would ideally want to be able to capture as high a spread as possible in the shortest possible time, it is conceivable that a quick execution could mean that one gives up on efficacy, causing a trade-off between the two criteria for measurement of execution quality. We now look at issues related to measuring execution efficacy. Of course, in the course of the discussion we will see how it affects execution speed. To measure efficacy, one can attempt to pair up the executions and check to see if each paired execution satisfies the spread constraint, how many missed the specified spread value, and by how much. A ratio of the dollar value of the hits versus the misses could very well be the measure of execution efficacy. Another option is to compare the specified spread value
156
RISK ARBITRAGE PAIRS
against the average achieved spread calculated using the execution prices of the two stocks. We could then use the difference as a measure of execution efficiency. The question therefore is, which measurement approach do we choose? We recommend going with the second method; that is, the one based on the average execution prices. We will justify our recommendation by demonstrating that the first idea of pairing up executions to judge efficacy is a process fraught with inconsistencies. To do just that, let us go through the exercise of pairing the executions. This exercise in addition to proving our point also provides insight into how we may characterize aggressive and conservative trading. Let us start with the pairing process. The input to the pairing exercise is the list of executions. Each execution is characterized by the stock ticker, the direction of execution (long or short),1 the fill price, and the fill quantity. The executions may therefore be partitioned into two sets based on the stock ticker. The idea is to pair the executions of one ticker against the executions of the other ticker such that the spread and ratio constraints are met. We formulate the trade pairing exercise as a network flow problem.
HISTORY One of the important subject areas of the latter part of the twentieth century that has found successful application in a multitude of situations has been the study of network flows. The subject area was pioneered by Ford and Fulkerson. In their seminal paper published in 1956, they framed the network flow problem as an integer linear program. Besides detailing an algorithm to solve for the optimal way to route the flow across a network, they also provided a novel interpretation of the dual of the network flow linear program, now famously known as the max flow–min cut theorem. Since then, network flows have found application in a wide variety of situations. Of relevance to finance, a subclass of the general max–flow problem, known as matching, has found application in game theory and to the theory of auctioning. Other applications of network flows range from something as practical as logistic planning to something as abstract as theorem proving,2 making this a fascinating subject area for study in its own right.
1
Since these trades are all done to fill the same order, we can expect the trade direction to be uniform across all executions for a given ticker. 2 For an example, see Umesh Vazirani, “Rapid Mixing Markov Chains,” Proceedings of the Symposia in Applied Mathematics, Vol. 44, 1991, pp. 99–121.
Trade Execution
157
As a necessary preprocessing step, we evaluate the equivalent target share quantities for the fill quantities on the bidder side trades. This is accomplished by dividing the bidder fill quantities by the exchange ratio. These are the share quantities that will be used in the modeling exercise. The network flow problem can now be represented in a pictorial fashion. Every execution is represented as a node. We pick a node from each of the two sets (bidder trades and target trades) and determine if the fill prices on the two executions satisfy the spread constraint. If satisfied, we draw an edge or line between the two nodes or executions and assign a weight or capacity to it based on the number of shares that can be matched along that edge. Thus, each edge represents a potential way to pair the executions. We wish to match as many shares of the bidder as we possibly can with the target shares. The edges and the nodes form a network called a bipartite graph in graph-theory parlance, and the pairing problem is equivalent to maximizing the flow on this bipartite network. The idea becomes clearer with the following example:
Example Consider an order specification with the following data: Exchange ratio = 1.0 Cash amount = 0.0 Spread value specified = $1.00 Action = put on Now, since we are putting on a spread, we sell the bidder and buy the target. The spread constraint is therefore met when the calculated spread is greater than the specified value of $1. The executions list for the bidder and target stocks when putting on the spread position is listed in Table 10.1. The pairing problem can now be posed as a max-flow problem on the network shown in Figure 10.1.
TABLE 10.1
Execution List. Bidder List
Label B1 B2 B3
Target List
Quantity
Fill Price
Ratio Adjusted Quantity
100 100 100
19.0 19.5 20.0
100 100 100
Label
Quantity
Fill Price
T1 T2 T3
50 150 150
19.0 18.5 18.0
RISK ARBITRAGE PAIRS
158
Bidder Trades
Target Trades
100 at $19.0
50 at $19.0
x1 Source
100 at $19.5
x4 x2
x3
100 at $20.0
FIGURE 10.1
x5
x6
150 at $18.5
Sink
150 at $18.0
The Max-Flow Formulation.
Note that in Figure 10.1 we have explicitly shown the source and the sink to identify the direction of flow, which should be from the source to the sink. Each node represents an execution, and the details of the execution are listed for each node. The maximum capacity of each edge is not shown in the picture but is actually the smaller of share quantities of the nodes on either end of the edge. The pairing problem is now equivalent to maximizing the flow on this bipartite network. The linear programming formulation3 may be written as Maximize x1 + x2 + x3 + x4 + x5 + x6 sub: x1 ≤ 100 x2 + x3 ≤ 100 x4 + x5 + x6 ≤ 100 x4 ≤ 100 x2 + x5 ≤ 150 x1 + x3 + x6 ≤ 150 3
(1) (2) (3) (4) (5) (6)
The linear programming formulation here is different from the conventional linear programming representation of the max-flow problem. The interpretation of the dual in this case is a minimum weighted set cover problem. However, since this is not germane to the discussion, we will not pursue this in detail.
Trade Execution
159
x1 ≤ 100 x2 ≤ 100 x3 ≤ 100 x4 ≤ 100 x5 ≤ 100 x6 ≤ 100 x1, x2, . . . , x6 ≥ 0
(7) (8) (9) (10) (11) (12)
Let us examine the formulation shown here. The objective in the linear program is the sum of the flows across all the edges, and we are rightfully looking to maximize this quantity. Constraints 1–3 state that the total outflow from a bidder node must not be greater than the share quantity of that node. Likewise, constraints 4–6 ensure that the inflow into a target node must not be greater than the share quantity of the node. The constraints 7–12 fix the maximum capacity of every edge. Thus, maximizing the flow along this bipartite network leads us to a solution for our pairing problem. The solution approach to the general max-flow problem has been well researched and a number of algorithms have been proposed to solve it. Also, the pairing problem we address has a special structure. On account of this, the solution method that would work for us is fairly straightforward. A detailed description of both the general method and one tailored to our problem can be found in the appendix. The important point, however, is that the modeling process and the solution method provide us with some insights into the nature of the pairing problem itself. In fact, the solution approach reveals that the answer to the pairing problem is not unique, and it is possible to have a number of optimal paring configurations for a given set of executions. Looking at Figure 10.2 we can see that there are at least three possible configurations for the given set of executions in Table 10.1, begging the question as to which configuration we should use to measure execution efficiency. It is now easy to appreciate the futility of measuring efficiency by way of pairing. Note, however, that the average achieved spread is a constant value regardless of the pairing configuration. We therefore contend that it is this measure that should form the basis for verifying executions. There is a flip side to looking at just the average spread. The broker may trade away the edge gained on an earlier execution, of course ensuring that the average spread criteria is met. To see what we mean, consider the situation with a target spread of $1.00 where the broker has completed about half of the whole order, achieving a spread of $1.05. To satisfy the average criterion, the broker needs to achieve a spread of just $0.95 on the remaining half, which may not be what was intended by the arbitrageur.
RISK ARBITRAGE PAIRS
160
Bidder Trades
Target Trades
100 at $19.0
50 at $19.0 100
50
100 at $19.5
100
100 at $20.0
50
150 at $18.5
150 at $18.0
Configuration A: excess 50 shares of target at $18.5
Bidder Trades
Target Trades
Bidder Trades
Target Trades
100 at $19.0
50 at $19.0
100 at $19.0
50 at $19.0 100
100 100 at $19.5
100 at $20.0
50 50 100
150 at $18.5
100 at $19.5
150 at $18.0
100 at $20.0
50 100
150 at $18.5
50
Configuration B: excess 50 shares of target at $19.0
FIGURE 10.2
150 at $18.0
Configuration C: excess 50 shares of target at $18.0
Redundant Matchings.
In conclusion, it would help to indicate exactly what the evaluation criterion would be to the broker. This will hopefully ensure that the execution was completed as intended. The possibilities are 1. Conservative trading. Ensure that there is at least one way to pair the trade list such that each pair satisfies the spread criterion. 2. Aggressive trading. Ensure that the average spread achieved is the same as or better than the specified spread value and therefore to the extent possible the speed of execution matters.
Trade Execution
161
EXECUTION DURING THE PRICING PERIOD In this section, we focus on trading in the context of fixed value exchange transactions. In such transactions, the exchange ratio is determined based on the stock price of the acquiring firm during the pricing period and is gradually revealed as the pricing period unfolds. Since knowledge of the exchange ratio is important to executing a paired transaction, this would imply that we can trade only at the end of the pricing period. However, in some cases, this strict constraint may not be ideal for trading. To appreciate the fact, let us consider the situation wherein the end of the pricing period is very close to the shareholders’ vote, or even worse, the end of the period is well past the shareholders’ vote (these situations are not uncommon in real life). In such situations, the time interval between the end of the pricing period and the date of deal completion is rather short, leaving very little time to put on a position of substantial size. To further compound the situation, one has to deal with the prospect of rapidly narrowing spreads. Given this, the arbitrageur would prefer to not have to wait until the end of the pricing period to put on a spread position. Therefore, even though the exchange ratio is not known exactly, the arbitrageur would want to begin trading before the end of the pricing period but trade in a way that leaves him or her perfectly hedged (meet the ratio constraints on the pair) eventually. In this section, we will discuss an approach where we trade during the pricing period and still manage to satisfy the ratio constraint at the end of the period. Consider the case where the exchange ratio is computed on a fixed value of the target stock pT and the average closing price of the bidder stock in the pricing period. Let the closing prices of the bidder stock be p1,p2, . . . ,pn. starting from day 1 to day n, the last day in the pricing period. Now, according to the terms of the fixed value exchange transactions, the ratio (number of bidder stock exchanged for one target stock) is given as r =
npT p1 + p2 + ... + pn
(10.3)
This is the fixed value of target stock divided by the average of the closing prices of the bidder stock in the pricing period. The reciprocal of the exchange ratio r is therefore p p 1 1 p1 = T + T2 + … + Tn r n p p p
(10.4)
RISK ARBITRAGE PAIRS
162
Interpreting Equation 10.4, the reciprocal of the exchange ratio is an average of the reciprocals of the realized exchange ratio in the pricing period. We will exploit this idea in our approach to trade during the pricing period. If we decide to put on a total spread position consisting of nB shares of the acquirer, the number of shares of the target nT to be bought is given as nT = nT =
nB n
nB r
p1 p2 pn T + T + … + T p p p
(10.5)
(10.6)
We can write this as nT = n1T + n2T + … + nnT
(10.7)
where niT =
n B pi npT
, i = 1, … , n
(10.8)
In other words, the formula tells us the amount of target shares to buy on a daily basis for a fixed number of bidder shares such that we would be fully hedged at the end of the pricing period. We can mimic this equation nB during execution by shorting shares of the acquirer and buying n nB pi × T shares of the target on the (i + 1)th day. At the end of the pricing n p period plus one more day, we will be perfectly hedged. Alternately, we could start by holding a naked short position on the bidder stock and buy the corresponding number of target stock on each day in the pricing period. Of course, the captured spread would vary depending on the timing of trades of the bidder stock. There is a small caveat in all of this, though. Recall from the introductory chapter that the dollar profit is calculated on a per-target-share basis. However, the total number of target shares that we buy in this case is known only at the end of the pricing period. Thus, it would be reasonable to say that we would know our position size (in target share terms) and expected profits only at the end of the pricing period. Hence, by trading in this fashion we will have replaced our uncertainty on the ratio by the uncertainty on the position size/profits.
Trade Execution
163
Example Days in pricing period n Fixed value of target stock Pr Closing price of bidder stock on ith day, pt Total shares of bidder to short nB Shares of bidder to short for the next day Shares of target to buy for the next day
= 20 = $60 = $80.5 = 100,000 nB = 100,000/20 = 5000 n n B pi = = 5000 × (80.5/60.0) = 6708 × n pT =
Bounds on the Position Size It was noted in the last section that when trading during the pricing period, an arbitrageur has to be willing to replace the uncertainty in exchange ratio with an uncertainty in position. However, there is still an interest in evaluating some bounds on the position size of the target shares. In this section, we derive some bounds on the ratio and show how it can be converted to bounds on the position size. To see how we can estimate bounds on the ratio, consider the fact that the average of the closing prices of the stock is always between the maximum and the minimum prices. Thus, a very simple bound could be based on that. If we assume a log-normal process for the price movement of the bidder stock, that is, the logarithm of the prices executes a Brownian motion, the probability distribution for the maximum of a Brownian motion in time t is given as x Fmax = 2φ − 1, x ≥ 0 σ t
(10.9)
where s in the equation is the volatility of the Brownian motion and Φ is the cumulative density function of the normal distribution. The formula represents the probability that the Brownian motion will have a maximum value less than or equal to x in the time duration t. By symmetry we can expect the distribution of the minimum to be a flipped version of distribution of the maximum for values less than 0. A plot of the distribution functions for the maximum and minimum is shown in Figure 10.3. Reading the graph, one can say that the maximum value of the Brownian motion is less than the standardized value of 2.0 with 95-percent accuracy,
RISK ARBITRAGE PAIRS
164
1.0 0.9 Cummulative Probability
0.8 0.7 0.6
Distribution of Maximum
Distribution of Minimum
0.5 0.4 0.3 0.2 0.1 0.0 –3
FIGURE 10.3
–2
–1
0 1 Standardized Variable
2
3
Distribution of the Max and Min of Brownian Motion.
the standardized value being
x
. Thus, we can say with 95-percent conσ t fidence that the average value of the Brownian motion will be between x = −2σ t and x. We now combine the preceding information with the fact that the mean of a sample is definitely lower than the maximum of the sample. The maximum can now be used as an upper bound for the mean. In a similar manner, the minimum serves as a lower bound for the ratio. The bounds on the ratio can quickly be translated into bounds for the target share quantity by multiplying the ratio bounds with the Bidder shares quantity. Thus, an estimate of the maximum and minimum price during the pricing period may be used to calculate the bounds on the exchange ratio. We, however, believe that we could obtain a tighter bound than the one just shown. To see that, let us consider the average of n numbers x1,x2,...,xn. Now let the upper bounds for each of the numbers be b1,b2,...,bn. Then x1 + x2 + … + xn b1 + b2 + … + bn ≤ n n
(10.10)
Trade Execution
165
Saying this in words, the upper bound for the average is the average of the upper bounds. It is easy to see that the same argument is also true for lower bounds. Exploiting this idea, we compute the bounds on the maximum and minimum for each day in the pricing period and use the average of the bounds as the bound for the average. This is illustrated in the following example.
Example Consider the bidder stock selling at $80, 15 days into the pricing period consisting of a total of 20 days. The realized average price in the 15 days for the bidder stock is $79. The volatility of the bidder stock on an annualized basis is 32 vol. The value of a target share is fixed at $60. We plan on shorting 100,000 bidder shares. So, what are the bounds on the number of target shares that we would end up holding in the end? Step 1: List the given information. The daily volatility σ = 32 252 Number of days to end of pricing period, N Current average price avgc Current closing price pclose Total days in pricing period, T Fixed value of target stock ptgt Number of bidder shares nB
= 2.01% = 5 days = $ 79 = $80 = 20 days = 60.0 = 100,000 shares
Step 2: We now compute the individual bounds in Table 10.2. TABLE 10.2
Days
n
Bounds on Size of Target Position.
Scaling Value
Incremental Bound
Log of Upper Bound
n
2σ n
log pclose
(
Log of Lower Bound
)
(
log pclose
+2σ n
−2σ n
)
Price Upper Bound
Price Lower Bound
elog(UpBound)
elog(lowBound)
1
1.000
0.0402
4.4222
4.3418
83.2815
76.8478
2
1.414
0.0569
4.4389
4.3252
84.6798
75.5788
3
1.732
0.0696
4.4517
4.3124
85.7687
74.6192
4
2.000
0.0804
4.4624
4.3016
86.6976
73.8198
5
2.236
0.0899
4.4719
4.2921
87.5243
74.7976
RISK ARBITRAGE PAIRS
166
Step 3: Calculations The average price upper bound for the five days, upperAvg = 85.5904 The average price lower bound for the five days, lowerAvg = 74.7976 The final price upper bound, pupper =
(T − N) avg c + N.upperAvg = 80.6476 T
The final price lower bound, plower =
(T − N) avg c + N.lowerAvg = 74.7976 T
Upper bound on number of target shares pupper = nB ptgt 80.6476 60.0 = 134413 shares (with 95% confidence) = 100000 ×
Lower bound on number of target shares p = n B lower ptgt 74.7976 60.0 = 129,917 shares (with 95% confidence) = 100000 ×
Note that in the preceding example, a volatility assumption is made for the bidder stock. Therefore, a realistic value for the volatility leads to good estimates. A reasonable approach to determining the volatility to use is to look at the implied volatility of the options on the bidder stock and choose the volatility corresponding to the contract with the most appropriate strike and expiration.
SHORT SELLING As discussed in the introduction, the uptick rule makes it harder to short the stock. Until some time ago, a trading technique called married puts allowed the traders to sidestep the rules that prevented short sales when a stock’s price was falling steadily. The strategy involved the following steps. First, buy a deep-in-the-money put with the closest maturity date. Then, tender for
Trade Execution
167
the shares. Now we are theoretically long the stock and can therefore sell it without waiting for the uptick. This loophole has, however, been detected and is now disallowed by the SEC. In another approach to circumvent the uptick rule, one may also enter into a stock swap, receiving the acquirer’s stock in exchange for cash. Now we are long the stock and can therefore sell the stock without the uptick rule restriction. To settle, we buy the acquirer’s stock in the market and return it or pay the stock returns. In turn, we would receive LIBOR with a haircut. Notice that this is similar to short selling except that we pay a premium for the stock.
SUMMARY The arbitrageur normally executes the paired transaction through a broker. Verifying the executions by pairing them is an approach fraught with inconsistencies. It makes sense for the arbitrageur to insist on a firm average spread or better. It is possible to put on a spread position during the pricing period and be perfectly hedged. In such situations, however, the exact position size of the target stock is gradually revealed. Putting on a spread position typically involves a short sale and must be executed in accordance with the uptick rule.
FURTHER READING MATERIAL On Linear Programming and Network Flows Chavatal, Vasek. Linear Programming. (New York: Freeman, 1983).
168
RISK ARBITRAGE PAIRS
APPENDIX DINIC’S ALGORITHM FOR MAXIMUM FLOW IN A NETWORK In this section, we describe Dininc’s algorithm for maximum flow on a network. The following are some additional concepts used in the description of the algorithm.
Blocking Flow A flow on a network is a blocking flow if every path from source to sink contains a saturated edge (edge with full capacity flow through it). An example of a blocking flow is shown in Figure 10.4. Also note that the flow is not maximum. The optimal flow is shown in Figure 10.4.
Residual Graph Associated with every feasible flow through the network is a weighted digraph called the residual graph. This is constructed in the following manner. Let C be the capacity of the edge connecting vertices A and B. Let F be the flow from A to B. The residual graph has a directed edge from A to B with capacity C – F and another from B to A with capacity F. All edges with zero capacity are removed from the graph thus constructed. The resultant graph is the residual graph. An example of the residual graph corresponding to the problem and the blocking flow is shown in Figure 10.4.
Flow Augmenting Paths A flow augmenting path is a path on the residual graph from source to sink. The maximum flow that can be sent through this path is limited by capacity of the edges in the path. It is equal to the smallest capacity of an edge along the path.
Dinic’s Algorithm The steps in Dinic’s Algorithm are as follows: 1. Find a blocking flow for a given network. 2. Construct the residual graph corresponding to the current flow. 3. Find the shortest length augmenting path on the residual graph. (The length of the path is equal to the number of edges in it.) If such a path does not exist, then the current flow is the maximum flow. 4. If such a path exists, then augment the flow along the path. (Make the corrections to the flow graph and the residual graph.) Go to Step 3.
Trade Execution
169
3 units
2
2 units
3
2
Src
Sink
3 1
3 1 unit
1
3 units
The Max-Flow Problem
3 units
2 units
3 Src
Sink
3 3 1 unit
3 units Blocking Flow
3 units
2
2 units
3
2
Src
Sink
3 1
3 1 unit
1
3 units
Residual Graph
3 units
1
Src
2 units Sink
2 1 unit
1
3 units
Optimal Solution
FIGURE 10.4
Max-Flow Algorithm.
170
RISK ARBITRAGE PAIRS
LAZY ALLOCATION ALGORITHM Dinic’s algorithm is required for a general network. We hasten to add that the matching problem we are required to solve has some additional properties. To recognize that, we organize the trades on the stock being sold in ascending order (group A) and the trades on the stock being bought in descending order (Group B). Thus, each trade has a rank associated with it. Also, if there is a feasible edge between the ith vertex in group A and the jth vertex in group B, then there is a feasible edge between the ith vertex and all the vertices in group B with rank greater than j. Similarly, there is a feasible edge between the jth vertex in group B and all the vertices in group A with rank less than i. This is illustrated in the ordering of the nodes in Figure 10.2. To perform lazy allocation, we start from the vertex with the lowest rank in group A and saturate the edge connecting it to the lowest possible rank in group B. When the entire trade quantity from the vertex is accounted for, we move on to the next vertex and do the same and work our way to the last vertex. The consequence of the lazy allocation is that on a graph with this structure, there is no flow augmenting path on the associated residual graph, and the allocation is optimal. An example of such an allocation method is demonstrated in Figure 10.2, configuration B. Although there may not be any flow augmenting paths, there may be an alternate flow routing on the residual graph, such that the flow along the directed edge is zero. This implies that there could potentially be redundant optimal routings.
CHAPTER
11
The Market Implied Merger Probability
INTRODUCTION The spread value in a merger deal is a measure of the profit potential of a trade. Knowledgeable players in the marketplace are likely to carefully assess the profit potential and inherent risks and put on a position according to their judgements. If the value of the spread is large and the risks inherent in 171
172
RISK ARBITRAGE PAIRS
the successful completion of the merger are small, then we can expect that the players would put on a large position causing the spread to narrow. If, however, the value of the spread narrows disproportionate to the risks, then we can expect some profit taking causing the spread to widen. If there are sufficient players in the marketplace engaging in risk arbitrage, the spread would then in some sense represent the consensus estimate of the risk involved in the deal and thereby the odds of successful completion of the merger. Therefore, it makes sense to construct models to estimate the odds of merger success taking the value of the spread into account. In this chapter, we discuss a method to assess the probabilities of merger as reflected by the spread between the stock prices of the merging companies. The method is based firmly on the classical results of the Arrow-Debreu theory of contingent claims. The probabilities derived are called risk neutral probabilities. Risk neutral probabilities calculated for a merger are in many ways similar to the implied volatility parameter encountered in option pricing. For purposes of option pricing, stock price movement is modeled as a lognormal process. One of the important parameters of the log-normal process is its standard deviation. This standard deviation is commonly termed volatility. The volatility of the underlying stock plays a crucial role in the pricing of an option. Conversely, the price of an option quoted in the marketplace implies a certain volatility that when plugged into the option pricing model results in the quoted option price. This volatility implied by the option price is called the implied volatility. In an analogous fashion, the probability of merger success as implied by the observed spread may be calculated. Note that both implied volatility and the merger probability specify probability distributions. Furthermore, they are both calculated from values observed in the marketplace, and therein lies the similarity between the two constructs. A key question, however, pertains to the premise of the chapter. What is the value in knowing the market implied probability of merger? Well, if one has a view based on independent research that the true chance of merger success is better than the implied probability, then entering into a trade may have a good risk–reward profile. Thus, knowing the probability of merger can provide value in making investment decisions. The implied probability of deal success can also prove useful in the area of risk management. The risk manager is typically charged with assessing the risk in multiple portfolios running multiple strategies. Unlike the risk arbitrage traders, risk managers are not disposed to know the intimate details of every deal in the risk arbitrage portfolio. Using the market-implied probabilities, the risk manager will be able to design meaningful value at risk (VAR) measures for such deals. The VAR plays a key role in determining the
The Market Implied Merger Probability
173
reserve capital requirements for a bank. The requirements for reserve capital are designed to help markets survive extreme conditions by making sure (to some extent) that the counterparties in a trade have enough reserve capital to meet their obligations. Ideally, we would like the reserve capital to be the proverbial Goldilocks value, not too little and not too much. If the risk measurement is too conservative, then we will need to post more reserves, leading to underutilization of capital. If it is too aggressive, then we may not have enough reserves to meet obligations during extreme moves. We will propose a practical value at risk measurement approach for risk arbitrage deals, based on the merger probabilities. The chapter is organized as follows. First, we discuss briefly the ArrowDebreu theory, which forms the basis for our probability measure. We then describe the single-step model for measuring merger probability success. The single-step model is then extended to multiple steps. Subsequently, we reconcile between the proposed theory and practice. This is followed by an example application to risk management.
IMPLIED PROBABILITIES AND ARROW-DEBREU THEORY The purpose of this section is not so much to provide a formal description of the Arrow-Debreu theory as much as to provide a flavor for it. Let us consider the scenario that involves placing bets on a set of outcomes. Examples of such events could be a boxing match or a horse race. In these cases, the set of outcomes is finite and well defined. We will use the horse race example for purposes of illustration. Important to the discussion is the notion of betting. If, for example, the bet is placed in favor of a horse and it wins the race, then the reward is the payoff from the bet. If it happens to lose, then here, too, the reward is the payoff from the bet, except that the payoff is probably zero dollars. Thus, a bet is completely defined when we specify the payoff for every possible outcome. To place a bet, one has to put up the stake money. This is specified by the bookie. The Arrow-Debreu theory states that the full and complete specification of bets with the stake money and the payoff for each outcome automatically implies a probability for a particular outcome.1 Additionally, the stake
1
The reasoning stems from the maximization of a linear utility function resulting in a linear program with constraints. The weights happen to be the values of the dual variables in the solution of the linear program. This work by Arrow and Debreu was awarded the Nobel Prize in Economics.
RISK ARBITRAGE PAIRS
174
money must be the probability weighted payoff (also called the expected payoff) across all outcomes. Also, if it so happens that a single set of probability weights is not able to account for the entire set of bets, then arbitrage opportunities exist. According to the theory, the probabilities are derived such that any two bets with the same expected payoff have the same current value. It may be that one of the two bets yields the expected payoff almost certainly and the risk associated with it is minimal when compared to the other bet. This scheme, however, treats both bets on an equal footing; that is, we are neutral to risk. For this reason, the set of probabilities that are implied by the definition of the bets are called risk neutral probabilities. Continuing with the horse race example, let us say that the odds given by the bookie for the horse race is as follows: 3 to 5 in favor of NiceAndEasy, 2 to 3 in favor of WindSlicer, and 1 to 2 in favor of ButterBiscuit. For instance, a successful bet of one dollar on ButterBiscuit returns the stake plus two dollars, which is a total of three dollars. The terms of the bets are presented in Table 11.1. Note that the first scenario described in Table 11.1 is the risk-free scenario where a deposit of x dollars with the bookie results in a payoff of one dollar no matter which horse wins. According to Arrow-Debreu theory, the bet amount must be a weighted combination of the payoffs. If the probability weights for each of the three horses winning are denoted as pne, pws, pbb, then the following equations as given in matrix form must hold. 1 8 0 0
TABLE 11.1
1 0 5 0
x 1 pne 0 3 ⋅ pws = 0 2 pbb 3 1
Terms of the Bet. Payoff from the Bets
Bet scenario Risk-Free Scenario Bet On NiceAndEasy Bet on WindSlicer Bet on ButterBiscuit
Bet amount
NiceAndEasy wins
WindSlicer wins
ButterBiscuit wins
x 3 2 1
1 8 0 0
1 0 5 0
1 0 0 3
The Market Implied Merger Probability
175
or specifically pne + pws + pbb = x pne = 3/8 = 0.375 pws = 2/5 = 0.4 pbb = 1/3 = 0.333 The value of x from the preceding equations is therefore pne + pws + pbb = 1.108. So, if we deposit close to a dollar and 11 cents with the bookie, we will get back a dollar. The loss for the bettor in this enterprise is therefore 100 × (0.108/1.108), which is approximately 9.7 percent. In other words, on average the bookie gets to keep 10 cents on every dollar that is deposited as stakes. Normalizing the probability weights to add up to one, we now have Probability of NiceAndEasy winning = 0.375/1.108 = .338 Probability of WindSlicer winning = 0.4/1.108 = .361 Probability of ButterBiscuit winning = 0.333/1.108 = .301 Thus, WindSlicer is favored to win the race, with ButterBiscuit being the underdog. Note that we started our example with the specification of the bets and their odds and have now derived the probabilities from it. Are the probabilities the true probabilities for the outcome of the race? Maybe, then again maybe not. This is, however, where the bookie will allow the bet to be made. In the case of the markets, unlike the example here, the price/stake amount of a bet is decided by the auctioning process. The price, therefore, represents the consenus opinion of the participants. In such situations it may be argued that the risk neutral probabilities represent the consensus of the market.
THE SINGLE-STEP MODEL We are now ready to formulate the single-step model to calculate the probabilities implied by the spread. It is, however, good to remind ourselves of the implicit assumptions we make during the modeling process. First, putting on a spread involves a short position and we need to borrow stock. At times it may turn out that a particular stock is unavailable for borrow. However, for purposes of our model, we will assume that the stock is always available to borrow. Additionally, we also assume that there are no liquidity issues and that it is possible to put on a spread position in size at the current spread level observed in the market. This would imply that the transaction costs if any are negligible and may be assumed to be zero.
RISK ARBITRAGE PAIRS
176
To construct our model, let us now go through the mechanics of a risk arbitrage trade. Initially, a spread position is created. We do this by establishing a short position in the target and a long position in the bidder, yielding a dollar value equal to the spread. The proceeds from the trades are reinvested at the risk-free rate. Let us call this spread at time zero S0. Upon successful completion of the deal, the spread would converge to zero. The position is reversed for no additional cost at time T, and the reward earned by the investor will be erT S0, with r being the interest rate. In case there is some cash paid out for the target shares, the payoff will be erT S0 + cash. However, if the deal ends in a failure, the spread will not converge to zero as expected. Instead, it inflates to a value of, say, ST. The position will still need to be reversed, and a cost equal to the spread at time T, ST, is incurred. The net payoff in the event of deal failure is therefore erT S0 – ST. The discussion of the scenarios here is illustrated in a state diagram as shown in Figure 11.1.
Payoff on Success ertS0+cash
πsuccess
Current Spread: S0
πfailure
Payoff on Deal Break erTS0−ST
FIGURE 11.1
Single-Step Model.
The Market Implied Merger Probability
177
The state diagram in Figure 11.1 comes with a few caveats. Very often, the fact that the deal is going to be unsuccessful is known much before the time it would take for successful completion, in which case, treating both the payoffs as if they are known at the same time would not be appropriate. For our purposes, however, we choose to reflect this difference in timing in the value of ST while keeping T a constant. Later on in the multistep model formulation, we will eliminate the need to estimate ST. Therefore, keeping T the same for both the success and failure scenarios does not affect the outcome of the model. Recall that the initial cost of setting up the trade was zero. By the no arbitrage condition the expected payoff is also zero. Writing out the equations, we have
π success (e rT S0 + cash) + π failure (e rT S0 − ST ) = 0
(11.1)
psuccess + pfailure = 1
(11.2)
where psuccess and pfailure are the risk-neutral probabilities of successful merger and failed merger, respectively. Solving the two equations, we have
π failure = e rT (S0 + e − rT cash) / (ST + cash)
(11.3)
Note here that in the derivation of the one-step model we apply the Arrow-Debreu ideas to the spread. For the curious mind, the same results may be obtained by applying the Arrow-Debreu ideas to the individual stocks in question, also. The derivation of the formulas for the model using the individual stocks is presented in the appendix. In Equation 11.3 for failure probability, all the values are known or observable except for ST, the spread in the future if the deal happens to break. The value for this spread is anybody’s guess. Unless a reasonable value for the deal break spread is known, the one-step model as it stands is of not much use. We deal with this issue in the multistep model.
THE MULTISTEP MODEL The multistep model relates the changes in the risk-neutral probability to the dynamics of the spread movement. In this case, it is crucial to have an estimate of deal break probability on the eve of the announcement. If we are able to assess the probability of successful merger on the eve of the announcement, the value may be updated periodically based on the spread dynamics. This eliminates the need to guess at the deal break spread and thus mitigates the problem of the one-step model. We now proceed to describe the model.
RISK ARBITRAGE PAIRS
178
Spread=0
π t1success
π t2success
0
T
St1
St2
π t1failure
π t2failure
ST
FIGURE 11.2
Multistep Model.
Let t1 and t2 be points in the interval [0,T] and t2 > t1. We denote the t1 t2 , π failure probabilities of failed merger at times t1 and t2 by, π failure and the spreads by St1 , St2, respectively. See Figure 11.2. Applying the one-step model, we have t1 π failure = e r(T − t1) (St1 + e − r(T − t1) cash) / (ST + cash)
(11.4)
t2 π failure = e r(T − t2 ) (St2 + e − r(T − t2 ) cash) / (ST + cash)
(11.5)
Dividing one by the other, we have t1 t2 π failure / π failure = e r(t2 − t1) (St1 + e − r(T − t1) cash) / (St2 + e − r(T − t2 ) cash) (11.6)
To get a better feel for what the equation signifies, let us consider the situation of a pure stock-for-stock deal without a cash component. Making the cash to zero, the equation in this case reduces to
The Market Implied Merger Probability
179
r t −t t2 t1 π failure π failure = e ( 1 1) St1 St1
(11.7)
Taking logarithms on both sides, we now have
(
t
)
(
t
)
(
)
( )
( )
2 1 log π failure − log π failure = − r t 2 − t1 + log St2 − log St1
(11.8)
Let us see how this equation may be interpreted. First, when t2 – t1 is a small quantity, the above becomes a difference equation. The left-hand side is the difference in the logarithms of the probabilities. The negative logarithm of probabilities can be interpreted as a measure of information content. Therefore, the left-hand side may be interpreted as the change in information content between the times t1 and t2. The right-hand side has a term consisting of the difference in the logarithm of the spreads. This is, in fact, the unrealized profit per target share. The other term on the right-hand side represents the risk-free return. Therefore, the equation may be interpreted as saying that any return in excess of the risk-free rate is equal to the change in the information content, that is, the reduction in the uncertainty of deal break. 0 Once we have the initial probability value, that is, π failure , all the probabilities in the interval [0,T] can be evaluated using the difference Equation 11.8. We have thus eliminated the requirement to make a guess at the spread value in case of deal break. This is replaced with the assessment of the initial deal break probability. As a practical matter, the risk-free rate values in Equations 11.7 and 11.8 are rather negligible. We can therefore set r = 0 in our calculations.
LOGARITHMS AND INFORMATION THEORY The interpretation of the negative logarithm of probabilities as the information content was originally proposed by Claude E. Shannon in his groundbreaking article “A Mathematical Theory of Communication,” published in 1948.2 Shannon was then working at Bell Laboratories. The powerful ideas describing the ways to measure rates of information flow very soon became a discipline in its own right called information theory. As a matter of fact, the word bit (as in bits per second) that is so commonly used today is attributed to Shannon. 2
“A Mathematical Theory of Communication,’’ Bell System Technical Journal 27 (July and October 1948): 379–423; 623–656.
180
RISK ARBITRAGE PAIRS
The initial deal break probability can be naively estimated using a number of robust statistical methods based on past data relating to deal announcements and successful completions. Alternately, an assessment of the fundamentals of the two merging companies can be used to arrive at the deal break probability. Yet another approach could be to base it simply on the instantaneous reduction of the spread on deal announcement. The average of the logarithm of the spread for the 10 days just prior to deal announcement could be used as the logarithm of the spread corresponding to a failure probability of 1.
RECONCILING THEORY AND PRACTICE To see how the theory bears out in practice, let us look at some implications of the preceding model. According to the theory, the expectation is that the spread will converge to zero in the case of successful completion of merger. Alternately, we expect it blow up or widen in case the merger does not go through. We present both examples and counterexamples. In cases where the model might not hold, we also provide possible reasons as to why that would be the case. All the examples presented are taken from the merger boom period of the late 1990s. Also, the key value in the theory relates to the logarithm of the spread. If the logarithm of the spread goes down in a linear fashion, then the spread goes down in an exponential fashion. As a matter of fact, in the case of mergers with no glitches along the way, it is not unreasonable to expect the logarithm of the spread to decrease in a linear fashion. We could therefore expect the spread to exponentially approach zero as the merger date nears. Now when the spread goes to zero, the logarithm of the spread goes to negative infinity. In practice, we may assume a lower bound to the value of the spread between the two stocks to be at least equal to or greater than the bid-ask spread between them. At best, the spread could go to a penny. The logarithm of the spread for a penny is –4.6. That would be about the lowest value that we could expect for the logarithm of the spread. With that said, let us look at some real-life deals. We start by looking at some mergers where the spread behavior follows the model. Figure 11.3 shows examples where the theory holds. Figure 11.3a is a plot of the spread for a successful merger. The bidder in this case was Newell Company, and the target was Rubbermaid Corporation. The ratio for exchange was 0.7883. The deal was announced on October 21, 1998, and completed March 24, 1999. Notice from the figure that it is conceivable that we could fit an exponential to the overall profile of the spread. Figure 11.3b is a plot of the spread of an unsuccessful merger. The bidder was American Home Products (AHP), and the target was Monsanto
The Market Implied Merger Probability
181
Spread on Close
2.2 1.8 1.4 1.0 0.6 0.2 –0.2 0
Spread on Close
FIGURE 11.3A
60
40
20
80
100
Successful Merger (NWL-RBD).
16 14 12 10 8 6 4 2 0 10
FIGURE 11.3B
30
50
70
90
Unsuccessful Merger (AHP-MTC).
Corp. (MTC). The exchange ratio was 1.15. The deal was announced June 1, 1998. It became obvious that the deal was a failed deal around October 13, 1998. The plot of the spread in the graph is for these dates. Notice that the spread does not converge at all. This is indicative of the risk involved in this particular merger. When it becomes obvious that the deal will not go through, the spread in fact blows up. We also present cases where the theory does not bear out. Figure 11.4a is a plot of an inverted spread. The bidder in this case was Venetor Group (Z). The target was Sports Authority (TSA). The exchange ratio was 0.8. The deal was announced May 7, 1998, and it became apparent that the deal was unsuccessful around September 10, 1998. Note that the spread in this case takes on negative values. The logarithm of the spread in this case is undefined. It is not possible to apply the theory here. Such a situation is
RISK ARBITRAGE PAIRS
182
Spread on Close
2 1 0 –1 –2 10
50
30
FIGURE 11.4A
90
70
Inverted Spread (Z-TSA).
Spread on Close
24 20 16 12 8 4 0 0
20
FIGURE 11.4B
40
60
80
100
120
140
Unconverged Spread (BRK-GRN).
possible if there are not enough arbitrageurs participating in the merger. As a consequence, the spread dynamics do not imply anything. Figure 11.4b is a plot of an unconverged spread. The bidder in this case was Berkshire Hathaway (whose chairman was the famous Warren Buffet). The target was General Re, a reinsurance company. The deal was announced June 19, 1998, and completed December 21, 1998. It is interesting to note that the spread on close was about $ 4.00; that is, the convergence to zero did not happen. To see why that would be the case, note the fact that Berkshire stocks were priced very high, and the exchange ratio was about 0.0035; that is, for every thousand shares of General Re we would need 3.5 shares of Berkshire to be perfectly hedged. This awkward exchange ratio is hard to achieve given the average per-trade volume of General Re and Berk-
The Market Implied Merger Probability
183
shire. Thus, the liquidity of the stocks involved increases the execution risk and caused the inefficiency. We now apply the theory to the INTC–LEVL example discussed in Chapter 10. Figures 11.5a and b are plots of the spread and the probability implied by the spread as of the close. The initial probability is determined as the percentage by which the spread narrows on deal announcement with respect to the spread calculated on the 10-day average prices of the securities. In this case it turns out to be close to 30 percent; that is, the spread narrowed by about 70 percent on deal announcement. Note that the probabilities calculated follow the spread values very closely. This results in a very noisy picture. It is conceivable that the probabilities do not vary as much. It is therefore useful to use some smoothing function on the spread before evaluating the implied probabilities. We will discuss this in greater detail in Chapter 12.
5
Spread
4 3 2 1 0 0
20
40
FIGURE 11.5A
60
80
100
120
100
120
Spread (INTC-LEVL).
Probability
0.3 0.2 0.1 0.0 0
20
FIGURE 11.5B
40
60
80
Probability of Deal Break (INTC-LEVL).
184
RISK ARBITRAGE PAIRS
RISK MANAGEMENT One of the important tasks in risk measurement is the estimation of VAR, an acronym for value at risk.
VAR Measurement Let us now detail what we mean by VAR measurement starting with the outcome of the measurement process. The outcome is a statement as follows: “Under usual market conditions, there is a 2.5-percent chance of losing X dollars in one day given the current list of positions.” The value X is the value at risk and is based on the current (day’s close) mark to market. The value of X is typically based on the statistics of the daily market movement in the past two to three years—that is, the distribution of price movements—and, more importantly, on the correlation between the price movements of various assets. The exercise is therefore to determine what the likely value of X is. Also note that there is nothing magical about use of the 2.5-percent value in the VAR statement. It is an artifact left from the cultural conditioning of dealing with Gaussian distributions and represents two standard deviations in the Gaussian case. That value could therefore be 1 percent or any number that one fancies. Another key aspect of the preceding statement is the use of the phrase, “usual market conditions.” This is because the measurement approach that works under usual market conditions is not sound under extreme market conditions. This is because the correlations that may be considered as relatively stable under usual market conditions become erratic under extreme market conditions causing the VAR approach to break down. Applying the same reasoning based on the correlation between asset prices, it is easy to deduce that commonly applicable methods to measure the VAR on a portfolio may not be used in the case of risk arbitrage. The announcement of the deal causes a qualitative shift in the way the two companies are viewed in the marketplace. The historical correlation between them is no longer relevant. The market dynamic now also reflects the possibility of merger between the two companies and there is an increased level of correlation between the returns of the bidder and target stocks. Therefore, any VAR methodology that relies on long-term historical correlation is rendered inappropriate. Furthermore, use of the historical correlation could potentially inflate the VAR numbers, increasing risk capital requirements, resulting in an unrealistic increased cost of doing business. This leads us to look for reasonable ways to evaluate the value at risk involved in risk arbitrage trades. The risk arbitrage trade at hand can be viewed in two ways. On event of deal failure, it is equivalent to holding two separate securities, and tradi-
The Market Implied Merger Probability
185
tional VAR methodologies will apply. However, when the deal is on, the risk is mostly due to the spread volatility, and the value at risk is best obtained by measuring the spread volatility. Thus, based on the view we choose to adopt, we now have two VAR values. We now reconcile between the two views and arrive at a final VAR number by weighing each scenario. Fortunately for us, the consensus probabilities of deal success and failure provide us with a ready mechanism by which to weigh the two views. Putting things more precisely, each of the two views may be associated with a return distribution describing the probability of return of the trade in the next time period. The overall probability distribution is then a weighted sum of the two distributions, the weights corresponding to the probability of deal success and deal failure, respectively. Distributions that are constructed as a weighted sum of other distributions are called mixed density distributions. To better understand the model, let us consider the process of drawing a sample from a mixture distribution. This is actually a two-step process. In the first step, we randomly choose the distribution from the available set of distributions. The choice is guided by the probability weights assigned to each distribution in the set. In the next step, we draw a random sample from the chosen distribution. Applying the mixture model to our situation, the probability of an outcome x dollars or lower is given by the distribution function shown in Equation 11.9 F(x) = π success Φ spread (x) + π failureΦ stockPortfolio (x)
(11.9)
where Φspread is the distribution of spread returns, and ΦstockPortfolio is the distribution of the returns assuming that there is no deal. The failure and success probabilities serve as weights. Thus, given a value of x dollars, we can find the probability that the returns will be less than or equal to x using the above formula. Note that VAR value is that value of x that results in the probability value of 2.5 percent in Equation 11.9. Additionally, the distribution function like every other distribution function is a nondecreasing function. As a consequence, the VAR value may be deduced by applying a standard binary search method. Thus, knowledge of the consensus probability estimates of deal success and failure may be used to calculate the VAR in risk arbitrage trades.
Event Risk Management Oftentimes in risk arbitrage, the outcome of merger success or deal break may hinge on a single event like the shareholders’ vote or a court decision.
RISK ARBITRAGE PAIRS
186
The arbitrageur may find the reward adequate but the risk too high. In such situations, the arbitrageur can protect himself against a widening spread through the use of options. Typically, the target stock experiences higher volatility relative to the bidder stock. The common strategy under these circumstances is to purchase a put on the target stock, thus protecting against a steep drop in the price of the target. While the price paid to enter into the hedge eats away at some of the profit in the trades, the risk due to the put purchase is a lot lower. It may also be that the deal poses a high degree of risk for some time. Once the event resolves itself favorably, the risk of deal break diminishes substantially. The timing, strike, and expiration of the put used will therefore depend on the appetite for risk and the degree of risk aversion of the arbitrageur. It must therefore be noted that while at the surface the risk arbitrage process seems cut and dried, there is still a substantial amount of decision making and judgment calls that are left to the discretion of the arbitrageur.
SUMMARY The risk neutral probability of merger is the probability implied by the observed spread between the bidder and target firms. It is similar to the implied volatility parameter for options. The evolution of the risk neutral probability of merger can be related to the spread dynamics. It is useful in the design of more appropriate value at risk measures.
FURTHER READING MATERIAL Risk-Neutral Probability Jarrow, R. and S. Turnbull Derivative Securities, (Cincinnati, Ohio: 2nd Edition. South Western Publishing, 2000).
The Market Implied Merger Probability
187
APPENDIX Notation p0A
Price of security A on the eve of merger announcement
p0B pTAB pTA pTB
Price of security B on the eve of merger announcement
psuccess
Probability of successful merger
pfailure
Probability of deal break
g
Ratio
Price of combined entity at time T Price of security A in case of deal break at time T Price of security B in case of deal break at time T
We now construct Table 11.2 similar to Table 11.1.
TABLE 11.2 Investment Scenario Risk-Free Buy Stock A Buy Stock B
Initial Price e–rT
p0A γ p0B
Payoff (Price at Time T) Merger Successful
Merger Failed
1
pTAB pTAB + cash
1
pTA γ pTB
The Arrow-Debreu equations that need to be satisfied written in matrix form is as follows: e − rT 1 1 π ′ pTAB pTA success = p0A ′ pB p AB + cash γ pB π failure T T 0 Solving for the values of π success and normalizing so that they add and π failure ′ ′ up to 1, we have
π failure = [e rT (p0A − γ p0B ) + cash] / [(pTA − γ pTB ) + cash]
RISK ARBITRAGE PAIRS
188
Applying the fact that S0 = p0A − γ p0B ST = pTA − γ pTB We have
π failure = (e rT S0 + cash) / (ST + cash).
CHAPTER
12
Spread Inversion
INTRODUCTION In Chapter 11 we discussed that the risk neutral probabilities of merger can be evaluated from the value of the spread as observed in the marketplace. The reasoning was that market forces that affect the value of the spread take into account the risk of deal completion. It is probably true that in most cases the deal completion risk is the predominant factor affecting the spread dynamics. However, the spread is also subject to idiosyncratic movement on the part of any one of the two stocks in question. This movement can be due to a variety of reasons that have nothing to do with the deal completion risk and hence contributes to what we shall term noise in the evaluation of the spread-implied probabilities. This fact is rather evident from the raw plot of the spreads. Some of the reasons for this behavior could be due to the bid-ask spread of the individual stocks and the order of the buys and sells as they occur on each individual stock. It could also be due to the market maker, looking to adjust inventory. The market maker may move the price sufficiently higher or lower albeit on a temporary basis to produce the desired supply or demand in the stock and thereby adjust his inventory levels. Another reason for the spread movement could be attributed to short covering on the target stock especially right after the merger announcement. Therefore, when attempting to evaluate the deal break probabilities, we need to bear in mind that the observed spread is not a consequence of the deal risk alone. However, if the deal risk is the predominant driver of the spread dynamics, the other effects may be treated as noise and filtered using classical filtering methods. In this chapter, we propose the Kalman filter for the task. The primary reason for proposing the Kalman filter is that it is robust and easy to use. Unlike Weiner filtering methods, which are valid only for stationary stochastic signals, the Kalman filter may be applied to a wide variety of situations without restriction. This property conveniently precludes us from coming to any conclusions on the stationarity of the spread and makes it a suitable candidate method. 189
RISK ARBITRAGE PAIRS
190
A comprehensive introduction to the Kalman filter is provided in Chapter 4. However, for sake of continuity we will summarize the method here in a few sentences. The idea of Kalman filtering revolves around the notion of state of the system. The process involves a sequence of state predictions followed by observations. The predictions are then reconciled with the observations to obtain the best estimate of the state. Applying the idea to our situation, let the state correspond to the logarithm of the spread. This process then translates as follows. First, we make a prediction of the next value for the log-spread and follow this with an observation after the elapsed time. The predicted log-spread and the observed log-spread are then reconciled to form the best estimate of the spread at that time instance. The process is then repeated for the next time step. Based on the discussion here, it is now obvious that Kalman filter design for our problem involves two main steps. First, we need to design the prediction method that we plan to use, to provide us with an a priori estimate of the state. Next, we need to model the observation process. The outcome from the observation modeling process is a means of observing the state along with a method to estimate the error variance in the observation. Once we have the prediction and observation, the Kalman filtering process provides us with the means to reconcile between them to come up with an estimate of the state; that is, the logarithm of the spread. This value can in turn be used to estimate the spread implied probability of deal break. Let us now discuss the modeling process for the prediction and observation equations.
THE PREDICTION EQUATION The prediction equation is essentially an equation specifying the state transition for the Kalman filter. Since in our case the state corresponds to the logarithm of the spread, the state equation in our case involves coming up with a prediction scheme for the value of the spread at the next time step. We of course have at our disposal all the past values of the spread right up to the current time step. Let us therefore discuss how we can achieve that. In order to motivate our choice for the state equation, we recall the difference equation from the previous chapter.
(
t
)
(
t
)
(
)
( )
( )
2 1 log π failure − log π failure = − r t 2 − t1 + log St2 − log St1
(12.1)
The left-hand side of the equation is equivalent to the rate of information flow. The right-hand side is the return in excess of the risk-free rate. The
Spread Inversion
191
implication is that under static interest rate conditions and a constant rate of information arrival at the market, the rate of decay of the spread would be a constant. However, the various milestones leading to the merger like the filing of documents, the proxy vote, and so forth, all happen at discrete times. So, how realistic is the assumption of a constant rate of information flow? It turns out that it is not terribly unrealistic. When the merger announcement is made and events are set into motion, it creates an expectation in the marketplace as to the timing of the various milestones leading to the merger. If a day passes without any news contrary to expectation, then it strengthens the expectations. In that sense, the mere passage of time may be construed as providing information leading to the strengthening of expectations—one of the very rare cases in the market where no news is considered good news. A constant rate of information flow, according to the model, carries with it the implication that the logarithm of the spread approaches zero in a linear fashion. In fact, this would be a very reasonable assumption, if the deal were to proceed without any hiccups. However, given the vicissitudes of the deal process, such an assumption might prove to be untenable. Nevertheless, it may still be reasonable to assume that the logarithm of the spread is piecewise linear; that is, the rate of spread reduction is about the same as observed in the recent past. Therefore, we model the spread at the next time instant to be the current spread decremented by the instantaneous rate of reduction of the spread. The state equation is given as Xt = Xt–1 + ∆t–1 + et
(12.2)
where ∆t–1 is the instantaneous rate of change of the spread. Note that Equation 12.2 can also be viewed as the first two terms of a Taylor expansion of the spread about Xt–1. This, therefore, brings us to the requirement of specifying a scheme to evaluate the instantaneous rate of change of the spread, ∆t–1. A naive method would be to use the correction rate of the past time step given by ∆ t −1 = Xˆ t −1| t −1 − Xˆ t − 2| t − 2
(12.3)
A little more sophisticated method would be to take the minimum variance weighted average of the past two correction rates given by
(
) (
)(
∆ t −1 = rt Xˆ t −1| t −1 − Xˆ t − 2| t − 2 + 1 − rt Xˆ t − 2| t − 2 − Xˆ t − 3| t − 3
)
(12.4)
where rt is the weight. Alternately, one could obtain an estimate of the correction rate by taking the mean correction rate over the past d time steps; that is,
RISK ARBITRAGE PAIRS
192
∆ t −1 = ∆ t −1 =
(Xˆ (Xˆ
t −1| t −1
t −1| t −1
)
(
− Xˆ t − 2| t − 2 +, … , + Xˆ t − d | t − d − Xˆ t −1 − d | t −1 − d − Xˆ t −1 − d | t −1 − d
)
d
)
(12.5)
d
Thus, the process of evaluating the instantaneous rate of reduction of the spread could vary depending on d, the number of time steps in the past that we use in our estimation. Consequently, we could have different state equations corresponding to different values of d, the lag parameter, and for each such state equation, a version of the Kalman filter could be implemented. But which one of the state equations is most suited for our purpose? Let us defer this question for now and address it a little later in the chapter. Another point that is noteworthy in the modeling of the prediction equation is the fact that the actual equation used is different for each time step. The exact values for the coefficients in the prediction equation are estimated at each time step based on the instantaneous rate of spread reduction. This is a little different from typical Kalman filter implementations that have a fixed prediction equation with fixed coefficients. We are now ready to move on to the observation equation.
THE OBSERVATION EQUATION The observation at a given time instant is the logarithm of the spread. Associated with an observation is a measure of the error. The magnitude of this error is quantified in terms of the error variance. The observation equation is written as Yt = Xt + ht
(12.6)
where Yt is the logarithm of the observed spread at time t and ht is the observation error with zero mean and variance σ η2t. We now therefore need to estimate the variance of our observation. To estimate the variance of the observation, we draw on the notion of realized volatility. Realized volatility is essentially an empirical volatility measure that sums the squared tick-by-tick returns over a given period. The construction of the realized volatility measure is based on some theoretical results of integrated volatility defined in the context of continuous stochastic processes. A detailed discussion on realized volatility measures and its scaling properties can be found in the reference material. Bear in mind that if we use the realized volatility as a measure of the variance in the observation, then we would need access to tick data. The
Spread Inversion
193
spread can then be evaluated on a tick-by-tick basis and the sum of squared returns on the spread could be used as a measure of observation variance. In the absence of tick data we use the following approximation to estimate the realized volatility measure. Initially, the four spreads corresponding to the open high low and close prices for the day of the two stocks are computed. Note that the times in which the two stocks registered their highs could be different; nevertheless, we treat them as if the highs were registered simultaneously. The sum of squared returns on the spread is calculated on the two possible spread paths; namely, open-high-low-close and open-low-highclose. The smaller of the two values is then chosen to be the variance of the observation. The question the reader might ask at this point is, “Is this really the variance of the error in the observation?” It is probably not. However, the variances thus calculated are not used to price any instrument. It should suffice that the measured variances are proportional to the actual error variances and that the relative order of the errors associated with the observations is preserved. This ensures that the weighting of the observations in the evaluation of the state estimate in done in a consistent manner resulting in state estimates that seem to be satisfactory. To get an intuitive feel for the volatility measure, consider the following. As the spread gets closer to zero, the bid-ask spread of the individual stocks measured as a percentage of the spread between the two stocks becomes higher. It then gets increasingly harder to tell whether the change in spread is real or due to microstructure or bid-ask effects. Thus, as the spread reduces, we should expect an increase in its variance.
APPLYING THE KALMAN FILTER Summarizing the discussions so far, in order to apply the Kalman filtering approach to smooth the spread, we went through the exercise of modeling the prediction equation and the observation equation. The Kalman state in our situation corresponds to the logarithm of the true spread, and the observation corresponds to the logarithm spread as observed in the market. The state equation as applicable in our situation is given as Xt = Xt–1 + ∆t–1 + et
(12.7)
where Xt is the state at time t, ∆t–1 is the time derivative of Xt–1, and et is the state noise. The time derivative ∆t–1 may be estimated by measuring the average change in the spread over different lag values. The observation equation is given by Yt = Xt + ht
(12.8)
RISK ARBITRAGE PAIRS
194
where Yt is the observation and ht is the observation noise with zero mean and variance of σ η2t. With this information we are now ready to apply the Kalman-filtering ideas in a recipe-like fashion (as discussed in Chapter 4). We list the recipe below once again for sake of continuity. The a priori estimate of the state at time t given all observations up to time t – 1 is denoted as Xˆ t | t −1, and the posteriori estimate of the state at time t given all the observations up to time t is given as Xˆ t | t . The various steps are then as follows:
(
)
1. Evaluate Xˆ t | t −1 and var Xˆ t | t −1 using the state equation. 2. Find the observation Yt and var Yt by observing the system.
( )
3. Evaluate Kt, also known as the Kalman gain, which will be used to obtain the linear minimum error variance estimate. 4. Evaluate Xˆ t | t given by Xˆ t | t −1 + Kt Yt − Xˆ t | t −1 5. Finally, evaluate var Xˆ t | t
( )
(
)
These steps are repeated for the next time step. The exact formulas to use for the various steps are derived in simplified form in the appendix.
MODEL SELECTION Note that based on the preceding scheme, we have multiple models for the prediction equation. The differentiating factor amongst them is d, the lag factor used in the estimation of the instantaneous error correction rates. For each of the prediction equations, we can implement a unique Kalman filterimplementation. We therefore need to choose the implementation that is most appropriate; that is, we need to make an appropriate choice for the parameter d. The principle guiding our choice for the lag value d is the maintenance of the delicate balance between prediction and observation. Relying overly on the prediction would mean that we rely excessively on our model and do not give enough weight to what is observed in practice. However, relying too much on the observation would mean that we do not have any view whatsoever on where the spread must be and so must rely excessively on the noisy observation. The right model choice achieves a happy medium between the two extremes. Let us therefore look at how we can quantify this notion of a happy medium. The basic idea is to work with the two sources of error; namely, the measurement/observation error and the state transition prediction error. We denote the measurement cost as the sum of all measurement errors and the prediction cost as the sum of all the prediction errors. More precisely, if X1, X2, X3, . . . , Xn are the estimated states of the Kalman filter,
Spread Inversion
195
Y1,Y2,Y3, . . . ,Yn, the measurements, and Z1, Z2, Z3, . . . , Zn, the predicted values from the state equation, then n
measurement cost =
∑ (Xi − Yi )
2
i =1
n
prediction cost =
∑ ( X i − Zi )
2
i =1
These cost functions may be considered the measures of effectiveness of the measurement and the state transition models. A high prediction cost indicates a poor prediction model. Similarly, a high observation cost indicates a poor observation model. If the two models must be equally effective, it makes sense to require that the values of the two cost functions be more or less the same. Additionally, we also know that the Kalman estimate for the state is a convex combination of the measured and predicted states; that is, given ki to be the Kalman gain at the ith time step, we have Xi = (1 – ki)Zi + kiXi
(12.9)
The discussion on model selection in the case of smoothing functions can be viewed as a situation where we attempt to separate the signal from noise given a sum of both. On the one hand, we can be very conservative and treat every kink as meaningful and separate out very little as noise. On the other hand, we can throw the baby out with the bathwater and oversmooth the given signal, discarding part of the signal along with the noise. The model selection criterion above provides us with a methodology to achieve a balance between both extremes.
RISK ARBITRAGE PAIRS
196
This dependency on the measured and predicted states requires both the measurement model and the prediction model to be fairly precise. Having one of them to be very precise and the other to be erroneous leads us to overly rely on one or the other and is likely to produce a mediocre result. Thus, in some sense there is a trade-off to be made between the observation and prediction costs, and the reduction of one of the costs at the expense of the other is highly undesirable. With this motivation, we define the cost function associated with a Kalman filter to be cost function = measurement cost + prediction cost
(12.10)
2
2
1
1
log(Spread)
log(Spread)
This cost function serves to keep the system honest. If in an attempt to reduce the cost function, we try to reduce the prediction cost, it would be all right as long as it does not increase the measurement cost and vice versa. The best choice for our prediction equation is therefore the one that results in the minimum value for this cost function. To demonstrate the approach, let us apply it to a real-life situation. Figure 12.1 is a plot of the spread and the corresponding Kalman smoother for various lags. The bidder in this case is McKesson Inc., and the target is HBO
0 –1
–1
–2
–2
–3
–3 10
30
50
2
2
1
1
log(Spread)
log(Spread)
0
0 –1
–3
–3
FIGURE 12.1
50
50
10
30
50
–1 –2
30
30
0
–2
10
10
Kalman Filter Implementations (MCK-HBOC).
Spread Inversion
197
13
Model Cost
12
11
10
9
8 0
2
FIGURE 12.2
4
6
8
Model Choice ( MCK-HBOC).
and Company. The exchange ratio is 0.37. The deal was announced October 19, 1998, and completed January 13, 1999. Notice that smaller lags tend to follow the data more closely. Increasing the lag results in greater smoothing and greater deviations from the data. We also have a plot of the cost function as discussed for various lags. See Figure 12.2. Note that the minimum cost value occurs at lag 3. A visual examination of the smoothed series for lag 3 against the series for other lags shows that it is indeed a reasonable choice. Thus, a low value for the lag parameter d implies a noisy set of state estimates, making the Kalman filter very sensitive to the observations. Alternately, a high value for the lag parameter d denotes a smoother set of states and the observations are largely ignored. The most suitable value for the lag parameter is one that minimizes the cost function.
APPLICATIONS TO TRADING Along with the smoothed version of the spread, at every time instant, the Kalman filter also estimates the error standard deviation at each point. This can be treated as error bands, about a mean estimate, similar to bollinger
RISK ARBITRAGE PAIRS
198
bands in technical analysis. Trading can be undertaken when the observed spread is on the upper and lower fringes of the band. We, however, add a note of caution. When the spread widens to the upper fringe of the band, it may be because there is some deterioration in the fundamentals of the merger and may not be just an aberration. Therefore, one needs to exercise extreme caution when putting the spread on. The matter is straightforward when timing the unwind. If the observed spread is near the bottom fringe of the band, then one can safely unwind the position and get back into it again at a higher spread level. Illustrated in Figure 12.3 is a plot of the spread, the Kalman smoother, and the error bands. The bidder in this case was Alza Pharmaceuticals. The target was Sequus Pharmaceuticals. The exchange ratio was 0.4 share. The deal was announced October 5, 1998, and completed March 17, 1999. Last, it is important to bear in mind that the smoothing scheme in Figure 12.3 has been set up with daily data in mind. When looking at data on an intraday basis, there is likelihood of running into intraday effects on the spread variance at market open and market close. One can expect that the spread variance is high around those periods and lower during the middle of
1
Log (Spread)
0
–1
–2
–3 0
FIGURE 12.3
20
40
60
80
100
120
Kalman Smoothing with Confidence Bands (AZA-SEQU, lag 5).
Spread Inversion
199
the day. The constant arrival of information assumption is probably not correct under such circumstances.
SUMMARY The spread observed in the marketplace between two companies involved in a merger is likely to be distorted due to other market effects like the bid-ask spread effects and market maker inventory adjustments. The Kalman filtering approach is a suitable smoothing technique for estimating the actual spread levels. The filtered spread could be used for the risk-neutral probability and also assist in timing executions.
FURTHER READING MATERIAL Kalman Filter Harvey, A. C. Time Series Models, 2nd Edition. (Cambridge, Mass: MIT Press, 1993), pp. 82–104. Model Choice Kalaba, R., and T. Tesfatsion. “A Multicriteria Approach to Model Specification and Estimation.” Computational Statistics and Data Analysis 21 (1996): 193–214. Realized Volatility Anderson, T. G., and others. “The Distribution of Exchange Rate Volatility.” Symposia 99, Statistical Issues in Risk Management. Leonard N. Stern School of Business, April 1999.
RISK ARBITRAGE PAIRS
200
APPENDIX Kalman Filter Design: Lag 1 The state equation of the Kalman filter is given as Xˆ t | t −1 = 2Xˆ t −1| t −1 − Xˆ t − 2| t − 2
(
)
(
)
(
)
(
var Xˆ t | t −1 = 4 var Xˆ t −1| t −1 + var Xˆ t − 2| t − 2 − 4 cov Xˆ t −1| t −1 , Xˆ t − 2| t − 2
)
The observation equation is given as Yt = Xt + ht The variance of Yt is calculated as described in the discussion on the observation equation. We now define gt =
( )
var Yt
( )
(
var Yt + var Xˆ t | t −1
)
where gt = 1 – Kt, Kt is the Kalman gain as described in the standard predictorcorrector framework. The posteriori estimate of the state, and its variance is given as
(
)
Xˆ t | t = g t Xˆ t | t −1 + 1 − g t Yt
( )
var Xˆ t | t =
(
) ( )
var Xˆ t | t −1 var Yt
(
)
( )
var Xˆ t | t −1 + var Yt
We note that the a posterori estimate is actually a convex combination of the a priori estimate and the observation. The value of gt as computed here ensures that the variance of the resulting combination is a minimum. We now proceed to obtain a recursive relation for cov Xˆ t −1 Xˆ t − 2 . The Kalman equation for the subscript t – 1 is
(
)
(
)
Xˆ t −1| t −1 = g t −1 Xˆ t −1| t − 2 + 1 − g t −1 Yt −1 Substituting for Xˆ t −1| t − 2, we have
(
) (
)
Xˆ t −1| t −1 = g t −1 2Xˆ t − 2| t − 2 − Xˆ t − 3| t − 3 + 1 − g t −1 Yt −1
Spread Inversion
201
Multiplying the above by Xˆ t −2 and evaluating the expected value, we have
(
)
(
)
(
cov Xˆ t −1| t −1 , Xˆ t − 2| t − 2 = g t −1 2 var Xˆ t − 2|t − 2 − cov Xˆ t − 2|t − 2 , Xˆ t − 3|t − 3
)
Kalman Filter Design: Lag 2 In order to enhance readability, we use Xˆ t and Xˆ t | t interchangeably to denote the posteriori state estimate. The state equation of the Kalman filter is obtained as a first order approximation of the Taylor expansion about the current point. To do that, an estimate of the derivative is required. The previous design used the first difference of the previous step as an estimate of the derivative. In this design, the derivative is estimated with two sample points. The mean and variance of the first sample are as follows: E(sample1) = Xˆ t −1 − Xˆ t − 2
(
)
(
)
(
var(sample1) = var Xˆ t −1 + var Xˆ t − 2 − 2 cov Xˆ t −1 , Xˆ t − 2
)
The mean and variance of the second sample are as follows: E(sample 2) = Xˆ t − 2 + Xˆ t − 3
(
)
(
)
(
var(sample 2) = var Xˆ t − 2 + var Xˆ t − 3 − 2 cov Xˆ t − 2 , Xˆ t − 3
)
The covariance between the two samples are as follows:
)( ) ( cov(samples) = cov( Xˆ , Xˆ ) − var( Xˆ ) − cov( Xˆ + cov( Xˆ , Xˆ )
cov(samples) = cov Xˆ t −1 − Xˆ t − 2 , Xˆ t − 2 − Xˆ t − 3 t −1
t −2
t −2
t −2
t −1 ,
)
Xˆ t − 3 +
t −3
The minimum variance linear combination of the two samples is given by rt (Xˆ t −1 − Xˆ t − 2 ) + (1 − rt )(Xˆ t − 2 − Xˆ t − 3 ) where rt is given as rt =
var(sample 2) − cov(samples) var(sample 1) + var(sample 2) − 2 cov(samples)
RISK ARBITRAGE PAIRS
202
Now the state equation is given as
(
) (
)(
Xˆ t | t −1 = Xˆ t −1 + rt Xˆ t −1 − Xˆ t − 2 + 1 − rt Xˆ t − 2 − Xˆ t − 3
(
)
(
)
(
)
)
Xˆ t | t −1 = 1 + rt Xˆ t −1 + 1 − 2rt Xˆ t − 2 − 1 − rt Xˆ t − 3
(
) (
var Xˆ t | t −1 = 1 + rt
)
2
(
) (
var Xˆ t −1 + 1 − 2rt
( )( ) ( − 2(1 + r )(1 − r ) cov( Xˆ − 2(1 − 2r )(1 − r ) cov( Xˆ
)
2
+ 2 1 + rt 1 − 2rt cov Xˆ t −1 , Xˆ t − 2 t
t −1 ,
t
t
t
Xˆ t − 3
t −1 ,
( )
) (
var Xˆ t − 2 + 1 − rt
)
Xˆ t − 3
)
2
(
var Xˆ t − 3
)
)
The observation equation is given as Yt = Xˆ t + ηt The variance of Yt is calculated as described in the discussion of the observation equation. We now define gt =
( )
var Yt
( )
(
var Yt + var Xˆ t | t −1
)
gt = 1 – Kt , where Kt is the Kalman gain as described in the standard predictorcorrector framework. The Kalman equations providing the minimum variance linear estimate are as follows:
(
)
Xˆ t | t = g t Xˆ t | t −1 + 1 − g t Yt
( )
var Xˆ t | t =
(
) ( )
var Xˆ t | t −1 var Yt
(
)
( )
var Xˆ t | t −1 + var Yt
(
)
We now proceed to obtain the recursive relation for cov Xˆ t −1 , Xˆ t − 2 . The Kalman equation for subscript t – 1 is as follows:
(
)
Xt | t −1 = g t −1 Xˆ t −1| t − 2 + 1 − g t Yt −1 Substituting for Xˆ t −1| t − 2, we have
Spread Inversion
203
[(
)
(
)
Xˆ t −1| t −1 = g t −1 1 + rt −1 Xˆ t − 2 + 1 − 2rt −1 Xˆ t − 3 −
(
)
] (
)
− 1 − rt −1 Xˆ t − 4 + 1 − g t −1 Yt −1 Multiplying the above by Xˆ t −2 and evaluating the expected value, we have
(
)
[(
) (
) ( , Xˆ )]
) (
)
cov Xˆ t −1 , Xˆ t − 2 = g t −1 1 + rt −1 var Xˆ t − 2 + 1 − 2rt −1 cov Xˆ t − 2 , Xˆ t − 3 −
(
) (
− 1 − rt −1 cov Xˆ t − 2
t −4
Multiplying by Xˆ t −3 and evaluating the expected value, we have
(
)
[(
) (
) (
) (
)
cov Xˆ t −1 , Xˆ t − 3 = g t −1 1 + rt −1 cov Xˆ t − 2 , Xˆ t − 3 + 1 − 2rt −1 var Xˆ t − 3 −
(
) (
− 1 − rt −1 cov Xˆ t − 3 , Xˆ t − 4
)]
Kalman Filter Design: Lag d ( d >= 3) To enhance readability, we use Xˆ t | t and Xˆ t interchangeably to denote the posteriori state estimate. The slope is estimated as the mean of the last d slope samples. This is given as Xˆ t −1 − Xˆ t −1 − d d The state equation is therefore Xˆ t | t −1 = Xˆ t −1 +
(Xˆ
t −1
1 Xˆ t | t −1 = 1 + Xˆ t −1 d
− Xˆ t −1 − d
)
d 1 − Xˆ t −1 − d d
The state variance is given by 2
1 1 ˆ ˆ 1 + var Xt −1 + 2 var Xt −1 − d d d
(
)
(
)
RISK ARBITRAGE PAIRS
204
the assumption being that Xˆ t −1 and Xˆ t −1 − d are not correlated. The observation equation is given as Yt = Xˆ t + ηt The variance of Yt is calculated as described in the discussion of the observation equation. We now define gt =
( )
var Yt
( )
(
var Yt + var Xˆ t | t −1
)
The Kalman equations providing the minimum variance linear estimate are as follows:
(
)
Xˆ t | t = g t Xˆ t | t −1 + 1 − g t Yt
( )
var Xˆ t | t =
(
) ( )
var Xˆ t | t −1 var Yt
(
)
( )
var Xˆ t | t −1 + var Yt
Index
A Aggressive trading, 160 Akaike information criterion (AIC), 27–29 Alza Pharmaceuticals, 198 American Home Products (AHP), 180 Analysis step, 24, 25 APT. See Arbitrage pricing theory Arbitrage pricing theory (APT), 39–42, 75 Arcsine law, 113 ARMA model, 124–125 ARMA process, 21–22 Arrow-Debreu theory, 172, 173, 177, 187 Asset returns, 3–5 Autocorrelation, 15 Autoregressive process (AR), 19–21
B Band design, 119 BARRA model, 38 Baruch, Bernard, 141 Berkshire Hathaway, 182–183 Beta, 3–5, 37 Bias, 25 Bidder firm, 142 Bidder stock, 144 Bipartite graph, 157 Boesky, Ivan, 141 Bonds, issuing, 139 Bootstrap technique, 114, 116 Bounds, on position size, 163–165 Box, George, 14 Box-Jenkins approach, 15 Broker, 151
Brownian motion, 113, 163–164 Buffett, Warren, 182 Burmeister, Ibbotson, Roll, and Ross (BIRR) model, 38
C Cantor, 79 Capital Asset Pricing Model (CAPM), 3–5, 37 asset returns, 3–5 beta, 3–5 market portfolio, 3 security market line (SML), 3–4 Capital structure, 139 CAPM. See Capital Asset Pricing Model Cash amount, in a deal, 142, 144 Chinesing, 154 Chi-square merit function, 110–111 Close-close method, 111 Cohn, Harry, 141 Cointegration, 75–79 application of, 80 coefficient, 118 common trends model, 87–90 error correction, 76–77 Collars, 145 Common drift, 99 Common factor risk, 44–46 Common factor spread, 98–99 Common trends cointegration model, 87–90 Common trends model, 77 and APT, 90–93 Compaq Computers Company (CPQ), 154–155 Complexity theory, 79
205
INDEX
206
Conditional mean, 19 Conditional variance, 19 Conservative trading, 160 Control theory, 52 Corporate events, 139 Correlation, 13 Correlogram, 15 Cost function, 133 Covariance, 12 Covariance matrix, 41–44
D Deal, 141 process, 141–142 spread, 145–147 Debt, issuing, 139 Definitive agreement, 141 Deviations from ideal conditions, 97–99 Differencing, 24–25, 76 Dinic’s algorithm, 168–170 blocking flow, 168 flow augmenting paths, 168 residual graph, 168 steps in algorithm, 168, 170 Discretization effect, 129 Distance measure, 93–94 formula, 94 interpreting, 94–97 calculating angle between two vectors, 95–96 geometric interpretation, 96–97 Dollar neutral portfolio, 6 Dual, of a linear program, 156, 158
E Eigen portfolios, 38 Eigen value decomposition, 103 Engle, Robert F., 76, 105 Equilibrium price, 60–61 Equilibrium value, 118 Equity, issuing, 139 Ergodicity, 126–127 Error correction, 76–77 Event risk management, 185–186 Exchange offers, 140
Exchange ratio, 144–146 Execution during pricing period, 161–166 verifying, 155–160
F Factor covariance matrix, 42 Factor exposure matrix, 42 Factor models, 37–51 applications, 44–50 calculating risk on portfolio, 44–47 calculation of portfolio beta, 47–49 tracking basket design, 49–50 arbitrage pricing theory, 39–42 covariance matrix, 42–44 fundamental, 38 macroeconomic, 38 parameters, 42 sensitivity analysis, 50 statistical, 38 Fibonacci numbers, 63 Fibonacci retracements, 54 Fixed ratio stock exchange, 144 Fixed value stock exchange, 144 Forecasting, 24–25 Ford, 156 Fulkerson, 156 Fundamental factor models, 38
G Gaussian distribution, 16 Gaussian white noise, 16 Gaussian white noise series, 120 General Electric, 30 Generalized auto-regressive conditional heteroskedascity (GARCH), 124, 125 General Re, 182–183 Geometric interpretation, 96–97 Golden mean ratio, 63 Goodness of fit vs. bias, 25–26 Granger, C.W., 76, 105 Granger representation theorem, 76 Gruss, Joseph, 141
Index
H
207
HBO and Company, 196–197 Hedge, 10, 48, 151, 161, 162 Hedge ratio, 48, 49 Heuristics approach, 86 Hewlett Packard Company (HWP), 154–155 Historic volatility, 64
Level Communications (LEVL), 146, 147, 148, 183 Levy, Gustave, 141 Levy, P., 113 Lewis, Salim, 141 Linear programming, 158 Linear relationship, 106–107 Long–short portfolio, 6–7
I
M
Implied probability of deal success, 172 of merger success, 172 Indexing, 81 Induction, 79 Innovation, 22 Information theory, 27, 179 Intel Corporation (INTC), 146, 147, 148 INTC-LEVL, 183 Integer linear program, 156
Macroeconomic factor models, 38 Market implied merger probability, 171–188 implied probabilities and ArrowDebreu Theory, 178 multistep model, 177–180 reconciling theory and practice, 180 risk management 184–186 event risk management, 185–186 VAR measurement, 184–185 single-step model, 175–177 Market neutral portfolios, 5 Market neutral strategy, 5–7 dollar neutral portfolios, 6 long–short portfolio, 6–7 market neutral portfolios, 5 mean-reverting behavior, 5–6 zero beta portfolio, 5–6 Market portfolio, 3 Markowitz approach, 40 Mark Twain, 127 Married puts, 166 Martingales, 23, 30 Max-flow formulation/problem, 158–159 Max flow–min cut theorem, 156 Maximization of profits, 118 Maximum entropy methods (MEM methods), 132 Maximum likelihood criterion, 26 Maximum value of Brownian motion, 163 McKesson Inc., 196 Mean, 12 Mean drift, 99, 100, 104, 119, 136
J Jenkins, Gwilym, 14
K Kalman, R.E., 53 Kalman filter/filtering, 52–69, 189– 190, 192, 194, 196, 197 applications to trading, 197–199 applying, 193–194 design, 200–204 filtering the random walk, 60–64 application to S&P index, 64–67 formulas, 69 Kalman filter, 53, 54–57 observation innovation, 56 scalar Kalman filter, 57–60 system state concept, 53 Karp, Richard, 79 Kirchoff’s law, 59
L Lagged values, 23, 196–197 Lazy allocation algorithm, 170 Least squares criterion, 111
INDEX
208
Mean reversion, 24 Mean-reverting behavior, 5–6 Mergers, 140 boom, 141 “Mining syndrome,” 43, 50 Mispricing, 74 Mixture Gaussian distribution, 123–124 Model choice, 26–29 Model error, 99 Modeling stock prices, 30–32 Model selection, 194–197 Modern Control Theory, 53 Monsanto Corp. (MTC), 180–181 Morgan Stanley, 73 Moving average process (MA), 17–19 Multifactor approach, 107–108 Multifactor models. See Factor models Multiple threshold design, 135–136 Multistep model, 177–180
N Newbold, P., 105 Newell Company, 180 New Jersey Holding Company Act, 140 Noise, 189 Nonparametric approach, 126–130 Nonstationary time series, 23, 79 Normal distribution, 16 NP completeness, 79
O Observation equation, 192–193 Ohm’s law, 59 Order, specifying, 152–155 Ordinary least squares, 108, 111
P Pair selection in equity markets, 85–103 common trends cointegration model, 87–90 common trends model and APT, 90–93 distance measure, 93–94 interpreting, 94–97
reconciling theory and practice, 97–102 deviations from ideal conditions, 97–99 model error, 99 numerical example, 99–102 stationarity of integrated specific returns, 97 Pairs trading, 8–9 described, 74–75 history of, 73–74 risk arbitrage pairs, 8, 9 spread, 8 statistical arbitrage pairs, 8, 9 strategy, 83–84 Parity relationship, 8, 79, 123, 125, 145 Portfolio beta, calculation of, 47–49 calculating risk, 44–47 Positive definite, 42 Prediction-correction method, 55 Prediction equation, 190–192 Prediction step, 24, 25 Preprocessing step, 24–25 Pricing period, 144 Probability formulas, 12–13
R Random walk process, 22–24 Random walk series filtering, 60–64 applied to S&P index, 64–67 Ratio, 8, 9, 39, 97, 121, 153, 161 Recapitalizations, 139 Recursive least squares method, 62 Regression, 108–109 Regression approach, 108–112 Regularization, 130–135 Relative pricing, 74 Residual component, 4, 5 Residual time series, 6, 108, 112, 115–116 Rice, S.O., 113, 114, 125 Risk, 39–40 common factor risk, 44–46
Index
diagram, 41 specific, 44, 46 Risk arbitrage mechanics, 139–150 capital structure, 139 corporate events, 139 deal process, 141–142 deal spread, 145–147 history of risk arbitrage, 140–141 quantitative aspects, 149 trading strategy, 147–149 transaction terms, 142–145 collars, 145 fixed ratio stock exchange, 144 fixed value stock exchange, 144 stock and cash exchange, 144–145 Risk arbitrage pairs, 8, 9 Risk management, 184–186 Risk neutral probabilities, 172 Risk on portfolio, calculating, 44–47 r-squared measure, 105 Ross, Stephen A., 39 Rubbermaid Corporation, 180
S Score/distance measure, 86–87 Securities and Exchange Commission (SEC), 142 Security market line (SML), 3–4 Selling short, 166–167 Sensitivity analysis, 50 Sequus Pharmaceuticals, 198 Shannon, Claude E., 179 Sharpe, William T., 3 Sharpe ratio calculations, 136 Short sale, 151–152 indicator, 154 Signal-to-noise ratio (SNR), 98, 104, 113, 119, 136 Single-step model, 175–177 Specific risk, 44, 46 Specific spread, 92–93 Specific variance matrix, 42 Spinoffs, 139 Sports Authority (TSA), 181 Spread, 8, 74, 80, 118 deal, 145–147 dynamics, 118, 122–128
209
valve, 153 volatility, 185 Spread inversion, 189–199 applications to trading, 197–199 applying Kalman filter, 193–194 model selection, 194–197 observation equation, 192–193 prediction equation, 190–192 Spurious regression, 105 Standard & Poor (S&P) index, 64–67 Stationarity of integrated specific returns, 97 Stationary time series, 79 Statistical arbitrage pairs, 8, 9 trading, history of, 73–74 Statistical factor models, 38 Stock, James H., 77, 79, 104 Stock and cash exchange, 144–145 Stock prices, modeling, 30–32 System state transition, 53
T Target firm, 142 Target stock, 144 Target stock quantity, 154 Tartaglia, Nunzio, 73, 74 Technical analysis, 53, 54, 198 Tender offer, 139–140 Threshold level, 128 Threshold value, 127 Tikhonov-Miller regularization, 132 Time-based stops, 136 Time series, 14–36 autocorrelation, 15 correlogram, 15 defined, 15 defined, 14 forecasting, 24–25 goodness of fit vs. bias, 25–26 model choice, 26–29 modeling stock prices, 30–32 time series models, 16–24 autoregressive process (AR), 19–21 general ARMA process, 21–22 moving average process (MA), 17–19
INDEX
210
Time series (cont.) random walk process, 22–24 white noise, 16–17 types of, 14 deterministic, 14 probabilistic, 14 stochastic, 14 Tracking baskets, 81 design, 49–50 Tracking error, 49, 81 Tradability, testing for, 104–117 linear relationship, 106–107 multifactor approach for estimating, 107–108 regression approach for estimating, 108–112 testing residual for tradability, 112–118 Trade execution, 151–167 broker, 151 execution during pricing period, 161–166 bounds on position size, 163–165 short sale, 151–152 short selling, 166–167 specifying the order, 152–155 action, 153 bidder and target tickers, 153 ratio and cash amount, 153 short sale indicator, 154 spread value, 153 target stock quantity, 154 verifying the execution, 155–160 Trading design, 118–136 ARMA model, 124–125 band design for white noise, 118–122 hidden Markov ARMA models, 125–126 mixture Gaussian distribution, 123–124
multiple threshold design, 135–136 nonparametric approach, 126–130 regularization, 130–135 Tikhonov-Mille, 132 Sharpe ratio calculations, 136 spread dynamics, 118, 122–128 time-based stops, 136 Trading strategy, 82–83, 147–149
V Value at risk (VAR), 172 measurement, 184–185 Variance, 12 Vectors, 95–96 Venetor Group, 181 Volatility, 5, 44, 64, 102, 108, 123–124, 172 historic, 64 implied, 166, 172 realized, 192–193 Volume weighted average price (VWAP) price, 112
W Walk away right, 145 Watson, Mark W., 77, 79, 104 Weighted least squares, 111 Weiner, Nobert, 14 Weiner filtering, 14, 22, 189 White noise, 16–17, 121 band design for, 118–122 Gaussian series, 120 Wilshire Atlas model, 38 Wyser-Pratt, Eugene, 141
Z Zero beta portfolio, 5–6 Zero crossing, 24, 112–114 rate, 113–114 Rice formula for, 113–114