A simulation-based Bayes' procedure for robust

A simulation-based Bayes’ procedure for robust prediction of pairs trading strategies Lukasz T. Gatarek∗, Lennart F. Hoogerheide and Herman K. Van Dijk February 8, 2011

Abstract We propose a new simulation method to estimate the cointegration model with nonnormal disturbances in nonparametric Bayesian framework in order to present a robust prediction of some alternative trading strategies. We apply the theory of Dirichlet processes to estimate the distribution of disturbances in form of infinite mixture of normal distributions. The simulation algorithm based on Dirichlet process priors is confronted with the standard method based on singular value decomposition and encompassing prior. We test the methodology with both simulated and true dataset evaluating our technique in context of predictive accuracy. In empirical exercise we apply the method to statistical arbitrage - pairs trading strategy. Keywords: Bayesian Analysis; Cointegration Models; Dirichlet process; Simulation; Pairs trading. JEL Classification:C11, C14, C50

∗

Econometric Institute and Tinbergen Institute, Erasmus University Rotterdam, P.O. Box 1738, NL-

3000 DR Rotterdam, The Netherlands. E-mail: [email protected]

Introduction The motivation for statistical arbitrage techniques has its roots in works that preach predictability of stock prices and existence of long term relations in the stock markets. This literature challenges the stylized fact in financial economics which says that the stock prices shall be decribed by independent random walk processes; what would automatically imply no predictability in the stock prices. The key references in this area are [Lo and MacKinlay, 1988], [Lo, 1991], [Lo and MacKinlay, 1992] and [Guidolin, 2009]. Based on these empirical investigations trading strategies might be formed to explore the inefficiences of the stock markets. [Khadani, 2007] consider a specific strategy first proposed by [Lehmann, 1990] and [Lo and MacKinlay, 1990] that can be analyzed directly using individual U.S. equities returns. Given a collection of securities, they consider a long/short market-neutral equity strategy consisting of an equal dollar amount of long and short positions, where at each rebalancing interval, the long positions consist of losers (underperforming stocks, relative to some market average) and the short positions consist of winners (outperforming stocks, relative to the same market average). This strategy is in opposition to the momentum strategy that aims to capitalize on the continuance of existing trends in the market. One of the strategies hidden under the common notion of statistical arbitrage is the pairs trading strategy. In pairs trading we do not deal with trends established for particular assets but on the longrun equilibrium among the pair of stocks. When the spread between two assets is positive we sell it i.e. we short sell the outperfroming stock and go long another one. When the spread is negative we buy it. In both cases we expect the spread

1

to come back to equilibrium which is zero. [Gatev et al., 2006] show the performance of this arbitrage rule over a period of 40 years and they find huge empirical evidence in favour of it. The crucial steps in building the pairs trading strategy is the local estimation of both current and expected spreads. In the framework of cointegration analysis spread is modeled as the local deviation from the long-term equilibrium among the time series. Therefore the current spread betwenn the assets is computed as the product of cointegrating vector and current stock prices. On the other hand, the expected spread is estimated as the product of cointegrating vector and predicted stock prices. The spread prediction is based on the assumption of sound cointegration relation among the pair of assets. To sum up, the pairs trading technique is based on the assumption that the linear combination of prices (scaled by the cointegrating vector) reverts to zero and a trading rule can be constructed to exploit the expected temporary deviations. The problems concerning the implementation of this technique may have two main sources: poor estimates of the cointegrating vector and inaccurate prediction of the expected spread. In this work we address these issue. We apply Bayesian sampling algorithm based on Dirichlet process priors in order to estimate the distribution of asset returns which directly influence the quality of estimates of both model parameters as well as of spread. Moreover the implementation of this nontrivial algorithm results in more accurate predictive densities, what in effect contributes to more profitable pairs trading strategy. Finally we find that in the time-varying model, the normalization of model parameters influences the accuracy of spread prediction heavily. This issue is also adressed by our empirical analysis. 2

The paper is constructed as follows.

1

Preliminaries

In order to test the profitability of pairs trading strategy we need to identify long term relations in the stock prices. Therefore we apply cointegration model, see [Juselius, 2006]. The distributions of stock market returns are typically nonnormal. Thus usually t-distribution and other fattailed distributions are applied. In case of pairs trading, we try to identify cointegrating relations in a huge universe of assets. It might be incorrect to assume common distribution of returns across different stocks. We propose a general algorithm to estimate the cointegration model in a Bayesian way under nonnormality. The outline of such an algorithm is composed as follows 1. Estimate the cointegration model

∆yt = Πyt−1 + εt

under normality of disturbances (M-H procedure under encompassing prior, see for instance [Kleibergen and Van Dijk, 1998]). Obtain fit yˆt . 2. Extract the residuals εt . 3. Assume that the distribution of residuals from every series yj is approximated by the mixture of Normal distributions. Find the components of this mixtures and their weights. Label the residuals according to their source in the mixture. Then, each residual εt,j is ordered by parameters of corresponding Normal distribution, µt,j and Vt,j . 3

4. Standardize the residuals and construct artificial timeseries yt according to yt = yˆt + (εt,j − µt,j )/

p Vt,j .

5. Go to Step 1 using artificial timeseries. The challenge to construct such an algorithm lies in finding an accurate method to select the number of components in the mixture of Normal distributions. The components of this mixture are different in each repetition of the algorithm. Thus a flexible method for estimating this mixture is needed. We propose to model this distribution as a Dirichlet Process Mixture (DPM) - a mixture with a countably infinite number of components. Due to this property this technique is more flexible than a finite ordered mixture model which ex ante specifies the number of components. For a general introduction to modelling via Dirichlet processes refer to [Ferguson, 1973], [Ferguson, 1974] and in particular to [Antoniak, 1974]. The model is hierarchical. For given j the distribution of residuals εt,j , t ∈ 1, . . . , T is modeled by a set of latent parameters ψj = {ψ1,j , . . . , ψT,j }, where ψt,j = (µt,j , Vt,j ). ψj,t unequivocally determine the afiliation of εt,j to Normal distribution N (ψt,j ). Thus the latent parameters automatically cluster the residuals. Every ψt,j is drawn independently and identically from the distribution G. Therefore G might be interpreted as the distribution over the distributions, see [Ferguson, 1974]. The discreteness of G is the reason for the clustering related interpretation of Dirichlet processes.

4

For εt,j , t ∈ 1, . . . , T this general model might be formulated as iid

εt,j |ψt,j ∼ N (ψt,j ) iid

(1.1)

ψt,j |G ∼ G

(1.2)

G|α0 , G0 ∼ DP (G0 , α0 ).

(1.3)

The information available in the data updates the prior knowledge in G. In particular, the prior G ∼ DP (G0 , α0 ), is parametrized by centering distribution G0 and precision parameter α0 , which determines how precisely G resembles G0 , see [Escobar and West, 1995]. The number of components in the mixture is not fixed, and it is inferred from the data by means of Bayesian updating mechanism. The reader that is unfamiliar with Bayesian theory is referred to [Box and Tiao, 1973] for general discussion of Bayesian analysis and to [Bernardo and Smith, 1994] for an introduction to nonparametric Bayesian models and finally to [Escobar and West, 1995] for details of estimating the distribution by means of Dirichlet processes. In this paper we combine the Dirichlet process prior techniques with Bayesian estimation of cointegration models under encompassing prior. We propose a coherent framework for estimating cointegration models under nonnormal distributions.

2 2.1

Methodology Model

[Koop et al., 2005] give an overview of Bayesian aproaches to cointegration model. This literature might be structured by means of different classes of priors: the Flat, Jeffreys’, Encompassing and Subspace prior. In our approach we work with encompassing prior, see 5

[Kleibergen and Van Dijk, 1998] or Appendix A. Apart from that the Dirichlet process prior is imposed on the mixture distribution of the disturbances. It determines the ordering of disturbances to the components of this mixture. This approach is new in context of cointegraton models. Many authors place the Dirichlet process directly on parameter of the econometric model. [Campolieti, 2001] give list of some applications of Dirichlet process prior in statistical modelling. We specify the bivariate cointegration model under Dirichlet process prior as ∆yt,1 = α1 (β1 yt−1,1 + β2 yt−1,2 ) + εt,1 ∆yt,2 = α2 (β1 yt−1,1 + β2 yt−1,2 ) + εt,2 εt,j ∼ N (ψt,j ), ψt,j = (µt,j , Vt,j ), iid

ψj |Gj ∼ Gj , Gj |G0,j , α0,j ∼ DP (G0,j , α0,j ). We index the variables by j; let p = 2 denote the dimension of cointegration model. Subscript t refers to timing of an observation. The model specification allows that every disturbance come from distinct Normal distribution defined by parameters µ and V . Those parameters, stacked in ψj come from prior distribution Gj . Finally, the distribution of ψ is determined by a Dirichlet process prior parametrized by a positive scalar α0,j and the centering distribution G0,j . This distribution spans over means and variances, i.e. ψj ∈ R × R+ . As far as the interpretation of the parameters of Gj is concerned, from the theoretical properties of Dirichlet distribution we note that E{Gj (ψj )} = G0,j (ψj ) for all ψj ∈ R × R+ . Finally α0,j is a precision parameter, determining the concentration 6

of the prior for Gj about G0,j . Further details on the model structure can be found in [Ferguson, 1973]. The key feature of Dirichlet process relates to its discreteness, i.e. in any sample ψj of size T there is positive probability of coincident values. This implies that T distinct elements {ψj,t } are configured into exactly T ∗ distinct values ψj,1∗ , . . . , ψj,T ∗ where T ∗ < T . When we work with bivariate series of disturbances the clustering implied by {ψ1 } does not need to be identical with the one implied by {ψ2 }. In that way we try to allow for maximum freedom in choosing µ and V across j. To account for the correlation between εt,1 and εt,2 we apply the Cholesky decomposition. The construction of matrix Π = βα automatically implies identification problem. Postmultiplying matrix β by a full rank invertible matrix and premultiplying matrix α by its inverse leaves the matrix Π unchanged. In order to uniquely identify the elements of β and α, the normalization of parameters is required. There are two types of normalizations that are applied in the cointegration literature: linear and orthogonal. Under linear normalization, at least r2 restrictions need to be applied. A straightforward way of identifying the elements of α and β is by normalizing β as     β1   Ir    . β=  ≡  β2 −β2

(1.4)

Normalization 1.4 is equivalent to impose r2 linear restrictions on β, whereas α remains unrestricted. Interesting discussion of this normalization with respect to the cointegration model is given in [Strachan, 2003]. For more introductory exposition of the topic we refer to [Juselius, 2006]. The main drawback of this method is caused by its ordinality i.e. 7

linear normalization imposes an order on the variables in the sense that the rows of β1 are linearly independent. Driven by this insight we also apply another normalization that is an unordinal method. β 0 β = Ir .

(1.5)

In 2-dimensional case we can visualize it as a mapping of the cointegrating vectors onto the unit circle around the centre of the coordinate system. The support of the coefficients is bounded in this case. In simulation and empirical part we apply both normalizations in order to investigate if normalization has any impact on profitability of pairs trading strategy.

2.2

DP algorithm

In this subsection we present an algorithm to estimate the cointegration model under nonnormal distributions. The steps of the algorithm shall be repeated Z times, where each run of the algorithm, z, leads to the new set of model parameter estimates: αz , β z . The algorithm remains unaltered by the choice of normalization of parameter matrix β. The cointegration rank is chosen under normality according to the ordering of Bayes factors in line with [Kleibergen and Paap, 2002] for linear normalization and [Gatarek et al., 2010] for orthogonal normalization. Step 1. For z = 1 obtain initial estimates of model parameters under Normal distribution, for period 1, . . . , T . Extract the error terms

εˆt,j = yˆt,j − yt,j .

8

where yˆt,j denotes the fit determined by estimates. We apply the M-H algorithm of [Kleibergen and Paap, 2002] under linear normalization. It is extended by [Gatarek et al., 2010] for orthogonal normalization. (for z = 2 . . . Z, Step 2 until Step 8 should be repeated for each j) Step 2. Draw starting value for each ψt,j . We follow [Escobar and West, 1995] in assuming the prior for ψ as Normal × Gamma. Step 3. Sample the elements of (bivariate) vector ψj sequentially, by drawing from the distribution of (ψ1,j |ψ (1,j) , Data), then (ψ2,j |ψ (2,j) , Data), up to (ψT,j |ψ (T,j) , Data), where for variable j, ψ (t,j) denotes the entire vector ψj without the element corresponding to observation t. (Note that if we configure T elements of {ψj } into exactly T ∗ distinct realizations, ψ1∗ , . . . , ψT ∗ , then ψj,t = ψh∗ , (h = 1, . . . , T ∗ ) implies µj,t = µ∗h and Vj,t = Vh∗ .) Step 4. Return to Step 3. and proceed iteratively until convergence in ψj . Step 5. Construct artificial dataset in line with q z−1 z z z yt,j = yˆt,j + (ˆ εz−1 − µ )/ Vt,j t,j t,j Step 6. Reestimate the model parameters under normality (as in Step 1) with dataset constructed in Step 5. Step 7. Extract and save the error terms. They deliver new dataset for subsequent iteration of the algorithm. Step 8. Save the model parameters from Step 6 and return to Step 2. (Repeat the procedure until z = Z.) Step 9. The empirical distribution of parameters from z ∈ 2 . . . Z provides for the distribution of model parameters in t = T .

9

3

Simulation

The performance of the algorithm is evaluated with accuracy of predictive density for observation yT +1 . We compare the DP algorithm with M-H algorithms derived under normality in [Kleibergen and Paap, 2002] and [Gatarek et al., 2010], respectively, for linear and orthogonal normalization. See Appendix A for general discussion of these algorithms. We work with simulated timeseries. Distinct features of both algorithms result in different approaches to estimate the predictive densities. In case of the DP algorithm, the predictive density is obtained based on the axiom of exchangeability of the DP processes. The out-of-sample prediction is given by ∗

εT +1,j

T X α0,j 1 ∼ nt,j N (ψt,j ), Ts + α0,j + T α0,j + T

(1.6)

t=1

ˆ T,j + εT +1,j . ∆ˆ yT +1,j = Πy

(1.7)

where nt,j denotes the number of error terms ordered to a particular component of the mixture. It determines the weight of this component in the mixture. Ts denotes the tdistribution, see [Escobar and West, 1995] for details. Thus, the new realization εT +1,j is either drawn from the t-distribution or from an existing component of the mixture of Normal distributions estimated with realizations 1, . . . , T . In order to obtain an estimate of predictive density based on (1.6), we introduce Step 4’ into our algorithm. This step is performed after convergence in ψj ’s has been attained. In Step 4’ we sample a new predicted εT +1,j based on (1.6) and construct yˆT +1,j with (1.7). Moreover, instead of sampling yˆT +1,j once, for every z we sample 1000 10

realizations of yˆT +1,j , based on exchangeability principle. After Z rounds of the algorithm this approach shall result in both more precise predictive density and point predictions. The method of summing the conditional expectaions of random variable instead of just summing the samples of this variable was suggested by [Gelfand and Smith, 1990], who called this method Rao-Blackwellization. For application of this method in context of Dirichlet process see [Escobar, 1994]. Under Normal distribution the prediction is computed according to ∆ˆ yTi +1,j = Πi yT,j ,

(1.8)

where i refers to a single draw of M-H simulation algorithm presented in Appendix A. In order to obtain the point prediction for yT +1,j we average across i’s. If the estimation procedure has high accuracy, the predictive density obtained with the simulation algorithms shall be close to the density of random disturbances in data generating process. We measure the distances between those densities. Rectilinear distance, known also as City block measure for the distance between the densities is applied, see [Sung-Hyuk, 2007] for details. The distances are further denoted by dnorm and dDP , respectively, for normality and DP algorithm. It is of interest how dnorm relates to dDP . The objective of the following simulation is to evaluate both the dnorm and dDP statistics. First let us transform the cointegration model into VAR form yt = ΠV AR yt−1 + εt .

(1.9)

The so-called companion matrix ΠV AR , is unequivocally related to matrix Π in the cointegration model by simple relation ΠV AR = Π + I, where I denotes the identity matrix. The dynamic properties of VAR processes are determined by eigenvalues of matrix ΠV AR . 11

For discussion of these properties we refer to [Juselius, 2006]. The measures of distance dnorm and dDP are evaluated over a grid of eigenvalues of ΠV AR : (λ1 , λ2 ) ∈ [0, 1] × [0, 1] with a distance of 0.05 between two adjacent points in both directions. Each point (λ1 , λ2 ) on the grid holds the arithmetic averages of statistics dnorm ’s and dDP ’s, respectively, obtained for 100 different models each simulated with 1.9. The models are determined by distinct ΠV AR matrices, however characterized by a common pair of eigenvalues λ1 and λ2 . The ε in (1.9) are simulated from the mixtures of Normal distributions. The numbers of components in those mixture range from 2 to 6 and are chosen randomly. The weights of components in a mixture are drawn from Multinomial distribution. The mean and variance of every component in the mixture is drawn from truncated Normal and Gamma distribution, respectively. The truncation is imposed to obtain unimodal distributions. As we work with financial data we favour skewed and fattailed but unimodal distributions that are common for financial returns, rather than mixtures of very distant components. The statistics dnorm and dDP are computed both under linear and orthogonal normalization of parameter matrix. As the disturbances in the data generating processes are drawn from nonnormal multimodal distributions, we aim at investigating the abbility of our method to estimate the model parameters and predictive density under these nontrivial circumstances. To understand the results we need to fix the ideas regarding the interplay between the eigenvalues of the companion matrix and the rank of the matrix. When both eigenvalues of the companion matrix are smaller than 1, the time series are stationary, i.e. the rank r equals dimension of the system, p = 2. When one eigenvalue is equal to 1 and another is 12

smaller than 1, the rank is 1, which means that the process has one unit root (cointegration). In the limiting case, when both eigenvalues equal 1, the bivariate time series consist of two independent random walks, i.e. r = 0. We know that the cointegration hypothesis can be formulated as a reduced rank restriction on the equation system parameters of a vector autoregressive model. This restriction must be incorporated in the prior knowledge on parametric structure as the dimension of both α and β in the cointegration model strictly depends on r. In case of pairs trading we select the time series that are cointegrated. Therefore in the simulation we estimate the models under assumption that r = 1. Assumption r = 1 holds over the entire grid of eigenvalues, irrespectively of the true rank implied by a pair of eigenvalues in the particular area of the grid. This step shall shed some light into the performance of the estimation procedure when the rank is assumed incorrectly a priori. We try to find out if the DP methodology is able to offset this mistake to some extent and lead to more accurate prediction than under normality. In Figure (1) we present the outcome of the simulation experiment for both normalizations. Top panel corresponds to dnorm and the middle one to dDP statistic. Most importantly the bottom panel shows the difference between both statistics, d = dnorm − dDP . We plot the surfaces only over half of the domain as they are symmetric. Intuitively if the accuracy of our estimators/predictors is high in the top and middle panels we shall obtain low distance statistics. Value 0 indicates that the true predictive density and the estimated one are identical. The higher the value of d in bottom panels, the more accurate is the predictive density obtained under DP algorithm compared to normality. Top and middle panel in Figure (1) indicate that both algorithms are most accurate 13

when the true rank is 1 or close to it, i.e. when one of the eigenvalues is equal to 1 or close to this value. This corresponds to darker region where λ1 = 1. In that case the distance between the true density and estimated ones is very low. This holds irrespectively of the normalization and algorithm applied. The bottom panel indicates that DP tends to outperform benchmark algorithm under normality over the entire grid, irrespectively of the normalization again. Moreover in the bottom right graph we observe a lighter shade over the interior than in the corresponding figure on the left. This indicates that we are able to obtain substantial increase in accuracy of prediction even when our assumption on rank is incorrect. This might be useful in case of pairs of stocks that might be close to cointegration, driven by dynamics governed by one eigenvalues close to 1 but not exactly equal to 1. Overally the simulation implies that the DP algorithm can substantially increase the accuracy of predictive density.

4

Statistical arbitrage

The word statistical in context of investment approach is an indication of speculative character of investment strategy. It is based on the assumption that the patterns observed in the past are going to be repeated in the future. This is in opposition to the fundamental investment strategy that both explores and predicts the behaviour of economic forces that influence the share prices. Thus the statistical arbitrage is a purely statistical approach designed to exploit equity market inefficiencies defined as the deviation from the long-term equilibrium across the stock prices observed in the past. To be more specific, we consider pairs trading - a form of statistical arbitrage designed for trading on the spread between two securities. The usual pairs trading strategies speculate on future convergence of spread 14

Figure 1: dnorm (top panel), dDP (middle panel) and dnorm − dDP (bottom panel) under linear normalization of matrix β. Linear (left panel) and orthogonal normalization (right panel). Estimation under r = 1. 15

between similar securities. Similarity concerns industry, sector, market capitalization, and other exposures. However a profitable strategy might also be constructed with stocks covering different sectors based purely on statistical properties of the time series. Once a pair is identified, the customary rule is to buy one security and sell the other short according to current estimate of the spread among them. This is identical with buying or selling short a spread depending on its sign. It attempts to create a market-neutral trading system that is able to make profit in both upturns and downturs of the market. One of the main motivations for handling the statistical arbitrage with advanced econometric methodology is an arbitrary form of computing the spread, whose quotation is based on scaling of stock prices. In case of cointegration analysis scaling is performed by the estimate cointegrating vector. The inaccuracy of the estimates might negatively impact the profitability of the trading strategy. The inaccuracy is mostly caused by incorrect assumption about the distribution of returns what influences both estimates as well as the predictive density. Typically the pairs trading algorithm must encompass two subalgorithms. The first fundamental building block of this methodology is a pairs selection algorithm which is esentially based on cointegration testing. The objective of this phase is to identify pairs whose linear combination exhibits significant predictable component, that is uncorrelated from underlying movements in the market as a whole. Detecting the statistical fairprice relationships between the assets is identical with assuming that the deviations or statistical mispricings have a potentially predictable component. Cointegration analysis itself has a long tradition in investigating the interdependence between stock market prices. [Bessler and Yang, 2003] identify the long-run relations among the major stock market 16

indexes worldwide. [Kolari et al., 2004] detect cointegration relations between emerging and US stock markets. We work directly with individual securities. For detected cointegrating relations the second subalgorithm creates trading signals based on predefined investment decision rule. The choice of rule is essential, however the spectrum of possible decision rules is unlimited. We adopt the general classification introduced by [Burgess, 1999]. They classify the decision rules in implicit (ISA) and conditional statistical arbitrage strategies (CSA). The implicit are so-called because the trading rules upon which they are based rely implicitly on the mean-reverting behaviour of the mispricing time-series. They assume that the higher is the current spread (mispricing) between two stocks - the higher is the probability of reverting back to the equilibrium among their prices, i.e., the larger position is opened. Thus, the size of the open position is proportional to the current estimates of the spread. In contrary, the conditional statistical arbitrage rules are based on the prediction of spread for the next period. This immediately introduces the need for a method to accurately predict the spread based on the longrun equilibrium among the stock prices. The ISA strategies might be interpreted as the CSA with predicted spread zero. The simulation algorithms are applied to both the implicit and the conditional statistical arbitrage strategies. In case of ISA we aim at increasing the accuracy of parameter estimates, in order to compute current spread more precisely. In case of CSA we also explore predictive density. We adhere to the simulation results which indicate, that the predictive densities from DP algorithm tend to outperform the predictive accuracy of M-H algorithm developed under normality. We are more focused on the CSA strategies as they explore the predictive abbility of the algorithm that we have developed. We assume that 17

Figure 2: Returns of components of Dow Jones Composite Average among whom the cointegration relations have been detected. ALCOA Inc. (AA), Altria Group, Inc.(MO), CenterPoint Energy, Inc. (CNP), Duke Energy Corp. (DUK), International Business Machines Corp. (IBM), NiSource, Inc. (NI), Norfolk Southern Corp. (NSC), Overseas Shipholding Group, Inc. (OSG), Ryder System, Inc. (R), Union Pacific Corp. (UNP) and United Parcel Service, Inc. (UPS). The period covered refers to 1.01.2009 − 31.01.2009. the inadequacy of spread prediction is mainly caused by the poor quality of predictive density in the regular cointegration models.

5

Implementation of pairs trading strategy

We work with the components of Dow Jones Composite Average index that tracks 65 prominent companies. Our target is to build a pairs trading strategy among the components of that index. We work with closing prices recorded over the period of 01.01 − 31.12.2009. In what follows we describe the general outline of the experiment in a nutshell. At first we test for cointegration pairwise, altogether (65 × 65)/2 − 65 pairs. Among the

18

universe of pairs we select the ones with the strongest cointegrating relations, based on Bayes’ factors. In the second round we backtest the pairs trading strategies with these series.

5.1

Cointegration testing

We start with a concise discussion of the Bayesian approach to cointegration test, that we follow. Review of standard procedures to approximate Bayes’ factors (BF) appears in [Kass and Raftery,1995], where the similarities between different methods are explored. [Kleibergen and Paap, 2002] show how to approximate BFs for cointegration model under encompassing prior. They derive the closed-form expression for acceptance weights w(Σi , αi , λi , β i ) in the M-H sampling algorithm. The BFs are approximated by the averages of these weights. Following [Chen, 1994] and [Kleibergen and Paap, 2002] we have √

N 1 X w(Σi , αi , λi , β i ) − cr BF (r|n) ⇒ N (0, v), N N i=1

where i corresponds to single draw of the M-H algorithm and cr denotes some scaling. Expression BF (r|n) corresponds to posterior odds of rank r to n. The formula for w(Σi , αi , λi , β i ) is derived in [Kleibergen and Paap, 2002] under linear normalization. [Gatarek et al., 2010] derive corresponding expression under orthogonal normalization. This result is presented in Appendix B. To identify the cointegrating relations, for each pair of stocks (a, b), we compute Bayes factors BF (0|2)ab and BF (1|2)ab and construct the statistic logBFab = log BF (0|2)ab − log BF (1|2)ab . According to our test a pair (a, b) is found to be cointegrated, i.e. the

19

system is of rank r = 1, when the following relations hold

log BF (1|2)ab > 0,

(1.10)

logBFab < 0,

(1.11)

what is equivalent with

P r(r = 1|Data) > P r(r = 2|Data), P r(r = 1|Data) > P r(r = 0|Data).

Conditions (1.11) and (1.11) say that there is strong indication of cointegration among the time series a and b when log BF (1|2)ab is higher than log BF (0|2)ab . Top panel of Figure (3) presents the surfaces correspoding to statistics log BF (0|2)ab and log BF (1|2)ab computed for every pair of stocks. The calculation is performed under normality, with initial 6 months of observations. The rank ordering shall remain unaltered irrespective of applied normalization, see [Gatarek et al., 2010] for in-depth simulation results that confirm this hypothesis. In this particular application it is confirmed by the top panel of Figure (3) where both normalization lead to similar selection of cointegrated pairs. The differences are assumed to have their roots in wrong distributional assumptions with respect to the stock returns. We find out that the most pairs are independent random walk relations. However, irrespectively of abundant evidence in favour of random walk processes, we are still able to detect a few cointegrated pairs. The bottom panel of Figure (3) presents pairs that are indicated by Bayes factors to be strongly cointegrated. Among them we choose 10 with lowest statistics logBFab under 20

both normalizations. Those pairs are selected to test the pairs trading strategies. They density functions of returns from these prices are given in Figure 2. They are clearly nonnromal, what confirms the demand for advanced estimation techniques.

5.2

Investment decision rules

The ISA rule does not make use of prediction mechanism. It requires highly accurate model estimates in order to obtain precise estimate of current spread. According to ISA trading rule the desired statistical arbitrage holding (i.e. the numbers of spreads bought or shorted) is given by ISA(st , l) = −sign(st−1 )|st−1 |l .

(1.12)

Parameter l might be interpreted as leverage control. l = 0 indicates that only the direction of current spread influences the position being opened i.e. one-spread-size position is opened with an appropriate sign, irrespective of size of the current spread. l = 1 corresponds to linear propagation of spread size - size of estimated spread is also taken into consideration. Values of l higher than 1 correspond to specifications when holdings are nonlinear functions of spread. To fix the ideas regarding timing convention, index t corresponds to holding or price recorded after closing of the session at day t − 1. It holds up to the next day’s closing. Thus, with st we denote the spread observed after yesterday closing (in t − 1) holding over the entire session today (t) and only replaced by the new spread calculated with today’s closing price. By adjusting the portfolio during the closing of the market we assume that the transaction is performed with the closing price t − 1 without any impact on the price. The holding implied by those prices determines the return of portfolio in t i.e. under 21

Figure 3: Statistics log BF (1|2)ab and log BF (0|2)ab (top panel). Statistic logBFab = log BF (0|2)ab − log BF (1|2)ab computed over collection of pairs (a, b) of stocks in Dow Jones Composite Average index (middle panel), pairs that pass the cointegration test (bottom panel). Among them we select pairs for testing the trading strategy. Linear normalization (left panel), orthogonal normalization (right panel). 22

today’s closing price. The return made by ISA in t is given by ISAret (st , yt,1 , yt,2 , l) = ISA(st , l)∆st − c|∆ISA(st , l)|(|βt,1 |yt,1 + |βt,2 |yt,2 )

(1.13)

i.e. the current portfolio holding, multiplied by change in value of the portfolio, adjusted to account for transaction costs incurred as a result of portfolio rebalance. The first term on the right hand side corresponds to the proportional change (meant by ∆st ) of spread relative to the total size of the spread holdings (reflected by ISA(st , l) ) in the mispricing portfolio. We follow the convention that the product of asset prices and absolute value of cointegrating vector (|β1 |yt,1 + |β2 |yt,2 ) is subject to transaction costs in t. Change in position at t, (|∆ISA(st , l)|), multiplied by percentage cost c, results in the transaction costs incurred at t that lower/increase the profit/loss. We fix c = 0.001 which is a common choice for the level of transaction costs. Cointegrating vector (β1 , β2 )0 is either linearly or orthogonally normalized. With ISAcum ret (t) we denote the return accumulated up to t. In contrary to the ISA strategy, the CSA strategy provides a mechanism for exploiting the predictive power of cointegration models. The CSA holding is defined as CSA(st , l) = sign(E(∆st ))|E(∆st )|l .

(1.14)

In this case l determines the sensitivity of the trading position to the magnitude of the predicted spread. Alike to ISA for CSA strategy we assume l = 0 and l = 1. The return made by this rule is given by CSAret (st , yt,1 , yt,2 , l) = CSA(st , l)∆st − c|∆CSA(st , l)|(|βt,1 |yt,1 + |βt,2 |yt,2 ),

(1.15)

and the reasoning behind this formula is identical ISA return. CSAcum ret (t) denotes the 23

return cumulated up to t. For CSA we perform rolling window predictions. In the first round we treat the initial 6 months of the sample as a learning sample. Remaining 6 months constitute the testing sample. The last observation of the learning sample is denoted by T . We predict spread for the next day (T + 1) and on this basis, the investment decision is made in T . We repeat this calculation for every observation of the testing sample, always rolling the learning sample by one day and computing the prediction for the next day’s closing prices.

5.3

Predictors

Prediction is performed via conditional mean. However, as we aim at exploring the shape of predictive density rather than tend to adhere to point predictions, we condition on the quantiles of predictive distribution. We open the position if the bounds of confidence interval symmetric around the median, are of alike sign. The nominal probablity P covered by the intervals is equal to 0.2, 0.3, 0.4 0.5 and 0.6, respectively. Under normality, we condition on quantiles of Normal distribution with variance 1 and mean located in the level of predicted spread. The general formula for these predictors, irrespective of the distribution, is given by sˆpt = E(st |yt )I sign(ξ0.5− P ) = sign(ξ0.5+ P ) 2

(1.16)

2

where I denotes the indicator function, ξprob the quantile of predictive distribution for probability prob and P the nominal probability - width of the confidence interval. Interval predictors are compared with point forecasts. Conditional and unconditional predictions are also performed under t-distribution, where we extend the sampling algorithms in line with [Geweke, 1993]. It is interesting to investigate if the Dirichlet Prcess Mixture is able 24

to outperform the t-distribution - a standard tool of modeling the distribution of returns in financial literature.

5.4

Prediction evaluation measures

Profitability We evaluate the strategies with ratio of cumulated income and average daily capital engagement. This measure is chosen due to specifics of pairs trading strategy. As the daily capital engagement is determined by the product of absolute value of cointegrating vector and prices of stocks, it varies throughout the testing period. The investor is required to increase or decrease her exposure in terms of dollars during the testing period according to the current spread. Therefore the strategy can not be evaluated in percentage return on initial capital involvement. Therefore average engagement is computed in order to approximate the amount capital required to finance the strategy on daily basis.

ISAprof itability = CSAprof itability =

ISAcum ret (T ) 1 T

PT

+ |βt,2 |yt,2 ) CSAcum ret (T )

1 T

PT

+ |βt,2 |yt,2 )

t=1 (|βt,1 |yt,1 t=1 (|βt,1 |yt,1

(1.17) (1.18)

Table 1 presents these statistics for linear and orthogonal normalization. Detailed results, for every cointegrated pair of stocks, are presented in Appendix C. Risk A common definition for investment risk is deviation from an expected outcome. We follow this idea to evaluate the risk inherent in the strategies. We find out that the risk is typically higher the lower is the confidence bound, considered as the decisive force for our strategy - to open or not the position. If the band is wide we accept the trade very rarely 25

Table 1: Performance evaluation measures (in %): ISAprof itability and CSAprof itability . Average over 10 pairs of assets. For CSA both point as well as interval predictions are reported. l=0

ISA point

point

normal t DP

17.46 16.61 25.44

12.52 19.87 30.03

normal t DP

51.26 50.35 51.47

57.1 60.27 48.39

l=1

ISA point

point

normal t DP

15.42 16.82 21.76

4.8 7.81 20.28

normal t DP

46.11 46.32 46.17

67.78 68.38 66.71

CSA P 20% 30% 40% linear 17.08 19.72 26.94 24.23 30.11 32.9 29.99 46.71 35.22 orthogonal 64.8 63.61 76.56 66.41 67.67 81.78 59.91 66.95 64.27

50%

60%

31.32 38.74 22.97

47.37 61.03 65.18

83.49 91.53 61.78

112.2 114.05 46.18

CSA P 20% 30% 40% 50% 60% linear 5.73 6.07 6.98 8.82 11.89 9.77 12.57 13.53 15.32 20.34 22.59 31.61 16.62 14.09 29.25 orthogonal 83.03 90.24 104.62 118.09 160.68 83.75 91.37 106.04 113.93 155.67 90 114.16 121.22 102.66 90.34

as it is unlikely that both bounds have identical sign. Risk is measured by a fraction of off the mark trades among all executed trades in the testing period. There are two potential sources of the trade to fail: failed direction of prediction or profit lower than the cost of trade. # (ISAcum ret (t + 1) < ISAcum ret (t)) # (ISAcum ret (t + 1) 6= ISAcum ret (t)) # (CSAcum ret (t + 1) < CSAcum ret (t)) = # (CSAcum ret (t + 1) 6= CSAcum ret (t))

ISArisk = CSArisk

(1.19) (1.20)

In order to estimate risk we use paths of cumulative return. When trade in t impacts it negatively the cumulative return is decreased in t and it contributes to perceive the strategy as risky. If the cumulative return is growing or remains steady over most periods the strategy is of low risk - the signals generated by the trading rule are accurate and yield profit. Table 2 presents these statistics for both normalizations. Detailed results, for 26

every cointegrated pair of stocks, are presented in Appendix C. Table 2: Risk evaluation measures: ISArisk and CSArisk . Average over 10 pairs of assets. For CSA both point as well as interval predictions are reported. l=0

ISA point

normal t DP

0.45 0.46 0.43

normal t DP

0.44 0.44 0.44

l=1

ISA point

6

normal t DP

0.47 0.48 0.44

normal t DP

0.45 0.46 0.45

point

CSA P 30% 40%

50%

60%

0.42 0.41 0.35

0.42 0.4 0.53

0.35 0.32 0.42

0.41 0.39 0.51

0.4 0.36 0.5

0.34 0.33 0.58

CSA P 30% 40%

50%

60%

0.45 0.46 0.36

0.44 0.44 0.53

0.35 0.34 0.42

0.46 0.42 0.52

0.45 0.39 0.52

0.37 0.35 0.61

20% linear 0.45 0.43 0.42 0.44 0.42 0.41 0.43 0.42 0.34 orthogonal 0.4 0.4 0.42 0.39 0.4 0.4 0.45 0.46 0.48

point

20% linear 0.51 0.48 0.45 0.5 0.47 0.46 0.47 0.43 0.35 orthogonal 0.48 0.46 0.48 0.47 0.47 0.45 0.51 0.48 0.5

Conclusions

Pairs trading is based on spread between stock prices st = βt,1 yt,1 + βt,2 yt,2 . In case of linear normalization we assume that

st βt,1

= yt,1 − β2∗ yt,2 , where β2∗ =

β2 β1 .

With time-

varying parameters the linear normalization might result in under or overestimation of the spread

st βt,1

accordingly to the true value of βt,1 , which is unobserved because of identyfing

restriction β2∗ =

β2 β1 .

This hypothesis is confimed by the empirical findings. Table 1

presents the average profitability of strategies calculated over 10 selected pairs. We observe that the orthogonal normalization substantially outperform the linear normalization. It is particularly pronounced for l = 1. The DP algorithm clearly outperforms the M-H algorithm in case of linear normalization. For orthogonal normalization the results are

27

inconclusive with respect to the superiority of any distributional assumptions. It seems that the normalization is a key factor in the analysis. The analysis of risk evaluation measures indicates again clear dichotomy between the linear and orthogonal normalizations. Under linear normalization the predictive density from DP algorithm typically outperfoms prediction under normality when P equals 30% or 40%. In those cases the strategies based on DP algorithm are both less risky and more profitable than their counterparts under normality or t-distribution. It implies that the intervals of this size are able to correctly predict the direction of spread. Under orthogonal normalization the results are similar across the strategies. They are again inconclusive with respect to the distributional assumptions under these circumstances.

28

References Antoniak, C.E., (1974), Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems, Annals of Statistics, 2, 1152-1174. Berger, J.O., Pericchi, L.R. (1996), The intrinsic Bayes factors for model selection and prediction, Journal of American Statistical Association, 91, 109-122. Bernardo, J.M., Smith, A.F.M. (1994), Bayesian Theory, Wiley, New York. Bessler, D.A., Yang, J., (2003), The structure of interdependence in international stock markets, Journal of International Money and Finance, 22, 261-287. Box, G.E.P., Tiao, G.C., (1973), Bayesian inference in statistical analysis, Addison-Wesley Publishing Company. Burgess, A.N., (1999), A computational methodology for modelling the dynamics of statistica arbitrage, Phd thesis, London Business School. Campolieti, M., (2001), Bayesian semiparametric estimation of discrete duration models: an a pplication of the Dirichlet process prior, Journal of Applied Econometrics, 16, 1-22. Chib, S., Greenberg, E., (1995), Understanding the Metropolis-Hastings algorithm, The American Statistician, 49, 327-335. Casella, G., George, E.I., (1992), Explaining the Gibbs sampler, The American Statistician, 46(3), 167-174. Chen, M., Shao, Q., Ibrahim, J. (2000), Monte Carlo Methods in Bayesian Computation, Springer-Verlag, New York.

29

Chen, M.H., (1994), Importance-weighted marginal Bayesian posterior density estimation, Journal of the American Statistical Association, 89, 818-824. Escobar, M.D., (1994), Estimating Normal Means With a Dirichlet Process Prior, Journal of the American Statistical Association, 89, 268-277. Escobar, M.D., West, M., (1995), Bayesian density estimation and inference using mixtures, Journal of American Statistical Association, 90, 577-588. Ferguson, T.S., (1973), A Bayesian Analysis of Some Nonparametric Problems, Annals of Statistics, 1(2), 209230. Ferguson, T.S., (1974), Prior Distributions on Spaces of Probability Measures, Annals of Statistics, 2, 615-629. Gatarek, L., Hoogerheide, L.F., Kleijn, R., Van Dijk, H.K., (2010), Prior Ignorance, Normalization and Reduced Rank Probabilities in Cointegration Models, unpublished working paper. Gatev, E., Goetzmann, W.N., Rouwenhorst, K.G., (2006), Pairs Trading: Performance of a Relative-Value Arbitrage Rule, The Review of Financial Studies, 19, 797-827. Gelfand, A.E., Smith, A.F.M. (1990), Sampling-Based Approaches to Calculating MArginal Densities, Journal of the American Statistical Association, 85, 398-409. Geweke, J., (1993), Bayesian treatment of independent student-t linear model, Journal of Applied Econometrics, 8, 19-40. Geweke, J., (1996), Bayesian reduced rank regression in econometrics, Journal of Econometrics, 75, 121-146.

30

Golub, G.H., Van Loan, C.F., (1989) Matrix computations, The John Hopkins University Press, Baltimore. Guidolin, M., Hyde, S., McMillan, D. and Ono, S. (2009),Non-linear predictability in stock and bond returns: When and where is it exploitable?, International Journal of Forecasting, 25, 373399. Jeffreys, H., (1939), Theory of probability, Oxford University Press. Juselius, K., (2006), The cointegrated VAR model: methodology and applications, Oxford University Press, Oxford, 2006. Kass,R.E., Raftery, A.E., (1995), Bayes factors, Journal of American Statistical Association, 90(430), 773-795. Khadani, A.E., Lo, A.,W., (2007) What happened to the Quants in August 2007?, Working Paper, Massachusetts Institute of Technology Sloan School of Management, Cambridge. Kleibergen, F.R., Van Dijk, H.K., (1994), On the shape of the likelihood/posterior in cointegration models, Econometric Theory, 10, 514-551. Kleibergen, F.R., Van Dijk, H.K., (1998), Bayesian simultaneous equation analysis using reduced rank structures, Econometric Theory,14, 701-743. Kleibergen, F.R., Paap, R., (2002), Priors, posteriors and bayes factors for a Bayesian analysis of cointegration, Journal of Econometrics, 111, 223-249. Kleibergen, F.R., Paap, R., (2004), Generalized reduced rank tests using singular value decomposition, Journal of Econometrics, 133(1), 97-126.

31

Kolari, J.W., Sutanto, P.W., Yang, J., (2004), On the stability of long-run relationships between emerging and US stock markets, Journal of Multinational Financial Management, 14(3), 233-248. Koop, G., Strachan, R.W., Van Dijk, H.K., Villani, M., (2005), Bayesian approaches to cointegration, EI 2005-13, Econometric Institute Report, Erasmus University Rotterdam. Lo, A.W., (1991), Long-Term Memory in Stock Market Prices, Econometrica 59, 12791313. Lo, A.W., MacKinlay, A.C. (1988), Stock Market Prices Do Not Follow Random Walks: Evidence from a Simple Specification Test, Review of Financial Studies, 1, 41-66. Lo, A.W., MacKinlay, A.C., (1990) When Are Contrarian Profits Due To Stock Market Overreaction?, Review of Financial Studies 3, 175206. Lo, A.W., MacKinlay, A.C., (1992), Maximizing Predictability in the Stock and Bond Markets, Working Paper No. 3450-92-EFA, Sloan School of Management, MIT. Lehmann, B., (1990), Fads, Martingales and Market Efficiency, Quarterly Journal of Economics, 105, 1-28. Magnus, J.R., Neudecker, H., (1999), Matrix differential calculus with application in statistics and econometrics, Revised Edition, Wiley (Chichester), 1999. Mardia, K.V., Jupp, P.E., (2000), Directional Statistics (2nd edition), Wiley, New York. Strachan R.W., (2003), Valid Bayesian estimation of the cointegrating error correction model, Journal of Business and Economic Statistics, 21(1), 185-195.

32

Sung-Hyuk, C., (2007), Comprehensive survey on distance/ similarity measures between probability density functions, International Journal of Mathematcial Models and Methods in Applied Sciences, 4(1), 300-307. Verdinelli, I., Wasserman, L., (1995), Computing Bayes factors using a generalization of the Savage-Dickey density ratio, Journal of American Statistical Association, 90, 614-618. Zellner, A., (1971), An introduction to Bayesian inference in econometrics, Wiley, New York.

33

A

Posterior simulator under encompassing prior

Consider the cointegration model

Yt = ΠYt−1 + εt ,

εt ∼ N (0, Σ).

(1.21)

The encompassing prior assumes that rank restriction on Π can be expressed explicitly using the following decomposition

Π = βα + β⊥ λα⊥ .

(1.22)

Not only the prior but also the posterior of α, β, λ|Σ, Y satisfies the transformation of random variables defined by (1.22) such that

p(α, β, λ|Σ, Y ) = p(Π|Σ, Y )|Π=βα+β⊥ λα⊥ |J(Π, (α, β, λ))|.

(1.23)

where J(Π, (α, β, λ)) denotes the transformation Jacobian from space of Π to α, β, λ. The conditional posterior α, β|λ, Σ, Y which is proportional to it can be evaluated in λ = 0 to obtain the posterior of α, β|Σ, Y . p(α, β|Σ, Y ) = p(α, β|λ, Σ, Y )|λ=0 ∝ p(α, β, λ|Σ, Y )|λ=0

(1.24)

= p(Π|Σ, Y )|Π=βα |J(Π, (α, β, λ))||λ=0 . Hence, we can consider the rank reduction as a parameter realization λ = 0. Sampling algorithm based on diffuse prior on Π and nesting Straightforwardly using a Gibbs sampler by simulating α and β from the full conditional posteriors is not possible due to their difficult dependence structure, see [Kleibergen and Van Dijk, 1994].

34

We specify the diffuse prior on parameter Π. Conditional on Σ, Π has a matricvariate normal posterior distribution. The marginal posterior of Σ is an inverted Wishart distribution with T −1 degrees of freedom. See [Zellner, 1971] for the discussion of Bayesian analysis in linear model. The decomposition in (1.22) allows us to obtain a (joint) draw of α and β (and λ) from a draw of Π. The dependencies between the full conditionals of α and β are avoided by determining α and β simultaneously. This poses the problem that our posterior of interest, p(α, β|Σ, Y ), does not involve λ while it is sampled. [Kleibergen and Paap, 2002] adopt the approach suggested by [Chen, 1994]. For simulating from the posterior p(α, β|Σ, Y ) it is first extended with an artificial extra parameter λ whose density we denote by g(λ|α, β, Σ, Y ). We use a Metropolis-Hastings (M-H) sampling algorithm, see e.g. [Chib and Greenberg, 1995], for simulating from the joint density

pg (α, β, λ, Σ|Y ) = g(λ|α, β, Σ, Y )p(α, β, Σ|Y ).

(1.25)

The posterior p(α, β, λ|Σ, Y ) from (1.23) is used as the candidate generating density. When pg (α, β, λ, Σ, Y ) is marginalized with respect to λ in order to remove the artificial parameter λ, the resultant distribution is p(α, β, Σ|Y ). The simulated values of α, β, Σ (discarding λ) therefore are a sample from p(α, β, Σ|Y ). The choice of g(λ|α, β, Σ, Y ) leads to the weight function w(α, β, λ, Σ) for use in the M-H algorithm. The acceptance probability in the M-H depends on a weight function

35

which is the ratio of the target density (1.25) and the candidate generating density (1.23), w(α, β, λ, Σ) =

pg (α, β, λ, Σ|Y ) p(α, β, λ, Σ|Y )

=

g(λ|α, β, Σ, Y )p(α, β|Σ, Y ) p(α, β, λ, Σ|Y )

=

ˆ 0 (βα − Π)) ˆ g(λ|α, β, Σ, Y ) exp(−((βα − Π) |J||λ=0 , ˆ 0 (βα + β⊥ λα⊥ − Π)) ˆ |J| exp(−((βα + β⊥ λα⊥ − Π)

(1.26)

ˆ is the OLS estimator. where Π The exponentiated trace expressions in numerator and denominator are related to each other by ˆ 0 (βα + β⊥ λα⊥ − Π)) ˆ ((βα + β⊥ λα⊥ − Π) ˆ ⊥ 0 )) ˆ ⊥ 0 )0 (λ − β⊥ 0 Πα ˆ 0 (βα − Π)) ˆ + ((λ − β⊥ 0 Πα =((βα − Π) (1.27) ˆ ⊥ 0 )0 β⊥ 0 Πα ˆ ⊥0) + ((β⊥ 0 Πα ˜ ˜ 0 (λ − λ)) ˜ + (λ ˜ 0 λ) ˆ 0 (βα − Π)) ˆ + ((λ − λ) =((βα − Π) ˜ = β⊥ 0 Πα ˆ ⊥ 0 . A sensible choice for the density function g(λ|α, β, Σ, Y ) thus turns where λ out to be ˜ 0 (λ − λ))). ˜ g(λ|α, β, Σ, Y ) ∝ exp(−((λ − λ)

(1.28)

Using this choice of g(λ|α, β, Σ, Y ) the weight function reduces to ˜ 0 λ)) ˜ w(α, β, λ, Σ) ∝ exp(−(λ

|J(Π, (α, β, λ))||λ=0 . |J(Π, (α, β, λ))|

(1.29)

For the unrestricted Jacobian determinant |J(Π∗ , (α, β, λ))| we refer to the appendix of [Kleibergen and Paap, 2002]. [Gatarek et al., 2010] derive some convenient computational reductions for this expression. The steps required in the sampling algorithm are,

36

1. Draw Σi+1 from p(Σ|Y ) 2. Draw Πi+1 from p(Π|Σ, Y ) 3. Compute αi+1 , β i+1 , λi+1 from Πi+1 using the singular value decomposition 4. Accept Σi+1 , αi+1 and β i+1 with probability min

B

w(αi+1 ,β i+1 ,λi+1 ,Σi+1 ) ,1 w(αi ,β i ,λi ,Σi )

Approximation of M-H weights under orthogonal normalization

In order to evaluate the M-H weights under orthogonal normalization we derive the closed form expression for Jacobian J(Π, (α, β, λ)). All matrices involved in decomposition (1.22) can be computed from Π using the singular value decomposition    S 0  1 0  V1  0     Π = U SV = U1 U2   , 0 S2 V20 where U =

and V =

U1 U2

(1.30)

V1 V2

are orthonormal matrices. U1 and V1 are p × r,

U2 and V2 are p × (p − r), and S1 and S2 are diagonal r × r and (p − r) × (p − r). Then for orthogonal normlaization the following relation hold: β = U1 , α = S1 V10 , β⊥ = U2 , α⊥ = V20 , and λ = S2 . β1 is identified uniquely by β2 , because of the normalization β 0 β = Ir . Theredore it suffices to derive J(Π, (α, β2 , λ)) J(Π, (β2 , α, λ)) = The expression for

∂ vec Π ∂(vec α)0

∂ vec Π ∂(vec β2 )0

∂ vec Π ∂(vec α)0

∂ vec Π ∂(vec λ)0

.

(1.31)

is given by

∂ vec Π ∂ vec Π ∂ vec α⊥ = (Ip ⊗ β) + 0 ∂(vec α) ∂(vec α⊥ )0 ∂(vec α)0 37

(1.32)

and

If we assume c =

Ir

∂ vec Π = (I ⊗ β⊥ λ). (1.33) ∂(vec α⊥ )0 0 0 0 0−1 0 and c = ⊥ 0 0 Ip−r , we have α⊥ = c⊥ Ir − α (αc) c and

∂ vec α⊥ −1 −1 0 −1 0 = c(αc) ⊗ (c(αc) αc ) − c(αc) ⊗ c Kr,p , ⊥ ⊥ ∂(vec α)0

(1.34)

∂ vec Π −1 0 −1 0 = (I ⊗ β) + c(αc) ⊗ β λc (c(αc) α) − I Kr,p . p p ⊥ ⊥ ∂(vec α)0

(1.35)

so that

Then for β2 we obtain ∂ vec Π ∂ vec Π ∂ vec β ∂ vec β1 ∂ vec Π vec β⊥ ∂ vec β ∂ vec β1 = + = 0 0 0 0 ∂(vec β2 ) ∂(vec β) ∂(vec β1 ) ∂(vec β2 ) ∂(vec β⊥ )0 ∂(vec β)0 ∂(vec β1 )0 ∂(vec β2 )0         Ir  ∂ vec β1 vec β⊥ 0 0   (α0 ⊗ In )  I ⊗  r   ∂(vec β2 )0 + (α⊥ λ ⊗ In ) ∂(vec β)0 0(n−r)×r The formula for

∂ vec β1 ∂(vec β2 )0

  Ir ⊗   

Ir 0(n−r)×r

 ∂ vec β1   ∂(vec β2 )0

is derived based on the orthogonal normalization condition. Ir = β10 β1 + β20 β2 0 = d(β10 β1 ) + d(β20 β2 )

0 = (β1 ⊗ Ir )d vec β10 + (Ir ⊗ β10 )d vec β1 +(β2 ⊗ Ir )d vec β20 + (Ir ⊗ β20 )d vec β2 As Kr =0 , see [Magnus and Neudecker, 1999] p.47, we have 0 =Kr (Ir ⊗ β10 )d vec β1 + (Ir ⊗ β10 )d vec β1 + Kr (Ir ⊗ β20 )d vec β2 + (Ir ⊗ β20 )d vec β2 =(Kr + Ir ) (Ir ⊗ β10 )d vec β1 + (Kr + Ir ) (Ir ⊗ β20 )d vec β2 =2Nr (Ir ⊗ β10 )d vec β1 + 2Nr (Ir ⊗ β20 )d vec β2 , (1.36) 38

where Nr = 12 (Ir + Kr ). Thus we obtain 0 = 2Nr (Ir ⊗ β10 )d vec β1 + 2Nr (Ir ⊗ β20 )d vec β2

(1.37)

−1 ∂ vec β1 = − Nr (Ir ⊗ β10 ) Nr (Ir ⊗ β20 ) . 0 ∂(vec β2 )

(1.38)

and

vec β⊥ Further, becuase β = U1 and β⊥ = U2 we can derive ∂(vec β)0 based on the orthomorphic ˜ where X ˜ = (I + U )−1 (I − U ) and U = transformation between U = U1 U2 and X,

˜ −1 (I − X). ˜ We find that (I + X) ˜ vec β⊥ ∂ vec β⊥ ∂ vec X = ˜ 0 ∂(vec β)0 ∂(vec β)0 ∂(vec X)   

 

 



  0  Ir     ˜ −1 × − I ⊗   (Ip + X) ˜ 0 ⊗ (Ip + U )−1  ,   (Ip + U )0 ⊗ (Ip + X) =− I ⊗         Ip−r 0 (1.39) where U1 = U

0

Ir 0

0

and U2 = U

0 Ip−r

and d (I + U )−1 (I − U ) = −(I +

U )−1 dU (I + U )−1 (I − U ) − (I + U )−1 dU . For

∂ vec Π ∂(vec λ)0

we obtain ∂ vec Π 0 ⊗ β⊥ . = α⊥ ∂(vec λ)0

39

(1.40)

C

Tables

40

Table 3: Performance evaluation measure under linear normalization of the parameter matrix β. Ratio of cumulative income to the average daily capital absorption (in %).

ISA point

point 20%

AA-OSG DUK-IBM DUK-OSG NI-NSC NI-OSG CNP-OSG MO-UPS NI-R NI-UNP NI-UTX

13.1 16 26.4 2.5 24.9 14.4 22.6 22.4 12.4 15.5

19.5 7.3 -3 24.9 11.2 14.8 -5.8 39.2 11.2 12.9

25.2 17 10.5 35.9 19.9 24.8 -5.9 32.9 9.1 9.5


12.8 17.7 29.3 1.5 24.4 15 22.6 22 11.5 5.5

24.5 10.5 13.8 41.5 29 17.1 9.9 32.4 4.9 19.7

31.1 16.5 27 53.1 35.7 24.3 7.9 28.1 8.3 17.2


14.3 20.3 43.9 15.2 16.5 31.6 22.5 11.7 26.7 40.6

18.2 20 47.8 9.2 17.5 58.6 22.1 32.9 23.2 39

33.1 41.4 19.7 6.6 33.3 50.4 23.8 44.3 -12 62.4

l=0 CSA 30% 40% 50% Normal 15.5 23.2 25.5 16.2 22.9 16.7 15.9 47.7 46.2 40.4 33.5 33.9 22.3 34.1 47.9 32 29.7 23 10.1 23.3 30.5 22.9 14.4 20 8.3 18.6 31.6 9.4 18.3 32.1 t-distribution 18.8 24.7 28.8 15.4 25.3 17.8 35.8 45.2 47.7 73.2 71.3 67.6 42.8 51 77 34.2 17.7 26.4 20.8 39.8 44 24.6 4.2 11.5 7.4 18.6 31.5 16.8 23 25.2 DP priors 20.1 23.7 24.5 104.4 78.5 NaN -6.8 108.8 206.7 23.6 50.2 -83.9 63.9 31.2 30.1 98.5 42.6 NaN 42.9 35.4 87.9 44.6 49.9 25.3 -4.7 -130.5 -103.1 54 50.9 -2.2

41

l=1 CSA point 20% 30% 40%

60%

ISA point

40.1 74.3 82.7 38.8 70.6 37.3 16.6 13.4 44 48.6

22.5 6.5 24.9 7.1 33.2 10.3 19.8 24.1 6.4 6.5

4.2 0.9 2.9 13.2 3.4 1.6 1.4 17.8 0.9 1.1

2.1 1.8 4.3 19.1 4.8 2.1 1.5 15.2 1.3 1.5

3.5 1.5 4.9 25 5.7 2.7 3 8.3 1.6 1.9

41.6 95.4 82.6 79.8 86.8 61.2 48.7 14.8 50.2 29.8

24.6 7 24.7 17.8 34.1 10.3 20.5 23.9 6.5 6.6

2.5 1.3 2.8 34.8 4.8 1.5 2.3 17.4 0.9 4.5

34.6 NaN NaN NaN 75.6 NaN 172.3 20.9 NaN -8.1

21.2 9.4 34 9.6 37 23.6 23.8 32 11.7 14.7

4.8 10.3 33 12.5 7.7 35.5 15.8 28.3 14.5 24.9

50%

60%

2.1 2.5 9.2 25.8 7.7 2.8 4.3 4.7 2.8 3

7 2.3 11.4 33.8 10 3.5 5.2 3.8 4.7 4.7

5.2 7.5 16.1 37.4 14.7 8.8 4.6 3.3 6.9 7.7

3.2 3 2.4 2.1 4.5 5.1 48.2 70.4 6.9 8 2.2 3 2.2 3.7 15.9 16.2 1.7 1.7 3.9 2.9

3.5 3.2 7.1 87.7 9.9 1.7 5.6 1.5 2.4 2.7

4 3.3 8 93.8 13.7 3.1 6.1 2.6 4.3 3

6.4 14.9 11.7 106.3 18 11.6 5.6 3.3 8.2 3.5

6.5 15.7 20 8.8 9.2 51.6 19 39.3 1.5 38.2

5.1 8.9 68.2 17.2 -3 10.7 26.1 39.9 -84.2 65.8

4.1 NaN 139.8 -2.6 6.6 NaN 66.2 16.6 -127.6 -0.4

4.3 NaN NaN NaN 6.4 NaN 103.6 7.9 NaN -0.9

3.9 39.6 15.6 6.5 10.7 113.5 23.2 36.5 -7.7 46.6

Table 4: Performance evaluation measure under orthogonal normalization of the parameter matrix β. Ratio of cumulative income to the average daily capital absorption (in %).

ISA point

point


117.2 25.5 89.8 18.2 77.8 16.3 8.1 17.5 13.3 128.9

108.7 138.6 23.6 36.3 80.9 85.7 46.4 54.8 67.1 100.3 25.5 21.4 0.2 4.4 23.7 12.9 14.2 7.5 180.7 186.1


117 26.7 88 21.6 57.6 15.5 12.6 17.3 12.9 134.3

118.3 150.9 25.9 29.7 78.4 84.2 43.5 54.6 55.5 77 29.7 26.9 11.1 4.9 33.4 22.1 16 15 190.9 198.8


117.2 25.5 89.8 20.3 77.8 16.3 8.1 17.5 13.3 128.9

102.4 138 8.8 26.4 73.8 90.7 33 43.9 58.4 96.9 11.3 -2 15.3 17.9 32.3 34.4 -17.7 -13.5 166.3 166.4

20%

l=0 CSA 30% 40% 50% Normal 145.2 155.2 167.4 32.6 60.7 82.3 85.3 124.3 136.5 59.8 81.2 81.5 94.2 113.3 160.4 16.9 17.4 6.4 2.8 15.6 32 2.5 -10.2 -9.4 1.5 16.8 30.7 195.3 191.3 147.1 t-distribution 170.1 182.5 192.7 26.6 63.1 98.4 96.7 115.3 148.5 58.2 69.9 53 74.3 101.5 152.1 18.9 20.8 15.4 2.3 15.4 30.7 13.8 2.6 -12.4 9.2 35.4 40.4 206.6 211.3 196.5 DP priors 167.2 211 255 45.7 -77.1 -97.1 80.9 144.7 75.2 42.3 56.8 10.6 121.2 109 63.7 5.7 11.4 41 32 8.5 53 19.8 40.9 63.3 -18.2 -22.1 -3.7 172.9 159.6 156.8

42

l=1 CSA 30% 40%

60%

ISA point

point

181.2 364 59.7 80.9 150.2 9.2 66.7 12.4 21.6 176.1

133.3 14.9 62.2 17.5 65.5 8.8 17.9 17.2 5.8 118

229 293.3 324.6 358.6 397.1 428.2 19.7 34.7 36.4 65.9 112.9 626.1 77.5 96.5 100.7 130.7 174.4 12.1 32.3 44.4 56.4 82.1 95.1 132.8 74 96.6 108.9 130.2 164.4 103.5 3.4 2.7 1.8 2.8 1.1 2.1 1 1.1 1.3 2.8 4.6 7.5 9.7 7.7 5.5 1 1.5 2.6 2 1 1.1 2.7 4.7 3.8 229.2 252.3 265.7 269.4 225.1 288.1

213.2 350.3 101.9 16.1 119.6 6.6 68.9 1.7 48.8 213.4

141.1 14.9 60.5 12.8 54.8 9.7 18.5 16.9 6.5 127.5

245.1 314 19.8 30.4 73.7 90.5 23.4 41 53.6 69.9 5.8 7.3 1.7 1.5 9.3 6.7 2.9 3.4 248.5 272.8

351.8 35 102.5 41.5 80.8 2.5 1.5 5 3.4 289.7

391.9 65.3 121.3 54.8 114 3.3 2.9 0.7 6.9 299.3

427.4 97.7 133.5 25.2 145.1 2.9 4.6 1.3 7.8 293.8

466.3 608.5 16.4 7.8 100.2 2.9 7.7 1.8 11.3 333.8

198.7 NaN 26 42.8 -14.7 NaN 32.7 52.6 -43.6 74.9

133.3 14.9 62.2 18.1 65.5 8.8 17.9 17.2 5.8 118

227.5 17 73.2 31.8 75.6 4.2 3.3 7.9 -2.6 229.2

404.1 109.2 118.1 74.4 158.9 0.3 8.8 0.6 0.6 266.6

490.3 -17.4 175 95.8 184.6 -0.7 5.8 5 1.7 272.1

580.5 -28.3 16 12.1 131.1 7.3 18.8 11.5 -0.7 278.3

495.9 NaN -1.2 31.4 0.1 NaN 6.4 10.1 -0.5 180.5

20%

317.9 45 112.7 44.7 120.5 0 4.1 6.2 -1.5 250.4

50%

60%

Table 5: Risk evaluation measure under linear normalization of the parameter matrix β. Fraction of failed prediction.

ISA point

point 20%


0.41 0.46 0.5 0.45 0.4 0.44 0.47 0.46 0.43 0.44

0.42 0.46 0.55 0.4 0.46 0.42 0.52 0.38 0.41 0.42

0.43 0.44 0.48 0.38 0.43 0.34 0.52 0.39 0.45 0.47


0.46 0.45 0.49 0.45 0.41 0.46 0.47 0.47 0.44 0.5

0.45 0.45 0.51 0.37 0.41 0.45 0.49 0.42 0.45 0.43

0.46 0.48 0.46 0.32 0.39 0.38 0.49 0.41 0.45 0.44


0.44 0.45 0.44 0.41 0.47 0.41 0.45 0.49 0.39 0.37

0.43 0.47 0.45 0.46 0.45 0.34 0.4 0.47 0.41 0.42

0.44 0.3 0.44 0.52 0.4 0.36 0.42 0.42 0.52 0.37

l=0 CSA 30% 40% 50% Normal 0.42 0.46 0.41 0.46 0.43 0.36 0.49 0.41 0.44 0.37 0.43 0.53 0.43 0.4 0.4 0.29 0.3 0.29 0.47 0.42 0.39 0.42 0.45 0.45 0.43 0.45 0.42 0.44 0.48 0.46 t-distribution 0.49 0.46 0.42 0.49 0.46 0.43 0.46 0.38 0.44 0.27 0.39 0.41 0.39 0.38 0.32 0.33 0.34 0.29 0.47 0.41 0.42 0.44 0.49 0.5 0.4 0.4 0.38 0.45 0.43 0.41 DP priors 0.47 0.44 0.40 0 0 NaN 0.44 0.25 0 0.33 0.17 1 0.4 0.47 0.44 0.36 0 NaN 0.28 0.36 0.33 0.4 0.38 0.41 0.5 1 1 0.38 0.5 0.5

43

l=1 CSA point 20% 30% 40% 50% 60%

60%

ISA point

0.39 0 0.36 0.54 0.36 0.33 0.42 0.45 0.33 0.33

0.48 0.49 0.5 0.48 0.42 0.44 0.5 0.49 0.44 0.45

0.49 0.55 0.58 0.5 0.48 0.48 0.58 0.42 0.46 0.5

0.48 0.55 0.5 0.42 0.44 0.39 0.58 0.44 0.49 0.49

0.43 0.52 0.52 0.4 0.44 0.34 0.53 0.47 0.43 0.44

0.46 0.5 0.43 0.48 0.41 0.33 0.46 0.49 0.45 0.48

0.42 0.5 0.44 0.53 0.4 0.29 0.43 0.47 0.42 0.46

0.39 0 0.36 0.54 0.36 0.33 0.46 0.47 0.33 0.33

0.37 0 0.36 0.4 0.33 0.25 0.42 0.47 0.25 0.4

0.48 0.5 0.49 0.48 0.44 0.46 0.49 0.48 0.45 0.5

0.48 0.54 0.54 0.45 0.45 0.5 0.53 0.45 0.54 0.49

0.48 0.55 0.47 0.38 0.4 0.42 0.55 0.44 0.51 0.48

0.49 0.53 0.48 0.36 0.4 0.39 0.53 0.48 0.46 0.48

0.46 0.5 0.41 0.43 0.39 0.42 0.48 0.53 0.47 0.47

0.42 0.43 0.44 0.47 0.34 0.33 0.51 0.52 0.46 0.44

0.37 0 0.36 0.47 0.36 0.25 0.46 0.49 0.25 0.44

0.35 NaN NaN NaN 0.25 NaN 0 0.44 NaN 1

0.46 0.46 0.45 0.42 0.47 0.42 0.45 0.49 0.41 0.39

0.48 0.51 0.48 0.53 0.49 0.35 0.47 0.49 0.44 0.46

0.48 0.3 0.44 0.52 0.4 0.36 0.46 0.44 0.52 0.4

0.47 0 0.44 0.33 0.4 0.36 0.36 0.4 0.5 0.38

0.45 0 0.25 0.17 0.47 0 0.43 0.38 1 0.5

0.40 NaN 0 1 0.44 NaN 0.33 0.41 1 0.5

0.33 NaN NaN NaN 0.25 NaN 0 0.44 NaN 1

Table 6: Risk evaluation measure under orthogonal normalization of the parameter matrix β. Fraction of failed prediction.

ISA point

point 20%


0.4 0.49 0.39 0.42 0.35 0.43 0.53 0.49 0.45 0.41

0.45 0.46 0.4 0.36 0.39 0.4 0.52 0.45 0.44 0.22

0.42 0.42 0.44 0.36 0.32 0.39 0.52 0.46 0.46 0.25


0.42 0.42 0.39 0.39 0.39 0.44 0.5 0.5 0.45 0.48

0.46 0.42 0.39 0.29 0.44 0.37 0.52 0.42 0.43 0.24

0.41 0.43 0.41 0.32 0.4 0.38 0.53 0.45 0.47 0.25


0.4 0.49 0.39 0.41 0.35 0.43 0.53 0.49 0.45 0.41

0.43 0.53 0.46 0.47 0.43 0.42 0.45 0.44 0.58 0.29

0.39 0.48 0.5 0.46 0.4 0.5 0.46 0.41 0.6 0.32

l=0 CSA 30% 40% 50% Normal 0.42 0.4 0.39 0.46 0.4 0.36 0.44 0.38 0.42 0.35 0.35 0.39 0.36 0.36 0.31 0.4 0.42 0.42 0.54 0.49 0.44 0.5 0.54 0.57 0.54 0.5 0.46 0.22 0.23 0.26 t-distribution 0.39 0.37 0.35 0.44 0.34 0.23 0.4 0.37 0.27 0.32 0.35 0.4 0.38 0.39 0.32 0.35 0.38 0.36 0.54 0.49 0.43 0.49 0.53 0.57 0.48 0.4 0.44 0.24 0.25 0.25 DP priors 0.37 0.31 0.27 0.56 0.8 1 0.56 0.53 0.5 0.52 0.52 0.55 0.4 0.46 0.54 0.44 0.44 0.25 0.42 0.5 0.33 0.44 0.37 0.26 0.64 0.67 0.75 0.32 0.34 0.32

44

l=1 CSA point 20% 30% 40% 50% 60%

60%

ISA point

0.37 0 0.36 0.46 0.27 0.4 0.31 0.51 0.47 0.29

0.4 0.49 0.4 0.44 0.37 0.45 0.53 0.5 0.46 0.41

0.51 0.54 0.49 0.44 0.44 0.44 0.58 0.49 0.52 0.35

0.48 0.48 0.52 0.42 0.39 0.41 0.58 0.52 0.52 0.34

0.49 0.52 0.51 0.39 0.41 0.42 0.6 0.56 0.59 0.32

0.45 0.43 0.45 0.39 0.39 0.45 0.56 0.61 0.57 0.32

0.45 0.36 0.48 0.43 0.34 0.42 0.5 0.63 0.54 0.35

0.4 0 0.36 0.54 0.23 0.4 0.34 0.56 0.53 0.38

0.34 0 0.18 0.5 0.38 0.4 0.3 0.55 0.4 0.24

0.42 0.45 0.4 0.43 0.42 0.44 0.5 0.5 0.47 0.49

0.49 0.51 0.51 0.39 0.5 0.43 0.55 0.5 0.46 0.38

0.43 0.52 0.52 0.37 0.45 0.42 0.55 0.5 0.47 0.4

0.42 0.52 0.51 0.32 0.44 0.38 0.56 0.51 0.48 0.37

0.38 0.38 0.46 0.31 0.43 0.38 0.51 0.56 0.4 0.37

0.37 0.23 0.37 0.35 0.35 0.36 0.43 0.59 0.44 0.38

0.36 0 0.27 0.42 0.43 0.4 0.3 0.55 0.4 0.36

0.32 NaN 0.67 0.5 0.75 NaN 0.33 0.36 1 0.45

0.4 0.49 0.4 0.43 0.37 0.45 0.53 0.5 0.46 0.41

0.5 0.61 0.51 0.54 0.5 0.45 0.48 0.49 0.63 0.35

0.45 0.5 0.53 0.43 0.44 0.48 0.46 0.47 0.62 0.38

0.43 0.56 0.59 0.48 0.42 0.44 0.41 0.51 0.68 0.37

0.37 0.8 0.53 0.48 0.46 0.44 0.44 0.46 0.73 0.38

0.34 1 0.5 0.45 0.54 0.25 0.33 0.35 0.88 0.39

0.36 NaN 0.67 0.5 0.75 NaN 0.33 0.5 1 0.55

A simulation-based Bayes' procedure for robust

A simulation-based Bayes' procedure for robust

Suggest Documents

A ROBUST SOLUTION PROCEDURE FOR HYPERELASTIC SOLIDS ...

AN IRWLS PROCEDURE FOR ROBUST BEAMFORMING ... - GTAS

A Robust Grade Adjustment Procedure - Infoscience

Bayes Decision Procedure Model for Post-Earthquake Emergency ...

An empirical Bayes testing procedure for detecting variants in analysis ...

General Robust Bayes Pseudo-Posterior: Exponential Convergence ...

Simulationbased estimates of safety distances for pipeline ...

A Robust Procedure for Comparing Multiple Means under ... - PLOS

A robust procedure for the automated measurement of ...

A Robust Procedure for the Functionalization of ... - Semantic Scholar

A tabu search procedure for generating robust project baseline ... - Core

A Robust Solution Procedure for the Transonic Small ... - Bentham Open

A Robust Procedure in Nonlinear Models for Repeated Measurements

A Robust Solution Procedure for the Transonic Small-Disturbance

A Robust Linear Feature-Based Procedure for Automated ... - MDPI

A robust synchronization procedure for blind estimation of the symbol

Robust Hierarchical Bayes State-Space Models for ...

Calibrated Bayes: A Bayes/Frequentist Roadmap - Sitemaker

Robust Equilibria under Linear Tracing Procedure - sipta

Robust Slicing Procedure based on Surfel-Grid

ROBUST CHANGE DETECTION PROCEDURE ... - Semantic Scholar

Robust watermarking procedure based on JPEG

Robust transformation procedure for the production of transgenic ...

A procedure based on robust design to orient towards ...