Data stream mining for market-neutral algorithmic trading

9 downloads 220 Views 244KB Size Report
theodoros.tsagaris@imperial.ac.uk. ABSTRACT. In algorithmic trading applications, a large number of co- evolving financial data streams are observed and ...
Data stream mining for market-neutral algorithmic trading Giovanni Montana

Department of Mathematics Statistics Section Imperial College London London SW7 2AZ, UK

Kostas Triantafyllopoulos

Department of Probability and Statistics University of Sheffield Sheffield S3 7RH, UK

[email protected] [email protected] ABSTRACT In algorithmic trading applications, a large number of coevolving financial data streams are observed and analyzed. A recurrent and important task is to determine how a given stream depends on others, over time, accounting for dynamic dependence patterns and without imposing any probabilistic law governing this dependence. We demonstrate how Flexible Least Squares (FLS), a penalized version of ordinary least squares that accommodates for dynamic regression coefficients, can be deployed successfully in this context. We describe a market-neutral algorithmic trading system based on a combined use of on-line feature extraction and recursive regression. The system has been proved to perform successfully when trading the S&P 500 Futures Index.

Categories and Subject Descriptors G.3 [Mathematics of Computing]: Probability And Statistics; I.5.4 [Pattern Recognition]: Applications; J.1 [J. Computer Applications]: Administrative Data Processing

General Terms Algorithms

Keywords Temporal data mining, flexible least squares, incremental principal component analysis, algorithmic trading

1.

INTRODUCTION

Algorithmic trading refers to the use of expert systems that, without any user intervention, decide on all aspects of a trading order such as its timing, price, and quantity. Based on the analysis of a large number of financial data ∗The author is also affiliated with BlueCrest Capital Management. The views presented here reflect solely the author’s opinion.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC’08 March 16-20, 2008, Fortaleza, Cear´a, Brazil Copyright 2008 ACM 978-1-59593-753-7/08/0003 ...$5.00.

Theodoros Tsagaris



Department of Mathematics Statistics Section Imperial College London London SW7 2AZ, UK

[email protected]

streams observed in real-time, an algorithmic trading system attempts to detect and exploit temporary market inefficiencies for speculative purposes. A particularly appealing class of trading systems implements market-neutral strategies: these are strategies that are neutral to general market conditions; that is, the return from the strategy is expected to be uncorrelated with the market return. The simplest special case of market-neutral strategies is perhaps pairs trading (see [1, 2]), a popular investment strategy among hedge funds and investment banks. The underlying premise in pairs trading is that assets with similar characteristics must be priced more or less the same. The first step of this strategy consists in finding two assets whose prices, in the long term, are expected to be tied together by some common stochastic trend (see, for instance, [5] for a formal procedure to test this hypothesis). What this implies is that, although the two price data streams do not necessarily need to move in the same direction at all times, their difference (or spread ) will fluctuate around an equilibrium level. The strategy is thus based on the idea of relative mispricing: the spread quantifies the degree of mispricing of one asset relative to the other one. If a common stochastic trend indeed exists between the two assets, any temporary mispricing is likely to correct itself over time. For instance, if the current spread is unusually large (so that one asset is overpriced compared to the other one), the trader expects the spread to revert back to its long-run equilibrium level. An obvious trading decision, in this scenario, is to go short (sell) the asset that is currently overpriced, and go long (buy) the undervalued asset, according to a predetermined ratio. A profit is made when the spread does revert back. Clearly, opportunities for pairs trading in the simple form described above are dependent upon the existence of similar pairs of assets, and thus are naturally limited. Although data mining approaches to automate the process of identifying suitable pairs may be built (see, for instance, [8]), in this paper we propose a different strategy. Rather than selecting two assets, our trading system exploits discrepancies between a target asset selected by the trader and a paired artificial asset. The latter asset is synthetically created as a linear combination of co-evolving data streams representing the prices of other assets belonging to the same sector or industry, or constituents of a benchmark index; again, the synthetic asset is expected to share a common stochastic trend with the target asset we intent to trade. In this sense, the price of the artificial asset can be interpreted as the fair price of the target asset, given all available information

and market conditions. Discrepancies between target and synthetic asset flag possible temporary market inefficiencies. Similarly to pairs trading, this is a relative value strategy: what matters is the size and dynamics of the spread, not the current market conditions. Moreover, given that this construction indirectly accounts for all sources of variations due to various market-related factors, the spread data stream is more likely to contain predictable patterns (e.g. a meanreverting behaviour). Several statistical tests exist to assess whether, over time, a given data stream exhibits some form of predictability (e.g. non-parametric versions of the variance ratio test, as in [10]).

where λt is the corresponding eigenvalue. Let us call b ht the current estimate of ht using all the data up to time t (t = 1, . . . , T ). We can write the above characteristic equation in matrix form as      h1 R1 · · · 0 g1    ..   ..  = Rg .. h =  ...  =  ... . .  .  hT 0 · · · RT gT

2.

the estimate b hT is obtained by b hT = (h1 + · · · + hT )/T by substituting Ri by ri ri0 . This leads to

METHODS

It should be clear that the construction above relies on selecting a set of explanatory streams with good predictive power. Moreover, the regression method should be able to account for a dynamically varying relationship between such streams and the target asset. The system we have developed comprises of three main steps that are performed sequentially on each trading day: (a) Having chosen a large population of p explanatory data streams, the system incrementally updates their first principal components, as new data points are made available. The rationale for this step lies in the belief that the first few components capture the market factors that mostly affect the target asset, and that these components will likely vary and evolve over time. At the current time t, only the largest k ¿ p components are retained as predictive streams. (b) The selected k features are used in a linear regression model with time-depending coefficients. As with PCA, the coefficients are updated on-line, as new information arrives. Adopting time-depending coefficients provides more flexibility in modeling the relationship between the extracted streams (market factors) and the asset. With the current regression estimate at hand (interpreted as the current fair price of the asset) the spread can be computed. (c) The one-step ahead prediction of the spread is obtained using, again, time-varying regression. This prediction (or trading signal) in then mapped onto a trading rule specifying what the current position in the market should be. Each one of these three steps is described in more detail in the following sections.

2.1

Incremental feature extraction

and then, noting that T h1 + · · · + hT 1 X = Ri gi T T i=1

t 1X 0 b ri ri gi ht = t i=1

(2)

which is the incremental average of ri ri0 gi , where ri ri0 accounts for the contribution to the estimate of Ri at point i. Observing that gt = ht /||ht ||, an obvious choice is to estimate gt as b ht−1 /||b ht−1 ||; in this setting, b h0 is initialized by equating it to r1 , the first direction of data spread. After plugging in this estimator in (2), we obtain ht =

t hi−1 1X 0 b ri ri b t i=1 ||hi−1 ||

(3)

In a on-line setting, we need a recursive expression for b ht . Equation (3) can be rearranged to obtain an equivalent expression that only uses b ht−1 and the most recent data point rt , that is b t − 1b 1 ht−1 b ht = ht−1 + rt rt0 b t t ||ht−1 || The weights (t − 1)/t and 1/t control the influence of old values in determining the current estimates. Full details related to the computation of the subsequent eigenvectors and other extensions can be found in [9].

2.2

Flexible least squares for on-line learning

The ordinary linear regression (OLR) model involves a response variable yt and p predictor variables x1 , . . . , xp , which usually form a predictor column vector xt = (x1t , . . . , xpt )0 . The model postulates that yt can be approximated well by x0t β, where β is a p-dimensional vector of regression parameters. Estimates βb of the parameter vector are found as those values that minimize the cost function

Suppose that Rt = E(rt rt0 ) is the unknown population covariance matrix of the p explanatory streams, with data available up to time t = 1, . . . , T . The algorithm proposed by [9] provides an efficient procedure to incrementally update the eigenvectors of the Rt matrix as new data are made available at time t + 1. In turn, this procedure allows us to extract the first few principal components of the explanatory data streams in real time, and effectively perform on-line dimensionality reduction and noise reduction. A brief outline of the procedure is provided in the sequel. First, note that the eigenvector gt of Rt satisfies the characteristic equation

When both the response variable yt and the predictor vector xt are observations at time t of co-evolving data streams, it may be possible that the linear dependence between yt and xt keeps changing and evolving, dynamically, over time. Flexible least squares were introduced at the end of the 80’s by [7] as a generalization of the standard linear regression model above in order to allow for time-dependant regression coefficients. Together with the usual regression assumption that

ht = λt gt = Rt gt

yt − x0t βt ≈ 0

(1)

C(β) =

T X (yt − x0t β)2

(4)

t=1

(5)

Furthermore, the cost of estimation function is assumed to have a quadratic form

Dynamics of estimated regression coefficients 6 Simulated Off−line On−line

5

c(βt ; µ) = βt0 Qt−1 βt − 2βt0 pt−1 + rt−1

4

where Qt−1 and pt−1 have dimensions p × p and p × 1, respectively, and rt−1 is a scalar. Substituting (10) into (9) and then differentiating the cost (9) with respect to βt , conditioning on βt+1 , one obtains a recursive updating equation for the time-varying regression coefficient

3 2 1

βbt = et + Mt βt+1

0 −1

(11)

with

−2

et = µ−1 Mt (pt−1 + xt yt )

−3 −4

(10)

Mt = µ(Qt−1 + µI + 0

50

100

150

200

250

300

0 0 Qt+1 − 2βt+1 pt + rt c(βt+1 ; µ) = βt+1

where Qt = µ(Ip − Mt )

the FLS model also postulates that (6)

for t = 1, . . . , T , that is, the regression coefficients evolve slowly over time. FLS has a number of advantages that are particularly appealing for our application. First, it does not require the specification of probabilistic properties for the residual error in (5). This is a favorable aspect of the method for applications in temporal data mining, and trading applications in particular, where we are either unable or unwilling to precisely specify a model for the errors, and the relationship between regressors and response is expected to change over time. Moreover, changes in the regression coefficients are accurately tracked. We have found that FLS performs well even when assumption (6) is violated, and there are large and sudden changes between βt−1 and βt , for some t (see, for instance, Figure 1). With these minimal assumptions in place, given a predictor vector xt , a procedure is called for the estimation 0 0 0 of a unique path of coefficients, βt = (β1t , . . . , βpt ) , for t = 1, 2, . . .. The FLS approach consists in minimizing a penalized version of the OLS cost function (4), namely C(β; µ) =

T X t=1

(yt − x0t βt )2 + µ

T −1 X

ξt

(7)

t=1

where we have defined ξt = (βt+1 − βt )0 (βt+1 − βt )

(8)

and µ ≥ 0 is a scalar to be determined. In their original formulation, [4] propose an algorithm that minimizes this cost with respect to every βt in a sequential way. They discuss a situation where all data points are stored in memory and promptly accessible, in an off-line fashion. The core of their approach is summarized in the sequel for completeness. The smallest cost of the estimation process at time t can be written recursively as © ª c(βt+1 ; µ) = inf (yt − x0t βt )2 + µξt + φ(βt ; µ) (9) βt

(14)

pt = µet

βt+1 − βt ≈ 0

(13)

The recursions are started with some initial Q0 and p0 . Now, using (11), the cost (9) can be written as

Figure 1: Simulated versus estimated time-varying regression coefficients using FLS in both off-line and online mode.

(12)

xt x0t )−1

rt = rt−1 +

(15) yt2

0

− (pt−1 + xt yt ) et

(16)

and where Ip is the identity matrix. In order to apply (11), this procedure requires all data points till time T to be available, so that the coefficient vector βT can be computed first. In [4] it is shown that the estimate of βT can be obtained sequentially as βbT = (QT −1 + xT x0T )−1 (pT −1 + xT yT ) Subsequently, (11) can be used to estimate all remaining coefficient vectors βT −1 , . . . , β1 , going backwards in time. The procedure relies on the specification of the regularization parameter µ ≥ 0; this scalar penalizes the dynamic component of the cost function (7), defined in (8), and acts as a smoothness parameter that forces the time-varying vector towards or away from the fixed-coefficient OLS solution. We prefer the alternative parameterization based on µ = (1 − δ)/δ controlled by a scalar δ varying in the unit interval. Then, with δ set very close to 0 (corresponding to very large values of µ), near total weight is given to minimizing the static part of the cost function (7). This is the smoothest solution and results in standard OLS estimates. As δ moves away from 0, greater priority is given to the dynamic component of the cost, which results in time-varying estimates. As noted above, the original FLS has been introduced for situations in which all the data points are available, in batch, prior to the analysis. In contrast, we are interested in situations where each data point arrives sequentially, one step at a time. Each component of the p dimensional vector xt represents a new point of a data stream, and the path of regression coefficients needs to be updated at each time step so as to incorporate the most recently acquired information. Using the FLS machinery in this setting, the estimate of βt is given recursively by βbt = (St−1 + xt x0t )−1 (st−1 + xt yt ),

(17)

where we have defined the quantities

Spread data stream

St = µ(St−1 + µIp +

xt x0t )−1 (St−1

xt x0t ),

+

st = µ(St−1 + µIp +

xt x0t )−1 (st−1

+ xt yt ),

(18)

0.01

The recursions are initially started with some arbitrarily chosen values S0 and s0 . By exploiting similarities between FLS and the Kalman filter, we have been able to obtain the same algebraic results without having to perform any matrix inversion. To briefly detail this argument, we note that from (17) we have βbt = Q−1 t pt ; then we can write

0.005

0 −1 Q−1 t−1 xt xt Qt−1 (pt−1 + xt yt ) βbt − βbt−1 = Q−1 t−1 xt yt − 0 −1 xt Qt−1 xt + 1 (19)

−0.005

0

= Kt r t −0.01

where we have used the matrix inversion lemma, with Kt =

0 −1 Q−1 t−1 xt /(xt Qt−1 xt

Jan04

Jan05

+ 1)

yt − x0t βbt−1

being the Kalman gain and rt = being the residual of yt . The updating of Q−1 is achieved by re-arranging t equation (18) as

Figure 2: Spread stream st for a subset of the entire period. FLS uses the first principal component and δ1 = 0.2.

Q−1 = µ−1 Ip +(Qt−1 +xt x0t )−1 = Pt −St Kt Kt0 +µ−1 Ip (20) t where again we have used the matrix inversion lemma, and 0 −1 Pt = Q−1 t−1 , St = xt Qt−1 xt + 1. Equations (19) and (20) form the Kalman filter equations. Starting with an arbitrary b Q−1 0 , we can perform the computation of βt without using any matrix inversion. Figure 1 illustrates how accurately the FLS algorithm recovers the path of the time-varying coefficients, in both off-line and on-line settings, for an artificially created data stream. The target stream yt for this example has been generated using the model yt = xt βt + ²t , where ²t is uniformly distributed over the interval [−2, 2] and the explanatory stream xt evolves as xt = 0.8xt−1 + zt , with zt being white noise. The regression coefficients have been generated using a slightly complex mechanism, including non-linear dynamics, for the purpose of illustrating the flexibility of FLS. Both off-line and on-line FLS estimates well the unobserved regression coefficients, with the latter being able to quickly adapt to abrupt jumps.

2.3

Signal generation and trading rule

Assume that the spread data stream has been updated at time t using FLS with a control parameter δ1 . The trading rule is a function that maps the expected spread at time t + 1 onto the number of contracts to hold at the end of the current day, namely ϑt (st ) = φ(b st+1 ; st )πt where φ(b st+1 ; st ) is a function of the expected spread and πt is the number of contracts to trade at time t. Our choice here is to set φ(b st+1 ; st ) = sign(γt st )

(21)

where the parameter γt is also estimated using FLS with a control parameter δ2 . Potential patterns of the spread data stream, such as mean reversion (see Figure 2), may be exploited in alternative ways; in our experience, the simple linear model has proved to be very satisfactory. With this trading rule in place, the daily order size is given by ϕt = ϑt (st ) − ϑt−1 (st ) rounded to the nearest integer.

3. EXPERIMENTAL RESULTS We have developed a data mining system that trades S&P 500 stock-index futures contracts. A futures contract is an obligation to buy or sell a certain underlying instrument at a specific date and price, in the future. The underlying instrument in this case is the S&P 500 Price Index, a world renowned index of 500 US equities. The S&P 500 Futures Index stream was obtained from Bloomberg, and covers a period of about 9 years, from 02/01/1997 to 26/10/2006. The explanatory streams comprise of a subset of the price index constituents. For our experiments, we have selected 430 data streams, corresponding to all the constituents for which we had sufficient historical data available over the period under study. Data points prior to 01/11/2000 were used as a training set to obtain stable estimates of the first few dominant eigenvectors, and were excluded from the back testing results. We assume an initial investment of $100 million, denoted by w. The number of contracts being traded on a daily basis, πt , is given by the ratio of this initial endowment w to the price of the contract at time t. The price of the contract at time t is given by the S&P 500 Futures Index at time t multiplied by $250, as set by the Chicago Mercantile Exchange (CME). Instead of using prices, we have taken the log-returns rit = log pit −log pi(t−1) where pit is the observed price of asset i at time t. The monetary return realized by the system at each time t is given by ft = 250 (pt − pt−1 ) ϑt−1 (st ), where pt is the price of the index. We have tested the system using a grid of values for the smoothing parameters δ1 and δ2 to understand the effect of their specification. Here we report results for an important and commonly used financial indicator, the Sharpe ratio, defined as the ratio between the average monetary return and its standard deviation. This gives a measure of the mean excess return per unit of risk. Figure 3 clearly shows that the system produces positive Sharpe ratios for the entire range of values; in particular, values of δ1 around 0.2 and δ2 around 0.9, corresponding to time-variable regression coeffi-

Cumulative Strategy Gross Percentage Returns 150

100

FLS−SVD FLS−nSVD Buy−hold

50

0

−50

−100 Jan00

Jan01

Jan02

Jan03

Jan04

Jan05

Jan06

Figure 4: Gross percentage results of three competing Figure 3: Sharpe ratio as function of δ1 and δ2 cients, produce ratios that are notably high for a single-asset strategy. In all our experiments, we have retained only the first principal component, representing the entire ”market”. We remark that all these results are out-of-sample. Figure 4 shows gross percentage returns over the initial endowment for the constituent set, ft /w, without transaction cost. The percentage returns made by our system (FLSSVD) are plotted against the returns made by other two alternative strategies. The first one (FLS-nSVD) uses all the 430 data streams as explanatory streams and adopts a function φ(b st+1 ; st ) = −sign(st ) that exploits mean reversion directly. The second strategy, buy and hold, is typical of asset management firms; the investor buys a number of contracts and holds them throughout the investment period. Clearly, the FLS-SVD system outperforms the index and is able to make a steady gross profit over time. The assumption of non existence of transactions costs is not particularly overly restrictive, as we expect the strategy not to be dominated by cost, given that we transact at a relatively low frequency.

4.

CONCLUSIONS

Our algorithmic trading approach relies on a feature extraction step, based on incremental principal component analysis, and an incremental regression step, based on flexible least squares. A more realistic trading system based on the methods proposed here would implement a number of improvements, such as more stringent statistical tests for the existence of a common stochastic trend between the target and synthetic asset, better calibrated and robust trading rules, and loss limitation strategies (e.g. [6]). The feature extraction step could also be replaced by feature selection, so that the explanatory data streams entering the regression model are selected automatically, and dynamically, from a very large basket of streams, on the basis of they similarity to the target asset (e.g. [3]).

5.

REFERENCES

strategies

[1] R. Elliott, J. van der Hoek, and W. Malcolm. Pairs trading. Quantitative Finance, pages 271–276, 2005. [2] E. Gatev, W. N. Goetzmann, and K. G. Rouwenhorst. Pairs trading: Performance of a relative-value arbitrage rule. Review of Financial Studies, 19(3):797:827, 2006. [3] S. Guha, D. Gunopulos, and N. Koudas. Correlating synchronous and asynchronous data streams. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 529–534, 2003. [4] R. Kalaba and L. Tesfatsion. The flexible least squares approach to time-varying linear regression. Journal of Economic Dynamics and Control, 12(1):43–48, 1988. [5] S. J. Leybourne, P. Newbold, D. Vougas, and T. Kim. A direct test for cointegration between a pair of time series. Journal of Time Series Analysis, 23:173–191, 2002. [6] Y. Lin, M. McCrae, and C. Gulati. Loss protection in pairs trading through minimum profit bounds: A cointegration approach. Journal of Applied Mathematics and Decision Sciences, pages 1–14, 2006. [7] L. Tesfatsion and R. Kalaba. Time-varying linear regression via flexible least squares. Computers & Mathematics with Applications, 17(8-9):1215–1245, 1989. [8] G. Vidyamurthy. Pairs Trading. Wiley Finance, 2004. [9] J. Weng, Y. Zhang, and W. S. Hwang. Candid covariance-free incremental principal component analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(8):1034–1040, 2003. [10] J. Wright. Alternative variance-ratio tests using ranks and signs. Journal of Business & Economic Statistics, 18(1):1–9, 2000.