Algorithmic Daily Trading Based on Experts’ Recommendations Andrzej Ruta1 , Dymitr Ruta2 , and Ling Cen2 1
ING Bank Slaski, Katowice, Poland,
[email protected] 2 Emirates ICT Innovation Center, EBTIC, Khalifa University, Abu Dhabi, UAE, {dymitr.ruta, cen.ling}@kustar.ac.ae
Abstract. Trading financial products evolved from manual transactions, carried out on investors’ behalf by well informed market experts to automated software machines trading with millisecond latencies on continuous data feeds at computerised market exchanges. While high-frequency trading is dominated by the algorithmic robots, mid-frequency spectrum, around daily trading, seems left open for deep human intuition and complex knowledge acquired for years to make optimal trading decisions. Banks, brokerage houses and independent experts use these insights to make daily trading recommendations for individual and business customers. How good and reliable are they? This work explores the value of such expert recommendations for algorithmic trading utilising various state of the art machine learning models in the context of ISMIS 2017 Data Mining Competition. We point at highly unstable nature of market sentiments and generally poor individual expert performances that limit the utility of their recommendations for successful trading. However, upon a thorough investigation of different competitive classification models applied to sparse features derived from experts’ recommendations, we identified several successful trading strategies that showed top performance in ISMIS 2017 Competition and retrospectively analysed how to prevent such models from over-fitting. Keywords: algorithmic trading, feature selection, classification, gradient boosting decision trees, sparse features, k-nn
1
Introduction
Algorithmic trading utilizes computer programs and mathematical models to create and determine trading strategies and make automatic transactions in financial markets for optimal returns, which has become a modern approach to replace human manual trading. Nowadays, banks, trading houses and investment firms rely heavily on algorithmic trading in the stock markets, especially for high-frequency trading (HFT) that requires processing of large amounts of information to make instantaneous investment decisions. Algorithmic trading allows consistent and systematic execution of designed trading strategies that are
2
Andrzej Ruta, Dymitr Ruta, Ling Cen
free from (typically damaging) impact of human emotions. It also makes markets more liquid and efficient [1]. 1.1
Related Work
The key question in algorithmic trading is how to define a set of rules or build mathematical models based on historical stock data, e.g. prices, volume, order book, as well as other available information, such as companies’ annual reports, expert recommendations or commodity prices, to accurately predict market behaviour and correctly identify trading opportunities. Simple trade criteria, as an example, can be defined based on 5-day and 20-day moving averages as follows: buying 100 shares when the 5-day moving average of a stock price goes above its 20-day moving average and selling half of shares when the price’s 5-day moving average goes below the 20-day moving average. Based on such rules a machine can be coded to monitor stock prices and corresponding indicators, and automatically place buy/sell orders triggered when the defined conditions are met. However, stock markets are non-stationary and chaotic and are influenced by many direct or indirect variables and uncertainties beyond trader’s control or knowledge. Simple rules, as the ones above, typically do not suffice to account for all impact factors and fail to simultaneously achieve high returns at low risk of financial loss. Machine learning (ML) and data mining (DM) have undergone a rapid development in the recent years and have found numerous applications in predictive analytics across different disciplines and industries. In algorithmic trading, ML/DM can help to discover hidden patterns of market behaviour from related financial data in order to decode their complex impact on market movements or trends at different time horizons. In [2], five types of stock analysis: typical price (TP), Bollinger bands, relative strength index (RSI), and moving average (MA), were combined together to predict the trend of closing price in the following day. With help of data mining techniques their model achieved an average accuracy of well over 50%. Application of supervised learning to determine future trends in stock market has been a subject of intense research, bulk of which focused on using well-established models, such as support vector machine (SVM), neural network (NN) or linear regression, to learn markets behaviour based on their own historical signals. In [3] classification and regression models were both designed to predict the daily stock returns from direct price/volume signals: open, close, high, low, volume, and indirect features corresponding to external economic indicators. Simple Logistic Regression (LR) predicting daily trend direction was reported to outperform SVM, yielding over 2000% cumulative return over 14 years. An automated stock trading system was proposed in [4]. It considered a hierarchy of features and methods selected based on multiple risk-adjusted investment performance indicators. It also used backward search and four ML algorithms (linear/logistic regression, l − 1 regularized v-SVM, and multiple additive regression tree (MART)), as was capable of online learning. The system traded automatically based on the following day’s trend prediction and reported high average accuracy in excess of 90% [4].
Algorithmic Daily Trading Based on Experts’ Recommendations
3
In [5] stock price was modelled on daily or longer intervals to predict future single- or multi-day averages respectively. LR, quadratic discriminant analysis (QDA) and SVM models were tested on the historical price data over the period between 2008 and 2013. LR reported the top performance in next-day price prediction: 58.2%, further improved to 79.3% in the SVM-based long-term model with a 44-day window. In [6] a random forest classier was built on features extracted from technical analysis indicators, such as RSI or stochastic oscillator, and achieved 96.92% accuracy in 88-day window price prediction. Neural network (NN) based approaches have also been applied to stock price trend prediction [7], classification into groups of buying, holding, and selling [8], and other related tasks. Recently, deep learning (DL) has achieved tremendous success in diverse applications, e.g. visual object recognition, speech recognition or information retrieval [9]. In [10], an auto-encoder composed of stacked restricted Boltzmann machines (RBM) was utilized to extract features from the history of individual stock prices, which successfully enhanced the momentum trading strategy [11] and delivered an annualized return of 45.93% over the period of 1990 − 2009 versus 10.53% for basic momentum. In [12] a high-frequency strategy was developed based on deep neural networks (DNN) that were trained for prediction of the next-minute average price based on the most recent and n-lagged one-minute pseudo-returns, price standard deviations and trend indicators. It achieved 66% accuracy on Apple Inc. stocks’ tick-by-tick transactions over the period of Sep-Nov 2008. Although these models have been successful in predicting stock price trends, they were built based solely on historical data, which contradicts a basic rule in finance known as the Efficient Market Hypothesis [13]. It implies that if one was to gain an advantage through historical stock data analysis, then the entire market would immediately become aware of this advantage causing correction of the stock price [6]. This dynamic and reactive nature of international financial markets combined with their high sensitivity to all kinds of micro- and macroeconomic events in business, financial and geopolitical spheres, make them appear chaotic, very noisy and allegedly truly unpredictable, especially in short time horizons [6]. Very little research reported in the public domain literature has been devoted to algorithmic trading based on information other than historical price related data. In [14], discrete stock price was predicted using a synthesis of linguistic, financial and statistical techniques to create an Arizona Financial Text System (AZFinText). Specifically, the system combined stock prices and the presence of key terms in the related financial news articles within the window of up to 20 min after the release yielding 2% higher return than the best-performing quantitative funds monitoring the same securities. Banks, trading houses and investment experts use deep human intuition and complex knowledge acquired over years of practice to make daily trading recommendations for individual and business customers. The main concern for nonprofessional investors to follow these recommendations is their non-guaranteed reliability, especially when inconsistent recommendations are given by various experts. In [15] the expert investors with similar investment preferences based
4
Andrzej Ruta, Dymitr Ruta, Ling Cen
on their publicly available portfolios were first matched to non-professional investors by taking advantage of social network analysis. Then, appropriately managed portfolios were recommended to them according to their assigned financial experts. Although the authors proposed an interesting way for non-professional investors to identify appropriate experts to follow, recommendations reliability was not investigated. In this work, in the context of ISMIS 2017 Data Mining Competition, we intend to explore the feasibility and value of expert recommendations in stock trend prediction for algorithmic daily trading utilizing various machine learning models, which has been seldom studied in the literature. 1.2
ISMIS 2017 Competition Problem Formulation
The trading recommendation problem has been defined as predicting best trading decisions from the set of classes: {sell,hold,buy}, corresponding to the considerable negative, near zero or considerable positive return observed for different financial assets in the subsequent 3 months, based on multiple trading recommendations made by many different experts in a period of up to 60 days prior to the trading decision. The expert recommendations for every asset were structured as a table including expert id, number of days prior to trading decision that the recommendation is made, expected return, and suggested trading action from the same set as the target classes: {sell,hold,buy}. To factor in uneven impact of trading decisions on corresponding return, the performance function of the trading classification system in response to a vector of features X has been defined by the following cost-weighted accuracy metric: P3
i=1 (Ci,i Wi,i ) P3 i=1 j=1 (Ci,j Wi,j )
ACC(X) = P3
(1)
where Ci,j denotes the confusion matrix entry for ith true and j th predicted class and Wi,j is a corresponding weight from the following cost matrix W :
Hold
Buy
Sell Hold Buy
Sell Actual
Predicted
8 4 8
1 1 1
8 4 8
Table 1. Cost matrix of the weighted accuracy ACC (1)
The contestants were provided with the labelled training set of 12234 examples as well as the facility to score their predictions on the chunk of the testing set (7555 examples) via web-based KnowledgePit platform. Although the submissions with predicted trading labels had to be made for the whole testing set, the
Algorithmic Daily Trading Based on Experts’ Recommendations
5
feedback in a form of the ACC score was received based on only 10% randomly chosen testing examples identities of which were hidden from the competitors. 1.3
Market Expectation
The competition setup allowed to extract general market expectations or sentiments in the target period, in which the testing set was prepared, in terms of the prior trading class expectations. Namely, given the performance cost matrix W shown in Table 1 and the feedback from uniform prediction submissions: all sell, all hold, all buy, correspondingly: ACCS , ACCH and ACCB , the prior class probabilities can be extracted using the following formula:
p(H) = ACCH p(S) =
p(B) =
ACCH ACCB 2(1 − ACCB − ACCS )
(2)
ACCH ACCS 2(1 − ACCB − ACCS )
derived from simple solutions of 3 unknowns with 3 equations determined by elementwise multiplications of the confusion and cost matrices as defined in (1). Compared to the training period for which sentiments among sell, hold and buy were distributed as: 0.25, 0.28 and 0.47, respectively, in the testing period extracted via (2) it changed significantly to 0.36, 0.30 and 0.34, i.e. from strong buy to strong sell. As we show in the subsequent analysis, access to this information significantly contributed to the model over-fitting, since the contestants attempted to exploit this information to boost their preliminary leaderboard position. It is also worth noting that in real trading scenario the testing set market expectations constitute the information from the future and will not be available to help fine-tune or calibrate the trading model.
2 2.1
Trading Classification Models Representation and Extraction of Recommendation Features
There were a number of different choices on how to represent the recommendation features, which part of the recommendation to include in feature definitions and how many features to choose for the model. We have adopted a simple, sparse feature representation associating each feature with an individual expert e. Numerous experiments were done with the aim to determine the best function mapping the original observation x associated with a given stock to the feature value fe (x). Possible choices involved the most recent return class suggested by the expert, the last percentage return expected by them, and the average return with multiple variants of temporal weighting and missing return imputation schemes. Finally we have decided to consider two best feature families fe (x): the most recent return class and the time weighted return from each expert:
6
Andrzej Ruta, Dymitr Ruta, Ling Cen
Pk
fe (x) = c(argmin(ti ))
fe (x) = Pi=1 k
ri e−λti
(3) e−λti where ri denotes percentage return of expert e at time point distant by ti from decision date and λ = 0.05. The resulting temporal weighting scheme emphasises returns expected more recently and follows our intuition that human experts tend to discount correctly rather near- than distant-future market changes. The feature matrix X [12234×2832] obtained this way for the training set was extremely sparse with less than 0.1% of non-zero values. Interestingly, no other feature generation scheme appeared contributive to the task of return class prediction. We trialled a new feature space in which for every data record the transformed vector contained values of hand-crafted functions, among others: number/percentage of recommendations with specific return class label, minimum/maximum/average expected return, average lead time of the recommendation or percent of missing return expectations. The total number of data dimensions generated this way exceeded 40. In this new feature space we observed low generalization power of several state-of-the-art classifiers outperforming the best dummy (all-SELL) solution by no more than 2%. The above disappointing results made us focus more on the recommenders rather than recommendations. Specifically, we ranked the experts according to the global accuracy of their recommendations over the training set and then retained for every data record only the {expert id, recommendation, expected return, days to decision date} entries assigned to the best-performing experts. With this modification no visible improvement was observed either. The company membership of experts was attempted to be used for clustering and for generation of company-wide features, but also did not result in any improvement of performance compared to the sparse matrix representation. i=1
2.2
Explored Model Design Choices
Trading recommendation was presented as a 3-class classification problem, however it is arguable whether the hold class should be considered a genuine independent class or it is just a state in between sell and buy classes. The argument in support of the latter could be that the hold occurring during transition from sell to buy (when price rises) is clearly different than the same hold happening during transition from buy to sell (when price drops). Given this ambiguity of the hold class, as well as the aforementioned discrepancy of target variable distribution between the training and the test set, we have tried various modeling approaches and obtained rather diverse results without a clear winner in terms of the consistency of the design. However, we will exploit the diversity of model solutions to refine the final prediction. 2.3
Top Baseline Classifiers and Their Fine-tuning
We have explored a number of standard classification models with thorough parametric fine-tuning. Specifically the following models have been considered:
Algorithmic Daily Trading Based on Experts’ Recommendations
7
– Naive Bayes with multivariate multinomial distribution and enforced uniform prior class probabilities that achieved the performance score of 0.45, – Sum of votes over selected subset of binarised expert recommendations that achieved the performance of above 0.45, – k-nearest neighbours with 9 neighbours, standard Euclidean distance, uniform class priors and hold-class penalizing cost matrix that scored 0.47, – Support Vector Machine (SVM) with linear kernel and test class priors that achieved the score of almost 0.45, – Boosted classifier ensemble with decision trees acting as week learners that jointly with class distribution rebalancing reached accuracy of over 0.47 and as much as 0.49 after further “sum-of-scores” combination of three ensembles. Below we present our modelling strategy that was ultimately adopted. We considered the most beneficial to focus on correct sell and buy classes predictions, even at the high cost of making mistakes. In principle we attempted to obtain as many correct predictions of sell and buy based on pure recommendation evidence as possible and then use the diversity of our model versions and the extracted market expectation to refine some of the extreme class instances towards hold. To get a “raw” model we chose two different baseline classifiers, k−nearest neighbours (k-NN) and a boosted decision tree ensemble. The former appeals for its conceptual simplicity, lack of training, and the ability to generate arbitrarily complex decision boundaries. The latter is natural for robust selection of predictors out of an overcomplete feature space. Boosted tree ensemble has an additional property of providing a measure of relative importance which reflects how frequently each variable is chosen in individual splits and how much choosing it contributes to reduction of the prediction error. Initially multi-class AdaBoost [16] algorithm was used for the purpose of building the model and selecting the best predictors in parallel. Further it was replaced by Gradient Boosted Decision Trees (GBDT) [17] which offered slightly better accuracy. While k-NN does not require explanation, we further provide some details on the boosted ensemble configuration. In GBDT weak learners, which are decision trees themselves, are combined into a strong classifier in such a way that in each subsequent round of training the new weak learner is fit to the error of the previously obtained classifier. In the provided implementation the error is expressed in terms of deviance for classification with probabilistic outputs, as in logistic regression. In addition, aiming at minimisation of model variance, the trees were randomized, i.e. we set the fraction of samples to be used for training of individual base learners to less than 1.0 and reduced the number of features to consider when looking for the best split at each level of each individual tree to log2 M , where M is the number of all features.
3
Preliminary Evaluation and Post-processing
Further refinement of the return class prediction model was based on the exploitation of market expectations extracted as shown in (2). The top baseline
8
Andrzej Ruta, Dymitr Ruta, Ling Cen
classifier outputs have been taken as a starting point for the refinement process and were subjected to several layers of label replacement aimed at reconstruction of the retrieved market expectations. The output class replacement followed a simple logic of identifying subsets of outputs where a pair of classes were significantly over/under represented compared to the market expectations and switching them accordingly to bring the outputs distribution closer to the reconstructed figures: p(S) = 0.36, p(H) = 0.30 and p(B) = 0.34 (2). The following set of post-processing output replacement methods have been applied based on the feedback during the preliminary stage of the competition: – Correction based on output labels imbalance. For various characteristics of model’s output scores, as well as auxiliary expert-specific properties, such as average recommendations age or recommendations inconsistency over time, histograms of output labels were inspected and reshaped in the regions where the actual labels distribution deviated the most significantly from the expected test label proportions. – Correction based on candidate model outputs agreement. The classification output correction was carried out along various levels of agreement among the candidate models’ outputs. For every output class separately, the class agreement was measured as a percentage of candidate models that agreed with the final model output. Using this corrective strategy for different ranges of the agreement levels the final output class was switched from the most over-represented to the most under-represented. – Correction based on disagreement with baseline classifiers. The soft outputs from auxiliary baseline classifiers were compared against the current final model outputs. In the case of significant disagreement between the two, the final predictions in the most extreme subset of up to 100 examples were replaced with the most under-represented class. – Correction based on submitted model versions disagreement. The outputs from all previously submitted model versions were taken as inputs to derive a models disagreement measure. Given the input example xi and its corresponding m competing outputs or votes yi = yij , j = 1, .., m from the ensemble of m predictors, each taking values from a set of c = 3 classes: {sell, hold, buy}, their disagreement d was defined as the normalised set cardinality of the least popular class outputs: d(yi ) =
c minck=1 |{yij : yij = k}| m
(4)
taking the values between 0 (at least one class has no votes) and 1 (votes equally distributed among classes). Outputs for cases with the extreme disagreement were replaced with the "safe" hold class, while subsets with continuous disagreement ranges detecting high classes imbalance compared to the expectations were re-labelled towards the most under-represented class. The two ML models based on k-NN and GBDT and the presented postprocessing scored top 1st and 2nd place during the preliminary evaluation stage.
Algorithmic Daily Trading Based on Experts’ Recommendations
4
9
Final Evaluation
Final evaluation provided surprising yet not entirely unexpected results that completely changed the leaderboard ranking. The top performer from the validation stage (second author’s model) scored only ACC = 0.387 on the full test set yielding the 24th place while the 2nd contestant in the preliminary stage (first author’s model) scored ACC = 0.424 (6th place). The top performer received the score of ACC = 0.437. At the same time organizers informed that the intermediate solution of the 2nd -ranked contestant from the preliminary stage achieved the highest test score out of all contestants. Unfortunately, that solution was not picked as final since it had not received the best score on the validation set. Figures 1(a) and 1(b) show the evolution of the preliminary scores on the validation set and their corresponding final test scores that were hidden throughout the competition. Clearly the market-expectation-guided post-processing led to massive over-fitting and spoiled the well performing early model setups that could have won the contest otherwise. It can be recognised particularly by the validation and test scores evolution divergence that in both cases started to play a role prior to any relabelling attempted. 10% validation sample available to contestants at the model development stage appeared to be insufficiently representative and thus misleading.
(a) Model based on k-NN classifier
(b) Gradient boosted tree ensemble
Fig. 1. Evolution of the validation and test performance of submitted solutions from the 2 top contestants in the preliminary competition leaderboard
5
Conclusions
The competition uncovered very weak predictability of the future market direction based on experts’ recommendations learnt from the past observations. Isolated expert-specific features capturing the most recent experts’ opinions proved to be the best at discriminating between three major asset return classes. A number of baseline classifiers, led by gradient boosted decision trees and a simple
10
Andrzej Ruta, Dymitr Ruta, Ling Cen
voting, resulted in the top overall accuracy in detecting optimal trading actions reported during the ISMIS 2017 Data Mining competition. The subsequent classification outputs post-processing, guided by the market feedback (unavailable in reality), turned out detrimental to the model and led to massive overfitting.
References 1. Algorithmic Trading, www.investopedia.com/terms/a/algorithmictrading.asp. 2. Kannan K.S., Sekar P.S., Sathik M.M. and Arumugam P., Financial Stock Market Forecast using Data Mining Techniques, Int. MultiConference of Engineers and Computer Scientists, vol.1, 2010. 3. Li H., Yang Z.J. and Li T.L., Algorithmic Trading Strategy Based On Massive Data Mining, Stanford University, 2014. 4. Shao C.X. and Zheng Z.M., Algorithmic trading using machine learning techniques: final report, 2013. 5. Dai Y. and Zhang Y., Machine Learning in Stock Price Trend Forecasting, Stanford University, 2013. 6. Khaidem L., Saha S. and Dey S.R., Predicting the direction of stock market prices using random forest, Applied Mathematical Finance, 2016. 7. Giacomel F., Galante R. and Pareira A., An Algorithmic Trading Agent based on a Neural Network Ensemble: a Case of Study in North American and Brazilian Stock Markets, Int. Conf. on Web Intelligence and Intelligent Agent Technology, 2015. 8. Boonpeng S. and Jeatrakul P., Decision Support System for Investing in Stock Market by using OAA-Neural Network. 8th Int. Conf. on Advanced Computational Intelligence, 2016. 9. Deng L., Three Classes of Deep Learning Architectures and Their Applications: A Tutorial Survey, APSIPA Trans. on Signal and Information Processing, 2012. 10. Takeuchi L. and Lee Y.A., Applying Deep Learning to Enhance Momentum Trading Strategies in Stocks, 2013. 11. Jegadeesh N. and Titman S., Returns to buying winners and selling losers: implications for stock market efficiency, The Journal of Finance, vol.48, no.1, pp.65-91, 1993. 12. Arevalo A., Nino J., Hernandez G. and Sandoval J., High-Frequency Trading Strategy Based on Deep Neural Networks, Int. Conf. on Intelligent Computing, pp. 424436, 2016. 13. Malkiel B.G. and Fama E.F., Efficient capital markets: A review of theory and empirical work, The Journal of Finance, vol.25, no.2, pp.383-417, 1970. 14. Schumakera R.P. and Hsinchun C., A Quantitative Stock Prediction System based on Financial News. Inf. Processing and Management vol.45, no.5, pp.571-583, 2009. 15. Koochakzadeh N., Kianmehr K., Sarraf A. and Alhajj R., Stock Market Investment Advice: A Social Network Approach, Int. Conf. on Advances in Social Networks Analysis and Mining, pp.71-78, 2012. 16. Zhu J., Zou H., Rosset S. and Hastie T., Multi-class AdaBoost, Statistics and Its Interface, vol.2, pp.349-360, 2009. 17. Friedman J., Greedy Function Approximation: A Gradient Boosting Machine, The Annals of Statistics, Vol.29, no.5, pp.1189-1232, 2001. 18. Jegadeesh N., Kim J., Krische S.D. and Lee C.M.C., Analyzing the Analysts: When Do Recommendations Add Value? The Journal of Finance vol.59, no.3, pp.10831124, 2004.