Predicting short-term stock prices using ensemble

0 downloads 0 Views 2MB Size Report
Jun 15, 2018 - tion/volatility of the stock; and (c) the trading volume of the stock. Some of those features .... flect price variation over time (Stochastic Oscillator, MACD, Chande. Momentum ..... a time window slicing cross validation strategy.
Expert Systems With Applications 112 (2018) 258–273

Contents lists available at ScienceDirect

Expert Systems With Applications journal homepage: www.elsevier.com/locate/eswa

Predicting short-term stock prices using ensemble methods and online data sources Bin Weng a, Lin Lu a, Xing Wang a, Fadel M. Megahed b, Waldyn Martinez b,∗ a b

Department of Industrial & Systems Engineering, Auburn University, AL 36849, USA Department of Information Systems & Analytics, Miami University, Oxford, OH 45056, USA

a r t i c l e

i n f o

Article history: Received 18 October 2017 Revised 18 December 2017 Accepted 7 June 2018 Available online 15 June 2018 Keywords: Big data Ensembles Google trends R programming Sentiment analysis Wikipedia

a b s t r a c t With the ubiquity of the Internet, platforms such as: Google, Wikipedia and the like can provide insights pertaining to firms’ financial performance as well as capture the collective interest of traders through search trends, number of web page visitors and/or financial news sentiment. Information emanating from these platforms can significantly affect, or be affected by, changes in the stock market. The overarching goal of this paper is to develop a financial expert system that incorporates these features to predict short term stock prices. Our expert system is comprised of two main modules: a knowledge base and an artificial intelligence (AI) platform. The “knowledge base” for our expert system captures: (a) historical stock prices; (b) several well-known technical indicators; (c) counts and sentiment scores of published news articles for a given stock; (d) trends in Google searches for the given stock ticker; and (e) number of unique visitors for pertinent Wikipedia pages. Once the data is collected, we use a structured approach for data preparation. Then, the AI platform trains four machine learning ensemble methods: (a) a neural network regression ensemble; (b) a support vector regression ensemble; (c) a boosted regression tree; and (d) a random forest regression. In the cross-validation phase, the AI platform picks the “best” ensemble for a given stock. To evaluate the efficacy of our expert system, we first present a case study based on the Citi Group stock ($C) with data collected from 01/01/2013 - 12/31/2016. We show the expert system can predict the 1-day ahead $C stock price with a mean absolute percent error (MAPE) ≤ 1.50% and the 1–10 day ahead with a MAPE ≤ 1.89%, which is better than the reported results in the literature. We show that the use of features extracted from online sources does not substitute the traditional financial metrics, but rather supplements them to improve upon the prediction performance of machine learning based methods. To highlight the utility and generalizability of our expert system, we predict the 1-day ahead price of 19 additional stocks from different industries, volatilities and growth patterns. We report an overall mean for the MAPE statistic of 1.07% across our five different machine learning models, including a MAPE of under 0.75% for 18 of the 19 stocks for the best ensemble (boosted regression tree). © 2018 Elsevier Ltd. All rights reserved.

1. Introduction Stock market prediction has continued to be an attractive topic in academia and business. Historically, the topic of predicting the stock revolved around the following question: “To what extent can the past history of a common stock’s price be used to make meaningful predictions concerning the future price of the stock?” (Fama, 1965, p. 34) Important financial theories, specifically the Efficient Market Hypothesis (Fama, 1965) and the random walk model (Cootner, 1964; Fama, Fisher, Jensen, & Roll, 1969), have



Corresponding author. E-mail addresses: [email protected] (B. Weng), [email protected] (L. Lu), xzw0 0 [email protected] (X. Wang), [email protected] (F.M. Megahed), [email protected] (W. Martinez). https://doi.org/10.1016/j.eswa.2018.06.016 0957-4174/© 2018 Elsevier Ltd. All rights reserved.

suggested that stock prices cannot be predicted since they are driven by new information which cannot be captured based on an analysis of stock prices (Geva & Zahavi, 2014). Proponents of these hypotheses believe that the stock prices will follow a random walk and any prediction of stock movement will be around 50% (Bollen, Mao, & Zeng, 2011). However, many studies have rejected the premise of these two hypotheses and showed that the market can be predicted to some extent (Abdullah & Ganapathy, 20 0 0; Ballings, Van den Poel, Hespeels, & Gryp, 2015; Bollen et al., 2011; Chong, Han, & Park, 2017; Malkiel, 2003; Mok, Lam, & Ng, 2004; Nassirtoussi, Aghabozorgi, Wah, & Ngo, 2015; Nguyen, Shirai, & Velcin, 2015; Nofsinger, 2005; Oliveira, Cortez, & Areal, 2017; Patel, Shah, Thakkar, & Kotecha, 2015a; Prechter Jr & Parker, 2007; Smith, 2003; Weng, Ahmed, & Megahed, 2017).

B. Weng et al. / Expert Systems With Applications 112 (2018) 258–273

In our estimation, the literature on stock prediction can be categorized according to four different metrics. These metrics are: (1) the type of outcome variable used for prediction, i.e., a dichotomous outcome for movement or a continuous outcome for price/returns; (2) the predictors included in the model, which include traditional predictors (features extracted from market, economic and technical indicators) and/or using crowd-sourced predictors (e.g., features extracted from web searches, financial news sentiment, etc.); (3) the type of prediction models used, which are typically based on the assumptions made in metrics (1)-(2); and (4) the length of the prediction period (i.e. short-term versus longterm investment windows). These metrics (and the corresponding grouping of the literature) is discussed in more detail in the paragraphs below. There are two main types of prediction outcomes in the stock market prediction literature: (a) stock market movement (see, e.g., Ballings et al. 2015; Bollen et al. 2011; Nguyen et al. 2015; Patel et al. 2015a; Schumaker & Chen 2009; Weng, Ahmed, et al. 2017), where the prediction goal is whether the stock is going up or down at a predefined time point; and (b) a continuous target, where the goal is to predict either the price (e.g., Göçken, Özçalıcı, Boru, & Dosdog˘ ru 2016; Patel, Shah, Thakkar, & Kotecha 2015b; Ticknor 2013) or the returns on investment (e.g., Chong et al. 2017; Oliveira et al. 2017; Rather, Agarwal, & Sastry 2015). From a financial market point of view, the underlying motivation behind the movement and continuous prediction models is somewhat different. Specifically, the literature on movement prediction implicitly assumes that the task is to “generate profitable action signals (buy and sell) than to accurately predict future values of a time series” (Gidofalvi, 2001, p.1). On the other hand, the prediction of the specific stock price, index or return can provide decision makers more accurate information pertaining to the risk adjusted trading profits (Kara, Boyacioglu, & Baykan, 2011). In this paper, our objective is to predict stock price since it provides more complete information when compared to just predicting movement. It should also be clear to the reader that the movement information can be generated from price, but not the other way around. From a predictors’ (explanatory variables) perspective, the literature has traditionally relied on the time series data of the stock market, technical analysis/indicators and economic indicators in predicting the future performance of stocks and indices. The trading of a given stock can be characterized using: (a) the stock’s opening and/or closing prices; (b) statistics capturing the variation/volatility of the stock; and (c) the trading volume of the stock. Some of those features are included in most (if not all) stock market prediction models. Technical analysis considers historical financial market data, such as past prices and volume of a stock, and uses charts as primary tools to predict price trends and make investment decisions (Murphy, 1999). Commonly used technical indicators include the moving average, moving average convergence and divergence, relative strength index, and the commodity channel index (Tsai, Lin, Yen, & Chen, 2011). For an introduction on how the market data and technical indicators are used in the literature, we refer the reader to Weng, Ahmed, et al. (2017). Economists have noted that stock prices can be correlated to: (a) macroeconomic indices, (b) seasonal effects, and (c) political events (Kao, Chiu, Lu, & Yang, 2013; Mok et al., 2004). For instance, the observed daily stock returns reflect the stock market reaction to factors such as the release of economic indicators, government intervention or political issues, among others (Mok et al., 2004). Our previous work (Weng et al., 2017, Under Review) shows that using ensemble methods with only macroeconomic indicators for predicting the one-month ahead prices of several U.S. indices and sector indices can result in predictions with a mean absolute percent error (MAPE) < 1.87%. These results build on the observations of Tsai and Hsiao (2010) who noted that economic performance has

259

a clear impact on the prospects of growth and earnings of companies. Generally speaking, economic indicators can be divided into coincident, leading and lagging indicators. These indicators can be obtained concurrently, prior or after the related economic activity occurs (Tsai et al., 2011). With the increased popularity of web technologies and their continued evolution, various sources of on-line data and analysis became more accessible to the public. These sources contain financial information either explicitly (e.g., a Google News article discussing/predicting future stock performance) or implicitly (e.g., measures of public interest in a stock/index through Google Search trends). Utilizing such insights, stock market prediction models have started to capitalize on these online data sources (Geva & Zahavi, 2014; Moat et al., 2013; Nassirtoussi et al., 2015; Nguyen et al., 2015; Preis, Moat, & Stanley, 2013; Tetlock, 2007; Weng, Ahmed, et al., 2017; Zhai, Hsu, & Halgamuge, 2007). The research literature conjectures that combining extensive crowdsourcing and/or financial news data with the aforementioned traditional data sources facilitates more accurate predictions. Consequently, in this paper, we examine the following sets of predictors to form the “knowledge base” of our financial expert system: (a) market data (e.g., the opening, closing, low and high prices of an index); (b) technical indicators (e.g., the Relative Strength Index and the Chande Momentum Oscillator); (c) counts and sentiment scores of financial news (which were shown to have prediction significance in Tetlock 2007); (d) trends in Google query volumes (relevance shown in Preis et al. 2013); and (e) Wikipedia page visit trends (relevance shown in Moat et al. 2013; Weng, Ahmed, et al. 2017). To the best of our knowledge, these five sets of predictors have never been examined in combination in the literature. Note that we do not consider macroeconomic indicators in this paper since they update monthly and thus, are invariant for shorter time prediction intervals. Numerous models have been proposed/implemented to predict stock/indices performance. The literature shows that machine learning models typically outperform statistical and econometric models (Hsu, Lessmann, Sung, Ma, & Johnson, 2016; Meesad & Rasel, 2013; Patel et al., 2015b; Weng, Ahmed, et al., 2017; Zhang & Wu, 2009). Perhaps more importantly, the use of machine learning models provide more flexibility when compared to the more traditional models, since they: (a) do not require distributional assumptions (Zhang & Wu, 2009), (b) more easily recognize patterns hidden in time series data (Meesad & Rasel, 2013), and (c) can combine individual classifiers to reduce the variance and obtain better prediction accuracy (Patel et al., 2015b). The literature pertaining to stock price prediction using machine learning models can be categorized into: (a) methods utilizing single/individual classifiers (see e.g., Alkhatib, Najadat, Hmeidi, & Shatnawi 2013; Chen & Hao 2017; Chong et al. 2017; Geva & Zahavi 2014; Guresen, Kayakutlu, & Daim 2011; Khansa & Liginlal 2011; Meesad & Rasel 2013; Schumaker & Chen 2009; Tsai & Hsiao 2010; Wang, Wang, Zhang, & Guo 2011; Zhang & Wu 2009); and (b) methods utilizing ensemble classifiers (see e.g., Araújo, Oliveira, & Meira 2015; Barak & Modarres 2015; Booth, Gerding, & Mcgroarty 2014; Chen, Yang, & Abraham 2007; Göçken et al. 2016; Hassan, Nath, & Kirley 2007; Kristjanpoller, Fadic, & Minutolo 2014; Patel et al. 2015b; Qian & Rasheed 2007; Rather et al. 2015; Tsai et al. 2011; Wang, Wang, Zhang, & Guo 2012; Wang, Zeng, & Chen 2015). From a machine learning perspective, it is well documented that “ensembles can often perform better than single classifiers” (Dietterich, 20 0 0a, p.1). The superiority of ensembles have also been shown in the context of financial expert systems (Chen et al., 2007; Qian & Rasheed, 2007; Tsai et al., 2011). Thus, in this paper, we examine ensemble methods in an effort to predict stock prices using multiple data streams. Specifically, we evaluate the effectiveness of the following ensemble methodologies: (a) neural network regression bagged ensem-

260

B. Weng et al. / Expert Systems With Applications 112 (2018) 258–273

Table 1 A tabular view of the stock price prediction literature using machine learning methods. Paper

One time interval

Multiple intervals

Single classifiers

Geva and Zahavi (2014)∗ Alkhatib et al. (2013) Guresen et al. (2011) Wang et al. (2011) Tsai and Hsiao (2010) Schumaker and Chen (2009)∗ Araújo et al. (2015) Barak and Modarres (2015) Rather et al. (2015) Wang et al. (2015) Booth et al. (2014) Wang et al. (2012) Tsai et al. (2011) Chen et al. (2007) Hassan et al. (2007) Qian and Rasheed (2007)

Chen and Hao (2017) Chong et al. (2017) Meesad and Rasel (2013) Khansa and Liginlal (2011) Zhang and Wu (2009) Göçken et al. (2016) Patel et al. (2015b) Kristjanpoller et al. (2014)

Ensembles

ble, (b) support vector regression bagged ensemble, (c) boosted regression tree, and (d) random forest regression. In terms of the time point for prediction, the majority of the papers discussed above focus on one particular time point. From an investor/practitioner’s perspective, a single time period model implicitly assumes the following: (a) buy and sell decisions are made periodically where the trading cost is minimal compared to the investment; and (b) it is reasonable to sell the stock and then buy it again in the next time period. From our experience, these assumptions are somewhat restrictive. An investor would like to have more information (with the understanding that there is uncertainty in the predictions) pertaining to how the stock price will perform over multiple time period. Ideally, this information can allow the investor to make more informed decisions. In Table 1, we categorize the literature on stock price prediction according to the machine learning approach and the time intervals used. We use the “∗ ” symbol to denote papers that incorporated multiple data sources (i.e., traditional sources with online sources). Note that none of the reviewed ensemble methods incorporate features from both traditional and online sources as potential predictors. Based on the insights from Nassirtoussi et al. (2015), Nguyen et al. (2015) and Weng et al. (2017), we hypothesize that the prediction performance can be improved by filling this research gap. The overarching goal of this paper is to develop a financial expert system based on ensemble methods that utilizes multiple data sources and is able to more accurately predict stock prices over multiple short-term time periods. To encourage the adoption of our financial expert system and/or similar approaches, we make all our code freely available at: https://github.com/martinwg/ stockprediction. Note that our code and documentation provides practitioners and researchers the tools and software packages to scrape data pertaining to any stock (and not just the stocks analyzed in our case study), with the purpose of broadening the utility of our financial expert system. The remainder of this paper is organized as follows. In Section 2, we provide the details for both the “knowledge base” construction and the “artificial intelligence platform”. We discuss our experimental results in Section 3. Finally, we present a summary of the main contributions and limitations of this work in Section 4, as well as some ideas for future research. 2. Methods We propose a data-driven approach that consists of two main phases, as shown in Fig. 1. In Phase 1, the data is collected through four web APIs, which are Yahoo YQL API, Wikimedia RESTful API, Quandl Database API, and Google Trend API. Four sets of data are generated that include: (a) publicly available market information on stocks, including opening/closing prices, trade volume, NASDAQ and the DJIA indices, among others; (b) the number of unique visitors for pertinent Wikipedia pages per day; (c) daily counts of financial news on the stocks of interest and sentiment scores that are a measure of bullishness and bearishness of equity prices calculated as a statistical index of positivity and negativity of news corps and (d) daily trend of stock related topics searched on Google. We obtain commonly used technical indicators that re-

flect price variation over time (Stochastic Oscillator, MACD, Chande Momentum Oscillator, etc.) from the R package TTR (Ulrich 2016) to comprise our fifth set of data. The data will then enter two sequential preprocessing steps: (a) data cleaning; which deals with missing and erroneous values; (b) data transformation; required by some machine learning models, such as neural networks. A dimensional reduction technique is applied to reduce the complexity of the data and keep the most important and relevant information. In Phase 2, we make the stock price prediction with different periods (lags) using four machine learning ensemble techniques. A modified leave-one-out cross validation (LOOCV) is employed to minimize the bias associated with the sampling. These models are compared and evaluated based on the modified LOOCV using three evaluation criteria. The details for each of these phases are presented in the subsections below. 2.1. Knowledge base: data acquisition In the data acquisition phase, five sets of data are obtained from three open source APIs and the TTR R package (Ulrich, 2016). These include traditional time series stock market data, Wikipedia hits, financial news, Google trends and technical indicators. The data sets are preprocessed and merged in Phase I. First, we obtain publicly available market data on the stock choice through Yahoo YQL Finance API. The following five variables are obtained as part of inputs: the daily opening and closing price, daily highest and lowest price, volume of trades, and the stock related indices (e.g. NASDAQ, DJIA). The second set of data is queried through the Wikimedia RESTful API for pageview data, which allows us to retrieve the daily visits for the selected stock-related pages also filtering the visitor’s class and platform. The reader is referred to https://en.wikipedia. org/api/rest_v1/ for more details. The names of stock/company Wikipedia pages need to be input by users to process the queries. The third set of data is acquired using the Quandl Database API, which is the largest public API integrating millions of financial and economic datasets. The database “FinSentS Web News Sentiment” is used in this study, which is subscription-based resource. The R package Quandl (Raymond McTaggart, Gergely Daroczi, & Clement Leung, 2016) is used to access the database through its API. The queried dataset includes daily news counts and daily average sentiment scores since 2013, derived from publicly available Internet sources. The fourth data set is the daily trends (number of hits) for stock related topics on Google Search. Our study uses the recent released Google Trends API (2017) to capture information on stock trends. The default setting of our methodology is to search the trends on the stock tickers and company names. The users are highly recommended to select more accurate stock or company related terms to improve the performance of the prediction model. Researchers list several technical indicators that could potentially have an impact on the stock price/return prediction including the Stochastic Oscillator, moving average and its convergence, divergence (MACD), the relative strength index (RSI), etc. (see e.g. Göçken et al. 2016; Kim & Han 20 0 0; Tsai & Hsiao 2010). In our study, eight commonly used technical indicators are selected, which are shown in Table 2. Furthermore, the trends and

B. Weng et al. / Expert Systems With Applications 112 (2018) 258–273

261

Phase I: Knowledge Base (Data Acquisition, Preprocessing and Feature Generation) Stock Market

Technical Indicators

Financial News

Google Trends

Wikipedia Hits

Open Price, Close Price, Volume, Index, etc.

Stochastic Oscillator, MACD, Chande Monmentum, Oscilator, etc.

Counts and sentiment scores on the stock of interest

Number of hits for stock related searches

Number of unique visitors for pertinent Wikipedia pages

Data Preprocessing

Feature Generation

Data cleaning: missing data, outliers Data transformation: scaling, centering

Features correlation analysis visualization Principle component analysis

Phase II: AI Platform (AI Models, Evaluation and User Interface) Machine Learning Ensemble Models

Model Evaluation

User Interface

a) Root mean squared error b) Mean absolute error c) Mean absolute percentage error

Time slicing cross-validation

Fig. 1. An overview of the proposed method. Table 2 Description of technical indicators used in this study. Technical indicators

Description

Stochastic Oscillator Relative Strength Index (RSI) Chande Momentum Oscillator (CMO) Commodity Channel Index (CCI) MACD Moving Average Rate Of Change (ROC) Percentage Price Oscillator

Indicator shows the location of the close relative to the high-low range. Indicator that measures the speed and change of price movements Capture the recent gains and losses to the price movement over the period Indicator used to identify a new trend or warn of extreme conditions Moving average convergence or divergence oscillator for trend following Smooth the time series to form a trend following indicator Measure the percent change from one period to the next Measure the difference between two moving average as a percentage

news sentiment of these technical indicators are obtained from Wikipedia, Financial News, and Google Trends. Six of the selected indicators are applied to generate additional features for these three datasets. The six indicators are presented in bold in Table 2. Please refer to http://stockcharts.com/ for a detailed calculation for the indicators. Hereafter, ten periods of targets (based on prediction lags) are calculated using the “Close Price” acquired from Yahoo QYL API. Five sets of data and ten targets are combined to form the original input data set for preprocessing purposes. 2.2. Knowledge base: data preprocessing Given that the data is automatically collected through the APIs in this study, some features have missing values or no meaning for a given sample. The preprocessing approach here includes two

main steps; dealing with the missing data and removing potential outliers. First and foremost, we scan through all features queried from the APIs and determine if any pattern of missing data exists. For missing data, the statistical average is imputed to the appropriate observation when applicable. Otherwise, the corresponding date with missing values will be removed from the data sets. The spatial sign (Serneels, De Nolf, & Van Espen, 2006) process is used to check for outliers and remove the corresponding data points. Feature scaling is performed for each predictor for a common scale. The process of scaling is required by the models used in this study, especially the support vector regression and neural networks in order to avoid attributes with greater numeric ranges dominating those with smaller ranges. This study deploys a straightforward and common data transformation approach to center and scale the

262

B. Weng et al. / Expert Systems With Applications 112 (2018) 258–273

predictor variables. We use a simple standardization of the variables by taking the deviation of each observation from the average of each predictor divided by the standard deviation. 2.3. Knowledge base: feature extraction For each of the five sets of data, around ten features are collected for each given period leading to more than fifty variables being collected. The final dataset contains 42 predictors including date, along with lagged stock prices from 1 up to 10 days for a total of 52 variables. All the variables are numeric except for date. Due to the curse of dimensionality, the accuracy and speed of many of the common predictive techniques degrade on high dimensional and high velocity data. Therefore, the process of dimension reduction is necessary and might improve the performance of at least some of the prediction models considered. On the other hand capturing most of the information provided by the original variables is of utmost importance. We apply principal component analysis (PCA) to our training set for the prediction models. Researchers show that PCA improves, in some instances, the accuracy and stability of stock prediction models (Lin, Yang, & Song, 2009; Tsai & Hsiao, 2010). PCA is probably the most commonly-used multivariate technique. Its origin can be traced back to Peason (1901), who described the geometric view of this analysis as looking for lines and planes of closest fit to systems of points in space. Hotelling (1933) further developed the technique and was the first to use the term “principal component”. The goal of PCA is to extract and only keep the most important and relevant information from a given set of data. To achieve this, PCA projects the original data into principal components (PCs), which are linear combinations of the original variables, so that the (second-order) reconstruction error is minimized. For normal variables (with mean zero), the (second-order) covariance matrix contains all the information about the data. Thus the PCs provide the best linear approximation to the original data, the first PC is computed as the linear combination to capture the largest possible variance, then the second PC is constrained to be orthogonal to the first PC, while capturing the largest possible variance unaccounted for, and so the process goes on. The PCs that capture the most variance are obtained through singular value decomposition (SVD). Since the variance depends on the scale of the variables, standardization (i.e., centering and scaling) is needed beforehand, so that each variable has a zero mean and unit standard deviation. To further understand the properties of PCA, we let X be the standardized data matrix, the covariance matrix can be obtained as  = 1n XX , which is symmetric and positive definite. By spectral decomposition, we can write  = QQ , where  is a diagonal matrix consisting of the ordered eigenvalues of , and the column vectors of Q are the corresponding eigenvectors, which are orthonormal. The PCs can be obtained as the columns of Q. It can be shown (Fodor, 2002) that the total variation is equal to the sum of the eigenvalues of the cop p p variance matrix Var(PCi ) = i=1 λi = i=1 trace(), and the i=1 k fraction i=1 λi /trace () gives the cumulative proportion of the variance explained by the first k PCs. In many cases, the first a few PCs capture the most variation, so the remaining components can be disregarded only with minor information loss. PCA derives orthogonal components, meaning they are uncorrelated to each other, and since our stock market data seems to contain many highly correlated variables, applying PCA helps us alleviate the effect of strong correlations between features, while reducing the dimensionality of the feature space. However, as an unsupervised learning algorithm, PCA does not consider the target while summarizing the data variation. The relationship between the target and the derived components might be more complex, or the surrogate predictors could provide no suitable relationship to

the target, so we provide results using PCs as predictors and also using the original features. Moreover, since PCA utilizes the first and second moments, it relies heavily on the assumption that the original has an approximate Gaussian distribution. We use the PCs that retain the majority of the variance (information) setting the threshold to 95%. The results of the PCA analysis is discussed in Section 3. The prediction performance of the proposed models with and without the dimension reduction is analyzed here. 2.4. The inference engine: AI model comparison and evaluation In this phase, our models and their evaluation approach are introduced. We compare the effectiveness of four machine learning models; a neural network regression ensemble, a support vector regression ensemble, a boosted tree and a random forest. The four models are considered ensembles of individual classifiers with the main difference stemming from the type of base-learner used and choice of ensemble approach; boosting, bagging or random forest. From a machine learning perspective, the following two components should be taken into consideration for a successful stock price prediction models: (a) capture the dimensionality of the input space; (b) detect the trade-off between bias and variance. A more detailed discussion on our feature extraction approach using PCA is presented in Section 2.3. Therefore, this section focuses on describing the proposed models based on the bias/variance tradeoff. The reader should note that a cross-validation approach has been applied to the four models during training. In the following subsections, we first provide a short overview of our proposed models and cross validation. We then introduce the performance evaluation metrics used in this study to identify the most suitable approach. 2.4.1. Neural networks regression ensemble (NNRE) Inspired by complex biological neuron systems in our brain, the artificial neurons were proposed by McCulloch and Pitts (1943) using the threshold logic. Werbos (1974) and Rumelhart, Hinton, and Williams (1985) independently discovered the backpropagation algorithm which could train complex multi-layer perceptrons effectively by computing the gradient of the objective function with respect to the weights. Neural networks have been widely used since then, especially since the reviving of the deep learning field in 2006 as parallel computing emerged. Neural networks have been used successfully in stock market prediction, due to their ability to handle complex nonlinear systems of stock market data. In neural networks, we describe the features as input x and the corresponding weighted sum (z = w x). The information is then transformed by the activation functions within each neuron and propagated through layers, finally resulting a given output. If there were hidden layers between the input and output layer, the network is called “deep”, giving rise to the term deep learning. The hidden layers could distort the linearity of the weighted sum of inputs, so that the outputs become linearly separable. Theoretically, we can approximate any function that maps the inputs to the output if the number of neurons are not limited. This flexibility gives the neural networks the ability to obtain higher accuracy in stock market prediction where the true data generating mechanism is extremely complicated. The functions in each neuron are called “activations”, and could be of many different types. The most commonly used activation is the sigmoid function, which is smooth and has an easy-to-express first order derivative (in terms of the sigmoid function itself), thus it is appropriate to train by using back-propagation. Furthermore, its S-shaped curve is good for classification, but as for regression, this property might be a disadvantage. It is worth to note that the rectified linear unit (ReLu), which

B. Weng et al. / Expert Systems With Applications 112 (2018) 258–273

takes the simple form f (z ) = max(z, 0 ), is less likely to have a vanishing gradient but instead the gradient is rather constant (when z > 0). This might result in faster learning for networks with many layers. Also, the sparcity of the weights arises as z < 0, reducing the complexity of the representation on a large architecture. Both properties allow the ReLu to become one of the most dominant non-linear activation functions in the last few years, especially in the field of deep learning (LeCun, Bengio, & Hinton, 2015). One of the main concerns of using ensembles of neural networks is that because of their complexity neural networks are not weak learners (classifiers with accuracy slightly higher than 50%). Ensembles rely on the use of unstable and weak classifiers to reduce the variance of the predictions. To alleviate this problem we do not fit a deep network and instead make use of a two-layer layer neural network structure (Foresee & Hagan, 1997; MacKay, 1992) with the number of neurons chosen using cross-validation at each iteration of the ensemble. We then construct the ensemble of neural networks by using bagging, that is, we take bootstrap samples of the training data set and iterate the process multiple times to reduce the variance in the bias-variance decomposition framework. The final prediction is computed as the average across iterations. In our experiments the bagging approach results in an improvement of on average 30% in test performance metrics compared to a single two-layer neural network with the same characteristics and features, including number of neurons. We use 100 iterations in our bagging ensemble. 2.4.2. Support vector regression ensemble (SVRE) To explain the learning process from a statistical point of view, Vapnik and Chervonenkis (1974) proposed the VC learning theory, and one of its major components characterizes the construction of learning machines that enable them to generalize well. Based on that, Vapnik and his colleagues developed the support vector machine (SVM) (Boser, Guyon, & Vapnik, 1992; Cortes & Vapnik, 1995) which has been shown as one of the most influential supervised learning algorithms. The key insight of SVM is that those points closest to the separating hyperplane, called the support vectors, are more important than others. Assigning non-zero weights only to those support vectors while constructing the learning machine can lead to better generalization. The separating hyperplane is called the maximum margin separator. Drucker et al. (1997) then expanded the idea to regression problems, by omitting the training points which deviate from the actual targets by less than a threshold ε , while calculating the cost. These points with small errors are also called support vectors, and the corresponding learning machine for regression is called support vector regression (SVR). The goal of training SVM/SVR is to find a hyperplane that maximizes the margin, which is equivalent to minimize the norm of the weight vector for every support vector, subject to the constraints that make each training sample valid, i.e., for SVR, the optimization problem can be written as

min s.t.

1 ||w||2 2 yi − wT xi − b ≤ ε

wT xi + b − yi ≤ ε where xi is a training sample with target yi . We will not show the details here, but maximizing its Lagrangian dual is a much simpler quadratic programming problem. This optimization problem is convex, thus it would not be stuck in local optima. Convex optimization is solved by well-studied techniques, such as the sequential minimal optimization (SMO) algorithm. Theoretically, SVR could be deployed in our regression model to capture the important factors that significantly affect stock price and avoid the problem of overfitting. The reason is not limited to picking the support vectors but also the introduction of the

263

idea of soft margins (Cortes & Vapnik, 1995). The allowance of softness in margins dramatically reduces the computational work while training. More importantly, it captures the noisiness of real world data (such as the stock market data) and could obtain more generalizable results. Another key technique that makes SVM/SVR so successful is the use of the well-known kernel trick, which maps the non-linearly-separable original input into a higher dimensional space, so that the data become linearly-separable, thus greatly expanding the hypothesis space (Russell, Norvig, & Intelligence, 1995). SVM/SVR has its own disadvantages. The performance of SVM/SVR is extremely sensitive to the selection of the kernel function, as well as the parameters. In that case, we picked the Radial Basis Function (RBF) as the kernel in our SVR since the stock market data contains high noise. Another major drawback to kernel machines is that the computational cost of training is high when the dataset is large (Goodfellow, Bengio, & Courville, 2016). SVM/SVR also suffers the curse of dimensionality and struggles to generalize well under certain conditions. We also use bagging with 100 iterations to form an ensemble of SVRs, that is, we take bootstrap samples of the training data set and iterate the process multiple times. The final prediction (SVRE) is also computed as the average predicted value across iterations. 2.4.3. Boosted regression tree (BRT) Rooted in probably approximately correct (PAC) learning theory (Valiant, 1984), Kearns and Valiant (1988) posed the question whether a set of “weak” learners (i.e., learners that perform slightly better than random guessing) can be combined to produce a learner with arbitrarily high accuracy. Schapire (1990) and Freund (1990) then answered this question affirmatively with the first provable boosting algorithm. Adaboost, the most popular boosting algorithm, was developed by Freund and Schapire (1995). Adaboost addresses two fundamental questions in the idea of boosting: how to choose the distribution in each round, and how to combine the weak rules into a single strong learner (Schapire, 2003). AdaBoost uses “importance weights” to force the learner to pay more attention to those examples with larger errors, that is, iteratively fits a learner using weighted data and updates the weights with the errors from the fitted learner. Lastly AdaBoost combines these weak learners together through weighted majority voting. Boosting is computationally efficient with very few parameters to set, while (theoretically) guaranteeing a desired accuracy given sufficient data. However, practically, the performance of boosting significantly depends on the sufficiency of the data as well as the choice of the base learner. Applying base learners that are too weak could definitely fail to work, while overly complex base learners could result in overfitting. It also seems susceptible to uniform noise (Dietterich, 20 0 0b; Martinez & Gray, 2016), since it may over-emphasize the highly noisy examples. As “off-the-shelf” supervised learning methods, decision trees are the most common choices for base learners in AdaBoost. Decision trees are simple to train, yet powerful predictive tools. Decision trees partition the space of all joint predictor variables into disjoint regions using greedy search, either based on the error or the information gain. However, due to their greedy strategy, the results obtained by decision trees might be unstable and have high variance, thus they often achieve lower generalization accuracy. Boosting improves upon decision trees performance by reducing the bias as well as the variance (Friedman, Hastie, & Tibshirani, 2001). We use 100 iterations of unpruned regression trees as the base learner for our boosting (AdaBoost) approach. 2.4.4. Random forest regression (RFR) Breiman (2001) defines a random forest (RF) as an algorithm consisting of a collection of tree structured classifiers that for independently and identically distributed random vectors. Each tree

264

B. Weng et al. / Expert Systems With Applications 112 (2018) 258–273

casts a unit vote for the most popular class for each input when the response is binary. For regression problems, the RF prediction is the average prediction from the regression trees. RFs inject randomness by growing each tree on a random subsample of the training data, and also by using a small random subset of the predictors at each decision node split. The RF method is similar to boosting in the fact that it combines classifiers that have been trained on a subset sample or a weighted subset, but they differ in the fact that boosting gives different weight to the base learners based on their accuracy, while random forests have uniform weights. There has been ample research on these ensemble methods and how they perform under different settings. For a more complete review on their performance, the reader is referred to Quinlan (1996), Maclin and Opitz (1997), Dietterich (20 0 0a), and Maclin and Opitz (2011). We use 100 trees for the RFR implementation here. 2.4.5. Time series cross validation In this study, the modified LOOCV is applied through the prediction model comparisons and evaluation approaches. The objective is to minimize the bias associated with the random sampling of the training and test data sample (Arlot et al., 2010). The traditional random cross validation (e.g. k-fold) is not suitable for this study because of the time series nature of the stock price prediction. Thus, the modified LOOCV approach is used, which performs a time window slicing cross validation strategy. The methodology moves the training and test sets in time by creating time slice windows. There are three parameters to be set in the training process: (a) Initial Window, which dictates the initial number of consecutive values in each training set sample; (b) Horizon, which determines the size of test set samples; and (c) Fixed Window, which is a logical parameter to determine whether the size of training set will be varied. The R package Caret (R Core Team, 2016) is used to perform this approach. We set the Initial Window parameter to 80% of the observations, the Horizon parameter to 5%, and the Fixed Window to TRUE for a static moving window of 80% of the data. 2.4.6. Model evaluation To evaluate the performance of the four modeling methods, three commonly used evaluation criteria are used in this study: (a) the root mean square error (RMSE), (b) the mean absolute error (MAE), and (c) the mean absolute percentage error (MAPE), where 

RMSE

n 1 (At − Ft )2 , n

=

t=1

MAE

=

1 n

n 

|At − Ft |,

t=1



MAPE

=



n 1   At − Ft   At  × 100, n t=1

and At is the actual target value for the tth observation, Ft is the predicted value for the corresponding target, and n is the sample size. The RMSE is the most popular measure for the error rate of regression models, as n → ∞, it converges to the standard deviation of the theoretical prediction error. However, the quadratic error may not be an appropriate evaluation criterion for all prediction problems, especially in the presence of large outliers. In addition, the RMSE depends on scales, and is also sensitive to outliers. The MAE considers the absolute deviation as the loss and is a more “robust” measure for prediction, since the absolute error is more sensitive to small deviations and much less sensitive to large ones than the squared error. However, since the training process

for many learning models are based on squared loss function, the MAE could be (logically) inconsistent (Woschnagg & Cipan, 2004) to the model optimization selection criteria. The MAE is also scaledependent, thus not suitable to compare prediction accuracy across different variables or time ranges. In order to achieve scale independence, the MAPE measures the error proportional to the target value. The MAPE however, is extremely unstable when the actual value is small (consider the case when the denominator At = 0 or close to 0). We will consider all three measures mentioned here to have a more complete view of the performance of the models considering the limitations of each performance measure. The fourth evaluation criterion is training runtime. We measure the time in seconds to complete the ensembles (for 100 iterations) using an Intel Xeon E5-2695 24-core workstation clocked at 2.30 GHz per core. We do not make use of parallel multicore processing. The reader should note that the runtime does not intend to measure the theoretical computational complexity of the algorithms presented here, but to merely illustrate comparison of the time it takes to run each algorithm under the same circumstances with a fixed physical computational power. 3. Experimental results and discussion In this section, we go through the techniques and methodologies used to complement and build our final ensemble models. The first step in our approach includes using visualization techniques to recognize highly correlated features. The extracted features with and without PCA transformation are used to build the predictive models, respectively. Finally, we compare the proposed ensemble models using the performance measures described in Section 2.4.6. 3.1. Explanatory analysis We explain here our exploratory analysis of the original data and our approach to capture the characteristics containing the most information from the available features. As we discussed in Section 2.2, the features collected through the APIs have high variability and contain missing/meaningless samples. After exploring each feature we perform the necessary data cleaning, feature centering and feature scaling. Furthermore, we also pay close attention to the correlation structure of feature. To illustrate our approach, a case study based on the Citi Group stock ($C) is presented here. The data is collected from January 2013 to December 2016 on a daily basis. Fig. 2 shows a visualization of the correlation matrix of the five sets of input features, in which the features are grouped using the hierarchical clustering algorithm (so that the features with high correlations are close to each other), and the colors indicate the magnitude of the pairwise correlations among features. The dark blue implies strong positive correlation, while the dark red stands for strong negative correlation, and the white color implies the two features are uncorrelated. The dark blue blocks along the diagonal indicate that the features fall into several large clusters, and within each cluster the features show strong collinearity. For example, the different prices (open, closed, high, or low) in the same day are clearly close to each other in most of the cases and thus probably fall into the same cluster. There are also features negatively correlated to each other, for instance, the volume and the index have opposite trends, which might due to the low volatility of the Citi Group stock ($C) stock. This shows investors tend to buy other stocks when the corresponding market index is increasing. 3.2. Feature extraction For our Citi Group stock ($C) stock analysis, the first three principal components, extracted from all the features considered, ac-

B. Weng et al. / Expert Systems With Applications 112 (2018) 258–273

265

Fig. 2. Correlation matrix for features.

counted for 21.13%, 16.86%, and 10.95% of the total variance for the data, respectively. Fig. 3(a) shows the cumulative percentages of the total variation in the data explained by each component, from which we can observe that the first 13 principal components describe 90.78% of the information from the features, and the first 17 components capture 95.29%. The first 26 components explain > 99.26% of the total variance, i.e. the remaining 15 components capture < 0.74% of the variation in the data. Deploying the predetermined threshold of 95%, we use 17 components out the total of 41 for training. Fig. 3(b) characterizes the loadings (i.e., the coefficients in the linear combination of features that derive a component) for each feature associated with the first two principal components. It is quite clear that the loadings of the prices, as well as the technical indicators have the largest effect for the first component, e.g., the coefficient of close price is 0.2668, and that of RSI is 0.2645. As for the second component, the external Internet features contribute the most in the positive direction. For instance, the coefficients for Google Trend, Wiki Traffic and News Count are 0.2547, 0.1957 and 0.2137, respectively. Also note that the News Sentiment plays a role that is negatively associated with the second component with coefficient −0.1018. Note that Fig. 3(b) shows that the relationship between the first two principal components is “scattered”, i.e. they are uncorrelated which is expected since they are orthogonal. We highlight this observation here, however, to note

the utility of using PCA to generate uncorrelated features that capture different information. Based on the observations above, PCA provides two benefits: (a) reducing the dimension space; and (b) ensuring that the features selected are not correlated. However, as an unsupervised learning algorithm, PCA does not consider the target while projecting the data. The implications of the unsupervised nature of PCA include: (a) the surrogate predictors provide no suitable relationship to the dependent variable (stock price); and/or (b) the connection between the target and the predictors is weakened by the PCA transformation. Thus, in this paper, we examine the performance of the aforementioned ensemble models with and without using PCA. 3.3. Set of predictor stages To further evaluate how much additional predictive value the proposed ensemble models obtain from the use of financial news sentiment, trends and online data sources, we divide our set of predictors into four stages. Table 3 shows the variables selected at each stage. For instance, in Stage 1 we consider only technical indicators as predictors for proposed ensemble methods. We expect the ensembles created using only the technical indicators as predictors to be highly predictive as these variables provide the most information about how the stock market behaves. Stage 2 adds news sentiment and counts variables to the variables considered in

266

B. Weng et al. / Expert Systems With Applications 112 (2018) 258–273

Fig. 3. Illustration of the variation explained by principal components.

Table 3 Model variables selected at each stage.

Open High Low Close Volume Index Market_fastK Market_fastD Market_slowD Market_RSI Market_CMO Market_CCI Market_MACD Market_MA5 Market_MA10 Market_ROC newsSentiment newsCount newsCount_RSI newsCount_CMO newsCount_MACD newsCount_MA5 newsCount_MA10 newsCount_ROC newsCount_OSCP gTrend gTrend_RSI gTrend_CMO gTrend_MACD gTrend_MA5 gTrend_MA10 gTrend_ROC gTrend_OSCP Wikitraffic Wikitraffic_RSI Wikitraffic_CMO Wikitraffic_MACD Wikitraffic_MA5 Wikitraffic_MA10 Wikitraffic_ROC Wikitraffic_OSCP

Stage 1

Stage 2

Stage 3

Stage 4

X X X X X X X X X X X X X X X X

X X X X X X X X X X X X X X X X X X X X X X X X X

X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X

Stage 1. We hypothesize that adding these variables should provide additional improvement to the predictive power, albeit a significant lesser improvement than the overall contribution of variables in Stage 1. Stage 3 adds Google trends data and Stage 4 includes Wikipedia traffic forming the most complete set of predictors. The ensemble models are trained using PCA transformations and untransformed sets of predictors for each stage. 3.4. Model comparison and evaluation As previously mentioned four commonly used machine learning models have been applied to our ($C) stock case study: a neural networks regression ensemble (NNRE), a support vector regression ensemble (SVRE), AdaBoost with unpruned regression trees as base learners (BRT) and a Random Forest with unpruned regression trees as base learners (RFR). We use three evaluation metrics (MAE, MAPE, RMSE) in addition to the training runtime in seconds to gauge the performance of the four models in this study. The data is split into two sets, training and test. As explained in Sections 2.4.5 and 2.4.6, the approach of modified LOOCV using time-slicing windows is applied through the model development approach. Since the stock market is essentially a time series formulation, 80% of the data is used for training and during the time slicing process the training set only contains the data points that occur prior to the data points in the validation set. Thus, no future samples are used to predict the past samples. Specifically, the size of each training sample is 80% of the data across each time slice and the test set contains 5% of the data. The process is repeated with different training sets where the training size is not varied through the time slicing. Therefore, a series of training and test sets is generated and used for training and evaluating the models. Afterwards, the prediction performance is computed by averaging the metrics over the validation sets with the exception of training runtime. The performance of the four models to predict the oneday ahead stock price using features with and without PCA transformation is shown in Fig. 4. The number of iterations are set at 100 for each of each of the ensembles considered for a more even comparison, but we should note that the number of iterations is a

B. Weng et al. / Expert Systems With Applications 112 (2018) 258–273

Fig. 4. Performance of the NNRE (

), SVRE (

), BRT (

parameter that can be further optimized through cross-validation to achieve better results. Several conclusions/observations can be made from Fig. 4. First, the test error shows improvement on most ensembles as information on news sentiment, trends and other online sources are appended to the technical indicators as predictors (stages). The SVRE model is the exception, showing a consistent worsening performance as more variables are added, which is indicative of overfitting. The use of PCA has a positive impact on predictive performance in most criteria analyzed. Overall, the Boosting (BRT) and the (RFR) have the best average performance in most of the metrics analyzed, and that also includes consistency from training to testing performance. For instance, if we use the MAPE for |MAPETest −MAPETrain | illustration, the MAPE = 100 × for both modMAPE Train

els are impressive 2.16% and 6.3% at respectively. The SVRE and NNRE ensembles present a drop in consistency performance in line with results typically published in the machine learning literature, MAPE < 20%. From a practical perspective, all 8 models (4 models × 2 [i.e. PCA/no_PCA]) can predict the 1-day ahead price of the stock with a MAPE ≤ 1.5 (with 6 models under 1%) using all available predictors (stage 4). Secondly, we can see that the use of PCA improves in some instances the predictive performance (irrespective of which metric is used for evaluation) of the ensembles, but in some cases it does not. From an a practitioner’s perspective, the decision to use PCA or not hinges on two factors: (a) what is an acceptable MAPE for the testing data (e.g., do they pick the best model or any model under a pre-specified acceptable MAPE?); and (b) how much time are they willing to dedicate to training the model. For the first factor, three of the six models with a MAPE under 1% involved the use of PCA, so there is not a significant drop in performance by choosing either option. The second factor is that the use of PCA can cut the training time significantly as can be seen in Fig. 4. As expected using PCA the can reduce model training time significantly. For the BRT, NNRE, RFR and SVRE, the corresponding reductions in runtime are: 58%, 268%, 10%, and 41%, for Stage 4 models respectively. We note that there exists a significant difference in runtime between the different ensemble methods (irrespective of whether PCA is used or not). We attribute this to both the complexity of the method and also to the availability of optimized R packages.

) and RFR (

267

) ensembles at each stage.

Thus, our results (for a given ensemble model) may not be typical if Python or some other software/programming language is used. Hereafter, we focus on the ensembles performance using PCA since the performance is similar and the training time is shorter than no PCA models. Note that a model predicting next day stock prices can be feasibly implemented as long as the time to train does not exceed the difference in close of trading to next day opening. Fig. 4 does not provide insights into the effectiveness of each ensemble in capturing the turning points in the stock price. To overcome this limitation, we depict the Stage 4 prediction performance for the competing ensembles over time in Figs. 5 and 6. Fig. 5 illustrates the prediction pattern of the competing ensemble models predicting the $C stock price compared to the actual price. Fig. 5 shows the prediction bias defined as b = |y − yˆ|. It is interesting to point out that some patterns can be observed in terms of the predictive errors, but the main finding is that The BRT and RFR methods have the smallest prediction errors and bias. An interesting observation is that when the stock price is stable (i.e., only changes in a small range), the SVRE method overestimates the volatility by exaggerating the amplitude, as well as the frequency of oscillates resulting in the highest test error rate, however the SVRE ensemble does a decent job at predicting price turns. A finely tuned SVRE might better suited to predict price turns and an opportunity for arbitrage. From the above discussion, we have found that the performance of the BRT and RFR ensembles is better than the SVRE and NNRE ensembles for one-day ahead stock price prediction not only in terms of predictive performance but also in consistency and faster runtimes. To formally understand the usefulness of our approach as the prediction window increases, we consider the forecasting performance for up to 10 lags. We use the notation Lag X as the target that predicts the price X days ahead of the market. As an example, we consider the BRT ensemble using PCA. The results are presented in Table 4. From the results, it is clear that the performance has a decreasing trend as the prediction period increases. Based on the natural volatility of the market, the rate of change in prices is commonly larger in long term predictions than short term. Moreover, the results validate that the features obtained from internet sources, such as Google Trends, significantly shock the stock market for a relatively short period (one

268

B. Weng et al. / Expert Systems With Applications 112 (2018) 258–273

50 35

40

45

Price

55

60

Actual NNRE SVRE BRT RFR

2015−12−29

2016−01−28

2016−02−26

2016−03−28

2016−06−14

2016−07−21

2016−08−26

2016−10−03

2016−11−02

Date Fig. 5. Ensemble predictions and actual price of the $C stock over time.

0

2

4

NNRE Bias

0

2

4

SVRE Bias

0

2

4

BRT Bias

0

2

4

RFR Bias

2015−12−29

2016−01−19

2016−02−05

2016−02−25

2016−03−15

2016−04−04

2016−04−28

2016−07−08

2016−07−27

2016−08−23

2016−09−19

2016−10−06

2016−10−25

2016−12−01

Fig. 6. Prediction bias over time.

Table 4 The performance of the BRT ensemble on different targets.

Lag Lag Lag Lag Lag

1 2 3 4 5

MAE

MAPE

RMSE

0.349 0.668 0.754 0.662 0.759

0.787 1.371 1.555 1.490 1.720

0.482 0.849 0.952 0.868 1.010

Lag Lag Lag Lag Lag

6 7 8 9 10

3.5. Evaluating the generalizability of our expert system

MAE

MAPE

RMSE

0.748 0.778 0.788 0.837 0.800

1.690 1.760 1.790 1.890 1.790

1.020 1.080 1.110 1.150 1.120

or two days), therefore the impact of these variables on the predictive performance will gradually reduce. An analysis on the importance of each variable on the predictive performance across lags shows that the technical indicators remain considerably important in terms of prediction power, but the variables obtained from online sources vary significantly after lag 3. The reader is referred to our online supplemental material at https://github.com/martinwg/ stockprediction for more information on variable importance.

In this subsection, we analyze 19 additional stocks to gauge the prediction performance of our expert system under a wider range of industries, volatilities, growth patterns and general conditions. Table 5 shows the test MAE, MAPE and RMSE under the same methodological conditions as the $C case study for both PCA and No PCA formulations. The stocks have been chosen to evaluate how the proposed methodologies would perform under different circumstances. For instance, Amazon’s ($AMZN) stock was consistently increasing in price across the analysis period, while: (a) Duke Energy’s stock ($DUK) had both periods of growth and decline; and (b) MacDonald’s ($MCD) stock price was very stable. In addition, these 19 stocks captured several industries: (a) retail (e.g., Amazon and Walmart), (b) restaurants (e.g., McDonald’s), (c) medical industries (e.g, Pfizer), (d) energy and oil & gas (e.g., Chevron and Duke Energy), (e) techonology stocks (e.g., Facebook, IBM and Twitter), (f) communications (e.g., Time Warner and Verizon), etc. In this analysis, we have also included a HYBRID method that con-

B. Weng et al. / Expert Systems With Applications 112 (2018) 258–273

269

Table 5 The performance of the ensemble models for the 1-day ahead price prediction on different stocks. No PCA

Amazon.com, Inc. ($AMZN)

Apple Inc. ($AAPL)

Chevron Corporation ($CVX)

The Coca-Cola Company ($KO)

The Walt Disney Company ($DIS)

Duke Energy ($DUK)

Facebook, Inc. ($FB)

IBM ($IBM)

Marriott International, Inc. ($MAR)

McDonald’s ($MCD)

Newmont Mining Corporation ($NEM)

Nike, Inc. ($NKE)

Pfizer, Inc. ($PFE)

Prudential Financial ($PRU)

NNRE SVRE BRT RFR HYBRID NNRE SVRE BRT RFR HYBRID NNRE SVRE BRT RFR HYBRID NNRE SVRE BRT RFR HYBRID NNRE SVRE BRT RFR HYBRID NNRE SVRE BRT RFR HYBRID NNRE SVRE BRT RFR HYBRID NNRE SVRE BRT RFR HYBRID NNRE SVRE BRT RFR HYBRID NNRE SVRE BRT RFR HYBRID NNRE SVRE BRT RFR HYBRID NNRE SVRE BRT RFR HYBRID NNRE SVRE BRT RFR HYBRID NNRE SVRE BRT RFR HYBRID

w/PCA

MAE

MAPE

RMSE

MAE

MAPE

RMSE

5.9176 12.2509 4.2573 3.2064 5.0736 0.8276 1.9908 0.5194 0.3905 0.7322 0.6260 0.5521 0.4755 0.3118 0.4530 0.2372 0.5100 0.1650 0.1105 0.1651 0.5341 0.3974 0.2845 0.3410 0.3408 0.5242 1.4937 0.4723 0.2625 0.4015 0.9674 2.1867 0.6541 0.4401 0.6497 0.9646 2.4874 0.8689 0.4842 0.7357 0.5972 1.3910 0.5193 0.2969 0.4507 0.6565 0.6815 0.4819 0.3328 0.4653 0.6236 0.9635 0.2370 0.3408 0.4578 0.4540 0.4024 0.3357 0.2200 0.3174 0.2033 0.5420 0.2088 0.1054 0.1614 0.8024 1.3340 0.6568 0.4090 0.6335

0.8308 1.7032 0.6009 0.4538 0.7115 0.8104 1.9623 0.5064 0.3831 0.7204 0.7951 0.6952 0.6037 0.3962 0.5755 0.5663 1.2150 0.3941 0.2643 0.3946 0.5325 0.4064 0.2866 0.3366 0.3398 0.7041 1.9983 0.6331 0.3525 0.5387 0.8149 1.8412 0.5504 0.3709 0.5476 0.6590 1.6923 0.5941 0.3311 0.5030 0.8775 2.0313 0.7628 0.4348 0.6621 0.5663 0.5879 0.4162 0.2867 0.4014 1.9175 3.0024 0.7175 1.0342 1.4080 0.8352 0.7449 0.6173 0.4051 0.5840 0.6447 1.7113 0.6616 0.3349 0.5118 1.0720 1.7916 0.8891 0.5432 0.8552

7.6973 15.5321 5.6486 4.5678 6.3352 1.0929 2.5911 0.7030 0.5481 0.9918 0.8192 0.6535 0.6532 0.4335 0.6088 0.3277 0.6748 0.2184 0.1595 0.2267 0.7111 0.3985 0.3661 0.4664 0.4689 0.6843 1.8603 0.6187 0.3485 0.5290 1.2636 2.7752 0.8697 0.6570 0.8877 1.2282 3.1237 1.1909 0.6830 0.9808 0.8443 1.6977 0.7581 0.4517 0.6584 0.8464 0.6820 0.6410 0.4627 0.6216 0.8588 1.2410 0.3316 0.4702 0.6214 0.5807 0.4032 0.4438 0.2975 0.4185 0.2791 0.5429 0.2917 0.1524 0.2303 1.0596 1.6525 0.9006 0.5942 0.8145

6.6202 17.9721 7.2265 9.5250 7.4839 0.8969 2.8127 0.5318 0.7609 0.9896 0.6645 0.3812 0.2626 0.6537 0.4451 0.2477 0.6927 0.1161 0.2098 0.1653 0.6318 1.1131 0.4624 0.8705 0.5379 0.5735 1.6727 0.5222 0.5978 0.4938 0.9939 2.8258 0.5956 1.1996 0.7859 0.9740 4.2481 0.9619 1.0816 0.8432 0.6549 1.4811 0.4342 0.4891 0.4612 0.7149 0.4747 0.3914 0.7935 0.5319 0.6880 1.0608 0.7613 1.2043 0.6787 0.4887 0.2348 0.3911 0.3916 0.3704 0.2120 0.2598 0.1088 0.2092 0.1372 0.8693 1.6573 0.6677 0.9161 0.7036

0.9296 2.5373 1.0279 1.3088 1.0472 0.8768 2.7419 0.5165 0.7507 0.9636 0.8458 0.4799 0.3312 0.8187 0.5625 0.5904 1.6560 0.2768 0.4964 0.3929 0.6298 1.1384 0.4611 0.8694 0.5384 0.7717 2.2360 0.6997 0.7923 0.6592 0.8374 2.3646 0.5047 0.9883 0.6566 0.6637 2.8866 0.6543 0.7280 0.5724 0.9603 2.1612 0.6389 0.6965 0.6716 0.6171 0.4095 0.3370 0.6752 0.4557 2.1332 3.3064 2.3448 3.5748 2.0560 0.8989 0.4345 0.7180 0.7359 0.6847 0.6704 0.8201 0.3449 0.6536 0.4280 1.1631 2.2272 0.8886 1.1956 0.9334

8.6108 22.6149 9.6373 12.7378 9.0729 1.1781 3.7080 0.6927 0.9985 1.2188 0.8795 0.3821 0.3449 0.8888 0.5855 0.3405 0.9037 0.1488 0.2921 0.2226 0.8699 1.1161 0.6346 1.1494 0.7096 0.7334 2.2100 0.7092 0.7445 0.6208 1.3558 3.4875 0.7643 1.5523 1.0118 1.2621 5.6314 1.2109 1.4377 1.0923 0.9148 1.9069 0.6029 0.6822 0.6376 0.9419 0.4750 0.4984 1.0200 0.6697 0.9638 1.4018 1.0122 1.5026 0.8547 0.6383 0.2352 0.4943 0.5101 0.4819 0.2901 0.2602 0.1469 0.2708 0.1733 1.1447 2.1813 0.8571 1.1942 0.9082

(continued on next page)

270

B. Weng et al. / Expert Systems With Applications 112 (2018) 258–273 Table 5 (continued) No PCA

Southwest Airlines Co. ($LUV)

Time Warner, Inc. ($TWX)

Twitter Inc. ($TWTR)

Verizon Communications ($VZ)

Wal-mart Stores, Inc. ($WMT)

NNRE SVRE BRT RFR HYBRID NNRE SVRE BRT RFR HYBRID NNRE SVRE BRT RFR HYBRID NNRE SVRE BRT RFR HYBRID NNRE SVRE BRT RFR HYBRID

w/PCA

MAE

MAPE

RMSE

MAE

MAPE

RMSE

0.4383 0.9012 0.3839 0.2196 0.3645 0.4767 1.0651 0.3221 0.2140 0.4179 0.5100 0.6580 0.4852 0.4967 0.4634 0.2775 1.0309 0.2080 0.1515 0.2033 0.4487 1.0222 0.3307 0.2131 0.3199

1.0822 2.1956 0.9493 0.5402 0.8986 0.7035 1.5650 0.4739 0.3159 0.6153 2.8188 3.5255 2.6492 2.6253 2.5165 0.5714 2.1205 0.4290 0.3124 0.4194 0.6627 1.5010 0.4871 0.3148 0.4721

0.6212 1.1364 0.5803 0.3363 0.5076 0.7142 1.3819 0.4669 0.3286 0.5503 0.8240 1.1659 0.8032 0.8559 0.7890 0.3540 1.0319 0.2669 0.1982 0.2600 0.6749 1.3245 0.4853 0.3396 0.4873

0.4717 0.8364 0.1907 0.3961 0.3473 0.5091 1.3136 0.2887 0.5184 0.4966 0.4730 0.4998 0.4553 0.3922 0.3895 0.3219 0.8128 0.1621 0.3246 0.2307 0.4990 1.1679 0.2381 0.4643 0.3360

1.1578 2.0293 0.4672 0.9549 0.8466 0.7518 1.9340 0.4253 0.7564 0.7305 2.6803 2.9238 2.6671 2.8556 2.2544 0.6642 1.6718 0.3343 0.6576 0.4730 0.7373 1.7063 0.3521 0.6791 0.4957

0.6608 1.0690 0.2590 0.5414 0.4546 0.7671 1.7119 0.3952 0.7210 0.6500 0.6979 0.6554 0.6045 0.5181 0.5345 0.4092 0.8135 0.2123 0.4442 0.2986 0.7301 1.5083 0.3322 0.6444 0.4916

sists of averaging out the prediction of the proposed ensembles and integrate these predictions in an effort to obtain a more stable predictive model. There are five main observations from Table 5. First, the mean absolute percentage error (MAPE) was on average lower for the simulations with no PCA than those with PCA. Second, the BRT and RFR methodologies are the most predictive for the investigated stock, with the BRT method being on average the top performer. The third observation pertains to the performance of the SVRE, which typically had the lowest performance among the different ensemble methods. Fourth, the HYBRID method does not result in an improved performance over the best-performing ensembles considered; however, the HYBRID method does perform slightly better than the average of the four ensemble methods. The fifth and perhaps the most important observation is that the expert systems perform strongly across the different scenarios considered; for example, the BRT no/PCA method has a MAPE under 0.75% for 18 of the 19 stocks considered. 4. Concluding remarks and future work 4.1. An overview of the impacts and contributions of our proposed expert system In this paper we propose a financial expert system that can be used to predict the 1-day ahead stock price. While there are several proposed financial expert systems in the literature, our approach has three main unique characteristics. First, our “knowledge base” combines five different data sources: (a) traditional predictors extracted from stock market data, (b) features/insights extracted from financial news, (c) features capturing public interest based on Google Trends, (d) features capturing the public’s interest on a stock’s related Wikipedia pages, and (e) technical indicators applied to the aforementioned four data sources. To the best of our knowledge, this is the first financial expert system for predicting stock prices that combines these data sources together. Typically, the methodologies in the literature focuses on either traditional or online sources (with limited/no methods that combine stock data with both technical indicators and different sources of online data).

The underlying hypothesis behind the construction of our “knowledge base” is: the prediction performance of the AI platform will improve based on integrating disparate data sources in the knowledge base. This hypothesis is founded based on evidence from both the data mining and stock movement (up/down) prediction literatures. Second, our AI platform trains ensemble models to predict stock prices over multiple time periods. As shown in Table 1, the multiple time period prediction using ensembles has received limited attention in the literature. This is somewhat surprising since: (a) ensemble methods generally outperform single classifiers and (b) predicting the price over multiple time periods provide investors with more information and thus, can potentially lead to better decision-making. Third, by making publically available our code, both investors and researchers can utilize our expert system in predicting the price for any stock. The models will be retrained based on each stock picked by the user. To demonstrate the utility of our system, we presented a case study based on the Citi Group stock ($C) utilizing data from 01/01/2013 - 12/31/2016. From our case study, the AI platform identified the boosted regression tree (BRT) and the Random Forest Ensemble (RFR) as the best models for predicting the 1-day ahead stock price. Based on our analysis, the predictions from the BRT model for any of the time periods have a test mean absolute percent error (MAPE) ≤ 1.5%. Based on the $C case study, 6 out of 10 ensemble models we applied using both the online and traditional data sources have achieved a MAPE < 1% in the 1-day ahead test dataset. Compared to the literature that uses either single data sources or individual/ensemble learning models shown in Table 1, our results are very competitive to the reported MAPEs of above 1%. In addition, using the BRT ensemble to predict the price for a time window of 10 days, the mean absolute percent error for all periods was ≤ 1.890%. A closer examination of the features extracted for these time periods indicate that online data contributes significantly to the prediction accuracy. However, the importance of those online features reduces or varies significantly over time. While this makes sense since online data seem to capture the crowd’s interest at a given moment, this observation has not been reported previously in the literature. We believe that this

B. Weng et al. / Expert Systems With Applications 112 (2018) 258–273

is due to the limited research performed on combining different data sources for multiple time period prediction. To assess the generalizability of our expert system, we investigated its 1-day ahead prediction performance for 19 additional stocks. The stocks were chosen to capture different industries, volatilities and growth patterns. The results summarized in Table 5 indicate that the observations obtained from the main case study (involving the $C stock) can be extended to a wide variety of stocks. In addition, our results indicate that the BFR approach typically outperforms the four other predictive models investigated. 4.2. Using our expert system in practice: some advice to investors Accurately predicting stock prices and estimated returns is the “dream” of every investor. In this paper, we present an ensemblebased approach to predict the 1-day ahead stock price using various data sources. Based on our results, we believe that a 1-day ahead MAPE of ≤ 0.75% has the potential to be informative for investors. We make our code publicaly available for further evaluation and application to different stocks. Since we do not require a potential investor to have a detailed (any) knowledge of R, we provide a tutorial for how they can modify/tweak our code to predict the price of any U.S. stock in the future. The tutorial is hosted at https://github.com/martinwg/stockprediction. The tutorial covers all the details from setting up and installing R to running our code. Note that our code shows the fundamental steps the investor should take to scrape the online data, provided that he/she presents R with keywords for the financial news query and the titles of the pertinent Wikipedia pages for the stock. Over the past couple of years, there has been an increasing number of articles on the use of artificial intelligence for automating trading decisions (see e.g., the investigation in Wired by Metz (2016)). Thus, it is important to highlight two major differences in motivation and scope behind our endeavor and the efforts highlighted in Metz (2016). First, we have released all the details behind our approach. While transparency is important in the context of academic research, it does not carry the same connotation in the context of arbitraging as any competitive advantage is lost once methodologies are publicly available. However, we believe the insights from our research can be generalized. Specifically, it is important to: (a) consider how the stock will perform over multiple time horizons; and (b) incorporating non-traditional data sources can improve prediction performance. Second, our expert system does not include an optimization or a decision-making engine. This is primarily related to the overarching (practical) goal from this research is to provide an investor with: (a) a novel datadriven forecast which has predictive potential, or (b) insights into some predictors that should be considered for prior to making an investment decision. These forecasts or insights can also be incorporated as a part of a larger model. 4.3. Limitations and future research Despite the predictive performance of our method, there are some limitations in our study that need to be highlighted. In the previous paragraph, we highlighted some limitations from a practical perspective. Here, we highlight some of the limitations from a research viewpoint. First, we have only examined the utility of our model for predicting the price for 1 up to 10 days ahead. While these represent up to two trading weeks, there is no standard definition for what constitutes short-term stock predictions. The range can be in minutes/hours as in Geva and Zahavi (2014) and Schumaker and Chen (2009) and can go up to a month (see Wang et al. (2012, 2011), Khansa and Liginlal (2011)). We have not investigated how our models will work at these extreme ends of the short-term prediction time frame, especially since some of our

271

predictors cannot be obtained at finer granularity (e.g. Wikipedia releases their traffic information per hour and Google releases its trend by day). Second, our analysis was limited to 20 U.S.-based stocks (the Citi stock and the 19 additional stocks presented in Table 5) during the time period from 2013 to 2016. We did not attempt to monitor any indices or stocks from non-US markets. It is not clear whether the performance in our case study would translate to future time frames and/or other stocks. The reader should note that this is a limitation of any machine learning model. We attempted to mitigate the effect of this limitation by making our code freely available to encourage other researchers to apply our method for future time periods and/or other datasets. Third, our financial expert system currently has no mechanism for detecting its obsolescence, i.e. when it needs to be retrained. While this is a common limitation in the stock market prediction literature, there exists some statistical surveillance tools that can be used for detecting a change in the model’s performance. The reader is referred to Megahed and Jones-Farmer (2015) for an introductory discussion. In our estimation, there are three major opportunities for future research. First, with the exception of using technical indicators to generate features, we did not capitalize on the time-series nature of the stock market. Other researchers can investigate whether using: (a) additional features that capture the time-series nature of the price; or (b) using ensemble approaches that can capitalize on this inherent property of the data (e.g., a recurrent neural network which consider the time effect while connecting neuron layers) can improve the prediction performance. Second, researchers can examine the impact of a firm’s location can be affected by the different predictors. For example, Alibaba ($BABA) and Amazon ($AMZN) are direct competitors on the global market. $BABA trades in the NYSE and $AMZN trades in NASDAQ. However, their operational foothold differs significantly with Alibaba predominately in China and Amazon in the U.S. Thus, it would be interesting to see how these differences affect the predictors’ importance and AI’s accuracy. Third, it is logical to extend our system into a trading engine, which uses our predictions to maximize the returns while minimizing investment risk. In summary, this paper proposed a novel financial expert system for predicting short-term stock prices. Our expert system is comprised of: (a) a detailed knowledge base that captures data from both traditional and online sources; and (b) an AI platform that utilizes ensembles and a hybrid model to predict the price over multiple time period. We have shown that our expert system tackles a gap in the literature and we hypothesized that our proposed system will perform better than its predecessors in the literature since it captures more information and utilizes superior artificial intelligence methodologies. From our analysis, we have shown that our system has an excellent predictive performance. To the best of our knowledge, the error rates achieved by our proposed method are lower than those reported in the literature. In this paper, we have also presented some advice to investors and presented four major future research streams that can build on the limitations of our work. Our code and data are made available at https://github.com/martinwg/stockprediction to encourage researchers to reproduce and/or extend our work. References Abdullah, M., & Ganapathy, V. (20 0 0). Neural network ensemble for financial trend prediction. In Tencon 2000. proceedings: 3 (pp. 157–161). IEEE. Alkhatib, K., Najadat, H., Hmeidi, I., & Shatnawi, M. K. A. (2013). Stock price prediction using k-nearest neighbor (knn) algorithm. International Journal of Business, Humanities and Technology, 3(3), 32–44. Araújo, R. d. A., Oliveira, A. L., & Meira, S. (2015). A hybrid model for high-frequency stock market forecasting. Expert Systems with Applications, 42(8), 4081–4096. Arlot, S., & Celisse, A. (2010). A survey of cross-validation procedures for model selection. Statistics surveys, 4, 40–79.

272

B. Weng et al. / Expert Systems With Applications 112 (2018) 258–273

Ballings, M., Van den Poel, D., Hespeels, N., & Gryp, R. (2015). Evaluating multiple classifiers for stock price direction prediction. Expert Systems with Applications, 42(20), 7046–7056. Barak, S., & Modarres, M. (2015). Developing an approach to evaluate stocks by forecasting effective features with data mining methods. Expert Systems with Applications, 42(3), 1325–1339. Bollen, J., Mao, H., & Zeng, X. (2011). Twitter mood predicts the stock market. Journal of Computational Science, 2(1), 1–8. Booth, A., Gerding, E., & Mcgroarty, F. (2014). Automated trading with performance weighted random forests and seasonality. Expert Systems with Applications, 41(8), 3651–3661. Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. In Proceedings of the fifth annual workshop on computational learning theory (pp. 144–152). ACM. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32. Chen, Y., & Hao, Y. (2017). A feature weighted support vector machine and k-nearest neighbor algorithm for stock market indices prediction. Expert Systems with Applications, 80, 340–355. Chen, Y., Yang, B., & Abraham, A. (2007). Flexible neural trees ensemble for stock index modeling. Neurocomputing, 70(4), 697–703. Chong, E., Han, C., & Park, F. C. (2017). Deep learning networks for stock market analysis and prediction: Methodology, data representations, and case studies. Expert Systems with Applications, 83, 187–205. Cootner, P. (1964). The random character of stock market prices. M.I.T. Press. Cortes, C., & Vapnik, V. (1995). Support vector machine [j]. Machine Learning, 20(3), 273–297. Dietterich, T. G. (20 0 0a). Ensemble methods in machine learning. In International workshop on multiple classifier systems (pp. 1–15). Springer. Dietterich, T. G. (20 0 0b). An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40(2), 139–157. Drucker, H., Burges, C. J., Kaufman, L., Smola, A., & Vapnik, V. (1997). Support vector regression machines. In Advances in Neural Information Processing Systems (pp. 155–161). Fama, E. F. (1965). The behavior of stock-market prices. The Journal of Business, 38(1), 34–105. Fama, E. F., Fisher, L., Jensen, M. C., & Roll, R. (1969). The adjustment of stock prices to new information. International Economic Review, 10(1), 1–21. Fodor, I. K. (2002). A survey of dimension reduction techniques. Center for Applied Scientific Computing, Lawrence Livermore National Laboratory, 9, 1–18. Foresee, F. D., & Hagan, M. T. (1997). Gauss-newton approximation to bayesian learning. In Neural networks, 1997., international conference on: 3 (pp. 1930–1935). IEEE. Freund, Y. (1990). Boosting a weak learning algorithm by majority. In Colt: 90 (pp. 202–216). Freund, Y., & Schapire, R. E. (1995). A desicion-theoretic generalization of on-line learning and an application to boosting. In European conference on computational learning theory (pp. 23–37). Springer. Friedman, J., Hastie, T., & Tibshirani, R. (2001). The elements of statistical learning: 1. Springer series in statistics Springer, Berlin. Geva, T., & Zahavi, J. (2014). Empirical evaluation of an automated intraday stock recommendation system incorporating both market data and textual news. Decision Support Systems, 57, 212–223. Gidofalvi, G. (2001). Using news articles to predict stock price movements. Technical Report. Department of Computer Science and Engineering, University of California, San Diego. Göçken, M., Özçalıcı, M., Boru, A., & Dosdog˘ ru, A. T. (2016). Integrating metaheuristics and artificial neural networks for improved stock price prediction. Expert Systems with Applications, 44, 320–331. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press. http:// www.deeplearningbook.org. Guresen, E., Kayakutlu, G., & Daim, T. U. (2011). Using artificial neural network models in stock market index prediction. Expert Systems with Applications, 38(8), 10389–10397. Hassan, M. R., Nath, B., & Kirley, M. (2007). A fusion model of hmm, ann and ga for stock market forecasting. Expert Systems with Applications, 33(1), 171–180. Hotelling, H. (1933). Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology, 24(6), 417. Hsu, M.-W., Lessmann, S., Sung, M.-C., Ma, T., & Johnson, J. E. (2016). Bridging the divide in financial market forecasting: Machine learners vs. financial economists. Expert Systems with Applications, 61, 215–234. Kao, L.-J., Chiu, C.-C., Lu, C.-J., & Yang, J.-L. (2013). Integration of nonlinear independent component analysis and support vector regression for stock price forecasting. Neurocomputing, 99, 534–542. Kara, Y., Boyacioglu, M. A., & Baykan, Ö. K. (2011). Predicting direction of stock price index movement using artificial neural networks and support vector machines: The sample of the istanbul stock exchange. Expert Systems with Applications, 38(5), 5311–5319. Kearns, M. J., & Valiant, L. G. (1988). Learning boolean formulae or finite automata is as hard as factoring. Harvard University, Center for Research in Computing Technology, Aiken Computation Laboratory. Khansa, L., & Liginlal, D. (2011). Predicting stock market returns from malicious attacks: A comparative analysis of vector autoregression and time-delayed neural networks. Decision Support Systems, 51(4), 745–759. Kim, K.-j., & Han, I. (20 0 0). Genetic algorithms approach to feature discretization in

artificial neural networks for the prediction of stock price index. Expert Systems with Applications, 19(2), 125–132. Kristjanpoller, W., Fadic, A., & Minutolo, M. C. (2014). Volatility forecast using hybrid neural network models. Expert Systems with Applications, 41(5), 2437–2442. LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. Lin, X., Yang, Z., & Song, Y. (2009). Short-term stock price prediction based on echo state networks. Expert Systems with Applications, 36(3), 7313–7317. MacKay, D. J. (1992). Bayesian interpolation. Neural Computation, 4(3), 415–447. Maclin, R., & Opitz, D. (1997). An empirical evaluation of bagging and boosting. AAAI/IAAI, 1997, 546–551. Maclin, R., & Opitz, D. (2011). Popular ensemble methods: An empirical study. Journal of Artificial Intelligence Research, 11, 169–198. Malkiel, B. G. (2003). The efficient market hypothesis and its critics. The Journal of Economic Perspectives, 17(1), 59–82. Martinez, W., & Gray, J. B. (2016). Noise peeling methods to improve boosting algorithms. Computational Statistics & Data Analysis, 93, 483–497. McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The Bulletin of Mathematical Biophysics, 5(4), 115–133. Meesad, P., & Rasel, R. I. (2013). Predicting stock market price using support vector regression. In Informatics, electronics & vision (iciev), 2013 international conference on (pp. 1–6). IEEE. Megahed, F. M., & Jones-Farmer, L. A. (2015). Statistical perspectives on big data. In Frontiers in statistical quality control 11 (pp. 29–47). Springer. Metz, C. (2016). The rise of the artificially intelligent hedge fund. Wired Inc., http: //fortune.com/2012/02/25/buffett-beats-the-sp-for-the-39th-year/. [Online, last accessed 08/08/2017]. Moat, H. S., Curme, C., Avakian, A., Kenett, D. Y., Stanley, H. E., & Preis, T. (2013). Quantifying wikipedia usage patterns before stock market moves. Scientific reports, 3. Mok, P., Lam, K., & Ng, H. (2004). An ica design of intraday stock prediction models with automatic variable selection. In Neural networks, 2004. proceedings. 2004 ieee international joint conference on: 3 (pp. 2135–2140). IEEE. Murphy, J. J. (1999). Technical analysis of the financial markets: A comprehensive guide to trading methods and applications. Penguin. Nassirtoussi, A. K., Aghabozorgi, S., Wah, T. Y., & Ngo, D. C. L. (2015). Text mining of news-headlines for forex market prediction: A multi-layer dimension reduction algorithm with semantics and sentiment. Expert Systems with Applications, 42(1), 306–324. Nguyen, T. H., Shirai, K., & Velcin, J. (2015). Sentiment analysis on social media for stock movement prediction. Expert Systems with Applications, 42(24), 9603–9611. Nofsinger, J. R. (2005). Social mood and financial economics. The Journal of Behavioral Finance, 6(3), 144–160. Oliveira, N., Cortez, P., & Areal, N. (2017). The impact of microblogging data for stock market prediction: Using twitter to predict returns, volatility, trading volume and survey sentiment indices. Expert Systems with Applications, 73, 125–144. Patel, J., Shah, S., Thakkar, P., & Kotecha, K. (2015a). Predicting stock and stock price index movement using trend deterministic data preparation and machine learning techniques. Expert Systems with Applications, 42(1), 259–268. Patel, J., Shah, S., Thakkar, P., & Kotecha, K. (2015b). Predicting stock market index using fusion of machine learning techniques. Expert Systems with Applications, 42(4), 2162–2172. Peason, K. (1901). On lines and planes of closest fit to systems of point in space. Philosophical Magazine, 2(11), 559–572. Prechter Jr, R. R., & Parker, W. D. (2007). The financial/economic dichotomy in social behavioral dynamics: The socionomic perspective. The Journal of Behavioral Finance, 8(2), 84–108. Preis, T., Moat, H. S., & Stanley, H. E. (2013). Quantifying trading behavior in financial markets using google trends. Scientific Reports, 3, 1684. Qian, B., & Rasheed, K. (2007). Stock market prediction with multiple classifiers. Applied Intelligence, 26(1), 25–33. Quinlan, J. R. (1996). Bagging, boosting, and c4. 5. AAAI/IAAI, 725–730. R Core Team (2016). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. Rather, A. M., Agarwal, A., & Sastry, V. (2015). Recurrent neural network and a hybrid model for prediction of stock returns. Expert Systems with Applications, 42(6), 3234–3241. Raymond McTaggart, Gergely Daroczi, & Clement Leung (2016). Quandl: Api wrapper for quandl.com. R package version 2.8.0. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1985). Learning internal representations by error propagation. Technical Report. DTIC Document. Russell, S., Norvig, P., & Intelligence, A. (1995). A modern approach. Artificial Intelligence. Prentice-Hall, Egnlewood Cliffs, 25, 27. Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5(2), 197–227. Schapire, R. E. (2003). The boosting approach to machine learning: An overview. In Nonlinear estimation and classification (pp. 149–171). Springer. Schumaker, R. P., & Chen, H. (2009). Textual analysis of stock market prediction using breaking financial news: The azfin text system. ACM Transactions on Information Systems (TOIS), 27(2), 12. Serneels, S., De Nolf, E., & Van Espen, P. J. (2006). Spatial sign preprocessing: A simple way to impart moderate robustness to multivariate estimators. Journal of Chemical Information and Modeling, 46(3), 1402–1409. Smith, V. L. (2003). Constructivist and ecological rationality in economics. The American Economic Review, 93(3), 465–508.

B. Weng et al. / Expert Systems With Applications 112 (2018) 258–273 Tetlock, P. C. (2007). Giving content to investor sentiment: The role of media in the stock market. The Journal of Finance, 62(3), 1139–1168. Ticknor, J. L. (2013). A bayesian regularized artificial neural network for stock market forecasting. Expert Systems with Applications, 40(14), 5501–5506. Tsai, C.-F., & Hsiao, Y.-C. (2010). Combining multiple feature selection methods for stock prediction: Union, intersection, and multi-intersection approaches. Decision Support Systems, 50(1), 258–269. Tsai, C.-F., Lin, Y.-C., Yen, D. C., & Chen, Y.-M. (2011). Predicting stock returns by classifier ensembles. Applied Soft Computing, 11(2), 2452–2459. Ulrich, J. (2016). Ttr: Technical trading rules. R package version 0.23-1. Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11), 1134–1142. Vapnik, V. N., & Chervonenkis, A. J. (1974). Theory of pattern recognition. Nauka. Wang, J.-J., Wang, J.-Z., Zhang, Z.-G., & Guo, S.-P. (2012). Stock index forecasting based on a hybrid model. Omega, 40(6), 758–766. Wang, J.-Z., Wang, J.-J., Zhang, Z.-G., & Guo, S.-P. (2011). Forecasting stock indices with back propagation neural network. Expert Systems with Applications, 38(11), 14346–14355.

273

Wang, L., Zeng, Y., & Chen, T. (2015). Back propagation neural network with adaptive differential evolution algorithm for time series forecasting. Expert Systems with Applications, 42(2), 855–863. Weng, B., Ahmed, M. A., & Megahed, F. M. (2017). Stock market one-day ahead movement prediction using disparate data sources. Expert Systems with Applications, 79, 153–163. Weng, B., Tsai, Y.-T., Li, C., Barth, J. R., Martinez, W., & Megahed, F. M. (2017). An ensemble based approach for major U.S. stock and sector indices prediction. Applied Soft Computing. Under Review. Werbos, P. (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. Harvard University Ph.D. thesis. Woschnagg, E., & Cipan, J. (2004). Evaluating forecast accuracy. University of Vienna, Department of Economics. Zhai, Y., Hsu, A., & Halgamuge, S. K. (2007). Combining news and technical indicators in daily stock price trends prediction. In International symposium on neural networks (pp. 1087–1096). Springer. Zhang, Y., & Wu, L. (2009). Stock market prediction of s&p 500 via combination of improved bco approach and bp neural network. Expert Systems with Applications, 36(5), 8849–8854.

Suggest Documents