An automated stock price prediction system with Rapid Analytics G´abor I. Nagy∗, Csaba G´asp´ar-Papanek† Budapest University of Technology and Economics Dept. of Telecommunications and Media Informatics Hungary
Abstract We show a proof of concept stock signalling service which predicts one-day ahead closing prices of more than a 1000 stocks on the NASDAQ Stock Exchange. A webserver on a LAMP architecture is used to download, update and store daily OHLCV1 price dataset of stocks from Yahoo Finance. The stored datasets can be enriched with various technical indicators and are provided to Rapid Miner and Rapid Analytics as inputs for model building. We use Rapid Miner for model building, Rapid Analytics to store and execute the models and to provide predictions. With the help of Rapid Analytics public webservices we can train and execute models remotely from the webserver using cURL and access the predictions, which can be stored by the signalling webserver and forwarded to the subscribers of the signalling service. Data, models and predictions can be updated daily with very little user interaction. We provide some experimental results of average prediction trend accuracy (PTA)2 of the running system.
1
Introduction
1.1
The idea of a stock signalling webservice
There are sites on the Internet like Yahoo Finance and Google Finance which offer tools for technical analysis and charting of stock price data on various ∗
[email protected] †
[email protected] 1 OHLCV
data means daily Open, Highest, Lowset and Close price data and daily Volume to the Rapid Miner help the Prediction Trend Accuracy ”measures the average of times a regression prediction was able to correctly predict the trend of the regression”. 2 According
markets. These service provide data access through web services: CSV files can be downloaded and analysed with datamining tools such as Rapid Miner and models can be used to predict price movements or volatility for a particular stock. [11] A stock signalling service provides signals to buy, sell or hold stocks or price predictions for subscribers on a daily or weekly basis. A stock signalling service uses predictive models to forecast price movements. These models could harness the superb datamining capabilities of Rapid Miner and Rapid Analytics. The models could be made with Rapid Miner, stored on a Rapid Analytics server, and the predictions of the models could be accessed through a Rapid Analytics public webservice by a webserver which stores and forwards the predicted prices to the subscribers. In this paper we discuss the building steps of such a system.
1.2
Introduction to stock price prediction
Stock price prediction is a challenging task in the financial datamining field. [6] There are several schools in the economic literature which deals with price movement and equities market efficiency. Some of them suggest that price movement is random and cannot be predicted. Others believe that prices tend to move in trends and patterns because of natural human behaviour [8], and as such they can be modelled and predicted with machine learning algorithms to some extent. [3] There have been many studies confirming the validity of stock price prediction by machine learning algorithms like Feed-Forward Neural Networks, Support Vector Machines and Genetic Algorithms. [3] [4] These predictive models provided better results used in trading strategies compared to benchmark strategies like the Buy-and-Hold strategy. [1] [2] Naturally our standpoint is that machine learning algorithms are useful for price prediction. However the main topic of this paper is not to validate this claim, but to provide a signalling service framework based on machine learning algorithms.
1.3
Main goals
1. Our goal is to make models that predict one-day-ahead closing prices fairly accurately. By fairly accurate we mean that the average prediction trend accuracy should be at least greater than 50% on a large set of stocks. 2. Because we provide signals for a large number of stocks we need to train and apply models in a short time to be able to present the predicted prices of the models for subscribers who can apply this information in their trading strategies.
We address these goals and provide a brief outline of the system in the next section.
2
System components
The system consists of three main components: Rapid Miner (RM), Rapid Analytics (RA) and the a Signalling Webserver (SW). We describe the main functions of the components in the following three sections.
2.1
Rapid Miner
Rapid Miner is used to create a pair of processes. The first process is the modelling process that trains models from stock price datasets and saves them into the RA repository alongside with performance measures. The second process is a model applier process for applying the trained models on unseen data. This process provides the predictions for the SW. Before we start the process building we make a directory for the pair of processes on RA where they can be stored. 2.1.1
Modelling process
Figure 1: The model building process
The modeling process can be seen in Figure 1. The modelling process always starts with a Read CSV operator. The csv file parameter is set to a URL which is the SW’s dataset provider webservice. The URL (http:
Attribute name Date Open High Low Close Volume
Description The date of the quote for the given stock Opening price Highest price Lowest price Closing price Trading volume of the day defined by the date
Table 1: The standard attributes of the Signalling Webserver’s data provider webservice //elorejelzes.tozsde.hu/ds/ta/%{market}/%{ticker}/%{id}) has three parameters which are macros. The market parameter defines the market of the stock. The ticker parameter is the stock’s symbol on the market. The ID parameter defines a set of predefined technical indicators. If the ID is omitted, only the mandatory attributes of the given stock is presented by the SW to the modelling process. The standard attributes are shown in Table 1. The date attribute is set as an ID. The next node in the process is the Windowing operator. Here we can set the windowing parameter and create the label attribute, which is the close price. The Sliding Window Validation operator is used for model building and validation. We compute the average PTA which goes to the output of the process. The performance and the model is stored under a certain subdirectory in the repository. A unique name derived from the input parameters is given for the performance and the model repository entries. For example the model for Apple Inc. (market=NASDAQ, symbol=AAPL) is stored under svm/models/nasdaq_AAPL_2 and the performance is stored under svm/performance/nasdaq_AAPL_2 where 2 denotes the ID of the input dataset. The model building process is greatly influenced by Thomas Ott’s tutorials. [11] 2.1.2
Model applier process
The model applier process (Figure 2.) also begins with a Read CSV operator just like the one shown in the former section, but the csv file parameter is set to a slightly different URL (http://elorejelzes.tozsde.hu/ds/ap/ %{market}/%{ticker}/%{id}). The 3 parameters are used to identify the dataset the same way as mentioned above. The slight difference is that this webservice only outputs the last 20 rows of the dataset. This results in the increased speed of model applying. The date is also set as an ID and the model is loaded by the given parameters from the repository. An Apply Model operator is used to predict the next-day closing prices after the Windowing operator constructs the dataset with the appropriate window length. The output attributes are filtered and only the
Figure 2: The model applier process
date and the prediction is given as outputs. These processes can be tested on Rapid Miner alone (with disabling the store operators to prevent overwriting of models and performance measures).
2.2
Rapid Analytics
The main role of Rapid Analytics is to provide the computational capacity required for model building and model execution for the pair of processes and to provide an interface which is accessible by the SW to remotely run the modelling process and to retrieve the results after a successful model building. Rapid Analytics is also used to store the models and performances in the repository. The pair of processes described above can be exposed as webservices on Rapid Analytics with the required parameters - market, ticker and id - set to the macros. The outputs are read from the processes when they are completed. The model building process outputs an XML file and the model applier process outputs a JSON object. The approach of providing a public webservice is mainly of convenience, however SSO could be incorporated to provide security.
2.3
Signalling Webserver
We put most of the data preprocessing and presentational functions on the SW. The SW was built using Codeigniter 1.7.2 PHP framework [7] and resides on a LAMP architecture on a shared webhost with MySQL database and CRON
enabled. The main tasks of the SW are the following: 1. Retrieve and refresh raw OHLCV data from public data services on a daily basis and compute some descriptive statistics. 2. Select appropriate stocks for model building. 3. Compute various technical indicators from the raw OHLCV dataset inputs and provide them for the model building and model applier processes. 4. Call the modelling webservice with market, ticker and ID parameters remotely and retrieve the PTA. 5. Call the model applier webservice with market, ticker and ID parameters remotely and retrieve the predicted prices. 6. Store model performance and price predictions. 7. Present predicted prices to the subscribers. 2.3.1
Data retrieval, preprocessing and stock selection
Currently we have developed three data scrapers. The Yahoo Scraper can scrape OHLCV data from the Yahoo Finance website. The Portfolio Scraper gathers transaction lists from a well-known Hungarian financial portal called Portfolio.hu. The third one is Stook Scraper which scrapes the stook.com website for European end-of-day OHLCV data. Some descriptive statistics, such as average price, average volume, average daily range, number of trading days and number of price movements are computed and stored in a MySQL table for the scraped stocks. The descriptive statistics can be used as filters in an SQL query with which we can select the set of stocks we want to make signals for. The retrieval and preprocessing use CRON to time the execution of the scripts at the closing of each market. The time to scrape price data for a market depends on the size of the market. For example the Budapest Stock Exchange takes only 1 minute to scrape and a half a minute to preprocess, while the NASDAQ stock exchange this takes around 15 to 20 minutes. For model building the Budapest Stock Exchange takes about 30 minutes and another 1 minute is for model applying and retrieving predictions. This shows that in general the most time consuming job is the model building process. We developed a helper function that can produce some technical indicators from the OHLCV CSV files. These indicators are: ATR, AROON, EMA, MACD, SMA, ROC, RSI. The description of the indicators can be found at [5].
We constructed a data provider webservice which can be accessed by the modelling process and the model applier process. It is easy to create other instances of this webservice which can compute different indicators. In the future a more flexible parametrized approach will be developed so that a process can optimize technical indicator parameters with perhaps the Optimize Selection operator.
2.3.2
Model building and price prediction retrieval
After the updating and preprocessing finishes the modelling process webservice on RA is called with cURL in batches using CRON in every 10-20 minutes. Each time the script starts it selects a portion of the set of tracked stocks (usually 10-30 in one batch) providing the market, symbol and ID parameters. The frequency of the CRON execution is determined by the time it takes to train a model. It iterates through the stocks and calls the modelling process webservice with the market, ticker and ID parameters. The cURL function returns the model’s average PTA which is stored in a MySQL table. If the model did not give any output or is halted, the PTA is zero. These models can be retrained manualy by the administrator after the tracked set of stocks are completed. After the models are finished we execute the price prediction retrieval in the same fashion: We call the model applier process webservice with cURL in batches timed with CRON, retrieve the prices which are then stored in a MySQL table and log the execution. If any error would arise, the administrator can rerun the price retrieval script manually and refresh the prediction database. The subscribers of the service can see the stock price predictions for each stock that is tracked on a candlestick chart. A blue dot denotes the next day predictions.
3
Experiment
In this section we provide an experimental setup to show that it is possible to make a prediction model for all the relevant stocks of NASDAQ with the framework described above which updates in a daily basis. We show a proof of concept framework that Rapid Analytics is capable of modelling price movements on a 64-bit Sun Opteron server X4100 with 4 2.6 Ghz cores. The Rapid Analytics Java VM is given 6 Gb of RAM. With the experiment we shed light on some details that may not be covered in the System components section.
3.1
Dataset
The stock price datasets are gathered for this experiment using the Yahoo Scraper. The datasets are daily historical OHLCV quotes from the NASDAQ stock exchange consisting of the last 5 years of quotes. We downloaded the stock symbol set from eoddata.com and found 2894 different stocks that we loaded it into a MySQL table. We set up the scraper to scrape all 2894 stocks. It turned out that 2834 symbols can be scraped through Yahoo Finance. We set up a CRON job for the Yahoo Scraper to update daily at the end of each trading day. The update usually takes around 15 to 20 minutes. After scraping we compute average price, average volume, average daily range, the number of trading days and the number of days prices moved (have a different high and low prices) for the downloaded files and perform error checking to find which files were downloaded correctly. The computation of these variables takes about 2 minutes, the results are saved into the same MySQL table mentioned above. We set up a CRON job to this after 25 minutes NASDAQ closes.
3.2
Stock selection
For the sake of comparing the results we need to eliminate those stocks that do not have enough history, which means insufficient training data for model building. We filter out stocks that did not trade on all days in the past 5 years. We need to eliminate stocks which have a low level of price action. This is essential as an automated trading system needs liquidity to perform trades whenever a signal occurs. Low level of liquidity can be assessed from the number of times the price moved. For the experiment we used stocks that moved on almost every days in the past 5 years (10 days of no movement is permitted). After filtering the market we are given with 1077 stocks to begin the modelling phase.
3.3
Stock price modelling and price prediction
Our goal is to provide a single base model for each and every stock that predicts the next-day closing price. We should seek for a model that can be trained for the entire set of stocks in a reasonable time. The NASDAQ stock market is open from 9:30 AM to 4:00 PM for 6.5 hours which leaves us 17.5 hours from 4:00 PM. For dataset preprocessing we subtract a half hour so we have 17 hours for model building and price prediction. We found that price prediction for one stock takes roughly 5 seconds from previous process executions. 1077 stocks takes approximately 1.5 hours, so we are left 15.5 hours to build the models. This means that we have approximately 50 seconds to perform one
Attribute name EMA7 EMA20 EMA200 MACD
ATR RSI ROC
Description 7 day Exponential Moving Average 20 day Exponential Moving Average 200 day Exponential Moving Average Moving Average Convergence Divergence, with a fast moving average set to 9 days, the slow moving average set to 7 days and the signal set to 9 days. Average True Range indicator 14 days exponential moving average Relative Strength Indicator, 9 days look back period Rate of Change indicator, 10 days look back period
Table 2: Technical indicators used in the input datasets
model building step. For input data, beside the mandatory columns shown in Table 1., some technical indicators are used which are shown in Table 2. For fast model building we choose an SVM model with a dot kernel with a dot kernel and C=10.0 and convergence epsilon set to 1.0. All other parameters remain the default. A static window size of 2 is used in each model with the close price as a label attribute. For Validation the Sliding Window Validation operator is used, with a training window size of 200 a training step size 20 a test window size of 200 and a horizon of 20. We compute the average prediction trend accuracy (PTA) for the models. We found that the Rapid Analytics server was capable of building 20 models of the above sort in 10 minutes which is less the time approximated for the available time for one model. We define a CRON job which executes every 10 minutes which calls the RA server to build models from 5:00 PM until about 2:00 AM the next day. After all execution is done we run an error checking CRON job and retrain models that did not produce any output from 3:00 AM until 5:00 AM. From 5:00 AM we set up a CRON job that calls the model applier web service. At 6:30AM the whole process finishes. At 7:00AM the administrator of the service reads the log file and run SQL queries to find out if everything is in order. Manual execution of the modelling process and the model applying process can be started remotely from the SW if errors are found.
4
Results
We provide brief results - which are far from accurate - of the models generated by this automated signalling system. The snapshot was taken from the system on 31/03/2011.
Figure 3: Distributions of Prediction Trend Accuracy of the greater than average (blue) and smaller than average (red) price stocks
The average PTA for the 1077 models were 51.2%. This means we achieved the goal of hitting the 50% average PTA. The maximum PTA was 57.8% on the stock GSI Group Inc. (GSIG), the worst was Woodward Inc. (WWD) 47.2%. We ran a linear regression (R=0.456, shown in (1)) on the resulting data with 70% as the modelling dataset and the remaining 30% for validation sampled randomly. We found that there is a negative impact of the logarithm of average price, the logarithm of average daily range (high-low) and the logarithm of the average trade volume to the PTA. We had larger prediction trend accuracy on stocks that have small average prices, traded in a small average price range and had small average trading volume. P T A = −0.001 ∗ log(vol) − 0.003 ∗ log(range) − 0.009 ∗ log(price) + 0.548 (1) This finding can be incorporated in the stock selection algorithm: we could presumably train better models on a smaller set of stocks focusing mainly on stocks with low average price, low average range and low average trading volume.
Figure 3. shows the PTA distribution of the smaller than average price stocks versus the larger than average price stocks. It shows that there is a slight shift to the right of the PTA distribution at the smaller price stocks than the bigger average price stocks, meaning a slight positive difference in the overall average PTA for the smaller than average price stocks.
5
Further work
We are currently trying to deploy Rapid Analytics to the Amazon EC2 environment and extending the job scheduler algorithm for multiple instances of Rapid Analytics to track other markets. On Amazon EC2 instances we could run the models presumably one order of magnitude faster. The other development path is the integration of R scripts in the process. Esing the quantmod [9] library we colud eliminate the scraping of the data because it could be downloaded directly by the R script. Portfolio management can be incorporated in the processes using the Blotter R library. [10] We have good results on Rapid Miner, and did integrate R scripts and blotter functionality into the model builder process. The blotter library produces images of the portfolio value and some useful statistics that can be used by portfolio managers who want to use the signals produced by this black box system.
6
Conclusion
We showed a proof of concept webservice based framework which can be used to track a large set of stocks (more than a 1000) efficiently with only one instance of Rapid Analytics. We could update the price predictions and the models on a daily basis with little user interaction. We found that SVM models make better predictions for stock with smaller than average prices and smaller than average trading volume.
References [1] Nail O’Connor, M. G., A neural network approach to predicting stock exchange movements using external factors. Knowledge-Based Systems 19. 2006. [2] William Leigh, R. P., Forecasting the NYSE composite index with technical analysis, pattern recognizer, neural network, and genetic algorithm: a case study in romantic decision support. Decision Support Systems 32, 361-377. 2002. [3] LeBaron, B., Agent-based computational finance: Suggested readings and early research. Journal of Economic Dynamics & Control, 679-702. 2000. [4] Kim, K.-j., Financial time series forecasting using support vector machines. Neurocomputing, 307-319 2003. [5] Achelis, S. B., Technical analysis from A to Z. Chicago: Probus. 1995. [6] David Hand, H. M., Principles of Data Mining. Cambridge, Massachusetts: The MIT Press 2001. [7] http://codeigniter.com, Codeigniter PHP framework [8] Martin Sewell http://www.behaviouralfinance.net/behaviouralfinance.pdf, Behavioural Finance University of Cambridge 2007. [9] http://www.quantmod.com/ Quantitative Financial Modelling & Trading Framework for R [10] https://r-forge.r-project.org/projects/blotter/ Blotter R package: Transaction-oriented infrastructure for defining instruments, transactions, portfolios and accounts for trading systems and simulation [11] http://www.neuralmarkettrends.com Neural Market Trends: Rapidminer Evangelism & Consulting Thomas Ott