25/04/2017
Stock Price Change Prediction Using News Text Mining Marcelo Beckmann Advisors: Nelson F. F. Ebecken Beatriz S. L. Pires de Lima
DSc. Thesis
Agenda Introduction Related Works Methodology Experiments Conclusions Future Work
1
25/04/2017
Introduction To predict changes in a market economy is a powerful ability, capable to create wealth and avoid losses Investors read news to guide their investment decisions, in the short and long term Buy low, sell with higher price= profit Earn dividends
Introduction The Internet advent allowed news to be published in real time Data Mining algorithms together with Text Mining techniques and Linguistics started to be applied in financial market Supported by Behavioral Economics
BE analyses the psychological, social, cognitive, and emotional aspects of human behavior when taking investment decisions
An interdisciplinary field of research has been created for Text Mining Applied to Financial Market Prediction (TMFP)
2
25/04/2017
Introduction This work
Aims to prove that data mining and text mining can effectively be used to automatically interpret news articles and learn patterns to predict the market movements For this purpose a long and automated process was developed to identify surges in stock prices Main problems faced: Long mining process with algorithms and user parameters adjustment, class imbalance and noise Classification and simulation results outperformed other results found after an extensive review in the related literature Incorrect use of classification measurements and model evaluation was identified in the related literature
Introduction Financial Economics Background
3
25/04/2017
Introduction Data Mining
Introduction CRISP-DM
4
25/04/2017
Introduction Text Mining
Related Works Number of publications in TMFP by year
5
25/04/2017
Related Works Number of publications by number of items (news articles)
Related Works Number of publications by time frame
6
25/04/2017
Related Works Number of publications grouped by the feature selection, and feature representation
Related Works Number of publications grouped by the dimensionality reduction method
7
25/04/2017
Related Works Number of publications by the type of machine learning algorithm
Related Works Number of publications grouped by training vs. testing method
8
25/04/2017
Related Works Number of publications by application of sentiment analysis
Related Works Other aspects
Source of news: Traditional financial market communication vehicles, Google, Yahoo, local sources
TMFP is predominantly applied to Stocks and ForEx
Only 22% applied sliding window for training/testing
About 50% applied some semantic technique (word meaning)
Less than 33% applied some syntax technique, mostly n-grams (word sequence) Only 6 studies applied some data balancing technique
9
25/04/2017
Related Works Problems found
Low quantity of scientific publications was found in this area, if compared with other branches of research, and the number is decreasing in the recent years Lack of information about how the model was evaluated (training vs. testing), and use of cross validation in time series
Lack of treatment of class imbalance problem and noise
Around of 50% of the studies published their results only in Accuracy
These problems are diminishing the investor´s confidence in TMFP
To be discussed ahead
Methodology The process developed for TMFP
10
25/04/2017
Methodology The process developed for TMFP
17 automated sub processes
RapidMiner, Text Mining and Web Mining extensions
Development of a new extension called TradeMiner
Java, Amazon EC2
Predict the price movements from 30 companies listed in the Down Jones Index (DJIA) One predictive model by stock symbol
Methodology Obtain news articles
RSS web crawler reading of news associated with a stock symbol http://finance.yahoo.com http://finance.google.com 24 hours x 7 days, updated every 5 minutes 480k records collected
Obtain market data
Stock prices Web service client http://www.restfulwebservices.net 6 hours x 5 days, updated every 1 minute 2.9M records collected
Operation
Started Apr/2012. Full operation from Jan/2013 to Sep/2013
11
25/04/2017
Methodology Text Cleaning
Removal of HTML tags Removal of news articles records with decommissioned content (page not found messages)
Market Data Labeling
Price segments with slopes >= 75% → SURGE (Buy) Price segments with nega ve slopes = threshold , then the majority instance will be removed 3- Repeat the process for all majority instances
Only applicable to training set Site blacklist rule applied to test set If all news from a website were removed during
the training, they must be removed in the test
15
25/04/2017
Methodology Training
Support Vector Machine (SVM)
LIBSVM implementation
RBF Kernel
C and Gamma hyper parameters were adjusted trough grid search Search of best F-Measure after cross validation in the training set
Methodology Feature selection (Test)
Same process from training phase (BOW, n-grams, TF-IDF) The resulting matrix must have the same variables The word list and counting from training phase is used to select the features and calculate the TF-IDF again in the test set
Dimensionality reduction
Same process from training phase (Stop words, Chi-Square, min/max occurrence) Chi-Square weights from training phase are used to filter variables again
Test
The SVM model from training phase will be used to predict the outcome (SURGE, NOT RECOMENDED) of news articles in the test set
16
25/04/2017
Methodology News Aggregation
The time offset τ is used to define how long a news article takes to affect the stock prices, for τ=1, 2, 3, and 5 minutes Along the day, each time offset can have one or more news articles to be predicted It is only necessary to have one recommendation of SURGE and NOT RECOMMENDED for each period of time with duration τ It´s necessary to provide a unique decision given a set of documents in the same time offset A novel ensemble approach named Cascading Aggregation for Time Series (CATS) was created A Genetic Algorithm is used to train decision rules given the counting of SURGES and NOT RECOMENDED outcomes from SVM in the same time offset τ
Methodology News Aggregation
17
25/04/2017
Methodology Model Evaluation (Good Model?)
At least 10 predictive models have a minimal value of G-Mean >= 55.00
Investment simulation
Check if the predictions are profitable in an investment scenario Strategy: If a SURGE prediction occurs, purchase $10,000 of shares from the related stock at τ-1 minutes after the news article being published. Hold the stock position for n=3 minutes, if during that n=3 minutes the stock can be sold to make a profit of >=2%, then sell it immediately. At the end of n minutes, the stock is sold at the current market price, and take a loss if necessary HFT
Methodology Good simulation?
Cumulative return US T-Bond in the same period: 0.05% Compare the results with a random trader The random trader is a null hypothesis, while TradeMiner is an alternative hypothesis P-value (t-value) with 99% of confidence
Real investment recommendation
In case of a good simulation led to the decision to apply the recommendation model in a real investment scenario On line test
18
25/04/2017
Experiments Setup
Four experiments comparing the performance with time offset τ=1, 2, 3, and 5 minutes From 3rd/July to 3rd/September 2013 (3 months) Classifier results F-Measure and G-Mean Cumulative return of simulation with null hypothesis test Comparison of results with the state of art Discussion about good practices
Experiments Classifier results τ=3 Stock Symbol
G-Mean
τ=2 F-Measure
G-Mean
τ=1 F-Measure
AA AXP BA DD
69.61 (0.0)
68.09 (0.0)
62.87 (13.4)
59.29 (0.6)
54.95 (0.5)
63.03 (0.1)
60.57 (1.0)
68.92 (0.3)
57.63 (0.6)
55.34 (2.7)
59.45 (7.7)
60.90 (5.5)
87.85 (4.6)
63.79 (0.2)
59.47 (4.5)
71.25 (0.3)
60.70 (3.7)
68.98 (8.9)
67.26 (4.6)
PFE
60.50 (0.1)
63.25 (0.5)
65.86 (1.9)
77.04 (6.6)
70.27 (0.1)
61.54 (16.9)
63.04 (3.7)
TRV VZ XOM
75.31 (1.3)
51.95 (0.4)
MMM MRK
F-Measure
72.12 (3.8) 57.94 (1.3)
GE MCD
G-Mean
92.66 (7.8)
60.93 (6.3)
85.52 (0.2)
65.65 (2.4)
59.59 (4.5)
76.00 (3.0)
69.75 (11.1)
59.80 (2.2)
19
25/04/2017
Experiments Classifier results
Experiments Investment simulation
Origin of Predictions
τ=3
TradeMiner
4.61 (0.38)
5.29 (1.49)
21.47 (0.13)
Random
-0.22 (1.11)
0.86 (2.03)
0.17 (2.71)
39.46
9.34
504.93
p-value (one sample t-test)
τ=2
τ=1
20
25/04/2017
Experiments Investment simulation
Experiments Comparison with the state of art – Classifier results
21
25/04/2017
Experiments Comparison with the state of art – Investment simulation
Experiments Discussion
The current work outperformed the existing results in terms of classifier results and cumulative return, with exception of one work with AUC 70.30, that used cross validation 11% of reviewed works used cross validation on time series problems Cross validation disrupts the time series and leaks information from the future
27% of reviewed works did not provide any information about model selection Accuracy is not a good classification measure for imbalanced class problem In a problem with 100 examples, 98 are negative and 2 are positive.
If an algorithm classify all of them as negative, its Accuracy will be 98%
~50% of reviewed works published only in accuracy
22
25/04/2017
Conclusions This work presented a computational process that use data and text mining to forecast the price movements intraday from 30 stocks listed on DJIA An extensive survey about TMFP was conducted, and problems like incorrect use of classification measurements and invalid model evaluation were identified Accuracy (99.77), Precision (99.88), Recall (92.74), AUC (67.87), G-mean (92.66), and F-Measure (76.00) Investment simulation with a cumulative return of 21.47% in three months This is credited to precise workflow development, proper use of classification measures, and the new algorithms KNN-Und, and CATS, proposed in this work
Conclusions These results show evidences that the stock prices movement can be effectively predicted using text mining Stock prices started to be affected by the news articles in the few minutes after they are published A loss of signal was observed when the news articles are accumulated in a wider time offset
23
25/04/2017
Future work The raw data used in this work is available for download (https://osf.io/gc6u6/) The CATS algorithm
Sliding window training, normalized counting values, add new measurements, attributes Apply new ensemble strategies to CASTS Apply CATS to other time series problems
Apply the TMFP process to on-line test
Use high capacity hardware or Hadoop if necessary
Develop an appropriated news alignment algorithm for wider time offsets
Future work Use more t-SNE and unsupervised learning to visualize and explore the data Use deep learning algorithms Use the improvements above to model the 20 companies with low performance in this work The TMFP could be applied to ForEx, sentiment detection, automatic content interpretation to be used in fundamental analysis, risk, and the forecast of other economic events
24
25/04/2017
Stock Price Change Prediction Using News Text Mining
[email protected]
25