Stock Price Change Prediction Using News Text

0 downloads 0 Views 3MB Size Report
Apr 25, 2017 - Data Mining algorithms together with Text Mining techniques and. Linguistics started to be applied in financial market. Supported by Behavioral ...
25/04/2017

Stock Price Change Prediction Using News Text Mining Marcelo Beckmann Advisors: Nelson F. F. Ebecken Beatriz S. L. Pires de Lima

DSc. Thesis

Agenda Introduction Related Works Methodology Experiments Conclusions Future Work

1

25/04/2017

Introduction To predict changes in a market economy is a powerful ability, capable to create wealth and avoid losses Investors read news to guide their investment decisions, in the short and long term Buy low, sell with higher price= profit Earn dividends

Introduction The Internet advent allowed news to be published in real time Data Mining algorithms together with Text Mining techniques and Linguistics started to be applied in financial market Supported by Behavioral Economics 

BE analyses the psychological, social, cognitive, and emotional aspects of human behavior when taking investment decisions

An interdisciplinary field of research has been created for Text Mining Applied to Financial Market Prediction (TMFP)

2

25/04/2017

Introduction This work 









Aims to prove that data mining and text mining can effectively be used to automatically interpret news articles and learn patterns to predict the market movements For this purpose a long and automated process was developed to identify surges in stock prices Main problems faced: Long mining process with algorithms and user parameters adjustment, class imbalance and noise Classification and simulation results outperformed other results found after an extensive review in the related literature Incorrect use of classification measurements and model evaluation was identified in the related literature

Introduction Financial Economics Background

3

25/04/2017

Introduction Data Mining

Introduction CRISP-DM

4

25/04/2017

Introduction Text Mining

Related Works Number of publications in TMFP by year

5

25/04/2017

Related Works Number of publications by number of items (news articles)

Related Works Number of publications by time frame

6

25/04/2017

Related Works Number of publications grouped by the feature selection, and feature representation

Related Works Number of publications grouped by the dimensionality reduction method

7

25/04/2017

Related Works Number of publications by the type of machine learning algorithm

Related Works Number of publications grouped by training vs. testing method

8

25/04/2017

Related Works Number of publications by application of sentiment analysis

Related Works Other aspects 

Source of news: Traditional financial market communication vehicles, Google, Yahoo, local sources



TMFP is predominantly applied to Stocks and ForEx



Only 22% applied sliding window for training/testing



About 50% applied some semantic technique (word meaning)





Less than 33% applied some syntax technique, mostly n-grams (word sequence) Only 6 studies applied some data balancing technique

9

25/04/2017

Related Works Problems found 



Low quantity of scientific publications was found in this area, if compared with other branches of research, and the number is decreasing in the recent years Lack of information about how the model was evaluated (training vs. testing), and use of cross validation in time series



Lack of treatment of class imbalance problem and noise



Around of 50% of the studies published their results only in Accuracy



These problems are diminishing the investor´s confidence in TMFP



To be discussed ahead

Methodology The process developed for TMFP

10

25/04/2017

Methodology The process developed for TMFP 

17 automated sub processes



RapidMiner, Text Mining and Web Mining extensions



Development of a new extension called TradeMiner



Java, Amazon EC2





Predict the price movements from 30 companies listed in the Down Jones Index (DJIA) One predictive model by stock symbol

Methodology Obtain news articles     

RSS web crawler reading of news associated with a stock symbol http://finance.yahoo.com http://finance.google.com 24 hours x 7 days, updated every 5 minutes 480k records collected

Obtain market data 

Stock prices  Web service client  http://www.restfulwebservices.net  6 hours x 5 days, updated every 1 minute  2.9M records collected

Operation 

Started Apr/2012. Full operation from Jan/2013 to Sep/2013

11

25/04/2017

Methodology Text Cleaning  

Removal of HTML tags Removal of news articles records with decommissioned content (page not found messages)

Market Data Labeling   

Price segments with slopes >= 75% → SURGE (Buy) Price segments with nega ve slopes = threshold , then the majority instance will be removed 3- Repeat the process for all majority instances





Only applicable to training set Site blacklist rule applied to test set  If all news from a website were removed during

the training, they must be removed in the test

15

25/04/2017

Methodology Training 

Support Vector Machine (SVM)



LIBSVM implementation



RBF Kernel





C and Gamma hyper parameters were adjusted trough grid search Search of best F-Measure after cross validation in the training set

Methodology Feature selection (Test)   

Same process from training phase (BOW, n-grams, TF-IDF) The resulting matrix must have the same variables The word list and counting from training phase is used to select the features and calculate the TF-IDF again in the test set

Dimensionality reduction 



Same process from training phase (Stop words, Chi-Square, min/max occurrence) Chi-Square weights from training phase are used to filter variables again

Test 

The SVM model from training phase will be used to predict the outcome (SURGE, NOT RECOMENDED) of news articles in the test set

16

25/04/2017

Methodology News Aggregation 











The time offset τ is used to define how long a news article takes to affect the stock prices, for τ=1, 2, 3, and 5 minutes Along the day, each time offset can have one or more news articles to be predicted It is only necessary to have one recommendation of SURGE and NOT RECOMMENDED for each period of time with duration τ It´s necessary to provide a unique decision given a set of documents in the same time offset A novel ensemble approach named Cascading Aggregation for Time Series (CATS) was created A Genetic Algorithm is used to train decision rules given the counting of SURGES and NOT RECOMENDED outcomes from SVM in the same time offset τ

Methodology News Aggregation

17

25/04/2017

Methodology Model Evaluation (Good Model?) 

At least 10 predictive models have a minimal value of G-Mean >= 55.00

Investment simulation 





Check if the predictions are profitable in an investment scenario Strategy: If a SURGE prediction occurs, purchase $10,000 of shares from the related stock at τ-1 minutes after the news article being published. Hold the stock position for n=3 minutes, if during that n=3 minutes the stock can be sold to make a profit of >=2%, then sell it immediately. At the end of n minutes, the stock is sold at the current market price, and take a loss if necessary HFT

Methodology Good simulation?    



Cumulative return US T-Bond in the same period: 0.05% Compare the results with a random trader The random trader is a null hypothesis, while TradeMiner is an alternative hypothesis P-value (t-value) with 99% of confidence

Real investment recommendation 



In case of a good simulation led to the decision to apply the recommendation model in a real investment scenario On line test

18

25/04/2017

Experiments Setup 

    

Four experiments comparing the performance with time offset τ=1, 2, 3, and 5 minutes From 3rd/July to 3rd/September 2013 (3 months) Classifier results F-Measure and G-Mean Cumulative return of simulation with null hypothesis test Comparison of results with the state of art Discussion about good practices

Experiments Classifier results τ=3 Stock Symbol

G-Mean

τ=2 F-Measure

G-Mean

τ=1 F-Measure

AA AXP BA DD

69.61 (0.0)

68.09 (0.0)

62.87 (13.4)

59.29 (0.6)

54.95 (0.5)

63.03 (0.1)

60.57 (1.0)

68.92 (0.3)

57.63 (0.6)

55.34 (2.7)

59.45 (7.7)

60.90 (5.5)

87.85 (4.6)

63.79 (0.2)

59.47 (4.5)

71.25 (0.3)

60.70 (3.7)

68.98 (8.9)

67.26 (4.6)

PFE

60.50 (0.1)

63.25 (0.5)

65.86 (1.9)

77.04 (6.6)

70.27 (0.1)

61.54 (16.9)

63.04 (3.7)

TRV VZ XOM

75.31 (1.3)

51.95 (0.4)

MMM MRK

F-Measure

72.12 (3.8) 57.94 (1.3)

GE MCD

G-Mean

92.66 (7.8)

60.93 (6.3)

85.52 (0.2)

65.65 (2.4)

59.59 (4.5)

76.00 (3.0)

69.75 (11.1)

59.80 (2.2)

19

25/04/2017

Experiments Classifier results

Experiments Investment simulation

Origin of Predictions

τ=3

TradeMiner

4.61 (0.38)

5.29 (1.49)

21.47 (0.13)

Random

-0.22 (1.11)

0.86 (2.03)

0.17 (2.71)

39.46

9.34

504.93

p-value (one sample t-test)

τ=2

τ=1

20

25/04/2017

Experiments Investment simulation

Experiments Comparison with the state of art – Classifier results

21

25/04/2017

Experiments Comparison with the state of art – Investment simulation

Experiments Discussion 



The current work outperformed the existing results in terms of classifier results and cumulative return, with exception of one work with AUC 70.30, that used cross validation 11% of reviewed works used cross validation on time series problems  Cross validation disrupts the time series and leaks information from the future





27% of reviewed works did not provide any information about model selection Accuracy is not a good classification measure for imbalanced class problem  In a problem with 100 examples, 98 are negative and 2 are positive.

If an algorithm classify all of them as negative, its Accuracy will be 98%



~50% of reviewed works published only in accuracy

22

25/04/2017

Conclusions This work presented a computational process that use data and text mining to forecast the price movements intraday from 30 stocks listed on DJIA An extensive survey about TMFP was conducted, and problems like incorrect use of classification measurements and invalid model evaluation were identified Accuracy (99.77), Precision (99.88), Recall (92.74), AUC (67.87), G-mean (92.66), and F-Measure (76.00) Investment simulation with a cumulative return of 21.47% in three months This is credited to precise workflow development, proper use of classification measures, and the new algorithms KNN-Und, and CATS, proposed in this work

Conclusions These results show evidences that the stock prices movement can be effectively predicted using text mining Stock prices started to be affected by the news articles in the few minutes after they are published A loss of signal was observed when the news articles are accumulated in a wider time offset

23

25/04/2017

Future work The raw data used in this work is available for download (https://osf.io/gc6u6/) The CATS algorithm 

 

Sliding window training, normalized counting values, add new measurements, attributes Apply new ensemble strategies to CASTS Apply CATS to other time series problems

Apply the TMFP process to on-line test 

Use high capacity hardware or Hadoop if necessary

Develop an appropriated news alignment algorithm for wider time offsets

Future work Use more t-SNE and unsupervised learning to visualize and explore the data Use deep learning algorithms Use the improvements above to model the 20 companies with low performance in this work The TMFP could be applied to ForEx, sentiment detection, automatic content interpretation to be used in fundamental analysis, risk, and the forecast of other economic events

24

25/04/2017

Stock Price Change Prediction Using News Text Mining

[email protected]

25