Predicting the direction of stock market prices using Ensemble Learning

Predicting the direction of stock market prices using Ensemble Learning By Luckyson Khaidem Snehanshu Saha Sudeepa Roy Dey

PROBLEM STATEMENT I

Investments in stock markets involve very high risk due to its complexity and dynamic nature.

I

Many variables influence the market value in a particular day such as economic condition, investor’s sentiment etc. Because of this, stock markets are susceptible to quick changes, causing random fluctuations in the stock price.

I

Market risk is positively correlated with forecasting error. And hence forecasting error needs to be minimized to ensure minimal risk in investment.

I

Errors in forecasting can be minimized by treating the problem of stock forecasting as a classification problem.

I

Design an intelligent system using Machine Learning techniques that learns from the market data and proposes an optimized trading strategy to investors

EXISTING APPROACHES AND RESULTS I

Researchers have used a wide variety of approaches.

I

Among the major methodologies used are: 1) Technical Analysis, 2) Time Series Forecasting, 3) Machine Learning and 4) Modeling and Predicting volatility of stocks using differential equations.

I

Some of the machine learning algorithms that have been used are SVM, Neural Network, Linear Discriminant Analysis, Linear Regression, KNN, Naive Bayesian Classifier etc.

I

Some of the existing approaches have not taken the non linearity of the problem into consideration and hence, use of linear discriminant type machine learning algorithms is futile

I

These algorithms have been able to achieve accuracy results in the range 60-70%.

Proposed Approach Data Collection

Exponential Smoothing

Feature Extraction

Random Forest

Stock Market Prediction Figure 1: Proposed Methodology

RESULTS ACHIEVED

Figure 2: Output from Apple Inc. Data set

Figure 3: Output from GE Data set

RESULTS ACHIEVED

Figure 4: ROC curve corresponding to Apple dataset

RESULTS ACHIEVED

Figure 5: ROC curve corresponding to GE Data set

RESULTS ACHIEVED

Figure 6: Time Window vs Accuracy for 3M stock data

WHY RANDOM FOREST ?

Figure 7: Test For Linear Seperability

WHY RANDOM FOREST ?

I

Stock data is inherently non linear in nature

I

Random Forests can learn highly irregular data

I

Random Forests can classify large amounts of data with high accuracy

I

Random Forests are natural candidate for parallelization since it comprise of highly de-correlated decision trees.

I

Random Forests converge as the number of trees in the ensemble increase

ERROR BOUND I I

I

I

Random Forests have upper error bounds. Define margin function mr (X , Y ) = Pθ (h(X , θ) = Y ) − maxj6=Y (Pθ (h(X , θ) = j)) (1) Strength of the forest is defined as the expected value of margin s = Ex,y mr (X , Y ) (2) Generalization error is bounded above by Chebychev’s inequality as Error = PX ,Y (mr (X , Y ) ≤ 0) ≤ var (mr )/s 2

I

(3)

Chebychev’s inequality: Let X be any random variable and C > 0. Then, P(|X − E (X )| ≥ C ) ≤

var (X ) C2

(4)

OOB ERROR AND CONVERGENCE

Figure 8: OOB error rate vs Number of Estimators

OUTCOME

I

A paper on this topic, co-authored by Dr. Snehanshu Saha and Mrs. Sudeepa Roy Dey, has been submitted to the journal of Applied Mathematical Finance.