Stock Market provides an ample opportunity to experiment and use various ... our academic course on Data Mining, a review of the various techniques was.
Using Statistical Techniques for Data Mining Muhammad Maad Department of Computer and Information Services Institute of Business Management, Karachi
1. Introduction Businesses require understanding of the information they contain or receive relating to the activities of the business. For the management to implement and undertake informed decisions, it is imperative that the information is analyzed and processed to provide the required answers.
Historically, businesses relied upon the field of statistics to come up with the answers they required for the analysis and comparison between various components of the data. In the recent time, the concept of data warehousing and data mining has gripped its hold on corporate decision making and now businesses rely heavily on data warehousing and data mining.
Stock Market provides an ample opportunity to experiment and use various techniques for data mining. For the purpose of completing the mid-term paper for our academic course on Data Mining, a review of the various techniques was undertaken.
The stock market is essentially a non-linear, non-parametric system that is extremely hard to model with any reasonable accuracyi. This appears true as the market behavior at time is difficult to understand or predict. Having said this, a pattern can still be developed or estimated based on the performance of a particular stock.
There are many techniques that can be applied on stock market prediction, such as decision tree, rough set approach, and neural networksii. We can employ data mining techniques to extract information from a large data set of the stock to arrive at the desired information. Data mining is somewhat similar to statistics, as it also used to discover concept or class of description, association & correlation,
classification, prediction, clustering, trend analysis, outlier, deviation analysis, and similarity analysisiii.
One of the methods of data classification is through the use of Decision Trees. Decision Tree is a graphical representation of all possible outcomes and the paths by which they may be reached. The Decision Tree requires a set of rules or decision parameters to work. It becomes an uphill task to define such a rule for an erratic environment as the stock market. However, we have based our work on the assumption that there exist a certain pattern in the stock behavior and operations.
For this paper, we have used recent data from the New York Stock Exchange for a few companies only. The data used is the “End of Day” data that provides information about the opening, closing, high, low, and volume for a particular stock. This technique can be used on a larger set of data as well, however for such an attempt, the algorithm would require significant changes and processing capabilities.
This term paper is primarily based on the various literatures available on the internet. Major focus has been on the following two papers:
Similarities and Differences in Statistics and Data Mining (S. M. Aqil Burney, Adnan Manzoor, Fahad Burney)
Predicting Stock Prices using Data Mining Techniques (Qasem A. Al-Radaideh, Adel Abu Assaf, Eman Alnagi)
2. Understanding the available data In order to get the most relevant data, we have downloaded the information from EODData (http://www.eoddata.com), a website specializing in the provision of data feeds in various formats. The objective is to provide data for researchers and analysts to analyze information for their respective work and modeling.
The website offers data from a number of exchanges, formats, and periods. In order to perform a small test, the data was selected from New York Stock Exchange (NYSE), for the following symbols/companies at random:
AA (Alcoa Inc.)
BSX (Boston Scientific Corporation)
CLR (Continental Resources)
ECA (Encana Corp.)
EXC (Exelon Corp.)
F (Ford Motor Company)
GE (General Electric)
GM (General Motors)
HPQ (Hewlett-Packard Company)
JPM (JP Morgan Chase & Co.)
The data that was downloaded has the following attributes:
Data Field
Description
Characteristics
Symbol
The stock symbol used for trading
String
Date
Trading Date relating to the data
Date
Open
Opening price
Numeric, 2 decimal
High
Highest price reached
Numeric, 2 decimal
Low
Lowest price reached
Numeric, 2 decimal
Close
Closing price
Numeric, 2 decimal
Volume
Number of shares traded
Neumeric
Table 1: Downloaded data characteristics
The data can also be represented in the form of an Online Analytical Processing (OLAP) cube, a pictorial representation is as follows:
3. Use of statistical techniques To analyze the data, we applied various statistical techniques, which are described in the following sections.
4.1
Time Series plot for all symbols over the total period
The first and basic technique that we can apply over the data is the time-series plot. The time-series plot depicts the performance of each of the stocks over a period of time. The performance can be measured either through the analysis of Volume, Closing Bid, Highest Bid, or Opening Bid. For the purpose of this report, we have used the Volume as an example.
4.2
Correlation between ‘Open’ and ‘Volume’ of all the symbols Correlations Open
Open
Pearson Correlation
Volume 1
Sig. (2-tailed) Sum of Squares and Cross-
-.207
**
.000 134885.157
-2.403E10
211.088
-3.760E7
640
640
**
1
products Covariance N Volume
Pearson Correlation Sig. (2-tailed) Sum of Squares and Cross-
-.207
.000 -2.403E10
9.976E16
-3.760E7
1.561E14
640
640
products Covariance N
**. Correlation is significant at the 0.01 level (2-tailed).
4.3
Non-parametric Correlation between ‘Open’ and ‘Volume’ of all the symbols Correlations Open
Kendall's tau_b
Open
Correlation Coefficient
1.000
Sig. (2-tailed)
Spearman's rho
Open
**
.000
640
640
**
1.000
Sig. (2-tailed)
.000
.
N
640
640
Correlation Coefficient
Correlation Coefficient
-.134
1.000
Sig. (2-tailed)
-.203
**
.
.000
640
640
**
1.000
Sig. (2-tailed)
.000
.
N
640
640
N Volume
-.134
.
N Volume
Volume
Correlation Coefficient
**. Correlation is significant at the 0.01 level (2-tailed).
-.203
4.4
Regression analysis between ‘Volume’ as Independent and ‘Open’ and ‘Close’ as Dependent variables b
Model Summary
Change Statistics
Model 1
R .224
Adjusted R
Std. Error of the
R Square
Square
Estimate
Change
R Square a
.050
a. Predictors: (Constant), High, Open b. Dependent Variable: Volume
.047
1.220E7
F Change .050
16.861
df1
df2 2
Sig. F Change 637
.000
Durbin-Watson .729
4. The Decision Tree model Reviewing the existing work performed by Al-Radaideh et aliv, the following model was replicated:
Previous
Open
High
Low
Close
Action
Positive
Positive
Positive
Negative
Negative
Sell
Negative
Positive
Positive
Negative
Negative
Buy
Negative
Negative
Equal
Negative
Negative
Buy
Negative
Negative
Equal
Negative
Negative
Sell
Negative
Equal
Positive
Negative
Positive
Buy
Positive
Negative
Positive
Negative
Positive
Buy
Positive
Positive
Positive
Positive
Positive
Buy
Positive
Equal
Positive
Negative
Negative
Buy
Negative
Positive
Positive
Negative
Negative
Sell
Table 2: Decision Tree Model
This model is based upon the last closing price for any stock. We can prepare a pseudo-code for the model as follows:
Filter the entire data set for selecting the required symbol
Sort the extracted data on the trading date in ascending order
Pick up the last closing price (A) of the symbol.
Scenario 1: If the Previous day’s closing = positive AND today’s open = positive AND today’s high = positive AND today’s low = negative THEN “SELL”
Scenario 2: If the Previous day’s closing = negative AND today’s open = positive AND today’s high = positive AND today’s low = negative AND today’s close = negative THEN “SELL”
Scenario 2: If the Previous day’s closing = negative AND today’s open = negative AND today’s high = equal AND today’s low = negative AND today’s close = negative THEN “BUY”
...
Pictorially, we can describe an example of the model as follows:
5. Conclusion A decision-rule based on past performance of a certain stock may be a quick or automated method to arrive at a decision to buy or sell. However, based on the behavior of the stock market, the predictions can never be perfect. There are various other factors that are linked with the performance of a certain stock or the whole market. Such factors are not contained within any structured form, and thus it is a very difficult task to predict whether the stock should be acquired or disposed off.
References i
Wang, Y. F. (2003) “Mining Stock price using fuzzy rough set system” Wang, Y. F. (2002) “Predicting stock prices using fuzzy grey prediction system” iii Han, J., Kamber, M., Jian P. (2011) “Data Mining Concepts and Techniques” iv Al-Radaideh, Q. A., Abu Assaf, A., Alnagi, E. (2013) “Predicting stock prices using data mining techniques” ii