Using Statistical Techniques for Data Mining

3 downloads 0 Views 316KB Size Report
Stock Market provides an ample opportunity to experiment and use various ... our academic course on Data Mining, a review of the various techniques was.
Using Statistical Techniques for Data Mining Muhammad Maad Department of Computer and Information Services Institute of Business Management, Karachi

1. Introduction Businesses require understanding of the information they contain or receive relating to the activities of the business. For the management to implement and undertake informed decisions, it is imperative that the information is analyzed and processed to provide the required answers.

Historically, businesses relied upon the field of statistics to come up with the answers they required for the analysis and comparison between various components of the data. In the recent time, the concept of data warehousing and data mining has gripped its hold on corporate decision making and now businesses rely heavily on data warehousing and data mining.

Stock Market provides an ample opportunity to experiment and use various techniques for data mining. For the purpose of completing the mid-term paper for our academic course on Data Mining, a review of the various techniques was undertaken.

The stock market is essentially a non-linear, non-parametric system that is extremely hard to model with any reasonable accuracyi. This appears true as the market behavior at time is difficult to understand or predict. Having said this, a pattern can still be developed or estimated based on the performance of a particular stock.

There are many techniques that can be applied on stock market prediction, such as decision tree, rough set approach, and neural networksii. We can employ data mining techniques to extract information from a large data set of the stock to arrive at the desired information. Data mining is somewhat similar to statistics, as it also used to discover concept or class of description, association & correlation,

classification, prediction, clustering, trend analysis, outlier, deviation analysis, and similarity analysisiii.

One of the methods of data classification is through the use of Decision Trees. Decision Tree is a graphical representation of all possible outcomes and the paths by which they may be reached. The Decision Tree requires a set of rules or decision parameters to work. It becomes an uphill task to define such a rule for an erratic environment as the stock market. However, we have based our work on the assumption that there exist a certain pattern in the stock behavior and operations.

For this paper, we have used recent data from the New York Stock Exchange for a few companies only. The data used is the “End of Day” data that provides information about the opening, closing, high, low, and volume for a particular stock. This technique can be used on a larger set of data as well, however for such an attempt, the algorithm would require significant changes and processing capabilities.

This term paper is primarily based on the various literatures available on the internet. Major focus has been on the following two papers: 

Similarities and Differences in Statistics and Data Mining (S. M. Aqil Burney, Adnan Manzoor, Fahad Burney)



Predicting Stock Prices using Data Mining Techniques (Qasem A. Al-Radaideh, Adel Abu Assaf, Eman Alnagi)

2. Understanding the available data In order to get the most relevant data, we have downloaded the information from EODData (http://www.eoddata.com), a website specializing in the provision of data feeds in various formats. The objective is to provide data for researchers and analysts to analyze information for their respective work and modeling.

The website offers data from a number of exchanges, formats, and periods. In order to perform a small test, the data was selected from New York Stock Exchange (NYSE), for the following symbols/companies at random: 

AA (Alcoa Inc.)



BSX (Boston Scientific Corporation)



CLR (Continental Resources)



ECA (Encana Corp.)



EXC (Exelon Corp.)



F (Ford Motor Company)



GE (General Electric)



GM (General Motors)



HPQ (Hewlett-Packard Company)



JPM (JP Morgan Chase & Co.)

The data that was downloaded has the following attributes:

Data Field

Description

Characteristics

Symbol

The stock symbol used for trading

String

Date

Trading Date relating to the data

Date

Open

Opening price

Numeric, 2 decimal

High

Highest price reached

Numeric, 2 decimal

Low

Lowest price reached

Numeric, 2 decimal

Close

Closing price

Numeric, 2 decimal

Volume

Number of shares traded

Neumeric

Table 1: Downloaded data characteristics

The data can also be represented in the form of an Online Analytical Processing (OLAP) cube, a pictorial representation is as follows:

3. Use of statistical techniques To analyze the data, we applied various statistical techniques, which are described in the following sections.

4.1

Time Series plot for all symbols over the total period

The first and basic technique that we can apply over the data is the time-series plot. The time-series plot depicts the performance of each of the stocks over a period of time. The performance can be measured either through the analysis of Volume, Closing Bid, Highest Bid, or Opening Bid. For the purpose of this report, we have used the Volume as an example.

4.2

Correlation between ‘Open’ and ‘Volume’ of all the symbols Correlations Open

Open

Pearson Correlation

Volume 1

Sig. (2-tailed) Sum of Squares and Cross-

-.207

**

.000 134885.157

-2.403E10

211.088

-3.760E7

640

640

**

1

products Covariance N Volume

Pearson Correlation Sig. (2-tailed) Sum of Squares and Cross-

-.207

.000 -2.403E10

9.976E16

-3.760E7

1.561E14

640

640

products Covariance N

**. Correlation is significant at the 0.01 level (2-tailed).

4.3

Non-parametric Correlation between ‘Open’ and ‘Volume’ of all the symbols Correlations Open

Kendall's tau_b

Open

Correlation Coefficient

1.000

Sig. (2-tailed)

Spearman's rho

Open

**

.000

640

640

**

1.000

Sig. (2-tailed)

.000

.

N

640

640

Correlation Coefficient

Correlation Coefficient

-.134

1.000

Sig. (2-tailed)

-.203

**

.

.000

640

640

**

1.000

Sig. (2-tailed)

.000

.

N

640

640

N Volume

-.134

.

N Volume

Volume

Correlation Coefficient

**. Correlation is significant at the 0.01 level (2-tailed).

-.203

4.4

Regression analysis between ‘Volume’ as Independent and ‘Open’ and ‘Close’ as Dependent variables b

Model Summary

Change Statistics

Model 1

R .224

Adjusted R

Std. Error of the

R Square

Square

Estimate

Change

R Square a

.050

a. Predictors: (Constant), High, Open b. Dependent Variable: Volume

.047

1.220E7

F Change .050

16.861

df1

df2 2

Sig. F Change 637

.000

Durbin-Watson .729

4. The Decision Tree model Reviewing the existing work performed by Al-Radaideh et aliv, the following model was replicated:

Previous

Open

High

Low

Close

Action

Positive

Positive

Positive

Negative

Negative

Sell

Negative

Positive

Positive

Negative

Negative

Buy

Negative

Negative

Equal

Negative

Negative

Buy

Negative

Negative

Equal

Negative

Negative

Sell

Negative

Equal

Positive

Negative

Positive

Buy

Positive

Negative

Positive

Negative

Positive

Buy

Positive

Positive

Positive

Positive

Positive

Buy

Positive

Equal

Positive

Negative

Negative

Buy

Negative

Positive

Positive

Negative

Negative

Sell

Table 2: Decision Tree Model

This model is based upon the last closing price for any stock. We can prepare a pseudo-code for the model as follows: 

Filter the entire data set for selecting the required symbol



Sort the extracted data on the trading date in ascending order



Pick up the last closing price (A) of the symbol.



Scenario 1: If the Previous day’s closing = positive AND today’s open = positive AND today’s high = positive AND today’s low = negative THEN “SELL”



Scenario 2: If the Previous day’s closing = negative AND today’s open = positive AND today’s high = positive AND today’s low = negative AND today’s close = negative THEN “SELL”



Scenario 2: If the Previous day’s closing = negative AND today’s open = negative AND today’s high = equal AND today’s low = negative AND today’s close = negative THEN “BUY”



...

Pictorially, we can describe an example of the model as follows:

5. Conclusion A decision-rule based on past performance of a certain stock may be a quick or automated method to arrive at a decision to buy or sell. However, based on the behavior of the stock market, the predictions can never be perfect. There are various other factors that are linked with the performance of a certain stock or the whole market. Such factors are not contained within any structured form, and thus it is a very difficult task to predict whether the stock should be acquired or disposed off.

References i

Wang, Y. F. (2003) “Mining Stock price using fuzzy rough set system” Wang, Y. F. (2002) “Predicting stock prices using fuzzy grey prediction system” iii Han, J., Kamber, M., Jian P. (2011) “Data Mining Concepts and Techniques” iv Al-Radaideh, Q. A., Abu Assaf, A., Alnagi, E. (2013) “Predicting stock prices using data mining techniques” ii