classification accuracy with respect to the existing classification techniques. The ...... trading decision making tool is the key for stock traders to make profits. Since the stock ... Stochastic neural networks is applied in forecasting the ... Page | 22 achieve good results using MACD, Williams, and two averages, along with related.
DEVELOPING A MIXTURE DECISION TREE ALGORITHM WITH K-MEANS CLUSTERING TO IMPROVE CLASSIFICATION ACCURACY AND ITS APPLICATIONS
A Thesis Submitted to the Mawlana Bhashani Science and Technology University for the Partial Fulfillment of the Requirements for the Degree of Master of Science (MS) in Statistics
SUBMITTED BY MD. ASHRAFUL ISLAM EXAMINATION ROLL NO: ST-12222 SESSION: 2015-2016
DEPARTMENT OF STATISTICS MAWLANA BHASHANI SCIENCE AND TECHNOLOGY UNIVERSITY SANTOSH, TANGAIL-1902 OCTOBER, 2018
CERTIFICATE This thesis contains the research on “Developing a Mixture Decision Tree Algorithm with K-Means Clustering to Improve Classification Accuracy and its Applications”. This research work has been carried out by Md. Ashraful Islam under our supervisions. It is certified that the work included in this thesis is original and form a distinct contribution to the knowledge. The thesis contains work well worth consideration for the award of Master of Science (MS) degree.
……………………………………………. Md. Deluar Jahan Moloy Supervisor Assistant Professor, Department of Statistics Mawlana Bhashani Science and Technology University Santosh, Tangail-1902
…………………………………………… Md. Sifat Ar Salan Co-supervisor Lecturer, Department of Statistics Mawlana Bhashani Science and Technology University Santosh, Tangail-1902 I
DEDICATED TO MY BELOVED PARENTS
Thanks for your supports, love and inspiration…
II
Acknowledgement First of all the author expresses his sincere gratitude to Almighty Allah, the Supreme Ruler of the universe for His ever ending blessings for successful completion of the present research work and to prepare this in manuscript. I am greatly indebted to my respected supervisor Md. Deluar Jahan Moloy, Assistant Professor, Department of Statistics and co-supervisor Md. Sifat Ar Salan, Lecturer, Department of Statistics, Mawlana Bhashani Science and Technology University, Santosh, Tangail-1902, for their dynamic guidance, compassionate help, precious suggestions, affectionate feelings, constructive criticisms and inspiration to all phases of study, other activities and preparation of this project report. I express my heartiest gratitude and sincere thanks to all of my respected teachers of Statistics Department, for their affectionate loving and kind Co-operation. Special thanks to all my friends who have helped me a lot during my work. Thanks to the entire member of our Statistics Department. I thank all my friends for their valuable comments, suggestions and criticisms that helped to enrich this work. Above all I thank for the blessing of Allah, the Almighty, the most beneficent, the most merciful; to enable me to complete my work in time.
Md. Ashraful Islam October, 2018
III
Abstract Statistical research based on data is one of the major key to the development of the world. But the size of the data is getting increased very rapidly. For this reason, statistical data analysis is getting difficult and sometimes the existing data mining techniques are not providing significant and satisfactory results. On the other hand, the traditional classification techniques are providing the results which are not enough statistically significant in every time. In this research, a Mixture Decision Tree algorithm is being proposed that is more powerful classification technique and that may provide better results than the existing methods. In this Mixture Decision Tree, I have combined Classification and Regression Tree (CART) algorithm with k-means clustering by which the primary data is being filtered in two stages. Applying this method to some predefined data, I have found some significant improvement in the classification accuracy with respect to the existing classification techniques. The Mixture Decision Tree Algorithm is also applied to time series data and it also reposing responds well with misclassification errors. The results revealed that the Mean Absolute Percentage Error (MAPE) for Mixture Decision Tree is lower than the MAPE value of usual Classification and Regression Tree (CART). The Mixture Decision Tree algorithm is little bit complex than the existing methods but using the computer programming technology it can be done easily. Keywords: Data Mining, Decision Tree, Classification, Clustering, Time Series Data
IV
Contents Acknowledgement ................................................................................................................... III Abstract .................................................................................................................................... IV List of Tables ........................................................................................................................ VIII List of figures ........................................................................................................................... IX
Chapter One: Introduction 1.1 Introduction ........................................................................................................................ 10 1.2 Problem statement .............................................................................................................. 12 1.3 Objectives of the study....................................................................................................... 12 1.4 Scope of the study .............................................................................................................. 13 1.5 Justification of the study .................................................................................................... 13 1.6 Limitation of the study ....................................................................................................... 14 1.7 Conclusion ......................................................................................................................... 14
Chapter Two: Literature Review 2.1 Introduction ........................................................................................................................ 15 2.2 Previous works ................................................................................................................... 15 2.3 Conclusion ......................................................................................................................... 26
Chapter Tree: Methodology 3.1 Introduction ........................................................................................................................ 27 3.2 Data Mining ....................................................................................................................... 27 3.2.1 The key properties of data mining .............................................................................. 27 3.2.2 The Scope of Data Mining .......................................................................................... 27 3.2.3 Tasks of Data Mining.................................................................................................. 27 3.2.4 Tasks in Data Mining .................................................................................................. 28 3.2.5 Clustering High-Dimensional Data............................................................................. 28 3.3 Classification and Prediction ............................................................................................. 29 3.3.1 Preparing the Data for Classification and Prediction .................................................. 29 3.3.2 Comparing Classification and Prediction Methods Accuracy .................................... 30 3.4 Classification by Decision Tree Induction ......................................................................... 31 3.4.1 Decision tree ............................................................................................................... 31 3.4.2 Algorithm for Decision Tree Induction ...................................................................... 32 3.4.3 Overfitting ................................................................................................................... 33 3.4.4 Decision Tree Algorithm Advantages and Disadvantages.......................................... 33 3.5 Decision Trees Inducers ..................................................................................................... 34
V
3.5.1 ID3 .............................................................................................................................. 34 3.5.2 C4.5 ............................................................................................................................. 36 3.5.3 CART .......................................................................................................................... 37 3.5.4 CHAID ........................................................................................................................ 38 3.5.5 QUEST........................................................................................................................ 39 3.6 Bayesian Classification ...................................................................................................... 39 3.6.1 Bayes’ Theorem .......................................................................................................... 39 3.6.2 Naïve Bayesian Classification .................................................................................... 40 3.7 A Multilayer Feed-Forward Neural Network .................................................................... 40 3.7.1 Classification by Backpropagation ............................................................................. 41 3.7.2 Advantages of Backpropagation ................................................................................. 42 3.7.3 Process ........................................................................................................................ 42 3.8 K-Nearest-Neighbor Classifier .......................................................................................... 44 3.9 Other Classification Methods ............................................................................................ 45 3.9.1 Genetic Algorithms ..................................................................................................... 45 3.9.2 Regression Analysis .................................................................................................... 45 3.9.3 Linear Regression ....................................................................................................... 46 3.9.4 Multiple Linear Regression......................................................................................... 46 3.9.5 Nonlinear Regression .................................................................................................. 47 3.9.6 Transformation of a polynomial regression model to a linear regression model ........ 47 3.9.7 Classifier Accuracy ..................................................................................................... 47 3.10 Cluster Analysis ............................................................................................................... 48 3.10.1 Applications clustering ............................................................................................. 48 3.10.2 Typical Requirements of Clustering In Data Mining ................................................ 49 3.10.3 Major Clustering Methods ........................................................................................ 50 3.10.5 Classical Partitioning Methods ................................................................................. 53 3.10.6 Agglomerative hierarchical clustering ...................................................................... 55 3.10.7 Divisive hierarchical clustering ................................................................................ 55 3.11 STL decomposition .......................................................................................................... 56 3.12 Data analysis tools ........................................................................................................... 56 3.13 Conclusion ....................................................................................................................... 57
Chapter Four: Mixture Decision Tree and its Application on Classification 4.1 Introduction ........................................................................................................................ 58 4.2 Development of a Mixture Decision Tree Algorithm ........................................................ 58 4.2.1 The classification and clustering methods used in the proposed Algorithm ............... 58 VI
4.2.2 The datasets that are used in the proposed algorithm ................................................. 59 4.2.3 Theoretical Framework ............................................................................................... 60 4.5 Experimental analysis ........................................................................................................ 61 4.6 Experimental Results ......................................................................................................... 72 4.7 Comparison of the results .................................................................................................. 73 4.7.1 Comparison of the results among usual methods and Mixture Decision Tree............ 73 4.7.2 Comparison of the performances among usual methods and Mixture Decision Tree 75 4.8 Conclusion ......................................................................................................................... 76
Chapter Five: Application of Mixture Decision Tree on Time Series Data 5.1 Introduction ........................................................................................................................ 77 5.2 Related features of time series and Mixture Decision Tree ............................................... 77 5.3 Experimental Analysis ....................................................................................................... 78 5.3.1 Outline of the analysis ................................................................................................ 79 5.3.2 Sources of data ............................................................................................................ 79 5.3.3 Trend analysis ............................................................................................................. 79 5.3.4 Constructing features to model for double sessional time series data......................... 80 5.3.5 Classification and Regression Tree (CART)/RPART................................................. 83 5.3.6 Identifying important variables ................................................................................... 83 5.3.6 Plot of fitted values from CART tree .......................................................................... 84 5.3.7 Mixture Decision Tree (MDT).................................................................................... 86 5.3.8 Comparison between CART and Mixture Decision Tree ........................................... 87 5.4 Conclusion ......................................................................................................................... 89
Chapter Six: Conclusion 6.1 Conclusion ......................................................................................................................... 90 References ................................................................................................................................ 91
VII
List of Tables Table 1: Datasets Descriptions
59
Table 2: Iris Plants dataset
61
Table 3: Modified Iris Pants Dataset
66
Table 4: Results of CART decision tree
72
Table 5: Results of Random Forest decision tree
72
Table 6: Results of Mixture decision tree algorithm
73
Table 7: Comparison of the performances among usual methods and Mixture Decision Tree
75
VIII
List of figures Figure 1: A typical Decision Tree.
31
Figure 2: A multilayer feed-forward neural network.
41
Figure 3: Typical Neural Network.
43
Figure 4: K-means clustering algorithm.
60
Figure 5: Mixture Decision Tree Algorithm
72
Figure 6: Comparison between CART and Mixture Decision Tree for Iris Plants
73
Figure 7: Comparison between CART and Mixture Decision Tree for ToothGrowth dataset
74
Figure 8: Comparison between CART and Mixture Decision Tree for beaver2 dataset
74
Figure 9: Performance of different types of decision trees.
75
Figure 10: Trend analysis of the dataset
79
Figure 11: Time series decomposing by STL
80
Figure 12: Features comparison
82
Figure 13: Forecasting the trend part
82
Figure 14: Decision tree created by CART
83
Figure 15: Fitted values from CART
84
Figure 16: Decision tree created by CART with CP
84
Figure 17: Fitted values from CART with CP
85
Figure 18: Forecast from CART
85
Figure 19: Forecast from CART with and without trend
86
Figure 20: Fitted values from Mixture Decision Tree
87
Figure 21: Fitted values from Mixture Decision Tree with trend
87
Figure 22: Forecasting from CART and Mixture Decision Tree
88
Figure 23: Comparison of the obtained MAPE values
88
IX
Chapter One Introduction 1.1 Introduction Now a days, the world is running with modern science where big data is an important part of the modern age. Big data refers to the growth in the volume of structured and unstructured data, the speed at which it is created and collected, and the scope of how many data points are covered. Big data often comes from multiple sources and arrives in multiple formats. Big data analytics examines large amounts of data to uncover hidden patterns, correlations and other insights. With today’s technology, it is possible to analyze your data and get answers from it almost immediately an effort that is slower and less efficient with more traditional business intelligence solutions. Data mining is not a new term, but for many people, especially those who are not involved in IT activities, this term is confusing in most cases, those who hear the term data mining think about miners digging and looking for gold and diamonds. In data mining classification, clustering, association rules are created by analyzing data for frequent then patterns, then using the support and confidence criteria to locate the most important relationships within the data. Support is how frequently the items appear in the database, while confidence is the number of times then statements are accurate. In this research the classifications rule in an important part. Classification and prediction are two forms of data analysis that can be used extract models describing important data classes or to predict future data trends. Classification is a data mining function that assigns items in a collection to target classes or categories. There are several methods and tools of machine learning to solve the classification problems. One of these is decision tree. Decision tree is one of the most exoteric and potential decision support tools of machine learning in classification problem. Decision tree has multidimensional applications including many real life uses as for example: weather prediction, credit approval, medical diagnosis, fraud detection, document categorization, image recognition, signal classification etc. that have multiclass is important issue in pattern recognition applications. Decision tree has several advantages, as for example it is very easy to understand and can deal with large volume of data sets. DT model can be combined with other machine learning models. DT can be constructed from dataset with many attributes and each attribute having many
Chapter One
Introduction
attribute values. Once the decision tree construction is complete, it can used to classify seen or unseen training instances. To make a decision using a DT, start at the root node and follow the tree down the branches until a leaf node representing the class is reached. There have been many decision tree algorithms like ID3, C4.5, CART, and SPRINT but optimal decision tree construction for large volume of dataset is still a problem. The decision tree, which is also known as classification tree or regression tree is a technique most commonly used in data mining or machine learning. The objective is to create a model that predicts the value of a target variable based on several input variables. DT building algorithms may initially build the tree and then prune it for more effective classification. With pruning technique, portions of the tree may be removed or combined to reduce the overall size of the tree. The time and space complexity of constructing a decision tree depends on the size of the dataset, the number of attributes in the dataset, and the shape of the resulting tree. Decision trees are used to classify data with common attributes. Each decision tree represents a rule set which categorizes data according to these attributes. A decision tree consists of nodes, leaves, and edges. A node of a tree specifies an attribute by which the data is to be partitioned. Each node has a number of edges which are labelled according to a possible value of the attribute in the parent node. An edge connects either two nodes or a node and a leaf. Leaves are labelled with a decision value for categorization of the data. Decision tree can also be used in time series forecasting. This paper is a re-examination of the benefits and limitations of decomposition and combination techniques in the area of forecasting, and also a contribution to the field, offering a new forecasting method. The new method is based on the disaggregation of time series components through the STL decomposition procedure, the extrapolation of linear combinations of the disaggregated sub-series, and the re-aggregation of the extrapolations to obtain estimates for the global series. Applying the forecasting method to data from the NN3 and M1 Competition series, the results suggest that it can perform well relative to four other standard statistical techniques from the literature, namely the ARIMA, Theta, Holt-Winters’ and Holt’s Damped Trend methods. The relative advantages of the new method are then investigated further relative to a simple combination of the four statistical methods and a Classical Decomposition forecasting method. The strength of the method lies in its ability to predict long lead times with relatively high levels of
Page | 11
Chapter One
Introduction
accuracy, and to perform consistently well for a wide range of time series, irrespective of the characteristics, underlying structure and level of noise of the data. 1.2 Problem statement Analysis of data sets can find new correlations to spot business trends, prevent diseases, combat crime and so on. Scientists, business executives, practitioners of medicine, advertising and governments alike regularly meet difficulties with large data-sets in areas including Internet search, financial technology, urban informatics, and business informatics. Scientists encounter limitations in e-Science work, including meteorology, genomics, complex physics simulations, biology and environmental research. There are several methods to solve these types of problems. One of these is data mining technique in which classification is one of the most important method for handling big data. But some built methods of classification techniques cannot give higher classification accuracy as expectation. For this reason is necessary to build up some hybrid methods that reduce the misclassification errors. Some previous work mixture a hybrid classification system based on a C4.5 decision tree classifier and a one-against-all method to improve the classification accuracy for multi-class classification problems. Their one-against-all method constructed M number of binary C4.5 decision tree classifiers, each of which separated one class from all of the rest (Polat and Gunes, (2009)). The 𝑖 𝑡ℎ C4.5 decision tree classifier was trained with all the training instances of the 𝑖 𝑡ℎ class with positive labels and all the others with negative labels. Here in this research, it is shown that, clustering before classification reduces the misclassification errors. 1.3 Objectives of the study The main objective of the study is, “To develop a mixture algorithm by combining the decision tree algorithm and k-means clustering to improve the accuracy of classification problem and to apply the mixture algorithm in different datasets and comparing the results with exiting powerful data classification techniques.” The Specific objectives are, 1. To develop a mixture method based on decision tree algorithm and k-means clustering algorithm. Page | 12
Chapter One
Introduction
2. To apply the mixture method on classification problems. 3. To compare the results of the mixture method with usual methods. 4. To apply the mixture method on time series data. 1.4 Scope of the study There are a lot of applications in data mining fields. Classification and association are most common problems in data mining for knowledge extraction and machine learning. Regression and classification are also important tools for estimation and prediction. Because human has very limited viewpoint of intuitive and visual understandability on problems with large dimension or huge size of databases, the visualization of data mining is recently emphasized in practices. Some special purposes of data mining are currently processed such as text mining or web mining, for a new search technique in World Wide Web multimedia or texture mining for image processing, and spatial mining for the time-series analysis. Especially the text mining is one of good approaches for natural language processing. Iris Plants Database: This is one of the best known dataset in the pattern recognition literature. This dataset contains 3 class values (Iris Setosa, Iris Versicolor, and Iris Virginica), where each class refers to a type of iris plant. There are 150 instances (50 in each of three classes) and 4 attributes (sepallength, sepalwidth, petallength, and petalwidth) in this dataset. One class is linearly separable from the other 2 classes. This Iris database was used to determine the results. 1.5 Justification of the study I hope that the findings of this study will fill the gap of lack of sufficient information on the improvement of classification accuracy. The mixture method will be helpful for the researcher to solve many data mining tasks including image detection, anomaly detection, economic forecasting etc. In a word, the results of the study are likely to influence further scholarly research by other researchers who may be interested in this field of knowledge and initiate appropriate subsidence.
Page | 13
Chapter One
Introduction
1.6 Limitation of the study Missing value and outliers are two important obstacles in any statistical analysis. But in this research, it is considered that, there was no missing value in the datasets and all the observations are been numeric. 1.7 Conclusion The application of data mining is increasing very rapidly in current modern era. Almost every sectors where it is important to deal with big data, are being handled with data mining techniques. That’s why, it is impossible to think any research works without knowing data mining in the present world.
Page | 14
Chapter Two Literature Review 2.1 Introduction A literature review is a critical analysis of published sources, or literature, on a particular topic. It is an assessment of the literature and provides a summary, classification, comparison and evaluation. At postgraduate level literature reviews can be incorporated into an article, a research report or thesis. Literature reviews are secondary resource as did not report any new or creative experiment work. Mostly literatures reviews are linked with academic-oriented literature, consisting thesis or fellow-reviewed article, it mostly takes a research proposal and results paragraph. Its major objective is to locate the recent study in the body of literature and to give reference of a specific reader. It is combining of researches in near every educational area. It relies on a research question, calculation, select and analyze all high quality research suggestions related to the question. Analysis is a system that can be used for statistical methods, which efficiently link the information used on all chosen studies to develop effective outcomes. 2.2 Previous works A tree has many analogies in real life, and turns out that it has influenced a wide area of machine learning, covering both classification and regression. In decision analysis, a decision tree can be used to visually and explicitly represent decisions and decision making. As the name goes, it uses a tree-like model of decisions. Though a commonly used tool in data mining for deriving a strategy to reach a particular goal, it is also widely used in classification problems, which will be the main focus of this article. Some of the related works are, The study proposed a partition conditional independent component analysis (PC-ICA) method for naive Bayes classification in microarray data analysis. It further extended the class-conditional independent component analysis (CCICA) method. PC-ICA spited the small-size data samples into different partitions so that independent component analysis (ICA) can be done within each partition. PC-ICA also attempted to do ICA-based feature extraction within each partition that may consist of several classes (Polat, K., & Gunes_, S. , 2006). They also obtained 91.84% and 92.94% classification
Chapter Two
Literature Review
accuracies using combination of C4.5 decision tree with fuzzy weighted pre-processing and combination of C4.5 decision tree with k-NN based weighted pre-processing on the diagnosis of erythema at squamous diseases, respectively. There have been several studies reported focusing on classification of image segmentation. Among these studies, Tin and Kwork obtained 83% classification accuracy using support vector machine (SVM) on the classification of image segmentation (Tin, J., & Kwork, Y., 1999). This paper achieved 85.2% success rate with k-NN (k-nearest neighbor) classifier on the same dataset (Tolson, E. , 2001). 87.6% and 86.7% classification accuracies were obtained using PNN (probabilistic neural network) and GRNN (generalized regression network). The another author achieved 88.2% and 90.00% classification accuracies using AIRS (artificial immune recognition system) and Fuzzy-AIRS on the classification of image segmentation dataset. The author achieved 83.138% and 90.00% classification accuracies using AIRS and Fuzzy-AIRS on the classification of lymphography dataset (Polat, K., & Gunes_, S., 2006). The erythemato-squamous diseases database comes from the Gazi University and Bilkent University and was supplied by Nilsel Ilter, M.D., Ph.D., and H. Altay Guvenir, Ph.D. This dataset contains 34 attributes, 33 of which are linear valued and one of them is nominal. The study proposed a hybrid classification system based on a C4.5 decision tree classifier and a one-against-all method to improve the classification accuracy for multiclass classification problems. Their one-against-all method constructed M number of binary C4.5 decision tree classifiers, each of which separated one class from all of the rest. The 𝑖 𝑡ℎ C4.5 decision tree classifier was trained with all the training instances of the 𝑖 𝑡ℎ class with positive labels and all the others with negative labels (Polat, K., & Gunes, S., 2009). This paper presented an associative classification tree (ACT) that combined the advantages of both associative classification and decision trees. The ACT tree was built using a set of associative classification rules with high classification predictive accuracy
Page | 16
Chapter Two
Literature Review
(Chen, Y.-L., & Hung, L. T.-H., 2009). ACT followed a simple heuristic which selected the attribute with the highest gain measure as the splitting attribute. The author proposed a fuzzy decision tree Gini Index based (G-FDT) algorithm to fuzzify the decision boundary without converting the numeric attributes into fuzzy linguistic terms. The G-FDT tree used the Gini Index as the split measure to choose the most appropriate splitting attribute for each node in the decision tree. For the construction of the decision tree, the Gini Index was computed using the fuzzymembership values of the attribute corresponding to a split value and fuzzymembership values of the instances (Chandra, B., & Varghese, P. P., 2009). The splitpoints were chosen as the midpoints of attribute values where the class information changed. The paper presented a co-evolving decision tree method, where a large number of variables in datasets were being considered. They proposed a novel combination of DTs and evolutionary methods, such as the bagging approach of a DT classifier and a backpropagation neural network method, to improve the classification accuracy (Aitkenhead, M. J., 2008). Such methods evolved the structure of a decision tree and also handled comparatively a wider range of values and data types. The authors applied a hidden naive Bayes (HNB) classifier to a network intrusion detection system (NIDS) to classify network attacks. It especially significantly improved the accuracy for the detection of denial-of-services (DoS) attacks (Koc, L., Mazzuchi, T. A., & Sarkani, S., 2012). The HNB classifier was an extended version of a basic NB classier. It relaxed the conditional independence assumption imposed on the basic NB classifier. Another study proposed a robust naive Bayes classifier (R-NBC) to overcome two major limitations i.e., underflow and over-fitting for the classification of gene expression datasets. R-NBC used logarithms of probabilities rather than multiplying probabilities to handle the underflow problem and employed an estimate approach for providing solutions to over-fitting problems. It did not require any prior feature selection approaches in the field of microarray data analysis where a large number of attributes were considered (Chandra and Gupta, (2011)).
Page | 17
Chapter Two
Literature Review
This paper presented a classification method called extended naive Bayes (ENB) for the classification of mixed types of data. The mixed types of data included categorical and numeric data. ENB used a normal NB algorithm to calculate the probabilities of categorical attributes (Hsu, C.-C., Huang, Y.-P., & Chang, K.-W., 2008). When handling numeric attributes, it adopted the statistical theory to discrete the numeric attributes into symbols by considering the average and variance of numeric values. A naive Bayes classifier is a simple probabilistic based method, which can predict the class membership probabilities. It has several advantages: (a) easy to use, and (b) only one scan of the training data required for probability generation (Chen, J., Huang, H., Tian, S., & Qu, Y., 2009). A NB classifier can easily handle missing attribute values by simply omitting the corresponding probabilities for those attributes when calculating the likelihood of membership for each class. The NB classifier also requires the class conditional independence, i.e., the effect of an attribute on a given class is independent of those of other attributes (Farid, D. M., & Rahman, M. Z., 2010). The past three decades have seen continuing developments in the area of pattern recognition. Research into algorithmic aspects of pattern recognition has proceeded alongside the development of instruments that are capable of producing high volumes of data, including images with increasingly finer spatial and spectral resolution (Mather, P. M., 1999). After 30 years of satellite remote sensing of the Earth’s land surface, users of remotely sensed data now have access to sophisticated statistical and neural/connectionist algorithms for both fuzzy and hard classifications of their data (Schowengerdt, R. A., 1997). The statistical and neural/connectionist approaches have limitations. Statistical methods rely on the assumption that the probabilities of class membership can be modelled by a specific probability density function (Foody, G. M., & Arora, M. K., 1997). In most cases, the Gaussian distribution is chosen, as it is characterized by firstand second-order statistics, that is, the class mean vectors and class covariance matrices. If training set size is fixed, then the precision of the estimates of the elements of the sample class mean vector and sample class covariance matrix declines as the number of features (dimensions) increases, so that one might expect the performance of the classifier to degrade as the number of features increases (Kavzoglu, T., 2001). The assumption that the data in each class follow a multivariate normal model restricts the Page | 18
Chapter Two
Literature Review
analysis to interval or ratio scale data. Neural/connectionist methods appear to work well with training data sets that are smaller in size than those required for statistical procedures. On the other hand, network training times can be lengthy, while choice of the design of network architecture (in terms of numbers of hidden layers and neurons per layer) and the values of the learning rate parameters is not straightforward (Wilkinson, G. G., 1997). Unlike statistical methods, the neural/connectionist approach makes no assumptions concerning the statistical frequency distribution of the data or the measurement scales of the features that are used in the analysis. The most commonly used neural/ connectionist algorithm is the back-propagating multi-layer perceptron (Wilkinson, (1997)), which is used in this study. Decision tree (DT) classifiers have not been as widely used within the remote sensing community as either the statistical or the neural/connectionist methods (Friedl, M. A.& Brodley, C. E., 1997). The advantages that decision trees offer include an ability to handle data measured on different scales, lack of any assumptions concerning the frequency distributions of the data in each of the classes, flexibility, and ability to handle non-linear relationships between features and classes. In contrast to neural networks, decision trees can be trained quickly, and are rapid in execution. They can be used for feature selection/reduction as well as for classification purposes (Gahegan, M. & West, G., 1998). Finally, the analyst can interpret a decision tree. It is not a ‘black box’, like the neural network, the hidden workings of which are concealed from view (Borak, J. S., & Strahler, A. H., 1995). Overall classification accuracy is used to measure the performance of the different methods. The level of classification accuracy that is achieved in a particular case depends on a number of factors, including the nature of the classification problem in terms of the complexity of the decision boundaries that separate the classes in feature space (assuming that the classes are separable), the training sample size, the adequacy of the training data in characterizing the properties of the chosen classes, the dimensionality of the data, and the properties of the classifier used (Raudys, S., & Pikelis, V., 1980).
Page | 19
Chapter Two
Literature Review
Mining stock market trend is a challenging task due to its high volatility and noisy environment. Many factors influence the performance of a stock market including political events, general economic conditions, and traders’ expectations (Abu-Mostafa, Y. S., & Atiya, A. F., 1996). Although stocks and futures traders have relied heavily upon various types of intelligent systems to make trading decisions, the performance have been a disappointment. Many attempts have been made to predict the financial markets, ranging from traditional time series approaches to artificial intelligence techniques, such as fuzzy systems and artificial neural network (ANN) methodologies (Abraham, A.; Baikunth, N.; Mahanti, P. k., 2001) . However, the main drawback with ANNs, and other black box techniques, is the tremendous difficulty in interpreting the results. They do not provide an insight into the nature of the interactions between the technical indicators and the stock market fluctuations (Chi, S. C., Chen, H. P., & Cheng, C. H., 1999). Thus, there is a need to develop methodologies that provide an increased understanding of market processes. Another issue to be dealt with is that the dimensionality of financial time series data also creates another challenge in ANN approaches (Zhang, Y.-Q., Akkaladevi, S., Vachtsevanos, G., & Lin, T. Y., 2002). The development of a timely and accurate trading decision making tool is the key for stock traders to make profits. Since the stock price series is affected by a mixture of deterministic and random factors. New tools and techniques are needed in dealing with noise and nonlinearity in stock price prediction (Zhang, Y.-Q., Akkaladevi, S., Vachtsevanos, G., & Lin, T. Y., 2002). White was the first to use neural networks (NNs) for market forecasting. He used a feed-forward NN (FFNN) to study the IBM daily common stock returns. He found that his training results were over-optimistic, because the result is over-fitting and of irrelevant features (Chi, S. C., Chen, H. P., & Cheng, C. H., 1999). In general, there are two different methodologies for stock price prediction in using ANN as a research tool (White, H. , 1988). The first method is to consider the stock price variations as a time series and predict the future price based on its past values. In this approach, artificial neural networks (ANNs) have been employed as the predictor (Lee, J. W., 2001). These prediction models, however, have their limitations owing to the tremendous noise and high dimensionality Page | 20
Chapter Two
Literature Review
of the stock price data. Therefore, the performances of the existing models are not satisfactory (Zadeh, L. A., 1965). The second approach takes the technical indices and qualitative factors, like political effects into account in stock market forecasting and trend analysis. Another author use Technical Indicators (%K and %D) along with price information to predict future price values (Yao, J., & Poh, H. L., 1995). They achieved good returns, and found their models performed better using daily data rather than weekly data. The author predict prices of stocks based on the fluctuations in the rest of the market for the same day. Although the investment is done in a frictionless environment, they show consistently high rates of return. Paying commissions on the large number of trades instigated would certainly take away much of the benefit from the trading strategy proposed (Hobbs, A., & Bourbakis, N. G., 1995). Austin and Looney 5 develop a neural network that predicts the proper time to move money into and out of the stock market. They used two valuation indicators, two monetary policy indicators, and four technical indicators to predict the four week forward excess return on the dividend adjusted S& P 500 stock index. The results significantly outperformed the buy-andhold strategy (Kim, K. J., & Han, I., 2000). Backpropagation ANNs is applied to predict future elements in the price time series in KOSPI. The author use time delay connections in enhanced neural networks (that is, the addition of time dependent information in each weight) to forecast IBEX-35 (Spanish Stock Index) index close prices 1 day-ahead. Stochastic neural networks is applied in forecasting the volatility of index returns for TUNINDEX (Tunisian Stock Index), and finds that the out-of sample neural network results are superior to traditional GARCH models (Mingo, Diaz, Palencia, & Jimenez, 2002). The another paper present a trading approach based on one-step ahead profit estimates created by combining neural networks with particle swarm optimization algorithms (Nenortaite, J., & Simutis, R., 2004). The method is profitable given small commission costs, but does not exceed the S& P500 returns when realistic commissions are introduced. The train ANNs using both technical analysis variables and inter market data, to predict one-day changes in the NIKKEI index (Jaruszewicz, M., & Mandziuk, J., 2004). They Page | 21
Chapter Two
Literature Review
achieve good results using MACD, Williams, and two averages, along with related market data from the NASDAQ and DAX. The fuzzy decision tree is similar to the standard decision tree methods base Dona recursive binary partitioning algorithm. At each node during the construction process of a fuzzy decision tree, the most stable splitting region is selected and the boundary uncertainty is estimated based on an iterative resampling algorithm (Janikow, C. Z., 1998). The boundary uncertainty estimate is used within the region’s fuzzy membership function to direct new samples to each resulting partition with a quantified confidence (Quinlan, J. R., 1986). The fuzzy membership function is used to recover those samples that lie within the uncertainty of the splitting regions. Many attempts have been made in the past to introduce this new technology into stock prediction. Sorensen et al. (2000) use CART to partition assets into outperforming and underperforming assets (Mugambi, E. M., Hunter, A., Oatley, G., & Kennedy, L., 2004). Portfolio composed by uniformly weighted outperforming assets. It has been a new tendency that combining the soft computing (SC) technologies of NNs, fuzzy logic (FL) and genetic algorithms (GAs) may significantly improve an analysis (Abraham, A.; Baikunth, N.; Mahanti, P. k., 2001). In general, NNs are used for learning and curve fitting, FL is used to deal with imprecision and uncertainty, and GAs are used for search and optimization. (Zadeh, L. A., 1965) Pointed out, merging these technologies results in a tolerance for imprecision, uncertainty, and partial truth to achieve tractability, robustness, and low solution cost. Forecast accuracy has been a critical issue in the areas of financial, economic and scientific modeling, and has motivated the growth of a vast body of literature on the development and empirical application of forecasting models (De Gooijer, J. G., & Hyndman, R. J., 2006). Nevertheless, these models are just “intentional abstractions of a much more complicated reality” (Diebold, F. & Lopez, J., 1996), and rely on historical data to draw conclusions about the future. Consequently, they are always prone to estimation error due to model misspecification. Combination techniques have been developed to address this issue of misspecification by exploiting the capabilities of the various forecasting models to capture specific aspects of the data. Even though Page | 22
Chapter Two
Literature Review
decomposition methods were not primarily developed to serve as prediction tools, the intuition behind their application to forecasting is nonetheless very appealing (Cleveland, R. B., Cleveland, W. S., McRae, J., & Terpenning, I., 1990). Disaggregating the various components in the data and predicting each one individually can be viewed as a process of isolating smaller parts of the overall process which are governed by a strong and persistent element, thus separating them from any ‘noise’ and inconsistent variability. These processes are also then easier to extrapolate, due to their more deterministic nature. It should therefore be possible to obtain more accurate forecasts for the individual components than one is likely to obtain for the global series. This becomes important in the case of time series with high levels of noise. The combination techniques operate by pooling forecasts from various models, in order to enhance and robustify the prediction accuracy (Koning, A., Franses, P., Hibon, M., & Stekler, H., 2005). The integration of information from different models into one forecast can reduce the estimation error in the prediction significantly. Nonetheless, the restrictive complexity of some existing combination methods and the lack of comprehensive guidelines for their application have been admitted in the literature to be flaws (Armstrong, J. S., 1989). Decomposition procedures, on the other hand, can facilitate the analysis by disaggregating the time series into feature-based sub-series. As is suggested in this paper, the isolation of the more important features of the data into distinct sub-series can enhance the forecasting performance of the models used for their estimation. As a consequence, the estimation error obtained from the aggregation of the extrapolated sub-series is lower than the estimation error obtained for the series as a whole. The improvement in accuracy is due mainly to the elimination of any residual variability within the sub-series, which may affect the structure of the individual components and consequently the performance of the forecasting method. In this paper, a forecasting method is developed which extrapolates the global series through the individual extrapolations of linear combinations of the sub-series returned from the application of a decomposition procedure, including the residual error component. The new forecasting method makes use of both decomposition procedures and combination techniques.
Page | 23
Chapter Two
Literature Review
A decomposition procedure from the literature is employed to disaggregate the data into three dominant components, namely the trend, seasonality and residual error, while a linear combination technique is used to obtain an estimation for the global series. The main underlying idea of the method is that better prediction accuracies can be achieved if the forecasting problem is subdivided into smaller parts, and consequently the degree of complexity of the problem is also segregated (Makridakis, S., & Hibon, M., 2000). Those parts are then easier to extrapolate, contributing to higher prediction accuracies than those obtained from the direct forecast of the global series using a single model. The new forecasting method is applied to the NN3 Competition datasets. The results obtained are benchmarked against the results of four forecasting methods, namely the ARIMA, Theta, Holt’s Damped Trend (hereafter HDT) and Holt-Winters’ (hereafter HW) methods; as well as a simple combination of the forecasts obtained from these methods, and a classical decomposition forecasting method (Hyndman, R.J., 2013). After more than 50 years of widespread use, exponential smoothing is still one of the most practically relevant forecasting methods available. This is because of its simplicity and transparency, as well as its ability to adapt too many different situations. It also has a solid theoretical foundation in ETS state space models (Hyndman, R.J., 2014). The acronym ETS stands both for ExponenTial Smoothing and for Error, Trend, and Seasonality, which are the three components that define a model within the ETS family. Exponential smoothing methods obtained competitive results in the M3 forecasting competition in the programming language R (R Core Team, 2014) means that a fully automated software for fitting ETS models is available (Goodwin, P., 2010). Thus, ETS models are both usable and highly relevant in practice, and have a solid theoretical foundation, which makes any attempts to improve their forecast accuracy a worthwhile endeavor (Koning, A., Franses, P., Hibon, M., & Stekler, H., 2005). Bootstrap aggregating (bagging), as proposed by Breiman, is a popular method in machine learning for improving the accuracy of predictors by addressing potential instabilities. These instabilities typically stem from sources such as data uncertainty, parameter uncertainty, and model selection uncertainty (Hastie, T., Tibshirani, R., & Friedman, J. , 2009). An ensemble of predictors is estimated on bootstrapped versions of the input data, and the output of the ensemble is calculated by combining (using the median, mean, trimmed mean, or weighted mean, for example), often yielding better Page | 24
Chapter Two
Literature Review
point predictions. In this work, we propose a bagging methodology for exponential smoothing methods, and evaluate it on the M3 data. As our input data are non-stationary time series, both serial dependence and non-stationary have to be taken into account. We resolve these issues by applying a seasonal trend decomposition based on loess and a moving block bootstrap to the residuals of the decomposition. Specifically, our proposed method of bagging is as follows. After applying a Box–Cox transformation to the data, the series is decomposed into trend, seasonal and remainder components (Cleveland, R. B., Cleveland, W. S., McRae, J., & Terpenning, I., 1990). The remainder component is then bootstrapped using the MBB, the trend and seasonal components are added back in, and the Box–Cox transformation is inverted. In this way, we generate a random pool of similar bootstrapped time series. For each of these bootstrapped time series, we choose a model from among several exponential smoothing models, using the bias-corrected AIC. Then, point forecasts are calculated using each of the different models, and the resulting forecasts are combined using the median (Cordeiro, C., & Neves, M., 2009). The only related work that we are aware of is the study by Cordeiro and Neves, who use a sieve bootstrap to perform bagging with ETS models. They use ETS to decompose the data, then fit an AR model to the residuals, and generate new residuals from this AR process. Finally, they fit the ETS model that was used for the decomposition to all of the bootstrapped series. They also test their method on the M3 dataset, and have some success for quarterly and monthly data, but overall, the results are not promising. In fact, the bagged forecasts are often not as good as the original forecasts applied to the original time series. Our bootstrapping procedure works differently, and yields better results. The STL is used for the time series decomposition, MBB to bootstrap the remainder, and choose an ETS model for each bootstrapped series. Using this procedure, we are able to outperform the original M3 methods for monthly data in particular. Starting from this basic idea, exponential smoothing has been expanded to the modelling of different components of a series, such as the trend, seasonality, and remainder components, where the trend captures the long-term direction of the series, the seasonal part captures repeating components of a series with a known periodicity, and the remainder captures unpredictable components (Hyndman, R.J., 2013). The trend component is a combination of a level term and a growth term. There is a whole family of ETS models, which can be distinguished by the type of error, trend, and seasonality Page | 25
Chapter Two
Literature Review
each uses. In general, the trend can be non-existent, additive, multiplicative, damped additive, or damped multiplicative. The seasonality can be non-existent, additive, or multiplicative. The error can be additive or multiplicative; however, distinguishing between these two options is only relevant for prediction intervals, not point forecasts. Thus, there are a total of 30 models with different combinations of error, trend and seasonality. For more detailed descriptions, we refer to (Hyndman, R.J., 2014). 2.3 Conclusion The previous works suggest different future works. Some suggest to combine two or more methods to find a mixture method that may perform best compare to the others. After observing the results of previous works it can be proposed a method by combining classification and clustering for improving the classification accuracy and the can also be applied to time series in sample forecasting.
Page | 26
Chapter Three Methodology 3.1 Introduction Research methodology refers to the approach by which data is extracted to be clearly understood. The main objective of this report is to develop a Mixture Decision Tree algorithm that may improve the classification accuracy by using clustering and classification algorithms. 3.2 Data Mining Data mining refers to extracting or mining knowledge from large amounts of data. The term is actually a misnomer. Thus, data mining should have been more appropriately named as knowledge mining which emphasis on mining from large amounts of data. It is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. 3.2.1 The key properties of data mining 1. Automatic discovery of patterns 2. Prediction of likely outcomes 3. Creation of actionable information 4. Focus on large datasets and databases 3.2.2 The Scope of Data Mining Data mining derives its name from the similarities between searching for valuable business information in a large database, for example, finding linked products in gigabytes of store scanner data and mining a mountain for a vein of valuable ore. Both processes require either sifting through an immense amount of material, or intelligently probing it to find exactly where the value resides. Given databases of sufficient size and quality, data mining technology can generate new business opportunities. 3.2.3 Tasks of Data Mining Data mining involves six common classes of tasks:
Chapter Three
Methodology
1. Anomaly detection (Outlier/change/deviation detection) – The identification of unusual data records, that might be interesting or data errors that require further investigation. 2. Association rule learning (Dependency modelling) – Searches for relationships between variables. For example a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis. 3. Clustering – is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data. 4. Classification – is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam". 5. Regression – attempts to find a function which models the data with the least error. 6. Summarization – providing a more compact representation of the data set, including visualization and report generation. 3.2.4 Tasks in Data Mining 1. Clustering High-Dimensional Data 2. Constraint-Based Clustering 3.2.5 Clustering High-Dimensional Data Clustering high-dimensional data is a particularly important task in cluster analysis because many applications require the analysis of objects containing a large number of features or dimensions. For example, text documents may contain thousands of terms or keywords as features, and DNA micro array data may provide information on the expression levels of thousands of genes under hundreds of conditions. It is challenging due to the curse of dimensionality. Many dimensions may not be relevant. As the number of dimensions increases, the data become increasingly sparse so that the distance measurement between pairs of points become meaningless and the average density of points anywhere in the data is likely to be low. Therefore, a different clustering methodology needs to be developed for high-dimensional data. CLIQUE and PROCLUS are two influential subspace clustering methods, which search for clusters in subspaces of the data, rather than over the entire data space. Frequent pattern based clustering, another clustering methodology, and extracts distinct frequent patterns Page | 28
Chapter Three
Methodology
among subsets of dimensions that occur frequently. It uses such patterns to group objects and generate meaningful clusters. 3.3 Classification and Prediction Classification and prediction are two forms of data analysis that can be used to extract models describing important data classes or to predict future data trends. Classification predicts categorical (discrete, unordered) labels, prediction models continuous valued functions. For example, we can build a classification model to categorize bank loan applications as either safe or risky, or a prediction model to predict the expenditures of potential customers on computer equipment given their income and occupation. A predictor is constructed that predicts a continuous-valued function, or ordered value, as opposed to a categorical label. Regression analysis is a statistical methodology that is most often used for numeric prediction. Many classification and prediction methods have been proposed by researchers in machine learning, pattern recognition, and statistics. Algorithms are memory resident, typically assuming a small data size. Recent data mining research has built on such work, developing scalable classification and prediction techniques capable of handling large disk-resident data. There are several issues regarding classification and prediction including the followings: 3.3.1 Preparing the Data for Classification and Prediction The following preprocessing steps may be applied to the data to help improve the accuracy, efficiency, and scalability of the classification or prediction process. 3.3.1.1 Data cleaning This refers to the preprocessing of data in order to remove or reduce noise (by applying smoothing techniques) and the treatment of missing values (e.g., by replacing a missing value with the most commonly occurring value for that attribute, or with the most probable value based on statistics). Although most classification algorithms have some mechanisms for handling noisy or missing data, this step can help reduce confusion during learning. 3.3.1.2. Relevance analysis Many of the attributes in the data may be redundant during analysis. Correlation analysis can be used to identify whether any two given attributes are statistically related. For example, a strong correlation between attributes A1 and A2 would suggest that one Page | 29
Chapter Three
Methodology
of the two could be removed from further analysis. A database may also contain irrelevant attributes. Attribute subset selection can be used in these cases to find a reduced set of attributes such that the resulting probability distribution of the data classes is as close as possible to the original distribution obtained using all attributes. Hence, relevance analysis, in the form of correlation analysis and attribute subset selection, can be used to detect attributes that do not contribute to the classification or prediction task. Such analysis can help improve classification efficiency and scalability. 3.3.1.3 Data Transformation and Reduction The data may be transformed by normalization, particularly when neural networks or methods involving distance measurements are used in the learning step. Normalization involves scaling all values for a given attribute so that they fall within a small specified range, such as -1 to +1 or 0 to 1. The data can also be transformed by generalizing it to higher-level concepts. Concept hierarchies may be used for this purpose. This is particularly useful for continuous valued attributes. For example, numeric values for the attribute income can be generalized to discrete ranges, such as low, medium, and high. Similarly, categorical attributes, like street, can be generalized to higher-level concepts, like city. Data can also be reduced by applying many other methods, ranging from wavelet transformation and principle components analysis to discretization techniques, such as binning, histogram analysis, and clustering. 3.3.2 Comparing Classification and Prediction Methods Accuracy The accuracy of a classifier refers to the ability of a given classifier to correctly predict the class label of new or previously unseen data (i.e., tuples without class label information). The accuracy of a predictor refers to how well a given predictor can guess the value of the predicted attribute for new or previously unseen data. Speed This refers to the computational costs involved in generating and using the given classifier or predictor. Robustness This is the ability of the classifier or predictor to make correct predictions given noisy data or data with missing values. Scalability
Page | 30
Chapter Three
Methodology
This refers to the ability to construct the classifier or predictor efficiently given large amounts of data. Interpretability This refers to the level of understanding and insight that is provided by the classifier or predictor. Interpretability is subjective and therefore more difficult to assess. 3.4 Classification by Decision Tree Induction 3.4.1 Decision tree A decision tree is a classifier expressed as a recursive partition of the in-stance space. The decision tree consists of nodes that form a rooted tree, meaning it is a directed tree with a node called “root” that has no incoming edges. All other nodes have exactly one incoming edge. A node with outgoing edges is called an internal or test node. All other nodes are called leaves (also known as terminal or decision nodes). In a decision tree, each internal node splits the instance space into two or more sub-spaces according to a certain discrete function of the input attributes values. In the simplest and most frequent case, each test considers a single attribute, such that the instance space is partitioned according to the attribute’s value. In the case of numeric attributes, the condition refers to a range. Each leaf is assigned to one class representing the most appropriate target value. Alternatively, the leaf may hold a probability vector indicating the probability of the target attribute having a certain value. Instances are classified by navigating them from the root of the tree down to a leaf, according to the outcome of the tests along the path. Each node is labeled with the attribute it tests, and its branches are labeled with its corresponding values. Typical decision tree is given below.
Figure 1: A typical Decision Tree. Page | 31
Chapter Three
Methodology
3.4.2 Algorithm for Decision Tree Induction Algorithm: Generate_decision_tree. Generate a decision tree from the training tuples of data partition D. Input: ▪ Data partition, D, whics is a set of training tuples and their associated class labels; ▪ Attribute_list, the set of candidates attribute ▪ Attribute_selection_method, a procedure to determine the splitting criterion that "best" pertitions the data tuples into individual classes. This criterion consists of a splitting_attribute and, possibly, either a split point or splitting subset. Output: A decision tree. Method: (1) create a node N; (2) If tuples in D are all of the same class, C then (3) retunr N as a leaf node labeled with the class C; (4) If attribute_list is empty then (5) retunr N as a leaf node labeled with the mejority class in D; // majority voting (6) apply Attribute_selection_method(D, attribute_list) to find the "best" splitting_criterion; (7) label node N with splitting criterion; (8) If splitting_attribute is discrete-valued and multiway splits allowed then // not restricted to binary tress (9) attribute_list 2, then k bits may be used to encode the attribute’s values. Classes can be encoded in a similar fashion. Based on the notion of survival of the fittest, a new population is formed to consist of the fittest rules in the current population, as well as offspring of these rules. Typically, the fitness of a rule is assessed by its classification accuracy on a set of training samples. Offspring are created by applying genetic operators such as crossover and mutation. In crossover, substrings from pairs of rules are swapped to form new pairs of rules. In mutation, randomly selected bits in a rule’s string are inverted. The process of generating new populations based on prior populations of rules continues until a population, 𝑃 evolves where each rule in 𝑃 satisfies a pre specified fitness threshold. Genetic algorithms are easily parallelizable and have been used for classification as well as other optimization problems. In data mining, they may be used to evaluate the fitness of other algorithms. 3.9.2 Regression Analysis Regression analysis can be used to model the relationship between one or more independent or predictor variables and a dependent or response variable which is continuous valued. In the context of data mining, the predictor variables are the attributes of interest describing the tuple (i.e., making up the attribute vector). In general, the values of the predictor variables are known. The response variable is what we want to predict. Page | 45
Chapter Three
Methodology
3.9.3 Linear Regression Straight-line regression analysis involves a response variable, y, and a single predictor variable x. It is the simplest form of regression, and models 𝑦 as a linear function of 𝑥. That is, 𝑦 = 𝑏 + 𝑤𝑥 Where the variance of 𝑦 is assumed to be constant 𝑏 and 𝑤 are regression coefficients specifying the 𝑌-intercept and slope of the line. 1.
The regression coefficients, w and b, can also be thought of as weights, so that we can equivalently write, 𝑦 = 𝑤0 + 𝑤1 𝑥
2.
These coefficients can be solved for by the method of least squares, which estimates the best fitting straight line as the one that minimizes the error between the actual data and the estimate of the line.
3.
Let 𝐷 be a training set consisting of values of predictor variable, x, for some population and their associated values for response variable 𝑦. The training set contains
|𝐷|
data
points
of
the
form,
( x1 , y1 ), ( x 2 , y 2 ), ............................(x D , y D ) . 4.
The regression coefficients can be estimated using this method with the following equations: D
wi
(x i 1
i
x)( y i y )
D
(x i 1
i
x) 2
w0 y w1 x Where 𝑥 is the mean value of x1 , x 2 ,..................x D , and y is the mean value of
y1 , y 2 ,........y D . The coefficients w0 and w1 often provide good approximations to otherwise complicated regression equations. 3.9.4 Multiple Linear Regression Multiple Linear Regression is an extension of straight-line regression so as to involve more than one predictor variable. It allows response variable 𝑦 to be modeled as a linear function of, say, n predictor variables or attributes A1 , A2 ,..............An describing a tuple 𝑋. An example of a multiple linear regression model based on two predictor attributes or variables, A1 and A2 , is y w0 w1 x1 w2 x2 , where x1 and x 2 are the values of attributes A1 and A2 , respectively, in 𝑋. Multiple regression problems are Page | 46
Chapter Three
Methodology
instead commonly solved with the use of statistical software packages, such as SAS, SPSS, and S-Plus. 3.9.5 Nonlinear Regression Nonlinear Regression can be modeled by adding polynomial terms to the basic linear model. By applying transformations to the variables, we can convert the nonlinear model into a linear one that can then be solved by the method of least squares. Polynomial Regression is a special case of multiple regression. That is, the addition of high-order terms like x2, x3, and so on, which are simple functions of the single variable, x, can be considered equivalent to adding new independent variables. 3.9.6 Transformation of a polynomial regression model to a linear regression model Consider a cubic polynomial relationship given by y w0 w1 x1 w2 x2 w3 x3
To convert this equation to linear form, we define new variables:
x1 = 𝑥, x 2 = x 2 , x3 x3 It can then be converted to linear form by applying the above assignments, resulting in the Equation, y w0 w1 x w2 x2 w3 x3 which is easily solved by the method of least squares using software for regression analysis. 3.9.7 Classifier Accuracy The accuracy of a classifier on a given test set is the percentage of test set tuples that are correctly classified by the classifier. The confusion matrix is a useful tool for analyzing how well your classifier can recognize tuples of different classes. True positives refer to the positive tuples that were correctly labeled by the classifier. True negatives are the negative tuples that were correctly labeled by the classifier. False positives are the negative tuples that were incorrectly labeled. How well the classifier can recognize, for this sensitivity and specificity measures can be used. Accuracy is a function of sensitivity and specificity.
Page | 47
Chapter Three
accuracy sensitivit y
Methodology
pos neg specificit y ( pos neg ) ( pos nef )
t _ pos pos t _ neg specificit y neg t _ pos precision (t _ pos f _ pos)
sensitivit y
3.10 Cluster Analysis The process of grouping a set of physical or abstract objects into classes of similar objects is called clustering. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters. A cluster of data objects can be treated collectively as one group and so may be considered as a form of data compression. Cluster analysis tools based on k-means, kmedoids, and several methods have also been built into many statistical analysis software packages or systems, such as S-Plus, SPSS, and SAS. 3.10.1 Applications clustering 1.
Cluster analysis has been widely used in numerous applications, including market research, pattern recognition, data analysis, and image processing.
2.
In business, clustering can help marketers discover distinct groups in their customer bases and characterize customer groups based on purchasing patterns.
3.
In biology, it can be used to derive plant and animal taxonomies, categorize genes with similar functionality, and gain insight into structures inherent in populations.
4.
Clustering may also help in the identification of areas of similar land use in an earth observation database and in the identification of groups of houses in a city according to house type, value, and geographic location, as well as the identification of groups of automobile insurance policy holders with a high average claim cost.
5.
Clustering is also called data segmentation in some applications because clustering partitions large data sets into groups according to their similarity.
6.
Clustering can also be used for outlier detection, Applications of outlier detection include the detection of credit card fraud and the monitoring of criminal activities in electronic commerce. Page | 48
Chapter Three
Methodology
3.10.2 Typical Requirements of Clustering In Data Mining 3.10.2.1 Scalability Many clustering algorithms work well on small data sets containing fewer than several hundred data objects; however, a large database may contain millions of objects. Clustering on a sample of a given large data set may lead to biased results. Highly scalable clustering algorithms are needed. 3.10.2.2 Ability to deal with different types of attributes Many algorithms are designed to cluster interval-based (numerical) data. However, applications may require clustering other types of data, such as binary, categorical (nominal), and ordinal data, or mixtures of these data types. 3.10.2.3 Discovery of clusters with arbitrary shape Many clustering algorithms determine clusters based on Euclidean or Manhattan distance measures. Algorithms based on such distance measures tend to find spherical clusters with similar size and density. However, a cluster could be of any shape. It is important to develop algorithms that can detect clusters of arbitrary shape. 3.10.2.3 Minimal requirements for domain knowledge to determine input parameters Many clustering algorithms require users to input certain parameters in cluster analysis (such as the number of desired clusters). The clustering results can be quite sensitive to input parameters. Parameters are often difficult to determine, especially for data sets containing high-dimensional objects. This not only burdens users, but it also makes the quality of clustering difficult to control. 3.10.2.5 Ability to deal with noisy data Most real-world databases contain outliers or missing, unknown, or erroneous data. 3.10.2.6 Incremental clustering and insensitivity to the order of input records Some clustering algorithms cannot incorporate newly inserted data (i.e., database updates) into existing clustering structures and, instead, must determine a new clustering from scratch. Some clustering algorithms are sensitive to the order of input data.
Page | 49
Chapter Three
Methodology
That is, given a set of data objects, such an algorithm may return dramatically different clustering depending on the order of presentation of the input objects. It is important to develop incremental clustering algorithms and algorithms that are insensitive to the order of input. 3.10.2.7 High dimensionality A database or a data warehouse can contain several dimensions or attributes. Many clustering algorithms are good at handling low-dimensional data, involving only two to three dimensions. Human eyes are good at judging the quality of clustering for up to three dimensions. Finding clusters of data objects in high dimensional space is challenging, especially considering that such data can be sparse and highly skewed. 3.10.2.8 Constraint-based clustering Real-world applications may need to perform clustering under various kinds of constraints. Suppose that your job is to choose the locations for a given number of new automatic banking machines (ATMs) in a city. To decide upon this, you may cluster households while considering constraints such as the city’s rivers and highway networks, and the type and number of customers per cluster. A challenging task is to find groups of data with good clustering behavior that satisfy specified constraints. 3.10.2.9 Interpretability and usability Users expect clustering results to be interpretable, comprehensible, and usable. That is, clustering may need to be tied to specific semantic interpretations and applications. It is important to study how an application goal may influence the selection of clustering features and methods. 3.10.3 Major Clustering Methods o Partitioning Methods o Hierarchical Methods o Density-Based Methods o Grid-Based Methods o Model-Based Methods
Page | 50
Chapter Three
Methodology
3.10.3.1 Partitioning Methods A partitioning method constructs k partitions of the data, where each partition represents a cluster and k