Adapting Aerial Root Classifier Missing Data Processor in Data ...

18 downloads 64 Views 267KB Size Report
Cite this paper as: Lachiheb O., Gouider M.S. (2014) Adapting Aerial Root Classifier Missing Data Processor in Data Stream Decision Tree Classification. In: Ait ...
Adapting Aerial Root Classifier Missing Data Processor in Data Stream Decision Tree Classification Oussama Lachiheb and Mohamed Salah Gouider Laboratoire SOIE, Institut Superieur De Gestion, Tunis University [email protected], [email protected]

Abstract. This work has contributed to the development of a classification method that can deal with the missing data problems. This method called ARC-CVFDT was developed in order to adapt the Aerial Root Classifier missing data processor to the Data Stream decision tree classification method. it offers a higher level of accuracy and adaptation with most of DSM challenges such as Concept Drifting. Keywords: Missing data, Classification, Decision Tree, Data Stream, Machine Learning.

1

Introduction

In recent years,new technologies and material evolution have facilitated the ability to collect data continuously; using credit card, road sensors, mobile calls and browsing web have all lead to a large and a continue flow of data. Data Stream mining DSM is a new kind of data mining which handles continuous data streams. We have many DSM methods, classification using real-time decision tree is one from the favorite methods because it is useful in applications that requires instant decision-making. Sometimes, sensor malfunction or interruption on a flow of data communication signals can give rise to missing data in the input of the Miner which can affects classification results. There are several techniques to deal with missing values in traditional data mining such as eliminating records having missing attributes, or using statistic approach to estimate those values, these techniques does not work with DSM because the testing and training in DSM is done dynamically with a moving data and not with a complete dataset. We propose in this work our approach called ARC-CVFDT which adapts the aerial root classifier missing data processor to Concept adapting Very Fast Decision Tree. This work will be followed by an experimental study using JAVA to show the obtained results which help us evaluate this approach. Y. Ait Ameur et al. (Eds.): MEDI 2014, LNCS 8748, pp. 92–99, 2014. c Springer International Publishing Switzerland 2014 

ARC-CVFDT Classification

2

93

Related Works

In traditional Data Mining algorithms, the presence of missing data may not affect the quality of output, they can be simply ignored or replaced by a mean value, but in the case of Data Stream Mining, any algorithm scans data only one time because of the high speed nature of Data Streams, so the problem of incomplete data is more impactful in this case. Many previous works was based on dealing with the incomplete data problem, in [4], authors have proposed a new method called WARM (Window Association Rule Mining) for dealing with the problem of incomplete data, this approach estimate missing values that can result from sensors errors or malfunction. This method has been extended in 2007, authors have called the new approach FARM [12]. In 2011, Hang, Fong and Chen have proposed in [7] a solution for predicting missing values in data stream mining, and especially for the classification task, this method is called ARC-HTA and it can perform data stream mining in the presence of missing values. HTA (Hoeffding tree classifier). This approach offers a high level of classifcation accuracy. In 2013, Authors on [6] have proposed a new approach that performs Data interpolating in the presence of missing readings. Their method is based on a novel probabilistic interpolating method and three novel deterministic interpolating methods, the experimental results demonstrate the feasibility and effectiveness of the proposed method.

3 3.1

Background Traditional Decision Tree Classification

Decision trees are a simple yet successful technique for supervised classification learning. They have the specific aim of allowing us to predict the class of an object given known values of its attributes. Traditional Decision Tree construction algorithms such as ID3 [11] and C4.5 [5] need to scan the whole data set each time to split the best attribute, they need to scan the data set again in order to update the obtained model, which cannot be useful with the Data Stream case. 3.2

Data Streams Decision Tree Classification

Classification task of Data Streams has several challenges such as high speed nature, the concept drifting, the unbounded memory requirement and the tradeoff between the accuracy and the efficiency. During last years, researchers have proposed many classification methods and models that can deal with data stream challenges such as Hoeffding tree Algorithm (HTA)[7], Very Fast Decision Tree (VFDT) [8] and Concept-adapting Very Fast Decision Tree (CVFDT)[9].

94

3.3

O. Lachiheb and M.S. Gouider

Methods for Dealing with Missing Values

The problem of incomplete data can happen with most of real world data sources, as shown in [10], several methods have been proposed in order to deal with this problem. In this section, we will present a brief overview of existing methods for dealing with missing data, and comment on their suitability for the equipment maintenances databases such as described in previous section. – Ignoring and Discarding Incomplete Records and Attributes – Parameter Estimation in the Presence of Missing Data

4

Predicting Missing Values within CVFDT Classification

Data stream mining algorithms handles a flow of continuous data stream and classify target values on the fly, sometimes, examples may include incomplete data due to sensor malfunction or generator error which can affect directly the quality and the accuracy of the output. As detailed above, missing data is more impactful in the case of data stream mining, so we have to include a method that predicts incomplete data with a high level of accuracy. This section is focused on the presentation of the new approach that enables predicting missing values before running the main CVFDT classifier, we first give some motivations to develop this method, then we define the new structure of the data stream within the sliding window model.and we finally report the algorithm of our approach. 4.1

Sliding Window Model

Sliding window is a model that enables storing a part of data stream; as a result, for each task we can scan a part of data because of the unbounded nature of the sensor data stream. The main issue is which part of the data stream will be selected for the processing; there are different data stream processing models, such as fixed model, land mark and sliding window. Fixed model scans the whole data stream in a fixed start time and end time, the landmark model capture the historical behavior of the data streams, and the sliding window model contains always the most recent data streams in an interval of time or for N records. 4.2

Adapting Aerial Root Classifier to CVFDT

Initialization Step In the beginning, a sequence of data is loaded from the stream in order to construct the data set that will be used in the classification process.

ARC-CVFDT Classification

95

Due to the huge amount of data generated from sensors, its sufficient to use just a number of samples in order to choose the split attribute at each decision node, this statistical method is called hoeffding bound.  R2 ln( 1 )

δ Let us consider ε = 2N For N independants observations, we consider G as information gain function, the difference between the two highest information gain functions G must be higher than ε in order to split the attribute Xi into decision node.

Estimating Missing Values We will use in this work the Aerial root classifier (ARC), ARC is a classification model that work parallel with the main classifier and predicts all missing values that the dataset may contain. This method checks all attributes in the loaded dataset; if an attribute has some missing values, it builds a parallel classification model (with the main classifier) that consider this attribute as class attribute and predicts its value using other attributes values. Let X = {X1 , X2 , ..., Xn } We have n attributes for each sample, the number of ARCs that our method builds is 0  N  n. More generally, the idea is to use all attributes to build an ARC expect Xk if Xk in the sliding window has missing values. CVFDT Classification The last step of our approach is to run the main classifier algorithm, as mentioned above, we have many ARCs that work in parallel in order to predict all missing values, then the complete dataset is used to build the main decision tree (Concept adapting Very Fast Decision Tree). CVFDT system works by keeping the model consistent with the sliding window of examples, the method enables resolving the concept drift problems, as a result, at any time, we cant have an outdated model. 4.3

Reducing the Computation Cost of the Algorithm

ARC-CVFDT offers a high classification accuracy, but in the experimental studies , we have identified that this method has several pass in building ARCs models and in estimating missing values, so we have to reduce the computational cost of this approach. Theres many method that can reduce the computational cost of our approach such as the ID4 algorithm and the feature selection method. The ID4 algorithm constructs a decision tree incrementally, it updates the decision tree automatically when new instances are loaded into the sliding window instead of re building it, this algorithm reduce the complexity of ARC missing values processor. The second method is to use feature selection, it gives a rank

96

O. Lachiheb and M.S. Gouider

to each attribute during computing the information gain function, then we calculate E as ARC performance indicator. If E is beyond to a predefined bound, then we use the current ARC. Ranki ωi = n i=1 Ranki n 

(1)

ωi ei

(2)

CorrectClassif iedInstances T otalInstances

(3)

i=1

e=

ARC-CVFDT Algorithm VAR ARC: Aerial Root Classifier; S: a sequence of instances; WS: Window size (the number of instances); X:a set of attributes; begin Repeat If ((ARC empty)) then BuildARC(X,WS); else UpdateARC(X,WS); end If until xi=n Use ARC to predict the missing values in Xi; Until no attribute has missing values in X; Run CVFDT to build decision tree; Update Weight(); end Return CVFDT Build ARC procedure Procedure BuildARC(X,WS) Repeat If ((xk has missing value)) then Use WS instances in S excluding xk to build ARC; end If Until xi=X.Length in S Return ARC;

ARC-CVFDT Classification

5

97

Performance Study and Experiments

Implementing our methods for predicting missing values within CVFDT classification allows us have an idea concerning the performance of our presented method. In addition to the different implemented programs, other simulation and test results will be shown; these results are done in real databases taken from the U.C.I repository. In this section, we will detail our experimental setup before describing different simulation and test results.

6

Experimental Setup

Our algorithm was written in Java and compiled using eclipse integrated development environment, all our experiments were performed on a PC equipped 3.2 Ghz intel core i3 and a 4 GB memory, the operating system was windows seven. All these experiments were run without any other user on the machine. 6.1

Simulations and Results

Simulations on Synthetic Datasets In order to illustrate the ability of our proposed algorithm, synthetic datasets are generated to test the scalability and the efficiency of the method. Different datasets generated are described in the table below: Table 1. Synthetic datasets Name Attributes Nbr Att values Class Nbr Instance Nbr LED7 7 Nominal 10 100.000 LED24 24 Nominal 10 100.000

In this experiment, we want to see the direct impact of missing data in the accuracy of CVFDT classification, for this reason, missing data are randomly added near the end parts of the data stream (figure 1). Simulations on Real Datasets In this experiment, we have downloaded a dataset which is KDD cup 98 provided by the paralyzed Veterans of America, this learning dataset have 481 attributes both in numeric and nominal, it contains 95412 instances. Using this dataset, we have compared the classification accuracy of our proposed approach ARC-CVFDT, complete data stream and the WEKA method that replaces missing values with the mean, we have obtained results shown in the figure 2.

98

O. Lachiheb and M.S. Gouider

Fig. 1. Accuracy comparison of missing values in LED7 dataset

Fig. 2. CUP98 dataset comparing classification accuracy

7

Conclusion

Because of unexpected error and malfunction, datasets containing incomplete data is becoming relevant, which can affect the quality of any mining and knowledge discovery process; however, existing methods in Data Stream Mining have not proposed enough consideration to this problem. Our method called ARCCVFDT can perform Data Stream classification with the presence if missing readings.

ARC-CVFDT Classification

99

In order to evaluate our approach, we have proposed an experimental study based on both real and simulated data, we got promising results that encourages us to complete our research with other real datasets. As future work, since our approach was providing encouraging results, this work can be extended by adapting the missing data processor ARC to other classification models instead of decision trees in order to compare different results, ARC can also be adapted to clustering techniques in order to estimate missing values which increase the quality of the process results.

References 1. Domingos, P., Hulten, G.: Mining highspeed data streams. Journal of Computer Science and Technology (2000) 2. May, P., Ehrlich, H.C., Steinke, T.: ZIB Structure Prediction Pipeline: Composing a Complex Biological Workflow through Web Services. In: Nagel, W.E., Walter, W.V., Lehner, W. (eds.) Euro-Par 2006. LNCS, vol. 4128, pp. 1148–1158. Springer, Heidelberg (2006) 3. Foster, I., Kesselman, C.: The Grid: Blueprint for a New Computing Infrastructure. Morgan Kaufmann, San Francisco (1999) 4. Halatchev, M., Le Gruenwald: Estimating missing values in related sensor data streams. In: The University of Oklahoma, School of Computer Science (2005) 5. Quinlan, J.R.: Programs for machine learning. Morgan Kaufmann series in machine learning.Kluwer Academic Publishers (1993) 6. Xiao, Y., Jiang, T., Li, Y., Xu, G.: Data Interpolating over RFID Data Streams for Missed Readings. In: Gao, Y., Shim, K., Ding, Z., Jin, P., Ren, Z., Xiao, Y., Liu, A., Qiao, S. (eds.) WAIM 2013 Workshops 2013. LNCS, vol. 7901, pp. 257–265. Springer, Heidelberg (2013) 7. Yang, M., Simong, F., Wei, C.: Aerial root classifiers for predicting missing values in data stream decision tree classifcation. Journal of Emerging Technologies in Web Intelligence (2011) 8. Domingos, M., Hulten, G.: Mining highspeed data streams. Journal of Computer Science and Technology (2000) 9. Domingos, M., Hulten, G., Spencer, L.: Mining timechanging data streams. In: Knowledge Discovery and Data Mining (2001) 10. Kamashki, L., Steven, A., Samad, T.: Imputation of missing data in industrial databases. Applied Intelligence Archive 11(3) (1999) 11. Quinlan, J.R.: Induction on decision tress. In: Machine Learning, vol. 1, pp. 81–106 (1986) 12. Gruenwald, L., Chok, H., AbouKhamis, M.: Using Data mining to estimate missing sensor data. In: Seventh IEEE International Conference on Data Mining Workshops (ICDMW 2007), pp. 207–212 (2007)