Document not found! Please try again

Performance evaluation of incremental decision tree learning under ...

3 downloads 20000 Views 561KB Size Report
decision tree for noisy big data' presented at the 'BigMine Workshop of ACM SIGKDD 2012',. Beijing ... statistics from the streams (Hulten et al., 2001; Domingos.
206

Int. J. Computer Applications in Technology, Vol. 47, Nos. 2/3, 2013

Performance evaluation of incremental decision tree learning under noisy data streams Hang Yang* and Simon Fong Department of Computer and Information Science, Faculty of Science and Technology, University of Macau, Av. Padre Tomás Pereira Taipa, Macau, China Email: [email protected] Email: [email protected] *Corresponding author Abstract: Big data has become a significant problem in software applications nowadays. Extracting classification model from such data requires an incremental learning process. The model should update when new data arrive, without re-scanning historical data. A single-pass algorithm suits continuously arrival data environment. However, one practical and important aspect that has gone relatively unstudied is noisy data streams. Such data are inevitable in real-world applications. This paper presents a new classification model with a single decision tree, so called incrementally Optimised Very Fast Decision Tree (iOVFDT) that embeds multiobjectives incremental optimisation and functional tree leaf. In the performance evaluation, noisy values were added into synthetic data. This evaluation investigated the performance under noisy data scenario. The result showed that iOVFDT outperforms the existing algorithms. Keyword: big data; data streams; classification; decision tree; noisy data. Reference to this paper should be made as follows: Yang, H. and Fong, S. (2013) ‘Performance evaluation of incremental decision tree learning under noisy data streams’, Int. J. Computer Applications in Technology, Vol. 47, Nos. 2/3, pp.206–214. Biographical notes: Hang Yang was a PhD candidate in the major of Software Engineering in University of Macau. He obtained a MSc degree from the University of Macau in 2009. His bachelor background was in electronic commerce and obtained a Bachelor degree of Economics in 2007. He has published over 30 peer-reviewed international conference and journal papers, mostly in the areas of e-commerce technology, business intelligence and data mining. Simon Fong graduated from La Trobe University in Australia, with a First Class Honours BEng Computer Systems degree and a PhD Computer Science degree in 1993 and 1998 respectively. He is now working as an Assistant Professor in the Computer and Information Science Department of the University of Macau. He is a founding member of the Data Analytics and Collaborative Computing Research Group in the University of Macau. He worked as an Assistant Professor in the School of Computer Engineering at Nanyang Technological University in Singapore. He has published over 180 peer-reviewed international conference and journal papers, mostly in data mining. This paper is a revised and expanded version of a paper entitled ‘Incrementally optimized decision tree for noisy big data’ presented at the ‘BigMine Workshop of ACM SIGKDD 2012’, Beijing, China, 12–16 August 2012.

1

Introduction

Owing to the friendly result demonstration of knowledge discovery and data mining, decision tree has become one of the most widely-used methods in the real world. The imperfect data, like noisy data, missing values and imbalanced distribution, are unavoidable in practice. In big data era, the input data streams are always evolving and unbounded. However, the pre-processing methods, such as ETL of data warehouse and sampling techniques, are unsuitable for real-time data mining system.

Copyright © 2013 Inderscience Enterprises Ltd.

Very Fast Decision Tree (VFDT) system is a wellknown pioneering decision tree learning algorithm used in data stream mining in software application; its lightweight design is capable of progressively building up a decision tree from scratch in real-time by accumulating sufficient statistics from the streams (Hulten et al., 2001; Domingos and Hulten, 2011; Yang and Fong, 2011). VFDT learns by incrementally updating the tree while scanning the data stream on the fly. This powerful concept is in contrast to a traditional decision tree that requires the reading up of a full dataset for tree induction. VFDT’s obvious advantage is its

Performance evaluation of incremental decision tree learning real-time data mining capability, which frees it from the need to store up all of the data to retrain the decision tree because the moving data streams are infinite. One challenge for data stream mining is associated with the quality of the streams, which generally render a data stream ‘imperfect’ in this context. Noisy values contamination in data streams is a common phenomenon due to data network malfunctioning. What’s more, in authors’ opinion, incremental decision tree learning will become prevalent for future real-time classification application. But so far, some problems of VFDT still obstruct them being applied for online scenarios. 





Tie-breaking parameter is a user-configured value to reduce detrimental effects of tree size explosion. However, there is no single default value that works well in all tasks. It is impossible to find a best tie value unless all combinations are tried, but that is not applicable in real-time applications like stream mining. How to find a well-defined tie-breaking threshold for most applications is a challenge. Pruning mechanism is useful for solving tree size explosion problem. Post-pruning and pre-pruning are distinguishing approaches. The former one is implemented after a full decision tree grown, while the latter one is implemented during the tree-growth. Because of the synchrony between continuously data streams arriving and tree learning, after a full tree built there is not enough time to prune the branches. Hence, pre-pruning is preferable in stream mining but few studies discussed it in VFDT. Imperfect data are unavoidable in real world. How to handle such problem in data stream mining is an important task.

A contribution of our previous paper (Yang and Fong, 2012) regards two novel extensions of the VFDT model that include optimising the tree growing process via Functional Tree Leaf and Incremental Optimisation for a balance of accuracy and tree size called incrementally Optimised VFDT (iOVFDT), that further enhances the prediction accuracy via the embedded Naïve Bayes classifier used as Functional Tree Leaf (Gama, 2003). The new decision tree model is able to achieve unprecedentedly good performance in terms of high prediction accuracy and compact tree size, although the data streams are perturbed by noisy values. So far, we have not known how the performance is iOVFDT algorithm dealing with noisy data. In this paper, we apply the new tree induction to some noise-included classification problems. Evaluation result shows iOVFDT outperforms VFDT. The rest of this paper is organised as follows. Section 2 formulises the problem to be solved for incremental decision tree learning. Section 3 presents Functional Tree Leaf mechanisms. Section 4 proposes the incremental optimisation for decision tree. Section 5 evaluates performance of iOVFDT and original VFDT algorithms. Section 6 concludes this paper and discusses future work.

2

207

Problems

2.1 Variables and parameters In this section, the variables and parameters for tree learning problem are listed in Table 1. They are also used in the next sections throughout this full paper. Table 1

Variables and parameters list

X

A vector of multiple attributes

x

Value of an attribute

D

An instance in data stream that D = (X, y)

y

Value of an actual class label



Value of a predicted class label



Confidence used in Hoeffding bound

i

Index of the attribute

j

Index of the value of one certain attribute

k

Index of the class label

t

Index of the timestamp

I

Total number of attributes

J

Total number of different values for an attribute

K

Total number of classes

T

Total number of timestamps. For limited data T is positive integer; for infinite data T   

Dt

A piece of data stream collected at timestamp t

Xi

The attribute with index i

xij

jth value of attribute Xi

TR

A decision tree model

nmin

Interval number of instances for splitting test

 nijk H(xij)

A user pre-defined value for tie-breaking threshold Statistic number that attribute Xi with value xij belongs to yk Heuristic function to evaluate xij

2.2 Big data For each instance Dt collected at timestamp t, it contains the information of an attributes Xt vector and an actual class yt, defined in (1):

Dt   X t , y t 

(1)

For a limited number of data instances, suppose DT is the data block collected so far. T is total number of timestamps that it is a positive integer T > 0. The full data streams collected up to T is defined as:  X1  DT   t 1 Dt    X T T

y1     y T 

(2)

For limited data volume scenario, tree induction choose a heuristic function H(.) by greedy search approach to train global optimal decision model TRGLOBAL from full data DT. The mechanism, which is used to choose a splitting-attribute

208

H. Yang and S. Fong

as an internal node, is heuristic function, i.e. entropy (Quinlan, 1986), information gain (Quinlan, 1993), GINI (Breiman, 1984). The value xij is the splitting-value of attribute Xi, where xij = arg max H(xij) This value of the splitting-attribute Xi is computed by a greedy search from the full attribute-values statistics, where we’ve known all values from xi1 to xiJ. In short, the global optimal decision tree is obtained by: Maximize i 1  j 1 H  xij  I

J

(3)

This trained model will assign a predictive class to new instance Xt. The optimisation goal for limited data scenario is to obtain a minimum error cost as follows:   TRGLOBAL  Train  DT , H       ykt  Test TRGLOBAL , X t    1, if  ykt  ykt  t Errork    0, otherwise 

(4)

subject to Minimize t 1  k 1 Errorkt T

K

Suppose a tree model TRGLOBAL has been built in terms of full data DT so for timestamp T. When new data DT+1 come at timestamp T+1, it has to re-compute on data (DT+DT+1) to update TRGLOBAL. This updated process takes a long time because it requires re-loading both historical data DT and the new data DT+1. For modern decision model, the data may appear in an unstructured format and is being generated at a huge scale day-by-day. How to keep a most updated model is an open problem for software system designers. For frequently updating model, re-computing the historical data is not applicable if it contains large number of data. An incremental learning of single decision tree TRINCR is proposed to solve this trade-off (Domingos and Hulten, 2000), so called VFDT. A lightweight design is capable of progressively building up a decision tree from scratch in real-time by accumulating sufficient statistics from data streams. Incremental tree induction learns by incrementally updating the decision model while scanning data. VFDT is an any-time algorithm that reads new instance only one time. The node-splitting criteria uses heuristic methods but freeing it from loading full dataset. For unlimited data volume when new instance comes, it is impossible to update algorithm by loading whole historical and new data. It requires refreshing decision model in terms of a piece of new data streams. Sufficient statistics is used to record the counts of each value xij of attribute Xi belonging to class yk. The solution is a nodesplitting estimation using Hoeffding bound (HB):

HB 

1 R 2 ln     2n

(5)

where R is the number of distinct classes and n is the number of instances which have fallen into a leaf. To evaluate a splittingvalue for attribute Xi, it chooses the best two values. Suppose xia is the best value of heuristic function H(.) where xia = arg max H(xij); suppose xib is the second best value of heuristic function where xib  arg max H  xij  ,  j  a ; suppose ∆H(Xi) is the difference of the best two values of heuristic evaluation for attribute Xi, where ∆H(Xi) = ∆H(xia)-∆H(xib). Let n be the observed number of instances, HB is used to compute high confidence intervals for the true mean rtrue of attribute xij 1 n to class yk that r  HB  rtrue  r  HB where r     i ri . n If after observing nmin examples, the inequality r+HB

Suggest Documents