Investigating the Impact of Bursty Traffic on Hoeffding Tree Algorithm in Stream Mining over Internet
Yang Hang
Simon Fong
Faculty of Science and Technology University of Macau Macau SAR, China
[email protected]
Faculty of Science and Technology University of Macau Macau SAR, China
[email protected]
Abstract— Steam data are continuous and ubiquitous in nature which can be found in many Web applications operating on Internet. Some instances of stream data are web logs, online users’ click-streams, online media streaming and Web transaction records. Stream Mining was proposed as a relatively new data analytic solution for handling such streams. It has been widely acclaimed of its usefulness in real-time decision-support applications, for example web recommenders. Hoeffding Tree Algorithm (HTA) is one of the popular choices for implementing Very-Fast-Decision-Tree in stream mining. The theoretical aspects have been studied extensively by researchers. However, the data streams that fed into HTA are usually assumed at a constant rate in the literature. HTA has yet been tested under bursty traffic such as Internet environment. This paper sheds some light into the impact of bursty traffic on the performance of HTA in stream mining.
destinations addressing. Probing into the building blocks of Internet such as routers or Asynchronous Mode Transfer switches, they embrace bursty traffic incoming from the Internet that generally can be characterized by three major influences depicted in Figure 1.
Figure 1. Three major influences on the performance of an ATM switch
Keywords- stream mining; Hoeffding tree algorithm; bursty stream;real-time application
I.
INTRODUCTION
Stream mining systems could potentially be implemented across a wide range of applications where data flow in rapidly and the total data size may be infinitely large. Some applications include financial analysis in stock markets, network intrusion detection, web personalization, online click-streams analysis, etc. [1]. Most of these applications operate in decoupled and distributed platform; Internet is an epitome of it. Traditional decision trees were not designed to handle data streams. A new generation of stream mining algorithms was developed for this purpose. Among them, Hoeffding Tree Algorithm (HTA) [5] dynamically constructs a decision tree along with the continuous arrival of data streams in real-time. HTA has been studied extensively by many researchers in the past decade. Many of them would assume the stream data come at regular rate though rapid, and the stream data carry a uniform pattern. In Internet, for instance, data streams may be irregular and bursty. Singh et. al. suggested that Internet traffic possesses forms of bursty patterns [2] and even self-similar like fractals [3] that can be modeled by Fast-FourierTransforms. The patterns of Internet traffic are not uniform due to a very wide mix of heterogeneous sources and
Figure 2. Examples of synthesized self-similar traces with different H parameters
When the data streams multiplex and mix along the switching routes in Internet, they do bundle up in some form of Self-Similar traffic. Research in network traffic measurement at Bellcore indicated that Ethernet LAN traffic is “self-similar” or “fractal” in nature (“self-similar” is a phenomenon which displays structural similarities across a wide range of time scales) [4]. Consequently, this suggests results that were obtained by assuming traditional arrival traffic models may no longer hold under self-similar traffic, and they need to be revised. Some samples of such traffic patterns are shown in Figure 2 with various Hurst parameters. The traces can be viewed as the results of a large number of bursty traffic patterns being multiplexed together. So far the issue of “bursty” vs. “self-similar” traffic arrival pattern has been investigated only with respect to ATM multiplexers, and there is a great deal of debate and research on how this will effect performance modeling studies. In this project we aim to explore the effects of Internet traffic that is characterized to be bursty on HTA algorithm performance. Our previous study has been undertaken documents the mapping of real-time constraints and the performance of HTA. Here we take a step further on investigating the patterns of Internet traffic in relation to HTA. The composition of a Web application that is powered by a real-time decision-support system (such as online recommenders, intelligent trader agents, etc) implements HTA is depicted in Figure 3. We want to see how HTA performs upon arrival of bursty traffic patterns from the Internet.
Figure 3. Abstract model of real-time Web application
The paper is organized into the following sections. Section 2 provides a foundation of HTA and its implementation such as Very Fast Decision Trees as well as other variants. Section 3 shows the experiments we conducted by synthesizing bursty traffic from an analytical model, and by using real-life web click-stream data that resemble Self-similar traffic, for evaluating the performance of HTA. A brief discussion and conclusion follow at the end. II.
BACKGROUND OF HTA IN STREAM MINING
A. Decision tree in predictive application In general, decision tree model is applied to predict an event or a situation that may happen in a particular period in future according to the past events. The predictive model is built by mining historical data collected over time. As seen from Figure 4, online applications calculate predictive consequence for decision maker which is an automated process after collecting enough observations. The predictive consequence labels a class in leaf node of a tree. The observations are the attributes in splitting node of a tree. The predictive consequence derived from the tree model provides
a decision making support for decision maker. So far, observations without consequence information cannot update the predictive model. It has had enough information to update decision tree model until the facto consequence has been made. This approach is stream mining test-then-train which is different from traditional data mining train-then-test.
Figure 4. Decision tree application flow
Most current applications of decision tree classifiers get updated at certain time interval (e.g. every couple of hours, time of passing midnight). Although fresh data may be feeding in continuously, the output of the trained model was based on the old data since the last update. There exists some latency between the arrival of new data and the due time of the next model update. This is one of the known limitations of decision trees. B. HTA used in VFDT system VFDT (Very Fast Decision Tree) system [5] constructs a decision tree by using constant memory and constant time per sample. It is a pioneer predictive technique that utilities Hoeffding bound. The tree is built by recursively replacing leaves with decision nodes. The sufficient statistics of attribute values are stored in each leaf. Heuristic evaluation function is used to determine split attributes converting from leaves to nodes. Nodes contain the split attributes and leaves contain only the class labels. The leaf represents a class that the sample labels. When a sample enters, it traverses the tree from the root to a leaf, evaluating the relevant attribute at every single node. After the sample reaches a leaf, the sufficient statistics are updated. At this time, system evaluates each possible condition based on attribute values, if the statistics are enough to support the one test over the others; a leaf is converted to a decision node. The decision node contains the number of possible values for the chosen attribute about the split-test installed. The main elements of VFDT are: Firstly, state the tree only has a single leaf - the root of the tree. Secondly, define the heuristic evaluation function (denoted by G(.)), which builds a decision tree with Information Gain like ID3 [6]. The Information Gain measures that amount of information that is necessary to classify a sample that reaches the node in terms of Equation 1. The sufficient statistics estimates the merit of a discrete attribute’s counts nijk, representing the number of samples of class k that reach the leaf where the attribute j takes the value i. The information of the attribute j is given by (2), where Pik is the probability of observing the value of the attribute i
that reduce the computation of ∆G . The choosing split attribute method of CVFDT is the same as VFDT, both using Information Gain. CVFDTNBC [8] adopts naïve-Bayes Classifiers in the leaf nodes of a decision tree induced by CVFDT so as to detect concept-drift. Both CVFDT and CVFDTNBC generate alternative sub-trees while concept drift being detected. If the sub-tree’s accuracy is higher than the old one, the alternative
Class#=20
Class#=50
70.00%
Class#=80
Class#=150
Variance Class#20 = 0.0041, Class#50 = 0.0070 Class#80 = 0.0088, Class#150 = 0.0114
60.00%
50.00%
40.00%
30.00%
20.00%
10.00%
3
.0 5 0. 01 0. 07 0. 13 0. 19 0. 25 0. 31 0. 37 0. 43 0. 49 0. 55 0. 61 0. 67 0. 73 0. 79 0. 85
-0
.2
.1 -0
-0
-0
.1
7
1
9
1 .4
.2 -0
-0
-0
.3
5
7
9
.4 -0
.5
.5 -0
-0
-0
.6
3
5
7
0.00% 3
C. Other HTAs In large volume continuously-changing data stream, the phenomenon called concept-drift may happen. VFDT is built on the assumption of random samples drawn from a stationary distribution that it cannot suit time-changing learning approach. CVFDT (Concept-adapting Very Fast Decision Tree) [7] applies VFDT with a sliding window technique. As the new samples arrive, they are inserted into the beginning of the window, a corresponding number of samples are removed from the end of the window so that the learner is up-to-date. Additionally, CVFDT imports a parameter γ , where ∆G < ε < γ . γ is a user-defined threshold
80.00%
1
difference between the two top quality attributes. If ∆G > ε with " samples observed in leaf, while the Hoeffding bound states with probability 1 − δ that xa is the attribute with highest value in G(.). Then the leaf is converted into a decision node which splits on xa .
.7
For n independent observations of a real-valued random variable r whose range is R, Hoeffding bound is calculated as in (5). It illustrates that with confidence level 1 − δ , the true mean of r is at least r − ε , where r is the observed mean of samples. For a probability the range R is 1, and for an information gain the range R is log 2 Class # . An important part of VFDT is the use of Hoeffding bound to choose a split attribute as the decision node. Let xa be the attribute with the highest G(.), xb be the attribute with second-highest G(.). Therefore ∆G = G ( xa ) − G ( xb ) is the
-0
(5)
9
R ln(1 / δ ) 2
SIMULATING HTA IN INTERNET TRAFFIC
A. Hoeffding bound and traffic change rates The definitions of real-time constraints and HTA parameters are described in details in [4]. Nevertheless a summary of the parameters are shown in Table 1. In order to study the different conditions of a real-time application operating in Internet, we programmed in JAVA to simulate the data collection of an online application. The full sample# of for updating each decision tree is 10,000 in which there are 20 different classes; the confidence is 0.999. If ∆T ≥ 0 , according to Hoeffding bound, the error of splitting node of decision tree generation is 0.0967. If ∆T < 0 with time passing by, the sample# changes because of collection time changes, the error rate differs in different conditions. In our experiment, TR is 5 seconds; ∆t is randomly selected between 0 and 0.5; ∆C is randomly chosen from 0 to 10; ∆r is randomly changed from -0.5 to 0.5. These values are to resemble typical cases.
.7
ε=
2
III.
.8
(4)
.8
Pi = ∑ a nija / ∑ a ∑ b najb
-0
(3)
-0
Pik = nijk / ∑ a najk
-0
(2)
(Hoeffding Bound) Error %
info(A j ) = ∑iPi (∑k − Pik log(Pik ))
5
(1)
.9
G(A j ) = info(samples) - info(A j )
one will replace the old one whose root node is a node. But since nodes close to the root node store a lot of samples, it is difficult to detect concept drift quickly in the case of abrupt concept drift. VFDTc [9] proposes to bring a performance of Hoeffding tree similar to traditional decision tree algorithms like C4.5. Besides large size data, VFDTc also suits medium size data so that the system can be any-time property. It uses two classifier strategies at leaves: majority class classifier and naïve-Bayes classifier. For continuous attributes, naïve Bayes are efficiently derived from tree used to store numeric attribute values. But the overhead is with respect to the use of majority class because the former requires the estimation much more probabilities than the latter one.
-0
given class k. Pi in Equation 3 is the probabilities of observing the value of attribute i.
Data Rate Change %
Figure 5. Comparison of Data Rate change% and error
In Figure 5, we simulate the trends of Hoeffding bound error while varying data rate change percentage. Four cases of different class numbers (class#) are chosen in this experiment, which are 20, 50, 80 and 150. We used different class numbers to show the situations when class# increases with streaming data. Without class# change, maximum necessary sample# is 10,000 and the confidence is 0.999. With the data rate change percentage increases from -85% to
1.20E-03
1
0.9
1.00E-03 0.8
(Hoeffding bound) Error %
0.7
0.6 Load
+85%, the trends observed are: (1) While the data rate increases (from negative to positive), the error is decreasing because of the more sample# being collected. (2) The cases of data rate with positive change percentage (data rate increase) have less error change range that is more stable than the cases of negative data rate (data rate decrease). (3) According to the variances of different cases, the one with least class# has the best stability.
0.5
0.4
8.00E-04
6.00E-04
4.00E-04 0.3
0.2
2.00E-04 0.1
0.00E+00
0 Unit Length (with time passing)
Sample# (Unit Length) with time passing
1.20E-03
1
0.9
1.00E-03 0.8
(Hoeffding bound) Error %
0.7
Load
0.6
0.5
0.4
8.00E-04
6.00E-04
4.00E-04 0.3
0.2
2.00E-04 0.1
0.00E+00
0
Sample# (Unit Length) with time passing
Unit Length (with time passing)
1.60E-03
1
0.9
1.40E-03
0.8
1.20E-03
(Hoeffding bound) Error %
0.7
0.6 Load
B. Generation of Bursty streams A sporadic source (bursty-silence source) is supposed to be a realistic data stream traffic source. It characterizes workloads that need to be delivered across a data network. For example, a user sporadically clicks on a HTTP link that requests delivery of a web page; a user types over a MSN communication channel, one sentence at a time, or a file to be downloaded upon a user’s mouse-click etc – all these are typical Internet activities that can be characterized as bursty.
0.5
0.4
1.00E-03
8.00E-04
6.00E-04
0.3
4.00E-04 0.2
2.00E-04
0.1
0.00E+00
0
Sample# (Unit Length) with time passing
Unit Length (with time passing)
1.06E-03
1
1.04E-03
0.9
0.8
1.02E-03
(Hoeffding bound) Error %
0.7
Load
0.6
0.5
0.4
1.00E-03
9.80E-04
9.60E-04
9.40E-04 0.3
9.20E-04
0.2
Figure 6. A two-state Idle/Active Markov chain
9.00E-04
0.1
8.80E-04
0 Unit Length (with time passing)
TABLE I. Avg. load (p) Euclidean Dis.
HOEFFDING ERROR IN DIFFERENT BURSTY STREAM 0.2
0.4
0.5
0.6
0.8
0.170
0.176
0.244
0.150
0.148
Samples of Bursty stream’s Average Load (p) and Euclidean Distance Result
Sample# (Unit Length) with time passing
1.08E-03
0.9
1.06E-03
0.8
1.04E-03
0.7
1.02E-03 (Hoeffding bound) Error %
1
0.6 Load
A two-state Markov chain model in the above figure is using two variables to define data stream: p and L. p is the average load of one bursty source which is defined as the fraction of time slots this source spends in the active state. L is the mean length of burst data stream. This length can be perceived as an average unit packet size of TCP/IP protocol in Internet. Bursty model is a fundamental data model for generating Internet traffic stream. In the following experiment we simulated the HTA with bursty data streaming into a web application. According to Markov chain theory, we simulated the Internet bursty data stream by programming in JAVA. Figure 7 shows a series of bursty simulation results that are extracted from snapshots of the simulator. In this experiment, the average load p is 0.6, and the mean length L of active period is 50, while that of idle is 5. The load of bursty traffic is the same as the data traffic rate, while the expecting traffic is the data rate of an average load. In this condition, the Hoeffding bound error is 0.0967 which is observed as an ambient benchmark when the traffic arrival is made uniform (in other words, the data change rate is at zero). In the boxes that display the bursty streams of Figure 7 (left), if there is no change in class number ∆C , the corresponding errors observed from choosing splittingattributes in HTA are shown in the boxes on the right hand side of Figure 7 (right).
0.5
0.4
1.00E-03
9.80E-04
9.60E-04
0.3
9.40E-04
0.2
9.20E-04
9.00E-04
0.1
8.80E-04
0 Unit Length (with time passing)
Sample# (Unit Length) with time passing
Figure 7. Bursty traffic and HB error
Euclidean distance as shown in Equation 6 between . q is the points p and q is the length of the line segment Hoeffding bound of S0, pn is that of simulated bursty stream. (6) Comparing the error rate of experimented result of different bursty stream to the expecting Hoeffding bound error is 0.0967, we calculated the Euclidean distance result of each stream shown in Table 2. The consequence is that: when the bursty traffic data stream is applying for HTA, with the data rate increasing, the error rate is approaching to the ambient Hoeffding Bound (HB) error, which is a preferable condition in web applications. Because the data rate increases, the more collected samples are within a certain time. From the experimental results above, we see that the error rate of HTAs splitting selection is influenced by the real time constraints. The Internet traffic is reflected by the data rate change. For bursty traffic, the error rate frustrates between the idle period and the active period. Table 1 shows
the parameters of real-time web application and HTAs for mapping the real-time constraints.
Number of log records
25000
20000
15000
10000
5000
8
0
16
8
16
8
0
0
16
8
16
8
0
16
8
0
16
8
0
16
8
0
16
8
0
16
8
0
16
8
0
16
8
0
16
8
0
16
0
Time(Every hour per day)
Figure 8. Visit# from Feb 1st to 11th 2004 600.00%
400.00%
300.00%
200.00%
100.00%
0
8
16
0
8
16
0
16
8
0
16
8
0
8
0
16
16
8
0
16
8
0
16
8
0
16
8
0.00% 0
Data Rate Change %
500.00%
-100.00%
Time(Each hour per day)
Figure 9. Data rate estimated from web logs
0.3500%
(Hoeffding bound) Error %
0.3000%
0.2500%
0.2000%
0.1500%
0.1000%
0.0500%
0 8
16
0
8 16
0
8 16
8 16
8
16 0
0 8
16 0
0
8 16
8 16
8
16 0
8
16 0
0 8
0.0000% 16 0
C. Applying Click-stream Web logs In addition to bursty traffic generated from a mathematical model, we applied live web log data that represent the nature of fluctuating Internet traffic in our experiment. The log files are Click-Stream data downloaded from the ECML/PKDD 2005 Discovery Challenge with a total size of over 60Mb. Figure 8 shows the extracted raw visited numbers (the number of web log entries in the file) in an hourly interval from Feb 1st to 11th 2004. A man-computer research [10] says that “1.0 second is about the limit for the user's flow of thought to stay uninterrupted, even though the user will notice the delay”. By this suggestion, we calculated the average hourly data. Hence, the ideal data rate to fit the response time is 1.0 second. Compared with the expecting data rate 1.0 event per second, the data rate change percentage (Data Rate minus Expected Data Rate) is shown in Figure 9. The experimental environment is configured in such a way that: the full sample# of for updating each decision tree is 10,000 in which there are 20 different classes; the confidence is 0.999. According to the Hoeffding bound, the error should be 0.0967 assuming the data rate is unchanged. The HB error rate being calculated as by equation (5) on the fly in the experiment of using the set of real world log files. The corresponding results of the HB errors upon the influences of changes of data rates are shown in Figure 10. It is observed that the errors follow a similar but opposite oscillation trend close to the data change rates.
0.4000%
Time(Each hour per day)
Figure 10. The HB errors as being affected by the Data rate changes
Consequently, similar to the phenomenon observed in Figure 5, the errors in Figure 10 change in opposite proportional to the change of data rates. That is, when the data rate is picking up, the error drops. The error rises up, when the data rate is slowing down. It is believed that increasing the data rate helps in collecting sufficient samples for the HTA for constructing the trees. With sufficient samples, the node splitting progresses with few errors. Likewise, when the data rate retards, it throttles the tree building progress and errors magnify. IV. CONCLUDING REMARKS AND FUTURE WORK This paper investigated the impact of Hoeffding bound error in Internet traffic. In particular we examined the effect of burstiness on the Hoeffding bound error which is one of the key performance indicators in stream mining. In our earlier work [4] we showed that the fluctuation of data rate in real-time oscillates the error of Hoeffding bound which causes frequent HTA tree reconstruction, and in turn that has an indirect effect on the overall prediction accuracy. A similar phenomenon is observed in the experiment in this paper. We found from the experiments that: (1) Web application is built on Internet, where HTA is sensitive to data rate change; (2) Increasing data rate lowers down Hoeffding bound error rate, and vice-versa. That is because increasing the data rate allows the system to collect more data to build decision tree in unit time; (3) The error rate of the dataset that has a smaller class# is lower than that of larger class#; (4) The error rate is sensitive to the average load of a bursty stream; (5) The length of a burst has practically no effect to HTA because when the bursty period goes idle, the HTA simply pauses its tree-building operation. (6) It is the data rate that has most impact to HTA; given the nature of Internet traffic as observed from the live clickstream logs, the range of data rate fluctuations scale from seconds to hours and days – that is the same phenomenon as observed in Self-Similar traffic where the fluctuation patterns have about the same shapes in different levels of temporal resolutions. In future, our research will concentrate on finding methods of guaranteeing a respectively stable error rate in different Internet conditions that embrace ever changing traffic data rates. We will also find means to stabilize the
traffic patterns and pace steadily the data rate, possibly by adding a preprocessor buffer between the traffic input and HTA. As suggested by our results in this experiment, pacing the data rates even out will help containing the HTA errors, that will effectively keep the overall predictive accuracy/error under control in stream-mining over Internet data. ACKNOWLEDGMENT The authors are grateful that this research project titled Real-time Data Stream Mining, is supported by the Research Committee, University of Macau. Grant number: RG070/0910S/FCC/FST. REFERENCES [1]
[2]
[3]
Yang H., and Fong S., “Real-time Business Intelligence System Architecture with Stream Mining”, The 5th International Conference on Digital Information Management (ICDIM 2010), July 2010, Thunder Bay, Canada, Accepted for Publication Fong S., Atiquzzaman M., and Singh S., “An Analytical Model and Performance Analysis of Shared Buffer ATM Switches under Nonuniform Traffic”, International Journal of Computer Systems Science and Engineering, Special Issue on ATM Networks, March 1997, pp.81-94 Fong S., and Singh S., “Performance Evaluation of Shared-Buffer ATM Switches under Self-Similar Traffic”, IEEE International Conference on Performance, Computing, and Communications (IPCCC 1997), Arizona, USA, 5-7 February 1997, pp.252-258
TABLE II. Web Application QoS Network: data arrival rate (DR) Available memory: computation influences mining time (TM) Response time: time of data collecting (TC) and mining (TM) Real time constraint: maximum acceptable waiting time of response time in web application (TR)
[4]
Huang, C., Devetsikiotis, M., Lambadaris, I., and Kaye, A. R.. “Modeling and simulation of self-similar variable bit rate compressed video: a unified approach”. SIGCOMM Comput. Commun. Rev. 25, 4 (Oct. 1995), pp.114-125. [5] Domingos, P. and Hulten, G. “Mining high-speed data streams”. In Proceedings of the Sixth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining,. ACM, New York, 2000, pp. 71-80. [6] Quinlan, J.R. Induction on decision tress. Machine Learning, 1, 1986, pp. 81-106. [7] Hulten, G., Spencer, L., and Domingos, P. “Mining time-changing data streams”. In Proceedings of the Seventh ACM SIGKDD international Conference on Knowledge Discovery and Data Mining, ACM, New York, 2001, pp. 97-106. [8] Nishimura, S., Terabe, M., Hashimoto, K., and Mihara, K. “Learning Higher Accuracy Decision Trees from Concept Drifting Data Streams”. In Proceedings of the 21st international Conference on industrial, Engineering and Other Applications of Applied intelligent Systems: vol. 5027. Springer-Verlag, Heidelberg, 2008, pp.179-188. [9] Gama, J., Rocha, R., and Medas, P. “Accurate decision trees for mining high-speed data streams”. In Proceedings of the Ninth ACM SIGKDD international Conference on Knowledge Discovery and Data Mining (Washington, D.C., August 24 - 27, 2003). KDD '03. ACM, New York, NY, 2003, pp.523-528. [10] Miller, R. B.. “Response time in man-computer conversational transactions”. Proc. AFIPS Fall Joint Computer Conference Vol. 33,1968, pp. 267-277.
REFERENCE OF REAL-TIME CONSTRAINTS AND HTA PARAMETERS Real Time Constraint
∆T : real time constraint TC : data collect time
TR : required max response time ∆C : class# change ∆t : data collect time change % ∆r : data rate change %
TM : negligible
HTA Parameters
ε : error rate of choosing split-attr
1 − δ : confidence N: necessary sample number R: range, for possibility is 1, for class# is log2Class# C: class#