Distributed Classification of Data Streams: An Adaptive Technique Alfredo Cuzzocrea1(B) , Mohamed Medhat Gaber2 , and Ary Mazharuddin Shiddiqi3 1
DIA Department, University of Trieste and ICAR-CNR, Trieste, Italy
[email protected] 2 School of Computing Science and Digital Media, Robert Gordon University, Aberdeen, UK
[email protected] 3 Sepuluh Nopember Institute of Technology, Surabaya, Indonesia
[email protected]
Abstract. Mining data streams is a critical task of actual Big Data applications. Usually, data stream mining algorithms work on resourceconstrained environments, which call for novel requirements like availability of resources and adaptivity. Following this main trend, in this paper we propose a distributed data stream classification technique that has been tested on a real sensor network platform, namely, Sun SPOT. The proposed technique shows several points of research innovation, with are also confirmed by its effectiveness and efficiency assessed in our experimental campaign.
1
Introduction
Mining data streams in wireless sensor networks has many important scientific and security applications [19,20]. However, the realization of such applications is faced by two main constraints. The first is the fact that sensor nodes are battery powered. This necessitates that the running applications have a low battery footprint. Consequently, in-network data processing is the acceptable solution. The second is the resource constraints of each node in the network including memory and processing power [16]. Many applications in wireless sensor networks require event detection and classification. The use of unsupervised learning techniques has been proposed recently for this problem [24,29]. Despite the applicability of the proposed methods, they have not addressed the problem of running on resource constrained computing environments by adapting to availability of resources. The problem has been rather addressed by proposing lightweight techniques. However, this may cause the sensor node to shutdown due to the low availability of resources. Experimental results have proved that typical stream mining algorithms can cause the device to shutdown as reported in [25]. Also, the use of unsupervised learning may fail to detect events of interest due to the possibility of producing impure clusters that contain instances of two or more classes. c Springer International Publishing Switzerland 2015 S. Madria and T. Hara (Eds.): DaWaK 2015, LNCS 9263, pp. 296–309, 2015. DOI: 10.1007/978-3-319-22729-0 23
Distributed Classification of Data Streams: An Adaptive Technique
297
In this paper, we propose the use of distributed classification of data streams in wireless sensor networks for event detection and classification. The proposed technique can adapt to availability of resources and work in a distributed setting using ensemble classification. The technique is coined RA-Class in reference to its resource awareness capabilities. The experimental results have shown both the adaptability and high accuracy on real datasets. This paper is an extended version of the short paper [10], where we present the basic ideas that inspire our research. The rest of the paper is organized as follows. Section 2 focuses the attention on granularity-based approach to data stream mining, the conceptual framework that provides the global umbrella for adaptive stream mining. The proposed technique is given in Sect. 3. Section 4 discusses the experimental results. Section 5 reviews the related work briefly. Finally, the paper is concluded in Sect. 6.
2
Mining Data Streams via Granularity-Based Approaches
Granularity-based approach to data stream mining is an adaptive resource-aware framework that can change the algorithm settings to cope with the velocity of the incoming streaming data [14,27]. It is basically a heuristics technique that periodically assesses the availability of memory, battery and processor utilisation, and in response changes some parameters of the algorithm, ensuring the continuity of the running algorithm. Accordingly, the approach has three main components: 1. Algorithm Input Granularity (AIG) represents the process of changing the data rates that feed the algorithm. Examples of this include sampling, load shedding, and creating data synopsis. This is a common solution in many data stream mining techniques. 2. Algorithm Output Granularity (AOG) is the process of changing the output size of the algorithm in order to preserve the limited memory space. In the case of data mining, we refer to this output as the number of knowledge structures. For example the number of clusters or rules. The output size could be changed also using the level of output granularity which means the less detailed the output, the higher the granularity and vice versa. 3. Algorithm Processing Granularity (APG) is the process of changing the algorithm parameters in order to consume less processing power. Randomisation and approximation techniques represent the potential solution strategies in this category. The Algorithm Granularity requires continuous monitoring of the computational resources. This is done over fixed time intervals/frames that are denoted as T F . According to this periodic resource monitoring, the mining algorithm changes its parameters/settings to cope with the current consumption patterns of resources. These parameters are AIG, APG and AOG settings discussed briefly
298
A. Cuzzocrea et al.
in the previous section. It has to be noted that setting the value of T F is a critical parameter for the success of the running technique. The higher the T F is, the lower the adaptation overhead will be, but at the expense of risking a high consumption of resources during the long time frame, causing the run-out of one or more of the computational resources. The use of Algorithm Granularity as a general approach for mining data streams will require us to provide some formal definitions and notations. The following are definitions and notation that we will use in our discussion. R: set of computational resources R = {r1 , r2 , . . . , rn } T F : time interval for resource monitoring and adaptation. ALT : application lifetime. ALT : time left to last the application lifetime. N oF (ri ): number of time frames to consume the resources ri , assuming that the consumption pattern of ri will follow the same pattern of the last time frame. AGP (ri ): algorithm granularity parameter that affects the resource ri . According to the above, the main rule to be used to use the algorithm granularity approach is as follows:
IF ALT T F < N oF (ri ) THEN SET AGP (ri )+ ELSE SET AGP (ri )− Where AGP (ri )+ achieves higher accuracy at the expense of higher consumption of the resource ri , and AGP (ri )− achieves lower accuracy at the advantage of lower consumption of the resource ri . For example, when dealing with clustering, it is computationally cheaper to allow incoming data instances in the stream to join an existing cluster with randomisation applied to which cluster the data instance would join. Ideally, the point should join a cluster that has sufficient proximity or a new cluster should be created to accommodate the new instance. This simplified rule could take different forms according to the monitored resource and the algorithm granularity parameter applied to control the consumption of this resource. The Algorithm Granularity approach has been successfully applied to a number of data stream mining techniques. These techniques have been packaged in a java-based toolkit, coined Open Mobile Miner [15,21].
3
Classification Methods for Wireless Sensor Networks: The RA-Class Approach
RA-Class follows a similar procedure to LWClass proposed by Gaber et al. [16,18]. However, RA-Class extends LWClass in two different aspects: – RA-Class uses all the algorithm granularity settings (input, processing and output). On the other hand, LWClass uses only the algorithm output granularity.
Distributed Classification of Data Streams: An Adaptive Technique
299
– RA-Class works in a distributed environment. On the other hand, LWClass is designed for centralized processing. The algorithm starts by examining each incoming data record. It then determines whether the incoming record will be assigned to a specific stored entry, or will be stored as a new entry, based on the proximity of the record to the existing entries. A stored entry is basically the mean value of all the records that have been assigned to that entry. The mean value of a record is the mean value of all of its attributes. An update function provides the algorithm with information if there is a need to change the algorithm’s settings. In the proposed RA-Class, there are three settings that can be adjusted: sampling interval, randomization factor and threshold value. A sampling interval is adjusted in response to the availability of battery charge. The randomization factor changes according to the CPU utilization. The threshold value is adjusted according to the remaining memory available. In case of low availability of resources, the algorithm granularity settings are adjusted. If the battery level drops below a preset level, the sampling interval is increased. This will reduce the energy consumed for processing incoming data streams. If CPU utilization reaches a preset critical level, the randomization factor is reduced. So when assigning an incoming record to an existing entry, the algorithm will not examine all of the currently stored entries to obtain the minimum distance. Instead, it checks randomly selected entries. Lastly, if the remaining memory decreases significantly, then the threshold value is increased to discourage the creation of new entries. This will slow down the memory consumption. The output of the RA-Class algorithm is a list of entries, each associated with a class label and a weight. The weight represents the number of data stream records represented by this entry. The resource-aware algorithm is shown in Algorithm 1. In a distributed environment, there is a possibility that one of the nodes would run out of battery charge. Therefore, there must be a mechanism to handle this scenario to keep the recent list of stored entries, produced during the data stream mining process. The mechanism should enable the dying node to migrate its stored entries to one of its neighbors. The migration process is preceded by the selection of the best neighbor. This process is done by querying the current resources of the neighbors, after which the dying node calculates the best node to migrate its results to, based on the neighbor’s resources. In addition, there has to be a mechanism to predict whether the current dying node’s resources are sufficient to migrate all of the stored entries before the dying node dies. This is done by setting a critical limit before entering the migration stage. After obtaining the best neighbor, the dying node will start transferring its stored entries. This process is done until all of the stored entries are transferred completely. In the destination node, the migrating entries will be merged with the destination node’s current entries. The merging process is done using the same mechanism as that when the RA-Class is processing incoming data streams. The only difference between the incoming data stream and the migrating entries is the weight. An incoming data stream will only contribute 1 to weight calculations. On the
300
A. Cuzzocrea et al.
Algorithm 1. RA-Class Algorithm 1: while data arrives do 2: for each incoming data do 3: for each stored entry do 4: currentDistane = newEntryV alue − currentStoredEntryV alue 5: if currentDistane < threshold then 6: if currentDistane < lowestDistance then 7: lowestDistance = currentDistance 8: lowestDistanceStoredEntryIndex = currentStoredEntryIndex 9: end if 10: end if 11: currentStoredEntryIndex + 1 12: end for 13: if lowestDistance < threshold then 14: if newEntryLabel = lowestDistanceStoredEntryLabel then 15: store the weighted average of the records 16: weightof thisentry + 1 17: else 18: if weightof thestoredentry > 1 then 19: weightof thisentry − 1 20: else 21: release the stored entry from memory 22: end if 23: end if 24: else 25: store the incoming entry as a new stored entry 26: set the weight of the new stored entry = 1 27: end if 28: end for 29: if timer > updateT imer then 30: do updateGranularitySettings 31: end if 32: end while
other hand, a migrating entry can contribute more than 1 to weight calculations according to its weight that has been already accumulated. In the process of deciding a class label for an incoming entry in a distributed environment, each RA-Class node needs to find the label using its own knowledge. The node labeling technique used needs to find the optimal approach to determine the closest entry. In this research, we use the K-NN algorithm with K = 2. The choice of assigning K the value of 2 is based on the need for fast classification of streaming inputs. The algorithm searches two stored entries that have the closest distance to the incoming unlabeled record. The result is the stored entry that has the higher weight among the two nearest neighbors. The result of this labeling process from each node is used in the labeling process of the distributed RA-Class.
Distributed Classification of Data Streams: An Adaptive Technique
301
We then use an ensemble approach in classifying any incoming unlabeled record. The ensemble algorithm works as an election system. Each node contributes to the election by giving a vote to each class label, while also providing an error rate. An error rate is used as a mechanism to state the assurance level of a vote. A lower error rate will have a higher possibility of winning the vote. In a normal condition, where all of the nodes are functioning properly, all of them contribute to the voting process. However, if one of the three nodes runs out of resources, it is not considered in the voting process as all of its results have already been migrated.
4
Experimental Assessment and Analysis
We run our experiments on the Sun SPOT sensor nodes from Sun Microsystems using SunSPOT API version 3.0. To evaluate the performance of our algorithms, we have conducted a set of experiments to assess the adaptability and the accuracy. The adaptability evaluation is used to assess the ability of the algorithm to adapt to changes in resource availability, while the accuracy evaluation is done to assess the accuracy of RA-Class. We have used two real datasets from UCI Machine Learning [2], namely, iris and vehicle. The iris dataset contains 150 entries, with 4 attributes, while the vehicle dataset contains 376 entries, with 18 attributes. The choice of the datasets was based on the low number of instances that consequently made the data stream simulator an easier task. We have first conducted a set of experiments on a single sensor node. We have evaluated both the adaptability and the accuracy of the proposed framework and the algorithm. These experiments have been an important step to ensure the applicability of the technique before the assessment of its distributed version. To evaluate the ability of the algorithm to adapt to varying conditions of resource availability, we have used the iris dataset for the first set of experiments we have conducted. Figure 1(a) shows the number of entries formed without enabling the resource-awareness module. The distance thresholds used are in the interval of [0.1, 1.0]. The figure shows that the number of entries produced declines as the threshold value increases. This is due to the fact that the increase in the threshold discourages the formation of new entries. When we have enabled the resource adaptation, the number of entries have been stabilized at around 60 entries as shown in Fig. 1(b). The memory adaptation is shown in Figs. 1(c) and (d). The figures show how the distance threshold discourages the creation of new entries that consume the memory. Figures 1(e) and (f) show how the algorithm adapts itself to different loads of CPU. We have set the criticality threshold to 40 % and randomly generated different CPU loads as shown in Fig. 1(e). Increasing the load exceeding the set threshold has resulted in the algorithm decreasing its randomization factor as shown in Fig. 1(f). However, the factor has a set lower bound of 10 % that represents the minimum acceptable value. Similarly, Figs. 1(g) and (h) demonstrate how the algorithm adapts itself by changing the sampling interval when the battery charge reaches its critical point, which has
302
A. Cuzzocrea et al.
(a) Number of entries produced over (b) Number of entries produced over time without adaptation time with adaptation
(c) Threshold value over time
(d) Available memory over time
(e) CPU utilization over time
(f) Randomization factor adaptation against CPU utilization
(g) Energy level over time
(h) Sampling interval over time
Fig. 1. Algorithm adaptability evaluation using iris dataset
been set to 80 % in this experiment. Figure 1(h) shows that the sampling interval decreases when the battery charge level reaches its critical point of 80 % as shown in Fig. 1(g). Since memory adaptation is mostly affected by the size of the dataset, we have repeated the experiments using the larger data set, vehicle. We have set the memory critical point to 85 %. The experiments, depicted in Figs. 2(a) and (b), show clearly that as soon as the memory reached its critical point, the distance threshold has been increased to discourage the creation of new entries. This is
Distributed Classification of Data Streams: An Adaptive Technique
303
also evident in Fig. 2(c) that shows the number of entries produced over time. The stability of the number of entries at 880 is due to releasing outliers from memory periodically.
(a) Available memory over time
(b) Threshold value over time
(c) Number of entries produced over time
Fig. 2. Algorithm adaptability evaluation using vehicle dataset
The accuracy assessment on the iris dataset has been done using 10 fold cross validation and resulted in 92 % accuracy. Similarly, the vehicle dataset has produced 83 % accuracy. In a distributed computational model, the main goal is that given a userspecified lifetime and a task such as data classification, the aim is to complete the preset lifetime and produce as accurate results as possible. The other objective is to minimize the accuracy loss in case few nodes die or stop working due to low availability of resources such as running out of battery, full of memory, and/or full of CPU utilization. Our approach is to migrate current results from a nearly-dead node to another best neighbor. There are three tasks that have to be performed before migrating the stored entries: (1) selecting the neighbor to migrate; (2) determining the time to migrate; and (3) the way to migrate. The migration scenario is shown in Fig. 3. To examine the accuracy of the distributed RA-Class, we have divided the iris dataset into three disjoint equal in size subsets randomly. To simulate the streaming environments, we have drawn from each of these subsets a 10 times larger set than the original size. After running RA-Class on each subset, we have run the ensemble classification by voting among the results of the three classifiers using 10 fold cross validation that has resulted in an average of 91.33 % accuracy. The accuracy of each experiment is reported in Table 1.
304
A. Cuzzocrea et al.
Fig. 3. Flowchart of data migration
The experiments have been repeated with result migration and merging. The reported average accuracy has been 86.67 %. The accuracy of each experiment is reported in Table 2. It is worth noting that the accuracy has not been dropped much for the ensemble choice. This provides an evidence of the applicability of our resource-adaptive framework. The main goal of this set of experiments is to test the validity of RA-Class in a distributed environment on the real Sun SPOT devices. Similar to the simulation experiments, we use three nodes that run RA-Class, and then use the ensemble approach for the classification process. We have used the Iris dataset for this experiment. We have divided the dataset into three disjoint subsets that are equal in size. Then we have simulated 1500 data streams drawn randomly from each subset of the dataset to feed each node. We use Sun SPOT LEDs to indicate the on going process as shown in Fig. 4. Table 1. Distributed RA-Class without migration E1
E2
E3
E4
E5
E6
E7
E8
E9
E10
93.3 % 93.3 % 86.7 % 93.3 % 100 % 86.7 % 93.3 % 93.3 % 86.7 % 86.7 %
After performing the classification process, we have tested the accuracy of RA-Class using 15 randomly selected records from the iris dataset. This test is done on a single node functioning as a testing node. The final entries list
Distributed Classification of Data Streams: An Adaptive Technique
305
Table 2. Distributed RA-Class with migration E1
E2
E3
E4
E5
E6
E7
E8
E9
E10
86.7 % 80 % 86.7 % 93.3 % 86.7 % 80 % 86.7 % 86.7 % 86.7 % 93.3 %
(a) Classification process
(b) Critical situation and neighbor selection
(c) Migration process
(d) Sleeping mode
Fig. 4. Distributed RA-Class on sun SPOT
of the remaining two nodes is transferred to the testing node and then the accuracy testing is performed using the ensemble algorithm. Figure 5 explains this scenario. The reason of this technique is only for efficiency purposes, so that the accuracy testing will be easier to observe. The used entries are the same ones that are used in the previous experiment. By repeating the experiments ten times, the result shows that the distributed RA-Class produced 88.0 % accuracy. The results show that in a real distributed system environment, the elaborated RA-Class remains producing a better result than a single node of RA-Class. This is due to the use of the ensemble algorithm that elevates the accuracy level. The above experiment has not considered the migration and merging processes that clearly affect the accuracy. To measure the accuracy when the migration and merging processes take place, we have conducted the same experiment with two nodes contributing to the classification and one dying node as shown in Fig. 6 and the results have shown an average high accuracy of 84.67 % ten different runs.
306
A. Cuzzocrea et al.
Fig. 5. Accuracy testing mechanism without migration
Fig. 6. Accuracy testing mechanism with migration
5
Related Work
In classical data stream mining literature there is a considerable attention towards extending popular approaches (e.g., clustering, classification, association rule mining, and so forth) as to make them working in the hard conditions defined by wireless sensor networks (e.g., [1]). Energy efficiency plays a central role in this context (e.g., [1]), as mining algorithms must rigorously consider energy consumptions in their core layers. Based on these main motivations, several research efforts have investigated data stream mining problems in wireless sensor networks. We briefly outline some of these approaches in the following. Zhuang et al. [32] focus on the problem of removing noise from data streams as they arrive at each sensor of the target network, by exploiting both temporal and spatial information in order to reduce the stream-value-to-sink rate, hence improving the mining capabilities of given algorithms. Sheng et al. [26] and Subramaniam et al. [28] focus on the problem of detecting outliers in wireless sensor networks. In the first case, authors propose each sensor to represent and maintain a histogram-based summary of data kept in neighboring sensors, which
Distributed Classification of Data Streams: An Adaptive Technique
307
is then propagated to the sink. This information is used by the sink for query optimization. In the second case, authors propose each sensor to represent and maintain a specialized tree-based communication topology, and outliers are detected by estimating the Probability Distribution Function (PDF) built on distributed topologies. Zhuang and Chen [31] provides a wavelet-based method for repairing erroneous data streams from a fixed population of sensors (e.g., due to transmission faults) by identifying them as anomalous data streams with respect to a specialised distance metrics (called Dynamic Time Warping – DTW) applied to data streams originated by spatially-close-to-that-population sensors. Cuzzocrea et al. investigate several approaches for supporting data stream mining in limited-memory environments, for both cases of dense [22] and sparse [4] data streams. In this case, the idea is to exploit popular time-fading and landmark models and to adapt them in the specialized case of limited-memory resources. Finally, a context of related work is represented by approaches that propose compression paradigms over data streams (e.g., [6,8,9]), because compression is a way for achieving better runs of typical data mining algorithms. The resource adaptive framework has been first proposed by Gaber and Yu in [17]. Our research uses the proposed framework for adapting to variations of resource availability on a single node. The framework proposed by Gaber and Yu [17] uses three settings that are adjusted in response to the resource availability during the mining process. The input settings are termed Algorithm Input Granularity (AIG). The output and processing settings are termed Algorithm Output Granularity (AOG), and Algorithm Processing Granularity (APG) respectively. The input settings include sampling, load shedding, and creating data synopsis techniques. The output settings include number of knowledge structures created, or levels of output granularity. Changing the error rate of approximation algorithms or using randomization represent the processing granularity. The three Algorithm Granularity settings are named collectively as Algorithm Granularity Settings (AGS). The framework has been applied to a data stream clustering algorithm termed as RA-Cluster. An important step towards proving the applicability of RA-Cluster in wireless sensor networks has been the implementation of ERA-Cluster by Phung et al. [23].
6
Conclusions and Future Work
The paper explored the validity of our adaptive classification technique, we termed Resource-Aware Classification (RA-Class), to process data streams in wireless sensor networks. The proposed RA-Class was evaluated with regard to accuracy and resource-awareness using real datasets. The results show that RA-Class can effectively adapt to resource availabilities and improve resource consumption patterns in both single-node and distributed computing environments. The algorithm has been tested in a real testbed using the Sun SPOT sensor nodes. The results also show that the accuracy loss due to the adaptation process is limited. Future work includes applying RA-Class to a dense wireless sensor with a large number of nodes, and testing the resource-awareness framework using other data stream mining techniques. In addition to this, we plan to
308
A. Cuzzocrea et al.
study further aspects of our framework, inspired by similar approaches in literature: (i ) fragmentation methods (e.g., [3,7]) to gain into efficiency; (ii ) privacy preservation issues (e.g., [11,12]), which are relevant for published data; (iii ) big data challenges (e.g., [5,13,30]), which are really emerging at now.
References 1. Akyildiz, I.F., Su, W., Sankarasubramaniam, Y., Cayirci, E.: Wireless sensor networks: a survey. IEEE Trans. Syst. Man Cybern. Part B 38, 393422 (2002) 2. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository. Irvine, CA: University of California, School of Information and Computer Science (2007). http:// www.ics.uci.edu/∼mlearn/MLRepository.html 3. Bonifati, A., Cuzzocrea, A.: Efficient fragmentation of large XML documents. In: Wagner, R., Revell, N., Pernul, G. (eds.) DEXA 2007. LNCS, vol. 4653, pp. 539– 550. Springer, Heidelberg (2007) 4. Cameron J.J., Cuzzocrea A., Jiang F., Leung C.K.-S.: Mining frequent itemsets from sparse data streams in limited memory environments. In: Proceedings of the 14th International Conference on Web-Age Information Management, pp. 51–578 (2013) 5. Cuzzocrea, A.: Analytics over big data: exploring the convergence of data warehousing, OLAP and data-intensive cloud infrastructures. In: Proceedings of COMPSAC 2013, pp. 481–483 (2013) 6. Cuzzocrea, A., Chakravarthy, S.: Event-based lossy compression for effective and efficient OLAP over data streams. Data Knowl. Eng. 69(7), 678–708 (2010) 7. Cuzzocrea, A., Darmont, J., Mahboubi, H.: Fragmenting very large XML data warehouses via K-means clustering algorithm. Int. J. Bus. Intell. Data Min. 4(3/4), 301–328 (2009) 8. Cuzzocrea, A., Furfaro, F., Mazzeo, G.M., Sacc´ a, D.: A grid framework for approximate aggregate query answering on summarized sensor network readings. In: Meersman, R., Tari, Z., Corsaro, A. (eds.) OTM-WS 2004. LNCS, vol. 3292, pp. 144–153. Springer, Heidelberg (2004) 9. Cuzzocrea, A., Furfaro, F., Masciari, E., Sacca’, D., Sirangelo, C.: Approximate query answering on sensor network data streams. In: Stefanidis, A., Nittel, S. (eds.) GeoSensor Networks, pp. 53–72. CRC Press, Boca Raton (2004) 10. Cuzzocrea, A., Gaber, M.M., Shiddiqi, A.M.: Adaptive data stream mining for wireless sensor networks. In: Proceedings of IDEAS 2014, pp. 284–287 (2014) 11. Cuzzocrea, A., Russo, V., Sacc` a, D.: A robust sampling-based framework for privacy preserving OLAP. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008. LNCS, vol. 5182, pp. 97–114. Springer, Heidelberg (2008) 12. Cuzzocrea, A., Sacc, D.: Balancing accuracy and privacy of OLAP aggregations on data cubes. In: Proceedings of DOLAP 2010, pp. 93–98 (2010) 13. Cuzzocrea, A., Sacc, D., Ullman, J.D.: Big data: a research agenda. In: Proceedings of IDEAS 2013, pp. 198–203 (2013) 14. Gaber, M.M.: Data stream mining using granularity-based approach. In: Abraham, A., Hassanien, A.E., de Leon, F., de Carvalho, A.P., Sn´ aˇsel, V. (eds.) Foundations of Computational, IntelligenceVolume 6. Studies in Computational Intelligence, vol. 206, pp. 47–66. Springer, Berlin (2009) 15. Gaber, M.M.: Advances in data stream mining. Wiley Interdisc. Rev.: Data Min. Knowl. Discov. 2(1), 79–85 (2012)
Distributed Classification of Data Streams: An Adaptive Technique
309
16. Iordache, O.: Methods. In: Iordache, O. (ed.) Polystochastic Models for Complexity. UCS, vol. 4, pp. 17–61. Springer, Heidelberg (2010) 17. Gaber, M.M., Yu, P.S.: A holistic approach for resource-aware adaptive data stream mining. J. New Gener. Comput. 25(1), 95–115 (2006) 18. Gaber, M.M., Zaslavsky, A., Krishnaswamy, S.: A survey of classification methods in data streams. In: Aggarwal, C.C. (ed.) Data Streams Models and Algorithms. Advances in Database Systems, pp. 39–59. Springer, Heidelberg (2007) 19. Gama, J., Gaber, M.M.: Learning from Data Streams: Processing Techniques in Sensor Networks. Springer, Berlin (2007). ISBN 1420082329, 9781420082326 20. Ganguly, A., Gama, J., Omitaomu, O., Gaber, M.M., Vatsavai, R.R.: Knowledge Discovery from Sensor Data. CRC Press, Boca Raton (2008). ISBN 1420082329, 9781420082326 21. Krishnaswamy S., Gama J., Gaber M.M.: Advances in data stream mining for mobile and ubiquitous environments. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 2607–2608 (2011) 22. Leung, C.K.-S., Cuzzocrea, A., Jiang, F.: Discovering frequent patterns from uncertain data streams with time-fading and landmark models. In: Hameurlain, A., K¨ ung, J., Wagner, R., Cuzzocrea, A., Dayal, U. (eds.) TLDKS VIII. LNCS, vol. 7790, pp. 174–196. Springer, Heidelberg (2013) 23. Phung N.D., Gaber M.M., Rohm, U.: Resource-aware online data mining in wireless sensor networks. In: Proceedings of the 2007 IEEE Symposium on Computational Intelligence and Data Mining, pp. 139–146 (2007) 24. Rodrigues, P.P., Gama, J., Lopes, L.: Clustering distributed sensor data streams. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008, Part II. LNCS (LNAI), vol. 5212, pp. 282–297. Springer, Heidelberg (2008) 25. Shah R., Krishnaswamy S., Gaber M.M.: Resource-aware very fast k-means for ubiquitous data stream mining. In: Proceedings of Second International Workshop on Knowledge Discovery in Data Streams, held in conjunction with the ECML/PKDD 2005, Porto, Portugal (2005) 26. Sheng, B., Li, Q., Mao, W., Jin, W.: Outlier detection in sensor networks. In: Proceedings of the 8th ACM International Symposium on Mobile and Ad Hoc Networking and Computing, pp. 219–228 (2007) 27. Stahl, F., Gaber, M.M., Bramer, M.: Scaling up data mining techniques to large datasets using parallel and distributed processing. In: Rausch, P., Sheta, A.F., Ayesh, A. (eds.) Business Intelligence and Performance Management. Advanced Information and Knowledge Processing, pp. 243–259. Springer, London (2013) 28. Subramaniam S., Palpanas T., Papadopoulos D., Kalogeraki V., Gunopulos D.: Online outlier detection in sensor data using non-parametric models. In: Proceedings of the 32nd International Conference on Very Large Databases, pp. 187–198 (2006) 29. Yin, J., Gaber, M.M.: Clustering distributed time series in sensor networks. In: Proceedings of the Eighth IEEE International Conference on Data Mining, pp. 678–687, Pisa, Italy, 15–19 December 2008 30. Yu, B., Cuzzocrea, A., Jeong, D.H., Maydebura, S.: On managing very large sensornetwork data using bigtable. In: Proceedings of CCGRID 2012, pp. 918–922 (2012) 31. Zhuang, Y., Chen, L.: In-network outlier cleaning for data collection in sensor networks. In: Proceedings of the 1st International VLDB Workshop on Clean Databases, pp. 678–687 (2006) 32. Zhuang, Y., Chen, L., Wang, X., Lian, J.: A weighted average-based approach for cleaning sensor data. In: Proceedings of the 27th International Conference on Distributed Computing Systems, pp. 678–687 (2007)