Prediction With Uncertainty: A Novel Framework for ... - IEEE Xplore

2 downloads 0 Views 890KB Size Report
Nov 7, 2014 - Commonwealth Scientific and Industrial Research Organization, in Hobart, ..... 2013 International Conference on Digital Image Computing: ...
382

IEEE SENSORS JOURNAL, VOL. 15, NO. 1, JANUARY 2015

Prediction With Uncertainty: A Novel Framework for Analyzing Sensor Data Streams Ashfaqur Rahman, Senior Member, IEEE, John McCulloch, and Quazi Mamun Abstract— In this paper, we present a novel framework to predict events through time-series analysis of sensor data streams. The framework is capable of producing and visualizing event prediction probabilities, uncertainties around the predictions, and the actual decision being taken based on the prediction. We have tested the analytical framework on predicting closure events in shellfish farms in Tasmania. Reasonably high prediction accuracy is achieved. The visualization was able to capture prediction, uncertainty, and actual decision being taken (i.e., three-in-one). Index Terms— Sensor data analytics, time series prediction, prediction with uncertainty.

I. I NTRODUCTION

W

E HAVE developed a framework to predict events based on time–series analysis of sensor data streams. There are uncertainties associated with sensor readings and this is expected to get reflected in the decision produced. Also there are two types of users of the system: (i) administrators who use the decision support system to assist in decision making, and (ii) end users who are affected by the decision. The administrators sometimes do not accept the automated decisions produced by the tool. This is mainly due to low confidence (or high uncertainty) in the decision. The end users need to be aware of this too. The proposed framework is capable of producing and visualizing event prediction, associated uncertainty, and the decision finally made by administrator. This framework can augment the capabilities of standard data processing frameworks [1] from sensor networks. We have applied and validated the above framework on predicting closure events in shellfish farms. The closure event refers to the scenario when the water quality at shellfish growing farms causes a health risk to human consumers and the farmers cannot sell the shellfish. If this event can be predicted earlier, the farmers can move the shellfish trays to different part of the waterways where the water is not Manuscript received June 10, 2014; accepted July 28, 2014. Date of publication July 31, 2014; date of current version November 7, 2014. This work was supported in part by the Autonomous Systems (AS) Program of Commonwealth Scientific and Industrial Research Organization, in Hobart, and in part by the Tasmanian node of the Australian Centre for Broadband Innovation under a grant from the Tasmanian Government which is administered by the Tasmanian Department of Economic Development, Tourism and the Arts. The associate editor coordinating the review of this paper and approving it for publication was Prof. Okyay Kaynak. A. Rahman and J. McCulloch are with the Commonwealth Scientific and Industrial Research Organization, Charles Sturt University, Bathurst, NSW 2795, Australia (e-mail: [email protected]; [email protected]). Q. Mamun is with the School of Computing and Mathematics, Charles Sturt University, Bathurst, NSW 2795, Australia (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/JSEN.2014.2344683

contaminated. The farmers in that case will still be able to sell the shellfish. The conceptual diagram of the sensor network and prediction platform is presented in Fig. 1. Sensor data relevant to water quality (for example, salinity, rainfall and river flow) is recorded and predictions are produced based on time–series analysis. The administrators from TSQAP (Tasmanian Shellfish Quality Assurance Program) use the decision support tool. The end users are the farmers who close the shellfish farms based on TSQAP’s decision. The proposed framework produces closure decisions, associated uncertainty, and the decision from TSQAP. We have developed a number of machine learning methods and validated with TSQAP data sets previously. However, they were conducted on flat data set as opposed to time series data. Our previous work with TSQAP focused on investigating the class imbalance problem [2], ensemble approach to deal with missing sensor values [3], problems related to relocating models to locations where we do not have sufficient closure examples [4], and identifying the importance of features for detecting closure events [5]. The framework presented in this paper is validated on a new time series data from TSQAP. The paper is organized as follows: Section II presents the proposed framework. The all modules within the framework are described in Section III. The evaluation of the proposed framework on the shellfish farm closure problem is presented in Section IV. Finally Section V concludes the paper. II. P ROPOSED F RAMEWORK The event prediction framework is presented in Fig. 2. Statistical features are extracted from historical time series sensor data. The feature vectors are associated with event labels by the domain experts and a supervised machine learning algorithm (e.g., classifiers) is trained on the feature vectors. Live sensor data streams (i.e., time series data) are classified into events by this classifier. Uncertainty is reflected by altering the input data stream into a set of modified data streams. Each modified data stream is classified by the previously trained classifier. Probabilities on different possible events are produced on each modified data stream. They are combined into a cumulative class (event) probability by a fusion module. The fusion module also produces the uncertainty around this decision. The administrator takes the actual decision. The visualization module produces a graphical representation of the event probability, uncertainty around that decision and the final decision taken by the administrator. III. M ETHODOLOGY In this section we present the description of the different methods in the proposed prediction framework. The key blocks

1530-437X © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

RAHMAN et al.: PREDICTION WITH UNCERTAINTY: A NOVEL FRAMEWORK FOR ANALYZING SENSOR DATA

Fig. 1.

383

Conceptual diagram of sensor network and prediction platform for shellfish quality assurance [3].

in the framework are: introduction of sensor uncertainty, feature extraction, prediction, fusion and visualization module. Each of these modules can be implemented in different ways and the framework is open to adopt any of them. In this paper, we present the algorithms that we found most suitable for predicting shellfish farm closures. A. Introduction of Sensor Uncertainty Due to practical reasons it is possible that sensor readings are produced with some level of uncertainties. The reasons for this uncertainty include the inherent inaccuracies of sensor operation, the influence of human activity in sensor deployment locations and the effects of bio-fouling and calibration drift upon sensors over time [6]. The proposed framework captures this uncertainty to reflect on the prediction probability. Let the uncertainty be ±δ. This implies that the reading x produced by the sensor is likely to fall within the range of x − δ to x + δ. As the sensor readings are used for prediction purposes, the changes in x is likely to produce changes in the prediction probability p. Given the time series data of the sensor readings (x 1 , x 2 , . . . , x n ), the uncertainty ± δ is introduced to each sensor reading. Given a set of sensor readings on n time stamps, a total of 3n combinations of raw values are computed considering only the boundary values (i.e. x − δ, x, x + δ). Time series features (as described next) are extracted from each combination. A classifier makes prediction probabilities for each combination. The fusion module combines these prediction probabilities.

from Fig. 3 where two segments of the time series rainfall data are presented. Where a transition between ‘open’ and ‘close’ state occurs [Fig. 3(a)], the frequency domain magnitudes are high whereas the same is low when no transition occurs in Fig. 3(b). Given the finite list of equally–spaced successive time series samples x 1 , x 2 , . . . , x n DFT [7], [8] produces a list of n coefficients f 1 , f 2 , . . . , f n of a finite combination of complex sinusoids, ordered by their frequencies. The magnitude of a   coefficient f i  is used as feature where 1 ≤ i ≤ n. The time series is represented of n features and the feature   by a total  vector is (|f | , f 2  , . . . , f n ). C. Prediction A classifier is trained on the historical time series data and the prediction labels produced by domain experts. This is an incremental learning process. As new data comes in, and we get the opinion from the administrator, the classifier is retrained on the newly generated training set. We have utilized nearest neighbor classification for time series prediction [9]. In k–NN classification the distance between a test pattern and all the patterns in the training set is computed. The distance can be calculated using Euclidian distance or Manhattan distance. The probable classes receive a vote from each of the k patterns that are closest to the test pattern in terms of distance. The class that obtains the highest vote is considered to be the class of the test pattern. The k responses obtained from the classifier can be used to compute the prediction probabilities of the different classes.

B. Feature Extraction A number of time series feature extraction methods including cluster profiles, curve fitting parameters, Piecewise Aggregate Approximations, Fourier and wavelet transforms are evaluated for their effectiveness. The frequency domain representation of the univariate time series using Discrete Fourier Transform (DFT) is found to provide best results empirically. The reason is that the higher variations in time series are presented by high frequency components whereas steady states are represented by low frequency components. It is evident

D. Fusion The fusion module combines the 3n class probabilities produced by the prediction module. Given a set of probabilities p1 , p2 , . . . , p N , we computed the arithmetic mean of these probabilities to produce the cumulative class probability pμ as 1 N pi . (1) pμ = i=1 N

384

IEEE SENSORS JOURNAL, VOL. 15, NO. 1, JANUARY 2015

Fig. 2.

Fig. 3.

Conceptual diagram of event prediction framework.

Performance of Fourier Transformation features on rainfall variation. (a) High variation in rainfall. (b) Low variation in rainfall.

The standard deviation computed from the class probabilities produce the uncertainty in the prediction as  1 N pσ = ( p i − p μ )2 . (2) i=1 N

E. Visualization The visualization module takes into consideration the prediction probability and the uncertainty around the prediction. It produces the prediction for a certain number of

RAHMAN et al.: PREDICTION WITH UNCERTAINTY: A NOVEL FRAMEWORK FOR ANALYZING SENSOR DATA

385

TABLE I I NFORMATION ON D ATA G ATHERED AND U SED IN THE R ESEARCH

days ahead and plots them on a graph. The x−axis on the graph represents time and the y−axis represents prediction probability. The administrator who takes the final decision can decide on a different outcome and ignore the predictions. We use color on the curves to represent the decision of the administrator. The result section details on the implementation.

Fig. 4.

Performance of class balancing methods. TABLE II

P ERFORMANCE S CORES O BTAINED U SING F OURIER T RANSFORMATION F EATURES AND BAYESIAN N ETWORK C LASSIFIER

IV. R ESULTS AND D ISCUSSION We have evaluated the performance of the proposed framework on shellfish farm closure prediction problem. TSQAP authorities are responsible for ensuring that shellfish growing areas are free of harmful contaminants. Microbial contaminants in particular pose a major risk to public health. The rainfall washes away harmful contaminants from the surrounding land to the water. Salinity level of water influences the growth of microbial contaminants and the rainfall causes change in the salinity of water. Rainfall measures are thus used by the TSQAP authorities to decide on the closure of the shellfish farming operations. We have collected time series rainfall data and farm closure status on six different shellfish farms in Tasmania (Table I). Rainfall data is obtained from SILO [10] and Bureau of Meteorology sensors [11]. We utilized the WEKA [12] implementation of different classifiers and default parameter settings of these classifiers are used in the experiments. Results are compared in this paper using Matthew’s Correlation Coefficient (MCC) [12]. Given true positive rate TP (percentage of correctly classified instances of True or Closure class), true negative rate TN (percentage of correctly classified instances of False or Open class), false positive rate FP (percentage of instances classified as True but actually False), and false negative rate FN (percentage of instances classified as False but actually True), the MCC is computed as MCC = √

T P×T N−F P×F N (T P+F P) (T P+F N) (T N+F P) (T N+F N) (3)

MCC obtains maximum score (+1) when both True Positive and True Negative are 100% and minimum score (−1) when False Positive and False Negative are 100%. MCC is thus a good measure to strike a balance between the accuracies of multiple classes. It can be observed from Table I that the datasets are imbalanced i.e. percentage of ‘Open’ class is much higher than that of ‘Close’ class. We thus applied class balancing methods to eliminate this problem. We tried both under–sampling and over–sampling approaches to class balancing [14].

Fig. 4 presents the performance of the different balancing methods. Note that both balancing algorithms improve the MCC score in almost all locations. This is because of the improvement in classification accuracy of the ‘Close’ class after balancing. The improvement with balancing is relatively smaller in Montagu. This is because the percentage of the ‘Close’ class is relatively higher in Montagu that other regions (Table I). On a head–to–head performance comparison, over– sampling and under–sampling are almost equally capable. The over–sampling process, however, preserves the underlying distribution of majority class. We present results based on over–sampling only for the rest of the paper without loss of generality. The MCC score, accuracy of ‘Open’ class, accuracy of ‘Close’ class, and mean accuracy are presented in Table II. As can be observed from Table II, both ‘Close’ and ‘Open’ class were recognized with high accuracy. Fourier Transform captures the time series in this case. The reason is that the transition state (‘Close’) is presented by high frequency components whereas a steady state, like ‘Open’, is represented mostly by low frequency components. As Fourier Transform captures the transitions in terms of high frequency components, it performs better than other features. The visualization component of the framework presents event prediction probabilities, uncertainties around the predictions, and the actual decision being taken by the administrator based on the prediction. The uncertainty around rainfall data is measured to be ±1 according to [15] and we used this as ± δ in this experiment. Fig. 5 presents the prediction for Hastings for fourteen days. The red line in the middle separates between open status (on top) and close status (the bottom). The curve represents the probabilities (scaled in the range of −1 to +1) of open/closure. 0 to +1 represents open prediction

386

IEEE SENSORS JOURNAL, VOL. 15, NO. 1, JANUARY 2015

Fig. 5.

Visualization of prediction in the proposed framework.

and 0 to −1 represents closure prediction. The vertical bar around a point on the curve represents the uncertainty in the decision. For example, on day 3 the prediction probability is + 0.25 presenting open decision. However, from the bar presenting uncertainty on day 3, the lower bound probability is negative. This indicates that if rainfall predictions change there is a minor possibility that it may lead to closure. The color of the curve presents the actual decision taken by administrator: green presenting open and red presenting closure. On day 8, the classifier is predicting positive probability but the administrator has decided to keep it closed. Thus the color changes to red (i.e. closure). V. C ONCLUSION In this paper we have presented a novel framework to predict events through time series analysis of sensor data streams. The novelty of the framework lies in introduction of quantitative uncertainty to sensor readings and translating them to prediction probabilities and confidences. We have evaluated the proposed framework on predicting shellfish farm closure problem. The framework is capable of producing and visualizing event prediction probabilities, uncertainties around the predictions, and the actual decision with high accuracy. R EFERENCES [1] R. R. Jitendra, Multi-Sensor Data Fusion With MATLAB. Boca Raton, FL, USA: CRC Press, 2010. [2] C. D’Este, A. Rahman, and A. Turnbull, “Predicting shellfish farm closures with class balancing methods,” in Advances in Artificial Intelligence, vol. 7691, Lecture Notes in Computer Science,. Berlin, Germany: Springer-Verlag, 2012, pp. 39–48. [3] A. Rahman, C. D’Este, and G. Timms, “Dealing with missing sensor values in predicting shellfish farm closure,” in Proc. IEEE Int. Conf. Intell. Sensors, Sensor Netw. Inf. Process. (ISSNIP), Apr. 2013, pp. 351–356. [4] C. D’Este and A. Rahman, “Similarity weighted ensembles for relocating models of rare events,” in Multiple Classifier Systems, vol. 7872, Lecture Notes in Computer Science,. Berlin, Germany: Springer-Verlag, 2013, pp. 25–36. [5] A. Rahman, C. D’Este, and J. McCulloch, “Ensemble feature ranking for shellfish farm closure cause identification,” in Proc. Workshop Mach. Learn. Sensory Data Anal., 2013, pp. 13–18, doi: 10.1145/2542652.2542655.

[6] A. Rahman, D. V. Smith, and G. Timms, “A novel machine learning approach toward quality assessment of sensor data,” IEEE Sensors J., vol. 14, no. 4, pp. 1035–1047, Apr. 2014. [7] L. Xingye and T. Tian, “Time series recognition based on wavelet transform and Fourier transform,” in Proc. IEEE Symp. Ind. Electron. Appl., Oct. 2010, pp. 722–726. [8] Y.-L. Wu, D. Agrawal, and A. El Abbadi, “A comparison of DFT and DWT based similarity search in time-series databases,” in Proc. Int. Conf. Inf. Knowl. Manage., New York, NY, USA, 2000, pp. 488–495. [9] Z. Xing, J. Pei, and P. S. Yu, “Early prediction on time series: A nearest neighbor approach,” in Proc. Int. Joint Conf. Artif. Intell. (IJCAI), 2009, pp. 1297–1302. [10] SILO Climate Data, Queensland Government, Queensland, Australia, Oct. 2012. [11] Bureau of Meteorology, Australian Government, Oct. 2012. [12] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten, “The WEKA data mining software: An update,” ACM SIGKDD Explorat. Newslett., vol. 11, no. 1, pp. 10–18, 2009. [13] D. M. W. Powers, “Evaluation: From precision, recall and F-factor to ROC, informedness, markedness and correlation,” J. Mach. Learn. Technol., vol. 2, no. 1, pp. 37–63, 2011. [14] Q. Gu, Z. Cai, L. Zhu, and B. Huang, “Data mining on imbalanced data sets,” in Proc. Int. Conf. Adv. Comput. Theory Eng., Dec. 2008. [15] S. J. Jeffrey, J. O. Carter, K. B. Moodie, and A. R. Beswick, “Using spatial interpolation to construct a comprehensive archive of Australian climate data,” Environ. Model. Softw., vol. 16, no. 4, pp. 309–330, 2001. Ashfaqur Rahman (SM’12) received the Ph.D. degree in information technology from Monash University, Gippsland, VIC, Australia. He is a Machine Learning Researcher for more than ten years. He is a Research Scientist with the Commonwealth Scientific and Industrial Research Organisation (CSIRO), Hobart, Tasmania, Australia. He worked on specific machine learning problems, including ensemble learning and fusion, feature selection/weighting methods, genetic algorithm-based optimization, and image segmentation and classification. He is the leader of the Computational Intelligence Team at CSIRO. He has authored around 50 peer-reviewed journal articles, book chapters, and conference papers. He serves as a reviewer of prestigious conferences and journals. He was the Program Committee Chair of the 2013 International Conference on Digital Image Computing: Techniques and Applications. John McCulloch is currently a Research Engineer with the Tasmania ICT Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Hobart, Tasmania, Australia. He is currently leading a collaborative project developing smart sensor systems for the aquaculture industry. He also has a keen interest in robotics and has played an active role in its promotion in Tasmania, being a Founding Member of both Robotics Tasmania and Robocup Junior Tasmania. In these roles, he has enhanced both community awareness and student interest in the field, by managing and promoting the Tasmanian Robocup Junior State finals. He regularly supervises graduate-level students working on an autonomous catamaran project. Prior to commencing work at CSIRO, he was with the University of Tasmania, Hobart, TAS, Australia, for nearly a decade, where he was involved in teaching research and commercialization projects in the mechatronics and data acquisition fields. Quazi Mamun received the B.Sc. (Eng.) degree in computer science and engineering from the Bangladesh University of Engineering and Technology, Dhaka, Bangladesh, the M.Sc. degree (by research) in global information and telecommunication studies from Waseda University, Tokyo, Japan, and the Ph.D. degree from Monash University, Clayton, VIC, Australia. He is a Lecturer with the School of Computing and Mathematics, Charles Sturt University, Wagga Wagga, NSW, Australia. In his 12 years of academic and research career, he was a Researcher and an Academic with Monash University, Waseda University, and Asia Pacific University, Dhaka. He has authored more than 50 research papers in the distinguished journals, international conferences, and workshops. For his distinguished research paper, he was awarded the Best Paper Award at the ARCHAR 2010 Conference. His research interests include, but not limited to, distributed systems, ad hoc and sensor networks, wireless networks, and network and information security. He served as the Chair and a TPC Member in many international flagship conferences, including the IEEE International Conference on Communications, the IEEE Wireless Communications and Networking Conference, the IEEE Global Communications Conference, the IEEE Securecomm, and the IEEE Tencon. He has served as an Editor of the International Journal of Computer and Information Technology. He is a Founding Member of the Advanced Networks Research Laboratory and the ICT Security Group of Charles Sturt University. He is also associated with the IEEE, IEICE, ACS, and EAI.