The growing attention in water supply system security urges the design of new tools in order to control water system vulnerability. The water system security ...
th
7 International Conference on Hydroinformatics HIC 2006, Nice, FRANCE
TIME SERIES DATA MINING: TECHNIQUES FOR ANOMALIES DETECTION IN WATER SUPPLY NETWORK ANALYSIS R. GUELI, M. MONGIOVI Research and Development Department, Proteo S.p.A., via S. Sofia, 65 Catania, 95123, Italy A. FERRO, R. GIUGNO, A. PULVIRENTI Dept. of Mathematics and Computer Science,University of Catania, viale A. Doria, 6 Catania, 95125, Italy G. MARATI G.O.R.I. S.p.A., via Casa Rosa,, 33 Piano di Sorrento (NA), 80063, Italy The growing attention in water supply system security urges the design of new tools in order to control water system vulnerability. The water system security depends, among other factors, on the capability of recognizing, as soon as possible, anomalous states of plants whenever they occur. In order to improve this capability a tool, based on Data Mining techniques, to detect faults during the remote sensing activity of complex water supply networks, is proposed. This software is based on previous work [1] in which APriori and Episode Mining techniques were applied to recognize faults and malfunctions of water plants. In this paper we present an extension of these ideas based on lowsupport/high-correlation data mining algorithm (Min-Hashing) in order to deal with time series analysis instead of simple discrete event analysis. The algorithm, which is applicable to larger size databases, allows the analysis of smooth processes that are not represented by discrete events giving the possibility of recognizing causal relations among time variant processes. Given a table of real values, to perform such task, we introduce a new similarity measure among columns. We experimentally show its good behaviour with respect to classical correlation. Moreover by making use of randomization Min-Hashing [2] is applied to compute compressed signature matrix. The key point is that ” continuous„ similarity of the original matrix is mapped into ” discrete„ similarity of its signature[2]. The proposed algorithm has been experimentally analyzed by using historical data acquired from remote sensing of a real water supply network. INTRODUCTION The advent of distributed computation, communication, and sensing systems has begun to create an environment in which it would be possible to have access to enormous amounts of devices and data, through which control complex systems [8].The availability of these
technologies is producing a significant effect both on Supervisory IT applications and on supervised physical systems. Remote sensing and control is an increasingly essential element of building such interconnected systems, providing high performance, and reconfigurable operation in the presence of uncertainties (adaptability). The challenge is to move from traditional centralized control systems to distributed heterogeneous array of devices, with complex logical and physical interactions [9]. The side effect of this novel approach is the growing need of tools which are able to guarantee the coherence and consistency of data and services distributed among the remote locations. Commercial Supervisory systems provide monitoring and automatic protection functions, but they are not usually able to provide early fault detection, and in-depth fault diagnosis. Fault diagnostics is a key issue in safety of drinking water networks [10]. Therefore advanced supervision methods are necessary to overcome the classical limit of value based supervision methods [11]. Several different innovative approaches and methods for supervision, early warning, fault detection and diagnosis have been recently developed [12]. Usually these methods is specialized to solve a specific kind of problem, and they are based on different knowledge bases about the system to be supervised [13]. In this paper it is shown a novel approach based on Data Mining techniques, which allows to realize early fault diagnosis, through the analysis of the set of monitored data, and than without the need to base the analysis on a prebuilt model of the system. The proposed algorithm is able to recognize anomalies propagation in the monitored water system baseline, and by this way it is able to recognize cause-effect relations among time variant processes. The novelty of the algorithm is based on its capability to analyze smooth trends together with of discrete events and to provide causal relations among processes represented by time series. A real application of the proposed algorithm, has been selected to illustrate the usefulness of the research outlets. The test case has been provided by G.O.R.I. S.p.A. which is one of the leading suppliers of water in south Italy. It manages the whole water cycle of one of the optimal territorial units (OTU) in Campania Region (Southern Italy). This OTU is named ATO3 Sarnese Vesuviano and it includes all the water plants located within the wide area around Mt. Vesuvio (Naples Italy). G.O.R.I. S.p.A. suits the needs of 76 towns and 1.425.429 inhabitants, managing 4.000 km of drinking water pipes and 2.200 km of sewer lines. The water network supplies 475.853 industrial, commercial and domestic customers. DATA MINING Association Rules Mining is a basic step in knowledge discovery. Typical applications require high-support association rules mining [3]. However, systems for low-support/high correlation rules mining are needed in many fields. In computational biology, a lowsupport mining algorithm (i.e. Locality-Sensitive Hashing [4]) has been applied to find similarity between large collections of nucleotide and amino acid sequences [5, 6]. Another important application is intrusion detection in a network in which rare events
associated with attacks must be detected. Finally, rare events correlations are often required in large sensor networks. In this paper a method to detect associations among physical parameters of a water system is presented. Experimental analysis are based on historical data, gathered by SCADA systems. Min-Hashing Algorithm Let M(m,n) be a very sparse large matrix in which columns may represent objects and rows transactions (i.e. user-accesses to objects on specific time). The aim of Min-Hashing algorithm is to find pairs of columns (objects) whose similarity is higher than a given threshold s. The similarity between two columns Ci and Cj is defined by Eq. (1). Sim(C i , C j ) =
Ci ∩ C j Ci ∪ C j
(1)
Min-Hashing algorithm [10] proceeds as follows. It generates k random permutations, say pj : {1,φ ,m} → {1,φ ,m} of row indices of M. pij denotes the i-th element of the permutation pj. Let S(k,n) be the corresponding signature matrix of M. Each entry S[i,j] is the index t of the first row in M in which M[pti, j] = 1. In [2] authors show that the similarity of two columns can be approximated by the similarity of the corresponding columns in the signature matrix. Consequently, for columns similarity search, the signature matrix can be considered as a compressed representation of M. In the above methods the signature matrix may have a huge number of columns. This implies that the columns similarity computation may be unfeasible. To overcome this problem methods to filter pairs of candidate columns may be used. Given a large matrix M(m,n), the Locality-Sensitive Hashing (LSH) finds similar columns by executing two tasks: filtering and similarity computation [2]. Min-Hashing Algorithm on real values In this section we discuss an application of Min-Hashing to real values. More details can be found on [14]. We suppose that the entries of the matrix M*(m,n) are real values in the range [0,1]. The above algorithm is adapted to such kind of input by making use of fuzzy logic in the following way. Let A be a set of elements drawn from a universe U. Let ’ A: U → [0,1] be a membership function. Given x ∈ U, ’ A (x) is the degree of membership of x in A [7]. Using the function ’ A we can see a column of the input matrix as a membership function. If M is a matrix of boolean values we can define Cj as the set of row indexes such that Mi,j = 1. Extending this concept to M*, for each element of a column Cj* of M* we define ϕ C j (i ) = M * i , j . In fuzzy logic the min, max and sum operators are used to compute the intersection, the union and the cardinality respectively. By using these operators we define in the Eq. (2) the similarity of two column Ci* and Cj* of M*(m,n) as follows:
∑ min(M m
Sim(C i , C j ) = *
*
l =0
* l ,i
, M *l, j )
∑ max(M * l ,i , M * l , j ) m
(2)
l =0
This similarity function can be used to detect relationship between physical parameters of a water system. Next we introduce a new version of the Min-Hashing algorithm able to detect efficiently high-similarity pairs of column on a matrix of continue values in the range [0,1]. Analogously to the classical version of the algorithm we compute k random permutations of the matrixàs rows. For each column we compute the k values of the signature by permuting the values of the column according to the previously defined permutations and choosing the first index row such that the sum of the previews values of the permutated column is greater or equal then 1. Once the signature matrix is computed we apply the classical Min-Hashing techniques to find the highsimilarity columns pairs in the signature matrix. Experimental results shown that this algorithm works very well when applied on continuous values in the range [0,1]. We compared the real similarity of 1000 pairs of trends with the approximated similarity computed by the algorithm. Figure 1 shows the experimental relationship between the real values and the approximated values obtained by the signature matrix. We can notice that the approximated values are typically less then the real one. In order to improve the quality of the algorithm, the computed similarity should be corrected applying the inverse of the experimentally detected relation, reported in Figure 1. The high value of R-square (0.93) on a quadratic regression curve shows that the error induced by proposed is low and can be controlled.
Figure 1. Comparison between exact similarity and approximated similarity computed by the Min-Hashing algorithm (k = 1000). The polynomial approximation is y = (x2 + x) / 2. The coefficient of determination is 0.93.
TIME SERIES ANALYSIS The proposed technique to analyze the trend of a water system is based on the observation that the physical parameters of a water system have a cyclic trend. Analyzing these parameters it is possible to compute a daily curve. Such a curve represents a typical behavior in a day. If we look for the correlations among parameters having typical trends, we could find high correlations even if the parameters are not physically or logically connected. For instance, if the typical trend of level of two uncorrelated tanks is similar, the algorithm will found a high-correlation between these levels. But we can deduce that this similarity represents a ” false positive„ and there is not associations between them. In fact, if in one particular day an anomaly happen in the first tank, because of their uncoupling, this behavior will not propagate to the second one. To at least minimize these false-positives we consider deviations from the typical trend. A high deviation from the typical trend can be symptom of a fault or, in general, may represents an extraordinary condition. If a deviation from the typical value is observed contextually on another physical parameter for several times, we can assert, with a certain confidence that this two physical parameters are correlated. We search for the analogies between variation from the typical trend of different physical parameters. In order to quantify this kind of analogies between physical parameters we could consider the correlation coefficient taking into account the daily curve, that is shown in the Eq. (3).
∑ (x m
Corr ( x, y ) =
l
l =0
− xl ) ⋅( yl − yl ) σ x ⋅σ y
(3)
Here xl and yl represent the values of the respective daily curve at the hour l. According to this definition the deviation is computed as difference between the value of the trend and the typical value, instead of the average value. The typical value is computed for each hour of the day, as the average value of the row data acquired at that hour. The main problem using this approach is that it is very expensive especially when applied to a large number of physical parameters. In fact the number of pairs is proportional to the square of the number of physical parameters considered, so consider all the possible pairs may result unfeasible. Another problem of this approach is that we cannot find causal relations when the effect happen after some hours from the cause. In order to avoid such problems, we use the similarity measure defined above instead of the correlation coefficient. For each trend the variation from the typical value is computed and then normalized by making use of a Gauss function. The normalized value is 0 when the value of the physical parameter is equal to the typical. The greater the difference from the typical value is, the closer to 1 the normalized value is. The Min-
Hashing algorithm is then applied to the normalized trend to discover the high-similarity pairs. The similarity between two trends so defined is shown in Eq. (4). ( y − y )2 ( x − x )2 − l l − l l , 1 e 2σ 2y 2σ x2 1 min e − − ∑ l =0 Sim( x, y ) = 2 2 ( ) y y − ( ) x x − l l − − l l m 2σ 2y 2σ x2 , 1 − e max 1 − e ∑ l =0 m
(4)
We give experimental evidence that similarity is a good measure to evaluate the correlation between two trends. We computed the similarity and the correlation coefficient of 7600 pairs of trends and drawn the results in Figure 2. We notice that high values of correlation coefficient are generally associated with high value of similarity and vice versa.
Figure 2. Comparison between similarity and correlation coefficient. High values and low values of correlation coefficient are associated with high values of similarity. In order to detect causal relations temporally shifted, we apply a temporal window of ten hours and shift the windows along each trend. At each step the temporal window is shifted by 1 hour and the maximum value inside the window is considered. The result is a sequence in which high values are extended for ten hours. High deviation of different trends happening in the same temporal windows overlap partially and increase the similarity of the trends.
EXPERIMENTAL RESULTS The proposed algorithm, developed in C++, has been tested with a dataset of 4 GBs of row data on a Pentium IV 512 Mb RAM with Windows XP Professional. The data, which cover a period of one year, contain 681 trends representing the processes of the delivery system of GORI S.p.A.. The dataset was first preprocessed by computing the average behavior of each trend in intervals of one hour. This give us a smaller dataset of 80 MBs. The Min-Hashing algorithm computed the similarity of all the pairs of the preprocessed dataset in 20 seconds. It was capable to find interesting associations among parameters. The algorithm was able to find 2802 binary associations between physical parameters of the water system with a similarity greater then 0.7. Several of these associations (382) showed a similarity greater then 0.99. We recall that an higher similarity means a stronger relation between the parameters. This kind of associations are usually due to physical or logical relations among parameters. For instance the absorbed current of the pump 2 in the plant number 139 (sIIPM2_139) is strongly connected with the flow of the same pump (sFQIRPM2_139), in this case the similarity was 0.9913. Some of these associations are related to different plants that are not strongly linked by physical relations. For example the association (cLIRmin_086, cLIRPR1stop_139) has a similarity of 0.9968, But in this case the algorithm recognized this association since the plants 86 and 139 are managed by the same automation logic (i.e. the associated parameters represent two different commands belonging respectively to faraway plants that are launched simultaneously by the same procedure running in the SCADA system). Weaker associations, represented by similarity ranging from 0.7 to 0.8, have been detected. These kind of associations represent often link between faraway plants. The analysis of these associations allows to build the graph of dependencies among the plants. Furthermore the algorithm pointed out a strong connection between plants 86 and 88 by giving as output several associations between their parameters with a similarity greater then 0.7. These plants are respectively the downstream pumping station and an intermediate pumping station of a pipeline filling the tank number 84.Further association among the plants number 17, 20, 45 and 83 detected the activity of scheduled maintenance of these plants occurred the 24 may 2004. In conclusion the selected test case has demonstrated the algorithm capability in analyzing large dataset of time series, in order to recognize anomalies propagation in the monitored water system baseline. ACKNOWLEDGMENTS This work is supported by PIA INNOVAZIONE program, from the Ministero delle Attivita Produttive (MAP), under contract N. E01/0431/P81247-12 Decreto N. 127.370 on 05/08/2003.
REFERENCES [1] A. Ferro, R. Giugno, A. Pulvirenti, R. Gueli, M. Mongiovi, A. Elia and D. Paparone, ” Probabilistic A-Priori and Episode Mining Techniques for Intelligent Management of Water Supply Networks„ , Proc. of Hydroinformatics, Vol. 2, (2004), pp 17271734. [2] E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P.Indyk, R. Motwani, J. D. Ullman and C. Yang, ” Finding Interesting Associations without Support Pruning„ , Knowledge and Data Engineering, Vol. 13, (2001) pp 64-78. [3] S. Brin and R. Motwani and J. D. Ullman and S. Tsur. ” Dynamic itemset counting and implication rules for market basket data„ . Proceeding of the ACM SIGMOD Conference on Management of Data, (1997). [4] P. Indyk and R. Motwani. ” Approximate nearest neighbors: Towards removing the curse of dimensionality„ . Proceedings of the 30th ACM Symposium on Theory of Computing, (1998). [5] J. Buhler. ” Efficient large-scale sequence comparison by locality-sensitive hashing„ , Bioinformatics, (2001), Vol. 17, No. 5, pp 419 428. [6] E. Halperin, J. Buhler, R. Karp, R. Krauthgamer, and B. Westover. ” Detecting protein sequence conservation via metric embeddings„ . Bioinformatics, (2003), Vol. 19, No. 1, pp 122 129. [7] L. A. Zadeh. J. Yen and R. Langari. ” Industrial Applications of Fuzzy Logic and Intelligent Systems© , (1995). [8] National Research Council. ” Embedded, Everywhere: A Research Agenda for Networked Systems of Embedded Computers„ . National Academy Press, (2001). [9] R.M. Murray, K.J. Astrom, S.P. Boyd, R.W. Brockett, G. Stein. ” Future directions in control in an information-rich world„ . Control Systems Magazine, IEEE, Vol. 23,No. 2, (2003), pp 20-33 [10] European Commission. ” Workshop on future and emerging control systems„ , (2000). Available at ftp://ftp.cordis.lu/pub/ist/docs/ka4/report_controlws.pdf. [11] Isermann, R. ” Model based fault detection and diagnosis Status and Application„ , 16th IFAC Symposium on Automatic Control in Aerospace, St. Petersburg Russia, (2004). [12] Isermann, R. ” Supervision, fault-detection and fault diagnosis methods. An introduction„ , Control Eng. Practice, Vol. 5, No. 5, (1997), pp 639-652 [13] G. Gallone, R. Gueli, A. Patti,, A. Tropea. ” Improving automatic control of complex water systems, using AI techniques: design of an expert component for the alarms analysis„ , Water Supply - 䀈 IWA Publishing, Vol. 4, No. 5-6, (2005), pp 375 381. [14] A. Ferro, R. Giugno, A. Pulvirenti, R. Gueli, M. Mongiovi, ” Advanced Lowsupport/high-correlation algorithms on real values„ , preprint 2006.