Clustering of multivariate time series data using particle swarm

0 downloads 0 Views 486KB Size Report
this paper, a particle swarm optimization algorithm is proposed for clustering multivariate time series data. Since the time series data sometimes do not have the ...
,QWHUQDWLRQDO6\PSRVLXPRQ$UWLILFLDO,QWHOOLJHQFHDQG6LJQDO3URFHVVLQJ $,63

Clustering of Multivariate Time Series Data Using Particle Swarm Optimization Abbas Ahmadi

Atefeh Mozafarinia

Azadeh Mohebi

Department of Industrial Engineering and Management Systems Amirkabir University of Technology Tehran, Iran Email: [email protected]

Department of Industrial Engineering and Management Systems Amirkabir University of Technology Tehran, Iran Email: [email protected]

Iranian Research Institute of Information Science and Technology (IRANDOC) Tehran, Iran Email: [email protected]

A suitable approach to overcome the limitations of nonevolutionary clustering methods is to use the notion of collective intelligence. Particle Swarm Optimization (PSO) is one of the most powerful evolutionary algorithms based on swarm intelligence that has been successfully applied for clustering [7], [8], [9], [10], [11], [12], [13]. Although numerous studies have been done for clustering static data using PSO, there is no specific research for clustering MVTS data based on PSO. One challenge in applying PSO for clustering MVTS data is to choose a suitable metric to measure similarity between series. The most common similarity measure used for time series is based on Euclidian distance and Dynamic Time Warping (DTW) [14], [15]. However using Euclidian distance in this field requires the length of time series to be the same and have exactly the same dimensions; the situation which is not true in MVTS data. While DTW is used to make the length of time series similar, this method does not provide appropriate results because of missing data. Moreover, the correlation between time series is another issue which is almost ignored when using Euclidean distance. There are some limited studies focused on defining a similarity measure and also defining a new algorithm for clustering MVTS data [15], [16], [17]. However, as per our knowledge both issues, together, has not been addressed yet. Therefore, in this article we try to eliminate the limitations of the conventional algorithms used for clustering MVTS data, by applying PSO for this purpose, and applying a hybrid similarity measure [15] based on Principal Component Analysis (PCA). This paper is organized as follows: in section two a brief introduction of PSO algorithm is presented. Then, in the third section we describe the proposed approach for clustering MVTS data based on PSO and define the hybrid similarity measure. At the fourth section the proposed method is evaluated and compared with similar methods and the results are presented, and finally the conclusion comes at the end.

Abstract— Particle swarm optimization (PSO) is a practical and effective optimization approach that has been applied recently for data clustering in many applications. While various non-evolutionary optimization and clustering algorithms have been applied for clustering multivariate time series in some applications such as customer segmentation, they usually provide poor results due to their dependency on the initial values and their poor performance in manipulating multiple objectives. In this paper, a particle swarm optimization algorithm is proposed for clustering multivariate time series data. Since the time series data sometimes do not have the same length and they usually have missing data, the regular Euclidean distance and dynamic time warping can not be applied for such data to measure the similarity. Therefore, a hybrid similarity measure based on principal component analysis and Mahalanobis distance is applied in order to handle such limitations. The comparison between the results of the proposed method with the similar ones in the literature shows the superiority of the proposed method.

I.INTRODUCTION Clustering time series data is an effective way to discover hidden patterns, identify similarities and predict future values in temporal data [1]. However, due to the high dimensionality, it usually requires lots of memory to do the clustering task, while this issue becomes a serious challenge when dealing with multivariate time series (MVTS) data. Many real world applications deal with MVTS data; for instance in customer behavior tracking and analysis, the customer purchasing behavior during a specific period of time is given in the form of multivariate time series. Also, MVTS data are observed in health applications such as tracking the status of a patient during a period of time in a specific disease. Many researchers have applied various algorithms such as k-means, hierarchical and expectation maximization to cluster time series data in order to distinguish the similarities between the data [2], [3], [4]. However, such algorithms and other types of nonevolutionary algorithms usually provide poor results due to their high dependency on initial solution, convergence to a local optima, unsatisfactory performance in handling multiple objectives, and impractical support when dealing with large datasets [5], [6].

‹,(((

II.PARTICLE SWARM OPTIMIZATION PSO is an iterative population-based algorithm for optimization problems that was originated from the flocking behavior of birds. In PSO, a population is composed of a set of



particles, where each particle actually represents a potential solution to the optimization problem. Two properties are considered for each particle: position showing the situation of the particle in the solution space, and velocity indicating how the particle moves to a new position within each iteration. We ( ) ( ) define and as the position and velocity of particle i at iteration t, respectively, and ( )

( )

=

(



)

Within each iteration the position and velocity of each particle is updated based on the following equations: (

Algorithm 1 PSO algorithm Initialize a swarm of P particles ←1 repeat for = 1 to do Update the velocity and position of particle i (

)

( )

=

( )

+ (

)

=



( )

( )

+

(

)

+





)

( )

=

(

if

)

(

(

)

≥ (

)



Update the global best as: ∗(

)

= argman

(

)

∈ {1, … , }

;

end for ← +1 until stopping criteria are satisfied

( )

;

=

( )

( )

+

(

)

+





( )

(3) (4)

)

( )

=

(

if

)

(

(

)

≥ (

( )

)

(5)



In every clustering approach based on PSO, three issues should be addressed: representing the particles, defining similarity measures, and defining fitness function. In the following, each of these issues are described for MVTS data clustering base on PSO.

( )

= argman

)



III.PROPOSED APPROACH FOR MVTS CLUSTERING

The personal best of each particle, , shows the best position of the particle among all passed iterations. Also, we define the global best ∗( ) to be the best position obtained so far among all particles. PSO procedure starts with an initial population of P particles. The aim is to optimize a given objective function F, also known as fitness function. The initial values for position and velocity of each particle is chosen randomly. Obviously, the personal best of each particle at the beginning is equal to its ( ) ( ) related initial position, i.e. = . Consequently, the global best solution for a minimization problem will be the best of personal best solutions, i.e. ∗( )

( )

+

PSO algorithm is described in Algorithm 2. PSO has been successfully applied for data clustering in many applications such as image segmentation, document clustering and speech recognition [5], [11], [12]. PSO clustering algorithm starts with an initial population of P particles, where each particle contains K (number of clusters) elements. Then, as the velocity and position of each particle update at each iteration, the algorithm converges to the final solution, i.e. K cluster centers. PSO-based clustering algorithm is less sensitive to the initial solution due to its populationbased nature. It can perform a global search in the solution space and is more likely to find a near-optimal solution. Moreover, it is an excellent algorithm when facing with multiobjective optimization problems. In order to apply PSO for MVTS clustering, we need to define particles’ representation and also the fitness function. In this paper, we have applied single swarm clustering [5], algorithm based on a hybrid similarity measure proposed by Signhal et al. in [15]. In the next section the proposed approach is described in details.

( )

( )

( )

where indicates inertia parameter controlling the effect of previous velocities. The parameters and are the cognitive and social components, respectively, and and are randomly sampled from a uniform distribution in the interval [0,1]. Then the personal best of particle i at every iteration is updated as: (

Update the personal best of particle i at iteration t + 1 as: (

=

(

(1)

.

)

∈ {1, … , }.

A. Particle representation At the beginning, an initial population is formed which consists of P distinguished particles. Each particle includes K elements, where K denotes the number of clusters. Therefore, the ith particle at iteration t can be represented as ( )

(2)



=

()

(C 1 , … , C ) ,

(6)

where C( ) is the cluster center and i = 1, ..., P. Each C( ) is actually an MVTS data of dimension m×n, where m represents time and n indicates the number of variables observed at a specific time. It is notable that, the length of time series in time,

i.e. m may not be the same for all MVTS data. Since we want to cluster MVTS data, each particle consists of K multivariate time series, and consequently there would be K×P time series data at each population.

Φ and

B. Hybrid similarity measure The proposed approach to clustering MVTS data is based on single swarm clustering [5] algorithm and a hybrid similarity measure. The hybrid measure is defined based on a PCA-based metric and a distance measure proposed in [15]. The PCA-based metric, SPCA, measures the similarity between two MVTS based on their first r principal components. For two given MVTS y and z, it is defined as follows [15]: ( ) ( ) (7) ∑ =1 ∑ =1 cos ( , )= ( ) ( ) ∑ =1 ( )

( )

where and are the eigen values of y and z, respectively, and cos is the pairwise angle between principal components of y and z. Algorithm 2 PSO-based clustering algorithm for MVTS data

Initialize a swarm of P particles ←1 repeat for = 1 to do Update the velocity and position of particle i (

)

( )

=

( )

+ (

)

=



( )

( )

+

(

)

+





)

( )

=

(

if

)

(

(

)

≥ (

( )

)

(

)

The distance measure is computed as follows [15]:

where



1 2

inf



2

2

dimension) of each MVTS matrix y and z, and Σ is the covariance matrix. The hybrid similarity measure, SH, between two MVTS is then computed by combining the two metrics mentioned above as follows: ( , )= 1 ( , )+ 2 ( , ) (10) where and reflect the contribution of each metric in the final similarity measure and are chosen such that + = 1. C. Fitness function One common way to define the fitness function used in single swarm based clustering is to use the compactness of each cluster [5]. The compactness of each cluster reflects the degree of similarity between the cluster members and the cluster center. Here, we define the fitness function based on the aggregation of all clusters’ compactness. By minimizing the fitness function in the PSO algorithm, the dissimilarity between each cluster member and its corresponding center is ( ) decreased. Thus, the distance between each data. in the ( ) kth cluster and its corresponding cluster center C is calculated, and the compactness of each cluster would be the sum of these distances. The fitness function F is then calculated by summing over the compactness of all clusters. Thus at a given iteration, the governing fitness function is given by )

= ∑



(

( )

, C ( ))

(11)

We have applied the proposed approach for MVTS clustering for two different datasets obtained from Physionet repository. Fig. 1 shows one selected sample of each data set. It is clear from the figure that each sample is an MVTS with different features observed at a period of time.

← +1 until stopping criteria are satisfied

( , )=

(9)

IV.RESULT AND EVALUATION

∈ {1, … , }

;

,

D. PSO algorithm for MVTS clustering The PSO algorithm for clustering MVTS data is presented in Algorithm 2. The algorithm is based on the fitness function F defined in Equation (11), and the hybrid similarity measure described in Equation (10).

)



= argman



where nk indicates the number of data in cluster k and K is the total number of clusters. It is notable that, the number of iterations, t, is not shown explicitly in the above formula, for the sake of simplicity.

end for Update the global best as: ∗(

Σ

are obtained by averaging over the rows (time

C( ) , … , C(

( )

Update the personal best of particle i at iteration t + 1 as: (

and



, =

(8)

Φy,z



TABLE I.

AVERAGE CLUSTER PURITY FOR DIFFERENT VALUES OF AND

Feature value

Cluster Purity 0.5512 0.6134 0.5 0.5996 0.6841 0.6901 0.5832 0.6267 0.5349

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

To evaluate the performance of the proposed approach, we have compared its results with three clustering approaches: 1) K-means with Euclidean distance, 2) K-means with hybrid similarity measure and 3) PSO with Euclidean distance, using the two above-mentioned datasets. The comparison results in terms of average cluster purity is presented in Table II. In Fig. 2 the average cluster purity of the proposed clustering approach is compared with the three above-mentioned approaches.

Time

Feature value

(a) ICU dataset

TABLE II.

CLUSTER PURITY OF PROPOSED APPROACH COMPARED WITH OTHER APPROACHES K-means

ICU GIAT

PSO

Euclidean dist.

Hybrid measr.

Euclidean dist.

Hybrid measr.

0.5315 0.5511

0.4368 0.3957

0.5263 0.5499

0.6841 0.6537

We have also studied the distribution of data in each cluster for the proposed approach. As we approximately know the number of the data in each cluster for each dataset, we have compared the output of the clustering approaches with the original distribution of the data within the clusters in each dataset. Fig. 3 indicates the obtained results. According to this figure, the proposed approach is more successful in clustering the data, as compared to the other clustering approaches. In other words, the proposed approach is able to exhibit hidden unknown patterns within the time series data more effectively.

Time

(b) GIAT dataset Fig. 1. MVTS sample of each dataset

The first dataset is related to ICU data which contains the values for four different features affecting blood pressure. The features are observed within 60 seconds for 38 persons. The dataset contains two clusters reflecting persons with drop in their blood pressure and persons with normal blood pressure. The second dataset called GIAT, contains eight features related to Parkinson disease. The value for each feature is measured through a sensor attached to a specific location in the body for 60 seconds. The GIAT dataset contains 47 samples consisting of normal persons and the ones with Parkinson disease. The algorithm is applied ten times for each dataset, and the average cluster purity [22] for ICU and GIAT datasets are reported as 0.6841 and 0.6537, respectively. To study the performance of the proposed approach and tune best values for parameters and , we have applied the PSO algorithm for several values of these two parameters, and measure the average cluster purity for each case. Table I shows the average Cluster purity corresponding to different values for and . As can be seen from this table, the best cluster purity measure obtained when = 0.6 and = 0.4.

I.CONCLUSION In this paper we have proposed a clustering approach based on particle swarm optimization in order to cluster multivariate time series data. Several non-evolutionary algorithms have been applied for multivariate data clustering, however they usually provide unsatisfactory results since they highly depend on initial values. Rather than using a conventional Euclidean distance to measure the similarity between data, we proposed to use a hybrid similarity measure. This similarity measure is based on combining two measures: PCA-based similarity measure to capture the correlation between the time series data, and a new distance measure proposed in [15]. The hybrid measure is very practical in multivariate time series data clustering since it does not require the time series to have the same length. We have applied the proposed approach to two different health datasets, and examined the results based on



the cluster purity measure. The proposed clustering approach is also compared with similar approach in terms of average cluster purity and the distribution of data in the clusters, while the comparison results shows an outperformance of the proposed approach.

ICU dataset

GIAT dataset

Cluster Purity

ICU dataset

(a) Original K-means

PSO

k-mean hybrid measure

PSO hybrid measure

Cluster Purity

GIAT dataset

(b) K-means with hybrid measure K-means

PSO

k-mean hybrid measure

PSO hybrid measure

Fig. 2. Comparing cluster purity of the proposed approach with other approaches

REFERENCES [1] J. Lin, M. Vlachos, E. Keogh, D. Gunopulos, Iterative Incremental Clustering of Time Series, IX Conference on Extending Database Technology, 2004, pp. 122-106. [2] Y. Xiong and D. Y. Yeung, Mixtures of ARMA Models for Model-Based Time Series Clustering, IEEE International Conference on Data Mining, 2002, pp. 717-720.

[3] X. Wang, K. A. Smith, R. J. Hyndman, Dimension Reduction for Clustering Time Series Using Global Characteristics, Computational ScienceICCS, Springer, 2005, pp. 792-795.

[4] F. Pattarin, S. Paterlini, T. Minerva, Clustering financial time series: an application to mutual funds style analysis, Computational Statistics and Data Analysis, Vol. 47, No. 2, 2004, pp. 353-372.

(c) Proposed approach Fig. 3. Comparing the proposed method in terms of number of data in each cluster with another method for ICU and GIAT dataset.

[5] A. Ahmadi, F. Karray, S. K. Mohamed, Flocking based approach for data clustering, Natural Computing, Vol. 9, No. 3, 2010, pp. 767-791.



[14] K. Kalpakis, D. G. Puttagunta, Distance Measures for Effective Clusternig of ARIMA Time Series, IEEE International Conference on Data Mining, San Jose, CA, 2001, pp. 273-280. [15] A. Singhal, D. E. Seborg, Clustering multivariate time series data, Chemometrics, Vol. 19, No. 8, 2005, pp 427-438. [16] Y. Huang, T. J. McAvoy, J. Gertler, Fault isolation in nonlinear systems with structured partial principal component analysis and clustering analysis, Canadian Journal of Chemical Engineering, Vol. 78, No. 3, 2000, pp. 569-577. [17] X. Wang, A. Wirth, L. Wang, Structure-based Statistical Features and Multivariate Time Series Clustering, 7thIEEE International Conference on Data Mining, 2007, pp. 351-360. [18] C. Li, P. Zhai, S. Zheng, B. Prabhakaran, Segmentation and recognition of multi-attribute motion sequences, 12th annual ACM international conference on Multimedia, ACM, 2004, pp. 836-843. [19] K. Yang, C. Shahabi, A PCA-based similarity measure for multivariate time series, 2nd ACM international workshop on Multimedia databases, Washington, DC, USA, 2004, pp. 65-74. [20] K. Yang, H. Yoon, C. Shahabi, A supervised feature subset selection technique for multivariate time series, Workshop on Feature Selection for Data Mining: Interfacing Machine Learning with Statistics, 2005, pp. 92-101. [21] K. Yang, C. Shahabi, On the stationary of multivariate time series for correlation-based data analysis, In Data Mining, Fifth IEEE International Conference on, 2005, pp. 805-808. [22] Y. Zhao, G. Karypis, Evaluation of hierarchical clustering algorithms for document datasets, Eleventh international conference on Information and knowledge management, ACM, 2002, pp. 515-524.

[6] S. Rani, G. Sikka, Recent Techniques of Clustering of Time Series Data: A Survey, International Journal of Computer Applications Vol. 52, No. 15, 2012.

[7] X. Xiao, E. .R. Dow, R. Eberhart, Z. B. Miled, R. J. Oppelt, A hybrid self-organizing maps and particle swarm optimization approach, Concurrency and Computation: Practice and Experience, Vol. 16, No. 9, 2003(b), pp. 895-915.

[8] X. Xiao, E. R. Dow, R. Eberhart, Z. B. Miled, R. J. Oppelt, Gene clustering using self-organizing maps and particle swarm optimization, International parallel processing symposium, Nice, France, 2003(a), pp. 22-26.

[9] M. B. ONeill, Self-Organizing Swarm (SOSwarm): A Particle Swarm Algorithm for Unsupervised Learning, Evolutionary Computation, 2006, pp. 634-639.

IEEE

[10]A. O. Sharma, Determining Cluster Boundaries using Particle Swarm Optimization, World Academy of Science, Engineering and Technology, 2008,pp. 1106-1110. [11] X. Cui, L. Gao, T. E. Potok, A flocking based algorithm for document clustering analysis, Journal of System Architecture, Vol. 52, No. 89, 2006, pp. 505-515. [12] M. Omran, A. P. Engelbrecht, A. Salman, Particle swarm optimization method for image clustering, International Journal of Pattern Recognition and Artificial Intelligence, Vol. 19, No. 3, 2005, pp. 297-321. [13] T. Oates, L. Firoiu, P. Cohen, Clustering time series with hidden Markov models and dynamic time warping, IJCAIWorkshop on Neural, Symbolic and Reinforcement Learning Methods for Sequence Learning, 1999, pp. 17-21.



Suggest Documents