DTW distance in a Self Organizing Map algorithm (Kohonen, 2001) with the aim of .... Dynamic programming algorithm optimization for spoken word recognition.
Integrating Time Alignment and Self-Organizing Maps for Classifying Curves Elvira Romano and Germana Scepi Dipartimento di Matematica e Statistica – Università “Federico II” di Napoli Via Cintia, Monte S. Angelo – 80126 Napoli Keywords: Classification, functional data, time series, dissimilarity.
1.Introduction Clustering time series has become in recent years a topic of great interest in a wide range of fields. The several approaches differ mainly in their notion of similarity (for a review see Focardi, 2001). Most researches use the Euclidean distance or some variation of it because of its easy implementation, even if it is very sensitive to temporal axis alignment. Furthermore, there are many applications where it is demonstrated that the Euclidean distances between raw data fail to capture the notion of similarity. The principal reason why Euclidean distance may fail to produce an intuitively correct measure of similarity between two sequences is that it is very sensitive to small distortions in the time axis as, for example, in the case of two sequences having approximately the same overall shape but not aligned in time axis. A method that allows this elastic shifting of the X-axis is desired in order to detect similar shapes with different phases. For this purpose, the Dynamical Time Warping (DTW) distance has been recently introduced (Berndt, Clifford, 1994), technique that was already known in the speech processing community (Sakoe, Chiba, 1978; Rabiner, Juang, 1993). Nevertheless the DTW algorithm can produce incorrect results in presence of salient features or noise in the data and the algorithm’s time complexity causes a problem in a way that “…performance on very large databases may be a limitation”. Morlini et al. (2005) proposes a modification of this algorithm that considers a smoothed version of the data and demonstrate that their approach allows to obtain points which are less noisy and dependent on the overall shape of the series. The clustering algorithms proposed in this approach are hierarchical clustering and K-means algorithms. The current paper proposes a new approach based on the implementation of the DTW distance in a Self Organizing Map algorithm (Kohonen, 2001) with the aim of classifying a set of curves. To show the results of this approach, we illustrate an application of our method on simulated data; while in the extended paper version we will propose an application on topographic real data.
2.A new approach for classifying curves Suppose we have several time series. Let us consider, for example Q and C, two time series of length n and m respectively:
1
Q = q1,q2,…,qi,…,qn C = c1,c2,…,cj,…,cm The first step of our approach consists in smoothing each series by a piecewise linear or cubic spline. Therefore our starting data are a set of curves, in the example: Q’ = q1' , q2' ,..., qi' ,..., qn' C’ = c1' , c2' ,..., c 'j ,..., cm' To align the two obtained sequences using DTW, we construct an n-by-m matrix where the (ith,jth) element of the matrix contains the Euclidean distance d (qi' ,c 'j ) between the two points qi' and c'j .Each matrix element (i,j) corresponds to the alignment between the points. A warping path, W, is a contiguous set of matrix elements that defines a mapping between Q’ and C’. The k-th element of W is defined as wk = (i,j)k, so we have: W = w1 , w2 ,..., wk ,..., wK
max(m, n) ≤ K ≤ m + n -1
The warping path is typically subjected to several constraints
-Boundary conditions: w1 = (1,1) and wK = (m,n). Simply stated, this requires the warping path to start and finish in diagonally opposite corner cells of the matrix. - Continuity: Given wk = (a,b) then wk-1 = (a’,b’), where a–a' ≤1 and b-b' ≤1. This restricts the allowable steps in the warping path to adjacent cells (including diagonally adjacent cells). - Monotonicity: Given wk = (a,b) then wk-1 = (a',b'), where a–a' ≥0 and b-b'≥ 0. This forces the points in W to be monotonically spaced in time.
We are interested only in the path that minimizes the warping cost: ⎧⎪ 1 DTWC(Q ', C ') = min ⎪⎨ ⎪⎩⎪ K
K ⎪⎫ ∑ wk ⎪⎬ ⎪⎭⎪ k =1
(1)
Therefore in our approach, the data are the smoothed values of sequences and the dissimilarity between two elements is the Dynamic Time Warping Cost (DTWC). The clustering method is based on an adaptation of the Kohonen’s SOM algorithm for dissimilarity data (Golli et al, 2004). The SOM algorithm consists of neurons organized on a regular low dimensional map. More formally, the map is described by a graph (N,Γ). N is a set of interconnected
2
neurons having a discrete topology defined by Γ. For each pair of neurons on the map, the distance is defined as the shortest path between them on the graph. This distance imposes a neighbourhood relation between neurons. The Dissimilarity SOM algorithm (DSOM) is an adaptation of the Kohonen’s SOM algorithm for dissimilarity data. It is a batch iterative algorithm in which the whole data set (Ω) is initially presented on the map. We denote with zl (l=1,…,N) the generic element of Ω and zl is the representation of this element in representative space D on which dissimilarity (denoted d) is defined. Each neuron x is represented by a set of M elements of Ω , m1,…,mg,…,mM, called prototypes, where mg is a vector of zl element. In DSOM the prototypes associated to neurons as well as the neighbourhood function are evolving with the iterations. It starts by an initialization phase, in which the value of M is randomly chosen. The algorithm alternates affectation phases and representation phases until convergence. In the first phase each initial observation is assigned to the winning prototype according to the following assignment function:
f ( zl ) = arg min d T ( zl , mg ) g ∈M
(2)
where the adequacy function is:
d T ( zl , mg ) = ∑ K T ϑ( g ,r ) ∑ d 2 ( zl , zs ) r ∈M
zs ∈mr
(3)
with K T ϑ( g ,r ) is the neighbourhood kernel around the neuron r and zl , zs are the representations of the elements in the space D. At the generic h-th iteration we assign an observation to the winning prototype with the (2) and define the cluster associated to this prototype at the iteration h. The main drawback of the DSOM algorithm is the cost induced by the representation phase. A fast version of the DSOM algorithm that allows a an important reduction of its theoretical cost has been proposed by Conan-Guez et al. (2005). In our approach we aim to classify a set of curves by using the described clustering algorithm and the DTWC. Therefore in our algorithm the smoothed time series are classified by substituting the distance d in (3) with the DTWC (1). This approach allows us to have an easy visualization of data and it is computationally more efficient of the classical clustering algorithms, it deals with time series drawn from large data sets. The visualization of time series is very important for the detection of their own characteristics and gives us some information for representing each class.
3
3. Experimental Results For a first evaluation of our approach, we propose a simulation study on a small data set of 130 time series. We have generated 130 time series (Fig.1) of length 100 and in particular i) 60 time series with increasing trend, ii) 30 time series with a seasonal component only and iii) 40 time series with decreasing trend .
Fig.1 The simulated time series
Without warping, the k-means algorithm is not often able to distinguish class i) from class ii), with a general misclassification of 53%. We wrote a Matlab program for generating the time series, smoothing each series with a cubic spline and implementing the DTW algorithm. Finally, we clustered the series on the basis of the DTWC with the DSOM algorithm. We have repeated the simulation study 150 times with different values of the smoothing parameter λ ranging from 0.05 to 0.20. The results show that (Fig.2) only the 10% of time series are misclassified (for λ=0.1) and there are very few cases of confusion between class i) and class ii).
Fig.2 The classification results with the smoothed time series
4
4. Conclusions In this short version of the paper, we have introduced a new approach for clustering smoothed time series, based on the joint use of the Dynamic Time Warping distance and the Dissimilarity SOM algorithm. This algorithm seems particularly promising in data mining problems and it can be applied on not aligned time points with a good visualisation of results. The forthcoming paper includes a more detailed analysis of our approach and, in particular, an application on a large set of real data, which is needed to investigate the robustness of the proposed approach in presence of irregular sampled data. A comparison, on the same data, with the algorithm proposed by Morlini et al. (2005) will be performed. In the further researches we aim to define a non parametric model for characterizing each obtained cluster. In other words, we will search for each cluster of smoothed time series a non parametric function synthesizing its elements.
Main References Berndt, D., Clifford, J. (1994). Using dynamic time warping to find patterns in time series AAI -94 Workshop on Knowledge Discovery in Databases,229–248. Conan-Guez B., Rossi F., Golli A.E. (2005). A Fast Algorithm for the Self-Organizing Map on Dissimilarity Data, WSOM’05 Proceedings. Focardi S.M. (2001). Clustering delle serie storiche economiche: applicazioni e questioni computazionali, Technical Report, Supercalcolo in Economia e in Finanza Milano. Golli A.E., Conan-Guez B, Rossi F. (2004). Self-organizing maps and symbolic data CLUEB, Journal of Symbolic Data Analysis, 2, n.1, ISSN 1723–5081. Kohonen T. (2001). Self-Organizing Maps, Springer Series in Information Sciences, Springer. Molini I. (2005). On the Dynamic Time Warping for Computing the Dissimilarity Between Curves, Vichi et al. eds., New Developments in Classification and Data Analysis, Proceedings of the Meeting of the Classification and Data Analysis Group, Università di Bologna, Settembre 2003. Rabiner, L., Juang, B. (1993). Fundamentals of speech recognition, Englewood Cliffs, N.J, Prentice Hall. Sakoe, H., Chiba, S. (1978). Dynamic programming algorithm optimization for spoken word recognition. IEEE Trans. Acoustics, Speech, and Signal Processing., Volume 26, pp 143-165.
5