Dec 22, 2011 ... Traffic Flow Data Mining and Evaluation Based on Fuzzy Clustering. Techniques
... proposed a new fuzzy clustering algorithm for road network ...
344
International Journal of Fuzzy Systems, Vol. 13, No. 4, December 2011
Traffic Flow Data Mining and Evaluation Based on Fuzzy Clustering Techniques Hu Chunchun, Luo Nianxue, Yan Xiaohong, and Shi Wenzhong Abstract1 Effective mining technology can extract the spatial distribution pattern of the road network traffic flow. In this paper, the similarities between traffic flow objects with spatial temporal characteristics were measured by introducing the Dynamic Time Warping (DTW) and the shortest path analysis method. We proposed a new fuzzy clustering algorithm for road network traffic flow data. So that traffic flow data objects with similar properties and space correlation are clustered into a group, which find the spatial distribution pattern of road traffic flow. The experimental results show that the method was valid and effective. The road network was classified reasonably, and classification results could provide traffic zone division with decision auxiliary support. Keywords: Fuzzy clustering, traffic flow, similarity measure, cluster validity.
1. Introduction For traffic flow on road network, there are different spatial distribution patterns. Such as linear pattern for major road traffic flow, surface pattern for thriving road and so on. According to the characteristics of the spatial distribution of traffic flow, dynamic traffic zone partition is one of the research hotspots of intelligent transportation system. But road traffic zone partition also can produce corresponding change with the traffic flow peak, normal and bottom period change. We can apply effective mining technology to extract the spatial distribution Corresponding Author: Hu Chunchun is with with the School of Geodesy and Geomatics, Wuhan University, 129 Luoyu Road,Wuhan 430079, China. E-mail:
[email protected] Luo Nianxue is with the School of Geodesy and Geomatics, Wuhan University, 129 Luoyu Road,Wuhan 430079, China. E-mail:
[email protected] Yan Xiaohong is with the School of Resource and Enviromental Sci-ence, Wuhan University, 129 Luoyu Road,Wuhan 430079, China. Shi Wenzhong is with the Department of Land Surveying and Geo-Informatics, The Hong Kong Polytechnic University. Manuscript received 15 Nov. 2011; revised 1 Dec. 2011; accepted 22 Dec. 2011.
pattern on road network. It is helpful to partition traffic zone, manage and control road network traffic, and increase the road network capacity and ease the traffic pressure. Clustering analysis is one of the most useful methods in knowledge acquisition, and is used to discovering underlying clusters and interest distributed pattern from data itself. Cluster analysis techniques in general can be divided into two categories, namely, crisp and fuzzy clustering. Crisp clustering is a kind of non-overlapping partitions method. And exploiting it means that an object either belongs to one class or not according to some proximity measure and clustering criterion. But the crisp clustering can cut off the link between objects and cause more deviation for clustering results. While the issue of uncertainty support in clustering task leads to the introduction of algorithms that use fuzzy logic concepts in clustering procedure [1]. A common fuzzy clustering algorithm is the Fuzzy C-Means (FCM) [2]. It attempts to find the most characteristic point in each cluster, which can be considered as the “center” of the cluster, and the grade of membership for each object in the clusters [1]. Spatial clustering analysis of traffic flow can find the spatial distribution patterns on road network, and it make that traffic flow data objects with similar properties and spatial association, are clustered into a group. In order to achieve our goal, we first review the related research work about traffic flow data. Due to the complexity of traffic network, it will not be suit for measuring the similarity of traffic flow only compared the difference of the time series data. Next, we define a similarity measure between traffic flow for discovering object groups according to proximity in time and road network. We develop an algorithm for clustering traffic flow, using the similarity measure defined earlier for discovering similar traffic flow in section 4. Finally we evaluate the proposed algorithm by conducting experiments on real traffic flow data.
2. Related work Previous work on mining traffic flow data includes traffic data model, clustering algorithm and similarity measure of traffic flow. Gaussian mixture models (GMM), speed, flow and
© 2011 TFSA
Hu Chunchun et al.: Traffic Flow Data Mining and Evaluation Based on Fuzzy Clustering Techniques
occupancy are used together in the cluster analysis of traffic flow data in the literature [3]. Chen et al. studied multi-dimensional traffic flow time series analysis with self-Organizing maps [4]. In urban city traffic flow mining methods, discrete wavelet transform is adopted for flow feature extraction for it is insensitive to disturbance/scaling, and zooms in multiple finer granularities [5]. For high-dimensional traffic data clustering, a two stage fuzzy clustering method was exploited. The optimal partition identified from the first clustering results is used as the initial partition in the second stage clustering based on full dimensional data, thus effectively reduces the possibility of local optimum [6]. Similarity measure of traffic flow time series was studied in the literature [7], and it achieved the effective separation of traffic flow time series. In the literature [8], it clustered traffic flow sequence by employing partition clustering technique, and could identify road traffic flow of the TOD ( time of day) interval according to different flow. However, aforementioned research about traffic flow clustering analysis only considered the time attribute of road traffic flow, without considering the spatial distribution characteristics on road network. Traffic flow on road network is related to temporal information and road segment, so the spatial clustering is different from the common clustering method. Similarity measure of traffic flow can be defined by exploiting time series data with dynamic characteristic and topology relation between road segments on road network. A new clustering algorithm will can be proposed which can well measure the similarity of traffic flow object, and further find potential traffic pattern from a large number of road network traffic flow series.
3. Similarity Measure between the Traffic Flow Object A good spatial clustering algorithm can group road segments according to the data characteristics of road network. Traffic flow series on road network is multidimensional, so a suitable distance function is required to express the similarity between two road sections. The Lp paradigm [9] and Dynamic Time Warping (DTW) [10] are distance function for time series similarity analysis. Similarity distance measure algorithm is simple and easy to implement based on the Lp paradigm, but it can only be dealt with equal length of time sequences. Whereas DTW is a kind of dynamic programming method for similarity measure of time series, and it is not subjected to length limit of time series. A. DTW Given two time sequence Q and C which their data
345
lengths are n and m respectively, we can calculate the distances between them in order to compare the similarity. Smaller distance express greater the similarity. In distance matrix of two different time series, a group of contiguous matrix elements set, which defined dissimilarity relations between series, is called a curved path. The aim of DTW method is to search the minimum total length of the curved path. The minimum total length of curved path can be calculated by dynamic programming method using the formula as shown in (1). If the point (i, j) is located on the optimal path, then sub path from point (1, 1) to (i, j) is also a local optimal solution. The best path can be obtained by recursive search the local optimal solution between time starting point (1, 1) and the end of (m, n). S1,1 = d (q1 , c1 ) S = d (qi , ci ) + min{S (i − 1, j ); S (i − 1, j − 1); S (i, j − 1)}
(1)
B. The Shortest Path Spatial temporal properties are main portion of road traffic flow information obtained by real-time. And the spatial relations between each road segment on road network are also significant. So similarity measure of the traffic flow object also considered the topology relation of road network, and spatial similarity degree is higher between connective and reachable road sections. The connectivity and accessibility of road could be measured by the shortest path analysis on road network. And the dynamic shortest path length between two road sections is defined as the similarity measurement function. C. Defining Similarity measures For the graphic structure of G = {V, E} corresponding to road network, V represents a road node and E represents a road edge on network. And traffic flow series was set to Ti = {Ti1, Ti2, ... , Tin} (n was used to express the nth time period of traffic flow series) generated by each edge Ei = . Spatial temporal similarity measure function of the traffic flow is defined as follows: TFSM ( Ei , E j ) = DTW (Ti , T j ) + Shorest − Path(vi 0 , v j1 )
(2) In (2), DTW (Ti, Tj) express similarity distance of two roads Ei and Ej in traffic flow series, Shorest-Path (vi0, vj1) is spatial similarity distance between the beginning node of the road Ei and the ending node of the road Ej. The small value of TFSM (Ei, Ej) is expected. Thus, the small value represents that the Ei and Ej are more similar.
4. Traffic Flow Clustering and Evaluation A. The Fuzzy Algorithm A common fuzzy clustering algorithm is the Fuzzy
346
International Journal of Fuzzy Systems, Vol. 13, No. 4, December 2011
C-Means (FCM), an extension of classical C Means algorithm for fuzzy applications [11]. It uses fuzzy techniques to cluster data. And in the algorithm, an object can be clustered to more than one cluster, which compatible with the status of real data. The FCM clustering algorithm has been widely used to obtain the fuzzy c-partition. It is a kind of fuzzy clustering algorithm based on object function. Given a dataset X={X1, X2, …, Xn} with s dimension, the object of FCM is to partition dataset X into c homogeneous fuzzy clusters by minimizing the function Jm.
B. New Fuzzy Clustering Algorithm In order to extract meaningful road traffic distribution pattern, the formula (2) is exploited as a fuzzy similarity measure function, and the new objective function of FCM is defined as follows: c
n
J m = ∑ ∑ (u ij ) m TFSM ( Ei , E j ) i =1 j =1
(8) In (8), TFSM (Ei, Ej) is the similarity measure function of the traffic flow on road network. So new fuzzy clustering algorithm was described as follows: z Step1: To build the topology structure of the road c n J m = ∑ ∑ (u ij ) m d 2 ( X j , Vi ) network, calculate the shortest path between start i =1 j =1 (3) node and end node of two road segments by Dijkstra In (3), c is the number of cluster, n is number of data and algorithm on the road network. uij is the membership degree of data point Xj belonging z Step2: To randomly select nc road traffic flow seto the fuzzy cluster Ci. Vi is the ith cluster centroid, m is quence as the initialized nc cluster center from the weighting exponent and controls the fuzziness of memroad network. bership of each datum [12]. The d2( Xj , Vi) represents z Step3: To calculate the degree of membership matrix the Euclidean distance between Xj and Vi. U(k) according to (5). In order to finish this step, The FCM solution is a mathematical planning probwe need to calculate the minimum dynamic bending lem, and the data set X can be divided into different catpath between each road segment and clustering cenegories c by minimizing the objective function. The limter traffic flow sequence by DTW algorithm on the ited condition of function Jm is that the sum of memberroad network according to (1) and compute d2( Xj , ship degree (uij ), which Xj belonging to each of cluster ci, Vi). equals to 1. It is described as follow. z Step 4: To adjust the fuzzy clustering center accordc ing to (6). For each cluster center, We find out road ∑ uij = 1 1 ≤ j ≤ n 0 ≤ uij ≤ 1 segments with the minimal dynamic curved path of i =1 , (4) traffic flow sequence compared with clustering cenThe FCM algorithm is carried out in the following ter by DTW algorithm. And these road segments steps: with traffic flow series will be new the fuzzy clusz Initialize threshold ε and cluster centroids V(0), set tering center. k=0. z Step5: Repeat Step 3 and Step 4 until the maximum z Given a predefined number of cluster c, and a chonumber of iterations tmax or (7) to meet. sen value of m. z Compute matrix of the membership degree U(k) C. Evaluation Methods =[ uij] for i=1,2,…,c in(5). Since fuzzy clustering is an unsupervised machine 1 (− ) learning technique, there is no set of correct answers that (d 2 ( X j ,Vi )) m−1 can be compared to the results. And it requires tuning the uij = 1 c ) (− input parameters according to some way for obtaining ∑ (d 2 ( X j ,Vi )) m−1 the optimal cluster results [13]. It is quite necessary to i =1 (5) validate the clustering results produced by the Fuzzy z Update the fuzzy cluster centroid Vi(k+1) for clustering algorithm. Measuring clustering results and i=1,2,…,c in (6). identifying the optimal partition or the optimal number n m u ( k ) X (c) of clusters are called cluster validity evaluation. The ∑ ij j j =1 object of the cluster indices is to seek optimal clustering Vi (k + 1) = n m schemes where most of data sets present high degree of ∑ uij (k ) j =1 (6) membership within a cluster [14]. The Xie-Beni index involves the membership values z If meet (7), then iteration halts; Otherwise return the and the dataset itself. It is a compact and separate fuzzy third step. || v(k ) − v(k + 1) ||≺ ε (7) validity function [14] and defined as The FCM algorithm always converge a local maximum value through above iteration calculation [2].
347
Hu Chunchun et al.: Traffic Flow Data Mining and Evaluation Based on Fuzzy Clustering Techniques
n
v XB =
c
2 2 ∑ ∑ uij || x j − vi || j =1 i =1
(9)
n ⋅ min || vi − vk ||2 i≠k
Equation (9) is explained as the ratio of total compactness to the separation of the fuzzy c-partition. For compact and well-separated clusters, the small values of vXB are expected. Therefore, the optimal cluster c is obtained by finding the fuzzy c-partition with the smallest value of vXB. In addition to cluster validity evaluation, we measure running times of the new fuzzy clustering algorithms with certain parameter values. The algorithms’ scalability will be tested by measuring running times of the same fuzzy clustering versus different number of objects that are being clustered.
5. Experimental Results The experimental data set includes 3648 road segments of certain city road network. Traffic flow data were collected at 5 minutes interval. In the experiment, the traffic flow series with the traffic flow peak (between seven o'clock to nine thirty in the morning) among a day were exploited. And the traffic flow series of each road segment contained thirty traffic flow data. Table 1 described the traffic flow series data. For obtaining the reliable experimental results, the parameters of FCM as fitness function are set to the weighting exponent m=2, which is common choice of the FCM algorithm, in the range of [1.5, 2.5] [15]. And the maximum number of iterations tmax is equal to 40 and iteration termination condition ε is equal to 0.0001. For the fuzzy clustering was an unsupervised classification method, we took the cluster number c = 3, 4, 5, 6 to observe the results of the experiment.
A. Cluster Validity Results We compared the new FCM and fuzzy clustering versions of our suggested algorithm according to XB index. The compared results between the FCM and new fuzzy clustering algorithm are listed in table 2 when the length of time series is equal to 10. The XB index of
FCM point to the optimal cluster number c=6 or c=5, while the new fuzzy clustering algorithm point to cluster number c=3 or c=5. Table 2. XB index values (The length of time series=10). C
FCM
New fuzzy clustering algorithm
3
0.19785713
0.214135881
4
0.203094451
0.233541914
5 6
0.182707857 0.171682671
0.232201256 0.251876422
In addition, the traffic flow data with the length of time series equal 20 and 30 are run in the experimental in order to test validity of the new algorithm. The results as shown in table 3 and table 4 present that the optimal cluster number is equal to 3 or 5 when using the new algorithm, and that the cluster result is c=3 or c=6. Table 3. XB index values (The length of time series=20). C
FCM
New fuzzy clustering algorithm
3
0.154611607
0.224557453
4
0.176140361
0.281728994
5 6
0.182708834 0.171681434
0.247241881 0.277330461
Table 4. XB index values (The length of time series =30).
Table 1. traffic flow sequence data. Road segment ID 10001001 10001001 10001001 … 10981049 10981049 10981049 …
Road segment starting node 1000
1001
0.9375
1000
1001
0.9375
1000
1001
1.2
… 1098
… 1049
… 1.287
1098
1049
1.29
1098
1049
1.4
…
…
…
Road segment ending node
Average cost (time)
FCM
New fuzzy clustering algorithm
3
0.154611523
0.225565906
4
0.203093245
0.252538231
5 6
0.182716977 0.171685031
0.246220492 0.278026107
C Time 2009-02-24 07:00-07:05 2009-02-24 07:05-07:10 2009-02-24 07:10-07:15 … 2009-02-24 07:05-07:10 2009-02-24 07:15-07:20 2009-02-24 07:25-07:30 …
Figure 1 show that the partial cluster results by exploiting the FCM algorithm when c=6, and that the results of the new fuzzy clustering algorithm are presented in figure2 when c=5.
348
International Journal of Fuzzy Systems, Vol. 13, No. 4, December 2011
Figure 1. The clustering results of FCM (c=6).
bers in experimental. In order to estimate the efficiency of the new clustering algorithm, we measure its running time compared to the FCM algorithm. The experimental results of measuring running times are showed in the figure3, figure 4 and figure 5. We can see the running times of the new algorithm are longer than the FCM algorithm. And it will be going up with increase of the cluster amount and the length of series going up. The calculation of the minimum dynamic bending path and the shortest path caused the expense time cost. But the quick running times of the new algorithm indicate its efficiency.
We can see the cluster centers rendered by red bold line in figure 1 and figure 2, and the road segments are rendered by different colors. Each color indicates a group of traffic flow data with high similarity.
Figure 4. Running times comparison (the length of series=20).
Figure 2. The clustering results of new algorithm (c=5).
Compared the two figure, the two cluster centers are too adjacent in figure 1. The area is dense traffic zone from real traffic situation, and the cluster centers are more fit to the real traffic state in figure 2. To sum up, the experimental results show that the new algorithm is valid. B. Clustering efficiency results
Figure 5. Running times comparison (the length of series=30).
5. Conclusions
Figure 3. Running times comparison (the length of series=10).
Running times of the clustering algorithm is mainly related to the length of the time series and cluster num-
Fuzzy cluster analysis is an important data analysis tool, it is a non-supervised classification method. Based on the fuzzy clustering algorithm ideas, considered the road network traffic flow distribution, the paper propose a new clustering algorithm which can well measure the similarity between traffic flow objects, and further find potential traffic pattern from a large number of road network traffic flow series. Experiment results show that the new algorithm is valid and efficiency. Specially, the cluster results can provide traffic zone division with decision auxiliary sup-port. Clustering analysis is largely a data-driven tool [16].
Hu Chunchun et al.: Traffic Flow Data Mining and Evaluation Based on Fuzzy Clustering Techniques
349
Menlo Park, CA: AAAI Press, 1996. [11] J. C. Bezdeck, R. Ehrlich, and W. Full, “FCM: Fuzzy c-means algorithm,” Computers and Geoscience, vol. 10, pp. 191-203, 1984. [12] D. W. Kim, K. H. Lee, and D. Lee, “Fuzzy cluster validation index based on inter-cluster proximity,” Acknowledgment Pattern Recognition Letters, vol. 24, no. 15, pp. 2561-2574, 2003. This research work was supported by the grants from the National Natural Science Foundation of China under [13] M. Kim and R. S. Ramakrishna, “New Indices for Cluster Validity Assessment,” Pattern Recognition Grant number 91024032. Letters, vol. 26, no. 15, pp. 2353-2363, 2005. [14] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, References “Clustering Algorithms and Validity Measures,” In Proc. of the International Working Conference on [1] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, Scientific and Statistical Database Management, “On clustering validation techniques,” Journal of pp. 3-22, Fairfax, VA , USA, 2001. Intelligent Information Systems, vol. 17, no. 2-3, pp. [15] N. R. Pal and J. C.Bezdek, “On cluster validity for 107-145, 2001. the fuzzy c-means model,” IEEE Transactions on [2] J. C. Beadek, Pattern recognition with fuzzy objecFuzzy Systems, vol. 3, no. 3, pp. 370-379, 1995. tive function algorithms, Plenum Press, New York, [16] J. Han and M. Kamber, Data Mining: Concepts 1981. and Techniques. San Francisco: Morgan Kaufmann [3] Sun Lu, Zhang Huimin, et.al, “Gaussian mixture Publishers, 2001. models for clustering and classifying traffic flow in real-time for traffic operation and management,” Journal of Southeast University, vol. 27, no. 2, pp. Hu Chunchun received her PhD degree of photogrammetry and remote sensing from Wuhan University. Her research in174-179, 2011. terests include geographic information system, spatial data [4] C. Yudong, Z. Yi, and H. Jianming, “Mulanalysis, spatial database and spatial data mining. ti-dimensional traffic flow time series analysis with self-organizing maps,” Tsinghua Science and Technology, vol. 13, no. 2, pp. 220-228, 2008. [5] C. Yudong, Z. Yi, H. Jianming, and et.al, “Mining for similarities in urban traffic flow using wavelets,” In Proc. of the IEEE International Conference on Intelligent Transportation Systems, pp. 119-124, Seattle, Washington, USA, 2007. [6] Z. Pengjun and M. Mike, “An algorithm for high-dimensional traffic data clustering,” In Proc. of the International Conference of Fuzzy System and Knowledge Discovery , pp. 59-68, Xi an, China, 2006. [7] R. Jiangtao, X. Qiongqiong, and Y. Jian, “Traffic flow time series separation methods,” Computer Applications, vol. 25, no. 4, pp. 937-939, 2005. [8] A. Hauser Trisha and T. Scherer William, “Data mining tools for real-time traffic signal decision support&maintenance,” In Proc. of the IEEE International Conference on Systems, Man and Cybernetics, pp. 1471-1477, Tucson, Arizona, USA, 2001. [9] C. Fanoutsos, M. Ranganathan, and Y. Manolopounos, “Fast subsequence matching in time-series database,” In Proc. of ACM SIGMOD Conference, pp. 419-429, Minneapolis, Minnesota, USA, 1994. [10] D. J. Berndt and J. Clifford, Finding Patterns in time series: A dynamic programming approach, So further work is need for tuning the parameters of the algorithm in order to obtain the better cluster results. Visualizing the clustering process, improving the similarity measure and performance are also important work.