more, KNBC can directly make use of the k-NN information of all data points, which are available middle results that already exist in most of the manifold learning ...
TRAJECTORY REPRESENTATION OF DYNAMIC TEXTURE VIA MANIFOLD LEARNING Yang Liu, Yan Liu, Shenghua Zhong, and Keith C.C. Chan Department of Computing, The Hong Kong Polytechnic University ABSTRACT
Table 1. Procedure of proposed framework
This paper proposes a novel framework to explore the motion trajectories of dynamic texture videos via manifold learning. First, we partition the high-dimensional dataset into a set of data clusters. Second, we construct intra-cluster neighborhood graphs using visible neighbors based on the individual character of each data cluster. Third, we construct the inter-cluster graph by analyzing the interrelation among these isolated data clusters. Then we compute the shortest paths according to the whole graph of the dataset. Finally, we embed the dataset to a unique low-dimensional space, trying to maintain the pairwise distances of all pairs of high-dimensional data points. Experiments on standard dynamic texture database show that the proposed framework can represent motion characters of dynamic textures very well. Index Terms—Dimensionality reduction, dynamic textures, manifold learning, trajectory representation 1. INTRODUCTION Dynamic textures, e.g., smoke and fire, flags or leaves blowing in the wind, can be simply defined as textures with motion [1]. Dynamic texture analysis, also referred as temporal texture analysis, plays an important role in the video related fields, such as video synthesis and recognition. The methods for dynamic texture analysis can be roughly divided into four categories [2]. The most widely used methods are based on optic flow estimation [3], [4], which can characterizes the local dynamics in temporal dimension efficiently. Another kind of methods, the model based methods, usually defines a dynamic model of the textures, and then estimates the parameters of this predefined model [5]. The spatiotemporal filtering based methods can be divided into local methods and global methods. The local methods extract the local spatiotemporal features of the dynamic textures for following segmentation or recognition tasks [6] while the global methods use wavelet technologies to decompose the motion into different scales [7]. Geometry based methods, with the assumption that there exists a trajectory surface embedding in the dynamic texture video, represent the dynamic textures by moving contours [8]. In this paper, we propose a novel framework to study the dynamic textures by discovering the motion trajectory
978-1-4244-4296-6/10/$25.00 ©2010 IEEE
2390
Input: High-dimensional dataset I={x1, x2, ..., xn} 1. Cluster dataset I by k-nearest-neighbor relationship; 2. Construct intra-cluster graphs using visible neighbors; 3. Construct inter-cluster graph using representative points; 4. Compute shortest paths in the whole graph; 5. Construct low-dimensional embedding; Output: Low-dimensional representation O={y1, y2, ..., yn}
through unsupervised dimensionality reduction. We construct the original feature space using the features that are directly available from raw video data of dynamic textures. Each dimension is used to represent the luminance value of one pixel on the video frame, e.g., if the resolution of the video frame is 352×288, then the original feature space has 101,376 dimensions. In this high-dimensional feature space, each frame is represented as one data point, and the dynamic texture video forms a time-ordered data sequence. The expected output is one low-dimensional trajectory or a set of low-dimensional trajectories including rich motion information, such as direction, velocity, and periodicity. Under this novel framework, we explore new manifold learning techniques for dimensionality reduction. The assumption of manifold learning that the high-dimensional input data lie on or close to an intrinsically low-dimensional manifold [9], [10], matches the task of motion trajectory extraction well [11]. Many manifold learning algorithms, such as locally linear embedding (LLE) [9] and isometric feature mapping (Isomap) [10], have demonstrated impressive results on nonlinear dimensionality reduction, especially for single manifold projection. However, the conventional manifold learning algorithms are not suitable for embedding the dataset including several isolated manifolds since they have the density requirement for the dataset when constructing the neighborhood graph [12]. To overcome this problem, in this paper, we utilize new graph construction methods to embed multiple manifolds, which often exist in dynamic texture videos, to a unique low-dimensional space. 2. PROPOSED FRAMEWORK AND ALGORITHMS Table 1 describes the procedure of proposed framework. I is the D-dimensional input dataset. O is the d-dimensional (d < D) output dataset. First, we use the k-nearest-neighbor based clustering (KNBC) [12], [13] to detect the existence of multiple mani-
ICASSP 2010
Table 2. Algorithm of intra-cluster graph construction Input: S1, S2, …, Sm: Subsets with connected edges for i=1:m do for all xa, xb, xc Si which are pairwise connected do if < 0 do remove the edge between xa and xb; end if end for end for Output: G1, G2, …, Gm: Intra-cluster graphs
folds, which generally appear as isolate data blocks in the original feature space. And if exists, we group these blocks to different clusters. Here we choose KNBC algorithm since it is stable for high-dimensional data clustering. Furthermore, KNBC can directly make use of the k-NN information of all data points, which are available middle results that already exist in most of the manifold learning algorithms, to cluster the high-dimensional data with little additional work. In the second step, we construct intra-cluster neighborhood graphs for each cluster/subset using “visible” neighbors [14], and use these graphs to estimate the geodesic distances between data points within the same manifold. If data points xa and xb are neighbors, xb is defined as a visible neighbor of xa when there is no other point xc that separates xa and xb. Geometrically, any edges corresponding to the obtuse angles should be removed from the graph. This visible neighbor preserving rule makes the edge connections more reasonable, which is of great importance for manifold structure reconstruction [15]. Table 2 describes the algorithm of intra-cluster graph construction, and the geodesic distances between data points can be estimated by finding the shortest paths in a graph with edges connecting visible neighboring points. Here Gi denotes the ith intra-graph corresponding to the subset Si. In the third step, we select representative points from each subset and use these points to construct the intercluster graph. For each subset, the representation points are selected based on local linear subspace model. We use the ratio of Euclidean distance to geodesic distance to judge the linearity of a subspace [16]. The ratio is defined as follows: Euclidean distance between two points (1) R EG Geodesic distance between two points If one REG or several REGs in a subspace are smaller than a threshold O , we consider that the current subspace cannot satisfy the linearity property and then divide it into more subspaces. After modeling locally linear subspaces for each subset, we select the representative points for each subset from its subspaces. For each subspace, we first compute the mean value P of all data points, and then select the data point xi that minimizes the distance || xi P || as the representative point of this subspace. Then, we construct the inter-cluster graph by connecting the representative points
2391
Table 3. Algorithm of inter-cluster graph construction Input: S1, …, Sm: Subsets with connected edges for i=1:m do j=1;Mj= ;Xtemp=Si ; while Xtemp do Mj={xj1}; Xtemp=Xtemp-{xj1}; for all xjk Mj do if xn N jk ŀXtemp && RxEGx ! O for all xjl Mj do jl n Mj=MjĤ{xn}; Xtemp=Xtemp-{xn}; end if end for n xij _center argmin|| x ¦kj 1 xjk / nj || s.t. xMj ; j=j+1; Mj= ; x
end while end for connect all selected points from different subsets; Output: Ginter: Inter-cluster graph
from different subsets. Note that connecting the points from the same subset may destroy the manifold structure of that subset. Table 3 describes the algorithm of inter-cluster graph construction. Here, Xtemp is a temporary set which is initialized to be the ith subset Si and finally cleared to an empty set in the ith iteration. Mj is the jth locally linear subspace of current subset. xj1 denotes the first points that has been selected into Mj, and xij_center denotes the representative point selected from the jth locally linear subspace of the ith subset. Njk = {x | x is one of the nearest neighbors of xjk}. After intra-/inter-cluster graph construction, all data points in the input dataset are connected by a unique graph G. Similar with Isomap [10], we calculate the shortest paths for each pair of data points, and construct low-dimensional embedding, trying to maintain all pairwise distances of these high-dimensional data points.
3. EXPERIMENTS In this section, we demonstrate the performance of proposed framework on a standard dynamic texture dataset DynTex, which includes 656 video shots containing different kinds of dynamic textures [17]. The original feature space for the dynamic texture videos is a high-dimensional feature space with 101,376 (352×288) dimensions while each dimension indicates the luminance value of one pixel in the video frame. The low-dimensional feature spaces for trajectory representation have two dimensions: x-axis and y-axis. To make the illustrations more clear, we add a temporal dimension (temporal-axis) for demonstration.
3.1. Video shots with different dynamic textures First we compare our method with a classical linear dimensionality reduction algorithm principal component analysis (PCA) [18]. Fig. 1(a) is one frame from the dynamic texture of sea anemone with dithering. Fig. 1(b) is one frame from the video that describes dynamic texture of couch grass with
(a) Sea anemone with dithering
(a) Large scale and low rotation speed
(b) Couch grass with camera motion
(b) Middle scale and (c) Small scale and high rotation speed various rotation speeds
(d) Trajectories generated by LLE with neighbor size k = 8
(c) Trajectories generated by PCA
(e) Trajectories generated by our method with neighbor size k = 8 (d) Trajectories generated by our method with neighbor size k = 8 Figure 1. Trajectory representation of different dynamic textures.
linear camera motion. Fig. 1(c) is three-dimensional trajectories of these two video shots generated by PCA. Fig. 1(d) shows trajectories of these two video shots generated by our method with the neighbor size k = 8. As shown in the left picture of Fig. 1(d), the three-dimensional trajectory generated by our method clearly describes the dithering of sea anemone in original video. And the right picture in Fig. 1(d) not only captures the come-and-go motion of the camera but also preserves the slight swing of the couch grass, which can be viewed when the figure is enlarged enough.
3.2. Video shots with similar dynamic texture but different scales and periodicities Then we compare our method with the typical local manifold learning algorithm LLE [9]. As shown in Fig. 2, there are three video shots including dynamic textures of peg-top rotation, with different scales and periodicities. Fig. 2(d) and Fig. 2(e) demos the trajectories of these three video shots generated by LLE and our method with k = 8, respectively. Obviously, our method generates more faithful and understandable trajectories in the low-dimensional space than LLE. First, the similar shapes of the three trajectories in Fig. 2(e) indicate the similarity of dynamic textures in original video shots. Moreover, the scales of circles in different trajectories reflect the scales of peg-tops in original videos correctly. Third, the rotation periodicities of these three peg-tops are also accurately represented by our method. The trajectories in Fig. 2(e) (from left to right) contain 8, 33, and 11 circles in 250 video frames respectively (according to the temporal axis). Actually, the peg-tops in video 1 (Fig. 2(a)), video 2 (Fig. 2(b)), and video 3 (Fig. 2(c)) exactly circumrotates 8, 33, and 11 cycles in 10 seconds.
k=6 k=8 k = 10 (f) Trajectories for video 1 generated by LLE with different k
k=6 k=8 k = 10 (g) Trajectories for video 1 generated by our method with different k Figure 2. Trajectory representation of similar dynamic textures but with different scales and periodicities.
Furthermore, the generated trajectory is capable of reflecting the frequency change of the dynamic textures within the video shot. In the right picture of Fig. 2(e), we can see that the frequency of rotation is slowing down when the frame number is increasing. This result correctly reflects the fact of rotation speed decreasing in video 3. In addition, we also demonstrate that the proposed method is robust to different neighbor sizes k. Fig. 2(f) and Fig. 2(g) show the trajectories for peg-top video of large scale and low rotation speed (video 1) generated by LLE and our method, respectively. Obviously, with different neighbor sizes, LLE provides very different low-dimensional equivalence while our method keeps stable.
3.3. Video shot with multiple dynamic textures Finally, we compare our method with the representative global manifold learning algorithm Isomap [10] on the video shot with multiple dynamic textures. As shown in Fig. 3, this video is a 500-frame video with caution light glittering.
2392
Fig. 3(a) shows the trajectory generated by Isomap, which only embeds 231 data points to the low-dimensional space. Since this video includes multiple clusters in the original feature space, Isomap encounters disconnected graph and discards 269 data points as outliers. Fig. 3(b) is the embedded trajectory generated by our method. It is obvious that all data points are embedded to the low-dimensional space successfully. Moreover, the characteristics of individual manifold and the interaction of multiple manifolds are well extracted and represented in the low-dimensional space. Fig. 3(c) shows the trajectory in t-x plane generated by our method. It is clear that this trajectory has 32 wave crests. The corresponding frame numbers of these wave crests are marked in the figure. In original video, these frames are corresponding to the case that the middle bulb of the red caution light (the left light) brightens up (Fig. 3(e)). Furthermore, the middle bulb of the red caution light just brightens up 32 times in the whole video. This means that the xaxis of the trajectory correctly reflects the periodicity of the lightening of red caution light. Similarly, in the t-y plane (Fig. 3(d)), there are 30 wave crests, which are corresponding to the case that the middle bulb of the white caution light just brightens up 30 times in original video (Fig. 3(f)).
(a) Trajectory in t-x-y space generated by Isomap
(b) Trajectory in t-x-y space generated by our method
(c) Trajectory in t-x plane generated by our method
(d) Trajectory in t-y plane generated by our method
(e) Frames corresponding to the wave crests in t-x plane
(f) Frames corresponding to the wave crests in t-y plane
4. CONCLUSION This paper proposes a novel framework to represent video trajectories of dynamic textures via manifold learning. Relying on several underlying algorithms of k nearest neighbors clustering, intra-graph construction using visible neighbors, and inter-graph construction using representative points, the proposed framework preserves the faithful structure globally and locally, and more importantly, embeds multiple manifolds into a unique low-dimensional space successfully. With these attractive characters, our method outperforms several representative dimensionality reduction algorithms for real video trajectories representation task on standard dataset of dynamic textures.
5. REFERENCES [1] [2] [3] [4]
[5] [6] [7]
Figure 3. Trajectory representation of 500-frame video with caution lights glittering. [8] [9] [10] [11]
G. Doretto, A. Chiuso, Y.N. Wu, and S. Soatto, “Dynamic textures,” IJCV, vol. 51, no. 2, pp. 91-109, 2003. D. Chetverikov and R. Péteri, “A brief survey of dynamic texture description and recognition,” in Proc. 4th International Conference on Computer Recognition Systems, Poland, pp.17-26, 2005. C.H. Peh and L.-F. Cheong, “Synergizing spatial and temporal texture,” IEEE Trans. Image Processing, vol. 11, pp. 1179-1191, 2002. R. Fablet and P. Bouthemy, “Motion recognition using nonparametric image motion models estimated from temporal and multiscale cooccurrence statistics,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, pp. 1619-1624, 2003. P. Saisan, G. Doretto, Y. Wu, and S. Soatto, “Dynamic texture recognition,” in Proc. CVPR, vol. 2, pp. 58-63, 2001. R.P. Wildes and J.R. Bergen, “Qualitative spatiotemporal analysis using an oriented energy representation,” in Proc. ECCV, pp. 768-784, 2000. J.R. Smith, C.-Y. Lin, and M. Naphade, “Video texture indexing using spatiotemporal wavelets,” in Proc. ICIP, pp. 437-440, 2002.
2393
[12] [13] [14] [15] [16] [17] [18]
K. Otsuka, T. Horikoshi, S. Suzuki, and M. Fujii, “Feature extraction of temporal texture based on spatiotemporal motion trajectory,” in Proc. ICPR, vol. 2, pp. 1047-1051, 1998. S.T. Roweis and L.K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, pp. 2323-2326, 2000. J.B. Tenenbaum, V.Silva, and J.C. Langford, “A global geometric framework for nonlinear dimensionality reduction,” Science, vol. 290, pp. 2319-2323, 2000. R. Pless, “Image spaces and video trajectories: using Isomap to explore video sequences,” in Proc. ICCV, pp. 1433-1440, 2003. Y. Liu, Y. Liu, and K.C.C. Chan, “Dimensionality reduction for heterogeneous dataset in rushes editing,” Pattern Recognition, vol. 42, no. 2, pp. 229-242, 2009. Y. Liu, Y. Liu, and K.C.C. Chan, “Multiple video trajectories representation using double-layer isometric feature mapping,” in Proc. ICME, pp. 129-132, 2008. T. Lin and H. Zha, “Riemannian manifold learning,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 30, no. 5, pp. 796-809, May 2008. M. Balasubramanian, E.L. Schwartz, J.B. Tenenbaum, V. de Silva, and J.C. Langford, “The Isomap algorithm and topological stability,” Science, vol. 295, no.5552, pp. 9, 2002. R. Wang, S. Shan, X. Chen, and W. Gao, “Manifold-manifold distance with application to face recognition based on image set,” in Proc. CVPR, pp. 1-8, 2008. http://www.cwi.nl/projects/dyntex/index.html. H. Hotelling, “Analysis of a complex of statistical variables into principal components,” J. Educ. Psychol. 24, 417-441, 498-520, 1933.