dimensional data sequences by an optimal operator space. Each operator ... High dimensional data sequence analysis is an attractive, yet challenging prob- lem for both its wide ..... Classification rate vs percentage of data using as training set.
Optimal Operator Space Pursuit: A Framework for Video Sequence Data Analysis Xiao Bian, Hamid Krim North Carolina State University
Abstract. High dimensional data sequences, such as video clips, can be modeled as trajectories in a high dimensional space and, and usually exhibit a low dimensional structure intrinsic to each distinct class of data sequence [1]. In this paper, we exploit a fibre bundle formalism to model various realizations of each trajectory, and characterize these high dimensional data sequences by an optimal operator space. Each operator is calculated as a matched filter corresponding to a standard Gaussian output with the data as input. The low dimensional structure intrinsic to the data is further explored, by minimizing the dimension of the operator space under data driven constraints. The dimension minimization problem is reformulated as a convex nuclear norm minimization problem, and an associated algorithm is proposed. Moreover, a fast method with superior performance for video based human activity classification is implemented by searching for an optimal operator space and adapted to the data. Illustrating examples demonstrating the performance of this approach are presented.
1
Introduction
High dimensional data sequence analysis is an attractive, yet challenging problem for both its wide application to video content retrieval and pattern recognition, as well as its intrinsic complexity. One key arising challenge is the high dimensionality of the data, which is increasingly common in applications, like image and video sequences, and often conceals and blurs crucial information from being discovered. When processing video and image data, the first step often includes the extraction of critical features [2], or dimension reduction [3] [1] [4] or both [5] [6]. To a large extent, these two kinds of ideas both work on efficiently and accurately estimating the low dimensional subspace that the hidden states of data sequences lie in. Low dimensional subspace learning techniques, such as manifold learning, however, usually require dense samples [6] [5] [4]. On one hand, this is not always applicable; on the other hand, a large data set further requires a high computational cost. In this paper, we propose a novel framework for high dimensional data analysis. We describe image sequences using the formalism of fiber bundles, and constructing an operator space H which is homeomorphic to the manifold of hidden states of sequences. Instead of working on an a priorili unknown data
2
Xiao Bian, Hamid Krim
space (space of images), we exploit the corresponding operator space to categorize and classify different sequences, by first developing an algorithm to find the optimal low dimensional operator space where the discriminating information is compactly stored. Finally, we illustrate the viability of this strategu by a fast method for human activity video sequences classification without background substraction, and invariant to pace or speed information by using trained optimal operator space.
2 2.1
Operators on Image Sequences Geometric Space of Image Sequences
Let the space of all images be E, then each image sequence nay be viewed as a sampled curve in E. Consider E as a metric space, implying that there is no guarantee that curves from the same class will be close to each other, since for a given class of image sequences, they may have different realizations. Taking human activities analysis as an example, for a certain human activity, different people may behave quite differentky (different clothes, gestures and speed). Although these video sequences share the same activity, simply clustering in E will not be revealing. We therefore need more detailed structures for data in E.
Fig. 1. Sample frames of human activity video clips(Upper:running; Lower:Siding)
Let each frame in a video sequence be an observation of a hidden state which lies on a manifold B embedded in E. These hidden states may be seen as control variables of a certain activity. Then we can invoke the fiber bundle formalism to describe the data space(Fig.2). E is the global space for all image frames. In the case of m × n gray-level images, E is the space of m × n matrices; B is the base space of the bundle for all control variables; fiber F over P ∈ B is the space for different realizations of a control variable p. π : E → B is a continuous surjection such that for a neighborhood U ∈ B, π −1 (U ) is homeomorphic to the product space U × F . Then each image sequence is a high dimensional curve in E corresponding to a low dimensional curve lie on the base manifold B. Under this setting, different
Title Suppressed Due to Excessive Length
3
realizations of the same activity, may vary in E, but should follow the same trajectory in B up to some noise term. Now the problem may be stated as follows: given sampled high dimensional curves in E, and information about the similarities of their control variables’ trajectories in B, extract features about the base manifold B and categorize the unknown sequences.
Fig. 2. Fiber bundle structure for data space
2.2
Operator Space Construction
It is a clear fact in human activity analysis, or in video sequence analysis in general, that we neither have the explicit form of the based manifold B, nor do we have the ability to approximate the tangent space of this manifold. The idea here is to then use given samples in E to construct a space H which is homeomorphic to B, and to investigate space H instead of the unknown manifold B. The key step for establishing a homeomorphism is bijection. More specifically, we need to establish a 1-1 and onto mapping between B and H. Considering the data is sampled curves in E, and no data directly sampled from B, we see B as the set of equivalent classes of elements of fiber π −1 (p) = {F over point p ∈ E}: B = {[x] : x ∈ π −1 (p)}. To map all points in one fiber Fp to a unique element in H, inspired by our previous operator-based approach to video sequence analysis, we can empirically establish a homeomorphism between the operator space H and B as follows: h∗ = arg min h
X
|xi h − g|,
(1)
i
xi , i = 1, · · · , m are samples on fiber F [p] = π −1 (p). g is a constant function.
4
Xiao Bian, Hamid Krim
Since the objective function is convex, we have a unique solution h{F [p]} for each fiber F [p]. Consequently there exists a 1-1 and onto mapping between the operator space H = {h{F [p]}} and B. Therefore, instead of working on the unknown structures of B, we use sequences of operators in H to categorize and analyze image sequences.
Fig. 3. Homeomorphism between H and B
3 3.1
Optimal Operator Space Pursuit Problem formulation
Since the operator space H lies on low dimensional subspace as base manifold B, it is natural to find the optimal subspace by solving the following constrained dimension minimization problem. min dim(H) s.t.khi (Xi ) − gk2 ≤ C, hi ∈ H
(2)
{Xi }, i = 1 · · · m are frames of a given image sequence. Practically, minimizing the dimension of operator space means finding the least rank matrix H under the constraints in Eq. (2). The rank minimization problem is generally NP-hard, this problem may, however, be treated as a constrained
Title Suppressed Due to Excessive Length
5
nuclear norm minimization, which can be seen as a tightest convex relaxation of Eq. (2). Therefore, we can replace the objective function in Eq. (2) by kHk∗ , yielding the nuclear norm minimization problem.
min kHk∗ s.t.kXi hi − gk2 ≤ C, H = [h1 | · · · |hm ]
(3)
{Xi }, i = 1 · · · m are diagonal matrices with fourier coefficients of each frame on the diagonal. hi is a corresponding filter. 3.2
Solution by singular value thresholding
Eq. (3) can be rewritten in a more general form:
min kHk∗ s.t.kAi (H) − gk2 ≤ C, f ori = 1, · · · , m
(4)
A(·) : Rn → Rn is a linear operator. Since singular value thresholding operator has been successfully used for large scale nuclear norm minimization problem, building upon [7], we also develop a modified version of the singular value thresholding algorithm adapted to our problem Eq. (4). For a matrix X, the singular value thresholding operator is defined as: Dτ (X) = U Sτ (Σ)V ∗ , Sτ (Σ) = diag(σi − τ + )
(5)
And we have the following theorem [7] Theorem 1 For each τ ≤ 0, Y ∈ Rm×n , the singular value thresholding operator(Eq. 5) 1 Dτ (Y ) = arg min kX − Y k2F + τ kXk∗ X 2
(6)
From Thm. (1), we can see that the singular value thresholding operator is closely connected to nuclear norm minimization problem. As tau ↑ ∞, the solution of Eq.( 6) converges to that of a nuclear norm minimization. Therefore, by selecting a large τ , we work on the approximate optimization problem:
6
Xiao Bian, Hamid Krim
1 min kHk∗ + kHk2F 2 s.t.kAi (H) − gk2 ≤ C, f ori = 1, · · · , m
(7)
Using Uzawa’s algorithm to find the saddle point, we have k L(H, y k−1 , sk−1 ) H = arg min H y k = y k−1 + δ k ∂y L(H k , y, s) sk = sk−1 + δ k ∂ L(H k , y, s) s
(8)
The generalized Lagrangian of Eq.( 7)is given by X 1 (hyi , g − Ai (H)i − si c) L(H, y, s) = τ kHk∗ + kHk∗F + 2 i
(9)
This gives X 1 hyi , Ai (H)i} arg minL(H, y, s) = arg min{τ kHk∗ + kHk2F − H H 2 i
(10)
Consider the dual of Ai , we have hyi , Ai (H)i = hA∗i (yi ), Hi. Then Eq. (10) can be rewritten as X 1 arg minL(H, y, s) = arg min{τ kHk∗ + kH − A∗i (yi )k2F } H H 2 i
(11)
According to Thm.(1), also letting P be the orthogonal projection onto a second order cone K, we have the following iteration, k P ∗ k−1 H = Dτ ( Ai (yi )) i k y y k−1 b − A(X k ) + k =P − sk−1 s The projection is given by [7] kyk ≤ s (y, s), (y, s), −kyk ≤ skyk P : (y, s) = kyk+s 2kyk (0, 0), s ≤ −kyk
(12)
Title Suppressed Due to Excessive Length
4
7
Algorithm for human activities analysis and Experimental results
Video-based human activity analysis is a perfect test for its intrinsic low dimensional structure [3] represented in a very high dimensional data space. The high dimensionality is due to the intrinsic complexity of variability among different individuals (appearance, gestures, etc) and highly nonlinear deformation among frames. We next carry out experiments of human activity classification to demonstrate the performance of our algorithm. To compare to the state-of-art, We use the database from [2](also used by [3] [8]), which are 188 × 144, 25fps low resolution video sequences of different human activities as walking, bending, running, jumping, etc, by 9 individuals. Sample frames are shown in Fig. 4. 4.1
Preprocessing: difference image
Most state-of-art work [3] [2] [6], if not all, include a background substraction step for preprocessing to only use shapes or silhouettes of human body in each frame. The argument ist that the textural and background information are irrelevant to activities. Under this setting,the results of background substraction will affect the performance of human activities classification. Also the computation complexity of background substraction may further limit the potential for these algorithms to carry out the analysis in real time. In our work, here, the variability of textural information is already considered under the formalism of fibre bundles, and the potential for operator-based approach tocope with noise terms [9], allow us to do a coarser preprocessing without affecting the classification performance. Considering rather the gestures in each frame, the variation among them are more intrinsic to human activities, we use the centered difference image of two neighboring frames as the input to optimal operator space pursuit without any other preprocessing. This dramatically reduces the burden of preprocessing, and as we note in the next section, gives us the potential to classify high dimensional data sequences in real time. 4.2
Video sequences similarity measure
Consider two video sequences X 1 and X 2 , the corresponding optimal operator spaces are H 1 and H 2 , respectively. Given the constraints in Eq. (3), we have kXiq hqi − gk2 ≤ C, H q = [hq1 | · · · |hqm ], for q = 1, 2 Essentially, two sequences cast two different constraints to (Eq. 3), respectively, which lead to two different convex optimization problems. Therefore, if the two sequences X 1 and X 2 from different classes (which means they are substantially quite different trajectories in the base manifold B), since the optimal operator space H q , q = 1, 2 is uniquely defined and optimized for every different trajectory,
8
Xiao Bian, Hamid Krim
Fig. 4. Examples human activity video sequence and difference images.(The upper case is running, and the lower case is walking)
at least one of the sequences will break the constraints under the other’s optimal operator space, such as min kXip hqj − gk2 ≥ C, H q = [hq1 | · · · |hqm ]
j=1···m
(13)
Generally, for X 1 and X 2 from different classes, they will not meet the constraints under each other’s optimal operator space. In special cases, if X 1 is part of X 2 , the X 1 will meet the constraints under the optimal operator space of X 2 . While if the complementary part of X 2 to X 1 is substantially different from X 1 , then they will not meet the constraints under the optimal operator space of X 1 . Mathematically, we can reformulate the above ideas to define the frameto-sequence distance and, consequently, the sequence-to-sequence distance by introduce the following definition, Definition 1 Frame-to-sequence distance For two sequences X p , p = 1 · · · n and X q , q = 1 · · · m, for any frame Xip ∈ X p d(Xip , X q ) = min kXip hqj − gk2 j=1···m
(14)
Intuitively, we find the operator in the optimum operator space to give the minimum deviation from the ideal output. It means that this is the best operator in the space to characterize the given image and we measure how good it is. Furthermore, we calculate the mean deviation of the entire sequence X p to H q , and also for X q for H p . Then as discussed above, to cover the the case that one sequence is part of the other, we pick the larger one between these two.
Title Suppressed Due to Excessive Length
9
Definition 2 Sequence-to-sequence distance For two sequences X p , p = 1 · · · n and X q , q = 1 · · · m D(X p , X q ) = max(mean{d(Xip , X q )}, mean{d(Xiq , X p )}) 4.3
(15)
Experimental results
Instead of using the entire video sequences of each sample, we further segment every sequence into 10-frame segments. This setting, similar to what has been used in [8] [2], has advantages compare to [6] [1] [3]. There are two reasons for that: first, for real applications, it is unrealistic to wait until one target finishes its activity to do the analysis. It is more persuasive and useful to utilize data in small size to obtain an acceptable performance. Second, using a small size input data, will reduce the computation complexity dramatically [7], since the computation cost here is O(N × L2 ) [10], which N is the resolution and L is the length of the video sequences. Thus, to keep L in the scale of 10 rather than a longer sequence has substantially practical importance so as to provide us the potential to do analysis in real time.
Fig. 5. Classification rate vs percentage of data using as training set
Human activities of 9 people are collected in this data base. To demonstrate the performance of our framework, we randomly pick 1,3,5,7 people’s samples respectively as training set and the others as testing set. And for each input data (10-frame segments from testing set), we assign it to the same class of its nearest neighbor under the measure of Def.( 2). The results are showing in Fig. 5. Noticed that we intentionally use smaller data set to illustrate the ability of generalization. For 7 people using as training set, which is about 77.78% of
10
Xiao Bian, Hamid Krim
activity sequences, we can get a classification rate of 97.92%, which is higher than most of the previous publications [8] [6] [3] using the same data base, and comparable to [2], which use the leave-one-out test for classification. Moreover, in our algorithm, there is no need to do alignment as [6] or to use the entire video sequence as [3]. In either case, the processing time will be increased for buffering and preprocessing the entire input sequence. Additionally, from Fig.5, we can see that the classification rate for a smaller training data set is not severely decreased. In the case of 5 people, about 55.55% data, as training set, the classification rate is 95.57%, and 91.53% in the case of 3 people, 92.08% in the case of 1. Considering the fact that very few work has shown the robustness of their method to the size of training set, this proves the superior generalization ability as well as the fast learning feature than the other state-of-art works.
5
Conclusion
We proposed in this paper, a novel geometry-based framework for high dimensional data sequence analysis. Instead of exploring the unknown high dimensional data space, we utilize the optimal operator spaces to compactly represent the information of the high dimensional trajectories. The problem is reformulated as a convex minimization problem, and an associated algorithm is developed here. Moreover, we implement a fast method for video-based human activity classification, and cast a series of experiments to show its high classification rate and robust feature to the size of training data. Future research may include analysis for more complicated scenario, such as multiple object interaction and applications involving video retrieval and index.
References 1. Abdelkader, M.F., Abd-Almageed, W., Srivastava, A., Chellappa, R.: Silhouettebased gesture and action recognition via modeling trajectories on riemannian shape manifolds. Computer Vision and Image Understanding 115 (2011) 439–455 2. Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. IEEE transactions on pattern analysis and machine intelligence 29 (2007) 2247–53 3. Blackburn, J., Ribeiro, E.: Human motion recognition using Isomap and dynamic time warping. In: Proceedings of the 2nd conference on Human motion: understanding, modeling, capture and animation, Springer-Verlag (2007) 285–298 4. Donoho, D.: Hessian eigenmaps: Locally linear embedding techniques for highdimensional data. Academy of Sciences of the United (2003) 1–15 5. Belkin, M., Niyogi, P., Sindhwani, V.: Manifold Regularization : A Geometric Framework for Learning from Labeled and Unlabeled Examples. Journal of Machine Learning Research 7 (2006) 2399–2434
Title Suppressed Due to Excessive Length
11
6. Yi, S., Krim, H., Norris, L.: Human Activity Modeling as Brownian Motion on Shape Manifold. Scale Space and Variational Methods in Computer Vision (2012) 628–639 7. Cai, J.F., Cand`es, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM J. on Optimization 20 (2010) 1956–1982 8. Bian, X., Krim, H.: Video-based human activitis analysis: an operator-based approach. 20-th International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision (2012) 9. Savvides, M., Kumar, B.V.K.V., Khosla, P.K.: Cancelable biometric filters for face recognition. In: ICPR (3)’04. (2004) 922–925 10. Lin, Z., Chen, M., Ma, Y.: The augmented lagrange multiplier method for exact recovery of corrupted low-rank matrices. UIUC Technical Report UILU-ENG-092214 (2011)