An Improved Clonal Selection Algorithm for ... - Semantic Scholar

2 downloads 0 Views 1MB Size Report
Abstract—In this paper, we present a novel generative method for human motion tracking. The principle contribution is the development of clonal selection ...
An Improved Clonal Selection Algorithm for Articulated Human Motion Tracking

Yi Li, Zhengxing Sun* State Key Laboratory for Novel Software Technology, Nanjing University Nanjing, 210093, China [email protected], *[email protected] probabilities [5]. Compared with discriminative methods, generative methods are usually more accurate, but they suffer a large computational load, especially in high-dimensional state space. Moreover, optimization method and initialization are also the bottlenecks of the approach especially in tracking situation.

Abstract—In this paper, we present a novel generative method for human motion tracking. The principle contribution is the development of clonal selection algorithm for pose analysis in latent space of human motion. Firstly, we use ISOMAP to learn the low-dimensional latent space of pose state and a manifold reconstruction method is proposed to establish the smooth mappings between the latent and original space. Pose analysis is performed in this latent space, which results to be more efficient and accurate. Secondly, we apply a new evolutionary approach, clonal selection algorithm (CSA) for pose optimization. Then, we design a CSA based method for pose estimation, which can achieve viewpoint invariant 3D pose reconstruction from static images. Thirdly, in order to make CSA suitable for motion tracking, we propose a sequential CSA (S-CSA) framework by incorporating the temporal continuity information into the traditional CSA. Our methods are demonstrated in different motion types and different image sequences. Experimental results show that our method achieves better results than state-of-art methods.

In the previous work, several possible strategies have been proposed for reducing the dimensionality of the configuration space, including using motion models [6], hierarchical search [7] and dimensionality reduction [4, 8]. Of which, dimensionality reduction method gains more interest, as it can reduce the dimensionality and extracting the prior knowledge of human motion simultaneously. However, linear methods as PCA [4] is often inadequate, since the mapping between the original pose space and the latent space is in general non-linear. Although non-linear methods as manifold learning [8] can learn this non-linear latent space, they can not build the mappings between them. How to build the smooth bidirectional mappings is still a not well solved problem. Optimization method is another key research problem of generative motion tracking methods, which is typically tackled using either deterministic methods or stochastic methods. Deterministic methods, as gradient descent search [9] are usually computationally efficient but they easily become trapped in local minima. In contrast, stochastic methods, as particle filter [5] are usually more robust, but they suffer a large computational load, especially in high-dimensional state space. In recent years, evolutional computing methods, such as genetic algorithm [4] and particle swarm optimization [7] have received increasing attention. Although considerable work has already been done above, a more effective optimization method is still intensively needed for robust visual tracking.

Keywords-Human motion tracking; Pose estimation; Isomap; Manifold learning; Clonal selection algorithm

I.

INTRODUCTION

Estimation of articulated human pose from images or image sequences is an important problem in computer vision. Applications of such technology are prevalent across scientific and consumer domains, such as smart environment, human computer interface, intelligent visual surveillance and assisted living. However, although having been attacked by many researchers, this challenging problem is still long standing because of the difficulties conduced mainly by the complicated nature of 3D human motion and incomplete information of 2D images for 3D human motion analysis.

In this paper, we propose a novel generative approach in the framework of evolutionary computing. The main contributions of this paper are as follows. Firstly, we learn the latent space of pose state and establish the smooth mappings by manifold reconstruction. Motion analysis in this latent space results to be more efficient and accurate. Secondly, we apply a new evolutionary approach, Clone Selection Algorithm [10] (CSA) for pose optimization. Then, we design a CSA based method for pose estimation from static image, which can be used for initialization of motion tracking. Thirdly, since the tracking process is a dynamic optimization problem, we propose a sequential CSA algorithm by incorporating the temporal continuity information into the traditional CSA. To the best of our knowledge, the proposed algorithm is new in human motion tracking literature. Experimental results shows that our method achieve better results than state-of-art methods.

In general, approaches to vision-based human motion analysis can be broadly divided into two categories: generative and discriminative [1]. Discriminative methods attempt to learn a direct mapping from image features to 3D pose from training data. The mapping is often approximated using nearest neighbor [2] or regression models [3]. While effective and fast, they are inherently limited by the amount and quality of the training data. Moreover, the relationship between image feature and the human pose in often multimodal which makes it difficult to build the mapping accurately. In contrast, generative methods exploit the fact that although the mapping from visual features to poses is complex and multimodal, the reverse mapping is often well-posed. Therefore, pose recovery is tackled by optimizing an object function that encodes the pose-feature correspondence [4], or by sampling posterior pose

978-1-4673-2430-4/12/$31.00 ©2012 IEEE

215

II.

STATE SPACE ANALYSIS

Step 1: (preparing) (1). Using the Isomap algorithm to compute the low-dim vector { yi | yi ∈ Y , i = 1,..., l} for the original input vector {xi | xi ∈ X , i = 1,..., l} ; (2). Construct the following matrixes: X i = ( xi1 − xi ,..., xil − xi ) ∈ R n×li , i Yi = ( yi1 − yi ,..., yil − yi ) ∈ R d ×li , where {xi j | j = 1,...li } is the i ε − neighbor of xi . (3). Compute Qi = X iYiT (YiYi T )G ∈ R n×d , where (YiYi T )G is the generalized inverse matrix of YiYi T . Step 2: (manifold reconstruction) (1) Mapping form original space to latent space: g : X → Y , y = g ( x) . Given a high-dim pose vector x0 , the corresponding low-dim vector y can be computed as: (1.1) Find the nearest neighbor of x0 in {xi | i = 1,..., l} ,set it to be xs ; (1.2) Compute y = g% ( x0 ) = ys + QsT ( x0 − xs ) ; (1.3) output y . (2) Mapping form latent space to original space: f : Y → X , x = f ( y ) . Given a low-dim pose vector y0 , the correspondence high-dim vector x can be computed as: (2.1) Find the nearest neighbor of y0 in { yi | i = 1,..., l} , set it to be ys ; (2.2) Compute x = f% ( y0 ) = xs + Qs ( y0 − ys ) ; (2.3) output x . Using the above Isomap-based manifold reconstruction method, we can generate smooth mappings between the original pose space and the latent space, which enable us to track human motion in the latent space.

Tracking in a low-dimensional subspace requires three components [11]: learn the non-linear low-dimensional latent space of human motion, establish smooth mappings between the latent space and the original state space and how tracking within the subspace occurs. In this section, we learn the lowdim latent space using Isomap. Then, a manifold reconstruction method is proposed to establish the smooth mappings between the latent space and the original space. A. Isomap based manifold learning As in [4], we use motion capture data from CMU [12] for latent space learning. The subspace learned by Isomap [13] is shown as Fig.1. We can see that similar low-dim subspace can be extracted from the training sequences that belong to the same type of motions but performed by different subjects. And the training sequences corresponding to different type of motions produce different subspaces.

(a)

(b)

Figure 1. Isomap based dimensionality reduction results. (a) (b) are manifolds of two sequences of walking and running in 3D subspace respectively.

Isomap can not only reduce the dimensionality of high-dim input space, but also find meaningful low-dim structures hidden behind their high-dim observations. In doing so, infeasible solutions, namely, the absurd poses can be avoided naturally during optimization, which will make motion tracking in this subspace more efficient and accurate.

In the following section, we will show how tracking within the latent space occurs. III.

CLONAL SELECTION ALGORITHM FOR POSE OPTIMIZATION

Clonal selection algorithm (CSA) [10], as a novel evolution method, has been another hotspot succeeding genetic algorithms and particle swarm optimization for its success in solving pattern recognition and multimodal optimization problems. In this paper, we apply CSA for pose optimization.

B.

Manifold reconstruction method Based on the intrinsic executive mechanism of Isomap, we proposed an Isomap based manifold reconstruction method to establish the mappings between low and high dimensional states. Suppose the pose state space to be X ⊂ R n and the lowdim state space to be Y ⊂ R d . Denote the mapping as: f : Y → X , x = f ( y ) and g : X → Y , y = g ( x) , where x , y are the high and low dimensional vectors respectively. The set of input instances is {xi | xi ∈ X , i = 1,..., l} and their corresponding points in the embedding space learned by ISOMAP are { yi | yi ∈ Y , i = 1,..., l} . Assume {xi j | j = 1,..., li } are the ε − neighbors of point xi , where li is the number of neighbors. And their corresponding points in the embedding space are { yi j | j = 1,..., li } . Then the Isomap-based manifold reconstruction method can be described as follows:

A. The CSA Algorithm Our modified CSA for pose optimization can be summarized as follows. (1) Initialization. The CSA starts with the generation of an initial population, usually by spreading N random points in the search space. We represent the antibody population as A = {a1 , a2 ,..., aN } , where ai is an antibody, 1 ≤ i ≤ N . (2) Selection. Then we calculate the affinity of each antibody in population A based on affinity measurement. h individuals with highest affinities are selected for cloning. (3) Clone. Each antibody receives qi copies. qi depends on the affinity and density of ai . We set antibody with high affinity and low density with large clone size. (4) Mutation. The clones, not the original individual, then undergo the maturation (hyper mutation) process. A given individual and its maturated clones forms a subpopulation of points in the search space. (5) Update.

Algorithm 1: Isomap-based manifold reconstruction Input: The training data set {xi | xi ∈ X , i = 1,..., l} . Output: The mappings g : X → Y , y = g ( x) and f : Y → X , x = f ( y) .

216

The maturated clones are evaluated over the affinity function, and the best of each subpopulation is allowed to pass to the next generation. At last, CSA avoids the population diversity by replacing the individuals not selected for cloning in a given generation by new random points.

defined operators Ts , Tc and Tm ; (2.2) Update: update A(k ) with matured antibodies and randomly generated individuals according to defined operator Tu . Thus, the next population A(k + 1) generate; (2.3) k = k + 1 . End while (3) Output: output converged population A( K ) .

In general, the CSA algorithm is to implement as the following evolvement process. Ts Tc Tm Tu A(k ) ⎯⎯ → C (k ) ⎯⎯ → Y (k ) ⎯⎯ → Z (k ) ⎯⎯ → A(k + 1)

C. CSA for pose optimization Pose estimation is the process to estimate articulated human pose from single image which can be formulated as an optimization process. We design a CSA based human pose estimation algorithm as follows.

Where A(k ) , C (k ) , Y ( k ) , Z (k ) is the antibody population during different period in a single evolution generation, k is the iterative step. Ts , Tc , Tm , Tu are the select, clone, mutate and update operators, respectively. Detailed introduction about CSA can be found in [10]. In this paper, we set numbers of antibodies N = 60 , selected number of individuals h = 40 , newly generated number of individuals r = 5 . B. Apply CSA for pose optimizaiton In this section, we apply CSA for human pose optimization. Some details of our implementations are discussed as follows.

Encoding and initialization: In CSA, each antibody represents a potential solution in the search space. For our problem, we perform human motion analysis in the latent space. So an antibody is corresponding to a pose vector in the latent space. In this paper, we represent the full 3D pose vector as x = {xr , Y } , where 3D vector xr = (rx , ry , rz ) represents the root joint rotations, Y = ( y1 ,..., yd ) corresponds to the pose vector in latent space, d = 6 . Here, we use real encodings. So an antibody can be represented as x = ( xr , Y ) . We represent the antibody population as A = {x1 , x2 ,..., xN } , N is the size of population. In normal CSA, the initial population is usually generated at random which will result in indefinitely long time or even un-convergent. In this paper, we restrict every dimensionality yi ( i = 1,..., d ) of subspace pose to be in the scope [min( yi ), max( yi )] , where the bound min( yi ) and max( yi ) are learned from the motion training data. Affinity measure: For each antibody, an affinity measure needs to be computed to estimate how well a given antibody (pose) matches the observed images. Here we use the bidirectional likelihood proposed by [14].

(a)

(b)

(c)

(d)

(e)

Figure 2. The process of human pose estimation, where (a) is one video frame, (b) is the initialized poses, (c) (d) (e) are results with different times of iteration respectively.

With the aim of both cutting the search space and determining the motion direction roughly, we incorporated the global motion process step [4] into the framework of CSA. The main concept is only to search the optimal solutions of global motion in the first round of state evolution. And in the rest rounds of state evolution, the antibody is evolved normally as described in algorithm 2. Based on the proposed CSA pose optimization algorithm, the antibody with highest affinity in population A( K ) = {x1 ( K ), x2 ( K ),..., xN ( K )} will be selected to be the estimated pose. Fig.2 is the process of pose estimation IV.

SEQUENTIAL CLONAL SELECTION ALGORITHM FOR HUMAN MOTION TRACKING

In tracking applications, the data is typically a time sequence, and hence the task is essentially a dynamic optimization problem which distinguishes it from traditional optimization problems. We proposed a sequential CSA (S-CSA) based framework for human motion tracking. The flowchart of the S-CSA framework is shown in Fig.3. There are three major stages: automatically initialization, next frame propagation, CSA-based optimization.

Based on the designing above, the CSA based pose optimization algorithm can be described as follows. Algorithm 2:CSA based pose optimization algorithm

Initialization of tracking: We achieve the automatic initialization by determining the pose of the first frame.

Input: total number of antibody N , selected number of individuals h , newly generated number of individuals r , maximum number of generations K . (1) Initialization: generate the initial population with N antibodies, represented as A(k ) = {a1 , a2 ,..., aN } , k = 0 . (2) Repeat: While ( k < K ) do (2.1) Immunity process: perform select, clone, mutate operations on current population A(k ) according to the

Next-frame propagation: Given the converged antibodies at frame t , the antibodies in the next frame are initialized by sampling a Gaussian distribution [16] centered in the current best antibodies. The randomly propagation method is actually a first-order Gauss-Markov dynamical model. We didn’t incorporate any motion model here, which is motivated by two considerations: generality and the effectiveness of our CSA based pose optimization.

217

CSA-based optimization: Estimate pose of current frame based on the initialized antibodies using CSA pose optimization algorithm.

B. The convergence of CSA It is understood that the number of antibodies N and iteration times K will influence the convergence. We take pose estimation experiment on a single image and report the affinities of the best antibody during iteration. Fig.4 demonstrates the convergence process. Different lines represent different number of antibodies used. The x-axis is the times of iteration while the y-axis is the affinity of the best antibody in the population. As shown in Fig.4, the affinities will converge as the times of iteration increase. The experimental results demonstrate that our CSA based pose optimization algorithm is convergent.

Initialization of first frame Antibodies converged at time t Next-frame propagation

t=t+1

CSA-based pose optimization No

Yes Converged?

Figure 3 Overview of the sequential CSA algorithm

Actually, the S-CSA framework is a “sample-and-refine” search strategy. Firstly, the initial antibodies are sampled as Gaussian distribution. Then, in each CSA iteration, the antibodies are updated according to the newest observations. Through the CSA iteration, the antibodies are moved towards the region where the likelihood of observation has larger values, and are finally relocated to the dominant modes of the likelihood. And in a Bayesian inference view, the CSA iterations are essentially a multi-layer importance sampling strategy which incorporates the new observations into a sampling stage, and thus avoids the sample impoverishment problem suffered by the particle filter [5].

Figure 4. The convergence process

C.

CSA based pose estimation results As mentioned in section 2, we first learn the subspace of walking and running. To extract the motion subspace of walking, a data set consisting of motion capture data of a single subject was used. The number of frames is 425. While for running subspace learning, the number of frames is 186.

EXPERIMENTAL RESULTS

V. A.

Experimental data and evaluation measures Experimental data: The data for latent space training is from CMU database [12]. We quantitatively evaluate our method on synthesized image sequences as in [3]. We also give experimental results on real image sequences from CMU database [12], HumanEva [14] and [15].

Evaluation measures: In this paper, we use the evaluation measures proposed in [14]. The average error over all joint angles (in degrees) is defined as: M

xm − xˆm

m =1

M

D( x, xˆ ) = ∑

(1)

Where x = ( x1 , x2 ,..., xM ) and xˆ = ( xˆ1 , xˆ2 ,..., xˆM ) are the ground truth pose and the estimated pose date, respectively. For the sequence of T frames, the average performance and the standard deviation of the performance are computed using the following: 1 T μ seq = ∑ D( xt , xˆt ) (2) T t =1

σ seq =

1 T [ D( xt , xˆt ) − μ seq ]2 ∑ T t =1

Figure 5. The mean errors of individual joint angle for different sequences.

We test our method on 100 frames of images for all three types of motions, and the mean errors of joint angle are reported, which are shown in Fig.5. From Fig.5 we can see that:

(3)

218

except for some particular joints, the mean errors of most joints for three sequences are less than 5 degrees. The mean errors of some joint angles are larger than others because they have wider range of variation or less observability related to 2D image features. Our results are competitive with others reported in the related literatures. The experiment results demonstrate that our CSA based pose estimation method is effective to analyze articulated human pose from singe image.

Figure 6. Pose estimation results on images of different motion types.

We also test our method on real image sequences. The purpose is to test the capability of the method to cope with limb occlusion, left-right ambiguity, view-point problems. The results are shown in Fig.6. From the experimental results we can see that: on most of the frames, the occlusion and left–right confusion problems are tackled by searching the optimal pose in the extracted subspace because the prior knowledge about motions is contained in this subspace. And the pose estimator is view invariant, mainly because of the special step for searching the global motion. In addition, the experiment results on walking and running sequence demonstrate that our algorithm is efficient for different types of motions. Actually, our method can be generalized to any other types of motions as long as the corresponding subspace can be properly extracted from training data.

Figure 7. Comparison of different tracking methods

The mean errors of different methods over all joint angles of the test sequences are shown in Fig.7. And table 1 is the statistics of the average errors and the standard deviations. From Fig.7 and table 1, we can see that our method achieve better results. The average errors and the standard deviations over all frames are near 3◦ and 1◦, respectively, in general. It also can be found that the change of mean error of our method in whole sequence is small, which indicate that our method can achieve stable tracking of 3D human pose.

D. S-CSA based motion tracking results We demonstrate our tracking algorithm on walking and running image sequences. And then we compare S-CSA quantitatively with other tracking methods, include particle filter (PF) method [5], particle swarm optimization (PSO) [7] and motion tracking in linear subspace using annealing genetic algorithm (PCA+GA) [4]. TABLE I.

RESULTS OF DIFFERENT TRACKING METHODS Walking

Running

Mean error

Standard deviations

Mean error

Standard deviations

PF

4.6713

2.5157

4.4669

2.0188

PSO

4.4369

1.5181

4.3949

0.9821

PCA+GA

3.5705

1.5651

4.1494

1.4779

S-CSA

2.5217

1.2377

2.8313

1.2941

Figure 8. Human tracking results on real image sequences

219

Fig.8 is the tracking results on walking and running image sequences respectively. From the above experimental results we can see that our CSA based pose estimation method can successfully be used for initialization of tracking. The nextframe propagation method is effective to generate initial distribution of antibodies for the next frame. Moreover, our SCSA method is effective on different types of motion.

China (69903006, 60373065, 61021062 and 61100110), Program for New Century Excellent Talents in University of China (NCET-04-04605), Natural Science Foundation of Jiangsu Province (BK2009230 and BK2010375), Key Technology R&D Program of Jiangsu Province (BE2010072 and BE2011058).

Experimental results demonstrate that our S-CSA based tracking method can achieve accurate and stable tracking of 3D human pose. However, our method has some shortcomings as discussed below. Firstly, though pose optimization in the latent space make our method more efficient and accurate, it makes our method not suitable for more complicated motion analysis. So in our future work, we will extend our algorithm to cover a wilder class of human motions and explore switch mechanism between different subspaces. Secondly, in generative tracking approaches, the time taken by an algorithm depends mostly on the number of likelihood evaluations. In our CSA pose optimization method, the time complexity would be O( N + N c ) K , which make it can not work for real time applications. In addition, our method is depended on the silhouette detection from video. But human silhouette detection from video is difficult, especially in uncontrolled environment. More robust human silhouette detection method and more sophisticated image likelihood function will be considered in our future work.

REFERENCES [1]

[2]

[3]

[4]

[5] [6]

[7]

VI. CONCLUSIONS In this paper, we presented a novel generative approach to reconstruct 3D human pose from a single monocular image as well as from monocular image sequences. We first use Isomap to learn the latent space of human motion and then establish the smooth mappings between the latent space and the original space. Pose analysis is performed in the learned low-dim subspace, which makes pose analysis more efficient and accurate. In the search strategy, we apply the clonal selection algorithm for pose estimation. A sequential CSA framework is proposed for motion tracking by incorporating the temporal continuity information into the traditional CSA. Experiment results on different motion types and image sequences demonstrated that our CSA based method for pose estimation is effective to deal with occlusion, left-right ambiguity and the viewpoint problem. The sequential CSA method can achieve stable and accurate motion tracking. Quantitative experiments compared with other state-of-art methods show that our methods achieve better results.

[8]

[9]

[10]

[11] [12] [13]

[14]

[15]

In the future work, we will extend our algorithm to cover a wilder class of human motions and explore switch mechanism between different subspaces because it is very important to deal with more complicated human motion scenarios. In addition, we will also consider more sophisticated image likelihood and how to reduce the computation time. Various pose analysis based applications will also be considered in our future work.

C. Sminchisescu. “3D Human Motion Analysis in Monocular Video, Techniques and challenges.” Chapter in Human motion understanding, modeling, capture and animation. R. Kleete, D. Metaxas and B. rosenhahn Eds., Springer-Verlag, 2007. G. Mori and J. Malik, “Recovering 3D human body configurations using shape contexts,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 28,7, 2006,pp. 1052–1062. A. Agarwal and B. Triggs, “Recovering 3-D human pose from monocular images,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28,1, 2006, pp. 44–58. Zhao X and Liu Y. C., “Generative tracking of 3D human motion by hierarchical annealed genetic algorithm,” Pattern Recognition, vol. 41,8, 2008, pp. 2470-2483. M. Isard and A. Blake. “Condensation: conditional density propagation for visual tracking,” Int. J. Comput. Vis., vol. 29,1, 1998, pp. 5-28. J. M. Rincón, D. Makris, C. O. Uruñuela, J.C. Nebel, “Tracking Human Position and Lower Body Parts Using Kalman and Particle Filters Constrained by Human Biomechanics,”, IEEE Tran. on Sys., Man, and Cyb. Part B: Cyb., vol. 41,1, 2011, pp. 26-37. V. John, E. Trucco and S. Ivekovic, “Markerless human articulated tracking using hierarchical particle swarm optimization,”, Image and Vis. Comput. , vol. 28,11, 2010, pp. 1530-1547 A. Elgammal and C. Lee, “Inferring 3D body pose from silhouettes using activity manifold learning,”, Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 07), IEEE Press, 2004, pp. 681-688. N. R. Howe, M. E. Leventon, and W.T. Freeman, “Bayesian Reconstruction of 3D human motion from single-camera video”, Proc. Advances in Neural Information Processing Systems (NIPS 00), IEEE Press, 2000, pp. 820-826. Leandro N. de Castro, Fernando J. Von Zuben, “Learning and Optimization Using the Clonal Selection Principle,” IEEE Tran. on Evolutionary Computation, vol. 6,3, 2002, pp. 239-251. R. Poppe, “Vision-based human motion analysis: An overview,” Comput. Vis. Image Und., vol. 108,1,2007, pp. 4-18 CMU database: http://mocap.cs.cmu.edu/ Joshua B. Tenenbaum, Vin de Silva, John C., ”A global geometric framework for nonlinear dimensionality reduction.” Science., vol. 290,22, 2000, pp. 2319-2323. L. Sigal, M. J. Black, ”HumanEva: Synchronized video and motion capture dataset for evaluation of articulated human motion,” Int. J. Comput. Vis., vol. 87,1, 2010, pp. 4-27. Dirk Ormoneit, Hedvig Sidenbladh, Michael J. Black and Trevor Hastie, “Learning and tracking cyclic human motion” Proc. Advances in Neural Information Processing Systems (NIPS 01), IEEE Press,2001, pp. 894900.

[16] X.Q. Zhang, W. M. Hu , S. Maybank and L. Xi, “Sequential

particle swarm optimization for visual tracking”, Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 08), IEEE Press, 2008. pp.23-28.

ACKNOWLEDGMENTS This work is supported by The National High Technology Research and Development Program of China (2007AA01Z334), National Natural Science Foundation of

220