IEEE TRANSACTIONS ON MULTIMEDIA, JAN. 2011
1
Non-rigid structure from motion from 2-D images using Markov chain Monte Carlo Huiyu Zhou, Xuelong Li, Senior Member, IEEE and Abdul Sadka, Senior Member, IEEE
Abstract—In this paper we present a new method for simultaneously determining three dimensional (3-D) shape and motion of a non-rigid object from uncalibrated two dimensional (2D) images without assuming the distribution characteristics. A non-rigid motion can be treated as a combination of a rigid rotation and a non-rigid deformation. To seek accurate recovery of deformable structures, we estimate the probability distribution function of the corresponding features through random sampling, incorporating an established probabilistic model. The fitting between the observation and the projection of the estimated 3-D structure will be evaluated using a Markov chain Monte Carlo based expectation maximisation algorithm. Applications of the proposed method to both synthetic and real image sequences are demonstrated with promising results. Index Terms—Non-rigid, structure from motion, Markov chain Monte Carlo, uncalibrated.
I. I NTRODUCTION
D
Igital video cameras are widely used, and the quantity of digital videos has dramatically increased recently. For reuse and storage purpose, consumers have to retrieve a video from a large number of multimedia resources. To recognise specific events from a definite database, information retrieval systems have been established with promising performance in searching accuracy and efficiency, e.g. [27]. Many of these established systems attempt to search for events that have been annotated with metadata a priori (e.g. [11]). Nevertheless, there are still a significant number of footages that have been shot but not ever used [1]. These footages normally have not been properly annotated, and hence the retrieval can only be carried out according to the video contents rather than the annotated information. Of the generic video contents, the need for the ability to retrieve 3-D models from databases or the Internet has gained prominence. Content-based 3-D model retrieval currently remains an active research area, and has found its tremendous applications in information retrieval, medical imaging, and security. To extract a 3-D object, shape-based 3-D modelling Copyright (c) 2010 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to
[email protected]. H. Zhou is with ECIT, Queen’s University Belfast, Belfast, BT3 9DT, United Kingdom, E-mail: {H.Zhou}@ecit.qub.ac.uk. X. Li is with the Center for OPTical IMagery Analysis and Learning (OPTIMAL), State Key Laboratory of Transient Optics and Photonics, Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an 710119, Shaanxi, P. R. China. E-mail: {xuelong_li}@opt.ac.cn. A. Sadka is with the School of Engineering and Design, Brunel University, Uxbridge, Middlesex, United Kingdom, E-mail: {Abdul.Sadka}@brunel.ac.uk. Manuscript received 17 Jan. 2011; revised 30 May 2011.
(e.g. [29]) and similarity or dissimilarity (distance) computation (e.g. [22]) are two of the main research schemes. Recovery of 3-D structure (or shape) and motion from 2-D images is a critical problem in computer vision [3]. As a solution, studies on camera calibration have been launched in order to relate image coordinates to 3-D coordinates (e.g., [23]). This strategy with promising accuracy and consistency, however, is time consuming while requiring necessary camera calibration. To release the demand of camera calibration, recent research work has been focused on non-metric reconstruction from uncalibrated image data [34]. Technically, different camera models, e.g. orthographic and perspective projection, have been investigated so that structure and motion recovery can be effectively achieved [18]. In spite of promising performance, many problems still remain open. One of the unsolved problems is that most of these algorithms cannot be directly applied to non-rigid deforming objects (or scenes), as they presumably rely on the assumption of rigid motion [31]. As a result, appropriate approaches to solve the problem of nonrigid structure and motion need to be explored [34], [18]. A number of solutions to this particular problem have been reported in the literature. For example, Szeliski and Kang [28] proposed a batch analysis of image streams in the sense of nonlinear least squares in order to simultaneously recover 3-D structure and motion. Carceroni and Kutulakos [7] stated that structure and motion estimation was equivalent to the recovery of differential properties of the spatial-temporal curve manifold that described the curve’s trace in spacetime. To quantify a shape of rank 1 or 2, Xiao and Kanade [37] presented a linear approach that imposed the positive semi-definite constraint to determine the desired solution for this approximation. Similarly, Torresani et al. modeled shape motion as a rigid rotation plus a non-rigid deformation [32], [33]. To handle the arbitrary deformation problem, the object shape at each time instant was extracted from a Gaussian distribution. Prior knowledge of shape and motion can be used to augment the accuracy of shape and motion recovery. For example, Shaji and Chandran [26] introduced a Riemannian Newton algorithm while using a variant of the non-rigid factorisation schemes for non-rigid structure from motion. Seo and Hong [25] applied the expectation maximisation algorithm based on an extended Kalman smoother to impose the time-continuity of the motion parameters. In [13], a smoothing penalty has been added to the camera trajectory in order to apply prior knowledge of camera pose to the structure recovery. Yan and Pollefeys [38] proposed an approach based on the modeling of the articulated nonrigid motion as a set of intersecting motion
IEEE TRANSACTIONS ON MULTIMEDIA, JAN. 2011
subspaces, where a motion subspace is the linear subspace of the trajectories of an object. Bartoli et al. [4] proposed a model that was based on a coarse-to-fine low-rank shape model and an Euclidean transformation accounting for the global displacement. A limited number of publications have been literally related to the non-linear and non-Gaussian problem in shape and motion recovery, which commonly exists in the presence of multiple feature correspondence and heterogeneous noise. This problem may deteriorate the performance of shape and motion recovery. For example, Fig. 1 illustrates a case where the approach introduced in [32] is applied to reconstruct shape and motion from the corresponding features with additive random noise (Gaussian, Poisson and uniform). It is observed that in the cases of Gaussian and Poisson the classical algorithm can handle the noise properly. However, the system performance goes down in the case of uniform noise. To resolve this problem, for example, Qian and Chellappa [24] reported a statistical structure from motion method using noisy sparse feature correspondences from calibrated video sequences under perspective projection. In this paper, we introduce a new strategy for tackling the structure from motion problem in the presence of non-linear motion and non-Gaussian distributions using a Markov chain Monte Carlo (MCMC) technique. MCMC has been commonly adopted in computer vision to address the problems such as image segmentation (e.g. [35]), structure from motion (SfM) (e.g. [12]), event recognition (e.g. [10]) and object tracking (e.g. [20]). For example, as an extension of the factorisation method reported in [31], Forsyth et al. [15] utilised an MCMC sampler to pursue a Bayesian solution to the SfM problem, while handling large tracking errors in the factorisation. The main challenge in SfM is to seek an optimal estimate for object motion and scene geometry. After the feature correspondences have been obtained, SfM is used to obtain a maximum likelihood estimation [12]. In classical SfM approaches, MCMC is usually used to compute the posterior probability, and our approach plays a similar role to them. The probability of a shape over its parameters will be calculated through MCMC. The proposed technique can be considered as an extension of the established shape recovery algorithms presented in [12],[32],[24] in the sense that an MCMC sampler is utilised within a well defined conditional probability in order to seek a maximum likelihood estimate of shapes and motion (this makes it significantly different from the scheme reported in [39],[40]). Furthermore, we apply a feature correspondence step before the shape and motion recovery, as this helps achieve fast convergence of the optimisation process. In summary, the main contributions of the proposed approach include: An MCMC technique is applied to optimally reduce the residuals of the estimated and the expected shapes. Simultaneously, outliers in the correspondences are removed from the computation list so that the estimation of the motion and shape parameters can be accurately achieved. Finally, a Markov chain Monte Carlo expectation-maximisation (MCMC-EM) algorithm is presented to handle the circumstances of missing correspondences. The rest of the paper is organised as follows. Section
2
II introduces a generalised 3-D structure recovery method. This is followed by a description about the minimization of the fitting residuals in Section III. In Section IV, we will introduce individual mathematical models in the cases of Gaussian and non-Gaussian distributions. Experimental work shown in Section V demonstrates the performance of the proposed algorithm against a classical approach in synthetic and real data. An application of the proposed algorithm in event recognition is also reported. Finally, conclusions and future work are given in Section VI. II. G ENERALISED 3-D STRUCTURE RECOVERY It is assumed that a number of image feature points have been tracked throughout a 2-D image sequence. These tracked features are obviously visible while may be of outliers due to the noise and multiple candidates in the same image. Let P be the observed projection, S the time-varying shape and S = {S0 , ..., St } where St is a shape at time t, R a 2×3 matrix combining rotation with the orthographic projection [31] and T a 2×1 translation vector. These parameters, in fact, should have come with time stamps; however, for convenience we omit the time stamps and hereafter. Then, we have the projection after rotating and translating the shape S with additive noise: P = R(S + T) + N, (1) where N is noise. Taking similar assumptions made in [24],[32], we here consider that a non-rigid shape S can be ¯ plus a deformation which represented as a shape average S has the following form: ¯+ S=S
k X
wi si ,
(2)
i=1
where i is the index of an image point. wi are scalar (or weights) that determine the contributions of individual defor¯ and s are regarded as the mation components to each shape. S shape basis and k is an order. It is recognised that the space of shapes is the linear combinations of basis shapes [6]. Suppose that a shape S is drawn from a probability distribution p(S|α), given its parameter α. Knowing the projection P, our intension is to estimate the motion parameters R and T, α and the parameter of a probability distribution, σ 2 . This can be achieved by maximizing the following conditional probability according to Bayes’ rule: Z 2 p(R, T, α, σ |P) ∝ p(P|R, T, S, σ 2 )p(S|α)dS. (3)
Eq. (3) reveals that the estimation of motion and shape can be performed deducing the parameters of p(S|α) over the unknown shapes S. An optimal solution can be found when the probability distribution of the observations fits that of the ¯ over a historic period with a global minimal mean shape S variance between two probability estimations. Alternatively, an optimal solution to the 3-D reconstruction of the non-rigid shape can be obtained if the following weighted sum of squares is satisfied: m X (P − R(S + T))T Φ−1 (P − R(S + T)), (4) min R,T,S
j=1
IEEE TRANSACTIONS ON MULTIMEDIA, JAN. 2011
3
(a)
(b)
(c)
(d)
Fig. 1. Illustration of applying the approach of [32] to the “Shark” sequence in the presence of additive noise (mean = 0, variance = 5): (a) Clean corresponding features; (b) with Gaussian noise; (c) with Poisson noise; and (d) with uniform noise. Blue dots - ground truth and red circles - reconstruction.
where m is the number of the image points, and Φ
=
cov(P, P) + Rcov(S + T, S + T)RT −Rcov(S + T, P) − cov(S + T, P)T RT , (5)
where cov is the variance-covariance matrix. To simplify the computation, Φ can be approximated as Φ ≈ E[(P − R(S + T))(P − R(S + T))T ]. j indicates the j-th image feature point. Eqs. (4)-(5) describe an optimisation procedure, which is used to search for an appropriate solution to the shape and motion recovery problem. Unfortunately, this process cannot guarantee a globally optimal solution. In fact, this optimisation can be biased due to the local minimum in the residuals. To prevent this situation, we take a closer look at Eq. (3). III. M INIMIZATION
OF RESIDUALS
If the residual of the shape estimation is non-Gaussian, then the posterior probability p(S|α) is also non-Gaussian [19], and vice versa. A variance (also a dominant element of a covariance matrix) is normally used to evaluate the discrepancy between the distribution of the estimated and the true shapes. The residual (P − R(S + T)) with its variance will be further investigated in this section, where a smaller variance indicates a better fit between the estimated and the true shape. We may never be able to recover the true shape in a real circumstance. To minimise the shape fitting errors in the optimisation, we consider the mean of the estimated shapes in a fixed period as the approximation of the real shape. In the following sub-sections, the composites of the right hand side of Eq. (3) will be analysed individually.
where nk is the number of the samples collected from the image space to simulate the shape’s PDF, and n0 is the number of the unknown components of the motion and shape estimation. Eq. (7) denotes that a larger (nk − n0 ) leads to a smaller variance, if the numerator of Eq. (7) is unvaried. Our aim is to search for the smallest variance during the optimization, which corresponds to the maximisation of p(R, T, α, σ 2 |P). To prevent a large (ni − n0 ) affecting the estimation of the variance, a relatively small number of image points ni needs to be chosen for computing a variance. Nevertheless, the estimation based on such a group of points is unstable due to noise or corresponding errors. For example, a group of image points may lead to a very small residual, while others can cause very large residuals. To avoid this situation, a robust estimation of the variance will be sought using a random sampling scheme, e.g., [14]. In other words, an ideal variance will be the one chosen from a group of estimates using Eq. (7), whereas a number of image points are randomly sampled. The reader can refer to [39] for intermediate procedures of calculating the variance σ 2 . The final variance to be determined will be the mean of individual over P l estimates σi2 ,, where l the overall corresponding features: σ 2 = i=0 l is the number of the estimated variances (usually l ≥ 20). This averaging operation is necessary. This process actually boosts the optimisation practice when it falls in a “trap” due to biases or local minimisation. Also, the mean value of the variances within a neighborhood can be used as a robust indicator without making any assumption to the characteristics of the residuals. B. Derivation of p(S|α)
A. Derivation of p(P|R, T, S, σ 2 ) 2
The conditional probability p(P|R, T, S, σ ) is associated with the Euclidean distance between the projections using the estimated R, T and the observations. This probability deploys a form of averaging the calculated Euclidean distance ¯ which is different from the spherical Gaussian form [32]: D, ¯ −1 p(P|R, T, S, σ 2 ) ∼ D |P−R(S+T)| .
(6)
Eq. (4) shows that the variance-covariance matrix is fully independent of the residual characteristics (i.e. Gaussian or non-Gaussian). In such a generalisation case, an unbiased and consistent estimation of the variance σ 2 is obtained using the pooled variance estimator [36], which is expressed as P T k (nk − n0 )(P − R(S + T)) (P − R(S + T)) 2 P , σ ∝ k (nk − n0 ) (7)
Now, we analyse the conditional density function p(S|α). Again, we start the derivation by decomposing the joint probability into the multiplication of a prior and a posterior: p(S, α) = p(S)p(α|S),
(8)
where the prior can be determined by investigating the correlation of the same shape in two neighbouring image frames: p(S) =
tp Y
t=1
p(St |St−1 ),
(9)
where tp is the sample size. Eq. (9) reveals that the probability of a shape depends on the historic estimates of the shape. The shape can move or deform in several possible directions over time. Therefore, assuming that the shapes with their parameters at different time instances are independent of each
IEEE TRANSACTIONS ON MULTIMEDIA, JAN. 2011
4
(10)
Without loss of generality, we use a functional g(St ) to replace the shape S. This replacement is very important because it is essential to obtain a description of the shape (e.g. shape model) in the optimisation. We then have an expectation of the conditional probability in the following equation: Z E(g(St )|α) = g(St )p(S|α)dS R g(St )p(S)p(α|S)dS R (11) = p(S)p(α|S)dS
¯ ¯ T ] (S ¯ is a mean where g(St ) can be S or [(S − S)(S − S) shape). In case the latter is used, the expectation shown on the left hand side of Eq. (11) will be residuals-driven and the covariance is used to “weight” the prior. This expectation can be approximated using the established Monte Carlo simulation. C. Monte Carlo simulation
20
15
10
5
0 10
20
30 Feature index
40
25
25
20
20
Errors (percentages)
p(S, α) p(S)p(α|S) =R . p(S, α)dS p(S)p(α|S)dS
Errors (percentages)
p(S|α) = R
Algorithm 1 A summary of the M-H algorithm. 1. Repeat for i = 1, 2,..., N . 2. Generate v from q(z|xi ) and u from U(0,1) (Gaussian distribution). 3. If u ≤ Q(xi , z) 4. - set xi+1 = v. 5. Else 6. - set xi+1 = xi . 7. Return the values {x1 , x2 , ..., xN }. Errors (percentages)
Qtp other, we shall have: p(α|S) = t=1 p(α|St ). According to the Bayes’ theorem, we have the posterior density be reformed as follows:
15 10 5 0 0
50
10
(a)
20 30 Feature index
40
50
15 10 5 0 0
10
20 Feature index
(b)
30
40
(c)
Fig. 2. Illustration of applying the method of [32] to the “Shark” sequence with additive Poisson random noise. Mean = 0 and variance is (a) 1, (b) 5, and (c) 10.
IV. S OLUTIONS
OF SHAPE AND MOTION RECOVERY
A. Generalised estimation
In the previous sections, the estimation of non-rigid shape and motion has been introduced without assuming the properties of the PDF. In the presence of clutters or occlusions, we may observe multiple peaks in the estimation of residuals or tilted normal distributions. For example, Fig. 2 illustrates the residual distributions of 50 features randomly selected, where the technique of [32] was used to recover the shape and motion from the “Shark” sequence with different additive Poisson random noise. This non-Gaussian problem cannot be effectively handled using the classical Gaussian based algorithms (e.g. [32]), p(S)p(α|S) p(β)p(S|β)p(α|β, S) ∝ R . p(β|S, α) = R as the saddle points in the optimisation hold local minip(S)p(α|S)dβ p(S)p(α|S)dβ (12) mum/maximum energy that may drive the search towards When the optimisation iterates for a number of times, the erroneous solutions. To estimate the motion and deformation density function becomes proportional to the numerator of the model simultaneously, a generalised expectation maximization most right hand side of Eq. (12). For mathematical tractability, (EM) algorithm reported in [16], [32] is taken into account in we here utilise the Metropolis-Hastings algorithm (M-H) in the this section. This EM algorithm enables us to reconstruct a random sampling process. In the M-H algorithm, let q(z|x) 3-D structure from missing data, e.g., occlusions and lighting be the proposal density (or called the sampling density), change in our applications. An EM algorithm consists of two processing steps, namely where x ∈ Rd (d is dimension size) and z is a sample. For this proposal density q(z|x), we then define the acceptance E- and M-stages. The generalised EM approach starts from the E-stage. In this stage, the probability of each component probability Q(x, z) as follows: that contributes to the shape estimation is formulated. Unfor ( p(z|S,α)q(x|z) , 1 , if p(x|S, α)q(z|x) > 0, tunately, it is extremely difficult to calculate the posterior in a min p(x|S,α)q(z|x) Q(x, z) = closed form. A simplified posterior can be formed as follows: 1, otherwise.
Monte Carlo simulation is used as a random process to approximate the expectation shown in Eq. (11). Unfortunately, it is very difficult to directly generate random draws of α from p(S|α). This is due to the fact that p(S|α) itself is an estimate (see the previous sub-section for details). Assume that β is a sample from p(S|α), whose probability needs to be determined in the estimation. Therefore, to apply the Gibbs sampler, we derive the conditional density function of β (given S and α) as follows:
That is to say, the random samples whose probability is larger than Q(x, z) will have the same effect. Using a Kalman filter (e.g. [2]), initial shape and motion parameters can be obtained. This provides an approximation of the final solution towards the problem of optimal structure and motion recovery. The M-H algorithm is then used to generate a random draw β from the distribution p(S|α). This process is summarised in Algorithm 1 [8].
pi (w|F, φ)
= ∝
pi (F|w, φ)pi (w|φ) pi (F|φ) F T Ω−1 F
ζ exp − PN
i=1
F T Ω−1 F
!
,
(13)
where F = F− RS− D, ζ is a scalar, T is transpose, F is the vectorised form of P, w represents the weight, the diagonals of the variance-covariance matrix Ω are σ, N is the number of
IEEE TRANSACTIONS ON MULTIMEDIA, JAN. 2011
5
where φ is the estimation variance, Epi (w|F,φ) is the expectation of pi (w|F, φ), and λ1 and λ2 are two coefficients. Of these parameters, λ1 can be determined using the residual between the expectation and the projection with the updated motion parameters: λ1 = ǫTi Ω−1 ǫi ,
(15)
¯ i −Di , F ¯ is the vectorised where the error residual ǫi = Fi − F ¯ Also, λ2 refers to the difference of the estimates form of RS. at two consecutive times: λ2 = Q(P, φ)|γ − Q(P, φ)|γ−1 ,
(16)
where γ is the iteration number. Eq. (16) clearly shows the Markovian property in the residual minimisation. B. Gaussian distributions This is a special case of the generalised form introduced in the previous sub-section. In this special case, Eq. (13) can be further simplified considering the following objective function: ! X 2 Wi ||F − vec(RS) − D|| , min(Q(P, φ)) = min i
(17) where Wi are weight coefficients, which can be determined using the Euclidean distance between the estimated and the averaged projections. Assuming N is a Gaussian distribution, then we have the following form [32]: ¯ MMT + σ 2 I), p(F|φ) = N (F|D + F;
(18)
p(w|F, φ) = N ([w, 1]T − F−1 θT ; σ 2 I),
(19)
where φ encapsulates R, T, α and σ 2 . The displacement D = RS, M is the vector of the multiplication of R and s, and I is the identity matrix. The EM algorithm is applied herein as some correspondences in the matching stage may be missing due to occlusions and illumination change. To ensure systematic efficiency, this EM algorithm can be further simplified: In the E-stage, the parameters to be used in the M-stage will be roughly optimised, given initial motion and shape estimations. Especially, we pay more attention to the probability of the weights w ([w1 , ..., wi ]) over the available F and φ: ¯ i−1 − Di−1 , 1]. and θ = [Fi−1 − F In the M-stage, we have a motion estimation form similar to Eq. (14) but the corresponding parameters are estimated ||Fi −vec(Ri Si )−Di ||2 , and using the following 2σ2 √ equations: λ1 = λ2 = 2ab log 2πσ 2 , where a and b are two intermediate variables [16]. To update the estimated parameters, one can
150
Z−axis
160 Z−axis
the neighborhood of the feature point, and Di is the estimated projection. In the M-stage, we design an objective function that will be optimised in order to find the minimum variance between the target and the estimate. Using a pseudo-linear form, the objective function is described as follows: X Q(P, φ) = Epi (w|F,φ) λ1 + λ2 , (14)
140 20 120 0
40 60
50
130 120 20
0 40
150
100
50
60
80
100 X−axis
140
Y−axis
80 100 Y−axis
(a)
X−axis
(b)
Fig. 3. Demonstration of the proposed MCMC scheme in structure recovery: (a) initial (“o”-real structure, “+”-initial estimate), (b) final settlement (square symbols-final estimate of the structure).
compute the partial derivative of the log likelihood with respect to each parameter, and then set it to zero before solving it. Fig. 3 demonstrates such a process that starts from an approximate estimate of the structure and then iterates in the proposed MCMC scheme to reach the final settlement. C. Algorithmic summary The proposed algorithm starts from corner feature detection in the first frame of an image sequence, followed by feature tracking using a standard tracker reported in [30]. Here, 150 corner features are extracted from the first image in order to guarantee that there are sufficient features to be tracked throughout the entire sequence. The features are gradually lost in the tracking procedure due to the absence of some features from the region of view. The proposed algorithm applied to two neighboring images is presented in Algorithm 2. It is worth pointing out that EM is an iterative procedure that is very sensitive to initial conditions. Evidence shows that a poor starting of the structure/motion parameters will end up with significantly large errors when the solution is settled. Therefore, we need a good and efficient initialisation procedure. In the literature, a number of approaches have been proposed to successfully deal with this issue. For example, Coleman and Woodruff [9] introduced a classification technique that started from a random partition of a subsample and finished with a clustering of the subsamples, which was the starting point for EM. Similarly, Biernacki et al. [5] reported a scheme that used the parameters from a sequence of a classification EM algorithm, a stochastic EM algorithm and previous iterations of EM itself. As stated in [21], a good initial point for EM targets at the compromise of accuracy and computational costs. In our application, after the Markov random sampling process with randomly selected features, we will have well initialised structure/motion parameters before applying the generalised EM scheme to the overall corresponding features. D. Computational complexity analysis In this subsection, we use the following conventions for the analysis: nim is the number of image frames within a sequence, nF is the number of features used for corresponding two neighboring images, ng is the number of feature groups used within MCMC and nf is the number of features used within MCMC.
IEEE TRANSACTIONS ON MULTIMEDIA, JAN. 2011
6
Algorithm 2 Outline of the proposed non-rigid shape and motion recovery algorithm. 1. Feature correspondence across two neighboring images. 2. Initialisation of structure and motion parameters. 3. Randomly choose 100 groups of feature points (each group has 50 features) from the overall correspondences. repeat 3.1. Employ the Metropolis-Hastings algorithm with the Gibbs sampling. 3.2. Maximisation of p(R, T, α, σ 2 |P) using Eq. (3) and the associated equations. until 4. Choose the feature group with its recovered structure and motion parameters, which lead to the minimum mean error. 5. Apply the generalised EM procedure to the overall feature correspondences using the outcomes of Step 4. repeat 5.1. In the E-step, compute the expectation E(w) of the residuals using Eq. (13). 5.2. In the M-step, Eq. (14) is deployed with the updated shape basis, error variance, translation and rotation. until 6. Stop the iteration if the average variance is less than a pre-defined threshold. Otherwise, go back to Step 3.
(a)
(b)
(c)
(d)
Fig. 4. Four frames of a synthetic sequence (“face sequence”). Image courtesy of Heriot-Watt University, UK.
Referred to Algorithm 2, the computational costs of the proposed MCMC scheme for an entire video sequence are as follows: (1) The MCMC step takes O((nim − 1)ng n2f ), and (2) the original EM stage takes O((nim − 1)n2F ). Therefore, the entire proposed algorithm adds extra computational efforts for MCMC onto the top of the original EM and hence reaches a computational level of O((nim − 1)ng n2f + (nim − 1)n2F ). Choosing smaller ng and nf will lead the computational costs to being dramatically reduced. However, this feature reduction may bring a negative impact on the optimality of the structure and motion parameters. In general, the more features and image frames are involved, the higher computational costs are required. V. E XPERIMENTAL WORK We compare the proposed algorithm against the state of the art techniques developed by Torresani et al. [32] and Zhou et al. [39],[40], respectively. The former has achieved successful performance in Gaussian estimates, while the latter pursued a solution towards the case of non-Gaussian distributions but did not apply any MCMC procedure. To seek a consistent estimation, the experimental work described in this section is
(a) Frame 1
(b) Frame 2
(c) Frame 3
Fig. 5. Some clips of a chess-board sequence, which are superimposed by the detected/tracked feature points (red squares).
organised as follows: (1) Motion estimation and 3-D shape recovery are achieved using a synthetic sequence of a deforming shark provided in [32] and a synthetic face sequence created by ourselves. (2) Using publicly accessible image sequences, we evaluate the performance of the proposed algorithm, Zhou’s and Torresani’s approaches.
A. Synthetic images The motion of the shark is a rigid rotation in combination with deformations generated by K = 2 basis shapes [32]. We apply average errors that are defined as the mean distance of the reconstructed point to the expected point divided by the size of the shape. To evaluate the proposed method in a noisy situation, we add two noise sources to the original data. One of them is (mean, variance) Gaussian type, and the other is (mean, variance) uniformly distributed noise. In the first case, the variance of noise levels ranges from 0 to 16 in steps of 2 whilst all means are zero (units: intensity levels). In the second case, to describe the degrees of added noise, we adopt signal-to-noise ratios. Tables I and II illustrate that with smaller noise levels these three algorithms possess comparable performance. However, as the noise levels increase (occlusions also occur), the proposed 3-D recovery algorithm has much smaller errors than Torresani’s and Zhou’s approaches. For example, the proposed method is of 3.48 against 5.41 of Torresani’s model and 3.95 of Zhou’s model in the case of uniform noise. Therefore, the proposed method is more robust than the others in different noise circumstances. The second synthetic sequence, namely face sequence, was created via a ray-tracing technique. Some clips are shown in Fig. 4. The motion of the face has been known a priori. In order to evaluate the proposed shape recovery algorithm, we add Gaussian and uniformly distributed noise to the facial images individually. The noise levels are the same as illustrated in Tables I and II. From the experiments, the results can be summarised as follows: (1) In the case of Gaussian noise, the proposed method has the error range of (2.01, 15.77)%, while Zhou’s method has the error range of (2.12, 16.40)%, and Torresani’s method (2.09, 16.82)%, and (2) in the case of uniformly distributed noise, the proposed technique results in the error range of (2.02,9.04)%. Comparably, Zhou’s technique leads to an error range of (2.12,9.56)%, while Torresani’s method has a larger error range (2.09, 13.48)% compared to the other two methods. In summary, the proposed algorithm has better accuracy than these two classical techniques in this experiment.
IEEE TRANSACTIONS ON MULTIMEDIA, JAN. 2011
7
TABLE I E RROR STATISTICS FOR DIFFERENT G AUSSIAN NOISE LEVELS . U NIT OF VARIANCE : intensity − level2 . Variance no additive noise 2 4 6 8 10 12 14 16
[32] (%) 1.46 2.98 5.68 7.72 8.84 10.05 11.24 12.01 12.99
[39] (%) 1.47 3.17 5.17 7.05 8.11 9.11 9.98 10.73 11.48
Proposed (%) 1.41 2.9 4.94 6.53 7.46 8.66 9.23 9.5 10.22
TABLE II E RROR STATISTICS FOR DIFFERENT UNIFORM NOISE LEVELS . U NIT OF SNR: dB. SNR no additive noise 35.35 29.33 25.84 23.31 21.12 20.03 19.17 18.26
[32] (%) 1.46 1.86 2.50 3.13 3.65 4.27 4.52 4.87 5.41
[39] (%) 1.47 1.84 2.41 2.92 3.24 3.37 3.45 3.69 3.95
Proposed (%) 1.43 1.68 2.25 2.36 2.79 2.94 3.17 3.24 3.48
B. Real images The ground truth of the real image sequences is not available and therefore the evaluation can only be based on the subjective judgments. In this paper, to provide a comprehensive evaluation of the involved algorithms, we categorise the testing results into two classes: The first one corresponds to the 2-D projections of the recovered structure, and then second one is related to the depth estimation. Three image sequences are here tested using the three approaches respectively but we will use image pairs to demonstrate the performance of the shape recovery. Note that missing data has been recognised in two of the image sequences due to the exit of the feature points from the camera view. The first sequence is about a chess-board symbol printed on an subject’s back (Fig. 5) while he repeatedly rotates the trunk.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 6. Performance comparison of the two methods in the chess-board sequence: (a)-(d) 3-D recovery by [32] using the correspondences between frames 1(or 2) and 2(or 3); (b)-(e) 3-D recovery by the proposed method using the correspondences between frames 1(or 2) and 2(or 3); (c)-(f) 3-D recovery by [39] using the correspondences between frames 1(or 2) and 2(or 3).
(a) Torresani’s
(b) Zhou’s
(c) Proposed
Fig. 7. Performance comparison of depth estimation by different algorithms in the chess-board image sequence.
This sequence was kindly provided by the authors of [17]. The chess-board actually is of non-rigid motion, as the worn shirt is soft and flexible. The experimental results, depicted in Fig. 6, show that the proposed algorithm has generated 3-D points of larger amplitudes along the z-axis (vertical to the image plane), compared to Torresani’s and Zhou’s methods. The feature points collected from the central areas of frames 1 and 2 actually hold different depths along the view direction of the camera. However, the estimation of Torresani’s and Zhou’s approaches does not reflect these depth variations, and hence ends up with relatively large errors. Fig. 7 demonstrates the performance comparison of depth estimation from images 2 and 3 using different algorithms in this chess-board sequence (the colour bar shows the depth and hereafter). The chess-board is nearly a plane, and therefore the recovered structure must be planar. In Fig. 7, the outcome of the proposed MCMC-EM algorithm demonstrates linear changes from blue to red colours, resulting in small errors in structure from motion, while the others sustain significant peaks in the recovered plane. The second example is a hand sequence that is available from an unknown provider. The hand is tightly held and then released. Fig. 8 shows three example frames of the image sequence and the corresponding 3-D recovery is illustrated in Fig. 9. Similar to the previous example, the proposed method results in correct feature locations. Interestingly, the recovered 3-D points shown in Figs. 9(a)-(d), (b)-(e) and (c)(f) are significantly different from each other in terms of their positions. In fact, Figs. 9(b)-(e) appear to follow the correct movements of the hand but the other sub-figures do not. Therefore, the proposed method has better accuracy in shape recovery than Torresani’s and Zhou’s methods. Fig. 10 illustrates the performance comparison of depth estimation from images 2 and 3 using different algorithms in this hand sequence. In some particular areas, the hand appears to be concave. Consequently, the recovered structure must correspond to this shape. In Fig. 10, the results of the proposed MCMC-EM algorithm simulates this hand movement, where the light blue colour indicates the concave of the hand. Unfortunately, the other two algorithms have less accuracy in the depth estimation, although Zhou’s method is better than Torresani’s algorithm in the sense that a concave is observed from the outcome of Zhou’s method. The third example is a plastic sheet sequence provided by Salzmann et al., EPFL, Lausanne, Switzerland (see Fig. 11). The sheet is stretched and tilted to form different shapes at different angles. Fig. 12 shows the reconstruction comparison
IEEE TRANSACTIONS ON MULTIMEDIA, JAN. 2011
(a) Frame 1
(b) Frame 2
8
(c) Frame 3
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 8. Some clips of a hand sequence, which are superimposed by the detected/tracked feature points (red squares).
between our new method and Torresani’s model. Comparing Figs. 12(a) and (b), one can observe that the recovered surface shown in Fig. 12(b) is more flat and wider than that shown in Fig. 12(a). In terms of Figs. 12(d) and (e), the former is tilted at a totally different angle from the latter which matches the true rotation. This justifies that the proposed method correctly reflects the angle variation of the object but Torresani’s and Zhou’s approaches have worse performance in this example. Fig. 13 shows the performance comparison of depth estimation from images 2 and 3 using the three algorithms in this plastic sheet sequence. In most circumstances, the plastic sheet appears to be planar. Therefore, the recovered structure must be planar. In Fig. 13, the results of the proposed MCMCEM and Zhou’s algorithms share a very similar structure, while Torresani’s algorithm possesses larger variations on the recovered surface. Looking closely at these resulting images, it is observed that Zhou’s method has a larger error range than the proposed MCMC-EM algorithm, indicating that the proposed algorithm has the best capability among the three algorithms in this example. C. Applications in event recognition In this subsection, we address the potential applications of the proposed structure and motion recovery scheme in event recognition. Event recognition is to recognise which events are occurring in a video stream. As an example, we here focus on the hand sequence, where the state changes during the period of hand holding and releasing need to be detected. In a 2-D based classification system, one has to design a classification system to measure the changes of the marked hand area. However, it is not easy to identify whether or not these changes are due to hand folding/releasing, mutual occlusion or pose change. In a 3-D based classification system, this uncertainty can be dealt with if one follows the variations of the angle ψ formed by three points, which are extracted from the marked area. The 3-D solution does not require any prior knowledge of the hand motion (e.g. planar or non planar). Instead, it only needs to monitor the state change of the angle ψ during the entire session (i.e. from holding to releasing or vice versa). Fig. 14 illustrates an example for event detection (e.g. hand holding/releasing). After a hand has been detected using the well-known OpenCV package, the proposed structure from motion algorithm is used to provide the angle ψ measurements using the area centre and other two surrounding feature points. It is observed that the detected events well match the true situations, except the case where the structure/motion scheme
Fig. 9. Performance comparison of the two methods in the hand sequence: (a)-(d) 3-D recovery by [32] using the correspondences between frames 1(or 2) and 2(or 3); (b)-(e) 3-D recovery by the proposed method using the correspondences between frames 1(or 2) and 2(or 3); (c)-(f) 3-D recovery by [39] using the correspondences between frames 1(or 2) and 2(or 3).
(a) Torresani’s
(b) Zhou’s
(c) Proposed
Fig. 10. Performance comparison of depth estimation by different algorithms in the hand image sequence.
becomes less efficient to catch up the transition from holding to releasing. This is because it is extremely difficult to identify which frame out of images 16-18 is the turning point due to the unchanged angle. D. Further discussion One of the main contributions of the proposed MCMCEM algorithm is the capability of effectively handling the 2-D observations without assuming the distribution characteristics. In the synthetic cases, we have run the proposed algorithm over the image data with additive Gaussian or non-Gaussian noise. In the real datasets, missing data and non-Gaussian distributions continuously challenged the proposed algorithm. For example, in the hand sequence, the marks on the hand gradually disappeared and re-appeared as the hand folded and then released. On the other hand, in the chess-board
(a) Frame 1
(b) Frame 2
(c) Frame 3
Fig. 11. Some clips of a plastic sheet sequence, which are superimposed by the detected/tracked feature points (red squares).
IEEE TRANSACTIONS ON MULTIMEDIA, JAN. 2011
(a)
(b)
9
(c)
(a) Torresani’s
(b) Zhou’s
(c) Proposed
Fig. 13. Performance comparison of depth estimation by different algorithms in the plastic sheet image sequence.
(d)
(e)
True Estimated
(f)
Fig. 12. Performance comparison of the two methods in the plastic sheet sequence: (a)-(d) 3-D recovery by [32] using the correspondences between frames 1(or 2) and 2(or 3); (b)-(e) 3-D recovery by the proposed method using the correspondences between frames 1(or 2) and 2(or 3); (c)-(f) 3-D recovery by [39] using the correspondences between frames 1(or 2) and 2(or 3).
Events
hold
0
release
0
sequence, since all the regions of the chess-board are of similar texture, the used tracking system was reluctant to find correct correspondences from multiple candidates in the matching stage. Regardless of these difficult circumstances, however, the proposed algorithm has achieved better performance than the other two state of the art approaches. This attributes to the fact that the proposed approach used an MCMC procedure plus EM refinements, which had been applied to general distributions and did not require the log-concavity condition for the final solution. VI. C ONCLUSIONS
AND FUTURE WORK
We have presented a solution to the non-rigid shape and motion recovery problem. This new algorithm has allowed us to perform more accurate 3-D structure recovery than two classical approaches in different distributions. A novel MCMCEM algorithm was proposed without assuming the statistical properties. By addressing the orthogonal reconstruction of the structure in a constrained minimization function, we can directly render the 3-D shape, motion parameters and variance values, based on the rigid motion using Tomasi-Kanade’s rigid structure-from-motion factorization. Numerical experiments have demonstrated that the proposed method produces better results from clean or noisy synthetic data, and real image sequences. An application of the proposed algorithm in event recognition was also introduced. In spite of its success, this approach has an obvious limitation: Gibbs sampling in the computation takes inevitable computational efforts. To address this weakness in the future work, we are working towards an optimisation technique to reduce the number of random sampling, which may compromise the convergence accuracy and efficiency in due course. ACKNOWLEDGMENTS This work was in part supported by the National Basic Research Program of China (973 Program) (Grant No. 2012CB316404), the National Natural Science Foundation of
#11
5
#16
10
15 Image frames
#17
#18
20
25
#19
30
#21
Fig. 14. Event recognition using the calculated angle ψ by the proposed MCMC-EM algorithm for the hand sequence (upper row: estimated events are compared to the ground truth; lower row: image samples).
China (Grant No. 61072093) and the Open Project of State Key Laboratory of Industrial Control Technology (Zhejiang University) (Grant No. ICT1105). The authors would like to thank the handling editor and the anonymous reviewers for their valuable and constructive comments. The unknown provider of the hand sequence and other image creators in the experimental work are also appreciated. R EFERENCES [1] “Rushes project deliverable d5, requirement analysis and use-cases definition for professional content creators or providers and home-users,” in http://www.rushesproject.eu/upload/Deliverables/D5 WP1 ETB v04.pdf, August 2007. [2] A. Azarbayejani and A. Pentland, “Recursive estimation of motion, structure, and focal length,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 17, no. 6, pp. 562–575, 1995. [3] A. Bartoli, “Maximizing the predictivity of smooth deformable image warps through cross-validation,” Journal of Mathematical Imaging and Vision, vol. 31, no. 2-3, pp. 133–145, 2008. [4] A. Bartoli, V. Gay-Bellile, U. Castellani, J. Peyras, S. Olsen, and P. Sayd, “Coarse-to-fine low-rank structure-from-motion,” in Proc. of IEEE Conf. on Comput. Vis. and Patt. Recogn., 2008, pp. 1–8. [5] C. Biernacki, G. Celeux, and G. Govaet, “Choosing starting values for the em algorithm for getting the highest likelihood in multivariate gaussian mixture models,” Comput. Stat. and Data Analysis, vol. 41, no. 3-4, pp. 561–575, 2003. [6] C. Bregler, A. Hertzmann, and H. Biermann, “Recovering non-rigid 3d shape from image streams,” in Proc. of IEEE Conf. on Comput. Vis. and Patt. Recogn., 2000, pp. 690–696. [7] R. Carceroni and K. Kutulakos, “Multi-view 3d shape and motion recovery on the spatio-temporal curve manifold,” in Proc. of IEEE Conf. on Comput. Vis., 1999, pp. 520–527. [8] S. Chib and E. Greenberg, “Understanding the metropolis-hastings algorithm,” The American Statistician, vol. 49, no. 4, pp. 327–335, 1995.
IEEE TRANSACTIONS ON MULTIMEDIA, JAN. 2011
[9] D. Coleman and D. Woodruff, “Cluster analysis for large datasets: an effective algorithm for maximizing the mixture likelihood,” Journal of Comput. and Graph. Stat., vol. 9, no. 4, pp. 672–688, 2000. [10] D. Damen and D. Hogg, “Recognizing linked events: Searching the space of feasible explanations,” in Proc. of IEEE Conf. on Comput. Vis. and Patt. Recogn., 2009, pp. 927–934. [11] M. Davis, “An iconic visual language for video annotation,” in Proc. of IEEE Symp. on Visual Language, 1993, pp. 196–202. [12] F. Dellaert, S. Seitz, C. Thorpe, and S. Thrun, “Em, mcmc, and chain flipping for structure from motion with unknown correspondence,” Machine Learning, vol. 50, no. 1-2, pp. 45–71, 2003. [13] M. Farenzena, A. Bartoli, and Y. Mezouar, “Efficient camera smoothing in sequential structure-from-motion using approximate cross-validation,” in Proc. of European Conference on Computer Vision, 2008, pp. 196– 209. [14] M. Fischler and R. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,” Communic. ACM, vol. 24, pp. 381–395, 1981. [15] D. Forsyth, S. Ioffe, and J. Haddon, “Bayesian structure from motion,” in Proc. of Int’l Conf. Comput. Vis., 1999, pp. 660–665. [16] Z. Ghahramani and G. Hinton, “The em algorithm for mixtures of factor analyzers,” University of Toronto, Tech. Rep. CRG-TR-96-1, 1996. [17] I. Guskov and L. Zhukov, “Direct pattern tracking on flexible geometry,” in Winter School of Computer Graphics, 2002, pp. 203–208. [18] R. Hartley and R. Vidal, “Perspective nonrigid shape and motion recovery,” in Proc. of European Conf. on Comput. Vis., 2008, pp. 276– 289. [19] M. Isard and A. Blake, “Condensation - conditional density propagation for visual tracking,” Int. J. of Computer Vision, vol. 29, no. 1, pp. 5–28, 1998. [20] Z. Khan, T. Balch, and F. Dellaert, “Mcmc-based particle filtering for tracking a variable number of interacting targets,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 11, pp. 1805–1918, 2005. [21] M. Meil˘a and D. Heckerman, “An experimental comparison of modelbased clustering methods,” Mach. Learn., vol. 42, pp. 9–29, January 2001. [22] R. Ohbuchi and J. Kobayashi, “Unsupervised learning from a corpus for shape-based 3d model retrieval,” in Proc. of the 8th ACM international workshop on Multimedia information retrieval, New York, NY, USA, 2006, pp. 163–172. [23] M. Pollefeys, R. Koch, and L. Van Gool, “Self-calibration and metric reconstruction in spite of varying and unknown internal camera parameters,” Intel. J. of Comput. Vis., vol. 32, pp. 7–25, 1999. [24] G. Qian and R. Chellappa, “Structure from motion using sequential monte carlo methods,” Int. J. Comput. Vision, vol. 59, no. 1, pp. 5– 31, 2004. [25] Y. Seo and K. Hong, “Structure and motion estimation with expectation maximization and extended Kalman smoother for continuous image sequences,” in Proc. of IEEE Conf. on Comput. Vis. Pattern Recog., 2001, pp. 1148–1154. [26] A. Shaji and S. Chandran, “Riemannian manifold optimisation for nonrigid structure from motion,” in Proc. of First Workshop on NonRigid Shape Analysis and Deformable Image Alignment, IEEE Conf. on Comput. Vis. Pattern Recog., 2008, pp. 1–6. [27] J. Sivic and A. Zisserman, “Video google: a text retrieval approach to obejct matching in videos,” in Proc. of IEEE Int’l Conf. on Comput. Vis., 2003, pp. 1470–1477. [28] R. Szeliski and S. Kang, “Recovering 3D shape and motion from image streams using non-linear least squares,” Journal of Visual Communication and Image Representation, vol. 5, no. 1, pp. 10–28, 1994. [29] J. Tangelder and R. Veltkamp, “A survey of content based 3d shape retrieval methods,” in Proc. of International Conference on Shape Modeling, 2004, pp. 145–156. [30] C. Tomasi and T. Kanade, “Detection and tracking of point features,” Carnegie Mellon University, Tech. Rep. CMU-CS-91-132, 1991. [31] ——, “Shape and motion from image streams under orthography: a factorization method,” Int. J. of Computer Vision, vol. 9, pp. 137–154, 1992. [32] L. Torresani, A. Hertzmann, and C. Bregler, “Learning non-rigid 3d shape from 2d motion,” in Neural Information Processing Systems (NIPS), 2003, pp. 1555–1562. [33] ——, “Non-rigid structure-from-motion: estimating shape and motion with hierarchical priors,” IEEE Trans. Pattern Analy. and Mach. Intell., vol. 30, no. 5, pp. 878–892, 2008. [34] L. Torresani, D. Yang, G. Alexander, and C. Bregler, “Tracking and modeling non-rigid objects with rank constraints,” in Proc. of IEEE Conf. on Comput. Vis. and Patt. Recogn., 2001, pp. 493–500.
10
[35] Z. Tu and S.-C. Zhu, “Image segmentation by data-driven markov chain monte carlo,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 24, no. 5, pp. 657–673, 2002. [36] E. Vonesh and R. Carter, “Efficient inference for random-coefficient growth curve models with unbalanced data,” Biometrics, vol. 43, pp. 617–628, 1987. [37] J. Xiao and T. Kanade, “Non-rigid shape and motion recovery: Degenerate deformations,” in Proc. of IEEE Conf. on Comput. Vis. and Patt. Recogn., 2004, pp. 668–675. [38] J. Yan and M. Pollefeys, “A factorization-based approach for articulated nonrigid shape, motion and kinematic chain recovery from video,” IEEE Trans. on Pattern Analy. and Mach. Intell., vol. 30, no. 5, pp. 865–877, May 2008. [39] H. Zhou, X. Li, T. Liu, F. Lin, Y. Pang, J. Wu, J. Dong, and J. Wu, “Recovery of nonrigid structures from 2d observations,” International Journal of Pattern Recognition and Artificial Intelligence, vol. 22, no. 2, pp. 279–294, 2008. [40] H. Zhou, G. Schaefer, T. Liu, and F. Lin, “3-d structure recovery from 2-d observations,” in Proc. of Int’l Conf. Image Proc., 2009, pp. 4309– 4312.
Huiyu Zhou received his BEng degree in radio technology from Huazhong University of Science and Technology of China in 1990. He was awarded an Master of Science degree in biomedical engineering from the University of Dundee, and an PhD degree in computer vision from the Heriot-Watt University, Edinburgh, United kingdom. Currently, he holds a permanent post in ECIT, Queen’s University Belfast, United Kingdom. His research interests include computer vision, human motion analysis, intelligent systems and human computer interface. He has published widely on these topics.
Xuelong Li (M’02-SM’07) is a professor with the Center for OPTical IMagery Analysis and Learning (OPTIMAL), State Key Laboratory of Transient Optics and Photonics, Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an 710119, Shaanxi, P. R. China.
Abdul H. Sadka (SM’07) is chair, head of Electronic and Computer Engineering and the director of the Centre for Media Communications Research of Brunel University West London with 15-year experience in academic leadership and excellence. He received his PhD in Electrical and Electronic Engineering from Surrey University UK in 1997. He is an internationally renowned expert in visual media processing and communications with an extensive track record of scientific achievements and peer recognised research excellence. He has managed so far to attract nearly 2.5M worth of research grants and contracts in his capacity as principal investigator. He has published widely in international journals and conferences and is the author of a highly regarded book on ”Compressed Video Communications” published by Wiley in 2002. He holds three patents in the video transport and compression area. He acts as scientific advisor and consultant to several key companies in the international Telecommunications sector. He is a Fellow of IET and senior member of IEEE.