1
Sparse Representation for 3D Shape Estimation: A Convex Relaxation Approach
arXiv:1509.04309v1 [cs.CV] 14 Sep 2015
Xiaowei Zhou, Menglong Zhu, Spyridon Leonardos, and Kostas Daniilidis,
1
Abstract—We investigate the problem of estimating the 3D shape of an object defined by a set of 3D landmarks, given their 2D correspondences in a single image. To alleviate the reconstruction ambiguity, a widely used approach is to assume the unknown shape as a linear combination of predefined basis shapes and the sparse representation is usually adopted to capture complex shape variability. While this approach has proven to be successful in many applications, a challenging issue remains, i.e., the joint estimation of shape and viewpoint requires to solve a nonconvex optimization problem. Previous methods often adopt an alternating minimization scheme to alternately update the shape and viewpoint, and the solution depends on initialization and might be stuck at local optimum. In this paper, we propose a convex approach to addressing this issue and develop an efficient algorithm to solve the proposed convex program. Moreover, we propose a robust model to handle gross errors in the 2D correspondences. We demonstrate the exact recovery property of the proposed model, its merits compared to alternative methods and the applicability to recover 3D human poses and car shapes from real images.
F
I NTRODUCTION
Recognizing 3D objects from 2D images is a central problem in computer vision. In recent years, there has been an emerging trend towards analyzing 3D geometry of objects including shapes and poses instead of merely providing bounding boxes [1], [2]. The 3D geometric reasoning can not only provide richer information about the scene for subsequent high-level tasks such as scene understanding, augmented reality and human computer interaction, but also improve object detection [3], [4]. 3D reconstruction has been a well studied problem and there have been many practically applicable techniques such as structure from motion, multiview stereo and depth sensors, but these techniques are limited in some scenarios. For example, a large number of acquisitions from multiple views are often required in order to obtain a complete 3D model, which is not preferred in some real-time applications; the depth sensors in general cannot work outdoor and have a limited sensing range; and reconstructing a dynamic scene is still an open problem. In this paper, we aim to investigate the possibility of estimating the 3D shape of an object from a single 2D image, which may potentially address the aforementioned issues. Estimating the 3D geometry of an object from a single view is an ill-posed problem. But it is a possible task for a human observer, since human can leverage visual memory of object shapes. Inspired by this idea, more and more efforts have been made towards 3D model-based analysis leveraging the increasing availability of online 3D model databases, such as the Google 3D warehouse that includes millions of CAD models of all kinds of objects and more and more publicly available shape scan and motion capture datasets that model the 3D shape and pose of human bodies. To address intra-class variability or nonrigid deformation and avoid enumerating all possibilities, many works
•
The authors are with Computer and Information Science Department and GRASP Laboratory, University of Pennsylvania, Philadelphia, PA, 19104. E-mail:
[email protected]
(e.g. [5], [6]), have adopted a shape-space approach originated from the “active shape model” [7] to represent shapes, where each shape is defined by a set of ordered landmarks and the shape to be estimated is assumed to be a linear combination of predefined basis shapes. For human pose reconstruction, sparse representation was proposed to handle the relatively complex variation of human poses which cannot be represented by the conventional active shape model [8], [9]. In model inference, the 3D shape model is fitted to the landmarks annotated or detected in images. In this way, the problem turns out to be a 3D-to-2D shape fitting problem, where the shape parameters (weights of the linear combination) and the pose parameters (viewpoint) have to be estimated simultaneously. Figure 1 gives an illustration to the fitting problem. While this approach has achieved promising results in various applications, the model inference is still a challenging problem, since the subproblems of shape and pose estimation are coupled: the pose needs to be known to deform the 3D model to match 2D landmarks, while the exact 3D model is required to estimate the pose. The joint estimation of shape and pose parameters usually results in a nonconvex optimization problem, and the orthogonality constraint on the pose parameters makes the problem more complicated. The previous methods often adopt an alternating scheme to alternately update the shape and pose parameters until convergence, which is sensitive to initialization and may get stuck at locally-optimal solutions. As mentioned in many works (e.g. [8], [5]), most of the failed cases were attributed to bad initialization. Some heuristics have been used to relieve this issue, such as trying multiple initializations [9] or using a viewpoint-aware detector for pose initialization [6]. But there is still no guarantee for global optimality. In this paper, we propose a convex relaxation approach to addressing the aforementioned issue. We use an augmented shape-space model, in which a shape is represented as a linear combination of rotatable basis shapes. This model can gives a linear representation of both intrinsic shape
2
=
Fig. 1. Illustration of the problem studied in this paper. The unknown 3D model is defined by a set of landmarks and assumed to be a linear combination of some predefined basis shapes with sparse coefficients. Given the 2D correspondences of the landmarks in an image, the computational problem is to estimate the coefficients of the sparse representation as well as the viewpoint of the camera.
deformation and extrinsic viewpoint changes. Next, we use the convex relaxation of the orthogonality constraint to convert the entire problem into a spectral-norm regularized least squares problem, which is a convex program. Then, we develop an efficient algorithm to solve the proposed convex program based on the alternating direction method of multipliers (ADMM). To achieve fast computation, we derive a closed-form solution to compute the proximal operator of the spectral norm. Furthermore, we extend the proposed model to handle outliers in the input 2D correspondences. This paper extends our earlier conference version [10]. The extension includes the postprocessing for rotation synchronization, outlier modeling, more experiments and real examples with learned detectors. The remainder of this paper is organized as follows. We first introduce the related work in Section 2 and formulate the problem in Section 3. Then, we introduce the proposed methods including the convex relaxation, optimization algorithms, outlier modeling and other technical details in Section 4. We report the experimental results in Section 5 and finally conclude the paper with discussions in Section 6.
2
R ELATED W ORK
The most related works include the ones that fit a 3D deformable model to the features in a 2D image. A popular application is human face modeling for various tasks such as recognition [11], feature tracking [12] and facial animation [13]. Recently, there has been an increased number of works on 3D object modeling. For example, Hejrati and Ramanan [5] used the deformable model for 3D car modeling. They produced 2D landmarks by a variant of deformable part models [14] and then lifted the 2D model to 3D. Lin et al. [15] proposed a method for joint 3D model fitting and fine-grained classification for cars. Zia [6] et al. developed a probabilistic framework to simultaneously localize 2D landmarks and recovery 3D object models. While the conventional active shape model performs well to describe the shape variability of rigid objects, it can hardly handle the structural variability of nonrigid objects such as human bodies. To address this issue, Ramakrishna et al. [8] proposed a sparse representation based approach to reconstructing 3D human pose from annotated joints in a still image. Wang et al. [9] adopted a 2D human pose detector to automatically locate the joints and used a robust estimator to handle inaccurate joint locations. Fan et al. [16] proposed to improve the performance of [8] by enforcing
locality when building the pose dictionary. Zhou and De La Torre [17] formulated human pose estimation as a matching problem, where the learned spatio-temporal pose model was matched to extracted trajectories in a video. Akhter and Black [18] integrated a joint-angle constraint into the sparse representation to reduce the possibility of invalid 6 reconstruction. A common component or an intermediate step in the works mentioned above is to fit the 3D deformable model to 2D correspondences. As we mentioned in the introduction, these works relied on nonconvex optimization and adopted alternating algorithms, which may be sensitive to initialization. The convex formulation proposed in this paper can potentially serve as a building block or provide a good initialization to improve the performance of the existing methods. Recently, there is growing interest in reconstructing category-specific object models from a collection of single images that include different instances of the same category [19], [20], [21], [22]. These works also adopted the deformable shape model and fit the model to either visual hulls in the pre-segmented images or landmarks annotated in the images, and the basis shapes were also learned from images during the model inference. The problem studied in this paper can be regarded as an essential component in these works. Another line of work tried to solve single-image reconstruction by finding the best match in the shape collection followed by a refinement step [23], [24]. The initial shape and viewpoint are found by enumerating all instances and viewpoints and comparing the rendered image with the given image. Then, the estimation is refined by optimizing the rigid transform and nonrigid deformation to fit the image contours. This approach can produce very detailed reconstruction and is applicable to a wide range of object categories, but it is computationally expensive and requires accurate image segmentation. There are also many alternative approaches for 3D human pose recovery from single images such as using a known articulated skeleton [25], [26], probabilistic graphical models [27], [28], explicit regression [29], [30], to name a few. But these approaches are customized for human pose modeling and not straightforwardly generalizable for other objects. Our work is also closely related to nonrigid structure from motion (NRSfM), where a deformable shape is recovered from multi-frame 2D-2D correspondences. The
3
low-rank shape-space model has been frequently used in NRSfM, but the basis shapes are unknown. The joint estimation of shape/pose variables and basis shapes is typically solved via matrix factorization followed by metric rectification [31], [32]. In some recent works, iterative algorithms were employed for better precision [33], [34] or sequential processing [35], and the problem studied in this paper is analogous to the step of fixing basis shapes and updating the remaining variables in those works.
3
P ROBLEM S TATEMENT
The problem studied in this paper can be described by the following linear system:
W = ΠS,
(1)
3×p
where S ∈ R denotes the unknown 3D shape, which is represented by 3D locations of p points. W ∈ R2×p denotes their projections in a 2D image. Π is the camera calibration matrix. To simplify the problem, the weakperspective camera model is usually used, which is a good approximation when the object depth is much smaller than the distance from the camera. With this assumption, the calibration matrix has the following simple form: α 0 0 Π= , (2) 0 α 0 where α is a scalar depending on the focal length and the distance to the object. There are always more variables than equations in (1). To make the problem well-posed, a widely-used assumption is that the unknown shape can be represented as a linear combination of predefined basis shapes, which is originated from the active shape model [7]:
S=
k X
ci B i ,
(3)
i=1
where B i ∈ R3×p for i ∈ [1, k] represents a basis shape learned from training data, while ci denotes the weight of each basis shape. In this way, the reconstruction problem is turned into a problem of estimating several coefficients by fitting the model (3) to the landmarks in an image, which greatly reduces the number of unknowns. Since the basis shapes are predefined, the relative rotation and translation between the camera frame and the coordinates that define the basis shapes need to be taken into account, and the 3D-2D projection is depicted by: ! k X T W =Π R ci B i + T 1 , (4) i=1
where R ∈ R3×3 and T ∈ R3 correspond to the rotation matrix and the translation vector, respectively. R should be in the special orthogonal group
SO(3) = {R ∈ R
3×3
T
|R R = I 3 , det R = 1}.
(5)
k X i=1
ci B i ,
i=1
F
¯R ¯ T = I 2, s.t. R
(7)
where c = [c1 , · · · , ck ]T and kck1 represents the `1 norm of c, which is the convex surrogate of the cardinality. k · kF denotes the Frobenius norm of a matrix. The cost function terms in (7) correspond to the reprojection error and the sparsity of representation, respectively. The optimization in (7) is nonconvex and there is an orthogonality constraint. A commonly-used strategy is the alternating minimization scheme, in which two steps are ¯ and updating c by any `1 minimizaalternated: (1) fixing R tion solver, e.g., the methods summarized in [39]; (2) fixing ¯ using certain rotation representations c and updating R such as the quaternions, the exponential map or a manifold ¯ representation. In some previous works (e.g. [8], [18]), R is updated by the Procrustes method, i.e., projecting the least-squares solution to the orthogonal group via singular value decomposition (SVD). But this method can only give ¯ is not a full rotation maan approximate solution since R trix and generally no closed-form solution exists [40]. The alternating algorithm is summarized in Algorithm 1. Due to the nonconvexity of (7), Algorithm 1 may get stuck at local minima when initialization is far away from the true solution.
4 4.1
P ROPOSED M ETHODS Convex Relaxation
We propose to use the following shape-space model:
Equation (4) can be further simplified as
¯ W =R
¯ ∈R where R denotes the first two rows of the rotation matrix, and the translation T has been eliminated by centralizing the data, i.e. subtracting each row of W and B by its mean. Note that the scalar α in the calibration matrix has been absorbed into c1 , · · · , ck . In the conventional active shape model, the principal components of the training samples are used as the basis shapes, which assumes that the unknown shape lies in a low-dimensional linear space. In many recent works (e.g. [8], [36], [37], [38]), it has been shown that the lowdimensional linear space is insufficient to model complex shape variation, e.g., human poses, and a promising approach is using an over-complete dictionary and representing an unknown shape as a sparse combination of atoms in the dictionary. Such a sparse representation implicitly encodes the assumption that the unknown shape should lie in a union of subspaces that approximates a nonlinear shape manifold. The representabilities of principal component analysis (PCA) and sparse representation for 3D human pose modeling will be experimentally compared in Section 5.2. Based on the sparse representation of shapes, the following type of optimization problem is often considered to estimate an unknown shape:
2
k
X 1
¯ min ci B i + λkck1 ,
W − R ¯
2 c,R 2×3
(6)
S=
k X i=1
ci Ri B i ,
(8)
4
1 2 3 4
5
Input: W ¯. Output: c and R ¯; initialize c and R while not converged do update c using any `1 minimization solver; ¯ using any local optimization solver over update R SO(3) or the Procrustes method; end
Proof. Since S = {Y | Y = |s|X, X ∈ Q}, we have ( k ) k X X conv (S) = θi Y i | Y i ∈ S, θi ≥ 0, θ=1 i=1
=
( k X
i=1
θi |s|X i | X i ∈ Q, θi ≥ 0,
i=1
k X
)
θ=1
i=1
= |s| · conv (Q) .
Algorithm 1: Alternating minimization to solve (7).
in which there is a rotation for each basis shape. The model in (8) implicitly accounts for the viewpoint variability and the projected 2D model is
W =Π
k X
ci Ri B i =
i=1
k X
M iBi,
Consequently, the tightest convex relaxation to the constraint in (10) is given by kM i k2 ≤ |ci |. Finally, with the modified shape model, the relaxed orthogonality constraint and the assumption of sparse representation, we propose to minimize the `1 -norm of the coefficient vector for shape recovery under noiseless cases:
(9)
k X
min
i=1
c1 ,··· ,ck ,M 1 ,··· ,M k
where M i ∈ R2×3 is the product of ci and the first two rows of Ri , which satisfies
|ci |,
i=1
W =
s.t.
k X
M iBi,
i=1
kM i k2 ≤ |ci |, ∀i ∈ [1, k] M i M Ti
=
c2i I 2 .
(10)
The motivation of using the models in (8) and (9) is to achieve a linear representation of shape variability in 2D, such that we can get rid of the bilinear form in (6), which is a necessary step towards a convex formulation. The model in (9) is equivalent to the affine-shape model in literature [41], [42], which uses an augmented linear space to represent the shape variation in 2D caused by both intrinsic shape deformation and extrinsic viewpoint changes. This representation also appears in most NRSfM literature (e.g. [31], [33]). As mentioned in [42], the augmented linear space can represent any 2D shape produced by the 3D shape model projected into the image plane, but the increase of degree of freedom may result in invalid shapes. In this work, we try to reduce the possibility of invalid cases by enforcing the orthogonality constraint on M i s and the sparsity constraint on the number of activated basis shapes. We will show that these constraints can be conveniently imposed by minimizing a convex regularizer. Next, we will consider to replace the orthogonality constraint in (10) by its convex counterpart. The following lemma has been proven in literature [43, Section 3.4]: Lemma 1. The convex hull n o of the Stiefel manifold Q = X ∈ Rm×n | X T X = I n equals the unit spectral-norm ball conv (Q) = {X ∈ Rm×n | kXk2 ≤ 1}. kXk2 denotes the spectral norm (a.k.a. the induced 2-norm) of a matrix X , which is defined as the largest singular value of X . Based on Lemma 1, we have the following proposition: Proposition 1. Given a scalar n o s, the convex hull of S = Y ∈ Rm×n | Y T Y = s2 I n equals the spectral-norm ball with a radius of |s|: conv (S) = {Y ∈ Rm×n | kY k2 ≤ |s|}.
(11)
which is obviously equivalent to the following problem: k X
min
M 1 ,··· ,M k
s.t.
kM i k2 ,
i=1
W =
k X
M iBi.
(12)
i=1
The formulation in (12) is a linear inverse problem, where we estimate a set of orthogonal matrices by minimizing their spectral norms. Interestingly, the conditions for exact recovery using such a convex program has been theoretically analyzed in [44]. We will provide numerical results to demonstrate the exact recovery property in Section 5.1. Considering noise in real applications, we can solve:
2
k k
X X 1
min W − M B + λ kM i k2 . (13)
i i
M 1 ,··· ,M k 2 i=1
F
i=1
The problem (13) is our final formulation. It is a penalized least-squares problem. We have following remarks: 1.
2.
The problem in (13) is convex programming, which can be solved globally. We will provide an efficient algorithm to solve it in Section 4.3. Notice that k · k2 in the above formulations denotes the spectral norm of a matrix instead of the `2 norm of a vector. As we will show in Section 4.2, minimizing the spectral norm of a matrix is equivalent to minimizing the `∞ -norm of the vector of singular values, which will simultaneously shrink the norm of the matrix towards zero and enforce its singular values to be equal. Therefore, by spectralnorm minimization, we can not only minimize the number of activated basis shapes but also enforce each transformation matrix M i to be orthogonal (an orthogonal matrix has equal singular values).
5
Δ2
In practice, we may estimate M i s by only considering reprojection errors for a part of landmarks, i.e., including a binary weight matrix in the first term of (13). The other landmarks can be hallucinated from the reconstructed shape model as their locations are known on the basis shapes.
}
3.
σY
}
Δ1
radius = λ 4.2
Proximal operator of the spectral norm ⎛ Δ1 X * = UY ⎜ ⎝
Before deriving the specific algorithm to solve (13), we first prove the following proposition, which will serve as an important building block in our algorithm. Proposition 2. The solution to the following problem
min X
1 kY − Xk2F + λkXk2 2
Fig. 2. Illustration of the proximal operator of the spectral norm.
(14)
∗
is given by X = Dλ (Y ), where
Dλ (Y ) = U Y diag [σ Y − λP`1 (σ Y /λ)] V TY ,
(15)
U Y , V Y and σ Y denote the left singular vectors, right singular vectors and the singular values of Y , respectively. P`1 is the projection of a vector to the unit `1 -norm ball.
4.3
Optimization
min
Proof. The problem in (14) is a proximal problem [45]. The proximal problem associated with a function F is defined as (16)
with the solution denoted by proxλF (Y ) and named the proximal operator of F . For the problem (14), F (X) = kXk2 = kσ X k∞ , where k · k∞ means the `∞ norm. It says that F is a spectral function operated on singular values of a matrix. Based on the property of spectral functions [45, Section 6.7.2], we have h i proxλF (Y ) = U Y diag proxλf (σ Y ) V TY ,
(17)
where f is the `∞ -norm. The proximal operator of the `∞ -norm can be computed by Moreau decomposition [45, Section 6.5]: proxλf (σ Y ) = σ Y − λP`1 (σ Y /λ),
(18)
given that the `1 -norm is the dual norm of the `∞ -norm.
1
We present the algorithm to solve (13). The noiseless case (12) can be solved similarly and the details are provided in Appendix A. Our algorithm is based on ADMM [46] and the proximal operator of the spectral norm we have derived. We first introduce an auxiliary variable Z and rewrite (13) as follows
f ,Z e M
1 proxλF (Y ) = arg min kY − Xk2F + λF (X), X 2
⎞ T ⎟ VY Δ2 ⎠
k
2 X 1
e kM i k2 ,
W − Z B
+λ F 2 i=1
f = Z, s.t. M
(19)
where we concatenate M 1 , · · · , M k as column-triplets of f and B 1 , · · · , B k as row-triplets of B e. M The augmented Lagrangian of (19) is k
2
X f , Z, Y = 1 e Lµ M kM i k2
+λ
W − Z B F 2 i=1
2
D E f − Z f −Z + µ + Y ,M
, (20)
M F 2 where Y is the dual variable and µ is a parameter controlling the step size in optimization. Then, the ADMM alternates the following steps until convergence: f t+1 = arg min Lµ M f , Z t, Y t ; M (21) f M t+1 f Z t+1 = arg min Lµ M , Z, Y t ; (22) Z t+1 f Y t+1 = Y k + µ M − Z t+1 . (23)
For the step in (21), we have f , Z t, Y t min Lµ M f M
The solution for the case with two singular values is illustrated in Figure 2, where [∆1 , ∆2 ]T corresponds to σ Y − λP`1 (σ Y /λ) in (15). From the illustration we have two observations: (1) If the projection is on the edge of the `1 -norm ball, then ∆1 = ∆2 and X ∗ turns out to be an orthogonal matrix; (2) If σY is small and lies inside the `1 norm ball, then the projection will be itself, ∆1 = ∆2 = 0, and X ∗ = 0. These two observations illustrate the effects of spectral-norm regularization in (13): (1) Enforcing M i s to be orthogonal; (2) Shrinking small M i s to achieve a sparse representation.
2 k 1 1 t λX t f
= min M − Z + Y + kM i k2
f 2 µ µ i=1 M F k X
1 λ t 2
= min M i − Qi F + kM i k2 , M 1 ,··· ,M k 2 µ i=1
(24)
where Qti is the i-th column-triplet of Z t − µ1 Y t . Therefore, we can update each M i by solving a proximal problem based on Proposition 2:
M t+1 = D λ (Qti ), ∀i ∈ [1, k]. i µ
(25)
6
t+1 f For the step in (22), Lµ M , Z, Y t is a quadratic form of Z and admits the following closed-form solution: −1 e T + µM f t+1 + Y t B eB e T + µI Z t+1 = W B .
Input: W , λ Output: M 1 , · · · , M k 2 3 4 5 6 7 8 9
min
(26)
The steps are summarized in Algorithm 2. Following the standard proof in [46], it can be shown that the sequences of values produced by the iterations in Algorithm 2 converge to the optimal solution of the problem (19), which is also the optimal solution to the original problem (13). In implementation, we adopt the convergence criterion and the adaptive scheme for tuning the parameter µ in [46, Section 3].
1
the basis shapes share the same rotation, we can average the rotations by solving the following optimization problem:
initialize Z 0 = Y 0 = 0, µ > 0; while not converged do for i = 1 to k do Qti = the i-th column-triplet of Z t − µ1 Y t ; M t+1 = D λ (Qti ) ; i µ end −1 f t+1 + Y t B eB e T + µI e T + µM Z t+1 = W B ; t+1 t+1 k t+1 f ; Y =Y +µ M −Z end
¯ c1 ,··· ,ck ,R
¯R ¯ T = I 2, s.t. R
1
2
Reconstruction
3
After solving (13), we propose two ways to reconstruct the 3D shape from the estimated M i s. 4.4.1 Direct Reconstruction We recover ci and Ri from the estimated M i , and reconstruct the 3D shape by the relaxed shape model (8). The (j) (j) steps are summarized in Algorithm 3, where mi and r i denote the j -th row vectors of M i and Ri , respectively. Note that ci = −kM i k2 is also a feasible solution. To eliminate the ambiguity, we assume that ci ≥ 0 and impose this constraint when training the shape dictionary. Also note that the reconstructed shape is already in the camera frame.
2 3 4 5 6 7 8
for i = 1 to k do ci = kM i k2 ; (1) (1) r i = mi /ci ; (2) (2) r i = mi /ci ; (3) (1) (2) ri = ri × ri ; (1) (2) (3) Ri = [r i , r i , r i ]T ; Pk S = i=1 ci Ri B i ; end Algorithm 3: Direct reconstruction
4.4.2 Rotation Synchronization and Refinement In the direct reconstruction introduced above, the recovered rotations Ri s are likely to be different from each other, which is not preferred in some applications. To enforce that
Input: M 1 , · · · , M k Output: S ¯ by solving (27) using the algorithm initialize c and R in [34, Section 4.4.1]; ¯ by Algorithm 1; refine c and R Pk S = i=1 ci B i ; Algorithm 4: Refinement and reconstruction
4.5
Outlier modeling
In real applications where 2D correspondences are given by detectors, there are likely to be gross errors in 2D correspondences. In order to correct them, we explicitly model the outliers by a sparse matrix E and modify the formulation in (13) as the following problem:
min
M 1 ,··· ,M k ,E,T
Input: M 1 , · · · , M k Output: S 1
(27)
¯ denotes the first two rows of a rotation matrix. where R The problem in (27) has been studied in [33] and [34], whose solution can be exactly obtained by semidefinite programming [33] or approximately computed by a more efficient algorithm proposed in [34, Section 4.4.1]. The solution from the synchronization described above is not necessarily the solution to the original problem, but it can serve as a good initial point for the alternating minimization algorithms to solve (7). In the experiments, we will demonstrate that such a better initialization can improve the performance of alternating minimization. After rotation averaging and refinement, we can simply reconstruct the 3D shape from the optimized shape coefficients. The full steps are summarized in Algorithm 4.
Algorithm 2: ADMM to solve (13) 4.4
¯ 2, kM i − ci Rk F
1 2
2 k
X
M i B i − E − T 1T
W −
+λ
i=1
k X
kM i k2 + βkEk1 .
F
(28)
i=1
Note that in this case we cannot eliminate the translation T by centralizing the data at the beginning. The problem in (28) can be solved by ADMM as well. We present the algorithm in Appendix B.
4.6
Dictionary learning
In general, one can simply use all existing or a few representative 3D models as the basis shapes. But when the collection is very large, e.g., the motion capture datasets often contain thousands of sequences of human skeletons, we need to make use of the dictionary learning technique to learn a dictionary of basis shapes that can concisely summarize the variability in training data and enable a
1
Reconstruction error
Number of landmarks
7
10 20 0.5
30 40 50 2 4 6 8 10 Cadinality of coefficients
n k X X X 1 kS j − Cij B i k2F + β Cij 2 j=1 i=1 i,j
s.t. Cij ≥ 0, kB i kF ≤ 1,
∀i ∈ [1, k], j ∈ [1, n],
(29)
where S i s denote the training shapes, B i s are the basis shapes to be learned, and Cij represents the i-th coefficient for the j -th training shape. The two term in the cost function correspond to the reconstruction error and the sparsity of representation, respectively. The nonnegative constraint is introduced to remove the ambiguity introduced in Section 4.4.1. We locally solve the problem in (29) by alternately updating C and B i s by projected gradient descent, a method widely used in dictionary learning [47]. The algorithm is summarized in Appendix C.
5 5.1
Simulation
First of all, we investigate whether the spectral-norm minimization in (12) can exactly solve the ill-posed inverse problem based on the prior knowledge of sparsity and orthogonality using simulation. More specifically, we synthesize the data with the following model: k X
M iBi,
(30)
i=1
where B 1 , · · · , B k ∈ R3×p (p is varying, k = 50) are randomly generated bases with entries sampled independently from the normal distribution N (0, 1). Each M i ∈ R2×3 is ¯ i , where ci is sampled from a generated from M i = ci R ¯ i denotes the first two uniform distribution U(0, 1) and R rows of a rotation matrix with angles uniformly sampled from [0, 2π]. For c1 , · · · , ck , we randomly select z of them as nonzero coefficients and set the rest of them to be zero. We use W as the input and solve (12) to estimate M i s. The recovery error is evaluated by f kF /kM f kF , ˆ −M relative error = kM
0.1 4 6 8 10 12 14 Number of activated bases
Figure 3 reports the frequency of exact recovery with varying p (number of landmarks) and z (sparsity of the underlying coefficients), which is evaluated over 100 randomly-generated instances for each setting. Note that the number of unknowns (6k ) is much larger than the number of equations (2p). The proposed convex program can exactly solve the problem with a frequency equal to 1 in the lower-triangular area, where the number of landmarks is sufficiently large and the coefficients are truly sparse. This demonstrates the power of convex relaxation, which has proven to be successful in various inverse problems, e.g., compressed sensing [48] and matrix completion [49]. The performance drops in more difficult cases in the uppertriangular area. This observation is analogous to the phase transition in compressive sensing, where the recovery probability also depends on the number of observations and the underlying signal sparsity [50]. 5.2
E XPERIMENTS
W =
0.2
Fig. 4. Reconstruction errors of PCA and sparse representation for 3D human poses.
sparse representation. Our dictionary learning problem can be formulated as the following optimization problem:
B 1 ,··· ,B k ,C
Sparse PCA
0.3
0 2
0
Fig. 3. The frequency of exact recovery on synthesized data.
min
0.4
(31)
f , and M ˆ is the algorithm where we concatenate M i s in M estimate. A solution is regarded as exact if error < 10−3 . Note that the original shape model (6) is a special case of (30) where all M i s are the scaled versions of each other.
Human pose estimation
The applicability of sparse representation for 3D human pose recovery has been thoroughly studied in previous work [8], [9], [16]. In this paper, we aim to illustrate the advantage of the proposed convex program compared to the alternating minimization algorithm widely used in previous work. We carry out evaluation on the CMU motion capture datasets [51]. The datasets include thousands of sequences of 3D human skeletons and we select part of them for evaluation. More specifically, we select sequences from six motion categories including walk, run, jump, climb, box and dance. For each motion, we select six sequences from six different subjects and use half of them for dictionary learning and the rest for testing. In total, there are 18 sequences for training and testing, respectively, and the selected data considers the variation across both subjects and motions. We use the 15joints model to represent a human skeleton. To build the shape dictionary, we collect all training shapes and aligned them by the Procrustes method. The size of the dictionary is set as k = 128, which is overcomplete as the dimension of the original space is 45 (15 joints in 3D). We initialize the dictionary by uniformly sampling k shapes from the training data and update the dictionary by the algorithm given in Section 4.6. We evaluate the representability of the learned dictionary by the reconstruction errors in 3D, i.e., we solve the sparse coding problem for each training shape and calculate the reconstruction error. We vary number of activated bases by tuning the
8
0.7
0.5
Objective value (Altern)
Estimation error
4
Convex Convex+refine Altern PMP
0.6
0.4 0.3 0.2
3
2
1
0.1 0 0
0 walk
run
jump climb
box dance
(a)
(b) 1.2
0.4
Estimation error
Convex+refine Altern
Estimation error
1 2 3 4 Objective value (Convex+refine)
0.35
0.3
0.25
Convex+refine Convex+robust+refine Altern Altern+robust
1 0.8 0.6 0.4 0.2
0.2 0
0.1 0.2 Noise level
0.3
(c)
0
0.1 0.2 Outlier proportion
0.3
(d)
Fig. 5. (a) The average estimation errors on the CMU Mocap datasets for different motions. (b) The comparison of objective values achieved by the alternating algorithm initialized by the mean shape and the convex result. (c) The sensitivity to Gaussian noise. (d) The sensitivity to outliers.
weight of regularization in sparse coding and compare the reconstruction error to that of PCA. The result is shown in Figure 4, which indicates that the sparse representation can achieve a reconstruction error with much fewer activated bases compared to PCA. To obtain 2D testing data, we synthesize an orthographic camera and project the 3D markers to 2D with the synthesized camera model, simulating the observation from a camera rotating around the subject for 360 degrees per sequence. We compare the following methods to reconstruct 3D models from the 2D data: • •
•
•
Convex: the proposed convex program with direct reconstruction (Algorithm 2+Algorithm 3). Convex+refine: the proposed convex program followed by refinement (Algorithm 2+Algorithm 4). The local update of rotation is implemented by the manifold optimization over the orthogonal group with the Manopt toolbox [52]. Altern: the alternating algorithm with the mean shape as initialization (Algorithm 1). The rotation estimation is solved by the Procrustes method widely used in previous work (e.g. [8], [16], [17]). PMP: Projected Matching Pursuit from Ramakrishna et al. [8], which solves sparse coding by greedily selecting basis shapes instead of `1 minimization. The dictionary for PMP is constructed by the method provided in the original paper with the same training data as for our method.
The reconstruction error is evaluated by the average distance between the estimated joint locations and their true locations up to a similarity transformation (estimated by the Procrustes method) in the 3D space. The mean errors for different motions are shown in Figure 5(a). The convex method consistently outperforms the nonconvex methods for all motions. The reconstruction errors after refinement are slightly different from those of direct reconstruction. The difference might be attributed to the different representability of the original shape model in (3) and the relaxed shape model in (8). Allowing basis shapes to have independent rotations, the relaxed shape model offers more degree of freedom compared to the original model but consequently has more parameters to estimate. Therefore, it may perform better for the relatively complex motion such as dancing and climbing but may be worse for the simpler motion like walking. Comparing the results of “convex+refine” and “altern”, we can find that, for the same model (7), the error of alternating minimization is apparently decreased by using the better initialization (the convex solution). To further verify the fact that the alternating algorithm depends on initialization, we compare the objective values achieved by the alternating algorithm initialized by the convex solution and the mean shape, as shown in Figure 5(b). Except for several outliers, the objective values with convex initialization are always smaller. Some selected visual examples of reconstructed poses for the six motions are shown in Figure 6. We can see that all
9
Fig. 6. Examples of human pose estimation given 2D poses. The columns from left to right correspond to the input 2D landmarks, the ground-truth 3D pose, and the reconstructions from the proposed method, the alternating minimization, and the PMP method [8], respectively.
methods perform well in the “walk” and “run” examples, where the poses are close to the mean shape (stand upright). But for more complex poses, the alternating algorithm may converge to wrong solutions as shown in the remaining four examples, while our convex algorithm still obtains appealing reconstructions. We also test the robustness of the proposed method against Gaussian noise. We add Gaussian noise with a varying standard deviation to the 2D data and plot the estimation accuracy versus the standard deviation of the simulated noise. As shown in Figure 5(c), the performance of the convex algorithm followed by refinement is consistently better than the alternating algorithm for all noise levels. Furthermore, we test the robustness of the extended model in (28) against outliers. We randomly select a proportion of landmarks, change their 2D locations to be some random values within a proper range (roughly the rectangle that covers the skeleton), and solves (28). For comparison, we also implement the alternating algorithm that alternately minimizes the robust version of (7) with an additional
outlier term as in (28). The curves of estimation error versus the outlier proportion are shown in Figure 5(d). The robust models are clearly more robust than the original models and the convex algorithm still achieves a better performance than the alternating algorithm. To illustrate the applicability of the proposed method in practice, we apply the 2D pose detector developed by Yang and Ramanan [53] on the PARSE dataset [54] and then use the proposed method to lift the detected 2D poses to 3D poses. Some selected results are shown in Figure 7. 5.3
Car model estimation
We demonstrate the applicability of the proposed method for car model estimation using the Fine-Grained 3D Car dataset [15], which provides images of cars, 2D landmark annotations and corresponding 3D models reconstructed by the authors. We concatenate the 3D models of 15 cars as the shape dictionary and try to reconstruct the 3D models of other cars from the visible landmarks annotated in the images (∼40 points per image). Some examples of reconstruction are shown in Figure 8. For comparison, we also show
10
Fig. 7. Examples of human pose estimation with detected 2D poses. The left panel shows the image with the 2D pose estimated by the detector [53]. The 3D pose estimated by our method are visualized from two viewpoints. The last row shows two failed examples.
the results of the algorithm proposed in the original paper [15], which adopts a perspective camera model and locally updates the camera and shape parameters with a Jacobian system initialized by the mean shape and predefined camera parameters. The nonlinear optimization performs well in the sedan example but relatively poorly in the SUV and truck examples, where the models deviate far away from the mean shape. Similar results were reported in the original paper [15] and the authors proposed to use the class-specific mean shape for better initialization. Instead, our method can achieve appealing results with arbitrary initialization. Unlike human pose reconstruction, we observed that the alternating minimization in Algorithm 1 also performs very well for car model reconstruction. The reason is that the shape variation among cars is relatively small and the mean-shape initialization is very close to the true solution. But when there are outliers in the observation, the initial
estimate of viewpoint might be unreliable and the alternating algorithm may fail. To validate this, we simulated some outliers in the observation. More specifically, we select 20 annotated 2D landmarks and change their locations to be random values in the image plane. Since there is no ground truth of 3D models (the 3D models provided in the datasets are estimated from 2D annotations by the authors), we evaluate the performances by comparing the objective values achieved by the convex algorithm followed by refinement (“convex+refine”) and by the alternating algorithm initialized by the mean shape (“altern”). We also evaluate the 2D shape estimation error by comparing the recovered 2D shape, i.e., the projection of the reconstructed 3D model, to the annotated 2D landmarks to see whether the synthesized outliers can be corrected. The quantitative comparison is shown in Figure 9. While the estimation errors are identical for most examples, there are still many cases in which the
11
200 150 100 50 0 0 100 200 Estimation error (convex+refine)
4 Objective value (altern)
Estimation error (altern)
Fig. 8. Examples of car reconstruction with given 2D correspondences. The first two column show the original image and the input 2D landmarks; The third and fourth columns correspond to the 2D fitted model and 3D reconstruction of the proposed method; The last two columns correspond to the results of the nonlinear optimization method from [15]. Only visible landmarks (∼ 40 per image) are used for shape fitting. The 3D models are visualized in novel views. The car models from top to bottom are the BMW 5 Series 2011 (sedan), the Nissan Xterra 2005 (SUV) and the Dodge Ram 2003 (pick-up truck), respectively.
3 2 1 0 0 2 4 Objective value (convex+refine)
Fig. 9. The quantitative comparison between the proposed method (“convex+refine”) and the alternating minimization (“altern”). The left plot shows the 2D landmark localization errors while the left plot shows the objective values achieved by the algorithms.
convex result is apparently better than that of the alternating algorithm. In terms of objective values, the proposed method is always comparable or better than the alternating algorithm. Finally, We demonstrate the real applicability of the proposed method with learned landmark detectors. For a simple demonstration, we build detectors based on SVM with HOG features and learn a mixture of classifiers for each landmark (64 in total) to capture the variety in 2D appearance. For a test image, we scan the image with the learned detectors and select the locations with the maximum classifier responses as the landmark locations. The detection results are shown in the first column of Figure 10. The false detections are mostly due to the self occlusion. In this demonstration, we assume that the visibility is unknown and fit the model with all landmarks. The fitting result and reconstruction are shown in the next two columns, which demonstrate that the proposed method can robustly fit the model at the presence of many outliers and even without knowing the visibility. The last row shows two failed examples, in which more than half of the landmarks are false detections and hard to be corrected.
5.4
Computational Time
Our algorithm is implemented in MATLAB and tested on a desktop with a Intel i7 3.4GHz CPU and 8G RAM. In our experiments, the ADMM algorithm generally converges within 500 iterations to reach a stopping criterion of 10−4 . In the experiments of human pose estimation, for example, the average computational time of our algorithm is 1.91s per frame, while those of the alternating algorithm and the PMP algorithm [8] are 1.34s and 3.70s, respectively. Note that our algorithm can be easily accelerated by parallelizing the forloops in Algorithm 2.
6
D ISCUSSION
In summary, we proposed a method for aligning a 3D deformable shape model (in sparse representation) to 2D correspondences by solving a convex program, which guarantees global optimality. Intuitively, we adopted an augmented 3D shape space to achieve a linear representation of both shape and viewpoint variability and proposed to use the spectralnorm regularization to penalize invalid cases caused by the augmentation. We also extended our model to handle outliers in 2D correspondences. We experimentally demonstrated the advantage of the proposed convex method compared to alternative nonconvex methods especially when the shape variability is large and there exist outliers. The exactness of using convex relaxation for linear inverse problems with various assumptions, e.g., sparsity and orthogonality, has been theoretically analyzed in literature [44]. We have numerically demonstrated the exact recovery property of our relaxed model. The theoretical analysis of the recovery conditions would be a future direction. In this paper, we demonstrated the applicability of the proposed method for reconstructing human poses and cars. The point correspondences across training shapes are provided in the training datasets. With more and more promising results on 3D shape maps (e.g. [55], [56]) and 3D datasets with rich annotations (e.g. PASCAL3D+ [57] and 3D ShapeNet [58]), we expect to see more applications of the proposed framework for more object categories in future.
12
Fig. 10. Examples of car reconstruction with detected 2D landmarks. For each example, the figures from left to right correspond to the detected landmarks superposed on the original image, the 2D fitted model and 3D reconstruction, respectively. The 2D landmarks are located by learned detectors based on HOG and SVM. The green and red dots correspond to the inliers and outliers (whose distance to the manual annotation is larger than 30 pixels), respectively. Note that the inlier/outlier classification is used only for the purpose of illustration and not for model fitting, i.e., the models are fitted with all landmarks. The last row shows two failed examples.
A PPENDIX A A LGORITHM TO SOLVE THE NOISELESS MODEL
Then, the following steps are iterated to update the variables until convergence:
We present the algorithm to solve (12). The problem can be rewritten as:
min
k X
kM i k2 ,
f t+1 = arg min Lµ M f , Z t, Y t ; M f M t+1 f Z t+1 = arg min Lµ M , Z, Y t ; Z t+1 t+1 k f Y =Y +µ M − Z t+1 .
(34) (35) (36)
f ,Z M i=1
e M f = Z. s.t. W = Z B,
(32)
The augmented Lagrangian is k X D E f , Z, Y = f −Z Lµ M kM i k2 + Y , M
t+1 f Z t+1 = Z p + M + Y k − Zp N N T ,
i=1
2 µ
f
+ M − Z , F 2 e s.t. W = Z B
The whole procedure is very similar to Algorithm 2 and the only difference is the Z-step since minimizing f , Z, Y Lµ M over Z with the constraint in (33) is an equality-constrained quadratic programming problem, which admits the following closed-form solution:
(33)
(37)
e and N where Z p is a particular solution to W = Z B e. denotes the left null space of B
13
A PPENDIX C A LGORITHM TO SOLVE DICTIONARY LEARNING
A PPENDIX B A LGORITHM TO SOLVE THE ROBUST MODEL We present the algorithm to solve (28). We first introduce an auxiliary variable Z and rewrite the problem as follows
Here we present the algorithm to solve the dictionary learning problem in (29). Denote: k
min f ,Z,E,T e M
X X ˜ C) = 1 kS j − f (B, Cij B i k2F + β Cij , 2 i=1 i,j
2 1
e − E − T 1T
W − Z B
F 2
+λ
k X
˜ is the concatenation of B i s. The algorithm is where B summarized in Algorithm 5.
kM i k2 + βkEk1 ,
i=1
f = Z. s.t. M
(38)
The augmented Lagrangian of (38) is
f , Z, E, T , Y Lµ M
1
2 1
e − E − T 1T = W − Z B
F 2 k X +λ kM i k2 + βkEk1 (39) i=1
Then, the following steps are iterated to update the variables until convergence: f M
2
3 4 5
2
D E f − Z f −Z + µ + Y ,M
.
M F 2
t+1
f , Z t, Et, T t, Y t ; = arg min Lµ M
6 7 8
t+1
f Z t+1 = arg min Lµ M , Z, E t , T t , Y t ; Z t+1 f E t+1 = arg min Lµ M , Z t+1 , E, T t , Y t ; E t+1 t+1 f T = arg min Lµ M , Z t+1 , E t+1 , T , Y t ; T t+1 t+1 k f Y =Y +µ M − Z t+1 .
11
(40)
12
(41)
13 14
(42) (43) (44)
The whole procedure is very similar to Algorithm 2 and the only difference is that there are two more steps to update E and T . For the steps in (40) and (41), they can be computed in the same ways as in Algorithm 2:
M t+1 = D λ (Qti ), ∀i ∈ [1, k], i µ
Input: S 1 , · · · , S n Output: B 1 , · · · , B k ˜ , C , step sizes δ1 and δ2 ; initialize B while not converged do /* solve nonnegative sparse coding while not converged do ˜ C) ; C ← C + δ1 ∇C f (B, for i = 1 to k do for j = 1 to n do if Cij < 0 then Cij ← 0 ;
*/
9 10
f M
(49)
15 16
end end end /* update dictionary ˜ ←B ˜ + δ2 ∇ ˜ f (B, ˜ C) ; B B for i = 1 to k do if kB i kF > 1 then B i ← B i /kB i kF ;
*/
17 18 19
end end Algorithm 5: Dictionary learning
ACKNOWLEDGMENTS We gratefully acknowledge Dr Xiaoyan Hu from Beijing Normal University who helped us process the Mocap data.
(45)
where Qti is the i-th column-triplet of Z t − µ1 Y t −E t −T t 1T , e T + µM f t+1 + Y t Z t+1 = (W − E t − T t 1T )B −1 eB e T + µI × B . (46)
R EFERENCES [1] [2]
[3]
For the step in (42), the problem is a proximal problem associated with the `1 -norm and can be solved by elementwise soft-thresholding: E t+1 = Sβ W − Z t+1 B − T t 1T , (47) where Sβ (X)ij = sign(Xij )(Xij − β)+ refers to the elementwise soft-thresholding operator. For the step in (43), the solution is simply given by h i T t+1 = mean of each row W − Z t+1 B − E t+1 . (48)
[4] [5] [6]
[7]
Y. Xiang and S. Savarese, “Estimating the aspect layout of object categories,” in International Conference on Computer Vision and Pattern Recognition, 2012. 1 M. Aubry, D. Maturana, A. Efros, B. Russell, and J. Sivic, “Seeing 3d chairs: exemplar part-based 2d-3d alignment using a large dataset of cad models,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014. 1 S. Fidler, S. Dickinson, and R. Urtasun, “3d object detection and viewpoint estimation with a deformable 3d cuboid model,” in Advances in Neural Information Processing Systems, 2012. 1 E. Simo-Serra, A. Quattoni, C. Torras, and F. Moreno-Noguer, “A joint model for 2d and 3d pose estimation from a single image,” in IEEE Conference on Computer Vision and Pattern Recognition, 2013. 1 M. Hejrati and D. Ramanan, “Analyzing 3d objects in cluttered images,” in Advances in Neural Information Processing Systems, 2012. 1, 2 M. Z. Zia, M. Stark, B. Schiele, and K. Schindler, “Detailed 3d representations for object recognition and modeling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 11, pp. 2608–2623, 2013. 1, 2 T. Cootes, C. Taylor, D. Cooper, and J. Graham, “Active shape models – their training and application,” Computer Vision and Image Understanding, vol. 61, no. 1, pp. 38–59, 1995. 1, 3
14
[8] [9]
[10]
[11] [12] [13] [14] [15] [16] [17] [18] [19]
[20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32]
V. Ramakrishna, T. Kanade, and Y. Sheikh, “Reconstructing 3d human pose from 2d image landmarks,” in European conference on Computer Vision, 2012. 1, 2, 3, 7, 8, 9, 11 C. Wang, Y. Wang, Z. Lin, A. L. Yuille, and W. Gao, “Robust estimation of 3d human poses from a single image,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014. 1, 2, 7 X. Zhou, S. Leonardos, X. Hu, and K. Daniilidis, “3D shape estimation from 2D landmarks: A convex relaxation approach,” IEEE International Conference on Computer Vision and Pattern Recognition, 2015. 2 V. Blanz and T. Vetter, “Face recognition based on fitting a 3D morphable model,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 9, pp. 1063–1074, 2003. 2 L. Gu and T. Kanade, “3D alignment of face in a single image,” in IEEE Conference on Computer Vision and Pattern Recognition, 2006. 2 C. Cao, Y. Weng, S. Lin, and K. Zhou, “3d shape regression for real-time facial animation,” ACM Transactions on Graphics (TOG), vol. 32, no. 4, p. 41, 2013. 2 P. Felzenszwalb, D. McAllester, and D. Ramanan, “A discriminatively trained, multiscale, deformable part model,” in IEEE Conference on Computer Vision and Pattern Recognition, 2008. 2 Y.-L. Lin, V. I. Morariu, W. Hsu, and L. S. Davis, “Jointly optimizing 3d model fitting and fine-grained classification,” in European Conference on Computer Vision, 2014. 2, 9, 10, 11 X. Fan, K. Zheng, Y. Zhou, and S. Wang, “Pose locality constrained representation for 3d human pose reconstruction,” in European Conference on Computer Vision, 2014. 2, 7, 8 F. Zhou and F. De la Torre, “Spatio-temporal matching for human detection in video,” in European Conference on Computer Vision, 2014. 2, 8 I. Akhter and M. J. Black, “Pose-conditioned joint angle limits for 3D human pose reconstruction,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015. 2, 3 T. J. Cashman and A. W. Fitzgibbon, “What shape are dolphins? building 3d morphable models from 2d images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 232– 244, 2013. 2 S. Vicente, J. Carreira, L. Agapito, and J. Batista, “Reconstructing pascal voc,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014. 2 J. Carreira, A. Kar, S. Tulsiani, and J. Malik, “Virtual view networks for object reconstruction,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015. 2 A. Kar, S. Tulsiani, C. Jo˜ao, and J. Malik, “Category-specific object reconstruction from a single image,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015. 2 H. Su, Q. Huang, N. J. Mitra, Y. Li, and L. Guibas, “Estimating image depth using shape collections,” ACM Transactions on Graphics, vol. 33, no. 4, p. 37, 2014. 2 Q. Huang, H. Wang, and V. Koltun, “Single-view reconstruction via joint analysis of image and shape collections,” ACM Transactions on Graphics, vol. 34, no. 4, p. 87, 2015. 2 C. J. Taylor, “Reconstruction of articulated objects from point correspondences in a single uncalibrated image,” in IEEE Conference on Computer Vision and Pattern Recognition, 2000. 2 P. Guan, A. Weiss, A. O. B˘alan, and M. J. Black, “Estimating human shape and pose from a single image,” in International Conference on Computer Vision, 2009. 2 L. Sigal and M. J. Black, “Predicting 3d people from 2d pictures,” in Articulated Motion and Deformable Objects. Springer, 2006, pp. 185–195. 2 M. Andriluka, S. Roth, and B. Schiele, “Monocular 3d pose estimation and tracking by detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2010. 2 A. Elgammal and C.-S. Lee, “Inferring 3d body pose from silhouettes using activity manifold learning,” in IEEE International Conference on Computer Vision and Pattern Recognition, 2004. 2 A. Agarwal and B. Triggs, “Recovering 3d human pose from monocular images,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 28, no. 1, pp. 44–58, 2006. 2 C. Bregler, A. Hertzmann, and H. Biermann, “Recovering nonrigid 3d shape from image streams,” in IEEE Conference on Computer Vision and Pattern Recognition, 2000. 3, 4 J. Xiao, J. Chai, and T. Kanade, “A closed-form solution to nonrigid shape and motion recovery,” International Journal of Computer Vision, vol. 67, no. 2, pp. 233–246, 2006. 3
[33] M. Paladini, A. Del Bue, J. Xavier, L. Agapito, M. Stoˇsi´c, and M. Dodig, “Optimal metric projections for deformable and articulated structure-from-motion,” International Journal of Computer Vision, vol. 96, no. 2, pp. 252–276, 2012. 3, 4, 6 [34] A. Del Bue, J. Xavier, L. Agapito, and M. Paladini, “Bilinear modeling via augmented lagrange multipliers (balm),” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 8, pp. 1496– 1508, 2012. 3, 6 [35] A. Agudo, L. Agapito, B. Calvo, and J. Montiel, “Good vibrations: A modal analysis approach for sequential non-rigid structure from motion,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014. 3 [36] S. Zhang, Y. Zhan, M. Dewan, J. Huang, D. Metaxas, and X. Zhou, “Sparse shape composition: A new framework for shape prior modeling,” in IEEE Conference on Computer Vision and Pattern Recognition, 2011. 3 [37] S. Zhu, L. Zhang, and B. M. Smith, “Model evolution: an incremental approach to non-rigid structure from motion,” in IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2010, pp. 1165–1172. 3 [38] Y. Zhu, D. Huang, F. De la Torre Frade, and S. Lucey, “Complex non-rigid motion 3d reconstruction by union of subspaces,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014. 3 [39] F. Bach, R. Jenatton, J. Mairal, and G. Obozinski, “Optimization R in with sparsity-inducing penalties,” Foundations and Trends Machine Learning, vol. 4, no. 1, pp. 1–106, 2012. 3 [40] A. Edelman, T. A. Arias, and S. T. Smith, “The geometry of algorithms with orthogonality constraints,” SIAM journal on Matrix Analysis and Applications, vol. 20, no. 2, pp. 303–353, 1998. 3 [41] A. Blake and M. Isard, Active contours. Springer, 2000. 4 [42] J. Xiao, S. Baker, I. Matthews, and T. Kanade, “Real-time combined 2d+ 3d active appearance models,” in IEEE Conference on Computer Vision and Pattern Recognition, 2004. 4 [43] M. Journ´ee, Y. Nesterov, P. Richt´arik, and R. Sepulchre, “Generalized power method for sparse principal component analysis,” The Journal of Machine Learning Research, vol. 11, pp. 517–553, 2010. 4 [44] V. Chandrasekaran, B. Recht, P. A. Parrilo, and A. S. Willsky, “The convex geometry of linear inverse problems,” Foundations of Computational Mathematics, vol. 12, no. 6, pp. 805–849, 2012. 4, 11 [45] N. Parikh and S. Boyd, “Proximal algorithms,” Foundations and Trends in Optimization, vol. 1, no. 3, pp. 123–231, 2013. 5 [46] S. Boyd, “Distributed optimization and statistical learning via the alternating direction method of multipliers,” Foundations and Trends in Machine Learning, vol. 3, no. 1, pp. 1–122, 2010. 5, 6 [47] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and sparse coding,” The Journal of Machine Learning Research, vol. 11, pp. 19–60, 2010. 7 [48] E. J. Cand`es and M. B. Wakin, “An introduction to compressive sampling,” IEEE Signal Processing Magazine, vol. 25, no. 2, pp. 21– 30, 2008. 7 [49] E. J. Cand`es and T. Tao, “The power of convex relaxation: Nearoptimal matrix completion,” IEEE Transactions on Information Theory, vol. 56, no. 5, pp. 2053–2080, 2010. 7 [50] D. Donoho and J. Tanner, “Observed universality of phase transitions in high-dimensional geometry, with implications for modern data analysis and signal processing,” Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, vol. 367, no. 1906, pp. 4273–4293, 2009. 7 [51] “Mocap: Carnegie mellon university motion capture database. http://mocap.cs.cmu.edu/.” 7 [52] N. Boumal, B. Mishra, P.-A. Absil, and R. Sepulchre, “Manopt, a matlab toolbox for optimization on manifolds,” Journal of Machine Learning Research, vol. 15, pp. 1455–1459, 2014. 8 [53] Y. Yang and D. Ramanan, “Articulated pose estimation with flexible mixtures-of-parts,” in IEEE Conference on Computer Vision and Pattern Recognition, 2011. 9, 10 [54] D. Ramanan, “Learning to parse images of articulated bodies,” in Advances in neural information processing systems, 2006. 9 [55] Q.-X. Huang and L. Guibas, “Consistent shape maps via semidefinite programming,” Computer Graphics Forum, vol. 32, no. 5, pp. 177–186, 2013. 11 [56] Q. Huang, F. Wang, and L. Guibas, “Functional map networks for analyzing and exploring large shape collections,” ACM Transactions on Graphics, vol. 33, no. 4, p. 36, 2014. 11
15
[57] Y. Xiang, R. Mottaghi, and S. Savarese, “Beyond pascal: A benchmark for 3d object detection in the wild,” in IEEE Winter Conference on Applications of Computer Vision, 2014. 11 [58] “http://shapenet.cs.stanford.edu/.” 11