LNCS 8142 - Joint Shape Classification and Labeling ... - Springer Link

Joint Shape Classification and Labeling of 3-D Objects Using the Energy Minimization Framework Alexander Zouhar, Dmitrij Schlesinger, and Siegfried Fuchs Dresden University of Technology

Abstract. We propose a combination of multiple Conditional Random Field (CRF) models with a linear classifier. The model is used for the semantic labeling of 3-D surface meshes with large variability in shape. The model employs multiple CRFs of low complexity for surface labeling each of which models the distribution of labelings for a group of surfaces with a similar shape. Given a test surface the classifier exploits the MAP energies of the inferred CRF labelings to determine the shape class. We discuss the associated recognition and learning tasks and demonstrate the capability of the joint shape classification and labeling model on the object category of human outer ears.

This paper addresses the problem of labeling 3-D objects with large variability in shape. A 3-D object is represented by a 2-D surface mesh embedded in R3 , defining its boundary. The labeling densely covers the surface capturing the part structure of the underlying object. Organic shapes such as categories of human teeth (e.g., the canine), categories of human bones (e.g., the femur), the outer ear anatomy of humans, and many others are typical examples of object categories with large variability in shape across individuals. It is extremely challenging to label surfaces of such object categories consistently. To achieve this, our approach employs multiple CRFs [14] each of which models the label distribution for a group of surfaces with a similar shape. The partitioning of a category of objects into sets of similar shapes is performed prior to learning the CRFs for example via manual selection by domain experts. A linear classifier takes the MAP energies of the CRF labelings of a surface as an input and determines the shape class. The labeling model associated with this shape class produces the best quality labeling of the surface. The use of multiple, shape specific labeling models has several advantages. First, each model may be kept simple. No additional shape prior is needed to ensure consistency of labeling across all objects. Second, no model assumption about the nature of the shape variation is required. The shape information is captured in terms of classes of similar shapes for which the CRFs are learned. Third, the MAP energy of the inferred labelings may be used for classification. This is a key aspect of our work, since the energy value associated with the optimal labeling can normally not be regarded as a readily useful quantity. J. Weickert, M. Hein, and B. Schiele (Eds.): GCPR 2013, LNCS 8142, pp. 71–80, 2013. c Springer-Verlag Berlin Heidelberg 2013

72

A. Zouhar, D. Schlesinger, and S. Fuchs

Related Work and Contribution Numerous tasks in geometric modeling and manufacturing of 3-D meshes rely on their segmentation into parts. CRF based approaches exploit local spatial interactions between the parts and allow the use of rich overlapping descriptors without the need to model the possibly complex dependencies between them. Superior labeling performance compared to previous mesh segmentation methods [16,8,9] has recently been reported in [11] for the Princeton Segmentation Benchmark (PSB) [4]. However, the proposed model is complex and lacks interpretability. Specifically, the choice of the features and their geometric relationship are not well founded. For example, the distinctiveness of the difference of the neighboring feature vectors tends to be sensitive to variations of the mesh resolution and to variations of the mesh tessellation. The latter drawback also holds for the model in [20]. In this work we employ CRFs with pairwise Potts interactions between the neighboring labels together with 3-D shape contexts [12] as local observations of the mesh vertices. 3-D shape contexts yield distinct local representations of regional or global shape except for symmetries. Other distribution based descriptor schemas, such as spin images [10], intrinsic shape contexts [17] and multi-scale surface descriptors [5] form histograms of ambiguous surface attributes and are therefore less discriminative on our data. Model-based object recognition methods of 2-D images often combine object specific segmentation models with shape priors in order to cope with visual variability and other non-ideal conditions. See for example [13,6,19] and the references therein. However, models which describe the desired form of the segmentation usually tend to be complex resulting in high computational costs of learning and recognition. Moreover, the nature of the underlying shape variation of an object category may be unknown or difficult to model. Instead of using a single labeling model we employ a linear classifier that sits on top of multiple, shape specific labeling models of low complexity. The shape specific labeling problem may also be formulated within the framework of Structured Support Vector Machines (SSVM). See, for example [15] and the references therein. However, from a practical point of view this may be inconvenient especially when larger data sets are involved. For example, relearning of an SSVM classifier involves all training instances each time a novel observation is added to the training data. Our approach requires a single shape specific labeling model to be relearned together with a few additional classifier parameters. This only involves data members of the shape class to which a novel observation is assigned. The capability of our model is demonstrated on the object category of human outer ears. Mesh labeling is highly significant in digital hearing aid manufacturing. It serves as a pre-requisite for automated surface manipulation in order to reduce the amount of human intervention in the manufacturing process [18]. In the next section we derive the joint shape classification and labeling model. Section 2 covers the resulting learning and recognition tasks. Experimental results are presented in section 3. A brief discussion concludes the paper.

Joint Shape Classification and Labeling of 3-D Objects

73

Shape class 1/Surface 1 Shape class 2/Surface 1 Shape class 3/Surface 1

Shape class 1/Surface 2 Shape class 2/Surface 2 Shape class 3/Surface 2 Fig. 1. Example shape classes of the ear population each of which is arranged column by column. The ear geometry is composed of 6 non-overlapping parts whose anatomical interpretation is color-coded.

1

Joint Shape Classification and Labeling Model

We consider the following model. A surface mesh X = (V, E, F ) consists of vertices V , edges E and faces F . A labeling Y : V → H of X assigns a part label Yv ∈ H to each vertex v ∈ V . For example, figure 1 shows 6 surfaces capturing the left or right ear geometry of different individuals. The human outer ear is composed of |H| = 6 parts whose anatomical interpretation is denoted by the colors (see [20] for details). Moreover, there exist K distinct subsets of surfaces with a similar shape which we refer to as shape classes. The partitioning of the data was performed using a known clustering algorithm (see section 3 for details). This resulted in a pre-labeled and pre-classified set of surfaces with reduced variability inside the shape classes compared to the variability in the set of all meshes. Each column in figure 1 depicts two representative examples of three shape classes of the ear population. We continue with a model for the joint probability over elementary events (X, Y, k), i.e., p(X, Y, k) ∝ p(k|X)p(Y |X, k). (1) The labeling model associated with class k is the conditional probability p(Y |X, k) =

exp{−U (X, Y, k)} , Z(X, k)

(2)

where U (Y, X, k) denotes the energy of a labeling Y of X using the k-th model and Z(X, k) denotes the observation specific partition function of the k-th model. The distribution p(k|X) on the right hand side of equation (1) indicates the confidence for a surface X being a member of class k based on its shape. The energy term U (X, Y, k) is given by

74


U (X, Y, k) =

φv (Yv , X) +

v∈V

ψv,w (Yv , Yw ),

(3)

{v,w}∈E

where the unary potentials φv (·, ·) use descriptors of regional or global surface geometry characterizing the shape of the neighborhood around each vertex v ∈ V . Surface descriptors, including 3-D shape contexts [12], normally reside in a high dimensional space. This is why we use randomized decision trees for the unary potentials similar to [19]. For each vertex v ∈ V a decision tree returns a distribution over the part labels in H. The pairwise potentials ψv,w (·, ·) incur a constant positive cost for neighboring labels being different and zero otherwise. We define the classification of a surface X as the problem of maximizing equation (1) jointly with respect to the variables k and Y , i.e., f (X) = arg max max p(k|X)p(Y |X, k),

(4)

f (X) = arg max max [log p(k|X) + log p(Y |X, k)]

(5)

k

Y

or equivalently k

Y

= arg max max [β(X, k) − U (Y, X, k)] ,

(6)

β(X, k) = log p(k|X) − log Z(X, k).

(7)

k

Y

where The two terms in equation (7) have similar qualitative properties. When the shape classes form compact clusters the posterior probability p(k|X) is peaked, that is, if X belongs to class k then the first term assumes a large value and a small value otherwise. Likewise, for a given X the quantity Z(X, k) assumes a large value when X belongs to class k and a small value otherwise because, in the former case, there should exist labelings with both high and low energies. It is therefore reasonable to assume that equation (7) may be approximated by a sum of two univariate functions, say, β(X, k) ≈ β(X) + k

(8)

where β(X) depends on X and k depends on k. In general this may not be true. Note, that this assumption is weaker than, e.g., to assume the decomposability of Z(X, k). We provide empirical evidence in section 3. Equation (6) then simplifies to (9) f (X) = arg max k − min U (X, Y, k) . k

Y

To further simplify the notation we set qk (X) = − min U (Y, X, k)

(10)

f (X) = arg max [qk (X) + k ] .

(11)

Y

and obtain k


75

The set of free parameters of the resulting classifier f (X) comprises the unary and pairwise potential parameters of the K energy functions given in equation (3) along with the class specific constants = (1 , ..., K ).

2

Learning and Inference

Given a pre-classified and pre-labeled set of training surfaces T we learn the unary and pairwise potential parameters for each of the K labeling models in equation (3) using a supervised algorithm similar to [19,20]. Learning involves growing decision trees for the unary potentials and learning of the pairwise potential parameters via cross-validation. Alternatively the maximum-likelihood method may be applied for parameter learning as well as the techniques described in [15]. Approximate MAP inference of the part labels may be carried out efficiently using the alpha expansion algorithm [3] after which the quantities qk (X) in equation (10) are computed. In the remainder of this section we describe the procedure for learning the class specific constants . For a training surface X ∈ T with known class association k the classifier f (X) correctly decides for k if qk (X) + k > qk (X) + k , ∀k = k.

(12)

Thus, for each X ∈ T there are K − 1 constraints of the form (12) yielding a total of (K − 1)|T | constraints for the training set T . We follow the Support Vector Machine (SVM) approach and minimize the upper bound of the empirical risk, i.e., max {0, 1 − qk (X) − k + qk (X) + k } → min, (13) L() =

X k =k

where k ≤ K denotes the true class of a surface X. Equation (13) is sometimes referred to as the hinge loss function. Since L() is convex a minimizer ∗ = min L() can be obtained globally. Moreover, L() is subdifferentiable [2] and can be minimized iteratively by a subgradient method. A typical subgradient method iterates (14) (l+1) = (l) − αl g (l) where g (l) denotes the subgradient of L((l) ) at (l) , αl denotes the step-size and l ≥ 0 denotes the iteration index. The subdifferential of equation (13) is given by ∂L() = ∂Li (), (15) i

where the sum is over all inequalities in equation (12) and all X. (l) If for the current (l) and for the i-th example X we have qk (X) + k − (l) (l) (l) (l) (l) qk (X) − k ≤ 1 then gk = −1 and gk = 1 with gk and gk denoting the k-th

76

A. Zouhar, D. Schlesinger, and S. Fuchs Input : Training set T , Class alphabet {1, ..., K}, Iterations n ≥ 1 Output: A minimizer ∗ of min L() (0) ← 0; for l ← 0 to n − 1 do g (l) = 0; // Initialize subgradient of L((l) ) with zero for i ← 1 to |T | do // T = {X1 , ..., Xi , ..., X|T | } for k ← 1 to K and k = k do // k is true class of Xi ∈ T if qk (Xi ) + k − qk (Xi ) − k ≤ 1 then (l) (l) gk ← gk − 1; // Update k-th component of g (l) (l) gk ← gk + 1; // Update k -th component of g end end end (l+1) ← (l) − g (l) /(l + 1); end ∗ = n ;

Algorithm 1. Subgradient method for solving the problem ∗ = min L(). and k -th component of the subgradient of Li ((l) ). Otherwise the subgradient of Li ((l) ) is equal to zero. The step-size αl in equation (14) is determined prior to running the iteration. A classical step-size rule is given by αl ≥ 0,

∞ l=0

α2l < ∞,

∞

αl = ∞,

(16)

l=0

for example αl = 1/l where l > 0. Algorithm 1 summarizes the proposed subgradient method for solving the problem ∗ = min L(). Given a test surface X recognition is conducted by first running the α-expansion algorithm for each of the K labeling models. The energy values returned by the algorithm are then used to compute the quantities qk (X) after which equation (11) can be solved. In the next section we show some experimental results.

3

Experiments

We have experimented with our model using a data set of 200 human outer ear impressions which in turn were laser scanned to reconstruct 3-D triangular mesh representations. The resulting meshes were composed of roughly 20000 vertices. Each surface was labeled by an expert along anatomical lines using a CAD software system. In this way 6 compact, non-overlapping regions are obtained as illustrated in figure 1. These regions play a significant role in the design of personalized hearing aid devices [18]. We randomly picked 90% of the surfaces for training while setting the other 10% of the surfaces aside for testing. The whole data set was then partitioned


77

Ground truth

Single model, Accuracy: 3.40 Proposed method, Accuracy: 5.53

Ground truth


Ground truth


Ground truth


Fig. 2. Example segmentations using a single labeling model (middle) and the joint shape classification and labeling model (right)

into K = 5 shape classes via clustering using the algorithm in [7]. As a measure of pair-wise shape distance we chose the 3-D shape context matching score under a bipartite matching model similar to [1]. This resulted in a pre-classified and pre-labeled set of training examples with reduced anatomical variability inside the clusters (the shape classes). Three example classes are depicted in figure 1. Next, each of the K labeling models was learned as described in section 2 using the class members as an input. Prior to learning the class specific constants the quantities qk (X) were computed using equation (10). The class specific constants were then learned using algorithm 1. The latter converged quickly after a few iterations. For a test surface the solver returns the estimated class and the labeling. If is set to zero, i.e., when k is removed from equation (11) then 70% of the training data and 30% of the test data were assigned to the correct class, i.e., the learned class while at the same time the labeling model of this class generated the best

78


Ground truth

= 0, Accuracy: 3.42

∗ , Accuracy: 4.29

Ground truth

= 0, Accuracy: 3.11


Ground truth

= 0, Accuracy: 3.21


Ground truth

= 0, Accuracy: 3.01


Fig. 3. Labeling result using the joint shape classification and labeling model with = 0 (middle) and with the learned ∗ = 0 (right). The labeling model associated with the learned shape class of a test candidate performs best (right).

quality labeling. On the other hand, for the learned vector ∗ = 0 a correct class assignment was achieved for 92% of the training data and for 78% of the test data while at the same time the labeling models of these classes performed best. Figure 2 illustrates the result of four test candidates. The first column shows the ground truth labeling of the surfaces. The second column depicts the labeling result achieved as the MAP estimate using a single model given in equation (3). Note, how various regions are over- and under-segmented. The third column shows the labeling result using the learned joint classification and labeling model for which we observe the best agreement with the ground truth. The quality of labeling is indicated by the quantity below the surfaces. For a label, first the Dice coefficient of the estimated region and the corresponding ground truth was computed as a measure of labeling accuracy per part. The labeling accuracy of a test candidate is defined as the sum of the part scores with 6 being the


79

maximum score. The result suggests that the joint model copes better with the shape variability versus using a single labeling model of comparable complexity. Figure 3 depicts the labeling result of four test surfaces when the estimated shape classes differ from the learned classes (middle) and when the surfaces are assigned to their learned classes (right). The result shows that the labeling model of the learned class achieves the best agreement with the ground truth. The outcome of the experiment provides empirical evidence for the assumption in equation (8).

4

Discussion and Future Work

We introduced a joint shape classification and labeling model using the energy minimization framework. The model integrates shape information in terms of multiple shape specific labeling models each of which is learned using a training set of surfaces with a similar shape. As demonstrated in the experiments the labeling accuracy greatly improves over using a single labeling model of comparable complexity. Moreover, the best performance was achieved when the labeling model of the estimated shape class was used. The preliminary experiments are promising given the simplicity of the labeling model. In the future we plan to investigate the capability of the method when the labeling model includes more complex constraints about the object structure.

References 1. Belongie, S., Malik, J., Puzicha, J.: Shape matching and object recognition using shape contexts. IEEE PAMI 24(24) (2002) 2. Boyd, S., Vandenberghe, L.: Convex Optimization. Cambridge University Press (2004) 3. Boykov, Y., et al.: Efficient approximate energy minimization via graph cuts. PAMI (2001) 4. Chen, X., Golovinskiy, A., Funkhouser, T.: A benchmark for 3D mesh segmentation. ACM Transactions on Graphics (Proc. SIGGRAPH) 28(3) (August 2009) 5. Cipriano, G., Phillips Jr., G.N., Gleicher, M.: Multiscale surface descriptors. IEEE Transactions on Visualization and Computer Graphics (Proceedings Visualization 2009) (October 2009) 6. Flach, B., Schlesinger, D.: Combining shape priors and MRF-segmentation. In: da Vitoria Lobo, N., Kasparis, T., Roli, F., Kwok, J.T., Georgiopoulos, M., Anagnostopoulos, G.C., Loog, M. (eds.) S+SSPR 2008. LNCS, vol. 5342, pp. 177–186. Springer, Heidelberg (2008) 7. Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315, 972–976 (2007) 8. Golovinskiy, A., Funkhouser, T.: Randomized cuts for 3D mesh analysis. ACM Transactions on Graphics (Proceedings SIGGRAPH Asia) 27 (2008) 9. Golovinskiy, A., Funkhouser, T.: Consistent segmentation of 3D models. Computers and Graphics (Shape Modeling International 09) 33(3), 262–269 (2009) 10. Johnson, A.E., Hebert, M.: Using spin-images for efficient multiple model recognition in cluttered 3-D scenes. IEEE PAMI 21(5), 433–449 (1999)

80


11. Kalogerakis, E., Hertzmann, A., Singh, K.: Learning 3D mesh segmentation and labeling. SIGGRAPH 2010 (2010) 12. Koertgen, M., Park, G.J., Novotni, M., Klein, R.: 3D shape matching with 3d shape contexts. In: Proceedings of The 7th Central European Seminar on Computer Graphics (2003) 13. Kumar, M.P., Torr, P.H.S., Zisserman, A.: Obj cut. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Diego, vol. 1, pp. 18–25 (2005) 14. Lafferty, J.D., McCallum, A., Pereira, F.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. ICML (2001) 15. Nowozin, S., Lampert, C.H.: Structured learning and prediction in computer vision. Foundations and Trends in Computer Graphics and Vision 6(3-4), 185–365 (2011) 16. Shapira, L., Shamir, A., Cohen-Or, D.: Consistent mesh partitioning and skeletonisation using the shape diameter function. Visual Computing 24, 249–259 (2008) 17. Shi, Y., et al.: Direct mapping of hippocampal surfaces with intrinsic shape context. Neuroimage (2007) 18. Slabaugh, G., Fang, T., McBagonluri, F., Zouhar, A., Melkisetoglu, R., Xie, H., Unal, G.: 3-D Shape Modeling for Hearing Aid Design. IEEE Signal Processing Magazine (2008) 19. Winn, J., Shotton, J.: The layout consistent random field for recognizing and segmenting partially occluded objects. In: CVPR (2006) 20. Zouhar, A., Baloch, S., Tsin, Y., Fang, T., Fuchs, S.: Layout Consistent Segmentation of 3-D meshes via Conditional Random Fields and Spatial Ordering Constraints. In: Jiang, T., Navab, N., Pluim, J.P.W., Viergever, M.A. (eds.) MICCAI 2010, Part III. LNCS, vol. 6363, pp. 113–120. Springer, Heidelberg (2010)