2012 IEEE International Conference on Multimedia and Expo
Learning Global and Reconfigurable Part-based Models for Object Detection Xi Song∗† , Tianfu Wu† , Yi Xie∗† , and Yunde Jia∗ Lab of Intelligent Information Technology, School of Computer Science Beijing Institute of Technology, Beijing 100081, P. R. China † Lotus Hill Research Institute, Ezhou 436000, P. R. China Email:
[email protected], {tfwu.lhi, yxie.lhi}@gmail.com,
[email protected]
∗ Beijing
or assuming a simple predefined quad-tree partition of the image lattice [3]. In this paper, we present a method of learning global and reconfigurable part-based models (RPM) for object detection. A RPM first quantizes the configuration space of parts explicitly, and then organizes them into an AOG. In learning, the globally optimal configuration is solved by using DP in the AOG in terms of the classification error rates of terminal nodes and And-nodes. A RPM has three properties as follows. (i) Flexible Part Types. Instead of using rectangular shapes of fixed size as parts [1], we define a dictionary of part types by enumerating rectangular shapes of different aspect ratios and sizes given the whole lattice (at twice resolution of the root node) as illustrated in Fig. 1 (a), and each part type has a set of part instances when placed in the lattice. For example, the part type B in the left figure has 64 instances when placed in the lattice as shown in Fig. 1 (b). So, the configuration space of parts is quantized by the defined part types and placed part instances. (ii) Selecting Parts Jointly Using the AOG Organization of the Configuration Space of Parts. By incorporating more flexible part types, the number of configurations of part instances of different part types increases largely and a good organization is entailed. As illustrated in the middle-bottom of Fig. 1, different part types and part instances can be organized in terms of And-Or structure, e.g., the part type D in Fig. 1 (a) can directly terminate or decompose into part type A and B in two alternative ways when placed in the lattice (see Fig. 1 (c)). So, the configuration space of parts is organized into a hierarchical And-Or directed acyclic graph (see Fig. 1 (d) and Sec. II for details on the construction) consisting of three types of nodes, terminal nodes (i.e., part instances), And-nodes (representing decompositions of a big part instance into two smaller ones) and Or-nodes (representing alternative ways of decomposing part instances). The AOG enables the joint selection of parts comprising the configuration, in contrast to the greedy initialization. (iii) Learning Global and Reconfigurable Configurations of Parts by DP. After the AOG is constructed, the globally optimal configuration in the AOG is found using dynamic programming(DP) where the error rates of terminal nodes and And-nodes are used as their figures of merit, and then
Abstract—This paper presents a method of learning global and reconfigurable part-based models (RPM) for object detection. Recently, deformable part-based model (DPM) is widely used. A DPM consists of a root node and a collection of part nodes, which is learned under the latent SVM formulation [1] by treating part nodes as hidden variables. Although the configuration of parts (i.e., the shapes, sizes and locations of parts) plays a major role in improving performance of object detection, it has not been addressed well in the literature. In this paper, we propose RPM to tackle it. A dictionary of part types is defined by enumerating rectangular shapes of different aspect ratios and sizes given the whole lattice (often at twice resolution of the root node), and each part type has a set of part instances when placed in the lattice. So, the configuration space of parts is quantized by the part types and part instances, and then organized into a hierarchical And-Or directed acyclic graph (AOG). The AOG consists of three types of nodes: terminal nodes (i.e., part instances), Andnodes (representing decompositions of a part instance into two smaller ones) and Or-nodes (representing alternative ways of decompositions). The globally optimal configuration in the AOG is solved using dynamic programming (DP) where the classification error rates of terminal nodes and And-nodes are used as their figures of merit. In experiments, we test our method on the 20 object categories in the PASCAL VOC2007 dataset and obtain comparable performance with state-of-theart methods. Keywords-Deformable Part-based Model, Part Configuration, Latent SVM, Dynamic Programming
I. I NTRODUCTION In the recent literature of object detection, the deformable part-based model (DPM) [1] has shown surprisingly promising performance improvement on challenging benchmark datasets such as the PASCAL VOC datasets [2]. A DPM consists of a root node (representing the whole object at a coarse scale) and a collection of part nodes (representing different portions of objects at a fine scale respectively). The learning of the DPM is formulated under the latent SVM framework and uses partially annotated data (i.e., only bounding boxes of the whole objects are labelled) [1], [3]. Although, as pointed out in [1], the performance improvement mainly lies in incorporating the deformable parts, the configuration of parts (i.e., their shapes, sizes and locations) has not been addressed well in the existing work where parts are often defined by rectangular shapes of fixed size, and their configurations are initialized by either using a greedy search based on the learnt root template [1] 978-0-7695-4711-4/12 $26.00 © 2012 IEEE DOI 10.1109/ICME.2012.32
13
(a) Dictionary of Part Types A
B
C
(b) Part Instances D
(d) AOG: Quantizing the Part Configuration Space
(e) An Example
B1
...
B64
(c) Part Decomposition
...
Or-node
...
...
...
And-node D
A
B
B
A
Terminal-node
Figure 1. Illustration of RPM. It quantizes the configuration space of parts by the part types (a) and part instances (b), and then organizes them into an And-Or directed acyclic graph (AOG) (d). For clarity, only a portion of the AOG is shown. (c) illustrates a basic And-Or unit for the part type D which can directly terminate or decompose into part type A and B in two “Or” ways. (e) shows an example of the learned RPM. See texts for details.
The dictionary of part types is then obtained by enumerating all pairs of (w, h), as illustrated in Fig. 1 (a). Part Instances. For a given part type such as B in Fig. 1 (a), we can generate all of its instances when placing it at different locations in the lattice Λ, i.e., using B as the sliding window in Λ, as the B1 to B64 shown in Fig. 1 (b). And-Or Structure among Part Instances. As illustrated in Fig. 1 (c), for a sub image domain of part type D, it can directly terminate as an instance of part type D (i.e., a terminal node represented by gray rectangle) or decompose into an instance of part type A and an instance of part type B in two alternative ways (i.e., Or switching), i.e., two And-nodes represented by solid circle. This basic And-Or structure can be actually recursively explored among all the part instances of all the part type, leading to the hierarchical And-Or graph organization, as illustrated in Fig. 1 (d). Constructing the AOG. As illustrated in Fig. 1 (c), it is straightforward to organize all the part instances into a hierarchical AOG. Concretely, starting from the root Or-node which represents the lattice Λ, it can directly terminate as a terminal node (as the left most gray rectangle node shown in Fig. 1 (d)) or decompose into two parts by enumerating all possible partition cases, i.e., dividing the lattice either vertically (as the red lines illustrated) or horizontally (as the blue lines illustrated), if the divided two parts are valid part instances. This procedure is recursively applied to all the Or-nodes encountered layer by layer from the top to the bottom in constructing the AOG. So, the AOG represents all possible configurations of parts for the given lattice Λ and then forms an over-completed representation. Furthermore, the AOG is a directed acyclic graph where DP can be used to search the globally optimal configurations. To learn global and reconfigurable configurations of parts in the AOG, we need to define some figure of merit for nodes in the AOG. Next, we first present the RPM interpreted as object detection grammar with appearance evaluation models and deformation models specified and then propose the DP algorithm in Sec. V.
adaptive for different object categories (see Sec. V). Fig. 1 (e) shows one component of the learnt model for persons in the PASCAL VOC2007 dataset. In experiments, we test RPM on the 20 object categories in the PASCAL VOC2007 dataset [2] and obtain comparable performance with state-of-the-art method [4]. In the literature of object detection, compositional hierarchy [5]–[8] and deformable templates [1], [9] are widely used. Our method is built on the DPM [1] and extends the part types and the approach for selecting part configurations. Our method is also related to [3] but extends their predefined structure of part configurations (i.e., the quad-tree) to reconfigurable configurations. The AOG organization is related to the tangram model proposed for modeling scene configurations [10]. The main contribution of this paper is three-fold: (i) quantizing the part configuration space by defining a flexible dictionary of part types; (ii) jointly selecting parts by using the AOG organization of the configuration space of parts to find a reasonable part configuration ; and (iii) learning global and reconfigurable configurations of parts by DP. II. R EPRESENTATION : RPM ORGANIZED INTO AOG In this section, we present the method of quantizing the part configuration space by defining a set of part types and part instances, and then organizing them into AOG for searching the part configuration for a given image lattice on which images of objects are defined. Denote by Λ the image lattice where images of objects are defined. Assume Λ is divided into W × H cells of the same size such as 8 × 8 pixels. Part Types. We also use rectangular shape for simplicity. A part type is then defined as a rectangular shape of given size w × h cells where w0 ≤ w ≤ w1 and h0 ≤ h ≤ h1 with w0 and h0 being the allowable minimal width and height (such as w0 = h0 = 2) and w1 and h1 the allowable maximal width and height (such as w1 = W − w0 and h1 = H − h0 ).
14
Figure 2.
Learnt AOT for cars.
III. T HE M ODEL : RPM EMBEDDED O BJECT D ETECTION G RAMMAR
have the parameter (row) vectors of deformation models θvdef ∈ Θdef (∀v ∈ VT ). The scored production rules are represented by the edges in the AOT. A. Appearance features and deformation models In this paper, we adopt the sliding window schema for detection, extracting an appearance feature (column) vector Φapp (I, ω, v) for each symbol v ∈ VN ∪ VT . Note that the appearance feature for a terminal part symbol is extracted at twice the spatial resolution with respect to its parent symbol to account for more details as in [1], [3]. We use the histogram of oriented gradient (HOG) descriptor [12] to compute Φapp (I, ω, v) as in previous work [1], [3]. Also, the deformation model for a terminal symbol v ∈ VT is defined by a quadratic function of its displacement Φdef (v, δ) = [dx2 , dx, dy 2 , dy] . B. Generating parse trees by placing AOT in images Let Λ be the image lattice and IΛ an image. In detection, each symbol in the AOT can be placed at different locations in Λ at different scales (we do not consider the orientation in this paper). In practice, an image pyramid I = (IΛ0 , · · · , IΛL ) is used to handle the scale space (where Λ0 = Λ). Denote by Ω = {(l, x, y) : 0 ≤ l ≤ L, (x, y) ∈ Λl } the location space for symbols in the AOT (i.e., each element corresponds to a sliding window in detection). In addition, a terminal symbol v ∈ VT , is deformable when placed at an anchor point ω with a displacement denoted by δ = (dx, dy). The displacement operation, denoted by ⊕, is defined by Ω⊕Δ → Ω (i.e., ω⊕δ = (l, x+dx, y+dy) ∈ Ω). When placing a terminal symbol v ∈ VT with respect to an anchor location, we need to infer its hidden actual location by maximizing its score function (to be defined below) over the displacement. When placing a nonterminal symbol
In this section, we interpret RMP as an object detection grammar by following the framework [6], [11]. Mixture of RPMs as an And-Or Template (AOT). Fig. 2 shows a mixture of RPMs of cars in PASCAL VOC2007 dataset with three components used, which is equivalent to an AOT representation: the mixture model as the root structural Or-node, the mixture components (e.g., defined based on aspect ratios in [1]) as decomposition And-nodes, and deformable parts as terminal nodes. The configuration of parts for each component is learnt from the AOG defined above. By following the framework [6], [11], an AOT embeds object detection grammars by defining one root symbol S representing an object category, a set of nonterminal symbols VN including a set of Or-nodes VOr accounting for structural variations of the category (i.e., mixture models), a set of And-nodes VAnd representing the decomposition of an object into parts, a set of terminal symbols VT representing the building blocks for objects such as deformable parts, a set of scored production rules R to define the derivations for nonterminal and terminal symbols, and a set of feature evaluation models each of which is with a symbol to compute the score when the symbol is placed in images. Formally, an AOT is specified by a 6-tuple G = (S, VN , VT , R, Θapp , Θdef ),
(1)
where Θapp represents the parameter (row) vectors of appearance models for nonterminal and terminal symbols so we have θvapp ∈ Θapp (∀v ∈ VN ∪ VT ). Since terminal symbols represent deformable parts in this paper, they also
15
D+ ∪ D− = {(xi , yi ); i = 1, · · · , n} representing the set of n = N + M training examples where yi ∈ {+1, −1}. Then, for each training example, we have a vector of hidden variables (4) h = (a, ω(S), {(tj , ω(tj ))}P j=1 ),
v ∈ {S} ∪ VN at a location ω ∈ Ω, we can derive its parse tree, denoted by P t(v, ω; G, I), by recursively applying the production rules for nonterminal symbols and directly terminating to image data for terminal symbols. This can be treated as a procedure to traverse the AOT starting from v by selecting the best branch (i.e., the one with the highest score) for each encountered Or-node, expanding to all the children for each encountered And-node, and terminating at terminal nodes. The score of a placed terminal symbol v is Score(v, ω; δ) = max{θvapp Φapp (I, ω δ∈Δ
where a ∈ VAnd , ω(S) is the placed location of the AOT, tj ∈ VT and ω(tj ) its anchor location. A. Estimating parameters for a given AOT For a given AOT, we can score each training example by using Eqn.3 with dynamic programming. We have
(2) ⊕ δ, v) −
f (x; θ) = max θ · Φ(x, h)
θvdef Φdef (v, δ)}.
h∈H(x)
= max score(P t(S, ω; G, I)), ω∈Ω
Then, the hidden actual location of v is given by zv (w) = ω ⊕ δ ∗ where δ ∗ = arg max Score(v, ω; δ). In practice, the inference of part locations is made efficient by using the generalized distance transform [1], [8] due to the quadratic function used for deformation. The score of the parse tree P t(v, ω; G, I) for a nonterminal node v is recursively computed by
(5)
(3) Score(P t(v, ω; G, I)) = θvapp Φapp (I, ω, v) ⎧ ⎪ max Score(P t(u, ω; G, I)) if v is an Or-node ⎨u∈ch(v) + Score(P t(u, ω; G, I)) if v is an And-node, ⎪ ⎩
where H(x) represents the assignment space of hidden variables in training example x, θ is the concatenation of parameter vectors of all the nodes in the AOT, and Φ(x, h) the sparse feature vector with non-zero entries corresponding to the nodes specified by h. Under the latent SVM framework [1], we estimate the parameter θ by minimizing the following regularized loss function: n 1 max(0, 1 − yi f (xi ; θ)). (6) LD (θ) = ||θ||2 + λ 2 i=1
where ch(v) represents the set of child nodes of v in the AOT. Note that if u is a terminal node Score(P t(u, ω; G, I)) = Score(u, ω; δ) defined by Eqn.2. This recursion leads to dynamic programming in inference. In detection, after computing and thresholding Score(P t(S, ω; G, I)) (∀ω ∈ Ω), we can backtrack the component used (i.e., the And-node which is selected in the parse tree). In practice, we also need to use non-maximum suppression.
In this section, we specify how to assign the AOT structure of RPMs by using the AOG of configurations of parts defined in Sec. II.
u∈ch(v)
The objective function LD (θ) is convex for negative examples (yi =-1) but concave for positive examples (yi =+1) (i.e., so called semi-convex in [1]). We refer to [1] for more details on how to solve this function. In experiments, we reuse the efficient implementation in [4]. V. L EARNING PART C ONFIGURATIONS U SING DP
A. Creating VAnd and assigning a ∈ h For initialization, we perform k-mean clustering with increasing integer k (k ≥ 1) based on the aspect ratios of Bi ’s with equal number of examples per cluster, and stop when we obtain the cluster number k such that the variances of aspect ratios within each cluster are below a given threshold. Then, each cluster is represented by an And-node (i.e., a mixture component) as a child of the root structural Or-node. Then, ai is initialized for each positive training image. So, the positive training dataset is divided into D+ = ∪a∈VAnd Da+ . Given the mean aspect ratio of each And-node, its model size is chosen such that it can overlap with all the assigned positive examples more than a threshold (80% used in this paper) as done in [1]. Then the positive examples are warped and cropped out in terms of the bounding box Bi . We first train a linear SVM (not latent) for each node a ∈ VAnd separately and then merge them together to train a latent SVM (with only one hidden
IV. L EARNING AOT OF RPM S In this paper, the learning of AOT of RPMs is based on partially labelled data (i.e., only object labels and object bounding boxes are given) and adopts the latent SVM framework [1]. Then, the learning procedure consists of two stages: AOT structure assignment (i.e., hidden variable assignments including structural Or-nodes and configuration of parts) and AOT parameter estimation (i.e., solving a SVM optimization problem). In this section, we briefly specify how to estimate parameters for a given AOT structure. Let D+ = {(I1 , B1 ), · · · , (IN , BN )} be the set of N positive training images for a given object category where Bi represents the labelled bounding box (without loss of generality we assume each image Ii contains only one object instance in Bi ). Denote by D− = {J1 , · · · , JM } a set of M negative images (in which no object instances of the given class appear). For notation simplicity, we will use D =
16
VI. E XPERIMENTS In experiments, we test our method on the 20 object categories in the PASCAL VOC2007 dataset [2]. For all 20 categories, we use the trainval datasets for training models, and evaluate our models on the test datasets. It costs about 5 hours to train a model for each category by using a matlab 5-computer cluster , and about 3 hours for testing a learned model on thousands of images (i.e. about 2 seconds for each image) using a single-threaded process. Some examples of detection results of our learnt models on PASCAL VOC2007 test dataset are shown in Fig. 4, where bounding boxes in red indicate the location of objects and bounding boxes in other colors indicate the location of different parts respectively. We compare our learnt models with the state-of-the-art methods [1], [3] and [4] in Table. I, where the average precision (AP) of the four methods on the 20 categories are summarized. Note that, all the results are obtained without post-processing such as bounding box prediction or with-context. Our models outperform others on three categories (i.e., aeroplane, boat and sheep), and obtain comparable performance on other categories. By comparing our learned models with the ones in [4], see Fig. 3 for an illustration, we note that : (i) Our method selects parts and their configuration simultaneously by using the classification error rates of parts as the figure of merits. In contrast, the parts in [4] are sequentially and greedily selected in terms of the squared norm of the positive template filter weights. For example, the part in the blue box as shown in (a) carries on information for detecting an aeroplane, although it has a low energy (i.e., not covered by the parts in (b)); (ii) Our method allow more flexible part types and thus has more power to adapt to different object categories potentially. For an instance, the part in the yellow box in (a) interprets the same portion of an aeroplane with the two parts in the brown and yellow boxes respectively in (b), while the former one is more stable intuitively. These might lead to the improvement of our models on the three categories.
variable for assigning the structural Or-node) to refine the parameters with a predefined number of iterations. During the learning procedure, each positive training example will be assigned to one of the And-nodes in terms of the inference result (note that the inference is done in the image pyramid of the original image without warping). B. Learning Global and Reconfigurable Configurations Denote by Λa the image lattice of size Wa × Ha cells for each And-node a ∈ VAnd . We first construct the AOG for the configurations of parts for each Λa as stated in Sec. II. Denote the learnt linear SVM classifier for each And-node a by fa (x) =< θa , Φapp a (x) > +ba ,
(7)
where θa is the learnt weight parameter vector of length Wa ×Ha ×d (d is the length of the HOG feature and d = 32 in this paper by adopting the HOG implementation in [4]) and ba the bias term. Due to the linearity, we can factorize the weight parameter vector into each cell indexed by c = (i, j) (where 1 ≤ i ≤ Wa and 1 ≤ j ≤ Ha ), and then we have fa (x) = < θa (c), Φapp (8) a (xc ) > +ba , c
where θa (c) and Φapp a (xc ) represent the weight parameters and HOG features corresponding to the cell c respectively. Calculating the error rate of each terminal node. For each terminal node t (i.e., a part instance) in the AOG, we can obtain its weight parameters by concatenating those from the cells belonging to the part instance. Then, given the positive and negative training dataset D+ and D− (including the mined hard negatives during learning the classifier for root node), we have a set of positive scores, denoted by s+ t = + (x (t)); x ∈ D }, and a set of negative scores, denoted {s+ i i t + − by s− t = {st (xi (t)); xi ∈ D }, where the score for each training example (positive and negative) is computing by st (xi (t)) =< θt , Φapp (xi (t)) > without using the bias term. Then, the minimal error rate for the terminal node can be − calculated using s+ t and st . Calculating the error rate of each And-node. For each − And-node a, its positive and negative scores s+ a and sa are computed by treating the two child part instances as deformable parts and then running the inference with the small “RPM” on each training example. Finding optimal configuration by top-down retrieving the AOG. After the error rates are computed for each terminal node and And-node in the AOG, to find the optimal configuration, we just traverse the AOG from the root node and for each encountered Or-node select the branch whose error rate is the minimal. Fig. 2 shows the traversed configurations for the car category.
(a)
(b)
Figure 3. Illustration of the parts and their configuration learned by our method (a) and [4] (b).
VII. C ONCLUSION This paper presents a global and reconfigurable partbased model (RPM) for object detection. Our method obtains comparable performance on the PASCAL VOC2007 dataset with state-of-the-art methods. In our on-going work, we are further exploring how to learn shared-parts among different
17
Figure 4. Examples of detection results of our learnt model on PASCAL VOC2007 testing dataset where parts are plotted in different colors. (Best viewed in color and magnification) Table I P ERFORMANCE C OMPARISON USING AVERAGE P RECISION FOR THE 20 O BJECT C ATEGORIES IN PASCAL VOC2007 DATASET. A LL THE PERFORMANCE ARE OBTAINED WITHOUT POST- PROCESSING SUCH AS BOUNDING BOX PREDICTION OR WITH - CONTEXT.
[3] [1] [4] Ours
aero .294 .290 .289 .327
bike .558 .546 .595 .561
bird .094 .006 .10 .097
boat .143 .134 .152 .155
bttle .286 .262 .255 .242
bus .440 .394 .496 .492
car .513 .464 .579 .548
cat .213 .161 .193 .181
chair .200 .163 .224 .212
cow .193 .165 .252 .235
And-nodes (i.e., mixture components) of a single object category and multiple object categories. ACKNOWLEDGMENT This work was supported in part by the Natural Science Foundation of China(NSFC) under grant No.90920009 and No. 60832004. R EFERENCES
tble .252 .245 .233 .241
dog .125 .050 .111 .111
hrse .504 .436 .568 .558
mbik .384 .378 .487 .473
pers .366 .350 .419 .409
plant .151 .088 .122 .118
sheep .197 .173 .178 .197
sofa .251 .216 .336 .329
train .368 .340 .451 .438
tv .393 .390 .416 .397
[6] S.-C. Zhu and D. Mumford, “A stochastic grammar of images,” Found. Trends. Comput. Graph. Vis., vol. 2, no. 4, pp. 259–362, 2006. [7] M. Fischler and R. Elschlager, “The representation and matching of pictorial structures,” IEEE Trans. Comput., vol. 22, pp. 67–92, 1973. [8] P. Felzenszwalb and D. Huttenlocher, “Pictorial structures for object recognition,” IJCV, vol. 61, no. 1, pp. 55–79, 2005.
[1] P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, “Object detection with discriminatively trained partbased models,” PAMI, vol. 32, no. 9, pp. 1627–1645, 2010.
[9] Y. N. Wu, Z. Z. Si, H. F. Gong, and S.-C. Zhu, “Learning active basis model for object detection and recognition,” IJCV, vol. 90, no. 2, pp. 198–235, 2010.
[2] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” IJCV, vol. 88, no. 2, pp. 303–338, 2010.
[10] J. Zhu, T. Wu, S.-C. Zhu, X. Yang, and W. Zhang, “Learning reconfigurable scene representation by tangram model,” in WACV, 2012.
[3] L. Zhu, Y. Chen, A. Yuille, and W. Freeman, “Latent hierarchical structural learning for object detection,” in CVPR, 2010, pp. 1062–1069.
[11] P. Felzenszwalb and D. McAllester, “Object detection grammars,” University of Chicago, Computer Science TR-2010-02, Tech. Rep., 2010.
[4] P. Felzenszwalb, R. Girshick, and D. McAllester, “Discriminatively trained deformable part models, release 4,” http://people.cs.uchicago.edu/ pff/latent-release4/.
[12] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, 2005, pp. 886–893.
[5] S. Geman, D. Potter, and Z. Y. Chi, “Composition systems,” Quart. Appl. Math, vol. 60, no. 4, pp. 707–736, 2002.
18