Constellation Models for Recognition of Generic ... - Semantic Scholar

14 downloads 0 Views 689KB Size Report
Some related work from William T. Freeman's group [24, 31, 32] at MIT, ..... The research group lead by Kevin P. Murphy and William T. Freeman developed a ...
1

Constellation Models for Recognition of Generic Objects Wei Zhang School of Electrical Engineering and Computer Science Oregon State University [email protected] Abstract— Recognition of generic classes of objects is one of the most challenging problems in computer vision. Constellation models have been proposed to solve this problem which represent object as composition of object parts with spatial constraints. Leading families of constellation models are surveyed in this paper, and evaluated according to the criteria defined with respect to the task setting. This paper concludes with discussion about insect identification problem and some promising future research directions in this area.

(more than 95% accuracy for human face recognition). But a much more challenging task is the recognition of generic objects in natural scenes. This is the central task being studied by computer vision community currently. What makes the recognition of generic objects in natural scenes so difficult? Let’s look at the following images which are studied by several research groups.

Index Terms— computer vision, constellation model, object recognition.

1.

INTRODUCTION

1.1 Recognition of generic objects

R

ecognition of generic classes of objects, such as dogs from cats, is one of the most fundamental problems of computer vision. Although it is quite natural for human to recognize thousands of categories, generic object recognition continues to be quite a challenge for machine vision systems. Object Recognition is referred as Object Detection in many computer vision articles. Here we will not differentiate between these two terms; they all refer to the generic classification of object in an image, usually determining whether or not an input image contains an instance of the object class. And there are two other problems related to object recognition: Object Identification and Object Localization. Object Identification is the step after the recognition of the generic category of an object, that is, the recognition of specific individual or subclass within a class. For example, after recognized one object as a car, object identification may try to tell you whether it is a “Ford” or a “Toyota”. Object Localization refers to the positioning of the object within an image. This task is often referred to as object detection in many papers. But for clarity, object localization will be used for this task in this paper. Significant progress has been made in generic object recognition during past ten years. Researchers used to be interested in the recognition and identification of relatively rigid objects photographed in strictly controlled environment, for example, front-view human faces photographed against purely black background, with little illumination, translation, scale and rotation variation and limited view change. For these tasks, state-of-art techniques can recognize very accurately

Fig. 1. Typical generic object classes ([11])

From Fig 1, we can observe that the difficulties of generic object recognition in natural scenes come from the following aspects: (1) Translation, rotation and scale variation of objects: for example, the second motorbike is much smaller than the others. (2). View perspective: there can be great illumination variation and viewpoint change in images, for example, the spotted cats images have different viewing angles. (3) Cluttered background: highly complex backgrounds often confuse machines’ analysis of object features. (4) Partial occlusion: part of the object to be recognized can be blocked. For example, the third face image. (5) Intra-class variation: this is the main difficulty for the recognition of generic object classes. For all the generic object classes showed in Fig 1, there is significant variation in the appearance and shape of objects within one class. How to model generic object class in a way that is robust to intra-class variations while at the same time retain the ability to distinguish this class from others? This is an important but difficult open problem for computer vision researchers. 1.2 Constellation models for object recognition It has been commonly accepted by the computer vision community that objects can be best represented as composed of some object parts constrained by the spatial relations between them. For example, a human face is composed of eyes, ears,

2 nose, mouth, hair and skin, with nose lies between eyes and mouth. This representation is consistent with the perception of human vision system. Therefore, it is natural to model generic objects as constellations of parts, i.e. constellation models. Each part has its own appearance and shape properties, and parts are tightly or loosely bounded together by class-specific spatial configuration constraints. With the constellation models, the recognition of generic objects can be done by using their class-characteristic individual parts or spatial relations between parts. 1.3 Evaluation criteria According to the formulation of the recognition task and its difficulties introduced in Section 1.1, constellation models surveyed in this paper are evaluated based on the following criteria: (1) Representation of objects: the model is built based on objects’ appearance information, or geometric information or the combination of the appearance and shape information. Combination is usually preferred, to make maximal use of image information. (2) Learning: unsupervised learning: model trained on unlabeled and unsegmented images; weakly supervised learning: only the images examples are labeled, but objects are not marked or segmented; supervised learning: objects are segmented into parts, and each part is labeled. Unsupervised learning or weakly supervised learning is usually preferred in order to minimize human intervention. (3) Assuming relatively fixed spatial configuration: if the model assumes relative fixed spatial relations between object parts, i.e. relatively rigid objects, it can only guaranteed to work well on the object class whose parts are visually similar and occur in a similar spatial configuration (e.g. cars), but probably work poorly on deformable objects such as human bodies. (4) Invariant to translation (5) Invariant to rotation (6) Invariant to scale variation (7) Invariant to affine transformation: (4) – (7) derive from the notion that objects are usually not normalized, not registered on some predetermined grid, and our viewpoint can change during photographing. Therefore, a constellation model that is invariant to these transformations is better. (8) Robust to partial occlusion: Oftentimes, object we are interested is blocked from our view. A powerful model should be able to infer the generic class of the object from its visible parts. (9) Robust to background clutter: In natural scenes, backgrounds are not controlled; a good approach is expected be able to discriminate between the features from objects (foreground) and those from background. (10) Robust to large intra-class variations: Many common object classes (e.g. cars, human bodies) have large variation in appearance or geometric configuration across different instances. A robust constellation model should generalize well to instances of one class while at the same time retain its

discriminative power from other classes. (11) Computational efficiency: Computational efficiency during both training and testing is evaluated. It is difficult to evaluate computational efficiency across all models because of different experiment settings. In this paper, efficiency evaluation is only meaningful for closely related approaches in one family. All the evaluation results given in this paper are qualitative. 1.4 Families of constellation models (summary) A variety of constellation models have been proposed to deal with the generic object class recognition problem. Different approaches differ in their formulation of the problem, construction of the model, and the learning algorithms. In this paper, we categorized some leading approaches into three families primarily according to their modeling principles and learning algorithms. The first family of constellation models mainly comes from the work of Pietro Perona’s group [11, 12, 16, 17, 34, 35] at CalTech, David Lowe’s group [14] at UBC, and Cordelia Schmid’s group [5] at INRIA. These models will be introduced in Section 2.1. Some related work from William T. Freeman’s group [24, 31, 32] at MIT, and a weakly-supervised extension [13] proposed recently will also be covered in this section. This family characterized in flexible probabilistic constellation models of parts constructed from unlabeled and unsegmented images; and models are trained with unsupervised learning. The major advantage of this family is its unsupervised setting, little human intervention is required. In addition, their flexible probabilistic structures make them robust to background clutters and partial occlusions. But, the early models suffers from high computational cost; and most of them assume relatively fixed spatial configuration of parts, which makes them sensitive to large intra-class variations. The second family is a group of approaches from Shimon Ullman’s group [8, 33] at Israel, and papers from Cordelia Schmid’s group [7, 20] at INRIA. They are introduced in Section 2.2.These constellation models are constructed by first learn part detector or part classifier for each part, then add the structural information to the learned parts. The learning algorithms can be supervised or weakly supervised. These models are much more robust to large intra-class variations, and they can perform well for both recognition and localization of generic objects in highly cluttered scenes. But their performance highly depends on the performance of the part detectors or part classifiers. If the part detector or classifier works poorly or hard to construct, the entire system will not perform well. The last family of constellation models is similar to the second family, but they do not learn the part detector or classifier explicitly. Instead, they do clustering on the image features, and use the feature clusters to build the representations of images. Object recognition and localization are conducted based on these representations. This family includes work of Dan Roth’s group [1] at Illinois, Cordelia

3 Schmid’s group [26, 27] at INRIA, R. C. Nelson’s group [28, 29, 30] at University of Rochester, and paper of J. Bi et al. [4]. This family is introduced in Section 2.3. These models are robust to the clutters, occlusions, and limited affine variations. The major work of these approaches focused on the formulation of image representations. Evaluation of some leading approaches is summarized in Section 3. Discussion about Insect Identification problem and some promising future research directions are presented in Section 4. Finally, Section 5 concludes with some open research problems in this area. 2.

CONSTELLATION MODELS FAMILIES

2.1 Probabilistic constellation models and unsupervised learning using EM The research group lead by Pietro Perona at CalTech has proposed a series of constellation models [11, 12, 13, 16, 17, 34, 35] for object recognition during the past five years. In these approaches, generic object class is modeled to share characteristic features or parts; each part has a distinctive appearance and spatial position; probabilistic constellations of these parts are constructed from unlabeled and unsegmented images; the models are trained in unsupervised setting (EM algorithm). To validate the performance of the models, extensive experiments have been conducted using a range of datasets including geometrically constrained classes (e.g. cars, bicycles) and flexible objects (e.g. wild cats) in cluttered scenes. Performance has been compared using ROC curves, ROC Equal-Error-Rates and Precision-Recall curves. 2.1.1 Shape-based probabilistic constellation model and unsupervised learning The first model in this series is proposed by M. Weber et al. in [34] (this model is denoted as Weber00, for clarity). There is also a more detailed version of the paper [35]. In this paper, objects are represented as flexible constellations of rigid parts. Three problems are solved simultaneously here: segmentation of training images (acquiring parts); part selection; and estimation of model parameters. The model is formulated as: p ( X o , x m , h, n, b ) = p ( X o , x m | h, n) p ( h | n, b ) p ( n) p (b ) (1) m

X o are the observable parts in an image, x are the missing ones. h is a hypothesis, i.e. a possible matching between model parts and image features; b encodes the occlusion information of the model parts; and n represent the number of background detections. p(n) is modeled as a Poisson distribution;

p(b) modeled as an explicit table of joint distribution; p( h | n, b) modeled as the fraction of hypotheses consistent with n and b; finally, p( X o , x m | h, n) = p fg (z) . pbg(xbg) ,

p fg (z) is a joint Gaussian distribution, and pbg(xbg)is of uniform density. Training of the model: First, parts are detected in each image using interest operator, then K-Means clustering on the parts detected, so the feature clusters (patterns) are obtained and some noise features are get rid of; Then, a greedy model configuration searching procedure is performed. This procedure runs EM algorithm which iterated between the following two steps: (1) select part candidates to be used in the model; (2) learn the parameters underlying the modeled probability densities. Classification using the model: p( X o , h | C 1 ) p(C 1 | X o ) ∑ (2) ∝ h p(C0 | X o ) p( X o , h0 | C0 ) Experiments are conducted to classify rear views of cars or human faces from cluttered background. This approach shows good performance, 93.5% recognition accuracy for faces and 86.5% for cars (equal error rates). While, there are some limitations of this model: (1) the system is sensitive to rotation and scale variation; generic object classes were modeled primarily on their shape information, appearance information of parts was not incorporated explicitly into the model; (2) this model assumes relatively fixed spatial configuration between object parts, so it is sensitive to large appearance and geometric variations within an object class; (3) this approach require an exhaustive searching over all possible matchings between model parts and image features, so the computational cost for training is exponential to the number of parts in the model. This prohibitive expense limits the model to relatively few parts, typically 3-7 parts. For the objects which have many parts in nature, for example, stoneflies, a great deal of information has to be ignored. But anyway, this model offered a purely probabilistic model for unsegmented and unlabeled images; the training of the model is unsupervised; it is robust to background clutters and partial occlusions; it can perform well for the recognition of generic objects with little human intervention. This approach has gained attentions from many computer vision researchers, and successful approaches have been proposed based on it. 2.1.2 Combining appearance with shape and scale information In [11], R. Fergus et al. proposed an extension (denoted as Fergus03) of the Weber00 model, which overcomes the first limitation mentioned above. In this model, appearance, shape, relative scale and occlusion of the object are modeled together, and their parameters are learned simultaneously using EM algorithm. The model is formulated as: p( X, S, A | θ ) = ∑ p( A | X, S, h, θ )p( X | S, h, θ )p( S | h, θ )p(h | θ ) h∈H

(3)

4 In this formula, only the features selected by a hypothesis h are evaluated. For each feature, the appearance density p ( A | X, S, h, θ ) and the relative scale density p( S | h, θ ) are Gaussian densities which are independent to other features’; the shape density p( X | S, h, θ ) is a joint Gaussian density of the locations of features within a hypothesis; and occlusion term p( h | θ ) is modeled as a Poisson distribution. Training of the model involves the greedy searching and EM algorithm, similar to the learning algorithm for Weber00. Classification is performed in a Bayesian manner:

R= ≈

p(Object | X, S, A) p( NoObject | X, S, A)

∫ p( X, S, A | θ, object)p(θ )dθ ⋅ p(Object)

∫ p( X, S, A | θ

bg



, NoObject)p(θbg )dθbg ⋅ p( NoObject)

p( X, S, A | θML )p(Object ) p( X, S, A | θbgML )p( NoObject) (4)

θML and θbgML are the Maximum Likelihood values of

θ and θ bg . A threshold is set on the value of R to make object/non-object decision. By modeling generic object class in this way, the system is not sensitive to scale variation; appearance information of the parts is also explicitly modeled. And, by using more efficient interest operators, the model can be less sensitive to rotational variations. So it overcomes the first limitation of the Weber00 model. But it is still quite sensitive to large intra-class variation. The Fergus03 model provides a probabilistic way to combine the appearance and geometric information of parts; it is robust to occlusion, clutter, rotation and scale variation; and it is a generic model that can performs well on various datasets. R. Fergus et al. reported experiments on the recognition of some object classes in natural scenes, and the results were evaluated using ROC equal error rates and Recall-Precision curves. The experiments demonstrated the power of constellation model for the recognition of generic objects, and encouraged researchers to develop a series of papers [5, 12, 13] following this direction. They are covered in the next section. 2.1.3 Extensions of Weber00 Fergus03 models and related works In [12], the representations of the features are extended: D = [ A,G ] , which were called “heterogeneous parts”. Both the appearance of the regions (A) and shape of curves (G, geometric information) of the images are represented. A robust learning algorithm based on RANSAC is proposed for the training of the model, and the feature type (A or G) is selected automatically based on the variances of the scoring function. This novel model was applied to Google images filtering (i.e. re-ranking the outputs of Google image search engine) and the

experiment results are evaluated using Precision at 15% Recall. A significant improvement of performance is achieved under both unsupervised setting and supervised setting. In [5], instead of modeling the generic object class as an one-layer collection of parts, it is modeled as a hierarchy of parts and subparts, with objects at the top, and image features extracted by interest operators at the bottom. Spatial transformations (translation and scale transformation) softly relate the parts and sub-trees to their parents. The model is initialized by clustering on the most discriminative images; it is trained using EM algorithm; and recognition is conducted by bottom-up voting. This model inherits the basic idea from the Weber00 and Fergus03 models, while its flexible hierarchical structure somewhat combined the advantages of “bag of features” approaches with the value of spatial relation information; current state-of-art interest point detectors and descriptors were used to achieve great performance and fast training speed for the recognition of generic object classes. While, this approach assumes that object parts are stable and distinctive in their identities and positions. This assumption may be too strong for some highly complex generic object classes. This model will be revisited later in this paper. The research group lead by Kevin P. Murphy and William T. Freeman developed a series of object recognition and localization techniques [24, 31, 32] which are related to the idea of probabilistic constellation models (Weber00 and Fergus03). The innovation is using contextual information to provide strong priors for object recognition in natural scenes. In [31] and [32], coarse descriptions of scenes (gist) are combined with local object features to detect objects in video clips; tree-structured graphic model was used to formulate the dependent relationship. And in [24], the relations between the objects within a scene are modeled, using densely connected random fields with boosting. Although it is somewhat different from the models which encoded the relations between object parts, it proposed a way to construct constellation models with more elegant structures (i.e. CRF). 2.1.4 More computationally efficient models In spite of the cheerful progress, the computational cost still limited the application of the Weber00 and Fergus03 models. Fortunately, some methods [13, 14, 16, 17] have been proposed to solve this problem. In Equation (4), the integral over all possible parameters θ is approximated using its Maximum Likelihood (ML) value

θ ML .This implies that sufficient training data is required to make the estimation accurate, thus give rise to the high computational cost. A novel learning method (denoted as Li03) was proposed by F.F. Li et al. in [16] and [17] to train the model with only a few images. The basic idea is to incorporate our general knowledge

5 about class categories as prior probability density function of θ . The basic model is identical to the Fergus03 model except for several simplifications: only appearance and shape are considered, scale information from features is used to transform them into a scale-invariant space; shape is independent of appearance; no occlusion is considered. Given a few training images { X t , At }, the posterior model is: p(θ | X t , At ) = p( π )∏ p( Γ ωX ) p( µωX | Γ ωX ) p( Γ ωA ) p( µωA | Γ ωA ) ω

(5) Wit θ = {π, µ X , µ A , Γ X , Γ A } composed of Ω components. Our prior knowledge is represented by the priors for the distribution of p(π ) (symmetric Dirichlet), p( µωX , Γ ωX ) and

p ( µ ωA , Γ ωA ) (both Normal-Wishart). The estimation of p(θ | X t , At ) is implemented by a Variational Bayesian EM (VBEM) method. Given test images { X, A} , the likelihood p( X, A | θ ) is calculated from the accumulation of all posterior models. Then we can use Equation (4) for recognition. The approach introduced above is a batch learning method, which means that all training examples are used simultaneously, so makes learning speed still not satisfactory. Therefore, an incremental learning algorithm was proposed in [17] to further improve the computational efficiency. The learning algorithm is based on Neal and Hinton’s adaptation of EM. The incremental approach was tested on ‘101 categories’ problem and showed great performance. Comparing with the Weber00 and Fergus03 models, the main advantage of the Li03 model is its fast learning speed without much decrease in performance. Even trained with only a few training images, the generic prior distribution can turns to be much more “class specific”. While the simplifications made in this model can cause it to work poorly for the recognition of partial occluded objects; and the incremental learning algorithm turns out to be worse for large training sets. And still, it is not robust to large intra-class variation. S. Helmer et al. proposed a new model (denoted as Helmer04) in [14] based on the Fergus03 model, which improves the computational efficiency greatly while at the same time maintains the advantages of the tradition model. In this approach, instead of learning all the parts simultaneously, they are learned in an incremental way. The expensive exhaustive search over all possible matchings is replaced by only searching on a few dominant matchings, so the computational cost is reduced to almost linear to the number of parts. The model is a slightly different version of the Fergus03 model, which is:

p( F | θ) = ∑ p( X, S, A, h | θ) h∈H

= ∑ p( A | h, θ )p( X | h,θ )p( S | X, h, θ )p(h | θ ) h∈H

(6)

In this model, the appearance (A) and the location (X) of the parts are independent to each other given the hypothesis h. The construction of the model starts from a few parts selected based on their appearance densities. Then, a new part is added which can maximally increase the likelihood of the training data with the new model, and the parameters for the new model are estimated using EM algorithm. This procedure is repeated until enough parts are added or the desired accuracy is achieved. Before adding a new part, dominant matchings for each image are found using current model by maximizing the appearance likelihood (A*); then each candidate feature is added temporally to the model as a new part; we only need to consider whether or not this new added part improves the previous dominant matchings (only a few). Thus, the searching space is reduced greatly. Although sometimes this approach suffers from being stuck in local minimum and poor performance due to random initialization, the experiment results shows that this model outperforms the Fergus03 model on both computational speed and recognition accuracy. In [13], R. Fergus et al. proposed a Heterogeneous Star Model (denoted as Fergus05) based on Fergus03 model, which not only significantly reduced the computational cost of the traditional model, but also enabled it to be less sensitive to large intra-class variation. Computational cost is reduced by introducing a partial-connected “star” model to replace the traditional fully-connected model. As illustrated in Fig. 2, the simplified structure of dependencies reduces the searching space significantly, and in turn reduces the learning complexity from O( N P ) to O( N 2 P) or O( NP) . Better tolerance to large intra-class variation is achieved by using “heterogeneous parts” which was proposed in [12].

Fig. 2. Fully-connected model (left) and “star” model (right) 2.2 Constellation models constructed from learned Parts As introduced in Section 2.1, the Fergus03 model is formulated as a purely probabilistic model over the parts; the parameters for the parts (A, S) and the parameters for the structure (X) are learned simultaneously in an unsupervised fashion. While, there are also some state-of-art constellation models [7, 8, 20, 33] built through a different way. That is, first learning the class-specific part detector or part classifier for each object part; and then adding the structural information to the learned parts to construct the final constellation model. Comparing with the models surveyed in Section 2.1, these

6 models are much more robust to large intra-class variations, they are easy to implement, and can perform well for both recognition and localization of generic object class in cluttered scenes. But the performance of these models is highly relied on the efficiency of their individual part detectors or part classifiers. Therefore, the major effort of these approaches was paid on the formulating and learning of the part detectors or part classifiers. 2.2.1 Recognition with discriminative “active” fragments In [33], S. Ullman et al. proposed a model (denoted as Ullman01) in which class-specific fragments were used as building blocks to compose objects in a generic class. The spatial relations between the fragments (parts) are not explicitly modeled, but the co-occurrence relations between parts can be modeled in a Bayesian manner. First, class-specific candidate fragments are segmented from images; then the candidate fragments are first selected on the basis of their merit, and then used on the basis of their distinctiveness. Merit is defined as: I(C, F) = H(C) - H(C/F), and distinctiveness is defined as: p(F|C) / p(F|NC). Thus, the fragments which are common in positive images but rare in negative images will be most significant for final decision. Finally, for new images, recognition is performed by first finding ‘active’ fragments in each image based on similarity matching, and then classifying these images by accumulating the evidences of their ‘active’ fragments. Two schemes were explored here: (1) Naïve Bayesian: assuming independence between different fragment types; in this scheme, just the occurrence information about fragments is used; (2) Dependence-Tree Combination: in this framework, both the occurrence and the co-occurrence information of fragments is used. This approach automated the part selection procedure; it is tolerant to intra-class variation and cluttered background. Its high performance on the recognition of faces and cars in cluttered scenes demonstrated the value of models built from learned parts. While, there are two limitations make it suboptimal: (1) the exhaustive search for best matches makes the computational cost very high. Oftentimes, human intervention is required to make the recognition finished in a reasonable time; (2) the detection of the corresponding fragments (e.g. eyes, legs) is a tough task for some objects. The Ullman01 model relies on the similarity-based part detectors to find corresponding object parts in images. While, for some images, the appearance of a object part can vary greatly, so makes the appearance similarity based detectors perform poorly. Recently, a new method is proposed by B. Epshtein et al. in [8] to detect semantically equivalent parts in images, and it is proved to be valuable when combined with the Ullman01 model. In this approach, equivalent parts F are detected by using their context fragments C, which are defined as the fragments that co-occur with F consistently in a relatively fixed spatial configuration. This approach can be applied to the construction of the part-based constellation

models to improve their recognition performance. 2.2.2 Recognition by probabilistic assembly of part detectors In [20], a new method (denoted as KM04) was proposed by K. Mikolajczyk et al. to learn part detectors using AdaBoost and assemble the parts into a probabilistic model. The assumption of this approach is that we have segmented training images for each individual object part, for example, head images and leg images. All the training images are represented in the way similar to SIFT [18, 19], that is, multi-scale orientation groups registered on a local coordinate. The detector for each object part is trained separately using its image features. Basically, each detector is a cascade of strong classifiers; each strong classifier is a linear combination of weak classifiers, and weak classifiers are the log likelihood ratio of the probability of feature occurrence and co-occurrence on objects against those on non-objects. The learned part detectors can then be applied to new images to detect individual object parts, i.e. find the local maxima of their outputs. While, to utilize the spatial relations between the parts, individual part detectors can be combined together using a Gaussian model: G ( x 1 − x 2 , y1 − y 2 , σ 1 σ 2 ) , {x 1 , y1 , σ 1 } and {x 2 , y 2 , σ 2 } are the locations and the scales of the part pair. A joint part detector can be built in an incremental way. Suppose a head (H) is detected by its individual detector, this detection can be used to raise the confidence of the upper-body detection (U), the joint part detector can detect upper body by setting threshold on the following score:

D U | H ( x , y, σ ) =

D U ( x , y, σ ) + G ( x H − x , y H − y, σ H σ ) ⋅ D H ( x H , y H , σ H ) (7) So, both the appearance of object parts and their relative spatial configurations are used for part detection. And object recognition can be done using the following formula which is similar to Equation (4): p(Object | R, F ) p( NoObject | R, F ) =

p(Object) p( R | F, Object ) p( F | Object ) ⋅ ⋅ p(NoObject ) p( R | F, NoObject) p( F | NoObject )

(8) with R represents the geometric information of parts and F represents image features. The first term is given by the joint part detectors and the second given by individual part detectors. This approach proposed robust detectors for object parts; both appearance and shape information of the parts are used; it can recognize generic object classes in highly complex scenes. While, it assumed segmented images for each part, and these images should be normalized. This is very expensive or impossible for many object recognition tasks. 2.2.3 Recognition by combining discriminative part classifiers

7 A feature selection method (Dorko04) was proposed by G. Dorko et al. in [7] which can be easily developed into a generic object class recognition system. This approach is similar to Ullman01 and KM04 in that it also first learns a part classifier for each part, and then makes the object/non-object decision based on the learned part classifiers. But the ‘parts’ here are not necessarily real object parts, they are just feature clusters, and the geometric relations between them is very loose. The procedure of the approach is quite simple: first, image features are extracted by interest operators and descriptors; then EM clustering is adopted to build feature clusters (parts), for each cluster, a part classifier is constructed; then these part classifiers are selected according to their discriminative power, the selected classifiers can be used to detect parts in new images; object recognition can be performed by simply setting threshold on the number of features selected in an image. The advantage of this approach lies in three folds: (1) weakly supervised fashion, automatic feature selection; (2) robust to occlusion and cluttered backgrounds; (3) offering a natural way to combine different interest operators. But in this approach, geometric information of the object is little used; the overall performance is highly relied on the performance of feature detection and description. 2.3 Feature clusters based representation and weakly supervised learning There are also another branch of approaches [1, 26, 27] constructed constellation models in a different way. First, interest regions are extracted in each image using interest operators, and each region is described by some descriptor, so gives a bunch of features. Then, these features are clustered to form feature clusters which are more robust to variations. Finally, each image is represented based on these feature clusters, and recognition is performed by learning a classifier directly from the image representations. The major work in these approaches is the construction of the image representation. These approaches are robust to intra-class variations, partial occlusions, and background clutters. 2.3.1 Sparse, part-based representation of generic objects A sparse, part-based representation (Agarwal04) of objects was proposed by S. Agarwal et al. in [1]. Based on the representation, a discriminative classifier can be learned for recognition and localization of generic objects. The procedure of Agarwal04 can be summarized as follows: First, interest operators find interest regions in representative images, and image patches are cropped from the selected regions. Then, clustering is performed on the patches based on their similarity measure, so a part vocabulary is constructed. Next, each image is represented by a high-dimensional sparse binary feature vector built from the vocabulary; both the part labels and the spatial relations among these parts are incorporated into the representations. Finally, based on the representations, a classifier is learned in a weakly supervised

setting. To take advantage of the sparseness of the features, the SNoW algorithm is used for the learning. The classifier learned can be used to make object/non-object decision (recognition) by accumulating the confidence, or localize the specific object class in images using activation map or activation pyramid. The method has been demonstrated to be robust to partial occlusions and cluttered backgrounds, and computationally efficient in both training and testing. While in this approach, the spatial relations between parts are assumed to be relatively fixed, so make this approach perform poorly on some ‘textured’ deformable objects. The overall performance of Agarwal04 highly relied on the interest operators to find meaningful parts of the object. And of course, a great deal of effort has to be paid on the construction of the part vocabulary in order to ‘cover’ the appearance variations among the instances within a generic class. 2.3.2 Two-layer representations and weakly supervised learning In [26] and [27], C. Schmid proposed a method (Schmid01) to represent positive images and negative images by two-layer descriptors. In these descriptors, geometric spatial relations between parts are not modeled in a rigid way (as did in Agarwal04), instead, a kind of ‘spatial-frequency’ description (second layer) is used to add flexible spatial constraints to the appearance-based description (first layer). Therefore, there is no assumption on the relatively fixed spatial relations between the parts, and this approach can work well on some highly textured deformable objects, such as zebras. For each pixel pl and its gray value descriptor d l , the construction of its first-layer representation (‘generic’ descriptor) is simple: just find the most probable cluster for d l , and use the cluster index C* ( pl ) as the “generic” descriptor. The second-layer representation (‘spatial-frequency’ cluster) is constructed based on the conditional joint probability distribution of “neighborhood-frequency” descriptor v l with respect to the center: p(Vi ) = p(v l | C* ( pl ) = C i ) , which is multi-normal. The index of the most probable “spatial-frequency” cluster: V * (v l ∧ d l ) = Vij is used as the second-layer descriptor for the pixel. For each second-layer descriptor Vij , its “significance”: Sig ( Vij ) is evaluated by its mutual information with class

label, which is similar to the method used in Dorko04. Then the “score” of a pixel is: p(M | pl ) = p(C* ( pl ) | d l ) ⋅ p(V * ( pl ) | v l ∧ d l ) ⋅ Sig(V * ( pl ) | M) and recognition is performed by simply accumulating the scores of all the pixels in one image. The beauty of Schmid01 is its two-layer description and weakly supervised learning, which make it perform well for the recognition of highly textured objects without any region segmentation or feature extraction. But the spatial relations in

8 this approach are quite loose, the system is constructed mainly from the appearance information; if the objects do not have very characteristic texture patterns (e.g. zebra stripes), it probably cannot work very well. And during recognition, the pixels in one image are assumed to be independent, as is problematic for many common objects. 3.

CONSTELLATION MODELS

Robust to partial occlusion

Robust to cluttered background

Weber00

Yes

Fergus03

Train

Test

Yes

No

Yes

Yes

Yes

No

No

Yes

Li03

No

Yes

No

Yes

Yes

Helmer04

Yes

Yes

Somewhat

Yes

Yes

Yes

Somewhat

Yes

Yes

Ullman01

nonlandmark parts Somewhat

Yes

Yes

No

No

KM04

Yes

Yes

Yes

Yes

Yes

Dorko04

Yes

Yes

Yes

No

Yes

Agarwal04

Yes

Yes

Somewhat

Yes

Yes

Schmid01

Yes

Yes

Yes

Not sure

Not sure

EVALUATION RESULTS

According to the criteria defined in Section 1.3, we can summarize the evaluation of some leading constellation model approaches as follows: TABLE 1 EVALUATION OF CONSTELLATION MODELS

CONSTELLATION MODELS

Representation

Weber00

Shape

Unsupervised

Assuming relatively fixed spatial configuration Yes

Fergus03

Combination

Unsupervised

Yes

Li03

Combination

Unsupervised

Yes

Helmer04

Combination

Unsupervised

No

Fergus05

Combination

Yes

Ullman01

Appearance

Weakly Supervised Supervised

KM04

Combination

Supervised

Yes

Dorko04

Appearance

No

Agarwal04

Combination

Schmid01

Appearance

Weakly Supervised Weakly Supervised Weakly Supervised

Learning

of objects

No

Yes No

CONSTELLATION MODELS Weber00

Translation invariant Yes

Rotation invariant No

Scale invariant No

Affine invariant No

Fergus03

Yes

Somewhat

Yes

No

Li03

Yes

Somewhat

Yes

No

Helmer04

Yes

Yes

Yes

No

Fergus05

Yes

Yes

Yes

No

Ullman01

Yes

Yes

Yes

Somewhat

KM04

Yes

Yes

Yes

No

Dorko04

Yes

Yes

Yes

Somewhat

Agarwal04

Yes

Yes

Yes

Somewhat

Schmid01

Yes

Yes

Yes

Somewhat

Efficiency

Robust to large intraclass variations No

Fergus05

Somewhat: the model has this property only to some extent. For example: Ullman01 model is invariant to affine transformation of objects only within a limited angle. Not sure: the model is hard to evaluate according to this criteria. Oftentimes, it is due to that authors did not give enough information about this property. Again, it is worth noting the fact that some results are only meaningful for the comparison between closely related approaches in one family.

4.

DISCUSSIONS ABOUT INSECT IDENTIFICATION PROBLEM AND FUTURE RESEARCH

As presented in previous sections, various elegant constellation models have been proposed for the object recognition task defined in Section 1.1, that is, recognition of generic objects in natural scenes. Insect Identification is an important direction in the field of Ecosystem Informatics. The object recognition problem in this project is to recognize the generic categories of stoneflies in the images photographed via microscope. This problem is essentially similar to the central task studied by the computer vision community, but also different in several aspects. Fig. 3 shows some examples of insect images we are researching on. The rows are corresponding to the categories of stoneflies, which we want to recognize. From these images, we can see that most of the difficulties described in Section 1.1 exist in this problem: there are significant translation, rotation, scale and viewpoint variations in these images; there is large intra-class variation in appearance and shape for each class, comparable or even worse than the images studied by the community; stoneflies are semi-rigid objects, their spatial configurations are usually not consistent.

9

Fig 3. Examples of stonefly images The stonefly images also have some characteristics which are different from the popular task setting: (1) Different classes are not distinctive from each other. The central task studied by the community is focused on the recognition of generic object classes, so one object class is usually quite distinct from the others either in appearance or in shape. For example, human faces are quite different from cars and airplanes in both contours and interior textures. But for stonefly images, different classes are quite similar to each other in both shape and texture; there are few distinctive class-specific patterns in stonefly bodies. Some image examples are difficult to recognize even with our own eyes. (2) Currently, each insect sample has 20 images photographed from different views. This multi-view information is not available for most of the problems studied by the community; it can be used to improve the performance of recognition. (3) There is much less background clutter and occlusion in stonefly images. All the images are photographed against blue background (although sometimes with bubbles), and objects are never blocked from our view. Due to the characteristics of Insect Identification problem, most of the models surveyed in this paper can hardly be applied directly to this problem. For example, those models that are not very robust to large intra-class variations (Weber00, Fergus03, and Li03), those models assumes relatively fixed spatial relations between parts (Weber00, Fergus03, Li03, Fergus05, KM04 and Agarwal04), and the models which assumes distinct characteristic texture patterns (Schmid01). While, these models may be applicable for Insect Identification problem when they are modified or combined with other techniques. One motivation is derived from the properties of Insect Identification problem and the emergence of various elegant feature selection methods for object recognition. These methods include the work of A. Opelt et al. [25], and work of G. Dorko et al. [7], the latter was introduced in Section 2.2.3. The stonefly images characterize in their large intra-class variations and similarity cross classes. Therefore, even using some most

state-of-art interest operators and descriptors, most of the features extracted are noise features. They can contribute little for recognition; instead, they overwhelm the truly discriminative features and degrade the performance. So it is promising to combine feature selection procedure into the construction of constellation models. Supposedly, the models built on several consistent and discriminative features (parts) will work much better than the ones constructed without feature selection. Another idea is inspired by the work of G. Bouchard et al. [5] and the work of Shimon Ullman’s group [9] at Israel. When we look at the stonefly images, we are not observing in a single scale or resolution, actually we usually recognize in a global-to-local manner. For example, first we look at the overall shape of the object, and then we focus on some local texture patterns, geometric patterns and so on. This procedure can be somewhat simulated by employing a multi-scale, global-to-local, part-based hierarchical model to represent object class. Image segmentation techniques, interest point detectors and feature selection methods can also be incorporated to construct the model. Hopefully, this model will improve the capture of informative features, and in turn improve the performance of object recognition. In order to make use of the multi-view information available in stonefly images, some models for 3D object recognition can be explored and tailored for Insect Identification problem. These techniques include the work of A. Selinger et al. [28, 29, 30], and many others. 5.

OPEN PROBLEMS

In the research of constellation models for generic object recognition, there are several substantial problems opening for computer vision community. One is the extraction and description of object parts. How to find consistent and discriminative features? How to describe the appearance and shape of parts so as to achieve invariant to translation, rotation, scale and affine transformations? Another problem is the structure of the model. How to model the structure of parts? Purely probabilistic (soft) structure or deterministic structure; fully connected structure or partial connected; one layer structure or hierarchical structure. The last interesting problem is how to construct a truly generic constellation model which can generalize well to hundreds or even thousands of object classes? For example, the “101 Object Classes” problem being studied by several groups [17]. CONCLUSION Current state-of-art constellation models are grouped into three families according to their modeling methods and learning algorithms. Each of the family is surveyed and evaluated to identify its strength and weakness for the recognition of generic objects. Probabilistic constellation models (the first family) advance in their unsupervised setting and robustness to occlusions and

10 clutters, but suffer from high computational cost and sensitiveness to large intra-class variations. The constellation models in the second family are constructed directly from parts, high-performance part detectors or classifiers can provide these models with robustness to intra-class variations, occlusions and clutters. This is also the main advantage of the constellation models in the third family, which form feature-clusters-based representations for images, and train corresponding classifiers with weakly supervised learning. But some models in the second and third family are also computationally expensive; and the supervised setting of some models can be trouble for some object recognition problems. Insect Identification problem is different from the central task studied by computer vision community in several aspects. The characteristics of Insect Identification problem imply that most of the models surveyed in this paper cannot be directly applied to this problem. Some possible solutions are proposed and left for future research. REFERENCES [1] [2] [3] [4]

[5] [6]

[7] [8] [9] [10] [11]

[12] [13]

[14] [15]

S. Agarwal, A. Awan, D. Roth. “Learning to detect objects in images via a sparse, part-based representation”. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 26(11):1475-1490. 2004. Y. Amit and D.Geman. “A computational model for visual selection”. Neural Computation. 1998. Y. Amit. “2D object detection and recognition”. MIT Press. 2002. J. Bi and Y. Chen. “A sparse support vector machine approach to region-based image categorization”. Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR) 2005, San Diego, California. 2005. G. Bouchard and Bill Triggs. “Hierarchical part-based visual object categorization”. Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR) 2005, San Diego, California. 2005. D. Crandall, P. Felzenszwalb and D. Huttenlocher. “Spatial priors for part-based recognition using statistical models”. Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR) 2005, San Diego, California.2005. G. Dorko and C. Schmid. “Object class recognition using discriminative local features”. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), submitted. 2004. B. Epshtein and S. Ullman. “Identifying semantically equivalent object fragments”. Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR) 2005, San Diego, California. 2005. B. Epshtein and S. Ullman. “Feature hierarchies for object classification”. Unpublished. 2005. P. Felzenszwalb and D. Huttenlocher. “Pictorial structures for object recognition”. International Journal of Computer Vision, Vol. 61, No. 1. 2005. R. Fergus, P. Perona, and A. Zisserman. “Object class recognition by unsupervised scale-invariant learning”. Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR) 2003, Madison, Wisconsin, USA. 2003. R. Fergus, and P. Perona, and A. Zisserman. “A visual category filter for Google images”. 8th European Conference on Computer Vision (ECCV).2004. R.Fergus, P.Perona and A. Zisserman. “A sparse object category model for efficient learning and exhaustive recognition”. Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR) 2005, San Diego, California. 2005. S. Helmer and D. G. Lowe. “Object recognition with many local features”. Workshop on Generative Model Based Vision 2004 (GMBV), Washington, D.C. 2004. A. Holub and P. Perona. “A discriminative framework for modeling object classes”. Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR) 2005, San Diego, California. 2005.

[16] F.F. Li, R. Fergus, and P. Perona. “A Bayesian approach to unsupervised one-shot learning of object categories”. Proc. of 9th Int'l Conf. on Computer Vision (ICCV) 2003, Nice, France, pages 1134–114. 2003. [17] F.F. Li, R. Fergus and P. Perona. “Learning generative visual models from few training examples: an incremental Bayesian approach tested on 101 object categories”. Computer Vision and Image Understanding, in press. 2003. [18] D. G. Lowe. “Object recognition from local scale-invariant features”. International Conference on Computer Vision (ICCV), Corfu, Greece. pp. 1150-1157. 1999. [19] D. G. Lowe. “Distinctive image features from scale-invariant keypoints”. International Journal of Computer Vision, 60, 2 (2004), pp. 91-110. 2004. [20] K. Mikolajczyk and C. Schmid, and A. Zisserman. “Human detection based on a probabilistic assembly of robust part detectors”. European Conference on Computer Vision (ECCV) 2004. [21] P. Moreels, M. Maire and P. Perona. “Recognition by probabilistic Hypothesis construction”. European Conference on Computer Vision (ECCV) 2004. [22] P.Moreels, P. Perona. “Common-frame model for object recognition”. Advances in Neural Information Processing Systems (NIPS) 2004. [23] E. N. Mortensen, H. Deng and L. Shapiro. “A SIFT descriptor with global context”. Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR) 2005, San Diego, California.2005. [24] K. Murphy, A. Torralba and W. Freeman. “Using the forest to see the trees: a graphical model relating features, objects and scenes”. Neural Information Processing Systems (NIPS) 2003. [25] A. Opelt, M. Fussenegger, A. Pinz, and P. Auer. “Weak hypotheses and boosting for generic object detection and recognition”. In Proceedings of European Conference on Computer Vision (ECCV), pages 71–84.2004. [26] C. Schmid. “Constructing models for content-based image retrieval”. Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR) 2001, volume 2, pages 39-45. 2001. [27] C. Schmid. “Weakly supervised learning of visual models and its application to content-based retrieval”. International Journal of Computer Vision. 2004. [28] A. Selinger and R. C. Nelson. “A perceptual grouping hierarchy for appearance-based 3D object recognition”. Computer Vision and Image Understanding , vol. 76, no. 1, pp.83-92. 1999. [29] A. Selinger and R. C. Nelson. “Minimally supervised acquisition of 3D recognition models from cluttered images”. Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 2001) , Kauai, Hawaii, vol. 1, pp.213-220. 2001. [30] A. Selinger and R. C. Nelson. "Appearance-based object recognition using multiple views". Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR 2001), Kauai, Hawaii, vol. 1, pp.905-911. 2001. [31] A. Torralba, K. Murphy, W. Freeman, M. Rubin. "Context-based vision system for place and object recognition". Proc. of 9th Int'l Conf. on Computer Vision (ICCV) 2003. [32] A. Torralba, K. Murphy, W. Freeman. "Contextual models for object detection using boosted random fields". Neural Information Processing Systems (NIPS) 2004. [33] S. Ullman, E. Sali, and M. Vidal-Naquet. "A fragment-based approach to object representation and classification". In 4th International Workshop on Visual Form, Capri, Italy. 2001. [34] M. Weber, M. Welling, and P. Perona. "Unsupervised learning of models for recognition". Proceedings of the 6th European Conference on Computer Vision (ECCV), Dublin, Ireland, pages 18–32. 2000. [35] M.Weber. "Unsupervised learning of models for object recognition". PhD thesis, California Institute of Technology, Pasadena, CA. 2000.

Suggest Documents