vised training of a previously developed recognition system from unlabeled and ... be trained from cluttered imagery would be useful for auto- matic, object-level ...
Minimally Supervised Acquisition of 3D Recognition Models from Cluttered Images Andrea Selinger and Randal C. Nelson Department of Computer Science University of Rochester Rochester, NY 14627 selinger,nelson @cs.rochester.edu
Abstract Appearance-based object recognition systems rely on training from imagery, which allows the recognition of objects without requiring a 3D geometric model. It has been little explored whether such systems can be trained from imagery that is unlabeled, and whether they can be trained from imagery that is not trivially segmentable. In this paper we present a method for minimally supervised training of a previously developed recognition system from unlabeled and unsegmented imagery. We show that the system can successfully extend an object representation extracted from one black background image to contain object features extracted from unlabeled cluttered images and can use the extended representation to improve recognition performance on a test set.
1. Introduction Appearance-based systems have proven quite successful in recognizing 3D objects. They typically rely on training from labeled imagery, which allows the recognition of objects without the requirement of constructing a 3D geometric model. It has been little explored whether such systems can be trained from imagery that is unlabeled, and whether they can be trained from imagery that is not trivially segmentable. A recognition system that could be trained from either unlabeled or unsegmented imagery would be valuable for reducing the effort required to obtain a training set. Of greater practical impact, a 3D recognition system that could be trained from cluttered imagery would be useful for automatic, object-level labeling of image databases, which is an important outstanding problem.
1.1.
Appearance-Based Object Recognition Systems
One of the first appearance-based systems was the one developed by Poggio that recognized wire objects [?]. Rao
and Ballard [?] describe an approach based on the memorization of the responses of a set of steerable filters. Mel’s SEEMORE system [?] uses a database of stored feature channels representing multiple low-level cues describing contour shape, color and texture. Schiele and Crowley [?] used histograms of the responses of a vector of local linear neighborhood operators. Murase and Nayar [?] find the major principal components of an image dataset, and use the projections of unknown images onto these as indices into a recognition memory. This approach was extended by Huang and Camps [?] to appearance-based parts and relationships among them. Wang and Ben-Arie [?] were able to do generic object detection using vectorial eigenspaces derived from a small set of model shapes which are affine transformed in a wide parameter range. The approach taken by Schmid and Mohr [?] is based on the combination of differential invariants computed at keypoints with a robust voting algorithm and semilocal constraints.
1.2. Previous Work on Unsupervised Training of 3D Recognition Systems A system that is trained from unlabeled images has to be able to perform unsupervised clustering of multiple views into multiple object classes. In their approach, Ando et al. [?] observed that although the input dimension of such an image set is very high, the view data of an object often resides in a low-dimensional subspace. Their strategy is to identify multiple non-linear subspaces each of which contains the views of each object class. A similar approach was taken by Basri et al. [?]. Their method examines the space of all images and partitions the images into sets that form smooth and parallel surfaces in this space. Nearby images are grouped into surface patches that form the nodes of a graph. Further grouping becomes a standard graph clustering problem. In both cases good results are obtained only if a very large number of clean, segmented images, or even sequences of images are considered. This is not surprising.
If the number of images is high, clustering is affected by a phase transition phenomenon: when the parameters of the image set reach a certain value, the topology of the network suddenly changes from small isolated clusters to a giant one containing very many nodes [?, ?]. The unsupervised learning methods discussed above were based on computing distances between images. Basri et al. [?] use a similarity measure based on the distortion of salient features between images. Gdalyahu and Weinshall [?] use a curve dissimilarity measure. The disadvantage of such similarity measures is that they generally require full object segmentation and cannot deal with scale changes. Weber et al. [?] developed a system that learns object class models from unsegmented cluttered scenes. Their method automatically identifies distinctive parts in the training set by applying a clustering algorithm to patterns selected by an interest operator, and then learns the statistical shape model using expectation maximization. However, the method requires images to be labeled as to the object that they represent.
regions represent high-information second-level perceptual groups, and are essentially windows centered on and normalized by key first-level features that contain a representation of all first-level features that intersect the window. The first level features are the result of first level grouping processes run on the image, typically representing connected contour fragments.
image
LEVEL I curves
LEVEL II context patches
LEVEL III
1.3. Current Work on Training from Unlabeled Cluttered Images
2D views
While there has been work on unsupervised training of recognition systems from clean, segmented images, and on supervised training from cluttered images, there has been no work to our knowledge on unsupervised training of object recognition systems from cluttered images. In this paper we present a method for minimally supervised training of a recognition system from unlabeled and unsegmented imagery. We use an object recognition system developed previously [?], and train it on one labeled black background image of an object and a set of unlabeled cluttered images. We show that the system can successfully classify the majority of cluttered images containing the seed object and can extend the object’s representation using features from these images. Using this representation, recognition performance becomes significantly better than the performance obtained by training the system only on the black background seed image.
LEVEL IV
representation of 3D object
Figure 1: The perceptual grouping hierarchy In more detail, distinctive local features called keys, selected from the first level groups in our hierarchy, seed and normalize keyed context regions, the second level groups. In the current system, the keys are contours automatically extracted from the image. The second level of grouping into keyed context patches amplifies the power of the key features by providing a means of verifying whether the key is likely to be part of a particular object. Even these high information local context regions are generally consistent with several object/pose hypotheses; hence we use the third-level grouping process to organize the context patches into globally consistent clusters that represent hypotheses about object identity and pose. This is done through a hypothesis database that maintains a proba-
2. The Underlying Object Recognition System The recognition system we adapt is based on a hierarchy of perceptual grouping processes [?]. A 3D object is a fourth-level group (see Figure 1) consisting of a topologically structured set of flexible 2D views, each derived from a training image. In these views, which represent thirdlevel perceptual groups, the visual appearance of an object is represented as a geometrically consistent cluster of several overlapping local context regions. These local context 2
3.1. Training from Clean Images
bilistic estimate of the likelihood of each third level group (cluster) based on statistics about the frequency of matching context patches in the primary database. The idea is similar to a multi-dimensional Hough transform without the space problems that arise in an explicit decomposition of the parameter space. In our case, since 3D objects are represented by a set of views, the clusters represent two dimensional rigid transforms of specific views. The use of keyed contexts rather than first-level groups gives the voting features sufficient power to substantially ameliorate well known problems with false positives in Hough-like voting schemes. The system obtains a recognition rate of 97% when trained on images of 24 objects taken against a clean black background over the whole viewing sphere, and tested on images taken between the training views, under the same good conditions. The test objects range from sports cars and fighter planes to snakes and lizards. Some of them can be seen in Figure 2. Performance remains relatively good in the case of clutter and partial occlusion [?].
As a validation experiment, we used a corpus of clean, black background images of objects (some of them seen in Figure 2). The images were taken at about 20 degrees apart over the viewing sphere. We seeded the recognition system with a single black background image of an object and then iteratively found the best match to the current representation over the entire corpus and rebuilt the object representation incorporating the new image. We stopped the procedure the first time an incorrect image was attracted. The overall procedure is essentially a minimum spanning tree algorithm, which is a standard clustering technique. In practice, the complexity of this algorithm, which would be a problem with large databases, can be avoided by modifying the representation as soon as a “good enough” match is found. Using this algorithm we attracted around 50% of the images of each object to the representation before making the first incorrect classification. Figure 3 shows the seed image for the sports-car and some of the sports-car images attracted to the representation through this method. It also shows the non-sports-car image attracted at the last step that stopped the growth process. The image is actually an odd view of the toy-rabbit, one that interestingly, looks a bit like the sports-car.
Figure 2: Some of the objects used in testing the system. The feature-based nature of the algorithm provides some immunity to the presence of clutter and occlusion in the scene; this, in fact, was one of the primary design goals. This is in contrast to appearance-based schemes that use the structure of the full object, and require good prior segmentation.
Figure 3: Top: Seed image of sports-car for propagation experiment and terminating non-car image. Bottom: Car images attracted to the representation during the experiment. To improve the performance of the system we experimented with a denser image set. Such a set would increase the percentage of images attracted to object representations by reducing the number of isolated views that are very different from the other views. Many of our objects, including the car and the aircraft, have a locus of relatively “pathological” views around the equator, where appearance changes very rapidly and recognition is more difficult. This is due to a “flattened” axis in the 3D shape, which is a common general property. To investigate the effect, we acquired an additional image set for the car and aircraft, with the distance between views decreasing in an adaptive fashion, reaching 5 degrees at the equator. Learning performance improved
3. Minimally Supervised Training The basic idea behind our current work is that if the recognition system is trained on one view of an object, it will be able to recognize views that are topologically close to the original view. After adding these views to the object representations the system will be able to recognize additional views in an iterative process. This will lead to the development of clusters of views characterizing each object, clusters that ideally will be able to cover the entire viewing sphere. 3
significantly in this case, and more than 90% of the images were attracted to object representations before an incorrect classification was made (94.6% of the sports-car images and 91.9% of the fighter images were attracted). Figure 4 shows the tree by which the sports-car views attracted each other to the growing representation. The 371 sports-car images in the corpus represent one hemisphere, and are represented by squares on the polar coordinate system in the figure. Dark squares represent images attracted to the representation prior to the first false match. Arrows show the topology of the growth process. The attraction process generally operated between close geometric neighbors, with the exception of some views separated by 180 degrees that were matched due to the symmetrical shape of the car. The images not attracted to the representation are pathological views that could not be matched to any other views of the object. 90
from the features coming from clutter. This is very important since features arising from clutter could be matched to features from other objects, giving us false positives. We start by seeding the system with a clean, black background image of the object. The recognition system will be able to recognize topologically close views of the same object even if they are taken against cluttered backgrounds. The difficulty is to extract the features belonging to the object from the image and add only those to the object representation. Adding clutter features would corrupt the representation and the object model would no longer be useful. An obvious way of extracting the object features would be to find the object’s occluding contour and extract all the features inside the contour. The difficulty, of course, is finding this contour in our dataset. The occluding contour of the object in the seed image is easy to obtain. Since the seed image is taken against a clean, black background, we can simply threshold the image and extract the contour of the white blob. Subsequent images however are cluttered, and thus more difficult. The position of the object is known from the output of the object recognition system. But as the object changes its appearance in these images, the shape of the contour also changes. The transformation that morphs the model view into the new view can be used to find the contour of the new view of the object. We can find this transformation using a deformable templates algorithm. To do this, we use a relatively generic algorithm adapted from the method of Jain et al. [?]. In this method the prior shape of the object of interest is specified as a template containing edge/boundary information in the form of a bitmap. Deformed templates are obtained by applying parametric transforms to the prototype, and the variability in the shape is achieved by imposing a probability distribution on the admissible mappings. The goal is to find, among all such admissible transformations, the one that minimizes the Bayesian objective function
0
0
20
0
180
270
0
40 50
0
0
90 0 0
( !" " *#%+ ) ,*#%- ) #%$'& )$'& (1) . / is the potential energy linking the edge where
0
Figure 4: Tree by which training images were attracted to representation during growth process.
positions and gradient directions in the input image to the object boundary specified by the deformable template, and the second term penalizes the various deformations of the template. The deformation is described by two displacement functions from the space spanned by the following orthogonal bases: 0 + 213546879;: 2< 1=?> @92 @92