Hyundo Kim1, Erik Murphy-Chutorian1, Jochen Triesch1,2. 1Complex ... system, a stereo-based segmentation engine, and an object .... motor control system.
Semi-autonomous Learning of Objects Hyundo Kim1 , Erik Murphy-Chutorian1 , Jochen Triesch1,2 1 Complex Systems & Cognition Lab, UCSD, 9500 Gilman Drive MC 0515, La Jolla, CA, USA 2 Frankfurt Institute for Advanced Studies, Max-von-Laue-Str. 1, 60438 Frankfurt am Main, Germany {hyundo, erikmc, triesch}@ucsd.edu
Abstract Eyebrows
This paper presents a robotic vision system that can be taught to recognize novel objects in a semi-autonomous manner that does not require manual labeling or segmentation of any individual training images. Instead, unfamiliar objects are simply shown to the system in varying poses and scales against cluttered background and the system automatically detects, tracks, segments, and builds representations for these objects. We demonstrate the feasibility of our approach by training the system to recognize one hundred household objects, which are presented to the system for about a minute each. Our method resembles the way that biological organisms learn to recognize objects and it paves the way for a wealth of applications in robotics and other fields.
1. Introduction Humans can effortlessly recognize thousands of objects despite background clutter, viewpoint changes, varying lighting conditions, and partial occlusions. Interestingly, humans learn to recognize objects with little to no explicit supervision. This is in sharp contrast to most computer vision systems that rely on various forms of supervised training paradigms. It is becoming increasingly clear that learning to recognize many objects despite background clutter and changes in pose, scale, lighting, etc. requires massive amounts of training data. Manually labeling such large sets of training images is not very practical, however, even if only a binary label has to be provided for each image [19]. Compared to how children learn about new objects, such methods seem a very poor approach, because they rely on substantial amounts of human expert knowledge. Semisupervised learning, where massive amounts of unlabeled training data supplement a smaller set of labeled training examples is one way to alleviate the problem, but it typically still requires a large amount of hand-labeled images. In this paper, we present a robotic vision system that learns representations for new objects without requiring labor-intensive
Eyes
Mouth
Neck Jaw
Figure 1. Semi-autonomous object learning: the robot (top left) learns a model for an object while actively tracking and segmenting it through pose and scale changes (top right, bottom left). Later, the learned model is used to recognize the object in a new context (bottom right).
labeling of any individual training images. Instead, the system automatically builds representations for objects while actively tracking and segmenting them as a human teacher presents them in different poses and scales in a cluttered, natural environment (see Fig. 1). We propose to call this style of learning semiautonomous learning [15]. This new learning paradigm is closer to the way that human infants and other biological organisms learn to recognize objects than traditional training approaches. We reserve the term fully autonomous learning for the situation where the vision system actively explores its environment and forms categories for the objects in it without any intervention by a teacher, i.e. it has to operate in a completely unsupervised manner. Human infants and other animals are capable of this most difficult form of learning 1 . 1 We make a distinction between unsupervised and fully autonomous learning because the latter may require the agent to actively seek out and/or
Our approach to semi-autonomous object learning integrates a number of components including an active tracking system, a stereo-based segmentation engine, and an object recognition system. We stress that the novelty of our work lies not so much in our choice of these individual components, but rather their integration into a system that represents a qualitative change in the way that object recognition systems can be taught. There are a few works that have similarities to our approach. Fei Fei et al. trained a system to accurately classify objects given a single training image per class, but their approach was limited to four possible category labels, whereas our system assigns a label from any of 100 different objects [9]. Loos and von der Malsburg developed a recognition system that used motion detection to segment the object from the background [10] but unlike our system, the learned object representations were derived from just a single video frame and cannot be used for pose-invariant recognition. Recent work by Kirstein et al. overcame this limitation in a similar system, but their approach relied on a carefully controlled environment where the teacher wore black gloves and the objects were presented against a black background [7]. Sivic and Zisserman [16] proposed a method to retrieve objects and scenes from short video seqeunces using combination of viewpoint invariant descriptors. Still their method requires user-outlining of objects and would work only for quasi-planar objects. In a different context, Frey and Jojic [1] learned generative models of videos containing humans, but these models were used for “editing” the video sequences (e.g., removing an object from the scene) rather than recognition. The remainder of the paper is organized as follows. Section 2 introduces our semi-autonomous learning method. Section 3 describes the multiple object detection system and we present a systematic experimental evaluation of the method in Section 4. Finally, Section 5 contains a brief discussion of our approach.
Figure 2. Block diagram of stereo computation used in object detection and model learning stages.
into three stages. During object detection the system detects the presence of objects that it should learn about based on their stereo disparity. Once detected, the system will actively track the object for an extended time as it undergoes pose and scale changes while keeping it fixated with both cameras. Finally, representations for new objects are learned in an offline model learning stage. The learned representations can then be used for object detection and recognition. In the following, we describe each of the components of our system in detail.
2.1. Object Detection Based on Stereo Segmentation
Our goal is to enable an active vision system to automatically recognize objects without the need for manual segmentation or labeling of individual training images. Instead, a human teacher simply shows objects to to the system in various poses in a cluttered, uncontrolled environment. We implemented such a system on an anthropomorphic robot head [6] with 9 degrees of freedom (DoFs): 4 DoFs for horizontal and vertical movement of each eye, 2 DoFs for the neck and 3 DoFs for facial expressions (not used in this study). Our semi-autonomous learning framework is divided
We employ stereo disparity information to coarsely segment the training objects from the background. Since our object representations consist of measurements at interest point locations, we only need to establish if the interest points are roughly in the current plane of fixation (likely belonging to the object) or not (likely belonging to the background or an occluding object). In our approach, the two eyes of the robot head initially fixate a position about 30cm in front of it. The system extracts Harris corner points [3] from both the left and right image. We compare Gabor-jets, e.g. [8], from all interest points found in the left image to all the interest points found in the right image, and each interest point in the left image is associated with the best matching interest point in the right image if the similarity between the two jets is above a threshold (0.93 in our implementation). For each match we calculate the horizontal and vertical disparity between the positions of the points in the left and right image2 . If the disparity exceeds certain bounds, the point is not considered part of the object. The process is illustrated in Figure 2. If enough interest points (we use 20) are detected in the plane of fixation for at least 20 video frames, a “learning situation” is triggered and the tracking and learning stages begin. The initial position of the detected object, which we call the target position, is estimated as the median position
manipulate objects, while in standard unsupervised learning stimuli are presented to the system automatically.
2 In this process, we make no attempt to rectify the images as is typically done for fixed stereo rigs.
2. Semi-autonomous object learning
Figure 4. Two examples of successful multiple object detection with shared features. The system has correctly found all objects that it was trained on. Figure 3. Block diagram of vergence control.
of all the matched interest point pairs in the latest frame used for detection. The target position is used to initialize the subsequent active tracking stage.
2.2. Active Tracking and Vergence Control A central requirement of our approach is the ability to successfully track an object over an extended period of time in order to gather data about its appearance. This is difficult, however, because we cannot rely on having a model for the object’s appearance – this is exactly what we are trying to learn. Our solution is to bootstrap a very simple model for the object in realtime, while saving image frames for the offline creation of a detailed model for recognition. We employ a multi-cue tracking system that uses the concepts of Democratic Integration [18, 5], since it provides a fast and robust way to track unknown objects in real-time. In this approach, the system combines information from a number of simple cues. Each cue has a weighting coefficient that is adapted with regards to cue’s agreement with the other cues. This allows the system to adapt cues for certain situations, i.e. a motion cue against a moving background would quickly loose influence. At the same time, each cue constantly tries to estimate a model of the object’s current appearance. In our current implementation we use five simple cues: a motion cue based on difference images, a prediction cue based on a Kalman filter, a template matching cue using a gray level template of 11 × 11 pixels, a color histogram cue based on kernel tracking [2], and lastly a contrast-based kernel cue where contrast is a local standard deviation of gray values computed at two spatial scales. To maintain proper coordination of the eyes, we adopt and extend the vergence control system of Theimer and Mallot [17]. To estimate the image displacement, we extract a complex Gabor-jet [8] from the target position in the left image and from a set of grid points centered around the same target position in the right image. We find the best matching Gabor-jet using the normalized inner product, and if this value exceeds a threshold, we compute the phasebased displacement estimation as stated in [17]. We have extended the original method to use the grid points because very often the disparities became larger than the limit that the original method could handle, which is about ±8 pix-
els depending on the filter sizes. Here we used 7 grid points with grid points positioned in steps of ±12 pixels in the horizontal direction. The estimated disparity is passed to the motor control system. Here we use a partitioned controller scheme to effectively resolve the redundancy between the DoFs in the eyes and neck with a cascade of two PD controllers [5].
3. Multiple Object Recognition To build an object representation suited for fast recognition, we employ a method similar to that of MurphyChutorian et al. [15], which shares local features from a quantized vocabulary between different object representations. During training, a set of weighted associations are learned between the vocabulary features and the set of objects. During recognition, features that are detected in a video frame are matched to the vocabulary and used to cast weighted votes for the presence of each of the objects. In the following sections we describe these steps in more detail.
3.1. Feature Vocabulary The recognition system uses a vocabulary of local features that quantize a potentially high-dimensional feature space. Our implementation uses 40-dimensional Gaborjets [8], comprised of the magnitude responses of Gabor wavelets at 5 scales and 8 orientations. To learn the feature vocabulary, we extract Gabor-jet features at the locations of objects in a large training set of video frames and cluster them with a K-means algorithm, using the cluster centers as the vocabulary features. In our experiments, we use a vocabulary with 64,000 features. Given any feature, finding the nearest vocabulary features that best represent it requires a nearest neighbor search in a 40-dimensional space. An approximate kd-tree algorithm accomplishes this efficiently [14].
3.2. Feature Associations Initially, we develop a sparse set of associations between the features and objects. If an object and feature are both present in a training frame, the system creates an association between the two. This association is labeled with the
distance vector between the location of the feature and the center of the object, discretized at the level of the bin spacing in a two-dimensional Hough Transform[4, 11]. In this paper, the center of the object is identical to the target position of the active tracking system. Duplicate associations, (i.e. same feature, same object, same displacement) are disallowed. Once all of the training frames have been processed in this way, the system begins a second pass through the training video to learn a weight for each of the associations. Assuming conditional independence between the features given the objects, Bayesian probability theory dictates the optimum weights are given by the log-likelihood ratios.
3.3. Pose Invariance To create pose-invariant representations, we simply agglomerate or superimpose different views of an object into a single representation, which has proven quite successful in similar work [15]. This is in stark contrast to recognition approaches that minimize the amount of labeled training images by careful selection of invariant features and training on only very few training views [11, 12, 13]. Instead, our semi-autonomous learning paradigm embraces the opportunity to train with massive amounts of training data covering the required range of transformations relevant for the application at hand.
3.4. Optimal Detection Thresholds During recognition, all of the detected features cast weighted votes to suggest the presence of the objects. If any Hough transform bin receives enough activation, the systems decides that an object is in the scene at the position of the transform bin. As a detection criterion, we use the optimum threshold from the maximum a posteriori (MAP) estimator assuming that the maximum activation can be modeled as a two-component Gaussian mixture model [15].
3.5. Learning Object Models After image sequences for all objects have been recorded as described above, the models for the individual objects are learned completely automatically. To segment the object of interest from the background, we use the same stereo disparity method as described in Section 2.1. The segmented frames are of course not 100% accurate. Instead they will usually have errors such as mislabeling parts of the experimenter and the background as the object, or missing large object parts – the latter occurring during fast movements of the object in depth for which the vergence control system cannot compensate fast enough. We hypothesize that these problems are not critical as long as a sufficient number of valid training images is used. This hypothesis is also supported by the success of supervised training methods that
learn object models directly from unsegmented images [19].
4. Experiments In order to test the performance of our semi-autonomous learning framework, we trained the system with 100 objects. We captured 500 stereo video frames at 10Hz (about one minute) of each object. We then moved the robot head to a different location and repeated the capture. The objects were presented initially as “frontal” views, and then transformed as much as 360 degrees of in-plane rotation, outof-plane rotations as far as ±90 degrees, and scale changes of up to a factor of 6. Most of the objects in the database are rigid with a few notable exceptions such as magazines. Figure 5 shows a typical training sequence of the object “PhDThesis”. One should note that the pose variations in the two data sets are not perfectly identical, i.e., one data set will contain views of an object not present in the other data set. This effect is inherently difficult to quantify due to the somewhat uncontrolled way in which objects are shown to the system. We divided the video into two groups such that, out of every 10 frames, frames I(t), t = 1, 2, 3 are used for training and frame I(t), t = 6 is used for testing. We used I(1) for developing the feature vocabulary and learning the set of associations between the features and the objects, I(2) for estimating the log-likelihood weights of these featureobject-associations, and I(3) for learning the optimal detection thresholds. In this application we are not concerned with detection (is an object there?) but rather with recognition (which object is present?). In this simpler task, the system is forced to chose a single label for each frame, and chooses the object with the highest activation normalized by its optimal threshold, even if it does not exceed the optimal threshold. Overall, the object detection and tracking performed quite well with some exceptions when the objects were very small or lacking texture. The tracking system occasionally lost an object when it moved too quickly since the kernel based cues, which play a significant role in the tracking system, are based on a local search and are therefore unreliable if the motion between consecutive frames is too large. In this case, we just repeated the training of this object. The tracking system also tended to lose track when there was an out of plane rotation of more than ±90 degrees, often revealing a completely different view of an object from what was presented initially. Alternatively, one could record several training sequences for an object with different initial views. Online disparity estimation during the tracking could improve tracking robustness and alleviate these problems.
Figure 5. Sequence of training images for object “PhDThesis”. Only every 60th image of the left camera is shown. 40
The recognition test is performed on both left and right images. We used 100 image pairs for testing each object. In each scene, the recognition system outputs the object with the highest normalized activation. The object recognition system runs at about 1-2 Hz. The average recognition performance of all 100 objects was 78.53%. Table 1 shows the recognition rates for all 100 objects for the experiment. These rates represent the percentage of images of an object which were correctly labeled. Figure 6 shows a histogram of recognition rates for the experiment. As can be seen, the typical correct identification rate is centered around 80%, although a small percentage of objects suffer from poorer performance. These results demonstrate the feasibility of our approach although they also leave room for future improvements. It should be noted that a number of aspects make our dataset quite difficult. First, it includes sets of very similar objects to explore the limitations of the system, e.g. six soda cans and six water bottles. Furthermore, the image quality provided by the miniature cameras in the robot head is poor in comparison to professional equipment. In addition to the pose and scale variation, some of the frames contain artifacts like specular reflections and motion blurs that make it very difficult, even for humans, to recognize the object.
5. Discussion We have presented a robot vision system that learns about novel objects in a semi-autonomous fashion where the teacher’s only role is to simply “show” the object to the system in different views and to provide a name for it. Since no manual segmentation or labeling of individual training images is necessary, this learning approach makes it feasible to train a vision system to recognize large numbers of objects in very little time. Training can occur in a cluttered natural environment and the teacher requires no special skills or equipment. This paves the way for countless applications in
Number of objects
4.1. Results
30
20
10
0
5
15
25
35
45
55
65
75
Recognition rates [%]
85
95
Figure 6. Histogram of recognition performances.
robotics and other fields. Note that while we have used an active stereo vision head in this study, we expect that similar results can be obtained with a standard stereo rig, so our approach should be widely applicable. More generally, the specific set of components for tracking, segmentation, and recognition we use is just one of many possible alternatives, and we stress that our approach could likely be realized with other components. Currently we make no attempt to incorporate stereo for recognition. We test left and right images independently. However, our current setup makes it easy to integrate stereo information into the recognition process by looking for objects only in the plane of fixation, discarding misleading information from the background. Problems associated with motion blur and lighting could be addressed by the use of temporal structure in the video frames. Since we recognize objects in the video frames, it could be as simple as accumulating recognition results over several frames. Our work goes significantly beyond all previous efforts to bestow vision systems with such “natural” learning abilities. We have demonstrated that the learned object models support near real-time recognition at reasonable perfor-
CDCase CherryCoke FilmBox DVCBox CokeCan DrPepper Sprite DasaniWaterbottle AquafinaWaterbottle EggCookie SweetRedbeanCake Videotape ComputerVisionBook ChampagneBottle ScotchtapeBox PostIt MixedNutsCan PepsiCan GreenTeaBox PhDThesis Windex CD USDigital BecksBeerBottle TennisBall LinuxBook
87.0% 76.0% 85.5% 81.0% 82.0% 64.5% 61.0% 73.5% 74.5% 73.5% 79.0% 83.0% 85.0% 90.0% 63.5% 85.5% 69.0% 66.5% 79.5% 93.5% 96.0% 82.5% 80.0% 87.5% 86.5%
RemoteControl Magazine ImprobableResearch Magazine Wired Magazine Spectrum Magazine NewScientist TropicanaGrapefruit ChineseDoll ButterRingCookie Book PracticalC++Programming Box FirewireCard Book EffectiveC++ Brain PlumbersGoop ChaiTeaBox FishermansFriendBox AltoidsSpearmintBox PCBBoard SuisseMochaCan Wordmaster MousePad ArrowheadWaterbottle IntelGigabitEthernetBox Book VisualCognition CD ICRA2004 FloppyDisk
87.0% 90.0% 79.0% 79.0% 85.5% 59.0% 72.5% 87.5% 82.0% 84.0% 80.0% 80.0% 79.5% 73.5% 70.5% 62.0% 61.0% 86.0% 78.0% 83.0% 78.0% 87.0% 81.5% 91.5% 71.5%
PotatoSnack WhiteboardEraser IcedTeaBottle MountainDew SoyMilk BlueSkyCan DriverSet CoffeeCup DisketteBox HomeRunBallCookie StarBucksCoffeeBottle PlasticMugCup KirklandWaterbottle PalomarWaterbottle EssentiaWaterbottle GatoradeBottle GerberSpringWaterbottle SpeakerWire FirstAidKitBox SpritePlasticBottle DellMousepad MemorexDVD+RDL EpsonScannerManual TropicanaBottle Acrobat5
79.0% 84.0% 84.5% 87.0% 71.0% 70.5% 77.5% 36.5% 76.5% 87.5% 91.5% 73.0% 80.0% 75.0% 78.5% 68.0% 55.0% 80.0% 85.0% 77.5% 76.0% 86.5% 83.0% 62.0% 78.5%
IntelEthernetPlasticCase Book Pthreads Mousepad Bochum RoboSapien IcedCappuccino FujiFinepixBox SpaldingSoccerBall Multimeter MultimeterCase HeinekenBottle SweetPeaCan SonyBox CampbellSoup HitecServoBox SunpakRapidCharger NeoguriNoodle CafeAccentCreamer WoodSculptureHawaii Yoot DellSupportBox CloroxPaperTowelCase SonyVideotape OrangeBasketballMug OreoCookie UPSPackageCover
58.5% 78.0% 92.5% 83.5% 76.5% 83.5% 68.0% 73.5% 83.0% 96.0% 67.5% 80.0% 79.5% 71.5% 57.0% 79.0% 85.5% 72.0% 83.0% 90.0% 89.0% 95.5% 85.0% 71.5% 98.0%
Table 1. Table of recognition rates for 100 objects.
mance levels, although there is certainly substantial room for improvement. For example, the current system requires objects to have a minimum amount of texture. The incorporation of other features is necessary to deal with textureless objects. Current limitations of computing power prompted a number of design decisions related to the tracking and segmentation that also leave room for considerable improvements in future years. Acknowledgments. The early stages of this research were supported by NSF under grant IIS-0208451. Jochen Triesch acknowledges support from the Hertie foundation.
References [1] V. Cheung, B. Frey, and N. Jojic. Video epitomes. In Proc. IEEE Conf. Comp. Vis. and Patt. Recog., 2005. 2 [2] D. Comaniciu, V. Ramesh, and P. Meer. Kernel-based object tracking. IEEE Trans. Patt. Anal. and Machine Intel., 25(5):564–577, 2003. 3 [3] C. Harris and M. Stephens. A combined corner and edge detector. In Proc. Alvey Vision Conference, 1988. 2 [4] P. Hough. Method and means for recognizing complex patterns, 1962. U.S. Patent 3069654. 4 [5] H. Kim, B. Lau, and J. Triesch. Adaptive object tracking with an anthropomorphic robot head. In Proc. Int. Conf. Sim. of Adaptive Behaviors, 2004. 3 [6] H. Kim, G. York, G. Burton, E. Murphy-Chutorian, and J. Triesch. Design of an anthropomorphic robot head for studying development and learning. In Proc. IEEE Int. Conf. Rob. and Auto., 2004. 2 [7] S. Kirstein, H. Wersing, and E. K¨orner. Rapid online learning of objects in a biologically motivated recognition architecture. In Deutschen Arbeitsgemeinschaft f¨ur Mustererkennung, pages 301–308, 2005. 2
[8] M. Lades, J. Vorbr¨uggen, J. Buhmann, J. Lange, C. von der Malsburg, R. W¨urtz, and W. Konen. Distortion invariant object recognition in the dynamic link architecture. IEEE Trans. Computers, 42:300–311, 1993. 2, 3 [9] F. Li, R. Fergus, and P. Perona. A bayesian approach to unsupervised one-shot learning of object categories. In Proc. IEEE Int. Conf. on Comp. Vis., 2003. 2 [10] H. Loos and C. von der Malsburg. 1-click learning of object models for recognition. In Proc. Biologically Motivated Computer Vision, 2002. 2 [11] D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91–110, 2004. 4 [12] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide baseline stereo from maximally stable extremal regions. In Proc. British Machine Vision Conference, 2002. 4 [13] K. Mikolajczyk and C. Schmid. Scale and affine invariant interest point detectors. IJCV, 60(1):63–86, 2004. 4 [14] D. Mount and S. Arya. ANN: A library for approximate nearest neighbor searching, version 1.1., 2005. (http://www.cs.umd.edu/∼mount/ANN/). 3 [15] E. Murphy-Chutorian, S. Aboutalib, and J. Triesch. Analysis of a biologically-inspired system for real-time object recognition. Cognitive Science Online, 3(2):1–14, 2005. 1, 3, 4 [16] J. Sivic and A. Zisserman. Video google: A text retrieval approach to object matching in videos. In Proc. IEEE Int. Conf. on Comp. Vis., 2003. 2 [17] W. M. Theimer and H. A. Mallot. Phase-based binocular vergence control and depth reconstruction using active vision. CVGIP: Image Understanding, 60(3), 1994. 3 [18] J. Triesch and C. von der Malsburg. Democratic integration: Self-organized integration of adaptive cues. Neural Computation, 13(9):2049–2074, 2001. 3 [19] M. Weber, M. Welling, and P. Perona. Towards automatic discovery of object categories. In Proc. IEEE Conf. Comp. Vis. and Patt. Recog., 2000. 1, 4