Online learning for object identification by a mobile robot ... Chinese Academy of Sciences ..... of the Computer Science Laboratory (LIP6) of the Uni- versity of ...
Online learning for object identification by a mobile robot Nicolas Bredeche∗ Inst. of Computing Technology Chinese Academy of Sciences www-poleia.lip6.fr/~bredeche
Jean-Daniel Zucker† LIMBIO Universit´e Paris-Nord www.limbio-paris13.org
Shi Zhongzhi Inst. of Computing Technology Chinese Academy of Sciences www.intsci.ac.cn/shizz
Abstract Object identification for a situated robot is a first step towards many relevant behaviours such as human-robot communication, object tracking, object detection, etc. However, the dynamic and unpredictable nature of the world makes it very difficult to design such algorithms. Our goal is to endow a Pioneer 2DX autonomous mobile robot with the ability to learn how to identify objects from its environment, and to maintain this ability through time. In order to do so, we propose an architecture that continuously looks for relevant visual invariant properties related to target objects thanks to online learning techniques.
1
Introduction
In real-world situated robotics, many tasks such as imitation, learning by demonstration, map building, navigation, communication, target tracking, etc. requires object identification. The goal of object identification is to classify or to name an object based on the robot’s current sensory data. For an autonomous robot, the ability to identify objects is a first step towards more complex tasks and may be built by regularly checking for the object. In this paper, we are concerned with a practical task, where a Pioneer 2DX mobile robot has to rely on its limited visual sensors to learn how to identify object such as human being, mobile robot or fire extinguisher (etc.) that it encounters while navigating in the corridors of our laboratory. To provide such an autonomous robot, living in a changing environment such as our laboratory, with the identification ability described above is a difficult task to program. As such it is a good candidate for a Machine Learning approach, which may be easily recasted as a classical concept learning task. However, learnt anchors may become outdated due to the dynamic and unpredictable nature of the world, especially regarding possible concept drifts or non-representative examples used for learning. The main contribution of this paper is to show how it is possible to provide the robot with an efficient everup-to-date object identification ability. In order to do this, we present an architecture for object identification that relies on combining classifiers learnt using different representations where each classifier identifies only specific set of visual invariant properties related to the target object. These classifiers are then updated from time to time depending on their relevance. ∗ †
supported by a LAVOISIER post-doctorate research grant. supported by a d´ el´ egation from CNRS.
Figure 1: Two snapshots taken by the robot. left: ”extinguisher”,”door” ; right: ”human”,”door”. In the following, we start with a description of our problem setting along with the initial framework we used for object identification. Then, we briefly review related works from the robotics and machine learning viewpoints. In section 4, we present our approach to online learning for object identification. Then, we describe a set of real-world experiments for long-term object identification using a Pioneer2dx autonomous mobile robot with different target objects (”fire extinguisher”, ”box” and ”human” identification).
2 2.1
Problem settings Context : situated mobile robots
The practical task we are concerned with takes place in a wider project called MICRobES, which is a collective robotics experiment started in 1999 and involving more than 10 people. This project aims at studying the long-term adaptation of a micro-society of autonomous mobile robots in an environment populated by humans: the LIP6 laboratory in Paris. The robots, ten Pioneer 2DX, have to ”survive” in this environment as well as cohabit harmoniously with its inhabitants. Inside the MICRobES project, we are concerned with providing the robot with the ability to perform robot-human communication about objects in the world. However, from the robot’s point of view, using a shared lexicon of human symbols requires some prerequisites such as grounding these symbols in order to make sense in the world. We aim at providing each Pioneer 2DX autonomous mobile robot with the ability to identify1 objects and/or living beings encountered in its environment thanks to learning capabilities. The Pioneer2DX mobile robot provides images thanks to its LCD video camera while navigating in the corridors. The images are 160 × 120 wide, with a 24 bits color information per pixel. Humans, robots, doors, extinguishers, ashtrays and other possible targets can be seen among the images as shown in Figure 1. All these possible targets, as they appear in the images, are 1 object identification consists in classifying or naming an object (no object model or complex scene reconstruction needed).
Figure 2: The experimental setup.
Figure 3: MI-learning and object identification. of different shape, size, orientation and sometimes they are partially occluded. Finally, each image is labeled with the names of the occurring targets.
2.2
Elements of object identification
A key aspect of the problem lies in the definition of the learning examples (i.e. the set of descriptions extracted from the images) used by the robot during the anchoring process. In effect, a first step in any anchoring process is to identify (relevant) information out of raw sensory data in order to reduce the complexity of the learning task. In order to acquire a learning set, each robot navigates in the environment during the day and takes snapshots of its field of vision with its video camera according to three possible behaviors: wander behavior (random snapshots - bias-free) , attention behavior (takes snapshots at supervisor request - supervisor induced bias) , active learner behavior (takes snapshots according to memory - knowledge induced bias). At the end of each day, the robot may report to a
Figure 4: The PLIC architecture.
supervisor and ”ask” her/him what objects (whose symbols may or may not belong to a pre-defined lexicon) are to be identified on a subset of taken pictures (without the supervisor pointing at them). It then performs a learning task in order to create or update the connection between sensory data and symbols which is referred to as the anchoring process. From a machine learning point of view, the learning tasks produces classifiers that should then be used to identify symbols from the sensory data. Figure 2 describes this process. The learning task is therefore characterized by a set of image descriptions and attached labels. In practical, this means that given a set of a positive examples (images with the target object) and negative examples (same number of images, but without the target object), the goal is to learn how to classify a new image based on relevant invariant properties that can be found in at least one part of the image. The main point of our approach is that it relies on iterative reformulation in order to discover relevant invariant properties among many different feature sets thanks to positive and negative examples for each target object (the learning set). This is considered as a batch learning task since the reformulation algorithm requires a fixed learning set (the same learning set is used several times and must be consistent). As a first step to learn how to identify objects, we developed and validated the Plic system. Plic encapsulates both a reformulation tool and a learning algorithm (RipperMi, a multiple-instance rule-based learning algorithm [2]) which produces a set of classifiers that rely on visually invariant features. These features are learned at different scales (granularity) and may have different structural configurations (structure). With a given granularity and structure, the learning task provides classifiers such object identification is based on a single part of the image as shown in figure 3. In fact, granularity corresponds to a resolution for which blocks are considered as ”initial parts” of the image. Structure then corresponds to a set of contiguous initial parts and form some kind of ”high-level parts”. As a consequence, each part may embeds one to many contiguous blocks. Such a multiple-instance representation has also been used for image classification and content-based image retrieval [12]. In order to complete the full learning task, Plic evaluates different granularities and structures according to the architecture synthesized in figure 4. As shown in this figure, Plic computes these classifiers thanks to two embedded wrappers [7] that explore granularity and structure. An initial granularity and structure are chosen and the images are reformulated into a multiple-instance representation; then, the concepts are learnt using this representation. Based on the results of the learning algorithm, a new granularity and structure is chosen and learning starts again. In the end, Plic provides many different classifiers that can be used on different granularities and structures. The global predictor for object identification is build by choosing a limited number of these classifiers according to their computed accuracies during learning. As a consequence, it is possible to combined some of these classifiers for online ensemble learning where the global prediction accuracy rely on the identification of various invariant features. Such a combination makes it possible to avoid small mistakes from a specific classifier while taking into account the overall accuracies of all classifiers.
Target concept extinguisher box human
estimated accuracy 79.45% +/- 1.13 79.3% +/- 0.75 73.41% +/- 1.1
”real” accuracy 79.2% +/- 3.9 65.3% +/- 0.85 57.7% +/- 1.5
Table 1: Average estimated and real accuracies.
Figure 6: Examples of learnt invariant properties. Figure 5: Example: ”human” identification with combined classifiers. As an example, figure 5 shows a typical ”human” identification thanks to the ten best combined classifiers devised by our systems (Identif icationglobal = predictionclassif ier1 +(. . .)+predictionclassif iern ). Note that these classifiers rely on various granularities and structures and do not always cover the same invariant features. In the scope of this paper, we should not describe the Plic system any further since we are concerned with long-term object identification issues. For further information, please refer to [1] which offers an in-depth description as well as experiments that showed that Plic endows the robot with a reliable object identification capability for objects such as fire extinguisher, box or even human.
2.3
Online learning pitfalls
Once this learning session is completed, each robot is thus endowed with an object identification capability that has been build from the robot’s experience in a specific context. As a consequence, the robot is supposed to be able to show behaviors such as a tracking, target following or other behaviors where object identification is involved (e.g. docking, target finding, etc.). However, this is only partly true because of important pitfalls related to long-life situated robotics and online learning with hidden contexts. In fact, there are two common pitfalls to be avoided : Pitfall 1 : Depending on the complexity on the environment and the object to identify, the learning set’s distribution may not be representative of the actual concepts to be learned [5]. In other words, this means that the sampling behavior used to acquire images in the first place may induce the building of an object identification ability based on a biased learning set (e.g. the robot learned to identify a ”human” with examples showing only the blue-dressed supervisor ). Pitfall 2 : Unknown objects may be introduced in the robot’s environment, which are not identified while they should be (e.g. the robot learned to identify a ”chair” from wooden chairs but starting today, all of these chairs are replaced with brand new rocking-chairs (the whole laboratory retires).). This is known as concept drifts, i.e. changes in the target concept [6]. As a matter of fact, we performed several experi-
ments using three different targets : ”fire extinguisher”, ”box” and ”human”. For each target, the robot’s acquired 50 images labelled as positive examples while wandering in our laboratory. Then, decision lists were built based on these 50 positive examples and 50 negative examples. The attributes used for image description are : hue, saturation, value for each pixel, and hue, saturation, value and corresponding standard deviations for each high-level percepts. Table 1 shows object identification accuracies, each corresponding to the average accuracy computed from the ten best classifiers found by our system (illustrated in fig. 5 we saw earlier). The second column gives the estimated accuracy during the learning task using cross-validation2 . The third column gives real-world accuracy computed on 100 new images acquired by the robot. While the estimated and real accuracies for ”fire extinguisher” identification are nearly the same, it is obvious that there are strong differences for the ”box” and ”human” concepts. On the one hand, a fire extinguisher is quite easy to identify since it always hangs on a white wall and is always red in this environment. On the other hand, the great diversity of humans and boxes in the environment of the robot makes it very difficult to build a reliable identification ability. As we said earlier, our system builds classifiers that are relevant in the short term but very sensitive to the dynamic nature of the world in the long term. For example, figure 6 shows two positive ”human” identification with a specific configuration selected by Plic. It is clear that some of the learnt invariant properties used for identification may not hold for different dressings or skin colour (left image) and that some other learnt properties clearly depends on the learning set distribution (right image).
3
Related works
In this section, we will briefly review of the work on anchoring in robotics. Anchoring is the problem of how to create, and to maintain in time, the connection between the symbol- and the signal-level representations of the same physical object. From the robot’s viewpoint, anchoring is closely related to the symbol grounding problem and can be seen as a first step for object identification. Then, we will review major works in machine 2 A widely used technique in Machine Learning that consists in using a small part of the learning set only for evaluation purpose - this can be seen as a prediction of real world accuracy iff the learning set and real world distributions do match.
learning that address the problem of learning in hidden context, i.e. online learning.
3.1
Anchoring and object identification
To sum it up, works in the field of anchoring can be related to one these two classes [3] : explicit anchoring : the goal is to endow the robot with the ability to identify one or several objects. The robot must be able to track these its anchors over time (i.e. localization and detection.). A good example is ”ball” tracking used during the Robocup football competitions. As a matter of fact, most approaches focus on the creation of an initial anchor with few ambiguity within a controlled environment (e.g. a target with a color that cannot be found in the environment, a limited number of symbolic attributes that can efficiently describe the target, etc.). implicit anchoring : Other works do not use explicit symbols, but merely rely on relevant visual patterns that can be useful for other behaviors. Indeed, many behaviors such as imitation, learning by demonstration or even navigation in a complex environment require the ability to identify a specific scenes thanks to discriminative features. However anchoring is not performed using explicit symbols, this approach usually requires a more complex anchoring mechanisms than the latter one since there is no use of a priori knowledge on the world. As far as we are concerned, these approaches do not meet the problem of a robot building an online object identification ability since they either rely on controlled environment or non-explicit symbols.
3.2
Online learning algorithms
Within the machine learning community, the ability to learn from a continuous flow of examples is referred to as online learning. These algorithm try to forget irrelevant information instead of synthesizing all available information (as opposed to classic batch learning algorithm). Important theoretical results show that online algorithm are thus able to cope with insufficient information and/or concept drift [6]. Neural networks, SVM or incremental decision tree learning algorithm are natural online learning algorithms. There are also numerous technique to turn a classic batch learning algorithm into an online learner [11]. There is a known drawback for all these algorithms since it is very difficult to perform learning with several examples at once. For example, our reformulation tool described in section 2 cannot be used. In order to solve this problem, some algorithms rely on windowing techniques [10], which consists in storing the n last examples and performing a learning task whenever a new example is encountered. While providing a relevant framework for integrating batch and online learning techniques, this approach requires costly computation a mobile robot is unlikely to perform whenever an object is encountered (learning cannot be differed).
4
Our approach to online object identification
We saw previously that PLIC requires a fixed number of examples during its reformulation process. As a consequence, the embedded batch learning algorithm build efficient classifiers as long as examples are representative of the world. Then, some of the best estimated
classifiers are then combined into the object identification module we saw earlier in figure 5. Starting from this, the robot has yet to face problems related to the dynamic and unpredictable nature of the world. As a first step to online object identification, an interesting approach is to adjust a weight for each combined classifier. As a consequence, it is possible to evaluate the relevance of a specific classifier (or weak learner ) for later updating. Moreover, using combined weighted classifiers to provide a global prediction is known to reduce to some extent the impact of biased classifiers [9]. In this setup, a classifier’s weight is decreased whenever the prediction is wrong. Moreover, many algorithms exists to provide a weighted prediction such as the weighted majority algorithm [8]. However, there is a strong difference between known applications of such algorithms and our approach for that the robot considers classifiers that were learnt from different learning sets, which corresponds to a multiplerepresentation weighted majority algorithm. From the robot’s viewpoint, it is now possible to evaluate the real-world accuracy of each embedded classifier in order to replace them if needed. As shown on figure 7 (X-axis stands for images acquired in chronological order), the core of our approach relies in the use of a windowing-based online learning algorithm that relies on two interrelated sessions : the learning and exploitation sessions. The learning session : a batch learning algorithm along with a reformulation algorithm provide a set of classifiers that are built on positive and negative examples as described in section 2. Once learning is completed, the window is emptied. This session, which require the robot to ”go to sleep” for costly computation processing, takes place once in while when the window contains enough examples. To begin with, the global predictor is initialised or boostrapped (see figure), i.e. a learning task has to be performed before the robot is endowed with an object identification capability. During this bootstrap session, the robot is just gathering images and labels to build a first learning set. Then, ”learning session ”0”” provides the first set of classifiers for object identification. Note that this learning session ”0” is the same as experiments discussed in section 2, for which results some were shown in table 1. The exploitation session : during this session, the robot may proceed any behaviour that require an object identification ability. Identification is provide thanks to combined classifiers learnt during the learning session. At the same time, classifiers’ weights are adjusted according to their accuracy and new examples are stored within the window (without starting a learning session). A classifier’s weight is used to determine if it has to be replaced during the next learning session. This exploitation session require only identification and can be computed quickly (in less than 100 milliseconds). On the one hand, the learning session may be triggered by the robot whenever the window is full and no behavior is being conducted. On the other hand, the window is filled with new examples during the exploitation session whenever a new images are labeled by a human supervisor (i.e. the robot is not continuously bugging people around and asking for help). Our approach makes it possible to use a fixed learning set for complex data reformulation while coping with both biased learning set and concept drifts thanks to irrelevant classifiers updating. The global prediction is computed out of the n se-
Figure 7: Experimental scenario. lected classifiers as already discussed in section pb. Except from learning session ”0”, classifiers are updated every new learning sessions. Old classifiers’ accuracies are evaluated on every images since the previous learning session while new classifiers’ accuracies are computed with cross-validation on the current learning set; then, the k (k ≤ m) worst performing old classifiers are replaced with the k best performing new classifiers. In order to implement an onboard object identification component onto our Pioneer2DX mobile robot, we have developed the WMplic system, which embeds both the short-term anchoring algorithm Plic and the approach described in this section. As a consequence, WMplic provides the robot with an up-to-date object identification ability that takes place in a real-world environment with real-time constraints.
5 5.1
Experiments Experimental setup
To evaluate our online object identification architecture through time, a number of different experiments have been carried out. We shall consider three different target objects to be identified: ”fire extinguisher” (they can be found in the corridors of our lab); ”box” (various boxes that stand alone or piled up); ”human” (a single person with different kind of clothes). The experiments are based on images acquired by a Pioneer2DX mobile robot wandering in the corridors of the Computer Science Laboratory (LIP6) of the University of Paris 6. The objects as they appear in the images are different in shape, size, orientation and are sometimes partially occluded. For a given learning set, half of the examples are labeled with the target object, but other objects may also be present (approx. 50% of the images show only one object (be it the target or not), 15% show two objects and 5% show three objects). During a learning session, the Plic system described in section 2 builds classifiers that will be used for object identification.
5.2
Secondly, the robot performs learning session ”0” as previously descibed and then a learning session whenever 100 new examples have been gathered (this defines the window size for online learning - half of the example being positive ones, the other half being negative ones randomly sampled (bias-free)). This corresponds to learning sessions ”1” and ”2” on figure 7. During each of these learning sessions, the 3 worst-performing classifiers are replaced by 3 newly learned ones that show the best estimated accuracies in order to keep ten up-to-date classifiers. We shall refer to this as ”online learning” since classifiers are regularly updated according to their accuracies (n = 10 and k = 3). Figures 8, 9 and 10 show the identification accuracies during a robot’s wandering behaviour according to the scenario described on figure 7. The X-axis shows images as they are acquired and labelled in chronological order. The identification accuracies are updated whenever a new image is considered (which explains early variability). As a matter of fact, these figures show that our approach is clearly relevant in this context and provides very good results given the complexity of the task at hand (from 75% to 86% identification accuracies depending on the object). Online learning for ”fire extinguisher” and ”box” identification tend to perform better after two learning sessions, while online learning ”human” identification performs better as soon as the first learning session. The ”fire extinguisher” and ”box” targets correspond to relatively static objects while the ”human” target is much more difficult to define (e.g. changing clothes, attitudes, etc.). Thus, one learning session is nearly enough to provide efficient classifiers for the two simpler targets. However, final accuracies for these targets show that online learning performs better in the long term because classifiers tend to be less biased (online learning accuracies are more stable). As for the ”human” target, a single learning task may be not enough to build relevant classifiers. As a matter of fact, classifiers learned online performs 9 points better than other classifiers. As a conclusion, an in-depth analysis of the results shows that combining and updating classifiers is useful on three main points : • Combining classifiers naturally improves the identification accuracy even compared to the best embedded classifier’s accuracy (results not shown). This is a known results from the literature [4]. • Online learning makes it possible to recover from concept drifts. For example, if the shape and properties related to the ”box” symbol is no longer valid, the identification ability will be completely recovered after few learning sessions. • Online learning reduces the impact of distributionrelated issue. Classifiers are learnt on different learning sets with different distributions. Thus, combining them endows the robot with a more universal identification ability and reduces the influence of distribution-related bias.
Experiments with online object identification
In order to evaluate our approach, we have performed two sets of experiments. Firstly, the robot performs only the initial learning session ”0” and combined the 10 best performing classifiers (n = 10) for the next 300 hundred images. We shall refer to this as ”initial learning” since no further updating is performed to replace classifiers (n = 10 and k = 0).
6
Conclusion
In this paper, we addressed the problem of object identification by a Pioneer2dx autonomous mobile robot. To begin with, we have shown that efficient object identification cannot be achieved only through an ad hoc identification system. In order to endow the robot with
such an ability, we built an approach for object identification that relies on an online learning architecture. Our system makes it possible to learn how to identify any object that can be found in the robot’s real-world environment and to update this knowledge. As a consequence, the robot is able to cope with the dynamic and unpredictable nature of the world (namely insufficient object’s examples and concept drifts). Experiments showed that our approach proved to be relevant through time in the very peculiar context of situated robotics thanks to online learning technique and classifiers combination. This work shows that for learning how to identify object, an approach that periodically searches for the most accurate classifiers, given the examples at hand, is a promising direction.
References [1] N. Bred`eche, Y. Chevaleyre, J.-D. Zucker, A. Drogoul, and G. Sabah. A meta-learning approach to ground symbols from visual percepts. Elsevier’s Robotics and Autonomous Systems journal, special issue on Anchoring, 2003.
Figure 8: ”Fire extinguisher” identification accuracies.
[2] Y. Chevaleyre, N. Bredeche, and J.-D. Zucker. Learning rules from multiple instance data : Issues and algorithms. In Proc. of the 9th Int. Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, 2002. [3] Coradeschi, S., Saffioti, A., Eds. Proceedings of the AAAI Fall Symposium on Anchoring Symbols to Sensor Data in Single and Multiple Robot Systems. AAAI Technical Report FS-01-01, 2001. [4] T.G. Dietterich. Ensemble methods in machine learning. Lecture Notes in Computer Science, 1857, 2000. [5] Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In Proc. of the European Conference on Computational Learning Theory. 23-37, 1995. [6] D.P. Helmbold and P.M. Long. Tracking drifting concepts by minimizing disagreements. machine learning, 14(1):27–45, 1994.
Figure 9: ”Box” identification accuracies.
[7] R. Kohavi and G. John. The wrapper approach. In Feature Selection for Knowledge Discovery and Data Mining, H. Liu and H. Motoda (eds.), Kluwer Academic Publishers, pp33-50., 1998. [8] N. Littlestone and M.K. Warmuth. The weighted majority algorithm. In IEEE Symposium on Foundations of Computer Science, 1989. [9] R.E. Schapire. The strength of weak learnability. Machine Learning, 5(2):197–227, 1990. [10] Gerhard Widmer and Miroslav Kubat. Learning in the presence of concept drift and hidden contexts. Machine Learning, 23(1):69–101, 1996. [11] D. Randall Wilson and Tony R. Martinez. Reduction techniques for instance-based learning algorithms. Machine Learning, 38(3):257–286, 2000. [12] Q. Zhang, W. Yu, S. Goldman, and J. Fritts. Content-based image retrieval using multipleinstance learning. In Proceedings of the 19th International Conference on Machine Learning, 2002.
Figure 10: ”Human” identification accuracies.