Abstract: For decades, machine learning and computer vision evolved ... disciplines so as to set stage to a more rigorous study of vision learning. .... On the other hand, vision-professionals themselves often lack deeper experience ... are interconnected by synaptic connections and the system combines symbolic and neural-.
Machine-Learning Issues in Model-Based Object Recognition M. Kubat
and W. Burger
Johannes Kepler University, Dept. of Systems Science, A-4040 Linz, Austria
Abstract: For decades, machine learning and computer vision evolved separately, largely ignorant of each other. Only recently, the need for automating some costly tasks in vision intensied to the point that many specialists in the eld asked whether machine learning algorithms can be of help. With a focus on model-based object recognition, the task of this paper is to survey intersecting items in the two disciplines so as to set stage to a more rigorous study of vision learning.
1 Introduction Computer systems capable of scene analysis, navigation, object recognition, and other vision tasks are expensive in terms of the programming eort needed to encode necessary internal representations of features, object and scene models, as well as algorithms for their detection. This furnishes the attractiveness of the idea that some of the existing machine-learning algorithms be harnessed for the automation of this process. The last couple of years have witnessed intensied eort toward this end. Apart from the many journal and conference articles that have appeared so far, specialized workshops 1, 5, 10] have been organized and a recent special issue of the IEEE Transactions on Pattern Analysis and Machine Intelligence 3] has been devoted to this topic. In this paper we investigate the use of machine learning techniques specically in the context of model-based object recognition. Our goal is to embed learning capabilities in a recognition framework based on a traditional hypothesize-and-test approach, with a special emphasis on the grouping, indexing, and model instantiation modules. This work was supported in part by the Austrian Science Foundation (FWF) under grant S7002.
2 Model-Based Structural Object Recognition Structural object recognition is based on the assumption that the objects of interest can be described and identied as a collection of structural primitives, often called \features" (see 15] or 12] for a review). As a framework for recognition we use traditional hypothesize-andtest approach, which consists of three main steps: primitive extraction, model-base indexing, perceptual grouping, and model instantiation. These steps operate in a bootstrap fashion, i.e., the process starts in a bottom-up mode by extracting primitives and combining them in a meaningful way up to a point when a plausible object hypothesis can be made. Then the recognition process turns into a top-down, model-directed search and verication process. They key design issues are (a) the choice of the structural primitives and the corresponding extraction methods, (b) the representation of objects, and (c) ecient matching strategies. The hope is that learning can be useful for all three of these. The task of grouping is to assemble simple structural features into more expressive, complex features. The idea is to have a set of universal, perceptually motivated rules that guide the grouping process. These rules are domain-independent and based on general geometrical properties, such as locality, continuation, parallelism, symmetry, etc. For certain classes of primitives, perceptual grouping functions can be inferred from perspective geometry, i.e., based \on rst principle". For other classes of structural primitives and particularly for representations based on multiple types of primitives (polymorphic feature sets) the grouping functions may be considerably more complex and in practice ad hoc grouping criteria are often used|see 11] and the references therein. The purpose of indexing is to select a promising object model using only partial evidence and without performing an exhaustive search through the entire model base. A typical solution involves an indexing map x 7;! Idx(x) = f( j j )g, which associates a given feature vector x with a set of object categories j and corresponding indexing coecients ( j jx), i.e., the probability that the object j is present when x is observed. Each map j = entry indicates a set of model entries that could have produced the observed feature or feature group. An index constitutes an object hypothesis, i.e., a possible partial match between the image features and the object description, which must be further evaluated to determine the quality of the hypothesis. The best indices are those that are unique to a particular object and occur reliably in conjunction with that object. In general, however, a feature will index to several model entries and also a feature may not always be observed with a single object. Dierent variations of the principal indexing scheme have been proposed, using a large variety of features and indexing maps 2, 14]. The main learning task here is the creation of the indexing map from examples, which has to be done in synchronization with the model base. The task ofmodel instantiation is to verify a given object hypothesis (delivered by the indexing process) with respect to the full object description in the model base, determine the viewpoint, measure the match quality, and eventually make a decision as to either accept or reject the C
C
a
P C
C
a
given hypothesis. In general, this process may revert to additional knowledge sources and features, predict the existence of features and invoke specic search processes to verify them. Graph structures are a common representation formalism in structural recognition, where primitive features are usually associated with graph nodes and the spatial relations correspond to the graph edges. Consequently, graph matching methods are at the core of many recognition approaches 4, 9]. The problem is complicated by the need to search for inexact matches and matches between subgraphs of the model and image structures. Usually, suboptimal matches are accepted to preclude combinatorial explosion. A hierarchical organization of the model base 8, 13] has advantages with respect to indexing and matching in certain applications. Naturally, the specication of the model base and the matching mechanism that uses it are intimately coupled. The key learning problem in the context of model instantiation is the creation and maintenance of the model base from observed examples.
3 Applicable Machine-Learning Techniques Supervised Learning. The task of supervised learning is dened as follows: given a set of positive and negative examples of a concept, described by xed-length vectors x = (x1 : : : xn ), derive some internal representation of the concept so that future examples can be correctly recognized. The vector x consists of features (attributes) describing the examples. The user's responsibility is to select relevant attributes.
Among the techniques whose performance has been demonstrated by many experimental studies, perhaps the most pupular are divide-and-conquer algorithms usually applied for the induction of decision trees 6]. A decision tree is a partially ordered set of tests identied with the internal nodes of the tree and class labels placed at its leaves. A decision test questions the value of some attribute, and the outcome points to the next test or leaf. Whenever an unseen example is to be classied, the system propagates it down the decision tree, starting at the root and ending up in a leaf containing the class label. Unsupervised Learning. In many conceivable applications, the ultimate objective is not so much to properly classify the object as to predict some of its properties that cannot be directly `read' from the image. This is especially the case of systems whose task is to extract information that will facilitate formulation of hypotheses about possible behavior of the environment. The utility of feature prediction is perhaps best illustrated on biological taxonomies. Being told that the body surface of an animal is hairy, we conclude that the animal is a representative of the class mammals and we can immediately predict the principles of its nervous, breathing or digestion systems, and the like. Being told that the mammal is herbivorous, we will speculate that it has more than one stomach and we can hypothesize about its teeth structure.
Obviously, objects can be taxonomized by many characteristics and dierent taxonomies will vary in their ability to predict properties, so that the process of taxonomy generation must be guided by properly selected heuristics. In the realm of machine learning, construction of taxonomies from observations is usually referred to as concept formation.
4 Specicity of the Learning Issues in Vision Contemporary machine-learning techniques have been developed without the peculiarities of vision in mind. On the other hand, vision-professionals themselves often lack deeper experience with the products of the learning domain. Perhaps the best way to reconcile the two elds is by trying to answer the question of `what makes vision-learning tasks specic'. A typical image is described by many numeric attributes. The fact that the attributes tend to be strongly interdependent somewhat reduces the applicability of simple machine-learning techniques, such as decision trees, in their basic form. To facilitate the search for hidden interrelations among features, multivariate trees were introduced 7], where the internal nodes test linear combinations of the features. An unpleasant complication is that one rarely knows which of the features or attributes (often organized in complex hierarchies) really matter for the particular task and, hence, the system is bound to learn from examples described by very long vectors. In the process of feature extraction, the agent is not always capable of saying where each feature belongs in the description vector. While the vast majority of the existing learning systems assume that each attribute has a xed position in the vector, vision systems are not always capable of delivering such rigid information. Hence, the learner faces the necessity of matching descriptions of exible structure. In general, matching is understood as the search for common properties of two images in the absence of any additional information about their contents. In the search for appropriate representation schemes, some authors attempt to make use of existing algorithms for conceptual clustering. The objective of the clustering is to nd groups in the images of an object that represent identical or similar aspects of the object. In many vision applications, the descriptions delivered by machine-learning techniques tend to be too rigid. 13] describes a system generating decision trees where the decision tests are based on a similarity measure and several options are allowed at each moment. The example is propagated simultaneously along several branches and propagated by way of a beam-search strategy. As the description of real-world objects is typically polymorphic, the recognition agent should be able to accumulate evidence about some particular classication. This is what current logic-
based learners usually neglect. In vision, this is usually referred to as indexing. Once a tentative classication of the object to some category has been made, it should be corroborated by other evidence. Ron Sun 16] exploits Michalski's idea of two-tier representations and suggests an ingenious scheme where one layer contains concepts and one layer contains features. Both layers are interconnected by synaptic connections and the system combines symbolic and neuralnetwork reasoning to decide about the objects category. Two major problems arise: rst, the performance (and thus the improvement) of any module cannot be evaluated individually, but only in the context of the entire system, causing what is called a credit assignment problem in AI in both the structural and the temporal sense. Reinforcement learning 17] appears to be a viable approach to handle this sort problems, but little work has been done in this particular direction. Secondly, learning has to cope with the fact that the knowledge structures involved in this process are also not decoupled, but exhibit complex interdependencies. This problem has hardly been addressed at all and raises the question if the traditional task decomposition is suitable from the learning viewpoint. The computational costs of vision learning necessitate that the learner acquires the knowledge incrementally, meaning that the arrival of a new learning example does not require that the learning process be re-run over all earlier examples. Rather, the new example should only have a small `tuning' eect on the previous knowledge structure. One of the key premises of machine learning is that the learner abstracts from the examples compact and understandable representation. However, in vision learning, compactness and interpretability are not always strictly required. For illustration, consider a module performing character recognition within some larger system designed for document processing. Obviously, the primary criterion will be the ability to recognize many characters in a time unit, rather than to provide explanations for each particular character.
5 Conclusions Today's upsurge of interest in vision learning has not been accompanied by a systematic study of the specicity of some challenges posed to learning by computer vision tasks. In writing this contribution, our ambition was to draw the researchers' attention to various peculiarities that have, so far, been neglected by the machine-learning community and are open as a research area. The breath of the eld necessitates that we concentrate only on the issues related to object recognition and model learning. In this paper we have tried to show that more systematic attention should be directed towards such issues as the incorporation of feature detectors into learners, learning from vectors of variable length, building on exible representation schemes facilitating matching procedures, and context-sensitive learning, to name just a few. More detailed examination of the properties
of learners possessing these properties are necessary if the vision applications are to escape ad hoc solutions and actually start to prot from available machine learning technology.
References 1] NSF/ARPA Workshop on Machine Vision and Learning, October 1992. 2] J.S. Beis and D.G. Lowe. Learning indexing functions for 3-D model-based object recognition. In Proc. Conf. on Computer Vision and Pattern Recognition, pages 275{280, 1994. 3] B. Bhanu and T. Poggio. Special section on learning in computer vision. IEEE Trans. on Pattern Analysis and Machine Intelligence, 16(9):865{919, September 1994. 4] R.C. Bolles and R.A. Cain. Recognizing and locating partially visible objects: The local-featurefocus method. International Journal of Robotics Research, 1(3):57{82, 1982. 5] K.W. Bowyer, L.O. Hall, P. Langley, B. Bhanu, and B. Draper. Report of the AAAI fall symposium on machine learning and computer vision: What, why and how? In Proc. DARPA Image Understanding Workshop, pages 727{731, 1994. 6] L. Breiman, J. Friedman, R. Olshen, and C.J. Stone. Classication and Regression Trees. Wadsworth Int. Group, 1984. 7] C.E. Brodley and P.E. Utgo. Multivariate decision trees. Machine Learning, 1995. 8] G.J. Ettinger. Large hierarchical object recognition using libraries of parameterized model subparts. In Proc. Conf. on Computer Vision and Pattern Recognition, pages 32{41, 1988. 9] David G. Lowe. Three-dimensional object recognition from single two-dimensional images. Articial Intelligence, 31:355{395, 1987. 10] R.S. Michalski, A. Rosenfeld, and Y. Aloimonos. Machine vision and learning: Research issues and directions. NSF/ARPA Workshop GMU-5-25010-1, George Mason Univ., October 1994. 11] R. Mohan and R. Nevatia. Perceptual organization for scene segmentation and description. IEEE Trans. on Pattern Analysis and Machine Intelligence, 14:616{635, 1992. 12] A.R. Pope. Model-based object recognition: A survey of recent research. Technical Report 94-04, University of British Columbia, Vancouver, CA, January 1994. 13] K. Sengupta and K.L. Boyer. Information theoretic clustering of large structural modelbases. In Proc. Conf. on Computer Vision and Pattern Recognition, pages 174{179, 1993. 14] F. Stein and G. Medoni. Structural indexing: Ecient 2-D object recognition. IEEE Trans. Pattern Analysis and Machine Intelligence, 14(12):1198{1204, 1992. 15] P. Suetens, P. Fua, and A.J. Hanson. Computational strategies for object recognition. Computing Surveys, 24(1):5{61, 1992. 16] R. Sun. A two-level hybrid architecture for structuring knowledge for commonsense reasoning. In R. Sun and L.A. Bookman, editors, Computational Architectures Integrating Neural and Symbolic Processing. Kluwer, Boston, 1995. 17] R.S. Sutton, editor. Reinforcement Learning. Kluwer, 1992.