Mar 27, 1997 - 9] Daniel P. Huttenlocher and Shimon Ullman. Rec- ognizing solid objects by alignment with an im- age. International Journal of Computer ...
Experiments on (Intelligent) Brute Force Methods for Appearance-Based Object Recognition Randal C. Nelson Andrea Selinger Department of Computer Science University of Rochester Rochester, NY 14627 (nelson, selinger)@cs.rochester.edu March 27, 1997
Abstract
the complexity of such algorithms is linear in the number of objects. However, the constant factors in this approach are so large, (e.g. 1022 operations per rigid object using no smarts = 10,000 pixels x 100 intervals per DOF for 6 rigid and 3 lighting freedoms), that the approach was dismissed as completely infeasible, and work concentrated on the development of ecient algorithms. Much has been learned from this work, but what has not emerged are algorithms that are ecient enough to solve object recognition problems in any but the most limited contexts, with the sort of computer power available on, say, a 1990 desktop. One possible conclusion is that such ecient recognition algorithms do not exist. Though there could be some undiscovered technique that will dramatically improve the situation, this seems unlikely, given the effort focussed on the problem in the last three decades. Nor is there particular evidence for the existence of such ecient algorithms in the human brain. Granted, it performs certain recognition tasks extremely well, but estimates of the computational resources it would take to simulate the relevant neural processes range from 1 to 1000+ tera-ops, with 1 to 1000 terabytes of stored information. Such estimates are extremely uncertain, but they are all large compared to a 1990 desktop, though small compared to 1022 . In the eld of computer science, the most dramatic change has been the phenomenal increase in the computational power of the machines which, for a xed pricem has doubled about every 18 months for three decades. This phenomenon, sometimes referred to as Moore's Law, has repeatedly exceeded all expectations, and continues to the current date, with only a few signs of abatement. The resulting million-fold increase has made certain algorithms, once thought impractical, practical to run. In the case of recognition, a question that has not been adequately addressed experimentally is whether near-term increases in computational resources coupled with modest algorithmic \smarts" can achieve, for certain problems, what algorithmic improvement alone has failed to do, making recognition, in one sense, a relatively \easy" problem. We think the an-
It has long been recognized that, in principle object recognition problems can be solved by simple, brute force methods. However, the approach has generally been held to be completely impractical. We argue that by combining a few more or less standard tricks with computational resources that are historically large, but completely feasible by recent standards, dramatic results can be achieved for a number of recognition problems. In particular, we describe a resource-intensive, appearance-based method that utilizes intermediate-level features to provide normalized keys into a large, memorized feature database, and Bayesian evidence combination coupled with a Houghlike indexing scheme to assemble object hypotheses from the memory. This system demonstrates robust recognition of a variety of 3-D shapes, ranging from sports cars and ghter planes to snakes and lizards over full spherical or hemispherical ranges (and planar scale, translation and rotation). We report the results of various large-scale performance tests, involving, altogether, over 2000 separate test images. These include performance scaling with database size, robustness against clutter, and generic ability. The result of 97% forced choice accuracy with full orthographic invariance for 24 complex curved 3-D objects over full viewing spheres or hemispheres is the best we are aware of for this type of problem. Key Words: Object recognition, Appearancebased representations, Visual learning.
1 Introduction
Object recognition and indexing problems have been some of the most intensely studied in the eld of machine vision. Until recently, however, recognition systems, especially three dimensional ones, were quite limited in their abilities; both in the types of objects they could handle, and in the conditions under which the methods would work. It has been recognized from the beginning that, in principle, object recognition can be solved using a brute-force approach: just compare \all" possible appearances of an object directly against an image. It can even be argued that 1
swer to this question is yes, and furthermore that a threshold that allows interesting results to be obtained from such experiments without exclusive use of a supercomputer has recently been crossed. In support of the position that resource intensive methods are worth looking at closely with today's power, we present some results for an appearancebased system which we believe represent the best 3D recognition results reported anywhere for general rigid objects. The system uses only a modest amount of force (relatively speaking) and only a little of the available algorithmic cleverness, yet the results suggest not only scalable general rigid object recognition, but good performance in the presence of clutter and some surprising generic ability.
2 Background
The most successful visual recognition work to date has been using model-based systems. Notable recent examples are [11, 10, 9, 6]. The 3D geometric models on which these systems are based are both their strength and their weakness. [8, 7]. On the one hand, explicit models provide a framework that allows powerful geometric constraints to be utilized to good eect. On the other, model schemas are generally severely limited in the sort of objects that they can represent, and obtaining the models is typically a dicult and time-consuming process. There has been a fair amount of work on automatic acquisition of geometric models, mostly with range sensors, e.g., [17, 19, 2] but also visually, for various representations [20, 3, 1, 5]. However, these techniques are limited to a particular geometric schema, and even within their domain, especially with visual techniques, their performance is often unsatisfactory. Appearance-based object recognition methods are resource-intensive algorithms that have been proposed in order to make recognition systems more general, and more easily trainable from visual data. Most of them essentially operate by comparing an image-like representation of object appearance against many prototype representations stored in a memory, and nding the closest match. They have the advantage of being fairly general, and often easily trainable. In recent work, Poggio has recognized wire objects and faces [15, 4]. Rao and Ballard [16] describe an approach based on the memorization of the responses of a set of steerable lters. Mel [12] takes a somewhat similar approach using a database of stored feature vectors representing multiple low-level cues. Murase and Nayar [13] nd the major principal components of an image dataset, and use the projections of unknown images onto these as indices into a recognition memory. Schmid and Mohr [18] have recently reported good results for an appearance based system with a local-feature approach similar in spirit to what we use, though with dierent features and a much simpler evidence combination scheme. In general, appearance-based methods have proven to be a useful technique; however because matches are generally made to representations of complete objects, these methods tend to be more sensitive to clutter and occlusion than is desirable, and require good global segmentation for success. Hough transform
and other voting methods allow evidence from disconnected parts to be eectively combined, but the size of the voting space increases exponentially with the number of degrees of visual freedom. This makes it dicult to apply such techniques directly when more than about 3 DOF are involved, thus limiting the use of the technique for 3D object recognition, which generally involves at least 6 DOF. We have implemented a prototype system that, by combining a large appearance database of semi-local, intermediate-level key features with a Hough-like evidence combination technique, resolves both the clutter and occlusion sensitivity of traditional memory-based methods, and the space problems of voting methods for high DOF problems. This system demonstrates robust recognition of a variety of 3-D shapes, ranging from sports cars and ghter planes to snakes and lizards over full spherical or hemispherical ranges (and planar scale, translation and rotation). It is also robust against clutter, and demonstrates some generic ability. This is in contrast to some recent results e.g. Murase and Nayar [13] where essentially only one of the two out-of-plane rotational degrees of freedom is spanned, and clutter is a signi cant problem.
3 The Method 3.1 Overview
The basic notion is to represent the visual appearance of an object as a structured combination of a number of semi-local features, or fragments. The idea is, that under dierent conditions (e.g. lighting, background, changes in orientation etc.) the feature extraction process will nd some of these, but in general not all of them. However, we show that the fraction that is found by feature extraction processes is frequently sucient to identify objects in the scene. This addresses one of the principle problems of object recognition, which is that, in any but rather arti cial conditions, it has so far proved impossible to reliably segment whole objects on a bottom-up basis. In this paper, local features based on automatically extracted boundary fragments are used to represent multiple 2D views of rigid 3-D objects, but the basic idea could be applied to other features and other representations. In more detail, we make use of semi-invariant local objects we call keys. A key is any robustly extractable part or feature that has sucient information content to specify a con guration of an associated object plus enough additional parameters to provide ecient indexing and meaningful veri cation. The basic idea is to utilize a database (here viewed as an associative memory) organized so that access via a key feature evokes associated hypotheses for the identity and con guration of all objects that could have produced it. These hypothesis are fed into a second stage associative memory, keyed by the con guration, which maintains a probabilistic estimate of the likelihood of each hypothesis based on statistics about the occurrence of the keys in the primary database. In our case, since 3-D objects are represented by a set of views, the con gurations represent two dimensional transforms. Ef cient access to the associative memories is achieved using a hashing scheme.
One step that we do not take in the current system is whole-object veri cation of the top hypotheses. Unlike appearance-based systems based on whole-object appearance, the structure of our representation is such that this could be performed to advantage, and such a step has the potential to signi cantly improve the performance of the system as a whole. The results given should thus be interpreted as representing the power of an initial hypothesis generator or indexing system.
3.2 Key Features
The recognition technique is based on the assumption that robustly extractable, semi-invariant keys can be eciently recovered from image data. More specifically, the keys must posses the following characteristics. First, they must be complex enough not only to specify the con guration of the object, but to have parameters left over that can be used for indexing. Second, the keys must have a substantial probability of detection if the object containing them occupies the region of interest (robustness). Third, the index parameters must change relatively slowly as the object con guration changes (semi-invariance). We currently make use of a single key feature type consisting of curve orientation templates normalized by robust boundary fragments. We call these features curve patches. Speci cally, a curve- nding algorithm is run on an image, producing a set of segmented contour fragments broken at points of high curvature. The longest curves are selected as key curves, and a xed-size template (21 x 21) constructed with a base segment determined by the endpoints (or the diameter in the case of closed or nearly closed curves) of the key curve occupying a canonical position in the template. All image curves that intersect the normalized template are mapped into it with a code specifying their orientation relative to the base segment. Matching of a candidate template involves taking the model patch curve points and verifying that a curve point with similar orientation lies nearby in the candidate template. Essentially this amounts to directional correlation.
3.3 Recognition Procedure
In order to recognize objects, we must rst prepare a database against which the matching takes place. To do this, we rst take a number of images of each object, covering the region on the viewing sphere over which the object may be encountered. The exact number of images per object may vary depending on the features used and any symmetries present, but for the patch features we use, obtaining training images about every 20 degrees is sucient. To cover the entire sphere at this sampling requires about 100 images. For every image so obtained, the boundary extraction procedure is run, and the best 25 or so boundaries are selected as keys, from which patches are generated and stored in the database. With each patch is associated the identity of the object that produced it, the viewpoint it was taken from, and three geometric parameters specifying the 2-D size, location, and orientation of the image of the object relative to the key curve. This information permits a hypothesis about the identity, viewpoint, size, location and orientation of an object to be made from any match to the patch feature.
The basic recognition procedure consists of four steps. First, potential key features are extracted from the image using low and intermediate level visual routines. In the second step, these keys are used to access the database memory and retrieve information about what objects could have produced them, and in what relative con guration. The third step uses this information to produce hypotheses about the identity and con guration of potential objects. Finally, these hypotheses are themselves used as keys into a second associative memory, where evidence for them is accumulated. After all features have been so processed, the hypothesis with the highest evidence score is selected. Secondary hypotheses can also be reported.
3.4 Evidence Combination
In the nal step described above, an important issue is the method of combining evidence. The simplest technique is to use an elementary voting scheme - each piece of evidence contributes equally to the total. This is clearly not well founded, as a feature that occurs in many dierent situations is not as good an indicator of the presence of an object as one that is unique to it. For example, with 24 3-D objects stored in the database, comprising over 30,000 patches, we nd that some image features match 1000 or more database features, while others match only one or two. An evidence scheme that takes this into account would probably display improved performance. An obvious approach in our case is to use statistics computed over the information contained in the associative memory to evaluate the quality of a piece of information. It is clear that the optimal quality measure, which would rely on the full joint probability distribution over keys, objects and con gurations is infeasible to compute, and thus we must use some approximation. A simple example would be to use formally, the rst order feature frequency distribution over the entire database, and this is what we do. The actual algorithm is to accumulate evidence, for each match supporting a pose, proportional to F log(k=m) where m is the number of matches to the image feature in the whole database, and k is a proportionality constant that attempts to make m=k represent the actual geometric probability that some image feature matches a particular patch in the pose model by accident. It can be shown that maximizing the summed reciprocal log terms is equivalent to Bayesian maximum likelihood evidence combination using the match frequency as an estimate of the prior probability of the the feature type, and assuming independence of observations. F represents an additional empirical factor proportional to the square root of the size of the feature in the image, and the 4th root of the number of key features in the model. These modi cations capture certain aspects that seem important to the recognition process, but are dicult to model using formal probability, (essentially that bigger features are better, and that the simplest explanation is preferred) The above measure allows us to combine evidence for all feature matches associated with a given pose hypothesis and a set of evidence. We now want to nd the maximum of this over all possible poses. Clearly, we can't directly evaluate all pose hypotheses: there
are too many of them (e.g. 20 objects x 100 viewpoints x 100 image locations x 20 orientations x 10 sizes = 40,000,000 poses to check). In our algorithm, the indexing into the secondary associative memory functions as an ecient way of accumulating the evidence for all poses that have any evidence associated with them at all (most possible poses have none, for a given set of evidence). This is the basic Hough transform idea, and it permits the pose with maximum evidence to be found in time proportional to the number of pieces of evidence times a database lookup factor rather than in time proportional to the number of possible poses.
3.5 Implementation
Using the principles described above, we implemented a recognition system for rigid 3-D objects. The system needs a particular shape or pattern to index on, and does not work well for objects whose character is statistical, such as generic trees or pine cones. Component boundaries were extracted by modifying a stick-growing method for nding segments developed recently at Rochester [14] so that it could follow curved boundaries. The system is trained using images taken approximately every 20 degrees around the sphere, amounting to about 100 views for a full sphere, and 50 for a hemisphere. of making the templates suciently exible to match between views. For objects entered into the database, the best 25 key features were selected to represent the object in each view. The thresholds on the distance metrics between features were adjusted so that they would tolerate approximately 15-20 degrees deviation in the appearance of a frontal plane (less for oblique ones). The system can be used both for \recognition" (what is this) and \ nding" (where is object X in this scene) operations. Preliminary experiments on the nding operation indicate good performance in large, complex scenes, but we have not yet acquired even moderate test databases for this problem, so the experiments reported below all involve \recognition" tasks. /subsectionResource Requirements The resource requirements scale more or less linearly with the size of the database. Memory is about 3 Mbytes per hemisphere, and overall times on a single processor Ultrasparc are about 20 seconds for the 6 object database, and about 2 minutes for the 24 object database. These numbers could almost certainly be improved by pushing on the indexing and data replication, which we have not done as yet.
4 Experiments
4.1 Variation in Performance with Size of Database
One measure of the performance of an object recognition system is how the performance changes as the number of classes increases. To test this, we obtained test and training images for a number of objects, and built 3-D recognition databases using dierent numbers of objects. The objects used were chosen to be \dierent" in that they were easy for people to distinguish on the basis of shape. Data was acquired for
24 dierent objects (34 hemispheres). The objects are shown in Figure 1. The number of hemispheres is not equal to twice the number of objects because a number of the objects were either unrealistic or painted
at black on the bottom which made getting training data against a black background dicult. Clean image data was obtained automatically using a combination of a robot-mounted camera, and a computer controlled turntable covered in black velvet. Training data consisted of 53 images per hemisphere, spread fairly uniformly, with approximately 20 degrees between neighboring views. The test data consisted of 24 images per hemisphere, positioned in between the training views, and taken under the same good conditions. Note that this is essentially a test of invariance under out-of-plane rotations, the most dicult of the 6 orthographic freedoms. The planar invariances are guaranteed by the representation, once above the level of feature extraction, and experiments testing this have shown no degradation due to translation, rotation, and scaling up to 50%. Larger changes in scale have been accommodated using a multi-resolution feature nder, which gives us 4 or 5 octaves at the cost of doubling the size of the database. We ran tests with databases built for 6, 12, 18 and 24 objects, shown in Figure 1, and obtained overall success rates (correct classi cation on forced choice) of 99.6%, 98.7% 97.4% and 97.0% respectively. (To nd out which objects are in which database, just count the images left to right, top to bottom.) The results are summarized in the following table. The worst cases were the horse and the wolf in the 24 object test, with 19/24 and 20/24 correct respectively. On inspection, some of these pictures were dicult for human subjects. None of the other examples had more than 2 misses out of the 24 (hemisphere) or 48 (full sphere) test cases. Overall, the performance is fairly good. In fact, we believe this represents the best results presented anywhere for this sort of problem. num. of num. of num. of num. percent objects hemitest correct correct spheres images 6 11 264 263 99.6 12 18 408 403 98.7 18 26 576 561 97.4 24 34 768 745 97.0 Table 1: Performance of forced-choice recognition for databases of dierent sizes
4.2 Performance in the Presence of Clutter
The feature-based nature of the algorithm provides some immunity to the presence of clutter in the scene, in contrast to appearance-based schemes that use the structure of the full object, and require good global segmentation. For modest dark- eld clutter, the method is quite robust. To test this, we acquired test sets of the six objects used in the previous 6-object case in the presence of non-occluding clutter. Examples of the test images are shown in Figure 2 Out of 264 test cases, 252 were classi ed correctly which gives
Figure 1: The objects used in testing the system
a recognition rate of about 96%, compared to 99% for uncluttered test images. A confusion matrix is shown in Figure 3 class | ref | num | 0 1 2 3 4 5 -------------------------------------------cup | 0 | 48 | 47 0 1 0 0 0 bear | 1 | 48 | 2 46 0 0 0 0 car | 2 | 24 | 0 0 24 0 0 0 rabbit | 3 | 48 | 0 0 1 47 0 0 plane | 4 | 48 | 0 0 2 1 45 0 fightr | 5 | 48 | 0 0 1 0 4 43 -------------------------------------------Hypoths. for class | 49 46 29 48 49 43
Figure 3: Error matrix for object classi cation experiment with clutter. Columns contain counts of classi cation results for test images of each type. In a second experiment, we took pictures of the objects against a light background. Clutter in these images arises from shadows, from wrinkles in the fabric, and from a substantial shading discontinuity between the turntable and the background. Unlike the dark- eld pictures, the objects in many of these pictures are not trivially segmentable. Examples of the test images are shown in Figure 4, and the boundaries found in Figure 5. Note that some of the images produce substantial numbers of clutter curves. (All the images shown were classi ed correctly.) Out of 264 test cases, 236 were classi ed correctly which gives an overall recognition rate of about 90%, which is not as good as some of our other results. However, almost half the errors were due to instances of the toy bear, the reason being that the gray level of the bear's body was so close to the upper background in low-level shots that many of the main boundaries could not be found (people had trouble with these shots too). If this case is excluded, the rate is about 94%. A confusion matrix is shown in Figure 6 class | ref | num | 0 1 2 3 4 5 -------------------------------------------cup | 0 | 48 | 44 2 0 1 1 0 bear | 1 | 48 | 3 32 1 5 2 5 car | 2 | 24 | 0 0 24 0 0 0 rabbit | 3 | 48 | 1 0 0 47 0 0 plane | 4 | 48 | 0 0 0 0 45 3 fightr | 5 | 48 | 0 0 1 0 3 44 -------------------------------------------Hypoths. for class | 48 34 26 53 51 52
Figure 6: Error matrix for light eld classi cation experiment. Columns contain counts of classi cation results for test images of each type.
4.3 Experiments on \Generic" Recognition
This set of experiments was suggested when, we tried showing our coee mugs to an early version of the system that had been trained on the creamer cup in the previous database (among other objects), and noticed that the system was making the \correct" generic call a signi cant percentage of the time. Moreover, the features that were keying the classi cation were the \right" ones, i.e., boundaries derived from the handle, and the circular sections, even though there was no explicit part model of a cup in the system. The notion of generic visual classes is ill de ned scienti cally. What we have is human subjective impressions that certain objects look alike, and belong in the same group (e.g. airplanes, sports cars, spiders, teapots etc.) Unfortunately, human visual classes tend to be confounded with functional classes, and biased by experience and other factors to an extent that makes formalizing such classes, even phenomenologically, pretty tough. On the other hand, the subjective intuition is so strong, and the early evidence of correct \generalization" so intriguing, that the matter seemed worth looking into. For the test, we gathered multiple examples of objects from several classes, which an (informal) sample of human volunteers agreed looked pretty much alike (our rough criterion was you could tell at a glance what class an object was in, but had to take a \second look" to determine which member of the class it was. We ended up with ve classes consisting of 11 cups, 6 \normal" airplanes, 6 ghter jets, 9 sports cars, and 8 snakes. The recognition system was trained on a subset of each class, and tested on the remaining elements. The training sets consisted of 4 cups, 3 airplanes, 3 jet ghters, 4 sports cars, and 4 snakes. These classes are shown in Figure 7, with the training objects on the left of each picture, and the test objects on the right. The training and test views were taken according to the same protocol as in the previous experiment. The cups, planes, and ghter jets were sampled over the full sphere; the cars and snakes over the top hemisphere (the bottom sides were not realistically sculpted). Overall performance on forced choice classi cation for 792 test images was 737 correct, or 93.0%. If we average performance for each group so that the fact that the best group, the cups, does not get weighted more because we had more samples, we get 92% (91.96%) performance. The error matrix is shown in Figure 8 The performance is best for the cups at about 98%, and the planes, sports cars and snakes came in around 92%-94%. The ghter planes were the worst by a signi cant factor, at about 83%. The reason seems to be that there is quite a bit of dierence between the exemplars in some views in terms of armament carried, which tends to break up some of the lines in a way the current boundary nder does not handle. Two of the test cases also have camou age patterns painted on them. The snakes were actually a bit of a surprise, given the degree of exibility, and the fact that none of the curves are actually the same (this is supposedly a rigid object recognition system). The key seems to be
Figure 2: Examples of test images with modest dark- eld clutter
Figure 4: Examples of test images on light background, with shadows and minor texture
Figure 5: Curves found by boundary extraction algorithm in light background images
Figure 7: Test sets used in generic recognition experiment. The training objects are on the left side of each image (4 cups, 3 planes, 3 ghters, 4 cars, 4 snakes) and the test objects are on the right. class | ref | num | 0 1 2 3 4 ---------------------------------------cup | 0 | 288 | 282 0 6 0 0 fightr | 1 | 144 | 0 120 7 16 1 snake | 2 | 96 | 5 0 88 1 2 plane | 3 | 144 | 0 2 7 135 0 car | 4 | 120 | 1 0 6 1 112 ---------------------------------------Hypoths. for class | 288 122 114 153 115
Figure 8: Error matrix for generic classi cation experiment. Columns contain counts of classi cation results for test images of each type. the generic \S" shape, which recurs in various ways in all the exemplars, and is quite rare in general scenes. These results do not say anything conclusive about the nature of \generic" recognition, but they do suggest a route by which generic capability could arise in an appearance based system that was initially targeted at recognizing speci c objects, but needed enough exibility to be able to deal with inter-pose variability and environmental lighting eects. They also suggest that one way of viewing generic classes is that they correspond to clusters in a (relatively) spatially uniform metric space de ned by a general, context-free, classi cation process. Finer distinctions would make use of this context.
5 Conclusions and Future Work
In this paper we have described a framework for keyed appearance-based 3-D recognition, which
avoids some of the problems of previous appearancebased schemes. We ran various large-scale performance tests and found good performance for fullsphere/hemisphere recognition of up to 24 complex, curved objects, robustness against clutter, and some intriguing generic recognition behavior. Future plans include adding enough additional objects to push the performance below 75%, both to better observe the functional form of the error dependence on scale, and to provide a basis for substantial improvement. We also want to see how the performance can be improved by adding a nal veri cation stage, since we have observed that even when the system provides the wrong answer, the \right" one is generally in the top few hypotheses. In another direction, we have some preliminary results indicating that the system, when coupled with a simple memory-constraint protocol, functions very well for nding particular objects in large, highly cluttered scenes. We plan to gather enough data for this problem to generate statistically signi cant performance data. Finally, we want to experiment with adapting the system to allow ne discrimination of similar objects (same generic class) using directed processing driven by the generic classi cation.
References
[1] Nicholas Ayache and Olivier Faugeras. Hyper: a new approach for the recognition and positioning of two-dimensional objects. IEEE Trans. PAMI, 8(1):44{54, January 1986. [2] Aaron F. Bobick and Robert C. Bolles. Representation space: An approach to the integration of visual information. In Proc. CVPR, pages 492{ 499, San Diego CA, June 1989.
[3] Robert C. Bolles and R. A. Cain. Recognizing and localizing partially visible objects: The localfeatures-focus method. International Journal of Robotics Research, 1(3):57{82, Fall 1982. [4] R. Brunelli and Thomaso Poggio. Face recognition: Features versus templates. IEEE Trans. PAMI, 15(10):1042{1062, 1993. [5] F.Stein and Gerard Medioni. Ecient 2dimensional object recgnition. In Proc. ICPR, pages 13{17, Atlantic City NJ, June 1990. [6] W. E. L Grimson. Object Recognition by Computer: The role of geometric constraints. The MIT Press, Cambridge, 1990. [7] W. E. L. Grimson and Danial P. Huttenlocher. On the sensitivity of the hough transform for object recognition. IEEE PAMI, 12(3):255{274, 1990. [8] W. E. L. Grimson and Daniel P. Huttenlocher. On the sensitivity of geometric hashing. In 3rd International Conference on Computer Vision, pages 334{338, 1990. [9] Daniel P. Huttenlocher and Shimon Ullman. Recognizing solid objects by alignment with an image. International Journal of Computer Vision, 5(2):195{212, 1990. [10] Y. Lamdan and H. J. Wolfson. Geometric hashing: A general and ecient model-based recognition scheme. In Proc. International Conference on Computer Vision, pages 238{249, Tampa FL, December 1988. [11] David G. Lowe. Three-dimensional object recognition from single two-dimensional images. Arti cial Intelligence, 31:355{395, 1987. [12] Bartlett Mel. Object classi cation with highdimensional vectors. In Proc. Telluride Workshop on Neuromorphic Engineering, Telluride CO, July 1994. [13] Hiroshi Murase and Shree K. Nayar. Learning and recognition of 3d objects from appearance. In Proc. IEEE Workshop on Qualitative Vision, pages 39{50, 1993. [14] Randal. C. Nelson. Finding line segments by stick growing. IEEE Trans PAMI, 16(5):519{523, May 1994. [15] Thomaso Poggio and Shimon Edelman. A network that learns to recognize three-dimensional objects. Nature, 343:263{266, 1990. [16] Rajesh P.N. Rao. Top-down gaze targeting for space-variant active vision. In Proc. ARPA Image Understanding Workshop, pages 1049{1058, Monterey CA, November 1994.
[17] R. Kjeldsen Ruud M. Bolle and Daniel Sabbah. Primitive shape extraction from range data. In Proc. IEEE Workshop on Computer Vision, pages 324{326, Miami FL, Nov-Dec 1989. [18] C. Schmid and R. Mohr. Combining greyvalue invariants with local constraints for object recognition. In Proc. CVPR96, pages 872{877, San Francisco CA, June 1996. [19] F. Solina and Ruzena Bajcsy. Recovery of parameteric models from range images. IEEE Trans. PAMI, 12:131{147, February 1990. [20] Shimon Ullman and R. Basri. Recognition by linear combinations of models. IEEE Trans. PAMI, 13(10), 1991.