Knowledge-Based Image Understanding Using ... - Semantic Scholar

3 downloads 0 Views 12KB Size Report
Ellen L. Walker. Department of Computer Science, Rensselaer Polytechnic Institute,Troy, NY 12180 [email protected]. Abstract†. Many applications of ...
Proc. CVPR ‘93

Knowledge-Based Image Understanding Using Incomplete and Generic Models Ellen L. Walker Department of Computer Science, Rensselaer Polytechnic Institute,Troy, NY 12180 [email protected]

Abstract† Many applications of computer vision require interpreting incomplete or noisy image data in environments where complete models are impractical. This paper presents a unified representation for image understanding that includes generic models of both objects and sensors. This representation is ideal for combining knowledge from multiple sources, taking advantage of available information without knowing in advance which information will be provided.

1. Introduction For computer vision to be a viable part of a household robot or outdoor explorer, images from an unconstrained and incompletely modeled environment must be interpreted. Image data, often noisy or incomplete, must be augmented with additional constraints to understand the environment. Two sources of such constraints are models of the image formation process, and models of the objects in the scene. When the environment is unconstrained and cannot therefore be completely modeled, it is difficult to determine in advance which knowledge will be available, so a flexible reasoning system that can combine constraints from different sources is desirable.

2. Knowledge representation for image understanding Knowledge representation is a key issue in knowledgebased image understanding. Along with the images and their extracted features, information needed for interpretation includes the current scene description, as well as a priori knowledge about the sensors and scene that can be used in reconstruction. This knowledge can be expressed by either exact or generic models. Exact object models and sensor models have been compiled into decision trees [1] for object recognition. Exact sensor models are also implicitly included in algorithms for sensor-specific tasks such as camera calibration. However, in an unstructured

† This research was supported by NSF Grant IRI-9011631.

environment, exact scene information cannot be obtained in advance, so generic models [3, 2] that describe features and constraints that are common to an entire class of objects are required. This paper describes generic models of objects, which include geometric relationships among the objects’ parts, and introduces generic models of sensors 1, which include geometric relationships among image features, scene objects, and the sensor geometry. The generic sensor models are represented identically to generic object models for easy combination using previously developed reasoning techniques [4].

3. Representing images within 3D FORM All models are represented within the ThreeDimensional Frame-based Object Recognition and Modeling (3 D FORM) system [4, 5]. The 3 D FORM system represents generic objects models and specific object instances using frames (see [4] for a more complete description). It also includes geometric reasoning to generate hypotheses to complete objects, to select an appropriate generic model to describe an object instance, and to match partial objects from different sources. These capabilities are called computation, specialization, and matching, respectively. Further description of this reasoning and an extended example are presented in [5]. An example of a generic sensor model is the representation of a 2-D image, shown in Figure 1. The model includes the camera location, the image plane, and 2-D features from image segmentation, such as vertices. Each 2-D feature is represented both in image plane coordinates (2D-PTS) and in world coordinates (IMG-PTS). A correspondence frame links each image-plane feature, its world-coordinate representation, and the object it depicts. Finally, all images of the same scene are linked to a common world frame. Each frame includes geometric relationships among the objects in its slots. For example, the IMAGE frame 1 The term “sensor” in this paper refers not only to physical measuring devices, but also to segmentation algorithms. This definition is consistent with Ikeuchi’s treatment of an edge detector as a sensor.

Proc. CVPR ‘93

basic-fusion-prob

object IS-A

images (list) INST

world

world INST

vertex

the two specified rays. For each of the other image points, two independent world point hypotheses remain.

plane IS-A

verts (list) INST

i-plane

vertex

INST

corresp

vertex

INST

w-obj i-obj 2d-obj

INST

INST

vertex 2dvertex

INST

INST

image

X-scale

observer

Y-scale

img-plane

unit-X

img-pts (list)

unit-Y

2d-pts (list)

center

IS-A INST INST INST

2dvertex 2dvertex 2dvertex

Figure 1: Part of intensity image model includes the constraint (not shown in the figure) that “every point in I M G-PTS is collinear with its corresponding world point and the camera.” To interpret the image(s), a network of frames representing the scene is created, and the known information is filled in. Using the relationships encoded in the models, incomplete objects are completed, and new objects are hypothesized. After all relationships are applied, the interpretation is the resulting object model network. For example, if the only available information is a collection of 2-D points from one image with known camera parameters, and there is no prior knowledge about the scene, then the resulting interpretation is a collection of point frames, each constrained to lie on a line connecting the camera and the image point. These frames represent the locus of points that could have formed the image. Given a second image of the same points and the correspondences, the locus of possibilities is further constrained to a set of fully specified 3-D points.

4. Knowledge-based stereo matching Typically, object recognition and stereo are performed sequentially. Stereo processing yields a 3-D description that is then used for model-based recognition. As an example application for our representation, this section describes how stereo correspondence can instead be determined in conjunction with object recognition. The input data is a set of unoccluded 2-D vertices from two calibrated views of a rectangular prism. The system is given the generic class of target object (a rectangular prism), but no metric information, and only one point correspondence. First, the two image frames are independently computed. The world coordinates for each image point are determined from the given camera information, and a world point is hypothesized on the line containing each image point and its camera position. For the matched point, two different world points, one from each camera, are matched, yielding its 3D location at the intersection of

2

Next, correspondences must be found among the remaining world point hypotheses using only the knowledge that the resulting points must be vertices of a rectangular prism. By collecting the points from each view into a single object and specializing it to a rectangular prism, additional hypotheses are created for occluded points, and all points are constrained as much as possible. The new hypotheses ensure that every point in one image has a correspondent in the other, and the additional constraints force incorrect matches to fail faster. By matching higher-level parts of the hypothesized prisms, such as edges and vertices, the correct correspondences are found even more quickly. After matching, the result will be a single rectangular prism, including its position, orientation, and the lengths of its sides. The same specialization and matching method would work equally well if the initial information were different, for example if some metric information for the prism were provided.

5. Conclusion This paper presented a knowledge representation for image understanding, including generic models for both objects and sensors. Image understanding tasks such as model-based stereo fusion are performed by applying computation, specialization, and matching to this representation. Although specialized techniques for a given task are more efficient than the generalized method described here, the general formulation allows unforeseen combinations of a priori knowledge and images.

References 1.

2.

3.

4.

5

Ikeuchi, K. and Kanade, T., “Automatic Generation of Object Recognition Programs,” Proceedings of the IEEE, vol. 76, no. 8, pp. 1016 - 1035, Aug. 1988. Nguyen, V.D., Mundy, J.L., and Kapur, D., “Modeling Generic Polyhedral Objects with Constraints,” in Proceedings IEEE Conference on Computer Vision and Pattern Recognition, pp. 479 - 485, IEEEComputer Society, June 1991. Stansfield, S.A., “Representing Generic Objects for Exploration and Recognition,” in Proceedings 1988 International Conference on Robotics and Automation, pp. 1090-1095, IEEE Computer Society 1988. Walker, E.L., Herman, M., and Kanade, T., “A Framework for Representing and Reasoning about Three-dimensional Objects for Vision,” AI Magazine, vol. 9, no. 2, pp. 47-58, Summer 1988. Walker, E.L., “Exploiting Geometric Relationships for Object Modeling and Recognition,” in Intelligent Robots and Computer Vision IX: Neural, Biological, and 3-D Methods, vol. 1382, pp. 353-363, SPIE, Nov. 1990.

Suggest Documents