MORAL { A Vision-based Object Recognition System for ... - CiteSeerX

0 downloads 0 Views 665KB Size Report
in the second stage of the calibration process (hand-eye calibration). In the case of a camera mounted on the manipulator of a mobile robot the 3D pose of the ...
MORAL { A Vision-based Object Recognition System for Autonomous Mobile Systems ? Stefan Lanser, Christoph Zierl, Olaf Munkelt, Bernd Radig Technische Universitat Munchen Forschungsgruppe Bildverstehen (FG BV), Informatik IX Orleansstr. 34, 81667 Munchen, Germany

email: flanser,zierl,munkelt,[email protected]

Abstract. One of the fundamental requirements for an autonomous mo-

bile system (AMS ) is the ability to navigate within an a priori known environment and to recognize task-speci c objects, i.e., to identify these objects and to compute their 3D pose relative to the AMS. For the accomplishment of these tasks the AMS has to survey its environment by using appropriate sensors. This contribution presents the vision-based 3D object recognition system MORAL1 , which performs a model-based interpretation of single video images of a CCD camera. Using appropriate parameters, the system can be adapted dynamically to di erent tasks. The communication with the AMS is realized transparently using remote procedure calls. As a whole this architecture enables a high level of exibility with regard to the used hardware (computer, camera) as well as to the objects to be recognized.

1 Introduction In the context of autonomous mobile systems (AMS ) the following tasks can be solved using a vision-based object recognition system: Recognition of taskspeci c objects (e.g., doors, trashcans, or workpieces), localization of objects (estimation of the three-dimensional pose relative to the AMS) in order to support a manipulation task, and navigation in an a priori known environment (estimation of the three-dimensional pose of the AMS in the world). These tasks can be formalized by an interpretation

I =h

obj;

f( j1 I

;M

i1 ); : : : ; (Ijk ; Mik )g; (R; T )

i

(1)

with the object hypothesis, ( jl il ) the correspondence between image feature jl and model feature il , and (R T ) the estimated 3D pose of the object. obj

I

I

M

;M

;

This work was supported by Deutsche Forschungsgemeinschaft within the Sonderforschungsbereich 331, \Informationsverarbeitung in autonomen, mobilen Handhabungssystemen", project L9. 1 Munich Object Recognition And Localization

?

CCD camera

user interface (dyn. configuration) RPC

M AMS

O

task

R

control

RPC

feature detection

calibration

recognition

A localization

L

model prediction

RPC odometry

RPC generalized environmental model

ok ok ok ok . . .

= = = =

moral_init(&ID) moral_load_param(ID,Configuration) moral_load_param(ID,CameraParameter) moral_load_object(ID,Object)

ok = moral_rec_object(ID,&Object,&Pose) . . . ok = moral_finish(ID)

(a) (b) Fig. 1. (a) Overview of MORAL and (b) a typical RPC sequence.

The object recognition system MORAL presented in this contribution accomplishes these tasks by comparing the predicted model features with the features extracted from a video image [LMZ95]. The underlying 3D models are polyhedral approximations of the environment provided by a hierarchical world model. This contribution mainly deals with rigid objects. The extension of this work to the recognition of articulated objects, i.e., objects consisting of multiple rigid components connected by joints, is shown brie y in Sec. 3. Related work to the recognition task can be found in [IK88], [Gri90], [DPR92], and [Pop94]. The problem of the self-localization of an AMS is considered in [FHR+ 90], [CK94], and [KP95]. Common to both tasks are the use of a geometric model and the basic localization procedure, i.e., the determination of the 6 DOF pose of the AMS relative to the world or to an object, respectively.

2 System Architecture The presented object recognition system (see Fig. 1 (a)) is implemented as a RPC server (remote procedure call ), which is called by any client process, especially by the AMS task control system. The standardized RPCs allow a hardware independent use of the MORAL system. Therefore MORAL can easily be applied on di erent platforms. By using the same mechanism, MORAL communicates (optionally) with other components, e.g., the generalized environmental model. With the help of speci c RPCs MORAL can be dynamically con gured, and can thus be exibly adapted to modi ed tasks. The internal structure of MORAL essentially consists of ve modules, which are implemented in ANSI-C and C++, respectively. 2.1

Calibration

In order to obtain the 3D object pose from the grabbed video image, the internal camera parameters (mapping the 3D world into pixels) as well as the external camera parameters (pose of the CCD camera relative to the manipulator or the vehicle) have to be determined with sucient accuracy.

RV TV )

(

RV TV )

(

RV TV )

(

(

(

(

;

;

;

R C TC )

R1M TM1 ) ;

(

R2M TM2 ) ;

(

RKM TMK )

(

;

;

R C TC ) ;

R C TC ) ;

(a) (b) Fig. 2. (a) Estimation of the camera pose based on known relative movements of the robot manipulator; (b) each triangle of the tesselated Gaussian sphere de nes a 2D view of an object.

Internal Camera Parameters. The proposed approach uses the model of a pinhole camera with radial distortions to map 3D point in the scene into 2D pixels of the video image [LT88]. It includes the internal parameters as well as the external parameters R, a matrix describing the orientation, and T , a vector describing the position of the camera in the world. In the rst stage of the calibration process the internal camera parameters are computed by simultanously evaluating images showing a 2D calibration table with N circular marks Pi taken from K di erent viewpoints. This multiview calibration [LZ95] minimizes the distances between the projected 3D midpoints of the marks and the corresponding 2D points in the video images. The 3D pose R; T of the camera is estimated during the minimization process. Thus, only the model of the calibration table itself has to be known a priori. Hand-Eye Calibration. Once the internal camera parameters have been determined, the 3D pose of the camera relative to the tool center point is estimated in the second stage of the calibration process (hand-eye calibration). In the case of a camera mounted on the manipulator of a mobile robot the 3D pose of the camera (R; T ) is the composition of the pose of the robot (RV ; TV ), the relative pose of the manipulator (RM ; TM ), and the relative pose of the camera (RC ; TC ), see Fig. 2 (a). The unknown pose (RC ; TC ) is determined by performing controlled movements (RkM ; TMk ) of the manipulator similar to [Wan92], for details see [LZ95]. Since the used 2D calibration table is mounted on the mobile robot itself, the manipulator can move to the di erent viewpoints for the multiview calibration automatically. Thus, the calibration can be accomplished in only a few minutes.

(a) (b) (c) Fig. 3. (a) Detected image features; (b-c) the two highest ranked hypotheses according to the image features in (a) and the object model in Fig. 2 (b). 2.2

Model Prediction

The recognition of rigid objects is based on a set of characteristic 2D views (multiview representation ). Here, a set of 320 perspective 2D views is determined using the tesselated Gaussian sphere. These views contain the model features used for establishing the correspondences ( jl il ) part of I in Eq. (1). This model prediction is provided by the Generalized Environmental Model (GEM ) [HS96] based on the boundary representation of a polyhedral 3D object. The tesselated Gaussian sphere and three 2D views of the object TrashCan are shown in Fig. 2 (b). Referring to Eq. (1), is not limited to a speci c object. In the case of vision-based navigation it means the environment of the AMS. Instead of using a set of characteristic 2D views, the expected 2D model features are provided by GEM according to the expected position. I

;M

obj

2.3

Feature Detection

The detection of image features is based on the image analysis system HORUS [ES97]. HORUS provides a large number of di erent image operators which are controlled by MORAL according to the performed task. Object-speci c knowledge is used to control the image segmentation process. This knowledge comprises di erent types, e.g., lines, faces, arcs, and features like color of the foreground or background (Fig. 3 (a)). The feature detection module can be parameterized either by the modules of MORAL itself or by the clients which call services of MORAL via the RPC mechanism, e.g., the AMS task control or the user interface. 2.4

Object Recognition

The aim of the object recognition module is to identify objects and to determine their rough 3D pose by searching for the appropriate 2D model view matching the image. This is done by establishing correspondences between image features extracted from the CCD image and 2D model features of an object. Obviously, if a rough 3D pose is a priori known, the recognition module is \bypassed" and the localization module is called directly (e.g., continous self-localization).

Building Associations. The rst step in the object recognition process is to build a set of associations. An association is de ned as a quadruple (Ij ; Mi ; v; ca ), where Ij is an image feature, Mi is a model feature, v is one of the characteristic 2D model views of an object, and ca is a con dence value of the correspondence between Ij and Mi . This value is obtained by traversing aspect-trees [Mun95] or by a geometrical comparison of the features incorporating constraints based on the topological relations Vertex , OvlPar , and Cont [LZ96]. Building Hypotheses. In order to select the \correct" view of an object the associations are used to build hypotheses f( Ai i i )g. For each 2D view i all corresponding associations with sucient con dence are considered. From this set of associations the subset of associations Ai with the highest rating forming a consistent labeling of image features is selected. That means that the geometric transformations in the image plane between model and image features in Ai are similar, i.e., they form a cluster in the transformation space. Furthermore, the image features have to ful l the same topological constraints as their corresponding model features [LZ96]. The con dence value i depends on the con dence values of the included associations and the percentage of mapped model features. The result of the described recognition process is a ranked list of possible hypotheses (see Fig. 3 (b-c)) which are veri ed and re ned by the localization module. A view in a hypothesis determines one translational and two rotational degrees of freedom of the object pose. Additionally, while building the hypotheses a scale factor and a translational vector in the image plane is computed. Using this weak perspective, a rough 3D position of the object is derived. Note, that some of the hypotheses from the object recognition module may be \incorrect". These hypotheses are rejected by the subsequent veri cation and re nement procedure performed by the localization module. obj;

;v ;c

v

c

2.5

Localization

The aim of the localization module is the determination of all 6 DOF of the CCD camera relative to an object or to the environment. The input for this module is a rough 3D pose and a set of 3D model lines. This input can be either a model of the environment (provided by GEM) or of an speci c object (result of the previous object recognition). Based on known uncertainties of the pose and/or the position of the model lines, speci c search spaces for the extraction of image lines are computed [LL95]. Final correspondences are established using an interpretation-tree approach [Gri90]. The traversion of the tree is dynamically adapted due to topological constraints of the actually involved correspondences. During this process model and image lines are aligned and therefore the input pose is re ned. Similar to [Low91], this is performed by a weighted least squares technique minimizing m X k=1

( i j (R T ))2

e M k; I k;

;

(2)

Fig. 4. AMS used for the experiments. using an appropriate error function . For example, if measures the 3D distance between a 3D model line ik = h i1k i2k i and the plane which is spanned by the corresponding image line jk = h j1k j2k i and the focus of the camera (similarly to [KH94]), Eq. (2) transforms to e

M

M

I

m P

I

i

W k

;I

2  P

l=1 n n ^ jk = jjnjjkk jj ; k=1

e

;M

2 l ik ? T  1  2 Ij k  Ifjk njk = f

^ Tjk  R 

n

?

M

with f the focal length of the used CCD camera. If only coplanar features are visible, which are seen from a large distance compared to the size of the object, the 6 DOF estimation is quite unstable because some of the pose parameters are highly correlated. In this case a priori knowledge of the orientation of the camera with respect to the ground plane of the object might be used to determine two angular degrees of freedom. Naturally, this approach decreases the exibility of the system. Tilted objects cannot be handled any longer. A more exible solution is the use of a second image of the scene taken from a di erent viewpoint with known relative movement of the camera (motion stereo). By simultanously aligning the model to both images the at minimum of the 6 DOF estimation can be avoided. Note, that for well structured objects with some non-coplanar model features a 6 DOF estimation based on a single video image yields good results as well.

3 Applications In the following several applications of MORAL in the context of autonomous mobile systems within the joint research project SFB 331 are described. The easy integration of MORAL into these di erent platforms (see Fig. 4) with di erent cameras2 and tasks shows the exibility of our system. A typical sequence of RPC calls to MORAL is outlined in Fig. 1 (b). An example for the vision-based navigation of an AMS in a known environment with MORAL is shown in Fig. 5. Using the input pose (given by the 2 Note, that in the AMS shown on the right the camera is mounted at the robot hand.

Fig. 5. Vision-based navigation, model speci c search spaces, and determination of the rotational joint state of the object Door.

Fig. 6. Examples for the vision-based 3D object recognition with MORAL.

Fig. 7. Recognition of an articulated object: Guided by the hierarchical structure of

the object model the unknown joint con guration of the drawer cabinet is determined recursively to enable an robot to open the upper drawer.

Fig. 8. The pose estimation of workpieces enables the grasping by a mobile robot.

odometry of the AMS), the predicted 3D model features, and the image grabbed from the CCD camera, MORAL determines the exact pose of the AMS relative to the world. Thereby, the uncertainty of the AMS pose leads to model speci c search spaces. The second image in Fig. 5 shows an example for three selected model lines with a pose uncertainty of 10 for the roll angle. On a standard SPARC 10 workstation the whole navigation task (image preprocessing and pose estimation) takes less than 2 seconds. Some results for the object recognition are presented in Fig. 6: On the left side the re nement of the rough 3D pose in Fig. 3 (b-c) is shown. This re nement is the result of the 3D localization module described in 2.5. The other images in Fig. 6 show the 3D object recognition of the objects ToolCarriage and DrawerCabinet. All of these tasks are performed in approx. 3 to 10 seconds on a SPARC 10, depending on the number of image and model features. Articulated objects are handled using a hierarchical representation and recognition scheme [HLZ97]. The object model consists of rigid 3D subcomponents linked by joints. Each joint state within the joint con guration of an object indicates the pose of a subcomponent relative to its parent component. On the right side of Fig. 5 the result of the identi cation of a rotational joint state is shown: After recognizing the door-frame the aim of this task is to determine the opening angle (rotational joint state) of the door-wing. Using this result, the AMS can determine whether it can pass the door or has to open it. A further example of the recognition of an articulated object is shown in Fig. 7: After recognizing the static component of the object DrawerCabinet, the translational joint states corresponding to the two drawers are computed following the object hierarchy. A further application of MORAL { the grasping of a workpiece by a mobile robot { is presented in Fig. 8. Since the exact pose of the object is a priori unknown, the robot guiding system calls MORAL to determine the object pose. Using the CCD camera mounted in the gripper exchange system, MORAL computes the requested pose. In this case the motion stereo approach described in Sec. 2.5 is used to simultanously align the model to two images.

4 Conclusion This paper presented the object recognition system MORAL, which is suitable for handling vision-based tasks in the context of autonomous mobile systems. Based on a priori known 3D models, MORAL recognizes objects in single video images and determines their 3D pose. Alternatively, the 3D pose of the camera relative to the environment can be computed, too. The exibility of the approach has been demonstrated by several online experiments on di erent platforms. Future work will focus on the following topics: Using an indexing scheme, the object recognition module should be enabled to handle a larger model database; in addition to the use of CAD models, the 3D structure of objects should be reconstructed by images taken from di erent viewpoints; nally curved lines have to be integrated as new features, both during the model prediction and the feature detection.

References [CK94]

H. Christensen and N. Kirkeby. Model-driven vision for in-door navigation. Robotics and Autonomous Systems, 12:199{207, 1994. [DPR92] S. Dickinson, A. Pentland, and A. Rosenfeld. From Volumes to Views: An Approach to 3-D Object Recognition. CVGIP: Image Understanding, 55(2):130{154, March 1992. [ES97] Wolfgang Eckstein and Carsten Steger. Architecture for Computer Vision Application Development within the HORUS System. Journal of Electronic Imaging, 6(2):244{261, April 1997. [FHR+90] C. Fennema, A. Hanson, E. Riseman, J. R. Beveridge, and R. Kumar. Model-Directed Mobile Robot Navigation. IEEE Trans. on Systems, Man, and Cybernetics, 20(6):1352{1369, November 1990. [Gri90] W. Eric L. Grimson. Machine Vision for Three Dimensional Scenes, chapter Object Recognition by Constrained Search, pages 73{108, 1990. [HLZ97] A. Hauck, S. Lanser, and C. Zierl. Hierarchical Recognition of Articulated Objects from Single Perspective Views. In Computer Vision and Pattern Recognition. IEEE Computer Society Press, 1997. To appear. [HS96] A. Hauck and N. O. Stoer. A Hierarchical World Model with Sensor- and Task-Speci c Features. In International Conference on Intelligent Robots and Systems, pages 1614{1621, 1996. [IK88] K. Ikeuchi and T. Kanade. Automatic Generation of Object Recognition Programs. IEEE Trans. on Computers, 76(8):1016{1035, August 1988. [KH94] R. Kumar and A. R. Hanson. Robust Methods for Pose Determination. In NSF/ARPA Workshop on Performance versus Methodology in Computer Vision, pages 41{57. University of Washington, Seattle, 1994. [KP95] A. Kosaka and J. Pan. Purdue Experiments in Model-Based Vision for Hallway Navigation. In Workshop on Vision for Robots in IROS'95, pages 87{96. IEEE Press, 1995. [LL95] S. Lanser and T. Lengauer. On the Selection of Candidates for Point and Line Correspondences. In International Symposium on Computer Vision, pages 157{162. IEEE Computer Society Press, 1995. [LMZ95] S. Lanser, O. Munkelt, and C. Zierl. Robust Video-based Object Recognition using CAD Models. In IAS-4, pages 529{536. IOS Press, 1995. [Low91] D. G. Lowe. Fitting Parameterized Three-Dimensional Models to Images. IEEE Trans. on PAMI, 13(5):441{450, 1991. [LT88] R. Lenz and R. Y. Tsai. Techniques for Calibration of the Scale Factor and Image Center for High Accuracy 3D Machines Metrology. IEEE Trans. on Pattern Analysis and Machine Intelligence, 10(5):713{720, 1988. [LZ95] S. Lanser and Ch. Zierl. Robuste Kalibrierung von CCD-Sensoren fur autonome, mobile Systeme. In Autonome Mobile Systeme, Informatik aktuell, pages 172{181. Springer-Verlag, 1995. [LZ96] S. Lanser and C. Zierl. On the Use of Topological Constraints within Object Recognition Tasks. In 13th ICPR, volume 1, pages 580{584, 1996. [Mun95] O. Munkelt. Aspect-Trees: Generation and Interpretation. Computer Vision and Image Understanding, 61(3):365{386, May 1995. [Pop94] A. R. Pope. Model-Based Object Recognition. Technical Report TR-94-04, University of British Columbia, January 1994. [Wan92] C. C. Wang. Extrinsic Calibration of a Vision Sensor Mounted on a Robot. Transactions on Robotics and Automation, 8(2):161{175, April 1992.

Suggest Documents