Visual Recognition of Workspace Landmarks for ... - Semantic Scholar

1 downloads 0 Views 507KB Size Report
using a visual memory of the route covered in the learning phase ...... 697–702. Leonard, J. and Durrant-Whyte, H. 1991. Mobile robot localization ... of Maryland.
Autonomous Robots 7, 143–158 (1999) c 1999 Kluwer Academic Publishers. Manufactured in The Netherlands. °

Visual Recognition of Workspace Landmarks for Topological Navigation PANOS E. TRAHANIAS, SAVVAS VELISSARIS AND STELIOS C. ORPHANOUDAKIS Institute of Computer Science, Foundation for Research and Technology—Hellas (FORTH), P.O. Box 1385, Heraklion, 711 10 Crete, Greece Department of Computer Science, University of Crete, P.O. Box 1470, Heraklion, 714 09 Crete, Greece [email protected] [email protected] [email protected]

Abstract. In this work, robot navigation is approached using visual landmarks. Landmarks are not preselected or otherwise defined a priori; they are extracted automatically during a learning phase. To facilitate this, a saliency map is constructed on the basis of which potential landmarks are highlighted. This is used in conjunction with a modeldriven segregation of the workspace to further delineate search areas for landmarks in the environment. For the sake of robustness, no semantic information is attached to the landmarks; they are stored as raw patterns, along with information readily available from the workspace segregation. This subsequently facilitates their accurate recognition during a navigation session, when similar steps are employed to locate landmarks, as in the learning phase. The stored information is used to transform a previously learned landmark pattern, according to the current position of the observer, thus achieving accurate landmark recognition. Results obtained using this approach demonstrate its validity and applicability in indoor workspaces. Keywords: topological navigation, workspace landmarks, view invariant landmark recognition, model-driven scene segregation, saliency map

1.

Introduction

Space perception is a basic cognitive capability of biological systems. The enhancement of robotic platforms with analogous capabilities is a challenging research goal, the achievement of which will constitute a major step towards autonomous robots. This paper studies the problem of space perception using visual landmarks (Shah and Aggarwal, 1995; Lazanas and Latombe, 1992; Yeh and Kriegman, 1995). Automated space perception is of significant practical importance in any attempt at providing robotic platforms with autonomous navigation capabilities. With current techniques of space representation, that are mainly based on accurate measurements and knowledge of the environment, autonomous navigation is only possible in known environments. In such cases, effective approaches have been presented for

model-based directed sensing, localization and navigation (Durrant-Whyte and Leonard, 1989; Leonard and Durrant-Whyte, 1991). However, when the environment is unknown or changing, some sort of cognitive space representation is needed to facilitate localization and the determination of the next appropriate motion(s) to reach a navigation target. Dynamic map building has been proposed and studied to model such environments (Leonard et al., 1990; Cox and Leonard, 1994). Navigation in this case is usually performed by observing the environment, and using the information from these observations to improve our estimate of the robot’s position (Leonard and Durrant-Whyte, 1991; Kosaka and Pan, 1995). An alternative approach in space representation for navigation uses visual landmarks, i.e. distinctive visual events in the workspace, which seem to offer a promising tool in this endeavor (Levitt and Lawton, 1990).

144

Trahanias, Velissaris and Orphanoudakis

Landmarks are also routinely used by biological systems as reference points during navigation. Their employment in robotic navigation requires that the problem of landmark selection and recognition be tackled satisfactorily. This problem is being investigated for more than a decade; still, due to inherent difficulties in the general case, a direct approach to the solution of this problem is usually avoided, by assuming the availability of a geometric representation of space, which facilitates robot motion planning (Hwang and Ahuja, 1992; Latombe, 1991). In a few cases, space perception using visual landmarks has been based on simple landmark patterns. This is due to certain difficulties encountered when realistic landmarks are used (Shah and Aggarwal, 1995). A brief review of representative approaches follows. • Predesigned landmarks are used for simplicity, constituting one of the most primitive approaches. In this case, the environment is enriched with distinguishable patterns, such as simple geometrical patterns (Lazanas and Latombe, 1992; Magee and Aggarwal, 1995), bar codes (Taylor and Kriegman, 1994), etc. Clearly, such approaches are inflexible and are not particularly useful in unknown or in dynamically changing environments. • Selected landmarks. When the robot workspace is known in advance, it may be possible to manually select objects that will constitute the space landmarks (Nasr and Bhanu, 1988; Bouguet and Perona, 1995). This approach is inherently confined to known environments. Furthermore, it uses humanoriented criteria for landmark extraction, which are not necessarily the most appropriate for automated processes. • Straight lines may be employed to describe the workspace (Sugihara, 1988; Nelson, 1988; Yeh and Kriegman, 1995; Kosaka and Pan, 1995). Although some promising results have been reported with this approach, it can not be accepted for the description of general environments due to the uncertainties involved in such a description. • Panoramic views focus on long distance navigation, using a visual memory of the route covered in the learning phase (Zheng and Tsuji, 1989). Segments from this representation are selected as landmarks, according to a measure of their “distinctiveness” (Zheng et al., 1991). • Entire Images. While mechanisms for driving visual attention—at least in nature—usually select

distinctive objects in a scene as visual landmarks, in some cases researchers have used entire images as iconic landmarks by storing in memory the images collected during a training phase (Cassinis et al., 1996; Nielsen and Sandini, 1994). In addition to the above, attempts have been reported to develop “selection functions” that, given a set of visible landmarks, return a subset that can usually be found correctly (Greiner and Isukapalli, 1996). In this case, however, the initial set of landmarks is still determined at specification time. Thus, it is fair to say that progress to-date in landmark-based space perception and autonomous navigation has been rather limited. This is primarily due to difficulties inherent in the process of reliably recognizing segments of realistic workspaces. Moreover, in most cases, one is faced with uncertainties due to sensor inaccuracies, dynamic changes in the environment and imperfect robot control. Even in cases where landmark recognition is possible, the implementation of corresponding methods on real robotic systems may not be feasible due to heavy computational requirements imposed by the need for exhaustive pattern search in consecutive image frames. In this paper, we address the problem of space perception using visual landmarks, by employing an approach for selective landmark extraction and recognition. Therefore, an exhaustive pattern search is avoided and recognition is confined to a few, salient image areas. By employing qualitative features for the detection of these areas, our approach is robust with respect to measurement inaccuracies. Furthermore, this focus of attention mechanism effectively bypasses the task of image segmentation for object delineation, thus resulting in computational savings and an improvement in the robustness of the method. Information obtained using this approach is in turn used to facilitate the automatic extraction of landmarks. All approaches to robot navigation require certain assumptions to be made about the workspace (Shah and Aggarwal, 1995). Otherwise, the set of landmarks that the system is required to learn and use is prohibitively large and complex. In our approach, it is assumed that we are dealing with indoor environments, and the following two phases are considered: (a) extraction (learning) of landmarks, and (b) recognition of landmarks during navigation. Qualitative, a priori knowledge of the robot workspaces allows us to combine the bottom-up generation of attention cues (saliency map) with a top-down,

Visual Recognition of Workspace Landmarks

model-driven search for landmark patterns. This further allows effective landmark recognition, since it facilitates landmark transformation—and therefore correct matching—according to the current robot position. In what follows, Section 2 presents our approach to landmark extraction and Section 3 focuses on landmark recognition, putting emphasis on the pattern transformation proposed. Section 4 presents experimental results that demonstrate our approach and Section 5 concludes the paper with a brief discussion. 2.

Landmark Extraction

In this phase, landmark patterns and associated information regarding their topological relationships (topological, non-metric map) are extracted and stored. Here we focus on the issue of landmark extraction; topological maps have been treated in work by others (Taylor and Kriegman, 1994; Kosaka and Pan, 1995). In order to extract landmark patterns that would facilitate their own robust recognition, we employ a focus of attention technique on salient image areas. This complies fully with the modern theories of active and purposive vision (Aloimonos, 1990; Pahlavan et al., 1993), since we avoid an exhaustive search and recognition of all image patterns, but rather attempt to purposively process the image data. Such focus of attention techniques have also been used by other authors (Madsen et al., 1997; Milanese, 1993) for landmark selection. In our approach, however, we combine the focus of attention process (bottom-up, data-driven mechanism) with a top-down, model-driven workspace segregation, which amounts to the extraction of the major scene

Figure 1.

145

areas (walls, floor, ceiling, etc) from the image data. As a result, the search for landmark patterns is confined to “meaningful” areas, and selection of spurious patterns is avoided. 2.1.

Model-Driven Workspace Segregation

In addition to the assumption of indoor environments, we consider that the robot workspaces are orthogonal parallelepipeds. This implies that the expected 2D image structure, of a 3D workspace scene, from the point of view of the autonomous robot facing towards one of the workspace sides, would have the form shown in Fig. 1(a). By means of “model-driven workspace segregation” we denote the extraction of this structure from the image data. The structure of Fig. 1(a) partitions the image space into five distinct areas. Although we avoid labeling these areas, they represent distinct areas in an indoor workspace: side walls, ceiling, floor, far end. This is exploited in our approach by confining the search for landmarks in the “wall” regions. We completely avoid detecting landmarks in the ceiling, floor and far end regions, since they represent “unreliable” areas for the detection of potential landmarks (e.g. an object may not be present at all times on the floor, the far end region is viewed with low resolution, etc.). The ceiling is included in the unreliable areas in our approach, due to the lights that are not always in the same state (on or off), thus complicating their own recognition, as well as the recognition of nearby objects. The exclusion of the above areas prevents us from identifying randomly existing objects as landmarks (Kuhnert, 1990), while objects on the walls tend to be permanent and,

(a) Expected 2D image structure of a 3D workspace scene, (b) measurable quantities on the image structure.

146

Trahanias, Velissaris and Orphanoudakis

therefore, appropriate as workspace landmarks. Moreover, the above image partitioning yields additional information that is used in the recognition phase. This will become apparent later in the description of landmark recognition. The method developed for workspace segregation employs simple image processing techniques. The following steps are employed to extract the expected line features [shown in Fig. 1(a)], using information readily available from the image data (Trahanias et al., 1997). • The edges in the image are detected and small ones are rejected from further processing. • The Hough transform (HT), applied selectively in ρ, θ , defines the lines that correspond to the two edges between the walls and the floor. This result is further refined with the use of the adaptive Hough transform (AHT) (Illingworth and Kittler, 1987). The point of intersection between these two lines establishes the z-axis vanishing point. • By combining information from the above extracted lines and the image edges, we selectively apply AHT to obtain the lines that correspond to the far end of the workspace. These lines, in conjunction with the vanishing point, facilitate the establishment of the lines (edges) between the walls and the ceiling. 2.2.

result of this is a set of points that ideally belong to the corresponding line in F 0 . In order to avoid the inclusion of erroneous such points, we extract this line by applying the Least Median of Squares robust detector (Rousseuw, 1984) to this set of points. The dynamic segregation may fail if the two consecutive frames differ substantially. Such a case may be encountered for example during a rotational motion of the robot (e.g. going around a corner). In order to detect such failures, the number of points in each line that result from differentiation, Pd , is examined. When a reasonable similarity between the two frames exists, then Pd should be very close to la , where the latter is measured in pixels. Experimentation has revealed that in such cases Pd was always greater than 0.7la , which has been employed as a working threshold for Pd . If this threshold criterion is not met, then it is assumed that the scene has changed substantially and the model-driven workspace segregation procedure described above is applied. Therefore, the dynamic segregation procedure is applied only when strong evidence supports the similarity assumption between the two consecutive frames. This represents a “safe approach” to the workspace segregation task, in the sense that the fast, dynamic segregation procedure is only invoked when the data strongly suggest it; otherwise, the slower modelbased segregation procedure is preferred.

Dynamic Segregation 2.3.

Since the robot is not moving very fast, adjacent image frames should not differ substantially in their structure. Therefore, we can exploit the segregation result extracted for one frame F (as described above), to extract the same information in the next frame F 0 . The lines of interest in these two frames should be very close spatially. To detect them we establish an area around each one of the two bottom lines in F. This area is expected to contain the corresponding lines in F 0 . Its width wa depends on the robot velocity,1 which is roughly known. Therefore, if we expect the robot to move with varying velocity during a navigation session, wa can be dynamically adjusted. Otherwise, an upper limit for wa can be easily determined and used at all times. The latter has been adopted in our implementation to-date due to its simplicity. Similarly, the length la of the above area has been set to a fixed value. Having established this area, differentiation in a direction normal to the lines in frame F can detect the edge points in frame F 0 . This differentiation is more accurately performed using sub-pixel computations. The

Saliency Map

Following recent theories of active and purposive vision, potential landmark locations are detected using a “focus-of-attention” mechanism (Andersen, 1996; Madsen et al., 1997). Locally, workspace objects that can be characterized as landmarks should form distinctive patterns, easily extracted from the rest of the environment. The mechanism that we employ in order to detect such distinctive patterns is the saliency map (Clark and Ferrier, 1993; Koch and Ullman, 1985). This represents a map of the scene image where distinctive areas are given larger values and areas that correspond to smooth regions are given smaller values. As suggested by a number of authors (Koch and Ullman, 1985; Milanese et al., 1994; Clark and Ferrier, 1993; Madsen et al., 1997), we compute the saliency map using appropriate image features. The determination of the exact set of features is application dependent (Clark and Ferrier, 1993). In our case we deal with indoor environments, and the observed scenes consist of large flat areas (i.e. walls, floor), being disturbed by

Visual Recognition of Workspace Landmarks

smaller objects of various shapes. Such distinctive objects can be detected on the image data by employing texture features that discriminate between uniform and non-uniform image areas; three such features are employed. Assuming a window W centered at position N , these features are: • Area correlation R N , computed as RN =

µM N − µM µN , σN σM

(1)

where µ N denotes the mean value of pixels in W at position N , µ M N is the mean value of the product values of all pixel pairs from two different window positions, M and N , and σ N = (µ N N − µ N µ N )1/2 is the standard deviation in W at position N . • Image entropy HN , calculated as HN = −

X

P(h i ) log2 P(h i ),

(2)

i

where P(h i ) denotes the probability of occurrence of the value h i within W , • Standard deviation of the intensity histogram over W. Finally, since we assume a moving observer (robotic platform), differentiation in time can also be used to identify areas of interest on the image. The corresponding feature employed is the standard deviation of pointwise differences in W in successive image frames. It should be noted at this point that in the literature on attention mechanisms, a variety of features have been proposed, including line ends, spatial frequency, line orientation, color and binocular disparity (Milanese et al., 1994; Clark and Ferrier, 1993). Such features can meaningfully be used in applications where the features’ perceptual characteristics are useful. In our case, where the objects of interest (landmarks) have no specific shape, orientation and color, and the most important characteristic to drive attention on them is their distinctiveness from the environment, the employment of texture features that actually mark such patterns seems appropriate. At the same time, the computational complexity of the employed features is very low, which guarantees their efficient calculation at run time. The conjunction of the adopted features to form a saliency map can be performed in various ways. Simple averaging (Milanese et al., 1994) may perform satisfactorily in cases where all features consent to the

147

presence or absence of an object deserving attention. Linear combinations of the features may be used (Clark and Ferrier, 1993), with weights proportional to our confidence in, and/or significance of the corresponding feature. When all weights are equal then the averaging scheme is actually implemented. Weights may be changing over time, with an extreme case the assignment of zero weights to all features but the current largest one; the latter corresponds to the winner-takeall approach which rewards the feature with the largest value (Koch and Ullman, 1984). Relaxation labeling approaches (Milanese et al., 1994) can also be used to render coherent areas of interest in the image. However, such approaches tax the overall efficiency of the system, due to their high computational requirements. The adopted features in our case express similar perceptual characteristics, suggesting the employment of a simple scheme for the computation of the saliency map. Since we do not want to bias the final result towards any of the features, we have chosen the averaging scheme, with the features being normalized in the same dynamic range before averaging. Consequently, the features contribute equally to the computation of the saliency map; information from the latter can be used for a more selective search for landmarks in a scene. 2.4.

Area of Interest Estimation

The saliency map, in conjunction with the workspace segregation results, indicate where to look (Schiele and Crowley, 1996) for potential landmarks. In other words, points within the boundaries of the “walls”, with large saliency values are good such candidates. A set of heuristic rules is applied in this case to assure extraction of valid landmark objects. That is, such objects are required to exhibit saliency values above a threshold value; moreover, isolated large saliency values are excluded from further consideration. It is required that large saliency values are formed at a certain number of points, before such points can be considered to belong to a landmark. Both the saliency value threshold and the threshold for the number of salient points imply trade offs in the detection of useful landmarks. Low values allow many objects to be considered as landmarks, whereas high values limit such objects to very prominent fixtures. In our implementation we have determined these values through a supervised, iterative procedure. Starting with low values for both thresholds, we have repeatedly increased them, until the extracted

148

Trahanias, Velissaris and Orphanoudakis

landmarks were independently confirmed by two individuals. In order to extract the image pattern of a detected landmark, a window is initiated as the minimum enclosing rectangle of the salient points, and an area expansion algorithm undertakes to estimate the optimal (rectangular) window that encloses the landmark. Optimality is considered here in terms of a criterion that controls when expansion towards one side of the window should be terminated. This criterion is formulated as: |σn0 − σn | < ²

(3)

where, σn0 , σn represent the standard deviations in the pattern window before and after expansion, respectively, and ² is a threshold value. Landmarks extracted as described above are stored in a visual memory as images. Attached to each such image we also save the values of associated parameters that are used later for its own recognition (see OPT, Eqs. (16) and (17)). Moreover, we save references to navigational actions selectable at each landmark, and spatial relations between landmarks, building in effect a topological map which can be used at runtime for decisions concerning the current navigational task. 3.

Landmark Recognition

During a navigation session, a procedure similar to the one described above is followed to detect potential landmarks. Then, landmark recognition is achieved by successfully matching a potential landmark to one of the previously extracted (learned) landmarks. However, observed and stored landmarks may have been extracted from different locations of the observer. Therefore, before pattern matching, an appropriate transformation should be applied to facilitate their comparison. A novel such transformation has been developed, called Observer Position Transformation (OPT); it is applied to the stored landmark so that it can be transformed to its view from the current position of the observer. 3.1.

Observer Position Transformation

OPT is a viewing transformation; it uses information about the observer’s position during landmark extraction in the learning phase and the new position of the observer during navigation to transform the stored landmark pattern to its view from the observer’s current

Figure 2.

Top view of robot workspace.

position. Its formulation below follows simple geometrical manipulations. Based on earlier assumptions, a top view of the workspace yields the layout shown in Fig. 2. Let O x yz be a reference coordinate system aligned with the workspace (Fig. 2). Without loss of generality, we assume that the origin of O x yz is midway the left and right sides (M0 N0 , MN) of the workspace, and it is at the same “height” (y-axis) and “depth” (z-axis) as the camera (observer) C. The camera coordinate system is parallel to the reference coordinate system; in other words, the camera coordinate system and the reference coordinate system are related by a simple translation along the x-axis. Using similar triangles one can easily show that: x1 + x10 = B

ZB − f ZB

(4a)

x10 = B

ZF − f ZF

(4b)

From Eq. (4) x1 is obtained as µ x1 = B f

1 1 − ZF ZB

¶ (5)

Similarly, x2 can be computed as µ x2 = (A − B) f

¶ 1 1 − , ZF ZB

(6)

Visual Recognition of Workspace Landmarks

149

where x 1 and x2 are quantities measured on the image, as shown in Fig. 1(b), Z F and Z B represent the depths indicated in Fig. 2, A is the workspace width, B the distance of the observer from the right side of the workspace, and f the camera focal length. The ratio x1 /x2 can be obtained from Eqs. (5) and (6) as B x1 . = x2 A−B

(7)

Let λ = B/A; then (7) can be rewritten as x1 λ = x2 1−λ

(8)

or, equivalently, λ is given as λ=

x1 x1 + x2

(9)

Equation (9) gives B as a fraction of the workspace width A. With respect to the reference coordinate system, the observer’s x-coordinate, C x , is given as Cx =

A − λA. 2

(10) Figure 3. tions.

Let a landmark be observed from the position C x and consider a point on the landmark with reference coordinates X L , Y L , Z L (Fig. 3). Then XL =

xl Z L − Cx , f

(11)

xl0 Z L0 − C x0 , f

(12)

where, similar to C x , C x0 is computed as C x0 =

A − λ0 A 2

(13)

xl Z L (C x − C x0 ) f 0 − ZL Z L0

Af α Af Z L0 = 0 α

(14)

(15a) (15b)

By substituting Eqs. (10), (13) and (15) into Eq. (14), and letting t = x 2 /x1 (resp. t 0 = x20 /x10 ), we finally obtain ¸ · α0 1 1 xl0 = xl − α 0 − (16) α 1 + t0 1+t and similarly, for the y-coordinate on the image plane, yl0 =

Equating (11) and (12) we obtain xl0 =

We can get independent estimations regarding the depth Z L (and Z L0 ) using the width α of the workspace that is measured on the image at that depth (Fig. 1(b)): ZL =

where xl is the x-coordinate of the same point on the image plane. Assuming that, during navigation, the robot encounters the same landmark, but from a different position C x0 (Fig. 3), X L can be expressed as XL =

A landmark point observed from different camera posi-

α0 yl . α

(17)

Equations (16) and (17) constitute OPT. According to them, a landmark pattern can be transformed

150

Trahanias, Velissaris and Orphanoudakis

to obtain its image pattern as viewed from a different position of the observer. Therefore, OPT facilitates view-invariant landmark recognition which is a prerequisite for landmark-based, topological navigation. It is interesting to observe that the quantities involved in (16) and (17) are readily available from the image data and the results of the workspace segregation. Therefore, no high level reasoning is required and all visual operations are performed at the lowest level, i.e. the image data level. Moreover, OPT explicitly avoids the estimation and use of the world position of the observer, thus resulting in increased pattern recognition accuracy and robustness by dissociating landmark recognition from localization errors. The unavailability of the robot’s world coordinates is by no means an obstacle for its navigation, since our approach relies on the relative positions of the landmarks in the workspace. In other words, the robot uses the qualitative knowledge of its position provided by a recognition event, and knowledge of the landmarks’ topology, to plan for its next move(s), i.e. keep going straight, or turn right at the next corner, etc. Furthermore, assuming that more than one landmarks are visible at the same time, qualitative navigation approaches may also be employed to plan the robot’s motion (Levitt and Lawton, 1990). It should be noted at this point that the above analysis is only valid in the case that the camera’s z-axis is parallel to the z-axis of the reference coordinate system. However, since we consider an active observer, having full control of the image acquisition process, this can be easily achieved by a simple rotation β of the camera, or the autonomous system if the camera is assumed fixed on the robot. The criterion for this rotation can be derived from the segregation result. If the two z-axes are parallel, then no vanishing point should be observed on the x-axis, as illustrated in Fig. 4(a); in the opposite case, the location of the x-axis vanishing point shows the direction of rotation2 [Fig. 4(b)].

3.2.

Bilinear Interpolation

The result of OPT is a set of 2D points (xl0 , yl0 ). However, these points do not necessarily have integer coordinates. In order to derive the transformed landmark as an image pattern, we should compute the image function values on the integer-coordinate (image) grid that corresponds to the set of 2D points (xl0 , yl0 ). To accomplish this, we employ a bilinear interpolation and compute the image intensity at the integer grid points using the available values within a circle of radius R around each grid point. Experimentally,√we have estimated a value of the circle radius, R = 2. Thus, intensity 8 at a grid point is computed as P i wi 8i 8= P , (18) i wi √ where, wi = 2 − di is the weight assigned to a point within the circle, at distance di from the grid point; 8i is the intensity of this point. The application of OPT, followed by the bilinear interpolation, to a stored landmark facilitates matching of the latter with the newly encountered landmark. In our implementation, this is performed using the area correlation coefficient in Eq. (1) and the value returned is taken as the similarity coefficient between the two landmark patterns. Eq. (1) quantifies the similarity between two image patterns as a figure in the range [−1, +1]; values close to +1 indicate strong similarity, whereas values close to −1 characterize very dissimilar patterns.

4.

Experimental Results

The proposed approach has been implemented and experimentally verified in an indoor environment, namely the corridors of FORTH’s main building. TALOS, the mobile robotic platform available at the Computer

Figure 4. Criterion for checking the observer’s z-axis orientation with respect to the z-axis of the reference coordinate system: (a) the two axes are parallel; (b) the two axes are not parallel.

Visual Recognition of Workspace Landmarks

Vision and Robotics Laboratory of FORTH, has been used as a testbed in all our experiments. TALOS is equipped with sonar, IR and tactile sensors, and an active vision head. Its mobility is provided by four synchro-drive motors. All computations are performed on-board using two Pentium processors. One processor, running at 200 MHz, is responsible for the image processing operations and control of the active head, whereas the other, running at 133 MHz, controls the motors and the robot sensors. Communication of the two processors is achieved via the TCX server. A Fast Screen Machine II has been used for image digitization. 4.1.

Illustrative Results

Several experiments have been conducted to test the proposed approach for landmark recognition. In the following we present sample results that demonstrate its applicability in indoor environments. In Figs. 5 through 7 the model-driven workspace segregation method is illustrated. In Fig. 5, two a consecutive frames taken during the learning phase are shown. The steps followed for the segregation of the first frame [Fig. 5(a)] are shown in Fig. 6. In Fig. 6(a) the edge detection results, after removing small edges, are shown. The selective application of HT on Fig. 6(a) has produced the result presented in Fig. 6(b). Further application of AHT has detected the lines shown in Fig. 6(c), and implicitly the z-axis vanishing point [point “x” in Fig. 6(d)]. The final segregation result is shown in Fig. 6(d), where the lines forming the boundaries between walls and celling have been obtained using the vanishing point and points on these lines that can be detected on the edge map [Fig. 6(a)].

Figure 5.

151

Dynamic segregation has been applied to the second frame [Fig. 5(b)], using the segregation results of the first frame. These results have established the areas shown in black color in Fig. 7(a), in which differentiation in a direction normal to the lines of the previous frame [Fig. 6(d)] is applied. The result of differentiation is marked as the set of white points on the above mentioned areas. Each set of points was given as input to the Least Median of Squares robust detector, to obtain the lines shown in Fig. 7(b). From this, the segregation result has been extracted and is shown in Fig. 7(c). The computation of the saliency map for frame Fig. 5(b) is presented in Fig. 8(a). In this map, shades of gray are used to code the image saliency values. The segregation result is also shown in Fig. 8(a), superimposed on the saliency map, to delineate search areas for potential landmarks. Points with large values in the saliency map, within the search areas, are denoted on Fig. 8(a) with an “x”. Isolated such points are rejected from further consideration. The points in the upper-left corner of the image plane have been used by the area expansion algorithm to obtain the landmark pattern, which is shown in Fig. 8(b) enclosed in the depicted window. During a navigation session, the same landmark may be seen by the autonomous agent from a different position. This is illustrated in Fig. 9(a) and (b) shows the results of workspace segregation and landmark extraction superimposed on the raw image data of Fig. 9(a). To facilitate the visual observation of the landmark, Fig. 10(a) and (b) show its image patterns as extracted in the learning [Fig. 8(b)] and the navigation [Fig. 9(b)] phases, respectively. The result of the application of the

Two consecutive frames of the workspace encountered in the learning phase.

152

Trahanias, Velissaris and Orphanoudakis

Figure 6.

(a) Edge detection, (b) Hough transform, (c) adaptive Hough transform, (d) scene segregation.

Figure 7.

Dynamic segregation results: (a) differentiation in selected areas; (b) Least Median of Squares robust detector; (c) scene segregation.

OPT and bilinear interpolation on the landmark pattern of Fig. 10(a) is illustrated in Fig. 10(c). Matching of the two patterns in Fig. 10(b) and (c) using Eq. (1) has resulted in a similarity coefficient value of 0.83. The effect of the OPT on that can be demonstrated by noting

that matching of the two raw patterns [Fig. 10(a) and (b)] has produced a value of 0.72. Figure 11 shows another result that refers to a different workspace. The landmark pattern extracted during the learning phase is illustrated in Fig. 11(a) as

Visual Recognition of Workspace Landmarks

Figure 8.

(a) Saliency map, (b) landmark extraction.

Figure 9.

(a) A frame from the workspace encountered in the navigation phase, (b) scene segregation and landmark extraction.

Figure 10. Landmark pattern extracted in: (a) learning phase; and (b) navigation phase; (c) pattern in (a) after the OPT and bilinear interpolation.

153

the pattern marked by the rectangular window. For visualization purposes, the workspace segregation results are also superimposed on the image data. During a navigation session, the same landmark pattern has been imaged from a different observer position, as shown in Fig. 11(b). Again, the landmark extraction and workspace segregation results are superimposed on the raw image data. OPT has been applied to the landmark pattern of Fig. 11(a) to facilitate its comparison against the landmark pattern of Fig. 11(b). The two patterns, as well as the result of OPT are presented in Fig. 12. As can be observed, the transformed image pattern (Fig. 12(c)) can be readily matched with the one viewed from the current robot position (Fig. 12(b)). A result taken from a workspace with no distinctive patterns present is shown in Fig. 13. Fig. 13(a) shows the initial frame, and Fig. 13(b) illustrates

154

Trahanias, Velissaris and Orphanoudakis

Figure 11.

Results of workspace segregation and landmark extraction during (a) learning phase, and (b) navigation phase.

the segregation result. The saliency map is given in Fig. 13(c). The values obtained from the saliency map in the two areas under consideration (left and right walls) do not exceed the thresholds imposed, and therefore do not suggest the presence of a landmark pattern. 4.2.

Figure 12. Landmark pattern extracted in: (a) learning phase; and (b) navigation phase; (c) pattern in (a) after the OPT and bilinear interpolation.

Figure 13.

(a) Initial frame, (b) workspace segregation, (c) saliency map.

Quantitative Evaluation

In order to quantitatively evaluate our approach, we have set up an experiment where the workspace consisted of four corridors forming a rectangle. Landmark extraction has resulted in acquiring and storing six distinct landmark patterns, named L1 through L6 in the sequel; the icons of these patterns are depicted in Fig. 14. Subsequently, TALOS was instructed to navigate in the same workspace, going around it for fifty

Visual Recognition of Workspace Landmarks

Table 1.

Landmark recognition results. Correct recognitions

Landmark

Figure 14.

155

#

Misrecognitions

%

#

False positives

%

#

False negatives

%

#

%

L1

47

94

2

4

1

2

1

2

L2

45

90

3

6

0

0

2

4

L3

47

94

2

4

2

4

1

2

L4

48

96

0

0

2

4

2

4

L5

44

88

5

10

1

2

1

2

L6

47

94

1

2

0

0

2

4

Total

46.33

92.66

2.17

4.33

1.00

2.00

1.50

3.00

Icons of the six landmarks identified during the quantitative evaluation.

times, and report all landmark recognitions it reached at. Information from previous landmark recognitions was not used in the next ones, in order to obtain an unbiased assessment of the approach. The results obtained are summarized in Table 1. The rows in this table indicate for each landmark the number and percentage of: (a) correct recognitions, (b) mis-recognitions, i.e. landmark L i was reported as L j , (c) false positives, i.e. landmark L i was erroneously reported, and (d) false negatives, i.e. landmark L i was missed. The results referring to all landmarks indicate a 92.66% correct recognition rate. Detailed investigation in the cases where errors have occurred has revealed that these were mostly due to either extreme placement of the robot towards the wall containing the landmark, or due to inaccuracies caused by the preprocessing modules (workspace segregation, saliency map). In the former case, landmark patterns on the walls tend to project on the image plane as vertical (thick) lines, which are very difficult to recognize or may even be missed if very few salient points are detected on them. Inaccuracies in the workspace segregation manifest themselves as erroneous parameters involved in OPT which may result in incorrect pattern

transformations and hence incorrect matching. Finally, small variations in the saliency map are responsible for either false- or no alarms, giving rise to false positive and false negative errors. The overall recognition rate of 92.66% is a promising result, considering also the potential for further improvements in our approach. Utilization of the information in the topological map may confine pattern matching against only a few stored landmarks, which is expected to reduce recognition errors. Moreover, other sources of information may additionally be used to direct the motion of the robot; for example, proximity sensors can be employed to prevent the robot from approaching closely the side-walls, with obvious effects in the recognition accuracy. All experiments presented in this section have been conducted using the on-board processing power of TALOS. Since the processing requirements of our approach are very limited, we have been able to obtain good performance even with the marginal processing facilities of TALOS. With a small image acquisition rate (5 f/sec) and image size 256 × 256, TALOS was able to recognize landmarks “on-the-fly”, when the database of learned landmarks contained a small number of patterns (e.g. 4–6).3 It should be noted, however,

156

Trahanias, Velissaris and Orphanoudakis

that in a real application, not all patterns in the database need to be examined for correct localization. Information from the previous landmark recognition and the topological map can be used to confine the search to a small number of candidate landmarks. 5.

Discussion

In this work, an approach has been presented for (a) automated landmark extraction in indoor environments during an initial learning phase, and (b) landmark recognition during robot navigation. For landmark extraction, the approach employs a selective search for landmark patterns, relying both on the workspace structure and the distinctiveness of the environment objects present. For recognition purposes, a viewing transformation has been developed that transforms a stored pattern according to the current (new) position of the observer. This facilitates accurate recognition, as has been demonstrated by experimental results in indoor environments. Landmark recognition has been approached to-date using various techniques from the fields of computer vision and pattern recognition. A key issue, common to many recognition tasks, is that of finding landmark representations that are invariant under viewing transformations. Such representations do exist, for example straight, vertical edge segments, but typically suffer from lack of descriptive power. Two-dimensional patterns, such as wall signs, posters, etc. are very salient and offer descriptive power, but do not in general offer any kind of invariance under viewing transformations. The observer position transformation proposed in this approach addresses this problem and facilitates the use of such objects as landmark patterns, without sacrificing recognition accuracy. It may, therefore, be useful for autonomous robotic navigation in indoor workspaces. The assumptions underlying OPT hold particularly true for the corridors and hallways of buildings, workspaces for which our approach seems very appropriate. Current research efforts in our laboratory focus on relaxing (some of) the assumptions made in this work regarding the structure of the environment workspace. Acknowledgments This work was supported in part by EC Contract No ERBFMRX-CT96-0049 (VIRGO) under the TMR Programme and the General Secretariat for Research

and Technology, Greece, under Grant No. 6060. The authors would like to thank Vassileia Liontaki and Thodoris Garavelos for their help in the system implementation of the approach. We would also like to express our appreciation to the referees for their valuable suggestions which helped to improve the quality of this paper. Notes 1. wa depends also on the scene structure; however, we ignore this dependency for simplicity and work with “upper bounds” for wa instead. 2. It is assumed that the camera’s z-axis is always parallel to the world xz-plane. Therefore, it is not necessary to perform analogous tests for the y-axis vanishing point. In the opposite case, however, the procedure would be exactly the same as in the case of the x-axis vanishing point. 3. The time needed for landmark recognition is proportional to the number of stored patterns, which determines the number of OPT applications and template matching operations.

References Aloimonos, Y. 1990. Purposive and qualitative active vision. In DARPA Image Understanding Workshop, pp. 816–828. Andersen, C.S. 1996. A Framework for Control of a Camera Head, Ph.D. Thesis, Laboratory of Image Analysis, Aalborg University, Denmark. Bouguet, J.Y. and Perona, P. 1955. Visual navigation using a single camera. In Proc. Intl. Conf. on Computer Vision, pp. 645–652. Cassinis, R., Grana, D., and Rizzi, A. 1996. Self-localization using an omni-directional image sensor. In 4th Intl. Symposium on Intelligent Robotic Systems, Lisbon, Portugal, pp. 215–222. Clark, J. and Ferrier, N. 1993. Attentive visual servoing. In Active Vision, A.Y.A. Blake (Ed.), Artificial Intelligence, MIT Press: Cambridge, MA, chap. 9, pp. 137–154. Cox, I. and Leonard, J. 1994. Modeling a dynamic environment using a multiple hypothesis approach. Artificial Intell., 66:311–344. Durrant-Whyte, H. and Leonard, J. 1989. Navigation by correlating geometric sensor data. In IEEE Int. Workshop on Intelligent Robots and Systems, IROS-89. Greiner, R. and Isukapalli, R. 1996. Learning to select usefull landmarks. IEEE Trans. Systems, Man, Cybern.—Part B: Cybernetics, 26(3):437–449. Hwang, Y.K. and Ahuja, N. 1992. Gross motion planning—a survey. ACM Computing Surveys, 24(3):221–291. Illingworth, J. and Kittler, J. 1987. The Adaptive Hough Transform, IEEE Trans. Pattern Anal. Mach. Intell., 9(5):690–698. Koch, C. and Ullman, S. 1984. Selecting one among the many: A simple network implementing shifts in selective visual attention. Technical Report, MIT AI Laboratory. Koch, C. and Ullman, S. 1985. Shifts in selective visual attention: Towards the underlying neural circuitry, Hum. Neurobiol., 4:219– 227. Kosaka, A. and Pan, J. 1995. Purdue experiments in model-based vision for hallway navigation. In Workshop on Vision for Robots in IROS’95, pp. 87–96.

Visual Recognition of Workspace Landmarks

Kuhnert, K.-D. 1990. Fusing dynamic vision and landmark navigation for autonomous driving. In IEEE Int. Workshop on Intelligent Robots and Systems, IROS ’90, pp.113–119. Latombe, J.C. 1991. Robot Motion Planning, Kluwer Academic Publishers: Boston, MA. Lazanas, A. and Latombe, J.-C. 1992. Landmark-based robot navigation. In 10th National Conference on Artificial Intelligence, San Jose, CA, pp. 697–702. Leonard, J. and Durrant-Whyte, H. 1991. Mobile robot localization by tracking geometric beacons, IEEE Trans. Robotics and Autom., 7(3):376–382. Leonard, J., Cox, I. and Durrant-Whyte, H. 1990. Dynamic map building for an autonomous mobile robot. In IEEE Int. Workshop on Intelligent Robots and Systems, IROS-90. Levitt, T. and Lawton, D. 1990. Qualitative navigation for mobile robots, Artificial Intell., 44:305–360. Madsen, C.B., Andersen, C.S., and Sorensen, J.S. 1997. A robustness analysis of triangulation based robot self-positioning. In 5th Intl. Symposium on Intelligent Robotic Systems, Stockholm, Sweden, pp. 195–204. Magee, M. and Aggarwal, J.K. 1995. Robot self-location using visual reasoning relative to a single target object, Pattern Recognition, 28(2):125–134. Milanese, R. 1993. Detecting salient regions in an image: From biological evidence to computer implementation. Ph.D. Thesis, Department of Computer Science, University of Geneva, Switzerland. Milanese, R., Wechsler, H., Gil, S., J.-M. Bost and Pun, T. 1994. Integration of bottom-up and top-down cues for visual attention using non-linear relaxation. In IEEE Conf. on Comp. Vision and Pattern Rec. Nasr, H. and Bhanu, B. 1988. Landmark recognition for autonomous mobile robots. In IEEE Intl. Conf. on Robotics and Autom., pp. 1218–1223. Nelson, R.C. 1988. Visual navigation. Ph.D. Dissertation, University of Maryland. Nielsen, J. and Sandini, G. 1994. Learning mobile robot navigation. In IEEE Conf. on Systems, Man and Cybernetics, San Antonio, TX Pahlavan, K., Uhlin, T., and Eklundh, J.O. 1993. Active vision as a methodology. In Active Perception, chap. 1, Y. Aloimonos (Ed.), Lawrence Erlbaum Associates. Rousseuw, P.J. 1987. Least median of squares regression, J. American Stat. Ass. 79:871–880. Schiele, B. and Crowley, J.L. 1996. Where to look next and what to look for. In IEEE/RSJ Intl. Conf. on Intell. Robotics and Syst. (IROS’96), pp. 1249–1255. Shah, S. and Aggarwal, J.K. 1995. Modeling structured environments using robot vision. In 1995 Asian Conf. on Computer Vision. Sugihara, K. 1988. Some location problems for robot navigation using a single camera, Computer Vision, Graphics, Image Proc., 42:112–129. Taylor, C.J. and Kriegman, D.J. 1994. Vision-based motion planning and exploration algorithms for mobile robots. In Workshop on the Algorithmic Foundations of Robotics. Trahanias, P.E., Velissaris, S., and Garavelos, T. 1997. Visual landmark extraction and recognition for autonomous robot navigation. In IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems, IROS’97, Grenoble, France. Yeh, E. and Kriegman, D.J. 1995. Toward selecting and recognizing

157

natural landmarks. In IEEE/RSJ Intl. Conf. on Intelligent Robots and Systems (IROS’95), Pittsburgh, PA. Zheng, J.Y. and Tsuji, S. 1989. Spatial representation and analysis of temporal visual events. In IEEE Intl. Conf. on Image Proc., pp. 775–779. Zheng, J.Y., Barth, M., and Tsuji, S. 1991. Autonomous landmark selection for route recognition by a mobile robot. In 1991 IEEE Intl. Conf. on Robotics and Automation, pp. 2004–2009.

Panos Trahanias is an Associate Professor with the Department of Computer Science, University of Crete, Greece, and a Senior Researcher with the Institute of Computer Science, Foundation for Research and Technology—Hellas (ICS-FORTH). He received his Ph.D. in Computer Science from the National Technical Univ. of Athens, Greece, in 1988. From 1985 to 1989 he served as a Research Assistant at the Institute of Informatics & Telecommunications, National Center for Scientific Research “Demokritos”, Athens, Greece, and from 1990 to 1991 he worked at the same Institute as a Postdoctoral Research Associate. From 1991 to 1993 he was with the Department of Electrical & Computer Engineering, University of Toronto, Toronto, Canada, as a Postdoctoral Research Associate. He has participated in many R & D programs in image processing and analysis at the University of Toronto and has been a consultant to SPAR Aerospace Ltd., Toronto, in a program regarding the analysis of Infrared images. Since 1993 he is with the University of Crete and ICS-FORTH. Currently, he is the coordinator of the Computer Vision & Robotics Laboratory at ICS-FORTH where he is engaged in research and R & D programs in vision-based robot navigation. He has published over 45 papers in technical journals and conference proceedings and has contributed in two books.

Savvas Velissaris holds a B.Sc. in Computer Science and an M.Sc. in Computer Science with specialization in Computer Vision and Robotics, both from the Department of Computer Science, University of Crete. His interests include Computer Vision, Artificial Intelligence, Programming Languages and Information Systems. Since October 1998 he is with the IT department of Delloite & Touche,

158

Trahanias, Velissaris and Orphanoudakis

Athens, Greece.

Stelios Orphanoudakis received the B.A. degree in engineering sciences from Dartmouth College, Hanover, NH, in 1971, the M.S. degree in electrical engineering from M.I.T., Cambridge, MA, in 1973, and the Ph.D. degree in electrical engineering from Dartmouth

College in 1976. He is Director of the Institute of Computer Science, Foundation for Research and Technology—Hellas, and Professor of Computer Science, University of Crete, Greece. He held a faculty appointment in the Departments of Diagnostic Radiology and Electrical Engineering at Yale University, USA, from 1975 until 1991. He is a Senior Member of the Institute of Electrical and Electronics Engineers (IEEE). Prof. Orphanoudakis has many years of academic and research experience in the fields of computer vision and robotics, intelligent image management and retrieval by content, and medical imaging. He has served on various committees and working groups of the European Commission and has been active in European R&D programs. He currently serves on the Board of Directors and is Vice President of the European Research Consortium for Informatics and Mathematics (ERCIM). He is also a member of the National Telecommunications Commission, the National Advisory Research Council, and the Board of Directors of the Hellenic Foundation for Culture of Greece.

Suggest Documents