Tutorial TT3: A Tutorial on Visual Servo Control - Semantic Scholar

Tutorial TT3: A Tutorial on Visual Servo Control Gregory D. Hager

Department of Computer Science Yale University New Haven, CT 06520-8285 Phone: (203) 432-6432 Email: [email protected]

Seth Hutchinson

Electrical and Computer Engineering Department University of Illinois Urbana, IL 61801 Phone: (217) 244-5570 [email protected]

Peter Corke

CSIRO Division of Manufacturing Technology P.O. Box 883, Kenmore. Australia, 4069. [email protected]

Tutorial TT3: Visual Servo Control

Hager/Hutchinson/Corke

Preface Two years ago at the 1994 IEEE International Conference on Robotics and Automation, we organized a workshop on visual servo control. The workshop was an overwhelming success, with over 60 participants interested in hearing about the current state of visionbased control. A short poll of participants indicated that most were viewing the workshop as a \tutorial" of sorts. Unfortunately, the workshop format did not best serve their interests in this regard. For this reason, we have organized this tutorial with the goal of presenting some of the basic ideas behind vision-based control in a carefully prepared manner. This tutorial is also well-timed from the standpoint of visual servo development. Until recently, the hardware costs for a real-time visual servo system were extremely high (including real-time vision system, real-time robot control interfaces, etc.). Building a visual servoing system required expertise in areas of control, real time systems, and visual tracking. However technology has advanced to the point that the visual processing and control calculations required for visual servoing can be performed on workstations or PC's. This, combined with increased scienti c understanding of visual servo-control has made it possible to construct modular visual servoing systems in software that operate at or near camera frame rate. Hence, we believe the time is ripe to capitalize on the potential of visual servoing for commercial and scienti c applications.

ii



Schedule 8:30 - 9:00 9:00 - 10:00 10:00 -10:15 10:15 - 11:00 11:00 - 11:30 11:30 - 12:00

Overview and Background Material Greg Hager Vision-Based Control Seth Hutchinson Coee Break Vision Processing Greg Hager Systems Issues Peter Corke State of the Art Systems and Outlook Peter Corke

iii



Contents 1 A Tutorial on Visual Servo Control 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8

Introduction : : : : : : : : : : : : : : : : Background and De nitions : : : : : : : Servoing Architectures : : : : : : : : : : Position-Based Visual Servo Control : : Image-Based Control : : : : : : : : : : : Image Feature Extraction and Tracking : Related Issues : : : : : : : : : : : : : : : Conclusion : : : : : : : : : : : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

: : : : : : : :

2

2 4 11 13 22 30 38 41

2 Systems Issues in Visual Servo Control

48

3 X Vision: A Portable Substrate for Real-Time Vision

79

2.1 A brief history of visual servoing : : : : : : : : : : : : : : : : : : : : : 48 2.2 Architectural issues : : : : : : : : : : : : : : : : : : : : : : : : : : : : 56 2.3 Control design and performance : : : : : : : : : : : : : : : : : : : : : 64 3.1 3.2 3.3 3.4 A

Introduction : : : : : : : : : : : : : : : : : : : Tracking System Design and Implementation : Applications : : : : : : : : : : : : : : : : : : : Conclusions : : : : : : : : : : : : : : : : : : : Programming Environment : : : : : : : : : : :

iv

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

: : : : :

79 81 92 98 100

Chapter 1

A Tutorial on Visual Servo Control Seth Hutchinson

Department of Electrical and Computer Engineering The Beckman Institute for Advanced Science and Technology University of Illinois at Urbana-Champaign 405 N. Mathews Avenue Urbana, IL 61801

Greg Hager

Department of Computer Science Yale University New Haven, CT 06520-8285

Peter Corke

CSIRO Division of Manufacturing Technology P.O. Box 883, Kenmore. Australia, 4069.

1.1 Introduction Today there are over 800,000 robots in the world, mostly working in factory environments. This population continues to grow, but robots are excluded from many application areas where the work enviroment and object placement cannot be accurately controlled. This limitation is due to the inherent lack of sensory capability in contempory commercial robot systems. It has long been recognized that sensor integration is fundamental to increasing the versatility and application domain of robots but to date this has not proven cost eective for the bulk of robotic applications which are in manufacturing. The `new frontier' of robotics, which is operation in the everyday world, provides new impetus for this research. Unlike the manufacturing {1{



application, it will not be cost eective to re-engineer òur world' to suit the robot. Vision is a useful robotic sensor since it mimics the human sense of vision and allows for non-contact measurement of the environment. Since the seminal work of Shirai and Inoue[92] (who describe how a visual feedback loop can be used to correct the position of a robot to increase task accuracy), considerable eort has been devoted to the visual control of robot manipulators. Robot controllers with fully integrated vision systems are now available from a number of vendors. Typically visual sensing and manipulation are combined in an open-loop fashion, `looking' then `moving'. The accuracy of the resulting operation depends directly on the accuracy of the visual sensor and the robot end-eector. An alternative to increasing the accuracy of these subsystems is to use a visualfeedback control loop which will increase the overall accuracy of the system | a principal concern in any application. Taken to the extreme, machine vision can provide closed-loop position control for a robot end-eector | this is referred to as visual servoing. This term appears to have been rst introduced by Hill and Park[51] in 1979 to distinguish their approach from earlier `blocks world' experiments where the system alternated between picture taking and moving. Prior to the introduction of this term, the less speci c term visual feedback was generally used. For the purposes of this article, the task in visual servoing is to use visual information to control the pose of the robot's end-eector relative to a target object or a set of target features. Since the rst visual servoing systems were reported in the early 1980s, progress in visual control of robots has been fairly slow but the last few years have seen a marked increase in published research. This has been fueled by personal computing power crossing the threshold which allows analysis of scenes at a sucient rate to `servo' a robot manipulator. Prior to this, researchers required specialized and expensive pipelined pixel processing hardware. Applications that have been proposed or prototyped span manufacturing (grasping objects on conveyor belts and part mating), teleoperation, missile tracking cameras and fruit picking as well as robotic ping-pong, juggling, balancing, car steering and even aircraft landing. A comprehensive review of the literature in this eld, as well the history and applications reported to date, is given by Corke[3] and includes a large bibliography. Visual servoing is the fusion of results from many elemental areas including highspeed image processing, kinematics, dynamics, control theory, and real-time computing. It has much in common with research into active vision and structure from motion, but is quite dierent to the often described use of vision in hierarchical tasklevel robot control systems. Many of the control and vision problems are similar to those encountered by active vision researchers who are building `robotic heads'. However the task in visual servoing is to control a robot to manipulate its environment using vision as opposed to passively or actively observing it. Given the current interest in this topic it seems both appropriate and timely to provide a tutorial introduction to this topic. We hope that this tutorial will assist re{2{



searchers by providing a consistant terminology and nomenclature, and assist others in creating visually servoed systems and gaining an appreciation of possible applications. The growing literature contains solutions and promising approaches to many theoretical and technical problems involved. We have attempted here to present the most signi cant results in a consistant way in order to present a comprehensive view of the area. Another diculty we faced was that the topic spans many disciplines. Some issues that arise such as the control problem, which is fundamentally nonlinear and for which there is not complete established theory, and visual recognition, tracking, and reconstruction which are elds unto themselves cannot be adequately addressed in a single article. We have thus concentrated on certain fundamental aspects of the topic, and a large bibliography is provided to assist the reader who seeks greater detail than can be provided here. Our preference is always to present those ideas and techniques which have been found to function well in practice in situations where high control and/or vision performance is not required, and which appear to have some generic applicability. In particular we will describe techniques which can be implemented using a minimal amount of vision hardware, and which make few assumptions about the robotic hardware. The remainder of this article is structured as follows. Section 1.2 establishes a consistent nomenclature and reviews the relevant fundamentals of coordinate transformations, pose representation, and image formation. In Section 1.3, we present a taxonomy of visual servo control systems (adapted from [4]). The two major classes of systems, position-based visual servo systems and image-based visual servo systems, are discussed in Sections 1.4 and 1.5 respectively. Since any visual servo system must be capable of tracking image features in a sequence of images, Section 1.6 describes some approaches to visual tracking that have found wide applicability and can be implemented using a minimum of special-purpose hardware. Finally Section 1.7 presents a number of observations, and about the current directions of the research eld of visual servo control.

1.2 Background and De nitions Visual servo control research requires some degree of expertise in several areas, particularly robotics, and computer vision. Therefore, in this section we provide a very brief overview of these subjects, as relevant to visual servo control. We begin by de ning the terminology and notation required to represent coordinate transformations and the velocity of a rigid object moving through the workspace (Sections 1.2.1 and 1.2.2). Following this, we brie y discuss several issues related to image formation, including the image formation process (Sections 1.2.3 and 1.2.4), and possible camera/robot con gurations (Section 1.2.5). The reader who is familiar with these topics may wish to proceed directly to Section 1.3. {3{



1.2.1 Coordinate Transformations In this paper, the task space of the robot, represented by T , is the set of positions and orientations that the robot tool can attain. Since the task space is merely the con guration space of the robot tool, the task space is a smooth m-manifold (see, e.g., [5]). If the tool is a single rigid body moving arbitrarily in a three-dimensional workspace, then T = SE = < SO , and m = 6. In some applications, the task space may be restricted to a subspace of SE . For example, for pick and place, we may consider pure translations (T = < , for which m = 3), while for tracking an object and keeping it in view we might consider only rotations (T = SO , for which m = 3). Typically, robotic tasks are speci ed with respect to one or more coordinate frames. For example, a camera may supply information about the location of an object with respect to a camera frame, while the con guration used to grasp the object may be speci ed with respect to a coordinate frame at the end-eector of the manipulator. We represent the coordinates of point P with respect to coordinate frame x by the notation xP. Given two frames, x and y, the rotation matrix that represents the orientation of frame y with respect to frame x is denoted by x Ry. The location of the origin of frame y with respect to frame x is denoted by the vector x ty. Together, the position and orientation of a frame are referred to as its pose, which we denote by a pair xxy = (xRy ; xty ). If x is not speci ed, the world coordinate frame is assumed. If we are given y P (the coordinates of point P relative to frame y), and xxy = (xRy ; xty ), we can obtain the coordinates of P with respect to frame x by the coordinate transformation 3

3

3

3

3

3

x

P = xRy y P + xty = xxy yP:

(1.1) (1.2)

Often, we must compose multiple poses to obtain the desired coordinates. For example, suppose that we are given poses x xy and y xz. If we are given z P and wish to compute x P, we may use the composition of transformations x

P = x xy y P = x xy y xz z P = x xz z P

where {4{

(1.3) (1.4) (1.5)

Tutorial TT3: Visual Servo Control x


xz = (xRy y Rz; x Ryy tz + xty )

(1.6)

Thus, we will represent the the composition of two poses by xxz = x xy y xz. We note that the operator is used to represent both the coordinate transformation of a single point and the composition of two coordinate transformations. The particular meaning should always be clear from the context. In much of the robotics literature, poses are represented by homogeneous transformation matrices, which are of the form "x x # Ry ty : xT = (1.7) y 0 1 To simplify notation throughout the paper, we will represent poses and coordinate transformations as de ned in (1.1). Some coordinate frames that will be needed frequently are referred to by the following superscripts/subscripts:

e The coordinate frame attached to the robot end eector 0 The base frame for the robot c The camera coordinate frame When T = SE , we will use the notation xe 2 T to represent the pose of the endeector coordinate frame relative to the world frame. In this case, we often prefer to parameterize a pose using a translation vector and three angles, (e.g., roll, pitch and yaw [6]). Although such parameterizations are inherently local, it is often convenient to represent a pose by a vector r 2 < , rather than by xe 2 T . This notation can easily be adapted to the case where T SE . For example, when T = < , we will parameterize the task space by r = [x; y; z]T . In the sequel, to maintain generality we will assume that r 2 0; (1.59) 2

x2X

where w() is a weighting function over the image region. The aim is to nd the displacement, d, that minimizes O(d). Since images are inherently discrete, a natural solution is to select a nite range of values D and compute d^ = min O(d): d2D

The advantage of a complete discrete search is that the true minimum over the search region is guaranteed to be found. However, the larger the area covered, the greater the computational burden. This burden can be reduced by performing the optimization starting at low resolution and proceeding to higher resolution, and by ordering the candidates in D from most to least likely and terminating the search once a candidate with an acceptably low SSD value is found [15]. Once the discrete minimum is found, the location can be re ned to subpixel accuracy by interpolation of the SSD values about the minimum. Even with these improvements, [15] reports that a special signal processor is required to attain frame-rate performance. It is also possible to solve (1.59) using continuous optimization methods [29, 3, 33, 21]. The solution begins by expanding R(x; c; t) in a Taylor series about (c; t) yielding R(x; c + d; t + ) R(x; c; t) + Rx (x)dx + Ry (x)dy + Rt(x) where Rx; Ry and Rt are the spatial and temporal derivatives of the image computed using convolution as follows: # " 1 ; 1 Rx(x) = (R 1 ; 1 )(x) { 33 {



"

# 1 1 Ry (x) = (R ;1 ; 1 )(x)

Rt(x) =

"

(R(; c; t + ) ; Rs (; c; t)) 11 11

#!

(x)

Substituting into (1.59) yields X O(d) (Rx(x)dx + Ry (x)dy + Rt(x) ) w(x) 2

x2X

(1.60)

De ne

2 q 3 q R ( x ) w ( x ) x q 4 5 g(x) = and h(x) = Rt(x) w(x) Ry (x) w(x) Expression (1.60) can now be written more concisely as X O(d) (g(x) d + h(x) ) : 2

x2X

(1.61)

Notice O is now a quadratic function of d: Computing the derivatives of O with respect to the components of d; setting the result equal to zero, and rearranging yields a linear system of equations: "X # X (g(x)g(x)T ) d = h(x)g(x) (1.62) x2X

x2X

Solving for d yields an estimate, d^ of the oset that would cause the two windows to have maximum correlation. We then compute c = c; + d^ yielding the updated window location for the next tracking cycle. This is eectively a proportional control algorithm for the \servoing" the location of an acquisition to maintain the best match with the reference window over time. In practice this method will only work for small motions (it is mathematically correct only for a fraction of a pixel). This problem can be alleviated by rst performing the optimization at low levels of resolution, and using the result as a seed for computing the oset at higher levels of resolution. For example, reducing the resolution by a factor of two by summing groups of four neighboring pixels doubles the maximum displacement between two images. It also speeds up the computations since fewer operations are needed to compute d^ for the smaller low-resolution image. Another drawback of this method is the fact that it relies on an exact math of the gray values|changes in contrast or brightness can bias the results and lead to mistracking. Thus, it is common to normalize the images to have zero mean and consistant variance. With these modi cations, it is easy to show that solving (1.62) is equivalent to maximizing the correlation between the two windows. +

{ 34 {



Continuous optimization has two principle advantages over discrete optimization. First, a single updating cycle is usually faster to compute. For example, (1.62) can be computed and solved in less than 5 ms on a Sparc II computer [21]. Second, it is easy to incorporate other window parameters such as rotation and scaling into the system without greatly increasing the computation time [33, 21]. It is also easy to show that including parameters for contrast and brightness in (1.60) makes SSD tracking equivalent to nding the maximum correlation between the two image regions [29]. Thus, SSD methods can be used to perform template matching as well as tracking of image regions.

1.6.3 Filtering and Feedforward Window-based tracking implicitly assumes that the interframe motions of the tracked feature do not exceed the size of search window, or, in the case of SSD tracking, a few pixels from the expected location of the image region. In the simplest case, the previous location of the image feature can be used as a predictor of its current location. Unfortunately, as feature velocity increases the search window must be enlarged which adversely aects computation time. The robustness and speed of tracking can be signi cantly increased with knowledge about the dynamics of the observed features, which may be due to motion of the camera or target. For example, given knowledge of the image feature location xt at time t; Jacobian Jv ; the end-eector velocity ut; and the interframe time ; the expected location of the search windows can be computed by the prediction ft = ft + Jv ut: Likewise, if the dynamics of a moving object are known, then it is possible to use this to enhance prediction. For example, Rizzi [54] describes the use of a Newtonian ight dynamics model to make it possible to track a ping-pong ball during ight. Predictors based on ; tracking lters and Kalman lters have also been used[53, 37, 72]. Multiresolution techniques can be used provide further performance improvements, particularly when a dynamic model is not available and large search windows must be used. +

1.6.4 Discussion Prior to executing or planning visually controlled motions, a speci c set of visual features must be chosen. Discussion of the issues related to feature selection for visual servo control applications can be found in [19, 36]. The \right" image feature tracking method to use is extremely application dependent. For example, if the goal is to track a single special pattern or surface marking that is approximately planar and moving at slow to moderate speeds, then SSD tracking is appropriate. It does { 35 {



not require special image structure (e:g: straight lines), it can accommodate a large set of image distortions, and for small motions can be implemented to run at frame rates. In comparison to the edge detection methods described above, SSD tracking is extremely sensitive to background changes or occlusions. Thus, if a task requires tracking several occluding contours of an object with a changing background, edgebased methods are clearly faster and more robust. In many realistic cases, neither of these approaches by themselves yields the robustness and performance desired. For example, tracking occluding edges in an extremely cluttered environment is sure to distract edge tracking as \better" edges invade the search window, while the changing background would ruin the SSD match for the region. Such situations call for the use of more global task constraints (e:g: the geometry of several edges), more global tracking (e:g: extended contours or snakes [80]), or improved or specialized detection methods. To illustrate these tradeos, suppose a visual servoing task relies on tracking the image of a circular opening over time. In general, the opening will project to an ellipse in the camera. There are several candidate algorithms for detecting this ellipse and recovering its parameters: 1. If the contrast between the interior of the opening and area around it is high, then binary thresholding followed by a calculation of the rst and second central moments can be used to localize the feature [54]. 2. If the ambient illumination changes greatly over time, but the brightness of the opening and the brightness of the surrounding region are roughly constant, a circular template could be localized using SSD methods augmented with brightness and contrast parameters. In this case, (1.59) must also include parameters for scaling and aspect ratio [70]. 3. The opening could be selected in an initial image, and subsequently located using SSD methods. This diers from the previous method in that this calculation does not compute the center of the opening, only its correlation with the starting image. Although useful for servoing a camera to maintain the opening within the eld of view, this approach is probably not useful for manipulation tasks that need to attain a position relative to the center of the opening. 4. If the contrast and background are changing, the opening could be tracked by performing edge detection and tting an ellipse to the edge locations. In particular, short edge segments could be located using the techniques described in Section 1.6.1. Once the segments have been t to an ellipse, the orientation and location of the segments would be adjusted for the subsequent tracking cycle using the geometry of the ellipse. { 36 {



During task execution, other problems arise. The two most common problems are occlusion of features and and visual singularities. Solutions to the former include intelligent observers that note the disappearance of features and continue to predict their locations based on dynamics and/or feedforward information [54], or redundant feature speci cations that can perform even with some loss of information. Solution to the latter require some combination of intelligent path planning and/or intelligent acquisition and focus-of-attention to maintain the controllability of the system. It is probably safe to say that image processing presents the greatest challenge to general-purpose hand-eye coordination. As an eort to help overcome this obstacle, the methods described above and other related methods have been incorporated into a publically available \toolkit." The interested reader is referred to [70] for details.

1.7 Related Issues In this section, we brie y discuss a number of related issues that were not addressed in the tutorial.

1.7.1 Image-Based versus Position-Based Control The taxonomy of visual servo introduced in Section 1.1 has four major architectural classes. Most systems that have been reported fall into the dynamic position- or image-based look-and-move structure. That is, they employ axis-level feedback, generally of position, for reasons outlined earlier. No reports of an implementation of the position-based direct visual servo structure are known to the authors. Weiss's proposed image-based direct visual-servoing structure does away entirely with axis sensors | dynamics and kinematics are controlled adaptively based on visual feature data. This concept has a certain appeal but in practice is overly complex to implement and appears to lack robustness (see, e.g., [81] for an analysis of the eects of various image distortions on such control schemes). The concepts have only ever been demonstrated in simulation for up to 3-DOF and then with simplistic models of axis dynamics which ignore `real world' eects such as Coulomb friction and stiction. Weiss showed that even when these simplifying assumptions were made, sample intervals of 3 ms were required. This would necessitate signi cant advances in sensor and processing technology, and the usefulness of controlling manipulator kinematics and dynamics this way must be open to question. Many systems based on image-based and position-based architectures have been demonstrated, and the computational costs of the two approaches are comparable and readily achieved. The often cited advantage of the image-based approach, reduced computational burden, is doubtful in practice. Many reports are based on using a constant image Jacobian, which is computationally ecient, but valid only over a small { 37 {



region of the task space. The general problem of Jacobian update remains, and in particular there is the diculty that many image Jacobians are a function of target depth, z. This necessitates a partial pose estimation which is the basis of the position-based approach. The cited computational disadvantages of the position-based approach have been ameliorated by recent research | photogrammetric solutions can now be computed in a few milliseconds, even using iteration.

1.7.2 Dynamic Issues in Closed-loop Systems A visual servo system is a closed-loop discrete-time dynamical system. The sample rate is the rate at which images can be processed and is ultimately limited by the frame rate of the camera, though many reported systems operate at a sub-multiple of the camera frame rate due to limited computational ability. Negative feedback is applied to a plant which generally includes a signi cant time delay. The sources of this delay include; charge integration time within the camera, serial pixel transport from the camera to the vision system, and computation time for feature parameter extraction. In addition most reported visual servo system employ a relatively low bandwidth communications link between the vision system and the robot controller, which introduces further latency. Some robot controllers operate with a sample interval which is not related to the sample rate of the vision system, and this introduces still further delay. A good example of this is the common Unimate Puma robot whose position loops operate at a sample interval of 14 or 28 ms while vision systems operate at sample intervals of 33 or 40 ms for RS 170 or CCIR video respectively[20]. It is well known that a feedback system including delay will become unstable as the loop gain is increased. Many visual closed-loop systems are tuned empirically, increasing the loop gain until overshoot or oscillation becomes intolerable. While such closed-loop systems will generally converge to the desired pose with a zero error, the same is not true when tracking a moving target. The tracking performance is a function of the closed-loop dynamics, and for simple proportional controllers will exhibit a very signi cant time lag or phase delay. If the target motion is constant then prediction (based upon some assumption of target motion) can be used to compensate for the latency, but combined with a low sample rate this results in poor disturbance rejection and long reaction time to target `maneuvers'. Predictors based on autoregressive models, Kalman lters, ; and ; ; tracking lters have been demonstrated for visual servoing. In order for a visual-servo system to provide good tracking performance for moving targets considerable attention must be paid to modelling the dynamics of the robot and vision system and designing an appropriate control system. Other issues for consideration include whether or not the vision system should `close the loop' around robot axes which are position, velocity or torque controlled. A detailed discussion of these dynamic issues in visual servo systems is given by Corke[20, 82]. { 38 {



1.7.3 Mobile robots The discussion above has assumed that the moving camera is mounted on an arm type robot manipulator. For mobile robots the pose of the robot is generally poorly known and can be estimated from the relative pose of known xed objects or landmarks. Most of the techniques described above are directly applicable to the mobile robot case. Visual servoing can be used for navigation with respect to landmarks or obstacle and to control docking (see, e.g., [83]).

1.7.4 A Light-Weight Tracking and Servoing Environment The design of many task-speci c visual tracking and vision-based feedback systems used in visual servoing places a strong emphasis on system modularity and recon gurability. This has motivated the development of a modular, software-based visual tracking system for experimental vision-based robotic applications [84]. The system design emphasizes exibility and eciency on standard scienti c workstations and PC's. The system is intended to be a portable, inexpensive tool for rapid prototyping and experimentation for teaching and research. The system is written as a set of classes in C++. The use of object-oriented methods hides the details of how speci c methods are implemented, and structures applications through a pre-speci ed set of generic interfaces. It also enhances the portability of the system by supporting device abstraction. The current system runs on several framegrabbers including models avialable from most common manufacturers. More information on the system and direction for retrieving it can be found at http://www.cs.yale.edu/HTML/YALE/CS/AI/VisionRobotics/YaleAI.html.

The same design philosophy is currently being used to develop a complementary hand-eye coordination toolkit which is to be avilable in the near future.

1.7.5 Current Research Problems There are many open problems in visual servo control, too numerous to describe here. These include control issues, such as adaptive visual servo control [85, 14], hybrid control (e.g., hybrid vision/position control [12], or hybrid force/vision control), and multi-rate system theory [86]; issues related to automatic planning of visually controlled robot motions [87, 88]; applications in mobile robotics, including nonholonmoic systems [83]; and, feature selection [33, 36]. Many of these are describe in the proceedings of a recent workshop on visual servo control [89].

{ 39 {



1.7.6 The future The future for applications of visual servoing should be bright. Camera's are relatively inexpensive devices and the cost of image processing systems continues to fall. Most visual servo systems make use of cameras that conform to broadcast television standards, and this is now a signi cant limiting factor toward achieving highperformance visual servoing. Those standards were developed over 50 years ago with very speci c design aims and to their credit still serve. Advances in digital image processing technology over the last decade have been rapid and vision systems are now capable of processing far more pixels per second than a camera can provide. Breaking this bottleneck would allow use of higher resolution images, higher frame rates or both. The current push for HDTV will have useful spinos for higher resolution visual servoing. New standards (such as promoted by the AIA) for digital output cameras are spurring the development of new cameras that are not tied to the frame rates and interlacing of the old broadcast standards. Visual servoing also requires high-speed image processing and feature extraction. The increased performance and falling cost of computer vision systems, and computing systems in general, is a consequence of Moore's Law. This law, originally postulated in 1964 predicts that computer circuit density will double every year, and 30 years later still holds true. Robust scene interpretation is perhaps the greatest, current, limitation to the wider application of visual servoing. Considerable progress is required if visual-servo systems are to move out of environments lined with black velvet.

1.8 Conclusion This paper has presented, for the rst time, a tutorial introduction to robotic visual servo control. Since the topic spans many disciplines, we have concentrated on certain fundamental aspects of the topic. However a large bibliography is provided to assist the reader who seeks greater detail than can be provided here. The tutorial covers, using consistant notation, the relevant fundamentals of coordinate transformations, pose representation, and image formation. Since no standards yet exist for terminology or symbols we have attempted, in Section 1.2, to establish a consistent nomenclature. Where necessary we relate this to the notation used in the source papers. The two major approaches to visual servoing, image-based and position-based control, were been discussed in detail in Sections 1.5 and 1.4. The topics have been discussed formally, using the notation established earlier, and illustrated with a number of realistic examples. An important part of any visual servo system is image feature parameter extraction. Section 1.6 discussed two broad approaches to this problem with an emphasis on methods that have been found to function well in { 40 {



practice and that can be implemented without specialized image processing hardware. Section 1.7 presented a number of related issues that are relevant to image-based or position-based visual servo systems. These included closed-loop dynamics, relative pros and cons of the dierent approaches, open problems and the future.

References [1] Y. Shirai and H. Inoue, \Guiding a robot by visual feedback in assembling tasks," Pattern Recognition, vol. 5, pp. 99{108, 1973. [2] J. Hill and W. T. Park, \Real time control of a robot with a mobile camera," in Proc. 9th ISIR, (Washington, DC), pp. 233{246, Mar. 1979. [3] P. Corke, \Visual control of robot manipulators | a review," in Visual Servoing (K. Hashimoto, ed.), vol. 7 of Robotics and Automated Systems, pp. 1{31, World Scienti c, 1993. [4] A. C. Sanderson and L. E. Weiss, \Image-based visual servo control using relational graph error signals," Proc. IEEE, pp. 1074{1077, 1980. [5] J. C. Latombe, Robot Motion Planning. Boston: Kluwer Academic Publishers, 1991. [6] J. J. Craig, Introduction to Robotics. Menlo Park: Addison Wesley, second ed., 1986. [7] B. K. P. Horn, Robot Vision. MIT Press, Cambridge, MA, 1986. [8] W. Jang, K. Kim, M. Chung, and Z. Bien, \Concepts of augmented image space and transformed feature space for ecient visual servoing of an \eye-in-hand robot"," Robotica, vol. 9, pp. 203{212, 1991. [9] J. Feddema and O. Mitchell, \Vision-guided servoing with feature-based trajectory generation," IEEE Trans. Robot. Autom., vol. 5, pp. 691{700, Oct. 1989. [10] B. Espiau, F. Chaumette, and P. Rives, \A New Approach to Visual Servoing in Robotics," IEEE Transactions on Robotics and Automation, vol. 8, pp. 313{326, 1992. [11] M. L. Cyros, \Datacube at the space shuttle's launch pad," Datacube World Review, vol. 2, pp. 1{3, Sept. 1988. Datacube Inc., 4 Dearborn Road, Peabody, MA. [12] A. Castano and S. A. Hutchinson, \Visual compliance: Task-directed visual servo control," IEEE Transactions on Robotics and Automation, vol. 10, pp. 334{342, June 1994. [13] K. Hashimoto, T. Kimoto, T. Ebine, and H. Kimura, \Manipulator control with image-based visual servo," in Proc. IEEE Int. Conf. Robotics and Automation, pp. 2267{2272, 1991. { 41 {



[14] N. P. Papanikolopoulos and P. K. Khosla, \Adaptive Robot Visual Tracking: Theory and Experiments," IEEE Transactions on Automatic Control, vol. 38, no. 3, pp. 429{445, 1993. [15] N. P. Papanikolopoulos, P. K. Khosla, and T. Kanade, \Visual Tracking of a Moving Target by a Camera Mounted on a Robot: A Combination of Vision and Control," IEEE Transactions on Robotics and Automation, vol. 9, no. 1, pp. 14{35, 1993. [16] S. Skaar, W. Brockman, and R. Hanson, \Camera-space manipulation," Int. J. Robot. Res., vol. 6, no. 4, pp. 20{32, 1987. [17] S. B. Skaar, W. H. Brockman, and W. S. Jang, \Three-Dimensional Camera Space Manipulation," International Journal of Robotics Research, vol. 9, no. 4, pp. 22{39, 1990. [18] J. T. Feddema, C. S. G. Lee, and O. R. Mitchell, \Weighted selection of image features for resolved rate visual feedback control," IEEE Trans. Robot. Autom., vol. 7, pp. 31{47, Feb. 1991. [19] A. C. Sanderson, L. E. Weiss, and C. P. Neuman, \Dynamic sensor-based control of robots with visual feedback," IEEE Trans. Robot. Autom., vol. RA-3, pp. 404{ 417, Oct. 1987. [20] R. L. Andersson, A Robot Ping-Pong Player. Experiment in Real-Time Intelligent Control. MIT Press, Cambridge, MA, 1988. [21] M. Lei and B. K. Ghosh, \Visually-Guided Robotic Motion Tracking," in Proc. Thirtieth Annual Allerton Conference on Communication, Control, and Computing, pp. 712{721, 1992. [22] B. Yoshimi and P. K. Allen, \Active, uncalibrated visual servoing," in Proc. IEEE International Conference on Robotics and Automation, (San Diego, CA), pp. 156{161, May 1994. [23] B. Nelson and P. K. Khosla, \Integrating Sensor Placement and Visual Tracking Strategies," in Proc. IEEE International Conference on Robotics and Automation, pp. 1351{1356, 1994. [24] I. E. Sutherland, \Three-dimensional data input by tablet," Proc. IEEE, vol. 62, pp. 453{461, Apr. 1974. [25] R. Tsai and R. Lenz, \A new technique for fully autonomous and ecient 3D robotics hand/eye calibra tion," IEEE Trans. Robot. Autom., vol. 5, pp. 345{358, June 1989. [26] R. Tsai, \A versatile camera calibration technique for high accuracy 3-D machine vision m etrology using o-the-shelf TV cameras and lenses," IEEE Trans. Robot. Autom., vol. 3, pp. 323{344, Aug. 1987. [27] P. I. Corke, High-Performance Visual Closed-Loop Robot Control. PhD thesis, University of Melbourne, Dept.Mechanical and Manufacturing Engineering, July 1994. { 42 {



[28] D. E. Whitney, \The mathematics of coordinated control of prosthetic arms and manipulators," Journal of Dynamic Systems, Measurement and Control, vol. 122, pp. 303{309, Dec. 1972. [29] S. Chieaverini, L. Sciavicco, and B. Siciliano, \Control of robotic systems through singularities," in Proc. Int. Workshop on Nonlinear and Adaptive Control: Issues i n Robotics (C. C. de Wit, ed.), Springer-Verlag, 1991. [30] S. Wijesoma, D. Wolfe, and R. Richards, \Eye-to-hand coordination for visionguided robot control applications," International Journal of Robotics Research, vol. 12, no. 1, pp. 65{78, 1993. [31] N. Hollinghurst and R. Cipolla, \Uncalibrated stereo hand eye coordination," Image and Vision Computing, vol. 12, no. 3, pp. 187{192, 1994. [32] G. D. Hager, W.-C. Chang, and A. S. Morse, \Robot hand-eye coordination based on stereo vision," IEEE Control Systems Magazine, Feb. 1995. [33] C. Samson, M. Le Borgne, and B. Espiau, Robot Control: The Task Function Approach. Oxford, England: Clarendon Press, 1992. [34] G. Franklin, J. Powell, and A. Emami-Naeini, Feedback Control of Dynamic Systems. Addison-Wesley, 2nd ed., 1991. [35] G. D. Hager, \Six DOF visual control of relative position," DCS RR-1038, Yale University, New Haven, CT, June 1994. [36] T. S. Huang and A. N. Netravali, \Motion and structure from feature correspondences: A review," IEEE Proceeding, vol. 82, no. 2, pp. 252{268, 1994. [37] W. Wilson, \Visual servo control of robots using kalman lter estimates of robot pose relative to work-pieces," in Visual Servoing (K. Hashimoto, ed.), pp. 71{104, World Scienti c, 1994. [38] C. Fagerer, D. Dickmanns, and E. Dickmanns, \Visual grasping with long delay time of a free oating object in orbit," Autonomous Robots, vol. 1, no. 1, 1994. [39] C. Lu, E. J. Mjolsness, and G. D. Hager, \Online computation of exterior orientation with application to hand-eye calibration," DCS RR-1046, Yale University, New Haven, CT, Aug. 1994. To appear in Mathematical and Computer Modeling. [40] M. A. Fischler and R. C. Bolles, \Random sample consensus: a paradigm for model tting with applicatio ns to image analysis and automated cartography," Communications of the ACM, vol. 24, pp. 381{395, June 1981. [41] R. M. Haralick, C. Lee, K. Ottenberg, and M. Nolle, \Analysis and solutions of the three point perspective pose estimation problem," in Proc. IEEE Conf. Computer Vision Pat. Rec., pp. 592{598, 1991. [42] D. DeMenthon and L. S. Davis, \Exact and approximate solutions of the perspective-three-point problem," IEEE Trans. Pat. Anal. Machine Intell., no. 11, pp. 1100{1105, 1992. { 43 {



[43] R. Horaud, B. Canio, and O. Leboullenx, \An analytic solution for the perspective 4-point problem," Computer Vis. Graphics. Image Process, no. 1, pp. 33{44, 1989. [44] M. Dhome, M. Richetin, J. Lapreste, and G. Rives, \Determination of the attitude of 3-D objects from a single perspective view," IEEE Trans. Pat. Anal. Machine Intell., no. 12, pp. 1265{1278, 1989. [45] G. H. Rosen eld, \The problem of exterior orientation in photogrammetry," Photogrammetric Engineering, pp. 536{553, 1959. [46] D. G. Lowe, \Fitting parametrized three-dimensional models to images," IEEE Trans. Pat. Anal. Machine Intell., no. 5, pp. 441{450, 1991. [47] R. Goldberg, \Constrained pose re nement of parametric objects," Intl. J. Computer Vision, no. 2, pp. 181{211, 1994. [48] R. Kumar, \Robust methods for estimating pose and a sensitivity analysis," CVGIP: Image Understanding, no. 3, pp. 313{342, 1994. [49] S. Ganapathy, \Decomposition of transformation matrices for robot vision," Pattern Recognition Letters, pp. 401{412, 1989. [50] M. Fischler and R. C. Bolles, \Random sample consensus: A paradigm for model tting and automatic cartography," Commun. ACM, no. 6, pp. 381{395, 1981. [51] Y. Liu, T. S. Huang, and O. D. Faugeras, \Determination of camera location from 2-D to 3-D line and point correspondences," IEEE Trans. Pat. Anal. Machine Intell., no. 1, pp. 28{37, 1990. [52] A. Gelb, ed., Applied Optimal Estimation. Cambridge, MA: MIT Press, 1974. [53] P. K. Allen, A. Timcenko, B. Yoshimi, and P. Michelman, \Automated Tracking and Grasping of a Moving Object with a Robotic Hand-Eye System," IEEE Transactions on Robotics and Automation, vol. 9, no. 2, pp. 152{165, 1993. [54] A. Rizzi and D. Koditschek, \An active visual estimator for dexterous manipulation," in Proceedings, IEEE International Conference on Robotics and Automaton, 1994. [55] J. Pretlove and G. Parker, \The development of a real-time stereo-vision system to aid robot guidance in carrying out a typical manufacturing task," in Proc. 22nd ISRR, (Detroit), pp. 21.1{21.23, 1991. [56] B. K. P. Horn, H. M. Hilden, and S. Negahdaripour, \Closed-form solution of absolute orientation using orthonomal matrices," J. Opt. Soc. Amer., vol. A-5, pp. 1127{1135, 198. [57] K. S. Arun, T. S. Huang, and S. D. Blostein, \Least-squares tting of two 3-D point sets," IEEE Trans. Pat. Anal. Machine Intell., vol. 9, pp. 698{700, 1987. [58] B. K. P. Horn, \Closed-form solution of absolute orientation using unit quaternion," J. Opt. Soc. Amer., vol. A-4, pp. 629{642, 1987. { 44 {



[59] G. D. Hager, G. Grunwald, and G. Hirzinger, \Feature-based visual servoing and its application to telerobotics," DCS RR-1010, Yale University, New Haven, CT, Jan. 1994. To appear at the 1994 IROS Conference. [60] G. Agin, \Calibration and use of a light stripe range sensor mounted on the hand of a robot," in Proc. IEEE Int. Conf. Robotics and Automation, pp. 680{ 685, 1985. [61] S. Venkatesan and C. Archibald, \Realtime tracking in ve degrees of freedom using two wrist-mounted laser range nders," in Proc. IEEE Int. Conf. Robotics and Automation, pp. 2004{2010, 1990. [62] J. Dietrich, G. Hirzinger, B. Gombert, and J. Schott, \On a uni ed concept for a new generation of light-weight robots," in Experimental Robotics 1 (V. Hayward and O. Khatib, eds.), vol. 139 of Lecture Notes in Control and Information Sciences, pp. 287{295, Springer-Verlag, 1989. [63] J. Aloimonos and D. P. Tsakiris, \On the mathematics of visual tracking," Image and Vision Computing, vol. 9, pp. 235{251, Aug. 1991. [64] R. M. Haralick and L. G. Shapiro, Computer and Robot Vision. Addison Wesley, 1993. [65] F. W. Warner, Foundations of Dierentiable Manifolds and Lie Groups. New York: Springer-Verlag, 1983. [66] G. D. Hager, \Calibration-free visual control using projective invariance," DCS RR-1046, Yale University, New Haven, CT, Dec. 1994. To appear Proc. ICCV '95. [67] D. Kim, A. Rizzi, G. Hager, and D. Koditschek, \A \robust" convergent visual servoing system." Submitted to Intelligent Robots and Systems 1995, 1994. [68] W. Jang and Z. Bien, \Feature-based visual servoing of an eye-in-hand robot with improved tracking performance," in Proc. IEEE Int. Conf. Robotics and Automation, pp. 2254{2260, 1991. [69] R. L. Anderson, \Dynamic sensing in a ping-pong playing robot," IEEE Transaction on Robotics and Automation, vol. 5, no. 6, pp. 723{739, 1989. [70] G. D. Hager, \The \X-Vision" system: A general purpose substrate for real-time vision-based robotics." Submitted to the 1995 Workshop on Vision for Robotics, Feb. 1995. [71] E. Dickmanns and V. Graefe, \Dynamic monocular machine vision," Machine Vision and Applications, vol. 1, pp. 223{240, 1988. [72] O. Faugeras, Three-Dimensional Computer Vision. Cambridge, MA: MIT Press, 1993. [73] J. Foley, A. van Dam, S. Feiner, and J. Hughes, Computer Graphics. Addison Wesley, 1993. [74] D. Ballard and C. Brown, Computer Vision. Englewood Clis, NJ: Prentice-Hall, 1982. { 45 {



[75] J. Canny, \A computational approach to edge detection," IEEE Trans. Pattern Anal. Mach. Intell., pp. 679{98, Nov. 1986. [76] B. D. Lucas and T. Kanade, \An iterative image registration technique with an application to stereo vision," in Proc. International Joint Conference on Arti cial Intelligence, pp. 674{679, 1981. [77] P. Anandan, \A computational framework and an algorithm for the measurement of structure from motion," International Journal of Computer Vision, vol. 2, pp. 283{310, 1989. [78] J. Shi and C. Tomasi, \Good features to track," in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 593{600, IEEE Computer Society Press, 1994. [79] J. Huang and G. D. Hager, \Tracking tools for vision-based navigation," DCS RR-1046, Yale University, New Haven, CT, Dec. 1994. Submitted to IROS '95. [80] M. Kass, A. Witkin, and D. Terzopoulos, \Snakes: active contour models," International journal of Computer Vision, vol. 1, no. 1, pp. 321{331, 1987. [81] B. Bishop, S. A. Hutchinson, and M. W. Spong, \Camera modelling for visual servo control applications," Mathematical and Computer Modelling { Special issue on Modelling Issues in Visual Sensing. [82] P. Corke and M. Good, \Dynamic eects in visual closed-loop systems," Submitted to IEEE Transactions on Robotics and Automation, 1995. [83] S. B. Skaar, Y. Yalda-Mooshabad, and W. H. Brockman, \Nonholonomic cameraspace manipulation," IEEE Transactions on Robotics and Automation, vol. 8, pp. 464{479, Aug. 1992. [84] G. D. Hager, S. Puri, and K. Toyama, \A framework for real-time visionbased tracking using o-the-shelf hardware," DCS RR-988, Yale University, New Haven, CT, Sept. 1993. [85] A. C. Sanderson and L. E. Weiss, \Adaptive visual servo control of robots," in Robot Vision (A. Pugh, ed.), pp. 107{116, IFS, 1983. [86] N. Mahadevamurty, T.-C. Tsao, and S. Hutchinson, \Multi-rate analysis and design of visual feedback digital servo control systems," ASME Journal of Dynamic Systems, Measurement and Control, pp. 45{55, Mar. 1994. [87] R. Sharma and S. A. Hutchinson, \On the observability of robot motion under active camera control," in Proc. IEEE International Conference on Robotics and Automation, pp. 162{167, May 1994. [88] A. Fox and S. Hutchinson, \Exploiting visual constraints in the synthesis of uncertainty-tolerant motion plans," IEEE Transactions on Robotics and Automation, vol. 11, pp. 56{71, 1995. [89] G. Hager and S. Hutchinson, eds., Proc. IEEE Workshop on Visual Servoing: Achievements, Applications and Open Problems. Inst. of Electrical and Electronics Eng., Inc., 1994. { 46 {

Chapter 2

Systems Issues in Visual Servo Control Peter Corke

CSIRO Division of Manufacturing Technology P.O. Box 883, Kenmore. Australia, 4069.

2.1 A brief history of visual servoing The history of visual servoing dates back nearly 30 years to early work of Wichman[112] (\optical feedback") in 1967 and Shirai and Inoue[92] (\visual feedback") in 1973. The latter described how a visual feedback loop could be used to correct the position of a robot to increase the accuracy of a task which involved placing a square prism in a box. Edge extraction and line tting were used to determine the position and orientation of the box, the camera was xed, and a servo cycle time of 10 s was reported. Due to technological limitations of the time these, and some other signi cant early work, fail to meet the strict de nition of visual servoing that is common today | they would now be classed as look-then-move robot control. Howver progress has been rapid and by the end of the 1970s systems had been demonstrated which were capable of 10 Hz servoing and 3D position control for tracking, seam welding and grasping moving targets. Considerable work on the use of visual servoing was conducted at SRI International during the late 1970s. Early work[85, 86] describes the use of visual feedback for boltinsertion and picking moving parts from a conveyor. Hill and Park[51] describe visual servoing of a Unimate robot in 1979. Binary image processing is used for speed and reliability, providing planar position as well as simple depth estimation based on the apparent distance between known features. Experiments were also conducted { 47 {



using a projected light stripe to provide more robust depth determination as well as surface orientation. These experiments demonstrated planar and 3D visually-guided motion, as well as tracking and grasping of moving parts. They also investigated some of the dynamic issues involved in closed-loop visual control. Similar work on a Unimate-based visual-servo system is discussed later by Makhlin[68]. Prajoux[80] demonstrated visual servoing of a 2-DOF mechanism for following a swinging hook. The system used a predictor to estimate the future position of the hook, and achieved settling times of the order of 1 s. Coulon and Nougaret[27] address similar issues and also provide a detailed imaging model for the vidicon sensor's memory eect. They describe a digital video processing system for determining the location of one target within a processing window, and use this information for closed-loop position control of an XY mechanism to achieve a settling time of around 0.2 s to a step demand. Simple hand-held light stripers of the type proposed by Agin[2] have been used in planar applications such as connector acquisition[71], weld seam tracking[19], and sealant application[90]. The last lays a bead at 400 mm=s with respect to a moving car-body, and shows a closed-loop bandwidth of 4.5 Hz. More recently Venkatesan and Archibald[101] describes the use of two hand-held laser scanners for real-time 5-DOF robot control. Gilbert[44] describes an automatic rocket-tracking camera which keeps the target centered in the camera's image plane by means of pan/tilt controls. The system uses video-rate image processing hardware to identify the target and update the camera orientation at 60 Hz. Dzialo and Schalko[34] discuss the eects of perspective on the control of a pan-tilt camera head for tracking. Weiss[109] proposed the use of adaptive control for the non-linear time varying relationship between robot pose and image features in image-based servoing. Detailed simulations of image-based visual servoing are described for a variety of manipulator structures of up to 3-DOF. Weber and Hollis[108] developed a high-bandwidth planar-position controlled micromanipulator. It is required to counter room and robot motor vibration eects with respect to the workpiece in a precision manufacturing task. Correlation is used to track workpiece texture. To achieve a high sample rate of 300 Hz, yet maintain resolution, two orthogonal linear CCDs are used to observe projections of the image. Since the sample rate is high the image shift between samples is small which reduces the size of the correlation window needed. Image projections are also used by Kabuka[60]. Fourier phase dierences in the vertical and horizontal binary image projections are used for centering a target in the image plane and determining its rotation. This is applied to the control of a two-axis camera platform[60] which takes 30 s to settle on a target. An extension to this approach[61] uses adaptive control techniques to minimize performance indices on grey-scale projections. The approach is presented generally but with simulations for planar positioning only. An application to road vehicle guidance is described by Dickmanns[30]. Real-time { 48 {



feature tracking and gaze controlled cameras guide a 5 tonne experimental road vehicle at speeds of up to 96 km=h along a test track. Later work by Dickmanns[32] investigates the application of dynamic vision to aircraft landing. Control of underwater robots using visual reference points has been proposed by Negahdaripour and Fox[72]. Visually guided machines have been built to emulate human skills at ping-pong[10, 35], juggling[83], inverted pendulum balancing [30, 7], catching[89, 17], and controlling a labyrinth game[7]. The latter is a wooden board mounted on gimbals on which a ball bearing rolls, the aim being to move the ball through a maze and not fall into a hole. The ping-pong playing robot[10] does not use visual servoing, rather a model of the ball's trajectory is built and input to a dynamic path planning algorithm which attempts to strike the ball. Visual servoing has also been proposed for catching ying objects on Earth or in space. Bukowski et al.[16] report the use of a Puma 560 to catch a ball with an endeector mounted net. The robot is guided by a xed-camera stereo-vision system and a 386 PC. Skofteland et al.[94] discuss capture of a free- ying polyhedron in space with a vision guided robot. Skaar et al.[93] use as an example a 1-DOF robot to catch a ball. Lin et al.[67] propose a two-stage algorithm for catching moving targets; coarse positioning to approach the target in near-minimum time and ` ne tuning' to match robot acceleration and velocity with the target. There have been several reports of the use of visual servoing for grasping moving targets. The earliest work appears to have been at SRI in 1978[86]. Recently Zhang et al.[118] presented a tracking controller for visually servoing a robot to pick items from a fast moving conveyor belt (300 mm=s). The camera is hand-held and the visual update interval used is 140 ms. Allen et al.[6] use a 60 Hz xed-camera stereo vision system to track a target moving at 250 mm=s. Later work[5] extends this to grasping a toy train moving on a circular track. Houshangi[53] uses a xed overhead camera and a visual sample interval of 196 ms to enable a Puma 600 robot to grasp a moving target. Fruit picking is a non-manufacturing application of visually guided grasping where the target may be moving. Harrell[46] describes a hydraulic fruit-picking robot which uses visual servoing to control 2-DOF as the robot reaches toward the fruit prior to picking. The visual information is augmented by ultrasonic sensors to determine distance during the nal phase of fruit grasping. The visual servo gains are continuously adjusted to account for changing camera target distance. This last point is signi cant but mentioned by few authors[22, 34]. Part mating has also been investigated using visual servoing. Geschke[42] described a bolt-insertion task using stereo vision and a Stanford arm. The system features automated threshold setting, software image feature searching at 10 Hz, and setting of position loop gains according to the visual sample rate. Stereo vision is achieved with a single camera and a novel mirror arrangement. Ahluwalia and { 49 {



Fogwell[3] describe a system for mating two parts, each held by a robot and observed by a xed camera. Only 2-DOF for the mating are controlled and a Jacobian approximation is used to relate image-plane corrections to robot joint-space actions. On a larger scale, visually servoed robots have been proposed for aircraft refuelling[66] and demonstrated for mating an umbilical connector to the US Space Shuttle from its service gantry[29]. Westmore and Wilson[110] demonstrate 3-DOF planar tracking and achieve a settling time of around 0.5 s to a step input. This is extended[106] to full 3D target pose determination using extended Kalman ltering and then to 3D closed-loop robot pose control[114]. Papanikolopoulos et al.[75] demonstrate tracking of a target undergoing planar motion with the CMU DD-II robot system. Later work[76] demonstrates 3D tracking of static and moving targets, and adaptive control is used to estimate the target distance. The use of visual servoing in a telerobotic environment has been discussed by Yuan et al.[117], Papanikolopoulos et al.[76] and Tendick et al.[98]. Visual servoing can allow the task to be speci ed by the human operator in terms of selected visual features and their desired con guration. Approaches based on neural networks[69, 47, 65] and general learning algorithms[70] have also been used to achieve robot hand-eye coordination. A xed camera observes objects and the robot within the workspace and can learn the relationship between robot joint angles and the 3D end-eector pose. Such systems require training, but the need for complex analytic relationships between image features and joint angles is eliminated.

2.1.1 Applications of position-based visual servoing Position-based visual-servoing requires determination of object pose. This can be achieved by analyzing projected image features from one or more cameras, or by direct 3D sensing. Determining object pose from measured known features is a problem in closerange photogrammetry[115], a discipline in its own right and with considerable associated literature. 6-DOF visual servoing based on an analytic solution has been demonstrated by Ganapathy[41]. Yuan[116] describes a general iterative solution independent of the number or distribution of feature points. For tracking moving targets the previous solution can be used as the initial estimate for iteration. Wang and Wilson[106] use an extended Kalman lter to update the pose estimate given measured image plane feature locations. The lter convergence is analogous to the iterative solution. The commonly cited drawbacks of the feature-based, or photogrammetric, approach are the complex computation, and the necessity for camera calibration and a { 50 {



model of the target. None of these objections are overwhelming and, as mentioned above, a number of systems based on these approaches have been demonstrated. Incorporating information from additional views allows the distance of image features to be determined without apriori knowledge of object geometry. Commonly, stereo vision is used and this requires the location of feature points in one view to be matched with the location of the same feature points in the other view. Matching may be done on a few feature points such as region centroids or corner features, or on ne feature detail such as surface texture. In the absence of signi cant surface texture a random texture pattern could be projected onto the scene. This matching, or correspondence, problem is not trivial and is subject to error. Another diculty is the missing parts problem where a feature point is visible in only one of the views and therefore its depth cannot be determined. Implementations of 60 Hz stereo-vision systems have been described by Andersson[10], Rizzi et al.[83], Allen et al.[6] and Bukowski et al.[16]. The rst two operate in a simple contrived environment with a single white target against a black background. The last two use optical ow or image dierencing to eliminate static background detail. All use xed rather than end-eector-mounted cameras. Closely related to stereo vision is monocular or motion stereo[73] also known as depth from motion. Sequential monocular views, taken from dierent viewpoints, are interpreted to derive depth information. Such a sequence may be obtained from a robot hand-mounted camera during robot motion. It must be assumed that targets in the scene do not move signi cantly between the views. The AIS visual-servoing scheme of Jang et al.[59] uses motion stereo to determine depth of feature points. Self motion, or egomotion, produces rich depth cues from the apparent motion of features, and is important in biological vision[102]. Research into insect vision[95] indicates that insects use self-motion to infer distances to targets for navigation and obstacle avoidance. Compared to mammals, insects have eective but simple visual systems and may oer a practical alternative model upon which to base future robotvision systems[96]. Dickmanns[31] proposes an integrated spatio-temporal approach to analyzing scenes with relative motion so as to determine depth and structure. Based on tracking features between sequential frames which he terms 4D vision. All the above approaches are based on emulating animal visual systems, however `non-anthropomorphic' approaches to sensing may oer some advantages. Active range sensors project a controlled energy beam, generally ultrasonic or optical, and detect the re ected energy. Commonly a pattern of light is projected on the scene which a vision system interprets to determine depth and orientation of the surface. Such sensors usually determine depth along a single stripe of light, multiple stripes or a dense grid of points. If the sensor is small and mounted on the robot[2, 33] the depth and orientation information can be used for servoing[101]. Such sensors, driven by developments in solid-state lasers, are becoming increasingly availabl. Besl[13] provides a comprehensive, but somewhat dated, survey which includes the operation { 51 {



and capability of many commercial active range sensors.

2.1.2 Applications of image-based servoing The earliest work in this area was by Weiss who proposed the \image-based visualservo" structure in which robot joint angles are controlled directly by measured image features. The non-linearities include the manipulator kinematics and dynamics as well as the perspective imaging model. Adaptive control is proposed since the gain depends on the relative pose which is not measured. The changing relationship between robot pose and image feature change is learned during the motion. Weiss uses independent single-input single-output (SISO) model-reference adaptive control (MRAC) loops for each DOF, citing the advantages of modularity and reduced complexity compared to multi-input multi-output (MIMO) controllers. The proposed SISO MRAC requires one feature to control each joint and no coupling between features, and a scheme is introduced to select features so as to minimize coupling. In practice this last constraint is dicult to meet since camera rotation inevitably results in image feature translation. Weiss[109] presents detailed simulations of various forms of image-based visual servoing with a variety of manipulator structures of up to 3-DOF. Sample intervals of 33 ms and 3 ms are investigated, as is control with measurement delay. With nonlinear kinematics (revolute robot structure) the SISO MRAC scheme has diculties. Solutions proposed, but not investigated, include MIMO control and a higher sample rate, or the dynamic-look-and-move structure. Weiss found that even for a 2-DOF revolute mechanism a sample interval less than 33 ms was needed to achieve satisfactory plant identi cation. For manipulator control Paul[77] suggests that the sample rate should be at least 15 times the link structural frequency. Since the highest sample frequency achievable with standard cameras and image processing hardware is 60 Hz, the IBVS structure is not currently practical for visual servoing. The so called dynamic look and move structure is more suitable for control of 6-DOF manipulators, by combining high-bandwidth joint level control in conjunction with a lower rate visual position control loop. The need for such a control structure is hinted at by Weiss and was subsequently used by Feddema[37] and others[25, 48, 58]. Feddema extends the work of Weiss in many important ways, particularly by experimentation[37, 38, 36]. Due to the low speed feature extraction achievable (every 70 ms) an explicit trajectory generator operating in feature space is used rather than the pure control loop approach of Weiss. Feature velocities from the trajectory generator are resolved to manipulator con guration space for individual closed-loop joint PID control. Feddema[38, 37] describes a 4-DOF servoing experiment where the target was a { 52 {



gasket containing a number of circular holes. Binary image processing and Fourier descriptors of perimeter chain codes were used to describe each hole feature. From the two most unique circles four features are derived; the coordinates of the midpoint between the two circles, the angle of the midpoint line with respect to image coordinates, and the distance between circle centers. The experimental system could track the gasket moving on a turntable at up to 1 rad=s. The actual position lags the desired position, and `some oscillation' is reported due to time delays in the closedloop system. A similar experimental setup[37] used the centroid coordinates of three gasket holes as features. Rives et al.[82, 18] describe an approach that computes the camera velocity screw as a function of feature values based on the task function approach. The task is de ned as the problem of minimizing the \task function" which is written in terms of image features which are in turn a function of robot pose. They refer to the image Jacobian as the \interaction matrix" denoted by LT . As previously it is necessary to know the model of the interaction matrix for the visual features selected and the cases of point clusters, lines and circles are derived. Experimental results for robot positioning using four point features are presented[18]. Frequently the feature Jacobian can be formulated in terms of features plus depth. Hashimoto et al.[48] estimate depth explicitly based on analysis of features. Papanikolopoulos[76] estimates depth of each feature point in a cluster using an adaptive control scheme. Rives et al.[82, 18] set the desired distance rather than update or estimate it continuously. Feddema describes an algorithm[36] to select which three of the seven measurable features gives best control. Features are selected so as to achieve a balance between controllability and sensitivity with respect to changing features. The generalized inverse of the feature Jacobian[48, 58, 82] allows more than 3 features to be used, and has been shown to increase robustness, particularly with respect to singularities[58]. Jang et al.[59] introduce the concepts of augmented image space (AIS) and transformed feature space (TFS). AIS is a 3D space whose coordinates are image plane coordinates plus distance from camera, determined from motion stereo. In a similar way to Cartesian space, trajectories may be speci ed in AIS. A Jacobian may be formulated to map dierential changes from AIS to Cartesian space and then to manipulator joint space. The TFS approach appears to be very similar to the image-based servoing approach of Weiss and Feddema. Bowman and Forrest[15] describe how small changes in image plane coordinates can be used to determine dierential change in Cartesian camera position and this is used for visual servoing a small robot. No feature Jacobian is required, but the camera calibration matrix is needed. Most of the above approaches require analytic formulation of the feature Jacobian given knowledge of the target model. This process could be automated, but there is { 53 {



attraction in the idea of a system that can `learn' the non-linear relationship automatically as originally envisaged by Sanderson and Weiss. Some recent results[57, 52] demonstrate the feasibility of online image Jacobian estimation. Skaar et al.[93] describe the example of a 1-DOF robot catching a ball. By observing visual cues such as the ball, the arm's pivot point, and another point on the arm, the interception task can be speci ed even if the relationship between camera and arm is not known a priori. This is then extended to a multi-DOF robot where cues on each link and the payload are observed. After a number of trajectories the system `learns' the relationship between image-plane motion and joint-space motion, eectively estimating a feature Jacobian. Tendick et al.[98] describe the use of a vision system to close the position loop on a remote slave arm with no joint position sensors. A xed camera observes markers on the arm's links and a numerical optimization is performed to determine the robot's pose. Miller[70] presents a generalized learning algorithm based on the CMAC structure proposed by Albus[4] for complex or multi-sensor systems. The CMAC structure is table driven, indexed by sensor value to determine the system command. The modi ed CMAC is indexed by sensor value as well as the desired goal state. Experimental results are given for control of a 3-DOF robot with a hand-held camera. More than 100 trials were required for training, and good positioning and tracking capability were demonstrated. Arti cial neural techniques can also be used to learn the nonlinear relationships between features and manipulator joint angles as discussed by Kuperstein[65], Hashimoto[47] and Mel[69].

2.1.3 Visual task speci cation Many of the visual servo systems reported are capable of performing only a single task. There has been little work on more general approaches to task description in terms of visual features. Jang[58] and Skaar et al.[93] have shown how tasks such as edge following or catching can be expressed in terms of image plane features and their desired trajectories which are described algorithmically. Commercial robot/vision systems have inbuilt language support for both visual feature extraction and robot motion; however, these are limited to look-then-move operation since motion is based on nite duration trajectories to known or computed destination points. Geschke[43] describes the Robot Servo System (RSS) software which facilitates the development of applications based on sensory input. The software controls robot position, orientation, force and torque independently, and speci cations for control of each may be given. The programming facilities are demonstrated with applications for vision-based peg in hole insertion and crank turning. Haynes et al.[16] describe a set of library routines to facilitate hand-eye programming for a PC connected to a Unimate robot controller and stereo-camera system. { 54 {



The use of a table-driven state machine is proposed by Adsit[1] to control a visually-servoed fruit-picking robot. The state machine was seen to be advantageous in coping with the variety of error conditions possible in such an unstructured work environment. Another approach to error recovery is to include a human operator in the system. The operator could select image features and indicate their desired con guration allowing the task itself to be performed under closed-loop visual control. This would have particular advantage in situations, such as space or undersea, where there is considerable communications time delay. Teleoperation applications have been described by Yuan et al.[117], Papanikolopoulos et al.[76] and Tendick et al.[98].

2.2 Architectural issues Progress in visual servoing is related to technological advances in diverse areas including sensors, image processing and robot control. This section summarizes these areas from a visual servoing perspective and gives relevant implementation details from the literature.

2.2.1 Cameras The earliest reports used thermionic tube, or vidicon, image sensors. These devices had a number of undesirable characteristics such as physical weight and bulk, fragility, poor image stability and memory eect[27]. Since the mid 1980s most researchers have used some form of solid state camera based on an NMOS, CCD or CID sensor. The only reference to color vision for visual servoing is the fruit picking robot[46] where color is used to dierentiate fruit from the leaves. Given real-time constraints the advantages of color vision for object recognition may be oset by the increased cost and high processing requirements of up to three times the monochrome data rate. Almost all reports are based on the use of area sensors, but line-scan sensors have been used by Webber et al.[108] for very high-rate visual servoing. Commonly available cameras generally conform to either of the two main video standards; RS170 (USA, Canada, Japan) or CCIR (Europe, Australia) with frame rates of 30 Hz and 25 Hz respectively. These video formats are interlaced, which means that each frame is transmitted as two sequential half-vertical-resolution elds, displaced vertically by one line. This can complicate the interpretation of full video frames, but many visual servo systems treat the elds as high-rate low-resolution `frames'[10, 25]. When tracking fast moving objects motion blur can be a signi cant problem. The object will appear to elongated in the direction of travel, and the centroid of the { 55 {



Area (pixels)

350 300 250 200 150 100 1

1.5

2 Time (s)

2.5

3

1.5

2 Time (s)

2.5

3

Theta (radl)

0.25 0.2 0.15 0.1 0.05 0 1

Figure 2.1. Measured target area due to motion blur (top plot) for an exposure interval of 20 ms. The lower plot shows the camera pan angle in response to a step change in target position.

blurred object will lag the real object's centroid. Figure 2.1 shows very clearly that with an exposure interval of 20 ms the apparent area of the LED target is increased during the high velocity phase of the camera motion. The apparent area of the target has increased by a factor of 2.6. The pixels in the blurred object image will also be less bright (the same re ected energy is spread over more pixels), so a simple binary vision system with xed threshold may `lose sight' of a rapidly moving target[10, 25]. Most modern CCD cameras now have an electronic shuttering facility which can reduce the blur eect by using a short exposure time. However scene illumination must be increased as exposure time is reduced in order to maintain image brightness. Due to eld interlacing there are also some subtle, but important, dierences in the way that various cameras implement shuttering, viz. eld-shuttering and frame shuttering. Many other camera characteristics are relevant in visual servoing and include sensitivity, dynamic range, noise, pixel shape and image sharpness. Cameras can be either xed or mounted on the robot's end-eector. The bene ts of an end-eector-mounted camera include the ability to avoid occlusion, resolve ambiguity and increase accuracy, by directing its attention. All reported stereo-based systems use xed cameras although there is no reason a stereo-camera cannot be mounted on the end-eector, apart from practical considerations such as payload limitation or lack of camera system robustness. Zhang et al.[118] observe that, for most useful tasks, an overhead camera will be obscured by the gripper, and a gripper mounted camera will be out of focus during the nal phase of part acquisition. This may imply the necessity for switching between { 56 {



several views of the scene, or using hybrid control strategies which utilize vision and conventional joint-based control for dierent phases of the task. Nelson et al.[63] and Pretlove[81] describe a camera mounted on a `looking' robot which moves in order to best observe another robot which is performing the task using visual-servoing.

2.2.2 Image feature extraction Image interpretation, or scene understanding, is the problem of describing physical objects in a scene given an image, or images, of that scene. Two broad approaches to image feature extraction have been used for visual servoing applications; whole scene segmentation, and feature tracking.

Whole scene segmentation Segmentation is the process of dividing an image into meaningful segments, generally homogeneous with respect to some characteristic. In a simple or contrived scene the segments may correspond directly to objects in the scene, but for a complex scene this is rarely the case. The problem of robustly segmenting a scene is of key importance in computer vision, and much has been written about the topic and many methods have been described in the literature. Haralick[45] provides a survey of techniques applicable to static images but unfortunately many of the algorithms are iterative and timeconsuming and thus not suitable for real-time applications. The principle processing steps involved are: 1. Classi cation, where pixels are classi ed into spatial sets according to `low-level' pixel characteristics. 2. Representation. The spatial sets are represented in a form suitable for further computation, generally as either connected regions or boundaries. 3. Description. The sets are described in terms of scalar or vector valued features. Pixel values may be scalar or vector, and can represent intensity, color, range, velocity or any other measurable scene property. The classi cation may take into account the neighbouring pixels, global pixel statistics, and even temporal change in pixel value. General scenes have too much `clutter' and are dicult to interpret at video rates unless pixels of ìnterest' can be distinguished. Harrell[46] describes the use of color classi cation to segment citrus fruit from the surrounding leaves in a fruit picking visual servo application. Haynes[16] proposes sequential frame dierencing or background subtraction to eliminate static background detail. Allen[6, 5] uses { 57 {



optical ow calculation to classify pixels as moving or not moving with respect to the background, which is assumed stationary. More commonly in laboratory situations, where the lighting and environment can be contrived (for instance using dark backgrounds and white objects) to yield high contrast, a simple intensity threshold can be used to distinguish foreground objects from the background. Many reported real-time vision systems for juggling[83], pingpong[10, 35] or visual servoing [51, 38, 25] use this simple, though non-robust, approach. Selection of an appropriate threshold is however a signi cant issue, and many automated approaches to threshold selection have been described[111, 88, 64]. Once the pixels of interest have been identi ed they must be represented in some form that allows features such as position and shape to be determined. Two basic representations of image segments are natural: two-dimensional regions, and boundaries. The rst involves grouping contiguous pixels with similar characteristics. Edges represent discontinuities in pixel characteristics that often correspond to object boundaries. These two representations are `duals' and one may be converted to the other, although the descriptions used are quite dierent. Edges may be represented by tted curves, chain code or crack code. Crack code represents the edge as a series of horizontal and vertical line segments following the `cracks' between pixels around the boundary of the pixel set. Chain code represents the edge by direction vectors linking the centers of the edge pixels. Feddema[37] describes the use of chain code for a visual servoing application. Crack codes are represented by 2-bit numbers giving the crack direction as 90i, while chain code is represented by 3-bit numbers giving the next boundary point as 45i. Rosenfeld and Kak[87] describe a single-pass algorithm for extracting crack-codes from run-length encoded image data. The dual procedure to boundary tracing is connected component analysis (also connectivity analysis or region growing), which determines contiguous regions of pixels. Pixels may be 4 way, 6 way or 8 way connected with their neighbours[87, 105]. This analysis involves one pass over the image data to assign region labels to all pixels. During this process it may be found that two regions have merged, so a table records their equivalence[28, 12] or a second pass over the data may be performed[87]. While edge representations generally have fewer points than contained within the component, computationally it is advantageous to work with regions. Boundary tracking cannot commence until the frame is loaded, requires random-access to the image memory, and on average 4 memory accesses to determine the location of the next edge pixel. Additional overhead is involved in scanning the image for boundaries, and ensuring that the same boundary is traced only once. Next each segmented region must be described, the process of feature extraction. The regions, in either edge or connected component representation, can be analyzed to determine area, perimeter, extent and `shape'. The principle role of feature extraction { 58 {



is to reduce the data rate to something manageable by a conventional computer, that is, extracting the èssence' of the scene. A feature is de ned generally as any measurable relationship in an image and examples include, moments, relationships between regions or vertices, polygon face areas, or local intensity patterns. A particularly useful class of image features are moments. Moments are easy to compute at high speed using simple hardware, and can be used to nd the location of an object (centroid) and ratios of moments may be used to form invariants for recognition of objects irrespective of position and orientation as demonstrated by Hu[54] for planar objects. The (p + q)th order moment for a digitized image is XX p q mpq = x y I (x; y) (2.1) R

Moments can also be determined from the vertices of a polygon or the perimeter points of a boundary representation[113]. For n boundary points labeled 1 n where point P Pn ! ! p X q (;1)i j n X X 1 p q xp;iyq;j xi yj mpq = p + q + 2 A` (2.2) ` ` i j ` ` i j i+j+1 ` where A` = x`y` ; y`x`, x` = x` ; x`; and y` = y` ; y`; . Moments can be given a physical interpretation by regarding the image function as mass distribution. Thus m is the total mass of the region and the centroid of the region is given by m ; y =m xc = m (2.3) c m A very important operation in most visual servoing systems is determining the coordinate of an image feature point, frequently the centroid of a region. The centroid can be determined to sub-pixel accuracy, even in a binary image. The calculation of centroid can be achieved by software or specialized video-rate moment generation hardware[9, 104, 39, 49, 103]. Andersonn's system[9] computed grey-scale second order moments but was incapable of connected-region analysis, making it unsuitable for scenes with more than one object. Hatamian's system[49] computed grey-scale third order moments using cascaded single pole digital lters but was also incapable of connected-region analysis. O-the-shelf hardware[103] is available that combines single-pass connectivity with computation of moments up to second order, perimeter and bounding box for each connected region in a binary scene. The centroid of a region can also be determined using multi-spectral spatial, or pyramid, decomposition of the image[8]. This reduces the complexity of the search problem, allowing the computer to localize the region in a coarse image and then re ne the estimate by limited searching of progressively higher resolution images. The central moments pq are computed about the centroid XX pq = (x ; xc)p(y ; yc)q I (x; y) (2.4) 0

+

=1

=0 =0

1

1

00

10

01

00

00

R

{ 59 {



and are invariant to translation. They may be computed from the moments mpq by = 0 (2.5) = 0 (2.6) m = m ;m (2.7) (2.8) = m ;m m = m ; mmm (2.9) 10 01 20

20

02

02

11

11

2 10

00 2 01 00 10

01

00

A commonly used, but simple, shape metric is circularity de ned as = 4m (2.10) p where p is the region's perimeter. Circularity has a maximum value of 1 for a circle and a square can be shown to have = =4. The second moments of area , and may be considered the moments of inertia about the centroid " # I= (2.11) The eigenvalues are the principle moments of the region, and the eigenvectors of this matrix are the principal axes of the region, the directions about which the region has maximum and minimum moments of inertia. From the eigenvector corresponding to the maximum eigenvalue we can determine the orientation of the principal axis as q ; ; ;2 + + 4 tan = ; (2.12) 2 00

2

20

20

02

11

20

11

11

02

2 20

02

20

02

2 02

2 11

11

or more simply

: tan 2 = 2 ;

(2.13)

11

20

02

Many machine vision systems compute these so called èquivalent ellipse' parameters. These are the major and minor radii of an ellipse with the same area moments as the region. The principal moments are given by the eigenvalues of (2.11) q + ( ) + 4 (2.14) ; = 2 20

1

02

20

02

2

2 11

2

and the area moments of an ellipse about the major and minor axes are given respectively by Imaj = Aa4 ; Imin = Ab4 (2.15) 2

2

{ 60 {



where A is the area of the region, and a and b are the major and minor radii. This can be rewritten in the form s s a = 2 m ; b = 2 m (2.16) 1

2

00

00

The normalized moments pq = pq ; = 21 (p + q) + 1 for p + q = 2; 3 (2.17) are invariant to scale. Third-order moments allow for the creation of quantities that are invariant with respect to translation, scaling and orientation within a plane[54]. 00

Feature tracking Software computation of image moments is one to two orders of magnitude slower than specialized hardware. However the computation time can be greatly reduced if only a small image window, whose location is predicted from the previous centroid, is processed[42, 110, 83, 38, 31, 46]. The task of locating features in sequential scenes is relatively easy since there will be only small changes from one scene to the next[31, 73] and total scene interpretation is not required. This is the principle of veri cation vision proposed by Bolles[14] in which the system has considerable prior knowledge of the scene, and the goal is to verify and re ne the location of one or more features in the scene. Determining the initial location of features requires the entire image to be searched, but this need only be done once. Papanikolopoulos et al.[75] use a sum-ofsquared dierences approach to match features between consecutive frames. Features are chosen on the basis of a con dence measure computed from the feature window, and the search is performed in software. The TRIAX system[11] is an extremely highperformance multiprocessor system for low latency six-dimensional object tracking. It can determine the pose of a cube by searching short check lines normal to the expected edges of the cube. When the software feature search is limited to only a small window into the image, it becomes important to know the expected position of the feature in the image. This is the target tracking problem; the use of a ltering process to generate target state estimates and predictions based on noisy observations of the target's position and a dynamic model of the target's motion. Target maneuvers are generated by acceleration controls unknown to the tracker. Kalata[62] introduces tracking lters and discusses the similarities to Kalman ltering. Visual servoing systems have been reported using tracking lters[6], Kalman lters[110, 31], AR (auto regressive) or ARX (auto regressive with exogenous inputs) models[37, 53]. Papanikolopoulos et al.[75] use an ARMAX model and consider tracking as the design of a second-order controller of image plane position. The distance to the target is assumed constant { 61 {



and a number of dierent controllers such as PI, pole-assignment and LQG are investigated. For the case where target distance is unknown or time-varying adaptive control is proposed[74]. The prediction used for search window placement can also be used to overcome latency in the vision system and robot controller. Dickmanns[31] and Inoue[55] have built multiprocessor systems where each processor is dedicated to tracking a single distinctive feature within the image. More recently, Inoue[56] has demonstrated the use of a specialized VLSI motion estimation device for fast feature tracking.

2.2.3 Communications It is an unfortunate reality that most ò-the-shelf' robot controllers have low-bandwidth communications facilities. In many reported visual servo systems the communications path between vision system and robot is typically a serial data communications link[101, 53, 37, 114]. The Unimate VAL-II controller's ALTER facility used by some researchers[101, 76] allows trajectory modi cation requests to be received via a serial port at intervals of 28 ms. Tate[97] identi ed the transfer function between this input and the manipulator motion, and found the dominant dynamic characteristic below 10 Hz was a time delay of 0.1s. Bukowski et al.[16] describe a novel analog interconnect between a PC and the Unimate controller, so as to overcome the inherent latency associated with the Unimate ALTER facility. To circumvent this communications problem it is necessary to follow the more dicult path of reverse engineering the existing controller or implementing a custom controller. One of the more common such approaches is to use RCCL[50] or similar[25, 24] to connect a `foreign controller' to a Puma robot's servo system, bypassing the native VAL based control system. This provides direct application access to the Unimate joint servo controllers, at high sample rate, and with reduced communications latency. Visual servo systems based on RCCL have been described by Allen et al.[6], Houshangi[53] and Corke[26]. Allen et al. use a `high speed interface' between the PIPE vision system and RCCL, while Corke uses an Ethernet to link a Datacube based vision system to a Cartesian velocity server based on RCCL. Houshangi uses RCCL as a platform for adaptive axis control, but uses a low visual sample rate and a vision system connected by serial link. Higher communications bandwidth can be achieved by means of a common computer backplane. Such an approach is used by Papanikolopoulos[75] with the Chimera multi-processor control architecture. Later work by Corke[25] is based on a VMEbus robot controller and Datacube vision system with a shared backplane. A very recent report by Urban et al.[100] describes a similar visual servoing architecture. Less tightly coupled systems based on transputer technology have been described by Hashimoto et al.[48], Rizzi et al.[83] and Westmore and Wilson[110]. The vision system and robot actuators are connected directly to elements of the computational network. Hashimoto's system closes the joint velocity control loops of a Puma 560 at 1 kHz, but the visual servo sample interval varies from { 62 {



85 ms to 150 ms depending upon the control strategy used. Wilson's transputer-based vision system performs feature tracking and pose determination, but communicates with the robot via a serial link.

2.3 Control design and performance Within the visual servoing literature there has, to date, been little emphasis on dynamic performance. Most reports use proportional control and low loop gains to achieve stable, but low-speed, closed-loop visual control. Machine vision has a number of signi cant disadvantages when used as a feedback sensor: a relatively low sample rate, signi cant latency (one or more sample intervals) and coarse quantization. While these characteristics present a challenge for the design of high-performance motion control systems they are not insurmountable. Latency is the most signi cant dynamic characteristic and has many sources which include: transport delay of pixels from camera to vision system, image processing algorithms, communications between vision system and control computer, control algorithm software, and communications with the robot. In fact problems of delay in visual servo systems were rst noted over 15 years ago[51]. If the target motion is constant then prediction can be used to compensate for latency, but combined with a low sample rate results in poor disturbance rejection and long reaction time to target `maneuvers', that is, unmodeled motion. Grasping objects on a conveyor belt[118] or a moving train[5] are however ideal applications for prediction. Predictors can be based on Kalman lters, ; lters[6], or autoregressive models[80] . Franklin[40] suggests that the sample rate of a digital control system be between 4 and 20 times the desired closed-loop bandwidth. For the case of a 50 Hz vision system this implies that a closed-loop bandwidth between 2.5 Hz and 12 Hz is achievable. That so few reported systems achieve this leads to the conclusion that the whole area of dynamic modelling and control design has to date been largely overlooked. The earliest report relevant to visual servo dynamics appears to be Hill and Park's[51] 1979 paper which describes visual servoing of a Unimate robot and some of the dynamics of visual closed-loop control. Stability, accuracy and tracking speed for another Unimate-based visual-servo system are discussed by Makhlin[68]. Coulon and Nougaret[27] describe closed-loop position control of an XY mechanism to achieve a settling time of around 0.2 s to a step demand. Andersen[7] describes the control of a labyrinth game | a wooden board mounted on gimbals on which a ball bearing rolls, the aim being to move the ball through a maze and not fall into a hole. The ball's position is observed at 40 ms intervals and a Kalman lter is used to reconstruct the ball's state. State-feedback control gives a closed loop bandwidth of 1.3 Hz. { 63 {



2.3.1 Performance metrics It is currently dicult to compare the temporal performance of various visual servo systems due to the lack of any agreed performance measures. At best only qualitative assessments can be made from examining the time axis scaling of published results or video tape presentations. Traditionally the performance of a closed-loop system is measured in terms of bandwidth. While `high bandwidth' is desirable to reduce error it can also lead to a system that is sensitive to noise and unmodeled dynamics. However the notion of bandwidth is not straightforward since it implies both magnitude (3 dB down) and phase (45 lag) characteristics. Time delay introduces a linear increase in phase with frequency and for tracking applications phase is a particularly meaningful performance measure. In fact one of the few examples of a performance speci cation for a visual servoing system is for a citrus picking system[79] | the static error tolerance was 6 pixels and a phase lag of 10 for a fruit swinging at 1.1 Hz. Other commonly used performance metrics such as settling time, overshoot and polynomial signal following errors are appropriate. Note though that a feedback system which is able to settle accurately over a static target may well show poor ability to follow a sinusoid. Simulated tracking of sinusoidal motion in Figure 2.2 shows a robot lag of approximately 30 and the centroid error peaking at over 80 pixels. Many visual servo system implement a xation or tracking task, and it thus appropriate to quantitatively evaluate the quality of the tracking motion in terms of image plane error, either peak-to-peak or RMS. The RMS error over the time interval [t ; t ] is given by v u Z t2 u u (i X (t) ; iXd (t)) dt u = t t1 (2.18) t ;t 1

2

2

2

1

An image-plane error measure is appropriate since the xation task itself is de ned in terms of image plane error, and is ideally zero. Many reports do not include information about the image plane error performance, but this should be given if progress is to be made by quantitatively comparing dierent control approaches. The choice of appropriate target motion with which to evaluate performance depends upon the application. For instance if tracking a falling or ying object it would be appropriate to use a parabolic input. A sinusoid is particularly challenging since it is a non-polynomial with signi cant and persistent acceleration that can be readily created experimentally using a pendulum or turntable. Sinusoidal excitation clearly reveals phase error which is a consequence of open-loop latency.

{ 64 {



Response (radl)

0.2 0.1 0 -0.1 -0.2 0

0.5

1

1.5 Time (s)

2

2.5

3

0.5

1

1.5 Time (s)

2

2.5

3

100

Error (pix)

50 0 -50 -100 0

Figure 2.2. Simulated tracking performance of visual feedback controller, where the target is moving sinusoidally with a period of 1.5 s (0.67 Hz). Note the considerable lag of the response (solid) to the demand (dashed) which leads to the image plane error. x t ( z ) target position i i

Xd ( z )

reference input

X˜ ( z )

+

+ D ( z)

R ( z)

compensator

robot

+

V ( z)

i

X ( z)

vision

Figure 2.3. System block diagram of of a visual feedback control system showing target motion as disturbance input.

2.3.2 Visual servoing as a control systems problem Two useful transfer functions can be written for the system of Figure 2.3. The response of image plane centroid to demanded centroid is given by the transfer function i X (z ) V (z)R(z)D(z) (2.19) i Xd (z ) = 1 + V (z )R(z )D(z ) Jang[58] has shown how tasks can be expressed in terms of image plane feature trajectories, iXd (t), for which this image plane reference-following performance is signi cant. For a xation task, where iXd is constant, iX (z)=Xt (z) describes the image plane error response to target motion, and is given by iX ~ (z) V (z ) = (2.20) Xt(z) 1 + V (z)R(z)D(z) { 65 {



where iX~ = iXd ; iX is the image plane error and iXd is assumed, for convenience, to be zero. The motion of the target, xt, can be considered a disturbance input and ideally the disturbance response, jiX~ (z)=Xt(z)j, would be zero. However with the typical dynamics (incorporating delay) it is non zero and rises with frequency. To obtain better tracking performance it is necessary to reduce jiX~ (z)=Xt(z)j over the target motion's frequency range. The steady-state value of the image plane error for a given target motion, Xt (z), can be found by the nal-value theorem lim iX~ (t) = zlim (z ; 1)i X~ (z) (2.21) t!1 ! V (z) = zlim ( z ; 1) (2.22) ! 1 + V (z)R(z)D(z) Xt (z) which may be evaluated with the target motion chosen as one of 8 z > for a step input < z;Tz Xt(z) = > z2; 2 for a ramp input (2.23) : T zz;z 3 for a parabolic input 1

1

1

(

1) ( +1) 2 ( 1)

For the steady-state tracking error to be zero, the numerator of (2.22) must cancel the poles at z = 1 of Xt(z) and retain a factor of (z ;1). This numerator comprises the poles of the robot and compensator transfer functions | the latter may be selected to give the desired steady-state error response by inclusion of an appropriate number of integrator terms, or (z ; 1) factors. The number of such open-loop integrators is referred to as the Type of the system in classical control literature. Equation (2.22) has one open-loop integrator and is thus of Type 1 giving zero steady-state error to a step input. From (2.23) compensator poles will appear as additional closed-loop zeros, and additional open-loop integrators will appear as closed-loop dierentiators. Classical approaches to achieving improved tracking performance are: 1. Increasing the loop gain, Kp, which minimizes the magnitude of ramp-following error. However for systems incorporating open-loop delay this gain is severely constrained unless additional compensation is introduced. 2. Increasing the Type of the system, by adding open-loop integrators. Type 1 and 2 systems will have zero steady-state error to a step and ramp demand respectively. 3. Introducing feedforward of the signal to be tracked. This is a commonly used technique in machine-tool control and has been discussed, for instance, by Tomizuka[99]. For the visual servo system, the signal to be tracked is the target position which is not directly measurable. This approach is investigated in [20]. { 66 {



300 feedforward poleplace 200

PID proportional

Error (pix)

100

0

−100

−200

−300 0

0.5

1

1.5

2

2.5 Time (s)

3

3.5

4

4.5

5

Figure 2.4. Comparison of image plane error for various visual feedback compensators from above. Target motion in all cases is the sinusoid 0 5 sin(4 2 ) radm s. :

: t

=

The simulated dynamic performance of a number of dierent compensators are compared in Figure 2.4. The simulations are based on detailed multi-rate non-linear models of the position and velocity loops of the Unimate Puma servo system[20], and in each case the targets are moving sinusoidally. The simple proportional feedback controller, while demonstrating an adequate step response, has poor tracking performance and this may be improved by means of a more sophisticated compensator such as PID or pole-placement with state estimator. The latter, while giving good tracking performance, lacks robustness to robot parameter variation and saturation. Feed-forward control, introduced by Corke[20, 23], is capable of similar performance but with greatly improved robustness.

2.3.3 Axis control modes for visual servoing Most reported visual servo systems[37, 58, 6, 75, 91, 107], are based on an approach in which the visual servo is built òn top of' an underlying position-controlled manipulator. This is probably for the pragmatic reason that most robot controllers present the abstraction of a robot as a position-controlled device. Exceptions to this approach are the citrus-picking robot[79] which closes the vision loop about a hydraulic actuator which is a `natural' velocity `source', and the work of Weiss[109] who proposes closing a visual-feature control loop around a torque servo. The simulated response of a visual servo system with position, velocity and torquecontrolled actuators to a unit step target motion is shown in Figure 2.5. The controller based on a position-mode axis has a large steady-state error since it is only of Type 0. { 67 {



1.2

1

Robot response

0.8

0.6 Torque Velocity 0.4

Position Position+integrator

0.2

0 0

5

10

15 Time step

20

25

30

Figure 2.5. Simulation of time responses to target step motion for visual servo systems based on torque, velocity, and position-controlled axes. Proportional control with gain selected to achieve a damping factor of 0.7

The velocity-mode axis control results in a signi cantly faster step response than that for torque-mode. If an integrator is added to the position-mode controller then the response becomes similar to, but slightly faster than, the velocity mode controller as shown in Figure 2.5. The integrator has turned the position-mode controller into an ìdeal' velocity source, just as used in the experimental system. From a control systems perspective however there are some subtle dierences between a velocity-controlled axis and a position-controlled axis plus integrator. In particular the inner position loop will generally be very `tight', that is high-gain, in order to minimize position error. For an ideal robot this is not a problem, but it will exacerbate resonance problems due to structural and/or drive compliance. It has been shown[84] that for the case of signi cant structural compliance it is preferable to use a `soft' velocity loop in order to increase system damping. Experiments and more detailed non-linear simulations indicate that the torquemode controller does not behave as expected | yielding very lightly damped oscillatory responses. This was found to be due to a combination of non-linear friction, integral action and the relatively low sample rate. The unmodeled dynamics, particularly stick/slip friction, are complex and have time constants that are shorter than the visual sample interval. It is a `rule of thumb' that the sample rate should be 4 to 20 times the natural frequency of the modes to be controlled. Improved torque-mode control would require one or more of the following: a higher sample rate, friction measurement and feedforward, adaptive, or robust control. A more straightforward { 68 {



option is the use of axis velocity control. Velocity feedback is eective in linearizing the axis dynamics and eliminating much of the parameter variation.

2.3.4 Requirements for high-performance visual control Important prerequisites for high-performance visual servoing are:

a vision system capable of a high sample rate with low latency; a high-bandwidth communications path between the vision system and the robot controller.

Most reports make use of standard video sensors and formats and are thus limited in sample rate to at most 60 Hz (the RS170 eld rate). For visual control[51, 27] and force control[78] it has been observed that latency from sensed error to action is critical to performance. Many reports are based on the use of slow vision systems where the sample interval or latency is signi cantly greater than the video frame time. If the target motion is constant then prediction can be used to compensate for the latency, but the low sample rate results in poor disturbance rejection and long reaction time to target `maneuvers'. The control requirements can be summarized as: 1. Good tracking performance entirely by feedback control necessitates a Type 2 system with fast closed-loop dynamics. 2. The design space for such compensators has been found to be very small[20]. 3. Control of actuator torque using only the vision sensor leads to poor control due to the low sample rate and non-linear axis dynamics. Compensators designed to increase performance and raise system Type are not robust with respect to the aforementioned eects. 4. Closing a velocity loop around the actuator linearizes the axis and considerably reduces the uncertainty in axis parameters. 5. Velocity and position mode control can be used to achieve satisfactory tracking. 6. Position-mode control requires additional complexity at the axis control level, for no signi cant performance improvement under visual servoing. 7. Particular implementations of axis position controllers may introduce additional sample rates which will degrade the overall closed-loop performance. { 69 {



The approaches reported in the literature are interesting when considered in the light of these observations. Most reports have closed the visual position loop around an axis position loop which generally operates at a high sample rate (sample intervals of the order of a few milliseconds). There are fewer reports on the use of a visual position loop around an axis velocity loop. Pool[79] used a hydraulic actuator which is a `natural' velocity source, albeit with some severe stiction eects. Hashimoto[48] implemented a digital velocity loop on the axes of a Puma robot as did Corke[23, 21]. Weiss[109] proposed visual control of actuator torque, but even with ideal actuator models found that sample intervals as low as 3 ms were required.

References [1] P. Adsit. Real-Time Intelligent Control of a Vision-Servoed Fruit-Picking Robot. PhD thesis, University of Florida, 1989. [2] G. Agin. Calibration and use of a light stripe range sensor mounted on the hand of a robot. In Proc. IEEE Int. Conf. Robotics and Automation, pages 680{685, 1985. [3] R. Ahluwalia and L. Fogwell. A modular approach to visual servoing. In Proc. IEEE Int. Conf. Robotics and Automation, pages 943{950, 1986. [4] J. S. Albus. Brains, behavior and robotics. Byte Books, 1981. [5] P. K. Allen, A. Timcenko, B. Yoshimi, and P. Michelman. Real-time visual servoing. In Proc. IEEE Int. Conf. Robotics and Automation, pages 1850{1856, 1992. [6] P. K. Allen, B. Yoshimi, and A. Timcenko. Real-time visual servoing. In Proc. IEEE Int. Conf. Robotics and Automation, pages 851{856, 1991. [7] N. Andersen, O. Ravn, and A. Srensen. Real-time vision based control of servomechanical systems. In Proc. 2nd International Symposium on Experimental Robotics, Toulouse, France, June 1991. [8] C. H. Anderson, P. J. Burt, and G. S. van der Wal. Change detection and tracking using pyramid transform techniques. In Proceeding of SPIE, volume 579, pages 72{78, Cambridge, Mass., September 1985. SPIE. [9] R. L. Andersson. Real-time gray-scale video processing using a momentgenerating chip. IEEE Trans. Robot. Autom., RA-1(2):79{85, June 1985. [10] R. Andersson. Real Time Expert System to Control a Robot Ping-Pong Player. PhD thesis, University of Pennsylvania, June 1987. [11] R. Andersson. A low-latency 60Hz stereo vision system for real-time visual control. Proc. 5th Int.Symp. on Intelligent Control, pages 165{170, 1990. [12] D. H. Ballard and C. M. Brown. Computer Vision. Prentice Hall, 1982. [13] P. Besl. Active, optical range imaging sensors. Machine Vision and Applications, 1:127{152, 1988. { 70 {



[14] R. C. Bolles. Veri cation vision for programmable assembly. In Proc 5th International Joint Conference on Arti cial Intelligence, pages 569{575, Cambridge, MA, 1977. [15] M. Bowman and A. Forrest. Visual detection of dierential movement: Applications to robotics. Robotica, 6:7{12, 1988. [16] R. Bukowski, L. Haynes, Z. Geng, N. Coleman, A. Santucci, K. Lam, A. Paz, R. May, and M. DeVito. Robot hand-eye coordination rapid prototyping environment. In Proc. ISIR, pages 16.15{16.28, October 1991. [17] G. Buttazzo, B. Allotta, and F. Fanizza. Mousebuster: a robot system for catching fast moving objects by vision. In Proc. IEEE Int. Conf. Robotics and Automation, pages 932{937, 1993. [18] F. Chaumette, P. Rives, and B. Espiau. Positioning of a robot with respect to an object, tracking it and estimating its velocity by visual servoing. In Proc. IEEE Int. Conf. Robotics and Automation, pages 2248{2253, 1991. [19] W. F. Clocksin, J. S. E. Bromley, P. G. Davey, A. R. Vidler, and C. G. Morgan. An implementation of model-based visual feedback for robot arc welding of thin sheet steel. Int. J. Robot. Res., 4(1):13{26, Spring 1985. [20] P. I. Corke. High-Performance Visual Closed-Loop Robot Control. PhD thesis, University of Melbourne, Dept. Mechanical and Manufacturing Engineering, July 1994. [21] P. Corke. Experiments in high-performance robotic visual servoing. In Proc. International Symposium on Experimental Robotics, pages 194{200, Kyoto, October 1993. [22] P. Corke and M. Good. Dynamic eects in high-performance visual servoing. In Proc. IEEE Int. Conf. Robotics and Automation, pages 1838{1843, Nice, May 1992. [23] P. Corke and M. Good. Controller design for high-performance visual servoing. In Proc. IFAC 12th World Congress, pages 9{395 to 9{398, Sydney, 1993. [24] P. Corke and R. Kirkham. The ARCL robot programming system. In Proc.Int.Conf. of Australian Robot Association, pages 484{493, Brisbane, July 1993. Australian Robot Association, Mechanical Engineering Publications (London). [25] P. Corke and R. Paul. Video-rate visual servoing for robots. In V. Hayward and O. Khatib, editors, Experimental Robotics 1, volume 139 of Lecture Notes in Control and Information Sciences, pages 429{451. Springer-Verlag, 1989. [26] P. Corke and R. Paul. Video-rate visual servoing for robots. Technical Report MS-CIS-89-18, GRASP Lab, University of Pennsylvania, February 1989. [27] P. Y. Coulon and M. Nougaret. Use of a TV camera system in closed-loop position control of mechnisms. In A. Pugh, editor, International Trends in { 71 {


[28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41]


Manufacturing Technology ROBOT VISION, pages 117{127. IFS Publications, 1983. R. Cunningham. Segmenting binary images. Robotics Age, pages 4{19, July 1981. M. L. Cyros. Datacube at the space shuttle's launch pad. Datacube World Review, 2(5):1{3, September 1988. Datacube Inc., 4 Dearborn Road, Peabody, MA. E. Dickmanns and V. Graefe. Applications of dynamic monocular machine vision. Machine Vision and Applications, 1:241{261, 1988. E. Dickmanns and V. Graefe. Dynamic monocular machine vision. Machine Vision and Applications, 1:223{240, 1988. E. Dickmanns and F.-R. Schell. Autonomous landing of airplanes by dynamic machine vision. In Proc. IEEE Workshop on Applications of Computer Vision, pages 172{179. IEEE Comput. Soc. Press, November 1992. J. Dietrich, G. Hirzinger, B. Gombert, and J. Schott. On a uni ed concept for a new generation of light-weight robots. In V. Hayward and O. Khatib, editors, Experimental Robotics 1, volume 139 of Lecture Notes in Control and Information Sciences, pages 287{295. Springer-Verlag, 1989. K. A. Dzialo and R. J. Schalko. Control implications in tracking moving objects using time-varying perspective-projective imagery. IEEE Trans. Ind. Electron., IE-33(3):247{253, August 1986. H. Fassler, H. Beyer, and J. Wen. A robot pong pong player: optimized mechanics, high performance 3D vision, and intelligent sensor control. Robotersysteme, 6:161{170, 1990. J. T. Feddema, C. S. G. Lee, and O. R. Mitchell. Weighted selection of image features for resolved rate visual feedback control. IEEE Trans. Robot. Autom., 7(1):31{47, February 1991. J. Feddema. Real Time Visual Feedback Control for Hand-Eye Coordinated Robotic Systems. PhD thesis, Purdue University, 1989. J. Feddema and O. Mitchell. Vision-guided servoing with feature-based trajectory generation. IEEE Trans. Robot. Autom., 5(5):691{700, October 1989. J. P. Foith, C. Eisenbarth, E. Enderle, H. Geisselmann, H. Ringschauser, and G. Zimmermann. Real-time processing of binary images for industrial applications. In L. Bolc and Z. Kulpa, editors, Digital Image Processing Systems, pages 61{168. Springer-Verlag, Germany, 1981. G. Franklin and J. Powell. Digital Control of dynamic systems. Addison-Wesley, 1980. S. Ganapathy. Real-time motion tracking using a single camera. Technical Memorandum 11358-841105-21-TM, AT&T Bell Laboratories, November 1984. { 72 {



[42] C. Geschke. A robot task using visual tracking. Robotics Today, pages 39{43, Winter 1981. [43] C. C. Geschke. A system for programming and controlling sensor-based robot manipulators. IEEE Trans. Pattern Anal. Mach. Intell., PAMI-5(1):1{7, January 1983. [44] A. Gilbert, M. Giles, G. Flachs, R. Rogers, and H. Yee. A real-time video tracking system. IEEE Trans. Pattern Anal. Mach. Intell., 2(1):47{56, January 1980. [45] R. M. Haralick and L. G. Shapiro. Survey: Image segmentation techniques. Computer Vision, Graphics, and Image Processing, 29:100{132, 1985. [46] R. C. Harrell, D. C. Slaughter, and P. D. Adsit. A fruit-tracking system for robotic harvesting. Machine Vision and Applications, 2:69{80, 1989. [47] H. Hashimoto, T. Kubota, W.-C. Lo, and F. Harashima. A control scheme of visual servo control of robotic manipulators using arti cial neural network. In Proc. IEEE Int.Conf. Control and Applications, pages TA{3{6, Jerusalem, 1989. [48] K. Hashimoto, T. Kimoto, T. Ebine, and H. Kimura. Manipulator control with image-based visual servo. In Proc. IEEE Int. Conf. Robotics and Automation, pages 2267{2272, 1991. [49] M. Hatamian. A real-time two-dimensional moment generating algorithm and its single chip implementation. IEEE Trans. Acoust. Speech Signal Process., 34(3):546{553, June 1986. [50] V. Hayward and R. P. Paul. Robot manipulator control under UNIX | RCCL: a Robot Control C Library. Int. J. Robot. Res., 5(4):94{111, 1986. [51] J. Hill and W. T. Park. Real time control of a robot with a mobile camera. In Proc. 9th ISIR, pages 233{246, Washington, DC, March 1979. [52] K. Hosoda and M. Asada. Versatile visual servoing without knowledge of true Jacobian. In Proc. IROS, September 1994. [53] N. Houshangi. Control of a robotic manipulator to grasp a moving target using vision. In Proc. IEEE Int. Conf. Robotics and Automation, pages 604{609, 1990. [54] M. K. Hu. Visual pattern recognition by moment invariants. IRE Trans. Info. Theory, 8:179{187, February 1962. [55] H. Inoue, T. Tachikawa, and M. Inaba. Robot vision server. In Proc. 20th ISIR, pages 195{202, 1989. [56] H. Inoue, T. Tachikawa, and M. Inaba. Robot vision system with a correlation chip for real-time tracking, optical ow, and depth map generation. In Proc. IEEE Int. Conf. Robotics and Automation, pages 1621{1626, 1992. { 73 {



[57] M. Jagersand, O. Fuentes, and R. Nelson. Experimental evaluation of uncalibrated visual servoing for precision manipulation. In Proc. IEEE Int. Conf. Robotics and Automation, page to appear, 1996. [58] W. Jang and Z. Bien. Feature-based visual servoing of an eye-in-hand robot with improved tracking performance. In Proc. IEEE Int. Conf. Robotics and Automation, pages 2254{2260, 1991. [59] W. Jang, K. Kim, M. Chung, and Z. Bien. Concepts of augmented image space and transformed feature space for ecient visual servoing of an \eye-in-hand robot". Robotica, 9:203{212, 1991. [60] M. Kabuka, J. Desoto, and J. Miranda. Robot vision tracking system. IEEE Trans. Ind. Electron., 35(1):40{51, February 1988. [61] M. Kabuka, E. McVey, and P. Shironoshita. An adaptive approach to video tracking. IEEE Trans. Robot. Autom., 4(2):228{236, April 1988. [62] P. R. Kalata. The tracking index: a generalized parameter for ; and ; ; target trackers. IEEE Trans. Aerosp. Electron. Syst., AES-20(2):174{ 182, March 1984. [63] P. Khosla, N. Papanikolopoulos, and B. Nelson. Dynamic sensor placement using controlled active vision. In Proc. IFAC 12th World Congress, pages 9.419{ 9.422, Sydney, 1993. [64] R. Kohler. A segmentation system based on thresholding. Computer Graphics and Image Processing, pages 319{338, 1981. [65] M. Kuperstein. Generalized neural model for adaptive sensory-motor control of single postures. In Proc. IEEE Int. Conf. Robotics and Automation, pages 140{143, 1988. [66] M. Leahy, V. Milholen, and R. Shipman. Robotic aircraft refueling: a concept demonstration. In Proc. National Aerospace and Electronics Conf., pages 1145{ 50, May 1990. [67] Z. Lin, V. Zeman, and R. V. Patel. On-line robot trajectory planning for catching a moving object. In Proc. IEEE Int. Conf. Robotics and Automation, pages 1726{1731, 1989. [68] A. G. Makhlin. Stability and sensitivity of servo vision systems. Proc 5th Int Conf on Robot Vision and Sensory Controls - RoViSeC 5, pages 79{89, October 1985. [69] B. Mel. Connectionist Robot Motion Planning. Academic Press, 1990. [70] W. Miller. Sensor-based control of robotic manipulators using a general learning algorithm. IEEE Trans. Robot. Autom., 3(2):157{165, April 1987. [71] J. Mochizuki, M. Takahashi, and S. Hata. Unpositioned workpieces handling robot with visual and force sensors. IEEE Trans. Ind. Electron., 34(1):1{4, February 1987. { 74 {



[72] S. Negahdaripour and J. Fox. Undersea optical stationkeeping: Improved methods. J. Robot. Syst., 8(3):319{338, 1991. [73] R. Nevatia. Depth measurement by motion stereo. Computer Graphics and Image Processing, 5:203{214, 1976. [74] N. Papanikolopoulos, P. Khosla, and T. Kanade. Adaptive robot visual tracking. In Proc. American Control Conference, pages 962{967, 1991. [75] N. Papanikolopoulos, P. Khosla, and T. Kanade. Vision and control techniques for robotic visual tracking. In Proc. IEEE Int. Conf. Robotics and Automation, pages 857{864, 1991. [76] N. Papanikolopoulos and P. Khosla. Shared and traded telerobotic visual control. In Proc. IEEE Int. Conf. Robotics and Automation, pages 878{885, 1992. [77] R. P. Paul. Robot Manipulators: Mathematics, Programming, and Control. MIT Press, Cambridge, Massachusetts, 1981. [78] R. P. Paul and H. Zhang. Design of a robot force/motion server. In Proc. IEEE Int. Conf. Robotics and Automation, volume 3, pages 1878{83, Washington , USA, 1986. [79] T. Pool. Motion Control of a Citrus-Picking Robot. PhD thesis, University of Florida, 1989. [80] R. E. Prajoux. Visual tracking. In D. Nitzan et al., editors, Machine intelligence research applied to industrial automation, pages 17{37. SRI International, August 1979. [81] J. Pretlove and G. Parker. The development of a real-time stereo-vision system to aid robot guidance in carrying out a typical manufacturing task. In Proc. 22nd ISRR, pages 21.1{21.23, Detroit, 1991. [82] P. Rives, F. Chaumette, and B. Espiau. Positioning of a robot with respect to an object, tracking it and estimating its velocity by visual servoing. In V. Hayward and O. Khatib, editors, Experimental Robotics 1, volume 139 of Lecture Notes in Control and Information Sciences, pages 412{428. Springer-Verlag, 1989. [83] A. Rizzi and D. Koditschek. Preliminary experiments in spatial robot juggling. In Proc. 2nd International Symposium on Experimental Robotics, Toulouse, France, June 1991. [84] M. Roberts. Control of resonant robotic systems. Master's thesis, University of Newcastle, Australia, March 1991. [85] C. Rosen et al. Machine intelligence research applied to industrial automation. Sixth report. Technical report, SRI International, 1976. [86] C. Rosen et al. Machine intelligence research applied to industrial automation. Eighth report. Technical report, SRI International, 1978. [87] A. Rosenfeld and A. C. Kak. Digital Picture Processing. Academic Press, 1982. { 75 {



[88] P. K. Sahoo, S. Soltani, and A. K. C. Wong. A survey of thresholding techniques. Computer Vision, Graphics, and Image Processing, 41:233{260, 1988. [89] T. Sakaguchi, M. Fujita, H. Watanabe, and F. Miyazaki. Motion planning and control for a robot performer. In Proc. IEEE Int. Conf. Robotics and Automation, pages 925{931, 1993. [90] S. Sawano, J. Ikeda, N. Utsumi, H. Kiba, Y. Ohtani, and A. Kikuchi. A sealing robot system with visual seam tracking. In Proc. Int. Conf. on Advanced Robotics, pages 351{8, Tokyo, September 1983. Japan Ind. Robot Assoc., Tokyo, Japan. [91] P. Sharkey, I. Reid, P. McLauchlan, and D. Murray. Real-time control of a reactive stereo head/eye platform. Proc. 29th CDC, pages CO.1.2.1{CO.1.2.5, 1992. [92] Y. Shirai and H. Inoue. Guiding a robot by visual feedback in assembling tasks. Pattern Recognition, 5:99{108, 1973. [93] S. Skaar, W. Brockman, and R. Hanson. Camera-space manipulation. Int. J. Robot. Res., 6(4):20{32, 1987. [94] G. Skofteland and G. Hirzinger. Computing position and orientation of a free ying polyhedron from 3D data. In Proc. IEEE Int. Conf. Robotics and Automation, pages 150{155, 1991. [95] M. Srinivasan, M. Lehrer, S. Zhang, and G. Horridge. How honeybees measure their distance from objects of unknown size. J. Comp. Physiol. A, 165:605{613, 1989. [96] G. Stange, M. Srinivasan, and J. Dalczynski. Range nder based on intensity gradient measurement. Applied Optics, 30(13):1695{1700, May 1991. [97] A. R. Tate. Closed loop force control for a robotic grinding system. Master's thesis, Massachusetts Institute of Technology, Cambridge, Massachsetts, 1986. [98] F. Tendick, J. Voichick, G. Tharp, and L. Stark. A supervisory telerobotic control system using model-based vision feedback. In Proc. IEEE Int. Conf. Robotics and Automation, pages 2280{2285, 1991. [99] M. Tomizuka. Zero phase error tracking algorithm for digital control. Journal of Dynamic Systems, Measurement and Control, 109:65{68, March 1987. [100] J. Urban, G. Motyl, and J. Gallice. Real-time visual servoing using controlled illumination. Int. J. Robot. Res., 13(1):93{100, February 1994. [101] S. Venkatesan and C. Archibald. Realtime tracking in ve degrees of freedom using two wrist-mounted laser range nders. In Proc. IEEE Int. Conf. Robotics and Automation, pages 2004{2010, 1990. [102] A. Verri and T. Poggio. Motion eld and optical ow: Qualitative properties. IEEE Trans. Pattern Anal. Mach. Intell., 11(5):490{498, May 1989. [103] Vision Systems Limited, Technology Park, Adelaide. APA-512MX Area Parameter Accelerator User Manual, October 1987. { 76 {



[104] P. Vuylsteke, P. Defraeye, A. Oosterlinck, and H. V. den Berghe. Video rate recognition of plane objects. Sensor Review, pages 132{135, July 1981. [105] J. Wang and G. Beni. Connectivity analysis of multi-dimensional multi-valued images. In Proc. IEEE Int. Conf. Robotics and Automation, pages 1731{1736, 1987. [106] J. Wang and W. J. Wilson. Three-D relative position and orientation estimation using Kalman lter for robot control. In Proc. IEEE Int. Conf. Robotics and Automation, pages 2638{2645, 1992. [107] A. Wavering, J. Fiala, K. Roberts, and R. Lumia. Triclops: A high-performance trinocular active vision system. In Proc. IEEE Int. Conf. Robotics and Automation, pages 410{417, 1993. [108] T. Webber and R. Hollis. A vision based correlator to actively damp vibrations of a coarse- ne manipulator. RC 14147 (63381), IBM T.J. Watson Research Center, October 1988. [109] L. Weiss. Dynamic Visual Servo Control of Robots: an Adaptive Image-Based Approach. PhD thesis, Carnegie-Mellon University, 1984. [110] D. B. Westmore and W. J. Wilson. Direct dynamic control of a robot using an end-point mounted camera and Kalman lter position estimation. In Proc. IEEE Int. Conf. Robotics and Automation, pages 2376{2384, 1991. [111] J. S. Weszka. A survey of threshold selection techniques. Computer Graphics and Image Processing, 7:259{265, 1978. [112] W. Wichman. Use of optical feedback in the computer control of an arm. AI memo 55, Stanford AI project, August 1967. [113] J. Wilf and R. Cunningham. Computing region moments from boundary representations. JPL 79-45, NASA JPL, November 1979. [114] W. Wilson. Visual servo control of robots using Kalman lter estimates of relative pose. In Proc. IFAC 12th World Congress, pages 9{399 to 9{404, Sydney, 1993. [115] P. Wolf. Elements of Photogrammetry. McGraw-Hill, 1974. [116] J.-C. Yuan. A general photogrammetric method for determining object position and orientation. IEEE Trans. Robot. Autom., 5(2):129{142, April 1989. [117] J.-C. Yuan, F. Keung, and R. MacDonald. Telerobotic tracker. Patent EP 0 323 681 A1, European Patent Oce, Filed 1988. [118] D. B. Zhang, L. V. Gool, and A. Oosterlinck. Stochastic predictive control of robot tracking systems with dynamic visual feedback. In Proc. IEEE Int. Conf. Robotics and Automation, pages 610{615, 1990.

{ 77 {

Chapter 3

X Vision: A Portable Substrate for Real-Time Vision Applications Gregory D. Hager and Kentaro Toyama Department of Computer Science Yale University, P.O. Box 208285 New Haven, CT, 06520

3.1 Introduction Real-time visual feedback is an essential tool for implementing systems that interact dynamically with the world. The challenge in providing visual feedback is to recover visual motion quickly and robustly. We distinguish between two types of motion processing: full eld motion processing such as optical ow, and feature-based motion processing such as edge tracking. While both have wide applicability, it is clear that their data-processing requirements vary considerably. Vision techniques such as those based on optical ow or region segmentation tend to emphasize full-frame, iterative processing which is be performed oine, or is accelerated using specialized hardware. O-line solutions are unsatisfactory for real-time applications and hardware solutions tend to be expensive and in exible. On the other hand, feature tracking concentrates on spatially localized areas of the image. Since image processing is local, high data bandwidth between the host and the framegrabber is not needed. Likewise, the amount of data that must be processed is relatively low and can be eciently dealt with by o-the-shelf hardware. Such systems are cost-eective and, since most of the tracking algorithms reside in software, extremely exible. Furthermore, as the speed of PC's and workstations continues to increase, so does the complexity of real-time vision applications that can be run on them. These advances anticipate the day when even full-frame applications requiring moderate processing can be run on standard hardware. Feature tracking has already found wide applicability in the vision and robotics literature. One of the most common applications is in determining structure from motion. Structure from motion algorithms attempt to recover the three-dimensional structure of objects by observing their movement in multiple camera frames. Most often, this research { 78 {



involves observation of line segments [10, 26, 36, 38, 45], point features [34, 37], or both [12, 35], as they move in the image. As with stereo vision research, a basic necessity for recovering structure accurately is a solution to the correspondence problem: three-dimensional structure cannot be accurately determined without knowing which features correspond to the same physical point in space in successive image frames. In this sense, precise local feature tracking is essential for the accurate recovery of three-dimensional structure. Robotic hand-eye applications also make heavy use of visual tracking. Robots often operate in environments rich with edges, corners and textures, making feature-based tracking a natural choice for providing visual input. Speci c applications include calibration of cameras and robots [7, 24], visual-servoing and hand-eye coordination [8, 16, 20, 23, 44], mobile robot navigation and map-making [38, 43], pursuit of moving objects [8, 22], grasping [1], and telerobotics [19]. Robotic applications most often require the tracking of objects more complex than line segments or point features, and they frequently require the ability to track multiple objects. Thus, a tracking framework for robotic applications must include a framework for composing simple features to track objects such as rectangles, wheels, and grippers in a variety of environments. At the same time, the fact the vision is in a servo loop implies the tracking must be fast, accurate, and highly reliable. A third category of tracking applications are those which track modeled objects. Models may be anything from weak assumptions about the form of the object as it projects to the camera image (e.g., contour trackers which assume simple, closed contours) to full- edged three-dimensional models with variable parameters (such as a model for an automobile which allows for turning wheels, opening doors, etc.). Automatic road-following can be accomplished by tracking the edges of the road [30]. Various snake-like trackers are used to track objects in 2D as they move across the camera image [2, 6, 9, 25, 39, 41, 42]. Threedimensional models, while more complex, allow for more precise pose estimation [14, 27]. Model-based tracking requires the ability to integrate simple features into a coherent whole, both to predict the con guration of features in the future and to evaluate the ecacy of any single feature. While the list of tracking applications dependent on simple feature tracking is long, the features themselves are variations on a very small set of primitives: \edgels" or line segments [10, 26, 36, 38, 45, 14, 27, 41], corners based on line segments [19, 34], small patches of texture [11], and easily detectable highlights [4, 32]. Although the basic principles of recovering these visual features have been known for some time, experience has shown that tracking them is most eective when strong geometric, physical, and temporal constraints from the surrounding task can be brought to bear on the tracking problem. In most cases, the natural abstraction is a multi-level framework where geometric constraints are imposed \top-down" while geometric information about the world is computed \bottom-up." Although tracking is a necessary function for most of the research listed above, it is generally not a focus of the work and is often solved in an ad-hoc fashion for the purposes of a single demonstration. This has led to a proliferation of tracking techniques which, although eective for particular experiments, are not practical solutions in general. Many tracking systems, for example, are only applied to pre-stored video sequences and do not operate in real time [33]. The implicit assumption is that speed will come, in time, with better technology (perhaps a reasonable assumption, but one which does not help those seeking real-time applications today). Other tracking systems require specialized hardware { 79 {



[1], making it dicult for researchers without such resources to replicate results. Finally, most, if not all, existing tracking methodologies lack modularity and portability, forcing tracking modules to be re-invented for every application. Based on these observations, we believe that the availability of fast, portable, recon gurable tracking system, an \X Vision" system, would greatly accelerate the progress in the application of real-time vision tools just as the X Window system made graphical user interfaces a common feature of desktop workstations. We have constructed such a system, largely for experimental vision-based robotic applications. Experience from several teaching and research applications suggests that this system reduces the startup time for new vision applications, makes real-time vision accessible to \non-experts," and demonstrates that interesting research utilizing real-time vision can be performed with minimal hardware. This article describes some of the essential features of X Vision, focusing particularly on how geometric warping and geometric constraints are used to achieve high performance. We also present timing data for various tracking primitives, and several demonstrations of X Vision-based systems. The remainder of the article is organized into three sections: Section 3.2 describes X Vision in some detail, Section 3.3 shows several examples of its use, and Section 3.4 discusses current and future research directions.

3.2 Tracking System Design and Implementation It has often been said that \vision is inverse graphics." In many ways, X Vision embodies this analogy. Common graphics systems implement a few simple primitives: e.g., lines and arcs, and subsequently de ne complex objects in terms of these primitives. So for example, a polygon may be decomposed into its polyhedral faces which are further decomposed into constituent lines. Given an object-viewer relationship, these lines are projected into the screen coordinate system and displayed. In a good graphics system, de ning these types of geometric relationships is simple and intuitive [13]. X Vision provides this functionality and its converse. The system is organized around a small set of image-level primitives referred to as basic features. Each of these features is described in terms of a small set of parameters, referred to as a state vector, which completely speci es the features' position and appearance. Complex features or objects carry their own state vector which is computed by de ning functions or constraints on a collection of simpler state vectors. These complex features may themselves participate in the construction of yet more complex features. Conversely, given the state vector of a complex feature, constraints are imposed on the state of its constituent features and the process recurses until image-level primitives are reached. The image-level primitives search for features in the neighborhood of their expected locations which produces a new state vector, and the cycle repeats. The primitive feature tracking algorithms of X Vision are optimized to satisfy two goals: eciency on scalar processors and high accuracy. The latter implies accuracy on a quantitative scale|exact feature state computation|as well as qualitative accuracy of feature matching. These goals are met largely through two important attributes of X Vision. First, eciency and robustness are attained through the use of highly optimized predictive image processing for spatially localized features. In particular, we rely strongly on the idea of image warping to reduce image-level computations to local perturbations on a nominal feature { 80 {



appearance. Second, the state-based representation for features supports hierarchical imposition of geometric and/or physical constraints on feature evolution in a straightforward and logical manner. Thus, tracking systems can be composed from existing components and combined with application-speci c constraints quickly and cleanly. In addition to eciency and accuracy, X Vision has been constructed to be modular and to simplify the process of embedding vision into applications. To this end X Vision incorporates a data abstraction mechanism that disassociates the information carried in the feature state from the tracking mechanism used to acquire it. This facilitates the construction of application independent tracking packages which can be quickly connected to special-purpose tracking mechanisms.

3.2.1 Image-Level Feature Tracking

Any tracking application can be considered to be a control system for maintaining a \focus of attention" or \region of interest." Tracking a feature means that the region of interest maintains a xed, pre-de ned relationship (e:g: containment) to the feature. In X Vision, a region of interest is referred to as a window. Fundamentally, the goal of low-level processing is to process the pixels within a window using a minimal number of addressing operations, bus transfer cycles, and arithmetic operations. The key idea for increasing the performance of low-level image processing is the use of warped images. Warped images are windows geometrically distorted using feature state information so as to present a feature in a canonical con guration. In X Vision, windows are parameterized as ane, possibly subsampled, distortions of a rectangular sub-image. We de ne acquiring a window to be the process of transferring and warping the pixels of a window. Window acquisition can be implemented quickly using ideas borrowed from graphics for fast rendering of lines and boxes [13]. For example, Figure 3.1(left) show a rotated rectangular window in a larger image, and Figure 3.1(right) shows the warped image. Note that the line appears roughly horizontal in window coordinates. The principle advantage of image warping is that it provides a conceptual and computational framework for propagating high-level temporal and geometric information to the image-processing level where it is exploited to accelerate image-level computations. For example, consider locating a simple straight edge with known orientation within an image region. Previous feature-tracking methods essentially apply standard edge detection followed by a matching step [5]. This approach involves convolving the region with two oriented lters, computing local gradient directions, and nally locating a set of image locations having gradient direction \close" to the expected orientation and which form a straight line consistent with that orientation. Note that the orientation and linearity constraints appear relatively late in the detection process. A much more eective approach is to exploit these constraints at the outset by utilizing an oriented detector for straight edges as described in Section 3.2.1. In general, geometric and temporal constraints simplify image processing by supplying a change of coordinates on the image regions. Image warping can be optimized to perform this change of coordinates quickly, and to apply it exactly once for each pixel. An additional advantage of separating the change of coordinates from the image processing is modularization. On one hand, the same type of coordinate transforms, e:g:; rigid transfor{ 81 {



mations, occur over and over again, so the same warping primitives can be reused. On the other hand, various types of warping can be used to normalize features so that the same accelerated image processing can be applied over and over again. For example, quadratic warping could be used to locally \straighten" a curved edge so that optimized straight edge detection strategy can be applied. Window-based image processing succeeds because of reliable, if approximate, predictions for feature state evolution. Provided that we know bounds on feature dynamics, featuretracking is more a problem of accurate con guration adjustment rather than repeated feature detection. Our low-level feature trackers are based on this idea: our edge \detector," for example, detects only edges which are within the search window and which are also within a narrow range of orientations. In a sense, the principal diculty of real-time vision|processing temporally adjacent images|also provides constraints which simplify computations. The nal step in determining a new state con guration for a feature is to determine if a \match" occurs between the previous appearance of the feature and the most recently acquired window. In order to ensure bounded real-time performance, this matching operation must be fast and consume a bounded amount of time. Our philosophy has been to assume that feature state prediction speci es the conditions for a match. If these predictions are con rmed uniquely, a match is declared. If the match is ambiguous, the ambiguity is reported and no match is declared. This follows the design criteria of maximizing modularity and minimizing the risk of reporting incorrect information to the application utilizing tracking. The low-level features currently available in X Vision include solid or broken contrast edges detected using several variations on standard edge-detection, general grey-scale patterns tracked using SSD methods [3, 21, 40], and a variety of color and motion-based primitives used for initial detection of objects and subsequent match disambiguation. The remainder of this section describes how edge-tracking and correlation-based tracking have been incorporated into X Vision. In the sequel, all timing gures were taken on an SGI Indy workstation equipped with a 175Mhz R4400 SC processor and an SGI VINO digitizing system.

Edges

Occluding contours and contrast edges are the basis of many tracking applications. X Vision provides a tracking mechanism for linear edge segments of arbitrary length. The state of an edge segment is characterized by its position and orientation in framebuer coordinates as well as its lter response. Given prior state information, edge tracking proceeds along the lines described above. Image warping is used to acquire a window which, if the prior estimate is correct, leads to an edge which is vertical within the warped window. Detecting a straight, vertical contrast step edge can be thought of as a series of one-dimensional detection problems. Assuming the edge is vertical in the window, convolving each row of the window with a derivative-based kernel will produce an aligned series of response peaks. These responses can be superimposed by summing down the columns of the window. Finding the maximum value of this response function localizes the edge. Performance can be improved by noting that the order of the convolution and summation steps can be commuted. Thus, { 82 {



Figure 3.1. On the left, a sample image showing a reference line on the waist of the robot, and on the right the image associated with the window as it appears in window local coordinates.

in an n m window, edge localization with a convolution mask of width k can be performed with just m (n + k) additions and mk multiplications. We in fact often use an IR lter composed of a series of negative ones, one or more zeros, and a series of positive ones which can be implemented using only m (n + 4) additions. The detection scheme described above requires orientation information to function correctly. If this information cannot be supplied from \higher-level" geometric constraints, it is estimated as follows. As the orientation of the acquisition window rotates relative to the edge, the response of the lter drops sharply. Thus, edge orientation can be computed by sampling at least three orientations and performing interpolation on the lter responses to determine the orientation maximizing response. Implementing this scheme directly would be wasteful because the three acquisition windows would overlap, causing many pixels to be transferred and warped three times. Instead, an expanded window at the predicted orientation is acquired, and the summation step is performed along the columns and along two diagonal paths corresponding to two bracketing orientation osets. For small perturbation angles, this closely mirrors the eect of performing the convolution at three dierent orientations. Quadratic interpolation of the three curves is used to estimate the orientation of the underlying edge. In the ideal case, if the convolution template is symmetric and the response function after superposition is unimodal, the horizontal displacement of the edge should agree between all three lters. In practice, the estimate of edge location will be biased. For this reason, edge location is computed as the weighted average of the edge location of all three peaks. If additional localization accuracy is required, a second derivative operator can be performed in the local neighborhood of the detected edge at the computed orientation, and the zero-crossing used to compute sub-pixel accuracy. Even though the edge detector described above is quite selective, as the edge segment moves through clutter, we can expect multiple local maxima to appear in the convolution output. The \best" treatment of edge ambiguity is typically dependent on both the application and the environment. By default, X Vision declares a match if a unique local maximum exists within an interval about the response value stored in the state. If a match is found, the interval for the next tracking cycle is chosen as a fraction of the dierence between the matched response value and the next closest response. This scheme makes it { 83 {



Line Length, Width 20, 20 40, 20 40, 40

Length Sampling Full 1/2 1/4 0.44 0.31 0.26 0.73 0.43 0.29 1.33 0 .69 0.49

Figure 3.2. Timing in milliseconds for one iteration of tracking an edge segment on an SGI Indy with 175 Mhz R4400SC processor.

extremely unlikely that mistracking due to incorrect matching will occur. Such an event could happen only if some distracting edge of the correct orientation and response moved into the tracking window just as the desired edge changed response or moved out of the tracking window. The value of the threshold determines how selective the lter is. A narrow match band implicitly assumes that the edge response remains constant over time. In environments with changing backgrounds this may be a bad assumption. Other possibilities include matching on the brightness of the \foreground" object as described in Section 3.3, or matching based on nearness to an expected location passed from a higher-level object. Experimental results on line tracking using various match functions can be found in [41]. Tracking robustness can be increased by making edge segments as long as possible. Long segments are less likely to become completely occluded, and changes in the background tend to aect a smaller proportion of the segment with a commensurately lower impact on the lter response. On long edge segments, speed is maintained by subsampling the window in the direction of the edge segment. Within our warping scheme, subsampled windows can be acquired as eciently as full resolution windows. Likewise, the maximum edge motion between images can be increased by subsampling in the horizontal direction. In this case, the accuracy of edge localization drops and the possibility of an ambiguous match increases. Figure 3.2 shows timings for simple edge tracking that were obtained during test runs. Because of the way edge detection is performed, we see that processing speed is sublinear with edge length. For example, moving from 20 pixel edges to 40 pixel edges results in only a 65% increase in time. Also, tracking a 40 pixel segment at half resolution takes the same amount of time as a 20 pixel segment at full resolution, verifying that subsampling does not decrease processing performance. Finally, if we consider 20 pixel lines at a xed resolution, we see that it is possible to track 33:33=0:44 75 edge segments simultaneously at frame rate.

Region-Based Tracking In region-based tracking, we consider matching a pre-de ned \reference" window to region of the image. The reference is either a region taken from the scene itself, or a pre-supplied \target" template. It is assumed throughout that the surface patch corresponding to the region of interest is roughly planar and that its projection is relatively small compared to the image as a whole so that perspective eects are minimal. Under these circumstances, the geometric distortions of a region can be modeled using an ane transformation. The { 84 {



state vector for our region tracker includes these six geometric parameters, two additional parameters that describe the brightness and contrast change between the reference image and the most recent image, and a value which indicates how well the reference image and the current image region match. The tracking cycle for a region is to rst use the state information of the reference region to acquire and warp a prospective image region. Once this transformation is performed, computing the remaining dierence between reference and the prospective image is posed as an optimization problem similar to that originally proposed by Lukas and Kanade for stereo matching [29]. Although other authors have utilized optimization-based SSD for tracking, most have computed just translation between adjacent frames|essentially the local optical

ow. However, for accurate tracking, it is essential that the initial reference frame be used throughout the image sequence. Tomasi and Shi [33] describe a scheme where the ane structure is computed in an iterative manner for the purposes of feature monitoring, however only inter-frame translation is actually used for tracking. The primary diculty in working with ane structure lies in the fact that most images do not fully determine all six geometric parameters. Consider, for example, a window placed on a right-angle corner. A pure translation of the corner can be accounted for as either translation, scaling or a linear combination of both|eectively an extension of the well-known aperture problem to scale and translation. The solution implemented in X Vision is to represent the ane system as discrete parameter groups representing translation, rotation, scaling and shear, and to solve for the parameter groups incrementally. This essentially establishes \preferences" for interpreting image changes. These preferences can be changed as desired for a particular application. Although less accurate than a simultaneous solution, the small amount of distortion between temporally adjacent images makes this solution method suciently precise for most applications. Once the ane distortion has been estimated, the results are integrated to provide a new estimate of the patch geometry and appearance for the next tracking cycle. The process proceeds as follows. Let I (x; t) denote the value of the pixel at location x = (x; y)T at time t in an image sequence. Consider a planar surface patch undergoing rigid motion observed under orthographic projection. At time t; the surface projects to an image region with a spatial extent represented as a set of image locations, W : At some later point, t + ; the region projects to an ane transformation of the original region. If illumination remains constant, the geometric relationship between the projections can be recovered by minimizing the following objective function:

O(A; d) =

X

x2W

(I (Ax + d; t + ) ; I (x; t))2w(x); > 0;

(3.1)

where d = (u; v )T ; x = (x; y )T ; A is an arbitrary positive de nite 2 2 matrix and w() is an arbitrary weighting function. A common problem with this approach is that the image brightness and contrast are unlikely to remain constant which may bias the results of the optimization. The solution is to normalize images to have zero rst moment and unit second moment. We note that with these modi cations, solving (3.1) for rigid motions (translation and rotation) is equivalent to maximizing normalized correlation. Suppose that a solution at time t + , (A ; d ); is known. If we de ne the warped image J (x; t) = I (A x + d ; t + ); subsequent deviations can then be expressed as small { 85 {



variations on the warped image. Recall that a linear transformation can be decomposed into a rotation, two orthogonal scaling components and a shear component. For small changes, we can write these components as a dierential rotation I +odiag(;; ); a scaling matrix I + diag(sx; sy ); and a shear matrix I + odiag( ; 0) so that A = (I + odiag(;; ))(I + diag(sx ; sy ))(I + odiag( ; 0)): Multiplying and dropping higher order terms allows us to write A (I + A) where A = odiag(;; ) + diag(sx; sy ) + odiag( ; 0): Applying this observation to (3.2), we have

O(d; ; s; ) =

X

x2W

(J ((I + A)x + d; t + t) ; I (x; t))2w(x); t > 0:

(3.2)

Linearizing J at the point (x; t) yields

O(d; ; s; ) =

X

x2W

(J (x; t) + (Jx (x; t); Jy(x; t))( Ax+d) + Jt (t)t ; I (x; y; t))2w(x); > 0:

(3.3) where (Jx ; Jy )T and Jt are the image spatial and temporal derivatives, respectively. If the solution at was the correct one, then J (x; y; t) = I (x; y; t) and Jt (x; t) J (x; t + t) ; J (x; t) = J (x; t + t) ; I (x; t): With these observations, we can simplify (3.3) and rewrite it in terms of the spatial derivatives of the reference image yielding

O(d; ; s; ) =

X

x2W

(Ix (x; t); Iy(x; t)) ( Ax + d) + Jt (x; t)t)2w(x); t > 0:

(3.4)

The parameter groups are then solved for individually. First, we ignore A and solve (3.4) for d by taking derivatives and setting up a linear system in the unknowns. This system can be compactly expressed if we de ne Ix and Iy to be the spatial derivatives of the reference image arranged as vectors indexed by x: Then we de ne

q gx(x) = Ix(x) w(x) q gy (x) = Iy (x) w(x) q h (x) = Jt(x) w(x);

(3.5) (3.6) (3.7)

0

and the linear system for computing translation is

"

# " # gx gx gx gy d = h gx : gx gy gy gy h gy 0 0

(3.8)

Since the spatial derivatives are only computed using the original reference image, gx and gy and the inverse of the matrix on the left hand side are constant over time and can be computed oine. Thus during tracking, computing the solution to this system requires a vector dierence and two inner products. In terms of computing time, these operations dominate the time needed to compute a solution. Once d is known, the image dierences are adjusted according to the vector equation

h = h ; gx u ; gy v: 1

0

{ 86 {

(3.9)



If the image distortion arises from pure translation and no noise is present, then we expect that h1 = 0 after this step. Any remaining residual can be attributed to linearization error, noise, or other geometric distortions. The solutions for the remaining parameters operate on h1 in completely analogous fashion. For image rotation, de ne gr (x) = gx (x)y ; gy (x)x and compute

= (h1 gr )=(gr gr ); h2 = h1 ; gr :

(3.10) (3.11)

gsx(x) = gx(x)x; gsy (x) = gy (x)y; " # " # gsx gsx gsx gsy s = h gsx : gsx gsy gsy gsy h gsy

(3.12) (3.13)

For image scaling, de ne

and solve

2

2

(3.14)

The image dierences are adjusted as

h = h ; gsxsx ; gsy sy : Finally, for image shear de ne g (x) = gx (x)y and compute

= (h g )=(g g ); h = h ; g : 3

2

3

4

3

(3.15) (3.16) (3.17)

In the last two cases, the left hand side can remain constant only if the image warping operation includes fractional scaling. Including image scaling necessitates a resampling of the image which is based on bilinear interpolation. After all relevant stages of processing have been complete, h4 h4 is stored as the match value of the tracker state vector. If we consider the complexity of tracking in terms of image vector operations1 we see that there is a xed overhead of one vector dierence and multiply to compute the weighted temporal derivatives. Each parameter computed requires an inner product, a multiply and an addition. Computing the nal residual or \match value" consumes an additional image inner product. By comparison, solving for all parameters simultaneously would require only six inner products. However, calculating the match value would require an additional six multiplies and sums, and so the two methods are computationally equivalent. In addition to parameter estimation, the initial brightness and contrast compensation consume three vector sums and two vector multiplies. Computing the weighted temporal derivative consumes another vector sum and multiply. Image fractional scaling, if required, consumes eectively three image multiplies and six image additions. In order to guarantee tracking of motions larger than a fraction of a pixel, these calculations must be carried out at varying levels of resolution. For this reason, a software 1

These calculations dominate any other operations with the exception of fractional scaling and shear.

{ 87 {


Size Reduction Trial A Trial B Trial C Trial D Trial E Trial F

40 40 4 2 0.8 1.7 3.2 7.2 1.5 5.6 2.9 4.1 2.8 3.8 3.7 8.4


60 60 4 2 1.7 3.9 7.3 13.0 3.2 9.5 6.4 9.2 6.2 8.6 8.1 15.6

80 80 4 2 3.2 7.3 13.0 23.6 6.1 17.7 11.1 16.9 10.8 15.8 14.4 28.5

100 100 4 2 5.0 12.7 20.3 37.0 9.4 28.3 17.5 27.9 17.5 26.0 22.5 43.1

Figure 3.3. The time in milliseconds consumed by one cycle of tracking for various instantiations of the SSD tracker. Trial A computed translation with aligned, unscaled images. Trial B computed translation with unaligned, scaled images. Trial C computed rotation with rotated images. Trial D computed scale with scaled images. Trial E computed shear with sheared images. Trial F computed all ane parameters under ane distortions.

reduction of resolution is carried out at the time of window acquisition. All of the above calculations except for image scaling are computed at the reduced resolution. Resolution reduction eectively consumes a vector multiply and k image additions, where k is the reduction factor. The tracking algorithm changes the resolution adaptively based on image motion. If the interframe motion exceeds 0:25k; the resolution for the subsequent step is halved. If the interframe motion is less than 0:1k, the resolution is doubled. This leads to to a fast algorithm for tracking fast motions and a slower but more accurate algorithm for tracking slower motions. To get a sense of the time consumed by these operations, we consider several test cases as shown in Figure 3.3. The rst two rows of this table show the timings for various size and resolution images when computing only translation. The trials dier in that the second required full linear warping to acquire the window. The subsequent rows present the time to compute translation and rotation while performing rotational warping, the time to compute scale while performing scale warping, the time to compute shear while performing shear warping, and the time to compute all ane parameters while performing linear warping. There are a few interesting points to notice in the data. First, it is clear that, for large images, image warping is the dominant factor in timing. As expected, reducing resolution increases the tracking speed. With the exception of two cases under 100 100 images at half resolution, all updates require less than one frame time (33:33 ms.) to compute. To get a sense of the eectiveness of image warping, Figure 3.4 shows several images of a box as a 100 100 region on its surface was tracked at one-fourth resolution. The lower series of images is the warped image which is the input to the SSD updating algorithm. We see that except for minor variations, the warped images are identical despite the radically dierent poses of the box.

{ 88 {



Figure 3.4. Several images of a planar region and the corresponding warped image used by the tracker. The image at the left is the initial reference image.

3.2.2 Networks of Features

We de ne composite features to be features that compute their state from other basic and composite features. We consider two types of feature composition. In the rst case information ow is purely \bottom-up." Features are composed soley in order to compute information from their state vectors without altering their tracking behavior. For example, given two point features it may be desirable to present them as the line feature passing through both. A feature (henceforth feature refers to both basic and composite features) can participate in any number of such constructions. In the second case, the point of performing feature composition is to exploit higher level geometric constraints in tracking as well as to compute a new state vector. In this case, information ows both \upward" and \downward." To illustrate, recall that the geometric portion of the state of an edge tracker is the vector L = (x; y; )T describing the location of a window centered on the contour and oriented along it. The low-level feature detection methods described in Section 3.2.1 compute an oset normal to the edge, t, and an orientation oset : Given these values, the state of the contour tracker is updated according to the following equation:

2 3 ; t sin( + ) L = L; + 64 t cos( + ) 75 : +

(3.18)

+ There is an aperture problem here: the state vector, L, is not fully determined by informa-

tion returned from feature detection. There is nothing to keep the window from creeping \along" the contour that it is tracking. For this reason, the edge tracking primitive almost always participates in a composite feature that imposes constraints on its state. One example is a feature tracker for the intersection of two non-collinear contours. This composite feature has a state vector C = (x; y; ; )T describing the position of the intersection point, the orientation of one contour, and the orientation dierence between the two contours. From image contours with state L1 = (x1 ; y1; 1)T and L2 = (x2 ; y2; 2)T , { 89 {



the distance from the center of each tracking window to the point of intersection the two contours can be computed as 1 = ((x2 ; x1 ) sin(2) ; (y2 ; y1) cos(2 ))= sin(2 ; 1); 2 = ((x2 ; x1 ) sin(1) ; (y2 ; y1) cos(1 ))= sin(2 ; 1): The state of a corner C = (xc ; yc ; c ; c) is calculated as:

xc yc c c

= x1 + 1 cos(1 ); (3.19) = y1 + 1 sin(1 ); = 1 ; = 2 ; 1 : Given a xed intersection point, we can now choose \setpoints" 1 and 2 describing where to position the contour windows relative to the intersection point. With this information, the states of the individual contours can be adjusted as follows: xi = xc ; i cos(i); yi = yc ; i sin(i ); (3.20) for i = 1; 2: Choosing 1 = 2 = 0 de nes a cross pattern. If the window extends h pixels along the contour, choosing 1 = 2 = h=2 de nes a corner. Choosing 1 = 0 and 2 = h=2 de nes a tee junction, and so forth. A complete tracking cycle for this system starts by imposing the constraints of (3.20) \downward" in order to make the initial state of the contours consistent. Image-level feature detection is then performed, and nally information is propagated \upward" by computing (3.18) followed by (3.19). More generally, we de ne a feature network to be a set of nodes connected by two types of directed arcs referred to as up-links and down-links. Nodes represent basic and composite features. Up-links represent the information dependency between a composite feature and the features used to compute its state. Thus, if a node is a source node with respect to up-links, it must be a basic feature. If a node has incoming up-links it must be a composite feature. If a node n has incoming up-links from nodes m1 ; m2; : : :; mk the latter are called subsidiary nodes of n: Down-links represent the imposition of constraints or other highlevel information on features. A node that is a source for down-links is a top-level node. Each node may have no more than one incoming down link. The node from whence this link originates is referred to as the supersidiary node of that feature. All directed paths along up-links or down-links in a feature graph must be acyclic. We also require that every top-level feature that is path-connected to some basic feature by down-links must be pathconnected to the same basic feature via up-links. For example, a corner is a graph with three nodes. The corner feature is a top-level feature. The two contours which compose it are subsidiary features. There are both up-links and down-links between the corner node and the feature nodes. Given this terminology, we can now de ne a complete tracking cycle to consist of: 1) traversing the down-links from each top-level node applying state constraints until basic { 90 {



features are reached; 2) applying low-level detection in every basic feature; and 3) traversing the up-links of the graph computing the state of composite features. State prediction can be added to this cycle by including it in the downward propagation. Thus, a feature tracking system is completely characterized by the topology of the network, the identity of the basic features, and a state computation and constraint function for each non-basic feature node. Composite features that have been implemented within this scheme range from simple edge intersections as described above, to Kalman snakes( Section 3.3), to three-dimensional model-based tracking using pose estimation [28] as well as a variety of more specialized object trackers, some of which are described in Section 3.3.

3.2.3 Feature Typing

In order to make feature composition simpler and more generic, we have included polymorphic type support in the tracking system. Brie y, each feature, basic or composite, carries a type. This type identi es the geometric or physical information contained in the state vector of the feature. For example, there are point features which carry location information and line features which carry orientation information. Any composite feature can specify the type of its subsidiary features and can itself carry a type. In this way, the construction becomes independent of a manner with which its subsidiary nodes compute information. So, for example, a line feature can be constructed from two point features by computing the line that passes through the features and a point feature can be computed by intersecting two line features. An instance of the intersectionbased point feature can be instantiated either from edges detected in images or line features that are themselves computed from point features. Feature typing is also polymorphic. For example, tracking two corresponding points in two images yields a stereo point feature. Likewise, tracking two line features in two images yields a stereo line feature. More generally, tracking two x's in two images yields a stereo

x:

3.3 Applications In this section, we present several applications of X Vision which illustrate how the tools it provides|particularly image warping, image subsampling, constraint propagation and typing|can be used to build fast and eective tracking systems. Additional results can be found in [15, 17, 18, 21, 28, 41].

3.3.1 Pure Tracking

Face Tracking A frontal view of a human face is suciently planar to be tracked as a

single SSD region. However, tracking a face as a single region necessitates computation of full ane transformations at the image level. Figure 3.5 shows several image pairs illustrating poses of a face and the warped image resulting from tracking. Despite the fact that the face is nonplanar, resulting for example in a stretching of the nose as the face is turned, the tracking is quite eective. However, it is somewhat slow (about 40 milliseconds per iterations), it can be confused if the face undergoes nonrigid distortions which cannot be { 91 {



Figure 3.5. Above several images of the a face and below the corresponding warped images used by the tracking system. FACE

EYES

MOUTH

EYE

EYE MultiSSD

MultiSSD

MultiSSD SSD

SSD

SSD SSD

SSD

SSD

Figure 3.6. The tracking network used for face tracking.

easily captured using SSD methods, and it is sensitive to lighting variations and shadowing across the face. Also, many areas of the face contain no strong gradients which implies that they do not contribute substantially to the SSD state computation. This suggests that a more ecient and robust method would be to concentrate on the areas of high contrast. We have constructed a face tracker that relies on using SSD trackers at the regions of high contrast|the eyes and mouth|which provides much higher performance as well as the ability to \recognize" changes in the underlying features. The tracker is organized as shown in Figure 3.6. We rst create a MultiSSD composite tracker which performs an SSD computation for multiple reference images. The state vector of the MultiSSD tracker is state of the tracker which has lowest (best) match value and the identity of this tracker. The constraint function propagates this state to all subsidiary features with the result that the \losing" SSD trackers follow the \winner." In eect, the MultiSSD feature is a tracker which includes an n-ary switch. From MultiSSD we derive an Eye tracker which modi es the display function of MultiSSD { 92 {



Figure 3.7. The \clown face" tracker.

to show an open or closed eye based on the state of the binary switch. We also derive Mouth which similarly displays an open or closed mouth. Two Eye's are organized into the Eyes tracker. The state computation function computes the location of the center of the line joining the eyes and its orientation. The constraint function propagates the orientation to the low-level trackers, obviating the need to compute orientation from imagelevel information. Thus, the low-level Eye trackers only solve for translation. The Mouth tracker computes both translation and orientation. The Face tracker is then a combination of Eyes and Mouth. It imposes no constraints on them, however its state computation function computes the center of the line joining the Eyes tracker and the Mouth and its display function paints a nose there. The tracker is initialized by clicking on the eyes and mouth, and memorizing their appearance when they are closed and open. When run, the net eect is a graphical display of a \clown face" that mimics the antics of the underlying human face|the mouth and eyes follow those of the operator and open and close as the operator's do as shown in Figure 3.7. This system requires less than 10 milliseconds per iteration disregarding graphics.

Disk Tracking One important application for any tracking system is model-based tracking of objects for applications such as hand-eye coordination or virtual reality. While a generic model-based tracker for three-dimensional objects for this system can be constructed [28], X Vision makes it possible to gain additional speed and robustness by customizing the tracking loop using object-speci c geometric information. In several of our hand-eye experiments we use small oppy disks as test objects. The most straightforward disk tracker (actually, a rectangle tracker) is simply a composite { 93 {



line length sampling tracking speed (pixels) rate (msec/cycle) A B 24 1 9.3 7.7 12 2 5.5 4.5 8 3 3.7 3.3 6 4 3.0 2.7 2 12 1.9 1.6 1 24 1.5 1.3

of position (pixels) A B 0.09 0.01 0.10 0.00 0.07 0.03 0.05 0.04 0.05 0.04 0.05 0.07

Figure 3.8. Speed and accuracy of tracking rectangles with various spatial sampling rates. The gures in column A are for a tracker based on four corners computing independent orientation. The gures in column B are for a tracker which passes orientation down from the top-level composite feature.

tracker which tracks four corners, which in turn are composite trackers which track two lines each as described in Section 3.2.1). Four corners are then simply tracked one after the other to track the whole rectangle with no additional object information. This method, while simple to implement, has two disadvantages. First, in order to track quickly, only a small region of the occluding contour of the disk near the corners is processed. This makes the tracking prone to losing the corner based on chance occlusion and makes the match value sensitive to local changes in the background. Second, each of the line computations is independently computing orientation from image information. However, given the location of two corners the line orientation can be easily computed using object geometry. Thus, increased speed and robustness can be attained by subsampling of the windows along the contour, and by passing the line orientation \from above." As shown in Figure 3.8, there is no loss of precision in determining the location of of the corners with reasonable sampling rates, since the image is sampled less frequently only in a direction roughly perpendicular to the lines. At the same time, we see a 10% to 20% speedup by not computing line orientations at the image level, and a nearly linear speedup with image subsampling level.

Distraction Resistant Tracking The tracking package includes a contour tracker sim-

ilar to the widely reported \Kalman Snake" algorithms [2, 9, 25, 39, 42]. The snake is a composite feature which computes its state by tracking small edges as discussed in Section 3.2.1. The upward propagation computes the state of the snake based on the state of edges using a simple spline t. To determine the position of new search windows, we compute a weighted combination of temporal prediction and spatial interpolation to arrive at the line parameters, L, of search windows for low level edge tracking. Figure 3.9 displays images of a contour tracker. Figure 3.9(b) shows the cubic splines determined by the knot points shown in Figures 3.9(a). By using the typing system, the snake tracker is written to expect subsidiary nodes that are generic line features. This makes it simple to apply it to dierent types of boundary detectors. For example, one diculty with using short edge segments detected as described in Section 3.2.1 is that the match value is extremely sensitive to background changes which makes the snakes easily distracted as shown in Figure 3.9(a&b). Distraction occurs because { 94 {



(a)

(b)

(c)

(d)

Figure 3.9. Distraction resistant contour tracking: (a) and (b) show distracted contours; (c) and (d) show contour tracking without distraction. The short line segments in (b) represent the search window of each tracking component, with estimated edge location (and consequently, spline knot points) at their midpoints.

individual components are simply high-gradient edge nders, and depend upon high-level models to correct them in the case that they stray. While many model- or template-based contour trackers might not be distracted by the high-contrast edges in Figures (a) and (b), they would still fail in cases such as (d), where the change in the shape of the contour is small and/or gradual. We have implemented an edge detector that combines elements of feature detection combined with some of the temporal correlation aspects of SSD tracking [41]. The spline algorithm can be instantiated using these edge trackers because they carry the same type as the default edge tracker. Figure 3.9(c) and (d) illustrate the resulting distraction-free tracking. By simply maintaining some information about the kind of edges that are tracked (i.e., what intensities are observed inside the contour), tracking is improved greatly.

3.3.2 An Embedded Application

As an illustration of the use of X Vision embedded within a larger system, we brie y describe some results of using X Vision within a hand-eye coordination we have recently developed [18, 17]. The system relies on image-level feedback from two cameras to control the relative pose between an object held in a robot end-eector and a static object in the environment. { 95 {

Figure 3.10. The results of performing point-to-point positioning to observable features (left) and to a setpoint de ned in the plane (right).

The typing capabilities of X Vision make it possible to abstract the function of the handeye coordination primitives from their visual inputs. The hand-eye system implements a set of primitive skills which are vision-based regulators for attaining a particular geometric constraint between the pose of a robot-held object and a target object. For example, two primitive skills are point-to-point positioning and point-to-line positioning. Each of these skills must acquire inputs from a stereo pairs of trackers for point features or line features as appropriate. Hence, they are written in terms of the point-type features and the linetype features as de ned by the X Vision typing system. In this way the same feedback methods can be used with a variety of application-speci c tracking con gurations without change. For example, positioning or orienting the robot in a plane is practically useful for systems which use a table or other level surface as a work-space. In order to relate task coordinates to image coordinates, the following planar invariant can be used [31]: given four planar points, no three of which are collinear, the coordinates of a fth point can be constructed using ratios of determinants. In X Vision, this construction can be implemented as a composite feature with a state vector typed as a point feature. Hence, it can be coupled directly with a point positioning skill in order to perform planar positioning as shown in Figure 3.10 (right). It can be shown that the accuracy of primitive skills depends only on the accuracy of feature location in the image [17]. Hence, the physical accuracy of hand-eye experiments can be used to directly determine the accuracy of our feature localization algorithms. For example, we have performed several hundred point-to-point positioning experiments with a camera baseline of approximately 30cm at distances of 80 to 100cm. Accuracy is typically within a millimeter of position. For example, Figure 3.10 (left) shows the accuracy achieved when attempting to touch the corners of two oppy disks. For reference, the width of the disks is 2:5mm. Simple calculations show that positioning accuracy of a millimeter at one meter of depth with a baseline of 30cm using the physical parameters of our cameras yields a corner localization accuracy of 0:15 pixels. Primitive skills are often combined to form more complex kinematic constraints. For example, two point-to-line constraints de ne a colinearity constraint useful for alignment. Figure 3.11 (right) shows an example of a problem requiring alignment|placing a screw{ 96 {



Intersection

Intersection SCREW

LINE

MIDLINE LINE

LINE

LINE

Figure 3.11. Left, the structure of the tracker for the screw. Right, the system in operation.

driver onto a screw. The diagram to the left summarizes the tracker for the screw. The tracker is composed from four edge segments. Two edge segments track the sides of the screw. Information from the segments is combined to compute a central axis. The orientation and position of the axis constrains the position and orientation of two short edge segments which detect the head of the screw and the point of contact with the wood surface. The position and length of the axis is determined by the positions of the trackers for the head of the screw and the wood. The screwdriver is tracked using a similar construction with the exception that only one short edge segment is used (at the tip of the screwdriver) and the edge trackers along the axis are of xed length. For the purposes of hand-eye coordination, the screw tracker is augmented with two composite features which compute the intersection point of the central axis with each of the short line segments. These point features are used as input the the visual servoing algorithm. This is a common use of such geometric constructions|to pull out information from an object which has its own internal structure and constraints.

3.4 Conclusions We have presented the X Vision system for fast visual tracking. The main features of this system are the strong use of image warping, highly optimized low-level tracking methods, and a simple notion of feature combination. Our experience has shown that X Vision makes the use of vision in robotics applications cheap, simple, and even fun. Naive users tend to become pro cient after a short \startup" period. Experts can easily develop and debug complex applications in a few hours time. The modularity of the system has made it an ideal framework for comparative studies. It is straightforward to add other types of tracking primitives to the system and benchmark them against existing methods on real images or canned sequences. It is possible to attain extremely high levels of performance by properly introducing geometric constraints into the tracking system. On the other hand, simpli ed systems like this cannot perform every vision-based task. In particular, window-based processing of features will only succeed when the observed system is suciently well-behaved to be able to predict its motion through time. However, recent successes in domains such as juggling { 97 {



suggest that window-based techniques can accommodate highly dynamic systems [32]. Since the system is almost entirely software, it bene ts from every increase in commercial processor speed. It is also extremely portable. The tracking system runs on an SGI Indigo with internal or external video, Sun systems equipped with a variety of framegrabbers, and PC compatibles equipped with digitizer boards. For example, the entire hand-eye system was recently ported to a robot at the DLR in Oberpfaenhoen, Germany where it was used in experiments in space telerobotics [19] In summary, we believe that this paradigm will have a large impact on real-time vision applications. We are currently continuing to advance the state of the art by considering how to build tracking methods that are faster, more robust to occlusion and distraction [41], and capable of automatic initialization. We are also continuing to extend the capabilities of the system toward a complete vision-based programming environment. Information on the current version is available at http://www.cs.yale.edu/HTML/YALE/CS/AI/VisionRobotics/YaleAI.html.

Acknowledgments This research was supported by ARPA grant N00014-93-1-1235, Army DURIP grant DAAH04-95-1-0058, by National Science Foundation grant IRI-9420982, and by funds provided by Yale University.

{ 98 {



Realizations Corner

Cross

GILine

CompFeature

Tools

Tee

Edge

Target SSD-trans SSD-trans-scale

Line

Scalable Point-Type Line-Type

Image

Absmedge

Pattern

Galileo IndyCam XWindow ITFG_101

FeatureGroup BaseType

CWindow

Video

BasicFeature Tracking

Typing

Interfaces

Figure .12. The organization of X Vision.

.1 Programming Environment Just as graphics systems are amenable to object-oriented techniques, we have found that object-oriented programming is well-suited to tracking systems. The use of object-oriented methods allows us to hide the details of how speci c methods are implemented and to interact with the system through a pre-speci ed set of generic interfaces. Modularization through object-oriented programming also makes it easy to conceptualize the distinct stages of multi-level tracking. The application of high-level constraints should not require an overhaul of low-level procedures, nor should low-level implementation aect the overall framework provided by higher level algorithms. Finally, such modularization also enhances the portability of the system. We have constructed X Vision as a set of classes in C++. Brie y, all features are derived from a base class called BasicFeature. Basic features are directly derived from this class, and are characterized by their state vector, and functions which compute state information and display the feature graphically. There are two types of composite feature which are also derived from BasicFeature. CompFeature describes a composite feature which has both upward and downward links. FeatureGroup is a composite feature with only upward links|that is, it does not impose any constraints on its subsidiary features. Any feature may participate in only one CompFeature, but many FeatureGroup's. Both CompFeature and FeatureGroup maintain and manage an internal queue of their subsidiary features. Information is propagated up and down the feature network using two functions: compute state which computes a composite feature's state from the state of its subsidiary nodes, and state propagate that adjusts the state of a subsidiary nodes based on the state of their supersidiary node. The default update cycle for a CompFeature is to call its own state propagate function, to call the update function of the children, and then to call compute state. A FeatureGroup is similar, except there is no state propagate function. The tracking cycle is combined into a single function track() callable only from { 99 {


Hager/Hutchinson/Corke Desired Corners Search Line

Figure .13. Schematic of the initialization example.

a top-level feature. Calling it sweeps information down the network to the set of basic features, updates of the state of all basic features, and sweeps updated state information back up the network. We have found that this programming environment greatly facilitates the development of tracking applications, and leads to clear compact program semantics. As an example, consider a simple program to locate and track the corners of the disk shown in Figure .13 using the ducial marks located near one edge. Video v(1); Edge e; Target t1(Sig1);

// // // // // //

Line l1(&e,&v), l2(&e,&v); Corner c1(&e, &v,UL), c2(&e, &v,UR); if (!(t1.search()))

A video device The pattern to track A specialized pattern finder with a known signature Two line trackers tracking edges in v

// Two corner trackers operating in v

exit(1);

// Search globally for the target // This takes about 0.5 seconds

l1.set_state(t1.x(), t1.y() + t1.sizey()/2, 0); // Initialize two l2.set_state(t1.x(), t1.y() + t1.sizey()/2, M_PI); // search lines If (!(l1.search(c1) && l2.search(c2))) exit(1);

// Search along the lines for // the corners

CompFeature p; p += c1; p += c2;

// A generic composite feature // Add the corners to it.

while (...) { p.track(); ... other user code ...

// Go into a main tracking loop // which combines tracking with // other useful functions.

}

After picking a video device and edge pattern to track, the remainder of the code locates the specialized target, uses it as a basis to nd the horizontal line, searches along the line to { 100 {



nd the speci ed corners, and then goes into a basic tracking cycle to which other user code can be added. Naturally, any other user code impacts the speed of tracking and so must be limited to operations that can be performed in a small fraction of a second. In most of our applications, this is a feedback control computation, or a broadcast of information to another processor.

References [1] P. Allen, A. Timcenko, B. Yoshimi, and P. Michelman. Automated tracking and grasping of a moving object with a robitc hand-eye system. IEEE Trans. on Rob. and Autom., 9(2):152{165, 1993. [2] A. A. Amini, S. Tehrani, and T. E. Weymouth. Using dynamic programming for minimizing the energy of active contours in the presence of hard constraints. In Proc. 2nd Int'l Conf. on Comp. Vision, pages 95{99, 1988. [3] P. Anandan. A computational framework and an algorithm for the measurement of structure from motion. Int'l Journal of Computer Vision, 2:283{310, 1989. [4] R.L. Anderson. Dynamic sensing in a ping-pong playing robot. IEEE Transaction on Robotics and Automation, 5(6):723{739, 1989. [5] Nicholas Ayache and Olivier D. Faugeras. Building, registrating, and fusing noisy visual maps. Int. J. Robot. Res., 7(6):45{65, 1988. [6] A. Blake, R. Curwen, and A. Zisserman. Ale-invariant contour tracking with automatic control of spatiotemporal scale. In Proc. Int'l Conf. on Comp. Vision, pages 421{430, Berlin, Germany, May 1993. [7] Y. L. Chang and P. Liang. On recursive calibration of cameras for robot hand-eye systems. In Proc. of 1989 IEEE Int. Conf. on Robotics and Automation, volume 2, pages 838{843, 1989. [8] F. Chaumette, P. Rives, and B. Espian. Positioning of a robot with respect to an object, tracking it, and estimating its velocity by visual servoing. In Proc. IEEE Int'l Conf. on Rob. and Autom., pages 2248{2253, Sacramento, CA, April 1991. [9] L.D. Cohen. On active contour models and balloons. CVGIP: Image Understanding, 53(2):211{218, March 1991. [10] James L. Crowley, Patrick Stelmaszyk, Thomas Skordas, and Pierre Puget. Measuremenat and integration of 3-D structures by tracking edge lines. Int'l Journal of Computer Vision, 8(1):29{52, 1992. [11] Mark W. Eklund, Gopalan Ravichandran, Mohan M. Trivedi, and Suresh B. Marapane. Adaptive visual tracking algorithm and real-time implemenation. In Proc. 1995 IEEE Conf. on Rob. and Autom., pages 2657{2662, Nagoya, Japan, May 1995. [12] O. D. Faugeras, F. Lustaman, and G. Toscani. Motion and structure from point and line matches. In Proc. Int'l Conference on Computer Vision, pages 25{33, June 1987. [13] J.D. Foley, A. van Dam, S.K. Feiner, and J.F. Hughes. Computer Graphics. Addison Wesley, 1993. { 101 {



[14] Donald B. Gennery. Visual tracking of known three-dimensional objects. Int'l Journal of Computer Vision, 7(3):243{270, 1992. [15] G. D. Hager. Real-time feature tracking and projective invariance as a basis for handeye coordination. In Proc. IEEE Conf. Comp. Vision and Patt. Recog., pages 533{539. IEEE Computer Society Press, 1994. [16] G. D. Hager. Calibration-free visual control using projective invariance. In Proceedings of the ICCV, pages 1009{1015, 1995. Also available as Yale CS-RR-1046. [17] G. D. Hager. A modular system for robust hand-eye coordination. DCS RR-1074, Yale University, New Haven, CT, June 1995. [18] G. D. Hager, W-C. Chang, and A. S. Morse. Robot hand-eye coordination based on stereo vision. IEEE Control Systems Magazine, 15(1):30{39, February 1995. [19] G. D. Hager, G. Grunwald, and K. Toyama. Feature-based visual servoing and its application to telerobotics. In Intelligent Robots and Systems. Elsevier, 1995. [20] Heikkila, Matsushita, and Sato. Planning of visual feedback with robot-sensor cooperation. In Proc. 1988 IEEE Int. Workshop on Intelligent Robots and Systems, Tokyo, 1988. [21] J. Huang and G. D. Hager. Tracking tools for vision-based navigation. DCS RR-1060, Yale University, New Haven, CT, December 1994. [22] Eric Huber and David Kortenkamp. Using stereo vision to pursue moving agents with a mobile robot. In Proc. 1995 IEEE Conf. on Rob. and Autom., pages 2340{2346, Nagoya, Japan, May 1995. [23] I. Inoue. Hand eye coordination in rope handling. In 1st Int. Symp. on Robotics Research, Bretten Woods, USA, 1983. [24] A. Izaguirre, P. Pu, and J. Summers. A new development in camera calibration: Calibrating a pair of mobile cameras. Int. J. Robot. Res., 6(3):104{116, 1987. [25] H. Kass, A. Witkin, and D. Terzopoulos. Snakes: Active contour models. Int'l Journal of Computer Vision, 1:321{331, 1987. [26] Yuncan Liu and Thomas S. Huang. A linear algorithm for motion estimation using straight line correspondences. In Int'l Conference on Patt. Recog., pages 213{219, 1988. [27] D. G. Lowe. Robust model-based motion tracking through the integration of search and estimation. Int'l Journal of Computer Vision, 8(2):113{122, 1992. [28] C.-P. Lu. Online Pose Estimation and Model Matching. PhD thesis, Yale University, 1995. [29] B. D. Lucas and T. Kanade. An iterative image registration technique with an application to stereo vision. In Proc. Int. Joint Conf. Arti cial Intelligence, pages 674{679, 1981. [30] A. D. Morgan, E. L. Dagless, D. J. Milford, and B. T. Thomas. Road edge tracking for robot road following: a real-time implementation. Image and Vision Computing, 8(3):233{240, August 1990. { 102 {



[31] J. Mundy and A. Zisserman. Geometric Invariance in Computer Vision. MIT Press, Cambridge, Mass., 1992. [32] A.A. Rizzi and D.E. Koditschek. An active visual estimator for dexterous manipulation. Paper presented at the 1994 Workshop on Visual Servoing, 1994. [33] J. Shi and C. Tomasi. Good features to track. In Computer Vision and Patt. Recog., pages 593{600. IEEE Computer Society Press, 1994. [34] D. Sinclair, A. Blake, S. Smith, and C. Rothwell. Planar region detection and motion recovery. Image and Vision Computing, 11(4):229{234, May 1993. [35] M. E. Spetsakis. A linear algorithm for point and line-based structure from motion. CVGIP: Image Understanding, 56(2):230{241, September 1992. [36] Minas E. Spetsakis and John (Yiannis) Aloimonos. Structure from motion using line correspondences. Int'l Journal of Computer Vision, 4:171{183, 1990. [37] T. N. Tan, K. D. Baker, and G. D. Sullivan. 3d structure and motion estimation from 2d image sequences. Image and Vision Computing, 11(4):203{210, May 1993. [38] Camillo J. Taylor and David J. Kriegman. Structure and motion from line segments in multiple images. In Proc. 1992 IEEE Conf. on Rob. and Autom., pages 1615{1620, 1992. [39] D. Terzopoulos and Szeliski. Tracking with kalman snakes. In A. Blake and A. Yuille, editors, Active Vision. MIT Press, Cambridge, MA, 1992. [40] C. Tomasi and T. Kanade. Shape and motion from image streams: a factorization method, full report on the orthographic case. CMU-CS 92-104, CMU, 1992. [41] K. Toyama and G. D. Hager. Distraction-proof tracking: Keeping one's eye on the ball. In IEEE Int. Workshop on Intelligent Robots and Systems, pages 354{359. IEEE Computer Society Press, 1995. Also Available as Yale CS-RR-1059. [42] D. J. Williams and M. Shah. A fast algorithm for active contours and curvature estimation. CVGIP: Image Understanding, 55(1):14{26, January 1992. [43] Yasushi Yagi, Kazuya Sato, and Masahiko Yachida. Evaluating eectivity of map generation by tracking vertical edges in omnidirectional image sequence. In Proc. 1995 IEEE Conf. on Rob. and Autom., pages 2334{2339, Nagoya, Japan, May 1995. [44] Billibon H. Yoshimi and Peter K. Allen. Active, uncalibrated visual servoing. In Proc. 1994 IEEE Int'l Conf. on Rob. and Autom., volume 4, pages 156{161, San Diego, CA, May 1994. [45] Z. Zhang and O. Faugeras. Determining motion from 3d line segment matches: a comparative study. Image and Vision Computing, 9(1):10{19, February 1991.

{ 103 {

Tutorial TT3: A Tutorial on Visual Servo Control - Semantic Scholar

Tutorial TT3: A Tutorial on Visual Servo Control - Semantic Scholar

Suggest Documents

A Tutorial on Visual Servo Control - Semantic Scholar

A Tutorial on Visual Servo Control - Robotics and Automation, IEEE ...

A Tutorial on Action Semantics - Semantic Scholar

a tutorial on digital watermarking - Semantic Scholar

A tutorial on KAM theory - Semantic Scholar

ScaLAPACK Tutorial - Semantic Scholar

FastFlow tutorial - Semantic Scholar

Adaptive Visual Servo Control

Visual Basic Tutorial

Visual Basic .NET Tutorial

Visual system tutorial

Visual C++ 2012 Tutorial

Tutorial de Visual Basic

Visual C++ 2010 Tutorial

Template Tutorial Visual Resources

A Tutorial On Adaptive Fuzzy Control

Fractional Order Control - A Tutorial

Tutorial on Genetic Algorithm - Semantic Scholar

Fractional Order Control - A Tutorial

Fractional Order Control - A Tutorial

Tutorial on Light Field Rendering - Semantic Scholar

Tutorial on Universal Algebra - Semantic Scholar

Reconfigurable Fault-tolerant Control: A Tutorial ... - Semantic Scholar

Tutorial Notes on Partial Evaluation - Semantic Scholar