Computer-vision-based Human-Computer Interaction with ... - CiteSeerX

2 downloads 7011 Views 192KB Size Report
Informatik VII (Computer Graphics) ... has to be performed at the computer, although it would be ..... user holds his arm still in order to fix the cursor at a par-.
Computer-vision-based Human-Computer Interaction with a Back Projection Wall Using Arm Gestures Christian Leubner, Christian Brockmann, Heinrich M¨uller Informatik VII (Computer Graphics) University of Dortmund, Germany [email protected]

Abstract A computer-vision-based interaction system for backprojection walls is presented. The user controls the projected graphical user interface of an application by pointing with the arm. The mouse cursor follows the motion of the arm. Further instructions corresponding e.g. to a mouse button click can be given by voice commands. The combination of the interaction system with an application program does not require any modifications of the application software, so that the system can be used to control any software running on the back-projection wall. On the image processing and recognition level, a special emphasis has been laid on coping with images disturbed by noise and varying illumination of the environment.

1. Introduction Video projection is in widespread use for multimedial presentations in classrooms and at conferences. It also plays an important role in group meetings for visualization purposes. Usually interaction is performed at a standard keyboard/mouse computer whose screen content is additionally directed to a video beamer. This type of interaction limits the possibilities of group meetings because the interaction has to be performed at the computer, although it would be more natural to interact directly at the back-projection wall. For that purpose, special displays augmented with sensors have been developed, like e.g. the SmartBoard [11]. Another recent development is to use classical laser pointers by capturing the laser point on the projection screen by video cameras. Versions for front- and back-projection have been implemented based on that idea [8, 13]. Another step further is to let the user point directly to the projection wall, without any additional pointing tool. The idea is to observe the user with video cameras in order to recognize his arm and to calculate the three-dimensional

pointing direction of the arm. There has been and is an increasing number of projects concerning tracking of the human body, like e.g. the Pfinder and Spfinder of the MIT [1] for human-computer interaction. Most of the projects emphasize on the recognition of symbolic static or motion gestures while the precise determination of locations or directions is less often treated. Examples concerning pointing are Visualization Space and Dream Space by IBM [6, 7]. These systems and others define the pointing direction by a line connecting the head and the hand of the pointing arm. According to our experience, in order that the line hits the desired goal, the postures that have to be performed are somewhat unnatural as far as pointing is concerned and thus are somewhat inconvenient. Another approach is to use a computer-internal kinematic 3D-model of the human body, which is controlled by the user by mapping body features recognized in the camera images to features of the model. Then the arm direction of the model can be taken as the desired pointing direction [5]. The advantage of the model-based approach is its robustness. Its precision, however, still depends on the precision of the 3D-information acquired from the images. In this contribution we focus on that aspect. One basic idea of our solution is to define a slightly restricted scenario of pointing to the wall. In our opinion this scenario still is natural and does not restrict the user too much. Our system uses two standard video cameras, under the ceiling and sidewards, which observe the space in front of the back-projection wall. The user directly interacts with the graphical user interface by pointing with his straight arm to the wall. The arm has to be in a defined region in front of the wall of about 1 to 1.5 meters of depth. No other dynamic objects, like e.g. other parts of the body of the user are allowed in this region. Typically the cursor of the application is displayed at the intersection of the pointing line with the wall and can freely be moved by moving the arm. Further instructions, for example initiating a mouse button click, are given by natural voice commands using a wireless headset microphone.

Figure 1. Computer-vision-based interaction in front of a back projection wall.

The advantage of the restricted scenario is that the typical difficulties for image processing concerning an unfavorable background or occlusion are reduced. Furthermore, a relatively precise calculation of the 3D-location of the arm is possible. The special configuration implies a more reliable result of the image segmentation phase than completely free interaction. Nevertheless, because of noise and varying illumination it is still possible that some images may not yield a correct result of segmentation. This observation is taken into consideration in the phase of interpretation and calculation of the pointing direction. The system has been implemented on PC hardware with standard color frame grabbing boards and video cameras, under the Microsoft Windows NT operating system. Image processing is performed on a separate PC that communicates by a local network with the application PC. In the next sections we focus on the technical realization of the system: Section 2 explains image segmentation. In section 3 a procedure for extracting the arm from the images is described. In section 4 the calculation of the three-dimensional pointing direction is presented. Section 5 is devoted to the compensation of missing or wrong information delivered by the image processing level. Section 6 reports on the currently available experiences with the system.

2. Segmentation The aim of segmentation is to distinguish between foreground and background objects [4]. In our case, the user,

and in particular his arm, is the foreground object whereas all other objects of the environment, which are seen by the cameras and are assumed to be static, are denoted as background objects. Examples of background objects are the floor and the wall, but not the back-projection wall. If the back-projection wall is seen in an image, then the region covered by it will be excluded from segmentation. However, the cameras can be arranged in a manner that the image area covered by the projection screen is small or even zero. Our solution to segmentation is divided into an initial learning phase and an application phase. In the learning phase the background without any foreground objects, i.e. without an interacting person, is presented to the system for some seconds. Learning means that the system collects information for every pixel about its background color and about its property of being a background edge within this period. The information is stored in a so-called knowledge base that is implemented as array. Every array element corresponds to a pixel and contains information about the colors and edges, respectively, detected during the learning phase. In the application phase the system has sufficient knowledge of the background to determine foreground object contour pixels. For this purpose the current color and edge information is compared to the acquired background knowledge. The combination of color and edge segmentation leads to more stable segmentation results than the usage of only one approach would provide. In order to cope with slight changes in the background as for example varying illumination conditions, background learning analogous to the learning phase is continued outside of the convex hull of the determined foreground object pixels. Due to the continuous updating of the knowledge base, segmentation is adaptive to slight changes in the background, which may e.g. be caused by varying illumination. Nevertheless, segmentation might partly fail or provide faulty results if color or edge segmentation do not work properly. In order to cope with this problem, fuzzy logic techniques are employed to finally report those pixels which describe the contour of the foreground object [2]. For the details on the segmentation method, which was just outlined here, we refer to [10]. Figure 2 shows an input image and the corresponding contour pixels of the arm, obtained by the segmentation algorithm.

3. Arm Extraction and Detection of Projected Pointing Direction Usually more of the user’s body than just the arm can be seen in an image. These parts have to be removed so that we have to determine a region in the image which covers the arm or at least a significant part of it. From the specific arrangement of the user, the cameras, and the back-projection wall we know on which side of the image the hand is lo-

Figure 2. An input image (left) and the corresponding image of contour pixels of the arm, calculated by the algorithm of segmentation (right).

cated, and how the arm traverses the image. Dependent on the direction of traversal, overlapping horizontal or vertical rectangular stripes covering the image are defined. The stripes are in parallel to the projection wall (figure 3). In each subimage defined by one of the stripes the convex hull of the contour pixels detected by the segmentation algorithm is calculated. In a next step, the stripes are processed according to increasing distance from the projection wall, starting with the one closest to the wall. The first stripe, which contains a convex hull, is considered to contain the hand of the user. Each neighboring stripe containing a hull polygon with a similar size is supposed to belong to the arm. If a large difference between the sizes of two neighboring hull polygons is detected, the algorithm will conclude to have reached the body. In this case stripe processing is terminated. For each hull polygon that was classified to belong to the arm, its centroid is calculated. A straight line is fit into the set of centroids. This straight line is taken as the pointing direction in this image (figure 4).

4. Determination of the Three-Dimensional Pointing Direction The next step is to transfer the two-dimensional views of the pointing direction into space. For that purpose a three-dimensional world coordinate system is introduced in the interaction space of the user. The position of the cameras and the size and position of the projection wall are expressed in this coordinate system. In order to get a relationship between the twodimensional image coordinates and points in space, a camera calibration technique with a look-up table is applied. The look-up table represents a function that assigns to every

point of the image plane the three-dimensional world coordinates of a point in space. The look-up table is determined with a calibration pattern composed of black discs (figure 5). The calibration pattern is placed in front of the camera, and the coordinates of the centers of the discs are measured (by hand) with respect to the world coordinate system (figure 5). The corresponding two-dimensional coordinates of the centers of the discs are detected at subpixel precision from the image by simple image processing operations (figure 6). The look-up table consists of the pairs of two-dimensional and three-dimensional coordinates of every center. The centers of the disc define a canonical quadrilateral mesh obtained by connecting them along rows and columns. The spatial point of an arbitrary point in the image plane is calculated by first determining its relative location in the quadrilateral in which it lies. The relative location is expressed in barycentric coordinates. The coordinates of the corresponding three-dimensional point are interpolated from the three-dimensional coordinates that correspond to the vertices of the quadrilateral and that are stored in the look-up table, by using the barycentric coordinates. Up to now we have assigned exactly one point in space to every image point. However, there are infinitely many points in space, which map to the same image point. The set of these points depends on the optical mapping of the camera. As often done in applications, we approximate the real camera by the pinhole model [3] illustrated in figure 7. In the pinhole camera model, the set of points in space mapping to the same image point are located on a ray through the optical center of the camera. The location of the optical center is estimated from the construction of the used cameras. The desired ray is obtained by connecting

Figure 3. Contour approximation of the user in the images of the sideward camera (left) and the ceiling camera (right). The contour of the arm is approximated by a union of convex hull polygons, which are visualized as line drawing in the images. The polygons are calculated in stripe-shaped sub-images, which are arranged horizontally and vertically, respectively. For the identified background regions (indicated by darkened colors), the learning of color information is continued.

Figure 4. Additionally to figure 3, the detected boundary between the arm and the body is indicated by a vertical and horizontal line, respectively. Furthermore, the approximated pointing directions are depicted by straight lines fitted to the centroids of the convex hull polygons of the stripes (indicated by white crosses).

CalibrationPattern

Image Coordinate System Optical Center

World Coordinate System Known Position

Known Position

Figure 5. Camera calibration using a calibration pattern with known real world position.

Figure 6. Calibration pattern located in front of camera (left) and the detected centers of the discs (right).

Image Pattern Optical Center

Camera’s View Area Perspective Line Object

Figure 7. Pinhole model of a camera.

for reasons, and have developed corresponding approaches of correction.

Figure 8. The pointing direction is determined as the straight line resulting from the intersection of the planes spanned by the optical center and the pointing line in the image of every camera.

the optical center with the three-dimensional corresponding point of the image point obtained from the look-up table as described. We have preferred this technique because of its simplicity and sufficient accuracy for our application, instead of using classical camera calibration techniques like that of Tsai [12], or improved techniques like that of Zhang [14]. Our approach, however, relies on the correct assumption of the location of the center of the used camera. Based on this method, the three-dimensional pointing direction of the arm is determined as follows. For every camera the plane spanned by the optical center of the camera and the pointing line, which has been fit in the image, is determined. The intersection of both planes yields a straight line in space that is taken as the desired pointing direction (figure 8). The position on the projection wall to which the user points is calculated as the intersection point of the plane of the projection wall and the just calculated pointing line. The coordinates of the intersection point on the projection wall are expressed with respect to the screen coordinate system. They are forwarded to the windows system, which updates the cursor position to this location.

5. Compensation of Disturbed Information First experiments had shown that the position of the cursor is quite unsteady. Even if the user did not move his arm, the mouse cursor rapidly jumped around the intended position by up to several centimeters in the worst case, and still by about one centimeter under good illumination conditions. In order to cope with this problem we have looked

A main reason is that segmentation may fail in one or more stripes. A particular undesirable effect is that because of wrong segmentation the convex hull in a stripe changes its shape discontinuously. We correct this effect by assigning a weight to every stripe that describes the amount by which the resulting convex hull center of this stripe is considered in the calculation of the fitting line of pointing direction. The weight is updated over time. It is determined from a certain number of points obtained for the stripe in the past, and reduces the priority of the stripe in the calculation in the case of nervous movement. Another effect is that the boundary line between the arm and the body jumps rapidly and considerably. Dependent on the location of the boundary line, the number of points involved in calculating the pointing line may be quite different, and in particular low, thus leading to quite different and non-reliable pointing lines. This effect is severe if the user holds his arm still in order to fix the cursor at a particular point on the wall. An indicator for this situation is that the fingertip, which usually is delivered quite reliably, shows none or just minor motion. If this situation is identified, the pointing line will not be determined by line fitting, but by taking the line through the fingertip with the same direction as in preceding frames in which the number of reliable convex hull centers on the arm has been high. Because in our application the final subject of interest is the position of the cursor on the screen, features of the time behavior of this position may also be used for correction. As before, the correction depends on the given situation. We distinguish between fixed pointing – the user points at the projection wall and does not move, correcting pointing – the user moves his arm slowly to adjust the position of the cursor within a small area, and dynamic pointing – the user moves his arm quickly to aim at a new position. The type of situation is identified by applying thresholds on the differences of cursor positions between successive frames. In all the three cases the mouse cursor position is obtained by a weighted average of a number of cursor positions in the past thus achieving a smoothing effect. For fixed pointing, the weights are chosen so that smoothing is stronger than for correcting pointing. For dynamic pointing, just a minor smoothing is performed. Measurements for the important situation of keeping the cursor at a desired fixed position show that these techniques reduce the error to about one half of the original one. Under good illumination conditions a desired cursor location can be found with a deviation between 0.5 and 2 cm. An implication of this observation is that selectable items, like icons or menu fields, should have at least this size on the projection wall.

6. Experiences and Future Work The presented gesture-based interaction system works reliably and enables the user to interact comfortably with the application system. The satisfying performance of the system is based on the awareness that preliminary processing steps may yield imperfect or even faulty intermediate results. Nevertheless the displayed interaction elements, like menus or icons, should not be too small in order that they can be selected reliably. The software runs on a Dual Pentium-III 800 MHz system with approximately 8 frames per second. The system has been extensively applied by different test persons and has been accepted as intuitive. Future work has to be spent to incorporate more than two cameras. This would enable a larger space of interaction in front of the projection wall. Furthermore it is planned to augment the possibilities of interaction by computer-visionbased hand gesture recognition enabling additional ways of interaction. For that purpose, our existing system Zyklop [9] can be used.

References [1] A. Azarbayejani, C. Wren, and A. Pentland, ”Realtime 3-D tracking of the human body”, Proceedings of IMAGE’COM 96, Bordeaux, Mai 1996. [2] Bezdek, J.C. (ed.), Keller, J., Krishnapuram, R., and Pal, M., Fuzzy Models and Algorithms for Pattern Recognition and Image Processing, Kluwer Academic Publishing, 1999. [3] O.D. Faugeras, ”Three-dimensional computer vision”, Artificial Intelligence, MIT Press, 3rd Edition, 1999. [4] Gonzalez, R.C., and Woods, R.E., Digital Image Processing, Addison-Wesley Publishing, 1992. [5] Hoch, M., Intuitive Interface - a new computer environment for design in visual media (In German), Verlag infix, 1999. [6] IBM. Dream Space. http://www.research.ibm.com/natural/dreamspace/ [7] IBM. Visualization Space. http://www.research.ibm.com/imaging/vizspace.html [8] C. Kirstein, and H. M¨uller ”Interaction with a Projection Screen Using a Camera-Tracked Laser Pointer’, Proceedings of The International Conference on Multimedia Modeling (MMM’98), IEEE Computer Society Press, 1998.

[9] Kohler, M., New contributions to Vision-based Human-Computer interaction in local and global environments, Infix-Verlag, 1999. [10] C. Leubner, ”Adaptive Color- and Edge-Based Image Segmentation Using Fuzzy Techniques”, Proceedings of 7th Fuzzy Days, 2001. [11] N.A. Streitz, J. Geißler, J.M. Haake, and J. Hol, ”DOLPHIN: Integrated Meeting Support across LiveBoards, Local and Remote Desktop Environments”, Proceedings of the 1994 ACM Conference on Computer-Supported Cooperative Work (CSCW’94), 1994, pp. 345-358. [12] R.Y. Tsai. ”An efficient and accurate camera calibration technique for 3D machine vision”, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1986. [13] M. Wissen, M.A. Wischy, and J. Ziegler, ”Realisierung einer laserbasierten Interaktionstechnik f¨ur Projektionsw¨ande”, Mensch & Computer 2001, Eds.: H. Oberquelle, R. Oppermann, and J. Krause, B.G. Teubner Verlag, 2001. [14] Z. Zhang, A flexible new technique for camera calibration, Technical Report MSR-TR-98-71, Microsoft Research, 1998.