color-based initialization for human tracking with a ...

11 downloads 0 Views 190KB Size Report
Vanderbilt University. Nashville, TN 37235. {jbbarile,magued,mec,watson,rap2,kawamura}@vuse.vanderbilt.edu. Abstract: This paper addresses the problem of.
COLOR-BASED INITIALIZATION FOR HUMAN TRACKING WITH A TRINOCULAR CAMERA SYSTEM J. BARILE, M. BISHAY, M. CAMBRON, R. WATSON, R.A. PETERS II, K. KAWAMURA Intelligent Robotics Laboratory Center for Intelligent Systems Department of Electrical and Computer Engineering Vanderbilt University Nashville, TN 37235 {jbbarile,magued,mec,watson,rap2,kawamura}@vuse.vanderbilt.edu

Abstract: This paper addresses the problem of initialization of a vision-based human tracking system. Most active binocular systems built for tracking require manual initialization. Our approach uses color skin-tone segmentation to initialize monochrome binocular tracking. A trinocular camera head designed to simplify the integration is used. We demonstrate the results of our color segmentation algorithms. Keywords: color processing, active vision, fixation, human tracking.

1.

The tracking system, outlined Figure 2, is composed of a camera head, CATCH [7], with three cameras (one center-mounted color camera and two monochrome), a DSP-based monochrome image acquisition and processing board developed in-house (RIVS: Real-time Intelligent Vision Sensor [7]), and a Pentium PC for color image processing. A Windows NT kernel-mode device driver was developed to allow users to display images processed on RIVS. The visual system is a part of ISAC, a dual-arm humanoid robot being developed at the Intelligent Robotics Lab [8].

Introduction

In the robotics community, there have been several active binocular systems built for tracking [1,2,3,4,5,6]. Most of these systems assume that the target to be tracked has already been fixated (i.e. the tracking has been initialized). Our approach uses a third color camera to initialize a monochrome stereo tracking algorithm. We have achieved this integration on a trinocular camera head (Figure 1) that has been mechanically designed in such a way as to simplify the integration.

R C L Color Image

Color Initialization (Pentium PC) L/R Mono Images

Pan/Tilt

"Start Verging" Mono Initialization (RIVS)

CATCH Controller L/R Verge

Figure 2 - Outline of tracking system

2.

Figure 1 - CATCH, a Cost-effective Active Trinocular Camera Head

CATCH and The Kinematics of Fixation

The mechanical design of the active trinocular camera head, called CATCH, simplifies the process of centering a target point on all three images. CATCH has 4 degrees of freedom: pan, tilt, left verge, and right verge. Figure 3 shows a simple schematic illustrating the axes of movement of the camera head. The two monochrome cameras are mounted on servomotors with parallel verge axes GL and GR normal to their optical axes, ZL and ZR. Thus the two monochrome cameras can turn independently about the verge axes GL and GR. The color camera and the two monochrome cameras are attached to crossbar that tilts about the horizontal axis,

TILT. The entire assembly rotates about the PAN axis. The optical axis of the color camera lies in the same plane as that of monochrome ones. The plane formed by the optical axes of two verging cameras is called the horopter plane [3]. The horopter is that the region of space containing objects projecting onto the image planes with no stereo disparity (zero-disparity). The process of centering each camera on a specified target is known as fixation. When fixating the two monochrome cameras on a target point, either of two cases may arise: 1) the target point lies within the horopter plane, and 2) the target point is displaced from the horopter plane. It has been shown that the first case is favored over the second [9]. Case 1 requires only calculating a single inverse tangent, while case 2 involves several projections and rotations. With this in mind, CATCH was designed so that the central color camera may be used to control pan and tilt, reducing the job of the monochrome cameras to simply verging on a point in the horopter plane.

Figure 4 - Skin-tone pixel volume in RGB space

G 255

GL “ S k i n to n e ” R e g io n

GR

zL

0

R 255

zR

TILT

255 B

Figure 5 - Ellipse slices for skin-tone model

PAN

Figure 3 - Camera head axes

3.

Color-Based Initialization

The goal of color-based initialization is to drive the pan/tilt camera head to center skin-tone regions in the center camera view. First in this process is skin-tone segmentation. Next, the pan and tilt axes of CATCH are moved to center an object of interest within the view of the color camera.

3.1. Skin-tone Pixel Segmentation The first step in color-based initialization is to separate skin-colored pixels from background pixels. A popular method of color segmentation is to work in a chromaticity space [10, 11]. Since most visual sensors do not produce data in these spaces, captured images must be converted from the 3-dimensional RGB space to a 2-dimensional chromaticity space. Depending on the chromaticity space used, this conversion can be quite computationally expensive. The proposed color segmentation algorithm works directly in RGB-space to avoid these costly and error-prone transformations.

3.1.1. Skin-tone Model A database of skin-tone pixels is created off-line by hand-segmenting several images of various people. An example of the resulting distribution of skin-tone pixels in RGB-space is illustrated in Figure 4. This “skin volume” is modeled with a layered-slice representation in order to facilitate real-time segmentation. Figure 5 illustrates the parametric model of the data. To build the slices, the G-axis is first split into 32 equal divisions. Next, the pixels falling within each slice are analyzed to extract parameters for a 2-dimensional joint normal distribution along the R-B plane belonging to the slice. The distribution is modeled by equation 1:

(

) (r σ− r )

2 1− ρ 2 λ ≥

2

2 r

(r − r )(b − b ) + (b − b)

2

− 2ρ

Nσ r σ b

σ b2

(1)

where r and b are the mean values of the red and blue coordinates, respectively, σ r2 is the variance in the R axis, σ b2 is variance along the B axis, ρ is the correlation coefficient between the two axes (and

corresponds to rotation of the ellipse), and λ determines the number of standard deviations from the norm included in the distribution. By extruding this ellipse along the G axis within the slice, a cylindrical volume scaled to the dimensions of the ellipse is formed. Each volume is used to segment pixels falling within the corresponding range of G values along the green axis.

3.1.2. Color Segmentation When an image is captured, a 3x3 spatial blur filter is used to remove CCD noise within the camera. Next, the algorithm looks at individual pixels in the image to determine which model slice they might fall within, by checking the value of the green byte in the 24-bit RGB tuple. Only the inner slices are used to model the skintone region, as the upper and lower slices tend to cause the inclusion of bright specular pixels and dark hair pixels inadvertently included in the skin-tone database (see Figure 5). Once the proper slice is found, equation 1 is used to see if the point defined by the red and blue components of the pixel falls within the elliptical region. If so, then the corresponding pixel location in the binary mask image is set to 1. Otherwise, it is set to 0. After all color pixels have been examined, the mask image is eroded with a 3x3 cross to remove spurious disconnected pixels.

3.2. Skin-tone Region Centering In order to drive the pan and tilt of CATCH to center a face or hand, the centroid of all skin-tone pixels is found. The mask image is divided into horizontal and vertical regions for controlling pan and tilt, respectively. These regions are illustrated Figure 6. CATCH is moved 2 degrees if the face lies in the outer zone (greater than 25% from the center of the image), 1 degree if the face lies within the middle zones (within 25% of the center), or zero degrees if the face lies within the 5% of the image center.

4.

Verge Lock

Once the color based initialization process is complete, the target will lie in the center of the color camera’s optical axis. The next step is to initialize the two monochrome cameras such that the target lies in the center of all three images. This process, called verge lock, is simplified since both the PAN and TILT axes have been correctly set during color initialization. The line formed by the color camera’s optical axis is perpendicular to the horizontal crossbar (which contains the three cameras). Therefore, the verge angles of the two monochrome cameras will be identical when the target to be tracked is centered in both the left and right images. (Note: The target is centered in the color image as well.) Figure 7 shows the optical axes of the two monochrome cameras as the verge angles change from 5 degrees to 25 degrees. The places where all three camera axes meet (P1, P2... Ptarget... and Pn) are potential target locations. A disparity filter can be used to find the correct verge angle. At this angle the target to be tracked will be centered in all three images. A disparity filter is a simple subtraction filter that finds the differences between the central region (20x20) of the left and right images. A minimum value of the difference occurs when both cameras are fixated on the same object. Figure 8 shows the results of the disparity filter as the verge angles of the two monochrome cameras sweep from 5 degrees to 25 degrees. The location with the smallest disparity corresponds to vergence angle that centers the target in all three images. Pn

Ptarget

P3 P2 dead zone

P1 Optical Axis

middle zone outer zone

outer zone

middle zone

dead zone

Example segmentation

Left Camera

Color Camera

Right Camera

Figure 7 - Verge lock procedure Figure 6 - Motion zones for controlling pan/tilt

The cross in the center of the image marks the center of gravity of all skin-tone pixels after the pan and tilt axes of CATCH have been directed towards the centroid.

30000

25000

6. Disparity

20000

15000

10000

Target

5000

0 5

10

15

20

Verge Angle (degrees)

Figure 8 – Disparity values during verge lock.

(a)

(b) Figure 9 - Color camera view and corresponding segmentation

5.

Conclusions

The strategic mounting of a color camera on a stereo monochrome camera head simplifies the problem of tracking initialization. This technique is currently being applied to ISAC; a service robot designed to work directly with humans. While the color segmentation algorithm described above was only used to locate to skin-tones, it may be applied to other colors by simply creating other slice models offline. Other applications for automatic initialization include video conferencing and surveillance.

Results

Figure 9 illustrates an example color camera snap and corresponding skin-tone segmentation after erosion with a 3x3 pixel cross. Note that both the face and arms were segmented, as were a few background pixels from a wooden table in the image. The person in this image was not included in the original skin-tone model, and yet was correctly segmented. Lighting variations across the face are apparent in the mask image. The current algorithm does not directly address the color constancy problem.

25

References: [1] Coombs and C. Brown, Real-time binocular smooth pursuit, International Journal of Computer Vision, 11, 1993, 147-164. [2] Nishihara, Practical real-time imaging stereo matcher, Optical Engineering, 23(5), 1984, 536-545. [3] F. Weiman, Log-polar binocular vision system, Technical Report, Transition Research Corporation, 1994. [4] Krotkov, Active vision for reliable ranging: cooperating focus, stereo, and vergence, International Journal of Computer Vision, 11(2), 1993, 187-203. [5] Jenkin, Using stereomotion to track binocular targets, Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1991. [6] Papanikolopoulos, P. K. Khosla and T. Kanade, Vision and control techniques for robotic visual tracking, IEEE International Conference on Robotics and Automation, 1991, 857-864. [7] S. Asaad, M. Bishay, D.M. Wilkes, and K. Kawamura, A Low-cost, DSP-Based, Intelligent Vision System for Robotic Applications, Proceedings of the 1996 IEEE International Conference on Robotics and Automation, 1996, 1656-1661. [8] K. Kawamura, D.M. Wilkes, T. Pack, M. Bishay, and J. Barile, Humanoids: Future Robots for Home and Factory, Proceedings of the International Symposium on Humanoid Robots, 1996, 53-62. [9] M. Bishay, Hand-eye coordination with an active camera head, Dissertation, Vanderbilt University, May 1997 [10] D. Sanger, H. Haneishi, and Y. Miyake, Method for Light source Discrimination and Facial Pattern Detection from Negative Color Film, Journal of Imaging Science and Technology, 39, 1995, 166-175. [11] T. C. Chang, T. S. Huang, and C. Novak, Facial Feature Extraction from Color Images, Proceedings of IAPR International Conference on Pattern Recognition, 2, October 1994, 39-43.

Suggest Documents