3D Images

S. E. Ghobadi, O. E. Loepprich, K. Hartmann, O. Loffeld, ‘Hand Segmentation using 2D/3D Images’, Proceedings of Image and Vision Computing New Zealand 2007, pp. 64– 69, Hamilton, New Zealand, December 2007.

Hand Segmentation Using 2D/3D Images S. E. Ghobadi1 , O. E. Loepprich1 , K. Hartmann1 , and O. Loffeld1 1

Center for Sensor Systems (ZESS), University of Siegen, Paul-Bonatz-Str. 9-11, Siegen, Germany. Email: {Ghobadi,Loepprich,Hartmann,[email protected]}

Abstract This paper describes a fast and robust segmentation technique based on the fusion of 2D/3D images for gesture recognition. These images are provided by the novel 3D Time-of-Flight (TOF) camera which has been implemented in our research center (ZESS). Using modulated infrared lighting, this camera generates an intensity image with the range information for each pixel of a Photonic Mixer Device (PMD) sensor. The intensity and range data are fused to be used as the input information for the segmentation algorithm. Our proposed segmentation technique is based on the combination of two unsupervised clustering approaches: K-Means and Expectation Maximization (EM). They both attempt to find the centers of natural clusters in the fused data. The experimental results show that the proposed gesture segmentation technique can successfully segment the hand from user’s body, face, arm or other objects in the scene under variant illumination conditions in real time. Keywords: Segmentation, 2D/3D Images, Gesture, K-Means, Expectation Maximization

1

Introduction

Although the CLA and laser range camera provide highly accurate range images, they suffer from the low acquisition rate. In [1] Caplier et al. used a CLA camera with the frame rate of 12 Hz and Heisele and Ritter used a laser range camera with the frame rate of 7Hz [4].

A fast and robust segmentation is the first challenge in the vision-based gesture recognition for the real time man machine interaction. Gesture segmentation approaches based on intensity information or color images (2D images) suffer from low efficiency and lack of robustness in the cluttered background as well as under variant lighting conditions. For this reason, in 2D gesture segmentation, the problem is usually simplified by introducing some assumptions or applying some constraints either to the scene or to the user. These limitations include wearing special gloves by the user [1], [2], [3], controlling the illumination in the scene, avoiding to have the objects with the skin color in the image and using markers [2].

The stereo vision camera has also a low frame rate due to it’s computational time to calculate the disparity map from right and left images. In [5] Nefian et al. used a stereo camera with the frame rate of 11 Hz. Another open issue in using stereo vision camera is that the stereo range images are heavily affected by texture and environmental illumination conditions. We have already discussed this problem in [8]. The 3D Time of Flight (TOF) sensors, on the other hand are becoming very attractive in man machine interaction field [6], [7], [8] because they can provide the gray level images with reliable range data for each pixel. While the range images of TOF camera are independent of the texture and lighting conditions, they are somehow affected by the color of the object. This is because the range image in TOF camera is calculated from the phase difference between the transmitted modulated infrared light to the object and the received infrared light back from the object. As the colors have different reflection factors, the range image in TOF camera is affected by the color of the object, ie. two objects with different colors at the same distance might get different range data in a TOF image.

To overcome these difficulties and improve the robustness of the segmentation a lot of research has been done based on the range images in recent years [1], [4], [5], [6], [7]. The range information can be provided by using different kinds of cameras such as laser range camera [4], stereo camera [5], Coded Light Approach (CLA) camera [1] and Time of Flight (TOF) camera [6], [7], [8]. No matter, which kind of the camera system is used, the performance of the gesture segmentation for the real time application relies on how fast and how robust it is. These requirements depend on the frame rate of the camera and the quality of the range images taken by the camera.

64

This paper on one hand addresses the solution to this problem in a robust gesture segmentation by fusing the intensity and range information into a 2D feature space, and on the other hand proposes a fast segmentation technique using our novel 3D TOF camera. We have used the combination of two following unsupervised clustering techniques for segmentation:

Figure 1: An Example of TOF Image (Left: Range Image, Right: Intensity Image)

where fmod denotes the modulation frequency and ∆ϕ = 2π · fmod · t represents the phase delay.

• K-Means Clustering • Expectation Maximization

At the modulation frequency of 20 MHZ, the unambiguous distance is equal to 15 meters, ie. the maximum distance for the target is 7.5 meters, because the illumination has to cover the distance twice: from the sender to the target and back to the sensor chip.

The paper continues as follows: Section 2 presents an overview of our camera system which we have used. In Section 3 the data fusion is discussed. Section 4 introduces the clustering techniques which have been used for the segmentation in this paper. Section 5 summarizes our experimental results while Section 6 concludes this work.

2

To calculate the phase delay, the autocorrelation function of electrical and optical signal is analyzed by a phase-shift algorithm. Using four samples A1 , A2 , A3 and A4 , each shifted by 90 degrees, the phase delay ∆ϕ can be calculated using the following equation [8]:

3D-Time of Flight Camera

Range imaging in a 3D-Time of Flight camera is the fusion of the distance measurement technique with the imaging aspect. The principle of the range measurement in a TOF camera is based on the measurement of the time the light needs to travel from one point to another. This time which is socalled time-of-flight t is directly proportional to the distance d the light travels: c·t 2 where c denotes the speed of light. d=

∆ϕ = arctan(

(1)

b=

A1 + A2 + A3 + A4 4

(5)

TOF camera, unlike the stereo vision camera, is texture independent and since the range is calculated directly in each pixel unit with minimal processing, a very high frame rate, dependent on the exposure time, can be obtained.

The lighting source illuminates the scene with the modulated infrared light signal which is generated using a MOSFET based driver and a bank of high speed infrared emitting diodes at the frequency of 20 MHz. The illuminated scene is observed by an intelligent pixel array (PMD) via an optical lens for focusing, where each pixel on the PMD sensor samples the amount of modulated light and determines the turnaround time of the modulated light [8]. Typically this is done by using continuous modulation and measuring of the phase delays in each pixel [10].

In our work, we have used a PMD sensor with the resolution of 3K (64x48). The exposure time is set at 5 msec at the frequency of 20 MHz. Under these conditions and using a USB 2.0 communication protocol, the frame rate of 50 images per second is obtained which is suitable for real time applications. The range accuracy of this camera under the mentioned condition is about ±1 cm.

Assuming continuous sinusoidal or rectangular modulation, the distance is calculated as follows [10]: c · ∆ϕ 4π · fmod

(3)

In addition to the phase shift of the signal, the strength of the received signal a and the gray scale value b are formulated respectively as follows [8]: p (A1 − A3 )2 + (A2 − A4 )2 a= (4) 2

Our 3D non-scanning Time of Flight (TOF) camera system consists of an infrared lighting source, Photonic Mixer Device (PMD) sensor [9] , FPGA based processing and the communication unit.

d=

A1 − A3 ) A2 − A4

(2)

65

An example of a TOF image, including the intensity and range images, is shown in Figure 1. The range image is coded in color such that the pixels in the foreground which represents the object closer to the camera are brighter. These two images are fused to provide the input data for the segmentation algorithm.

3

Range and Intensity Image Fusion

4

Segmentation is the first step of the image processing in the computer vision applications such as gesture recognition. It is the process of distinguishing the object of interest from the background as well as the surrounding non interesting objects. In other words, image segmentation aims at a better recognition of objects by grouping of the image pixels or finding the boundaries between the objects in the image.

The TOF camera delivers three data items for each pixel at each time step: intensity, range and amplitude of the received modulated light. The intensity image of the TOF camera comparable to the intensity images in CCD or CMOS cameras relies on the environment lighting conditions, whereas the range image and the amplitude of the received modulated light are mutually dependent. They both depend on the reflection factor of the object, ie. the constitution of the object.

Gesture segmentation in this paper is treated as a clustering problem. Clustering is an unsupervised learning technique to identify the group of unlabeled data based on some similarity measures. Each group of unlabeled data so-called cluster corresponds to an image region while each data point is a feature vector which represents a pixel of the image.

None of these individual data can be used solely to make a robust segmentation under variant lighting and color conditions. Fusing these data provides a new feature information which is used to improve the performance of the segmentation technique. In this paper we have used the basic technique for the fusing of the range and intensity data which has already been used in other fields like SAR imaging. The range data in a TOF image is dependent on the reflection factor of the object surface (how much light is reflected back from the object). Therefore, there is a correlation between the intensity and range vector sets in a TOF image. These two vector sets are fused to derive a new data set, so-called ”phase”, which indicates the angle between two intensity and range vector sets and is derived as follows:

Two techniques have been combined and employed for clustering in this paper which are discussed in the next sections.

4.1

K-Means technique

K-means is one of the simplest unsupervised learning algorithms that solve the well known clustering problem by partitioning the data set {x1 , x2 , ..., xN } into some number K of clusters. For each data point xn , a binary membership function is defined as: ½ 1 if xn assigned to cluster k rnk = 0 otherwise

First, using the intensity and range data in each image a new resulting set of complex number C is derived. Crc = grc + i drc (6)

K-means aims at minimizing the objective function, given by [12]:

where grc corresponds to the normalized gray value and drc represents the normalized range information for each pixel in row r and column c of the intensity and range images respectively. Next, the phase of each complex number ϕ is calculated in the polar coordinate system for the whole array of the pixels: drc ϕrc = arctan( ) (7) grc The phase of the complex value and range data are then combined into a 2D feature space where each pixel is described by a feature vector frc , containing range and phase information. frc = (drc , ϕrc )

Segmentation

J=

N X K X

rnk kxn − µk k2

(9)

n=1 k=1

where kxn − µk k2 is the distance between the data point xn and the cluster center µk . In fact, the goal is to find the values for the {rnk } and the {µk } so as to minimize J. This is done through an iterative procedure in which each iteration involves two successive steps corresponding to successive optimization with respect to the rnk and the µk [12]. The main advantages of this algorithm are its simplicity and speed. The computational cost of Kmeans is O(KN ), which allows it to run on the large data sets. However, k-means is a data dependent algorithm. Although it can be proved that the procedure will always terminate, the algorithm does not achieve a global minimum.

(8)

Here, range denotes the position of the object in Z direction in the world coordinate system which is aligned to the optical axis. The other 2D information in XY coordinate system are neglected, because in our gesture segmentation problem the pixels with the similar XY information do not necessarily belong to the the same physical object.

Since K-means is a distance based or hard membership algorithm, every data point, at each iteration,

66

4.3

is assigned uniquely to one, and only one, of the clusters. For the data points which lie roughly midway between cluster centers, the hard assignment to the nearest cluster might not be the most appropriate one. By adopting the probabilistic approaches, like Expectation Maximization (EM), a soft assignments of data points can be obtained.

K-Means Expectation Maximization (KEM) Technique

As it is already mentioned the K-means algorithm has a hard membership function and a small shift of a data point can flip it to a different cluster. The solution to this problem is to replace hard clustering of k-means with soft probabilistic assignments 4.2 Expectation Maximization Technique [12]. In our paper this is done by EM algorithm because EM has no strict boundary among clusters Expectation Maximization (EM) is a powerful method and a data point is assigned to each cluster with a certain probability. to find the maximum likelihood solution for models with latent variables. This approach can be However, the techniques such as EM might yield used for image segmentation where each segment poor clusters if the parameters are not initialized (cluster) is mathematically represented by a paraproperly. To solve this problem we have proposed metric Gaussian Distribution. The entire data set a technique which combines K-Means with EM, is therefore modeled by a mixture of these distriso-called KEM. This technique is similar to that butions. presented in [13]. It employs K-means as the the initial clustering to find the initial cluster centers. EM consists of the iterations of two steps: E-step This reduces the sensitivity of the initial points and and M-step. After the initialization, these two gives the centers which are widely spread within steps are repeated till the algorithm converges and the data. These centers are used as the initial gives a maximum likelihood estimation. The imparameters for EM and it starts iterating to find plementation of EM has been done as follows [12]: the local maxima. • Initialization: the parameters we want to learn are initialized. These consist of mean µk , 5 Experiments and Results covariance Σk and mixing coefficients πk . All experiments have been done in the real time. • Expectation: In the expectation step the exThe range and intensity images are taken directly pected values E(znk ) for each data point xn in each snap shot of a TOF camera based on PMD is calculated. It is actually the probability sensor. The resolution of the camera we have used that xn is generated by the kth distribution. is 3k (64x48 pixels). The modulation frequency and the exposure time have been set to 20 MHz πk N (xn |µk , Σk ) E(znk ) = PK (10) and 5 msec respectively. Under these conditions, j=1 πj N (xn |µj , Σj ) the frame rate of the camera is about 50 images per second, including the intensity and range images. • Maximization: Once the expected values E(znk ) Using K-Means Expectation Maximization (KEM) have been calculated, the parameters are rewhich was discussed in the last section, each image estimated: is segmented. The frame rate of the segmented images is above video frame rate which is suitable N 1 X new for the real time gesture tracking and recognition. µk = E(znk )xn (11) Nk n=1 We evaluate our segmentation technique for three following cases: N X 1 Σnew = E(znk )(xn −µnew )(xn −µnew )T k k k • gesture is posed in the foreground of a simple Nk n=1 scene. (12) Nk πknew = (13) Figure 2 depicts some images of this case. In these N images the gesture is posed in the distance of over where 10 cm from the background, torso or face. The N X first row in Figure 2 shows the intensity images. Nk = E(znk ) (14) The second row shows the coded range images such n=1 that the pixels of the background are darker than • Evaluation: In this step the log likelihood the pixels of the gesture in the foreground. The is evaluated and the convergence of the pathird row shows the results of segmentation using rameters or the log likelihood is checked. If KEM technique for six clusters. The images 1 to 3 the convergence criterion is not satisfied the show the gesture in a plain background while the algorithm returns to the Expectation Step. images 4 to 8 show the gesture in the scene where

67

1

2

5

6

4

3

1

2

3

4

5

6

8

7

Figure 4: Results of gesture segmentation in a sequence of movement from foreground to the background. First row: intensity images, Second row: range images, Third row: segmented images)

Some images of this case are shown in Figure 3. In this case the gesture is posed in a cluttered and complex scene where the lighting condition as well as the color of the objects affect the TOF images and make the problem more complicated. In this case, we have segmented the images once using just the range data as the single feature and once using the fusion of the range and phase data. The first and the second row of the Figure 3 show the intensity and the range images respectively. The third row shows the results of the segmentation using the range data while the last row shows the segmented images using the fused data. As it can be seen from the results, the segmented images using just range information get too much error and the pixels related to the gesture do not get isolated from the pixels of the other objects very well. In the images 1 and 4 the range data are affected by the color, ie. the black color ( the color of the shirt in image 1 and the color of the hair in image 4) does not reflect too much infrared light and therefore the range data get wrong values for these objects. This is one of the problem of TOF camera which we have already discussed in this paper. In the range images 2 and 6 since the face is not illuminated very well by the lighting system, we get some errors in the range data. Also, the range data in images 3 and 5 get error because the gesture span to the torso, face, arm or other objects in these images is smaller than the statistical error rate of TOF image (about 4 cm). However, these images show that the TOF range images can not be used solely to build a robust gesture segmentation under these conditions. Fusing the range data with the intensity images which has been proposed in our paper solves this problem. As the results of the segmentation based on fused data in the last row of Figure 3 show, the pixels related to the gesture have been grouped in one cluster and the gesture has been segmented very well from the face, torso or other objects in the complex scene.

Figure 2: Results of gesture segmentation in a simple scene . First row: intensity images, Second row: range images,Third row: segmented images) 1

2

3

4

5

6

Figure 3: Results of gesture segmentation in a complex scene. First row: intensity images, Second row: range images, Third row: segmented images using range feature, Fourth row: segmented images using the fusion of range and phase features)

the user’s body, face or arm are observed as well. As it is seen from the segmented images, the pixels related to the gesture have been grouped in one cluster very well without any error. In this case, since the gesture span from the background, torso or face (10 cm) is over the statistical noise of TOF range images (about 4 cm), the range information, without fusing with the intensity data, can be used as a single feature for the segmentation algorithm and it yields the same results as when the fusion of range and phase is employed as the input feature vector for the segmentation.

• A sequence of moving gesture from foreground to the background.

• gesture is posed in the foreground in a cluttered and complex scene.

68

Figure 4 shows a sequence of moving gesture from foreground to the background in the steps of 15 cm. As the previous figures, the first and second row show the intensity and range images respectively while the third row shows the segmented images. The hand gesture is segmented from the user’s body, face and arm very well in all of the sequences except image sequence number 4 where the hand gesture and face are posed in the same distance from the camera and they both have the same intensity values. This is actually the case that the segmentation fails. This problem can be solved using our new novel 2D/3D multi-camera [14] which employs two sensors to provide a colorful high resolution image with the range information. In our ongoing project these images are used for gesture recognition.

6

[2] C. Keskin, O. Aran, and L. Akarun, “Real time gestural interface for generic applications,” in European Signal Processing Conference, EUSIPCO’05, 2005. [3] T. Burger, A. Caplier, and S. Mancini, “Cued speech hand gestures recognition tool,” in European Signal Processing Conference, EUSIPCO’05, 2005. [4] B. Heisele and W. Ritter, “Segmentation of range and intensity image sequences by clustering,” in IEEE International Conference on Information Intelligence and Systems, 1999. [5] A. V. Nefian, R. Grzeszczuk, and V. Eruhimov, “A statistical upper body model for 3d static and dynamic gesture recognition from stereo sequences,” in IEEE International Conference on Image Processing, 2001.

Conclusion

[6] B. S. Goektuerk and C. Tomasi, “3d head tracking based on recognition and interpolation using a time-of-flight depth sensor,” in IEEE Conference on Computer Vision and Pattern Recognition, 2004.

This paper describes a fast and robust gesture segmentation technique based on the fusion of range and intensity images of a TOF camera. This camera provides these images with the frame rate of 50 images per second which is suitable for the real time gesture tracking and recognition applications. The two unsupervised clustering techniques, Kmeans and Expectation Maximization, are combined to segment the images.

[7] Z. Mo and U. Neumann, “Real-time hand pose recognition using low-resolution depth images,” in IEEE Conference on Computer Vision and Pattern Recognition, 2006. [8] S. Ghobadi, K. Hartmann, W. Weihs, C. Netramai, O. Loffeld, and H. Roth, “Detection and classification of moving objects-stereo or time-of-flight images,” in Computational Intelligence and Security. IEEE, 2006, pp. 11– 16.

The results show that the proposed technique is on one hand fast to deliver the segmented images at the video frame rate and on the other hand is robust enough to be used as the first step of processing technique in 2D/3D gesture tracking and gesture recognition.

[9] PMD, “Photoics pmd 3k-s 3d video sensor array with active sbi, www.pmdtec.com,” 2007.

We have also shown that in the case where the gesture gets the same range and intensity information as the face, the segmentation fails. The solution to this problem is to use our new novel 2D/3D camera which delivers a colorful high resolution image with the range information.

7

[10] T. Moeller, H. Kraft, and F. J., “Robust 3d measurement with pmd sensors,” in PMD Technologies GmbH, www.pmdtec.com. [11] S. Gokturk, H. Yalcin, and C. Bamji, “A time of flight depth sensor, system description, issues and solutions,” in IEEE workshop on Real-Time 3D Sensors, 2004.

Acknowledgments

This research has been supported by the DFG Dynamisches 3D Sehen- Multicam Project and DAAD IPP’s Program in Center for Sensor System (ZESS) in Germany.

[12] C. M. Bishop, Pattern Recognition and Machine Learning. Springer, 2006. [13] S. Nasser, R. Alkhaldi, and G. Vert, “A modified fuzzy k-means clustering using expectation maximization,” in IEEE World Congress on Computational Intelligence, 2006.

References

[14] T. Prasad, K. Hartmann, W. Weihs, S. Ghobadi, and A. Sluiter, “First steps in enhancing 3d vision technique using 2d/3d sensors,” in Computer Vision Winter Workshop, 2006.

[1] A. Caplier, L. Bonnaud, S. Malassiotis, and M. G. Strintzis, “Comparison of 2d and 3d analysis for automated cued speech gesture recognition,” in 9-th International Conference on Speech and Computer, SPECOM04, 2004.

69