motion recovery based on feature extraction from 2d images

MOTION RECOVERY BASED ON FEATURE EXTRACTION FROM 2D IMAGES Jianhui Zhao1, Ling Li2 and Kwoh Chee Keong3 1,3 2

School of Computer Engineering, Nanyang Technological University, Singapore, 639798 School of Computing, Curtin University of Technology, Perth, Australia, 6102

Abstract:

This paper presents a method for motion recovery from monocular images containing human motions. Image processing techniques, such as spatial filter, linear prediction, cross correlation, least square matching etc, are applied to extract feature points from 2D human figures with or without markers. A 3D skeleton human model is adopted with encoded angular constraints. Energy Function is defined to represent the residuals between extracted feature points and the corresponding points resulted from projecting the human model to the projection plane. Then a procedure for motion recovery is developed, which makes it feasible to generate realistic human animations

Key words:

Posture Reconstruction, Human Animation, Energy Function

1.

INTRODUCTION

There are two basic methods in classical computer animation: kinematics and dynamics approaches1. The disadvantage of these methods is their inability to filtrate erroneous movements. If real images containing human motions are used to drive the virtual human body, more faithful motions and variations of dynamic scenes can be generated in the virtual world. This understanding leads us to a source where great amount of motion information can be obtained: the monocular images containing human movements. This approach can be used in many fields2,3, e.g. virtual reality, choreography, rehabilitation, communication, surveillance systems, movie production, game industry, image coding, gait analysis. However, due to the lack of information in the third dimension and the fact that the human body is an extremely complex object, the problem of generating 3D human motion from 2D images taken by single camera is quite difficult. It is 1075 K. Wojciechowski et al. (eds.), Computer Vision and Graphics, 1075–1081. © 2006 Springer. Printed in the Netherlands.

1076 mathematically straightforward to describe the process of projection from a 3D scene to a 2D image, but the inverse process is typically an ill-posed problem. Chen and Lee4 presented a method to determine 3D locations of human joints from a film recording walking motion. In this method, geometric projection theory, physiological and motion-specific knowledge and graph search theory are used. Another approach is divide-and-conquer technique reported by Holt et al. for human gait5. Although the simplicity of this approach is attractive, it is unsatisfying since it does not exploit the fact that different components do belong to the same model. The reconstruction method proposed by Camillo J. Taylor6 does not assume that the images are acquired with a calibrated camera, but the user is required to specify which end of each segment is closer to the observer. Barron and Kakadiaris7 estimated both the human’s anthropometrical measurements and pose from a single image. Their approach requires the user to mark the segments whose orientation is almost parallel to the image plane. The novelty of our approach is that it is able to deal with human motions from 2D images without camera calibration and user interference. It provides an alternative way for human animation by low cost motion capture while it avoids many limitations that come up with current motion tracking equipments.

2.

EXTRACTION OF FEATURE POINTS

2.1

Extraction from image with markers

The markers with different colors are stuck to the tight clothes of a human subject where the joints are located. Motions of the subject are recorded by a digital camcorder for 30 frames every second. The video sequence is composed of m monocular images of JPEG format, and each frame can be represented as a discrete two-dimensional function f j ( x, y ) . Suppose there are n markers on the human figure, and each marker is represented as M i ( r , g , b) , where r, g, and b are color values of the marker in its Red, Green, and Blue plane respectively. Given a threshold value u, then, whether a pixel p ( x, y ) is belong to the marker can be determined by:

p ( x, y ) M i

( M i u d p ( x, y ) d M i u )

(1)

The procedure for feature extraction from monocular images with markers is:

Motion Recovery Based on Feature Extraction from 2D Images

1077

Step 1, 2D image f j ( x, y ) is used as input. Step 2, a 3u 3 Low Pass Spatial Filter (LPSF) is used to reduce noise, i.e., intensity value of a point is replaced by the average of all the pixels around it. Step 3, tracking of the markers is executed throughout f j ( x, y ) by Equation (1), and the pixels belong to M i ( r , g , b) are selected. Step 4, the tracking result may be irregular, or composed of several discrete regions, thus the main continuous part is selected as the result, and the other parts are discarded. As illustrated in Figure 1, those in blue circle are main part, while those in green circle are discarded parts.

Figure 1. Processing of the tracked results.

Step 5, to extract the feature points, averaging method as follows is applied.

x

1 n ¦ xi , y ni1

1 n ¦ yi ni1

(2)

where x and y are averages of coordinates of all pixels in the tracked result. 2.2

Extraction from image without markers

Suppose there are m monocular frames in a video sequence without markers, and n feature points in human body. The frame is represented as f j ( x, y ) , each feature point is represented as Pi , j ( x, y ) (ith feature point and jth frame), each template window with the feature point as its center is wi , j ( x, y ) . Feature points in the first frame are picked manually, and the procedure for feature extraction from the other frames is: Step 1, 2D image f j ( x, y ) is used as input. Step 2, a 3u 3 LPSF is applied to reduce the noise. Step 3, Linear Prediction is used to predict Pi , j 1 ( x, y ) based on corresponding feature points in previous frames as follows:

1078

° Pi , j , j 1 ® °¯ Pi , j ( Pi , j Pi , j 1 ),1 j d m

Pic, j 1

(3)

Step 4, Normalized Cross Correlation (NCC) is utilized to find matches of the template image wi , j ( x, y ) within search image si , j 1 ( x, y ) with Pic, j 1 ( x, y ) as its center by:

¦¦ s x

c(r , t )

¦¦ s x

y

i , j 1

wi , j ( x r , y t )

y

2 i , j 1

¦¦ w

i, j

x

2

(4)

( x r, y t )

y

The position where the maximum value of c(r , t ) appears is selected as

Pic,cj 1 ( x, y ) . Step 5, Least Square Matching method is applied to find accurate

Pi , j 1 ( x, y ) from the initial point Pic,cj 1 ( x, y ) , during which affine transformations (i.e. rotation, shearing, scaling, and translations) are considered as follows:

x new

3.

a0 a1 x a 2 y , y new

b0 b1 x b2 y

(5)

MOTION RECOVERY

The employed 3D skeletal human model consists of 17 joints and 16 segments, and the joints are Hip, Abdomen, Chest, Neck, Head, Luparm, Ruparm, Llowarm, Rlowarm, Lhand, Rhand, Lthigh, Rthigh, Lshin, Rshin, Lfoot, Rfoot. Kinematics analysis resolves any motion into one or more of six possible components: rotation about and translation along the three mutually perpendicular axes. Rotational ranges of the joints8 are utilized as the geometrical constraints of human motion. Energy Function (EF) is defined to express the deviations between the image features and the corresponding projection features as

EFi

Scale(1) ' _ anglei Scale(2) ' _ lengthi Scale(3) ' _ positioni

(6)

where ǻ_anglei is deviation of orientation, ǻ_lengthi is deviation of length, ǻ_positioni is deviation of position, while Scale are weighting parameters.


1079

The procedure for motion recovery is: Step 1. Take a series of monocular images as input; Step 2. Extract the feature points; Step 3. Calculate initial projection value of the 3D model for every body part; Step 4. Joint Hip is translated in the plane parallel with the image to place the projected point of Hip to the accurate position; Step 5. Rotate joint Hip based on Eq. (6), and the descendent joints of Hip are Abdomen, Lthigh, Lshin, Lfoot, Rthigh, Rshin, and Rfoot; Step 6. Adjust the other joints by considering its immediate descendant in such an order: Abdomen, Lthigh (and Rthigh), Lshin (and Rshin), Chest, Neck, Luparm (and Ruparm), Llowarm (and Rlowarm); Step 7. Joint Hip is translated for another time along the line defined by position of camera and the extracted point of Hip to make the projected posture have the same size as the human figure in 2D image; Step 8. Rotate joint Hip by Eq. (6) again with reference to all the other joints of the human model; Step 9. Adjust other joints by considering all their descendant(s) in the same order as Step 6; Step 10. Display the recovered 3D postures.

4.

EXPERIMENTAL RESULTS

The adopted method for human motion reconstruction from monocular images is tested by several video sequences of human motions, as shown in Figure 2 and Figure 3. There are 8 frames in human kicking sequence of Figure 2, and 3 of them (1st, 3rd, 5th) are displayed; while 3 of the 8 frames (2nd, 4th, 6th) in human farewell sequence of Figure 3 are illustrated. Figures in the 1st column are 2D frames of the video sequence; figures in the 2nd column are the extracted feature points (red&dot) and the animated results (black&solid) from the same viewpoint; figures in the 3rd and 4th column are results from side and top views respectively.

1080

Figure 2. Recovered motion from a kicking sequence.

Figure 3. Recovered motion from a farewell sequence.

5.

CONCLUSION

An approach for reconstruction of human postures and motions from monocular images is presented. The advantage of this method is that neither camera calibration nor user’s interface is needed. Experiments show that


1081

reconstructed results are encouraging, while some improvements are needed. Future work includes automatic and accurate picking of the occluded feature points, further studying of Energy Function and the biomechanical constraints, etc.

REFERENCES 1. Yahya Aydin, Masayuki Nakajima, Database Guided Computer Animation of Human Grasping using Forward and Inverse Kinematics, Computers & Graphics, 23 (1999), Page(s): 145-154. 2. D.M.Gavrila, The Visual Analysis of Human Movement: A Survey, Computer Vision and Image Understanding, Vol. 73, No. 1, January 1999, Page(s): 82-98. 3. Thomas B. Moeslund and Erik Granum, A Survey of Computer Vision-Based Human Motion Capture, Computer Vision and Image Understanding 81, 2001, Page(s): 231-268. 4. Zen Chen and His-Jian Lee, Knowledge_Guided Visual Perception of 3-D Human Gait from a Single Image Sequence, Systems, Man, and Cybernetics, IEEE Transactions, Vol. 22, No. 2, March/April 1992, 336-342. 5. Robert J.Holt, Arun N.Netravali, Thomas S.Huang, Richard J.Qian, Determining Articulated Motion from Perspective Views: A Decomposition Approach, IEEE Workshop on Motion of Non-Rigid and Articulated Objects, 1994, 126-137. 6. Camillo J. Taylor, Reconstruction of Articulated Objects from Point Correspondences in a Single Uncalibrated Image, Computer Vision and Image Understanding 80, 2000, 349-363. 7. Carlos Barron and Ioannis A. Kakadiaris, On the Improvement of Anthropometry and Pose Estimation from a Single Uncalibrated Image, IEEE Workshop on Human Motion, 2000, 53-60. 8. Jianhui Zhao and Ling Li, Human Motion Reconstruction from Monocular Images Using Genetic Algorithms, Journal of Computer Animation and Virtual Worlds, 2004 (15), Page(s): 407-414.