Action Recognition Unrestricted by Location and Viewpoint Variation

0 downloads 0 Views 702KB Size Report
Action recognition is a popular research topic in computer vision. So far most of proposed algorithms are under assumptions of fixed location and viewpoint.
IEEE 8th International Conference on Computer and Information Technology Workshops

Action Recognition Unrestricted by Location and Viewpoint Variation Feiyue Huang ,Guangyou Xu Department of Computer Science and Technology, Tsinghua University, 100084, Beijing, China [email protected] [email protected] In our research work, we divide action recognition into two separate modules: posture representation and recognition. Posture is a kind of representation of human body for a single frame, for example, vector of distances from boundary pixels to the centroid [6]. And recognition module uses results of pose estimation of frames to classify actions. Our research focuses on discovering proper posture representation for action recognition unrestricted by location and view angle variation, which implies an actor can move as he like without any contrived restriction in a scenario. Compared to a common action recognition system, an unrestricted action recognition system should have following extra properties: z Viewpoint invariant z Location invariant z Occlusion tolerant In the case of recognition of human action, a good representation should be able to tolerate variations in viewpoint, human location, identity, background, illumination and so on. Among them, viewpoint and location invariance are most important. We can perform training and recognition in given environments and specialized persons. But in order to perform unrestricted action recognition, we can’t restrict human body’s displacement and rotation at any time which inevitably leads to variable location and viewpoint. There have been some proposed approaches on viewpoint invariant action recognition. Seitz and Dyer [7] described an approach to detect cyclic motion that is affine invariant. CenRao[8] used affine invariance of trajectory to recognize actions. Vasu Parameswaran [9] chose six joints of the body and calculated their 3D invariants of each posture, so each posture can be represented by a parametric surface in 3D invariance space. Daniel et al.[10] introduced Motion History Volumes as a free viewpoint representation for action

Abstract Action recognition is a popular research topic in computer vision. So far most of proposed algorithms are under assumptions of fixed location and viewpoint of the subject, which are usually not valid in practical environment where the subject might roam in the field. To address the difficulties of action recognition tolerating location and view angle variation, we propose an “Adapted Envelop Shape” based approach, which is a posture invariance representation and extendible to multi-camera environment. Further Adapted Envelop Shape is used as input vector for Hidden Markov Model to train and recognize actions. Our method has following desirable properties: 1) Exact camera calibration is not needed. 2) Action recognition is view point and location invariant. 3) Automatic switch of cameras according to human location makes visible area more wide. 4) Partially occlusion or out of sight of human body is tolerable. Experimental results also demonstrate the effectiveness of our method.

1. Introduction Human action recognition is an active research area in computer vision. There have been several surveys which tried to summarize and classify previous existing approaches on this area [1, 2, 3]. During the last few years, lots of approaches for action recognition have been proposed [4, 5]. However, most action recognition systems only work in certain restricted scenarios. Especially, they usually depend on fixed area and viewpoint. Our research aims to get rid of such kind of dependency and restriction on scenarios in action recognition. In this paper, we propose a general and practical solution for unrestricted human action recognition.

978-0-7695-3242-4/08 $25.00 © 2008 IEEE DOI 10.1109/CIT.2008.Workshops.41

433

3 and introduce multiple cameras deployment scheme in section 4. We give experiment results in section 5 and conclude in section 6.

recognition, which requires multiple calibrated cameras. Though there have been some research work on viewpoint invariant action recognition, there are still some problems to be solved. Most approaches depend on robust meaningful feature detection or point correspondence, which are often hard to implement. And there is a tradeoff to be insensitive to viewpoint is that some useful information for discriminating different actions is often eliminated at the same time. Most posture representations have the property of location invariant when the observation of human is entire. However, in an unrestricted scenario, if the actor moves out of the camera’s sight or part of the actor is occluded, usual representation will fail due to using only part of body. Feiyue,Huang[11] introduced “Envelop Shape” as view invariant posture representation, which is easy to acquire from silhouettes with two orthogonal cameras and contains enough discriminating features for action recognition. However, it is also deficient for unrestricted action recognition. For example, entire silhouettes are required and it can not deal with occlusion, which is what we want to solve in this paper. In this paper, we deliver a multiple cameras’ deployment scheme. With the help of cameras’ homography and cameras’ switch scheme, we propose an improved representation of “Envelop Shape” which is called “Adapted Envelop Shape”. It dispels previous disadvantages, which means it is occlusion tolerant and does not need entire silhouettes. Experimental results on video sequences also demonstrate the effectiveness, efficiency, and robustness of our algorithm. The remainder of this paper is organized as following. In section 2, we present a system overview. We propose adaptation of “Envelop Shape” in section

2. System overview Figure 1 is system flow chart. Temporal aligned videos of multiple cameras are input data. After human detection and silhouette extraction, body on ground is located with the help of cameras’ homography. After looking up the static “Location-Pair” table, system will automatic switch to the active cameras’ pair and use silhouettes of this pair. Since location is determined, we will align silhouettes to a uniform scale of vertical direction using cameras’ location-scale data which is acquired in advance. After the scale alignment, we will use silhouettes pair to generate Adapted Envelop Shape as input vector of each frame. Then PCA dimensional reduction is performed and we will use continuous Hidden Markov Model for action training and recognizing.

3. Posture representation Posture representation is a key issue in human action recognition. We will propose an adaptation of Envelop Shape as posture representation for unrestricted action which can deal with situations when human body is occluded.

3.1 Envelop Shape “Envelop Shape”[11] is a kind of posture representation which is view insensitive. It is a vector

Figure 1. System flowchart

434

assumed for Envelop Shape. If the actor moves too near to camera, affine model may fail. How to make use of “Envelop Shape” in unrestricted application scenarios when human body is occluded or out of cameras’ best sight? We will solve this problem using multiple cameras’ deployment scheme.

for each specified frame. Figure 2 shows Envelop Shape of “Fetch” action samples. The third and fourth two rows show the silhouettes of real images from the two orthogonal placed cameras. The fifth row shows generated “Envelop Shape” vector frames. Envelop Shape will change less when human rotates on vertical axis which is the most frequently occurred view point variance.

3.2 Adapted Envelop Shape In order to make use of Envelop Shape, we need to be able to set up body’s vertical correspondence. When an actor moves around, the height of body in images will vary according to its distance to camera. According to camera model, we know if the distance is much larger than focus length, the size of an object in image can be presumed to be in inverse proportion to distance from the object to camera lens. We can infer formula (2), k (2) h = hw

Figure 2. Envelop Shape examples To acquire Envelop Shape, we need to set up two cameras whose image planes are both parallel to vertical axis Y and two optical axes are orthogonal. We can simply regard them as front and profile view. Then we need to extract silhouettes of the two cameras’ view. After a height scale normalization of silhouettes, we can get Envelop Shape using following definition,

r =

x2 + y2

Z

In this formula, k is a constant for a specified fixed camera, Z is the distance from human to camera, hw is the actual height of human body and h is body height in images. Z can be represented using distance of body’s point on ground to camera. In camera’s coordinate system, we use P: (Xc,Yc,Zc) represent body’s point on ground. So Z equals to Zc. Since P is on ground plane, it will meet following equation (3) aX c + bYc + cZ c + d = 0

(1)

Expression (1) calculates “r” value of Envelop Shape at each height. The x and y are the corresponding widths of the two normalized silhouettes at the same height. As proved in [11], Envelop Shape representation is view insensitive. Also compared to previous approaches, it is easier to acquire and has more distinguishable information in action recognition applications. However, there are still some shortcomings. In order to acquire Envelop Shape, we must establish the two camera views’ correspondence on vertical axis. That is to say, to find the corresponding “x” and “y” to calculate r using expression (1). When there are no occlusions of both views, we can regard top and bottom points of silhouettes as corresponding points and perform height scale normalization. But when part of human body is occluded, which often occurs such as legs or feet are occluded, the correspondence on vertical axis will lost which causes Envelop Shape can not be obtained. Another consideration is that in unrestricted scenarios, when actor moves, he may probably enter or leave the two cameras’ best sight area. Sometimes body is only visible in one camera’s view, while Envelop Shape requires that silhouette are visible in both views. What’s more, affine camera model is

Where a,b,c and d are all constants for a fixed camera. Let (u,v) represent the pixel’s coordinate of body’s ground point P. We know fX c ⎧ ⎪⎪u = Z + u0 (4) c ⎨ fYc ⎪v = + v0 ⎪⎩ Zc With above two formulas, we can infer (5) Z c = au + bv + c Also a, b and c are constant parameters for a specified fixed placed camera. So we can infer: 1 (6) h= hw k1u + k 2 v + k3 For a specified fixed camera, we can manually select some frames of human body in some different locations to solve these const parameters in advance. In this way, if body’ ground point can be located in images, using formula (6), we can alignment the body height to its actual height scale using pixel coordinate (u,v) despite of its distance to cameras. For all cameras in the scenario, we can obtain their Location-Scale data in advance, that is to say, to solve parameters: k1, k2 and k3.

435

requires only one pair to perform action recognition. But in an actual scenario, one pair of cameras’ sight is limited. More pairs are configured to obtain more wide sight area. We can switch camera pairs when human body moves out of sight of one pair of cameras. In real scenarios, we can deploy proper amount of camera pairs according to sight requirement and body move area.

We will present body’s ground point location method in next section. Let us now assume body’s ground point is already located. So we can perform height scale alignment which implies we can set up the uniform correspondence of body on vertical axis of images at arbitrary locations and for all cameras. We represent Adapted Envelop Shape as R and definite it in expression (7). In this expression, xi and yi are widths of the two height scale aligned silhouettes at same height i. If at height i, the part is occluded or invisible of either view, it will be regarded as absent and the ri will be zero. R = [ r1 , r2 K r N ] T , ⎧⎪ 0 , if x i or y i is absent (7) ri = ⎨ 2 2 ⎪⎩ x i + y i Figure 3 shows comparisons between Envelop Shape and Adapted Envelop Shape. The first row shows representations of entire silhouettes using either Envelop Shape or Adapted Envelop Shape. The second row and third row show results of occluded silhouette using Envelop Shape and Adapted Envelop Shape respectively. The first and fifth columns are silhouettes of two views, the second and fourth columns are scaled silhouettes, which are x and y vector and the third columns are R vector. The gray part represents it is occluded. Due to wrong correspondence, Envelop Shape will fail facing with occlusions. However, Adapted Envelop Shape uses camera’s Location-Scale information to align body height, it can deal with occlusions. which

Figure 4. Experiment scenario

4.1. Homography recovery For two cameras, there exist geometric constraints of corresponding points in two views to the 3D camera geometry. For a set of coplanar points, the constraints take the form of a homography. In this scenario, different views share a common ground plane. We define H Matrix as ⎡ h11 h12 h13 ⎤ (8) H = ⎢⎢h21 h22 h23 ⎥⎥ ⎢⎣h31 h32 1 ⎥⎦ '

'

Let (ui , vi ) and (ui , vi ) be a pair of correspondence

Figure 3. Representations’ comparison

points on the ground plane in two views. They can be associated with H: ⎡ui' ⎤ ⎡ h11 h12 h13 ⎤ ⎡ui ⎤ ⎢ '⎥ ⎢ ⎥ ⎢ ⎥ (9) ⎢ vi ⎥ = ⎢h21 h22 h23 ⎥ ⎢ vi ⎥ ⎢ 1 ⎥ ⎣⎢h31 h32 1 ⎦⎥ ⎣⎢ 1 ⎦⎥ ⎣ ⎦

4. Multiple cameras’ Deployment As mentioned above, we deploy a multiple cameras’ scenario to perform unrestricted action recognition. Figure 4 shows our experiment scenario. Each two orthogonal cameras are configured as [11]’s requirement. They are called as a “camera pair” and can be used to obtain Adapted Envelop Shape. There are total six cameras and five camera pairs, “1-3, 1-6, 3-5, 5-6, 2-4”. Adapted Envelop Shape

The homography H can be recovered from a set of correspondence points. In this paper, the ground plane homography is computed using several landmarks on the ground plane.

436

4.2. Body location and Best Pair Switch We can use principal axis to locate human body when he moves in the room [12]. The ground point of human body is the intersection point of principal axis and the ground plane. Body’s ground points in two cameras agree with Homography correspondence. We can transform the principal axis of the other camera’s view into this view using ground planes’ homography. Thus ground point of human body will be intersection point of this principal axis and the transformed principal axis. Figure 5 show samples.

Figure 5. Body location using principal axis Now we can locate body’s ground point even when part of body is occluded. When an actor moves in this room, our system is able to locate human body and track his location on ground. Figure 6 show body’s location tracks of a sequence in camera1 and camera2.

Using Adapted Envelop Shape and HMM model, we carry out unrestricted action recognition experiments. We collected our own database of action video sequences. There are total eight different actors. Each actor performs nine natural actions which are “Point To”, “Raise Hand”, “Wave Hand”, “Touch Head”, “Communication”, “Bow”, “Pick Up”, “Kick” and “Walk”. In this scenario, actors move and perform actions at any locations. Each action is performed by every actor about 10 times. Sometimes body’s part is occluded, thus we get partial Adapted Envelop Shape. We can assume only lower part is occluded and upper part is visible. Then we use following policy to deal with partial Adapted Envelop Shape: 1. Select entire Adapted Envelop Shapes which are not occluded as training data. 2. Choose 1/3, 1/2, 2/3 as baseline ratios. 3. Using part of Adapted Envelop Shapes from top according to baseline ratio as training data to train HMM models respectively. 4. For actions with partial Adapted Envelop Shape to recognize, select the baseline ratio which is closest and not bigger than its actual ratio of visible part. 5. Use selected baseline ratio to cut the partial Adapted Envelop Shape and use corresponded HMM to recognize. Figure 7 show sample sequences of partial Adapted Envelop Shape using above baseline ratios. From top to down, ratios are 1, 1/3, 1/2, 2/3. To show it clearly, the occluded part is drawn using gray color which is absent in fact.

Figure 6. Body’s location tracks With human body’s location track, we can perform cameras’ Best Pair Switch. We establish a “LocationPair” look up table in advance according to cameras’ placement and room layout. We will look up and switch to the best pair according to actor’s location online. In Figure 4, shadowed area shows the LocationPair Area. Each type of shadowed area represents a kind of pair. When actor enters this area, system will automatic switch to the corresponded pair and uses this pair to extract Adapted Envelop Shape.

5. Action recognition

Figure 7. Partial Adapted Envelop Shape In a word, since we have aligned information of Adapted Envelop Shape, we will make the most of partial Adapted Envelop Shape to recognize. However, caused by occlusion, partial vector loses some discriminating information to recognize some actions. For example, if legs and feet are occluded, it is hard to recognize “Kick” actions. We perform subject independent action recognition experiments. For each type of actions, we train entire HMM models for all actors in train set and use the

437

Table 1. Action recognition results using Adapted Envelop Shape

Ratios Point. Raise Wave Touch Comm. Bow Pick Kick Walk

1 98.4 98.8 94.1 97.5 87.8 99.1 96.3 97.8 91.3

Train Sets (%) 2/3 1/2 95.6 95.6 93.4 93.6 95 92.3 94.7 88.1 82.8 91.3 83.4 84.7 94.1 91.3 93.8 89.1 -

1/3 97.2 94.1 96.9 -

entire HMM models for recognition. Full part and baseline ratio part of Adapted Envelop Shape are used respectively. That is to say, we train four HMM Models using 1/3, 1/2, 2/3, 1 ratios respectively. For each action, we pick up forty actions which are not occluded as training set and other sequences as test set. Table1 shows correct recognition rates. Each row corresponds to a kind of action’s recognition rate of Train Sets and Test Sets when using the four kinds of HMM models respectively. Since partial vector has less discrimination, some actions are ignored in action sets under some baseline ratios which are expressed using dash line.

1 95.3 93.2 90.5 92.3 86.6 98.2 93.5 95.5 89.1

Test Sets (%) 2/3 1/2 95.3 92.2 91.2 94.1 87.2 91.1 83.2 82.2 81.5 83.3 82.3 78.1 79.5 82.1 82.3 78.1 -

1/3 91.1 92.1 86.5 -

Reference [1]J.K. Aggarwal, Q. Cai, Human motion analysis: a review, Computer Vision and Image Understanding, 73 (3) (1999) 428-440 [2]T.B. Moeslund, E. Granum, A survey of computer vision based human motion capture, Computer Vision and Image Understanding, 81 (3) (2001) 231-268. [3]Liang Wang, Weiming Hu, Tieniu Tan, Recent Developments in Human Motion Analysis, Pattern Recognition, Vol. 36, No. 3, pp.585-601, 2003 [4]Y. Sheikh, M. Shah, Exploring the Space of an Action for Human Action Recognition, International Conference on Computer Vision, 2005 [5]M. Blank, L. Gorelick, E. Shechtman, M. Irani and R. Basri, Actions as Space-Time Shapes, International Conference on Computer Vision, 2005 [6]Liang Wang, Tieniu Tan, Huazhong Ning, Weiming Hu, Silhouette analysis-based gait recognition for human identification, IEEE Trans on Pattern Analysis and Machine Intelligence,Vol25,No 12,pp 1505 - 1518,Dec. 2003 [7]Steven M. Seitz1 and Charles R. Dyer1, View-Invariant Analysis of Cyclic Motion, International Journal of Computer Vision, 1997 [8]Cen Rao, A. Yilmaz, and M. Shah, View-Invariant Representation And Recognition of Actions, International Journal of Computer Vision, Vol. 50, Issue 2, 2002 [9]Parameswaran, V., Chellappa, R., Human Action Recognition Using Mutual Invariants, Computer Vision and Image Understanding, 2005 [10]D. Weinland, R. Ronfard, E. Boyer, Free Viewpoint Action Recognition using Motion History Volumes, Computer Vision and Image Understanding, 2006 [11]Feiyue Huang, Guangyou Xu, Viewpoint Insensitive Action Recognition Using Envelop Shape, The eighth Asian Conference on Computer Vision, 2007 [12]Weiming Hu, Min Hu, Xue Zhou,Tieniu Tan,Jianguang Lou and Steve Maybank, Principal Axis-Based Correspondence between Multiple Cameras for People Tracking, IEEE Trans. Patter Analysis and Machine Intelligence, Vol. 28, No. 4, April 2006

6. Conclusion To perform unrestricted action recognition, we propose a practical solution based on multiple cameras' deployment to deal with view and location variance and occlusions. We also adapted Envelop Shape representation and used it for action recognition. Experiment shows our system has good discriminating ability for action recognition under arbitrary body movement and occlusions. In order to realize a complete automatic action recognition system, there is also further work to finish. Now our system adopts manual temporal action segmentation. It is an urgent work to discover automatic segmentation algorithm.

Acknowledgements This work is supported by National Science Foundation of China under grant No 60673189 and No 60433030.

438

Suggest Documents