Recognizing action events from multiple viewpoints Tanveer Syeda-Mahmood IBM Almaden Research Center 650 Harry Road, San Jose 95120
[email protected]
A. Vasilescu Dept. of Computer Science University of Toronto, Canada. alemovQhotmail .com
S. Sethi Dept. of Computer Science Boston University ssethiQbu .edu Abstract
has remained an outstanding problem due t o several reasons. First, since each action is associated with an object, it would require object detection. Secondly, the segmentation of actions continuously performed by the object into distinct actions, poses challenging problems. Finally, an action can show enormous variability in image appearance when seen at different time intervals. Changes in imaging conditions such as illumination changes, viewpoint changes, and pure variations in execution style can considerably alter the appearance of an action in an image from its original description. Finally, an action may only be partially executed or partially visible in image during its execution in space, making its automatic detection and recognition difficult. Although the problem of action recognition has been actively addressed, particularly in the area of gesture recognition, the above problems still remain outstanding. Existing work in this area can be broadly classified into region-based, trajectory-based, and partbased approaches. The region-based approaches analyze spatietemporal patterns of actions. These can be two-dimensional shapes, as in the case of temporal templates[7], view models[6] or three-dimensional shapes as in the case of XYT cube [14] or image solid[l5]. The trajectory-based approaches analyze temporal trajectories of features for properties such as velocity, direction and speed, joint angles, and spatiotemporal curvature of the temporal trajectories t o classify actions [13, 171. More complex spatial models of human action treat the body as a set of connected parts[2, 4, 101. There has also been work on the recognition of activities using the temporal evolution of the parameters of such models[l, 81. From our understanding of work in this area, all approaches t o action recognition are sensitive t o changes in viewpoint. For example, it is common knowledge
A first step towards a n understanding of the semantic content in a video is the reliable detection and recognition of actions performed by objects. This is a dificult problem due t o the enormous vaeability in a n action's appearance when seen from different viewpoints and/or at different times. In this paper we address the recognition of actions by taking a novel approach that models actions as special types of 3d objects. Specifically, we observe that any action can be represented as a generalized cylinder, called the action cylinder. Reliable recognition is achieved by recovering the viewpoint transformation between reference (model) and given action cylinders. A set of 8 corresponding points from time-wise corresponding cross-sections is shown t o be suficient t o align the two cylinders under perspective projection. A surprising conclusion from visualizing actions as objects i s that rigid, articulated, and nonrigid actions can all be modeled an a uniform framework.
1
Introduction
Action events are a common occurrence in many real world applications such as surveillance, humancomputer interfaces, interactive environments and content-based retrieval. For example, in surveillance, both usual and unusual activities may be useful t o monitor. In distributed learning, the emphasis on a topic can be learned from the pointing actions made by the speaker. Automatic interpretation of sign language requires the analysis of actions. Finally, observing user actions or gestures is important in human-computer interaction and the design of HCI systems. Despite the progress made in image and video content analysis, detecting and recognizing action events 0-7695-1293-3/01 $10.00 0 2001 IEEE
64
Authorized licensed use limited to: KnowledgeGate from IBM Market Insights. Downloaded on July 20, 2009 at 18:53 from IEEE Xplore. Restrictions apply.
spondence needed for recognizing the action. Finally, by combining the effect of shape and motion in a 3d shape, it allows their relative contributions to be explicitly modeled. The rest of the paper is organized as follows. In Section 2, we model actions as shapes through action cylinders. In Section 3, we discuss the recognition of actions using action cylinders. In Section 4, we present our action recognition algorithm. Finally, in Section 5, we present results on the performance of our algorithm on a large action data set.
2 Figure 1: Illustration of a generalized cylinder. The third dimension here is time.
Modeling actions as shapes
For purposes of this paper, we define an action t o be the motion sequence executed by a 3D object. We now analyze the shape formed by an action using the following assumptions. Assumptions: 1)The imaging situation depicts a single object undergoing an action. 2) It is viewed using a single fixed camera under full perspective projection. 3) The object can be separated from the background by some means (several methods of background subtraction are available[8]), and finally, 4) a single action has been segmented from other previous and next actions done by an object. The latter is critical to the identity of an action as an object. Under the above assumptions, as an object under, ( t ) ) on the goes motion, a point P ( t ) = ( X ( t ) ,Y ( t ) Z 3d object at time instant t is projected into the image at location p ( t ) = ( z ( t ) , y ( t ) ) . If we consider the spatiotemporal solid formed from the time varying projections of object features ( z ( t ) , y ( t )as a function of time t , the resulting 3D shape formed in image-time space is a cylinder, which we call the action cylinder.
that the recognition accuracy can drop considerably for actions such as sitting actions seen sideways and frontally. Further, 3d spatio-temporal shapes such as XYT cube[l4] and image solid[l5] being formed from the entire image sequence, are useful for detecting highlevel patterns such as those in periodic actions. In addition, the trajectory-based approaches require accurate tracking which is often difficult for people due t o clothing and self-occlusions. While tolerance to variations such as local changes of speed are not explicitly addressed, most approaches do take some of the execution style differences into account during matching, using techniques such as dynamic time warping [SI. Finally, not many approaches have addressed the recognition of actions from their partial appearance. In this paper, we address the problem of robust recognition of action events. Specifically, we focus on the recognition of an action performed by an object when seen from a different viewpoint. In particular, we observe that the shape formed from the successive perspective projections of an object executing an action can be visualized as a generalized cylinder, called the action cylinder. Using this representation, action recognition involves finding a set of corresponding features between model and given action cylinders and verifying the correspondence through projection of the model cylinder. The paper makes several novel contributions to advance the state of the art in action recognition. First, by expressing actions as shapes, it allows the entire machinery of object recognition techniques to be brought t o bear on the action recognition problem. It also allows all complex action types, rigid, articulated or nonrigid, to be modeled in a uniform framework. Next, it shows that a set of eight corresponding features from corresponding cross-sections are sufficient t o recover the viewpoint transformation as well as the time corre-
Observation: The successave perspective projections of every isolated action of a 3d object moving in space can be represented as a generalized cylinder in imagetime space. As shown in Figure 1, a generalized cylinder is represented by ( A ( s ) E , ( s ,8, a),@(s)) where A ( s ) is the axis of revolution of the cylinder parameterized by s. 8,a specify the cross-section in the cylinder's coordinate frame, and @(s) is the change in orientation of the cross-sections with respect t o the axis of revolution. Action cylinders are generalized cylinders by the following equivalence. E ( s ,8,a)represents the 2d projections of the 3d object as it undergoes motion and is clearly affected by the shape of the object. The change in the cross-sections over time is due to motion as well as effects of perspective projection. @(s) represents the change in the apparent orientation of the 2d shapes (projections) due to both motion and effect 65
Authorized licensed use limited to: KnowledgeGate from IBM Market Insights. Downloaded on July 20, 2009 at 18:53 from IEEE Xplore. Restrictions apply.
action matches one of a set of stored action cylinders (called model cylinders). To address the matching in a manner analogous t o model-based object recognition, we need to take into account the effect of various imaging conditions such as illumination changes, viewpoint changes, occlusions and noise, that can cause the two cylinders t o appear different. In addition, since the action cylinders represent motion information, we need t o consider differences produced due t o changes in execution style that include velocity and acceleration variations. We now describe a way t o match two action cylinders that address some of the above issues. Specifically, we focus on recognizing actions when seen from multiple viewpoints.
of projection. Finally, A ( s ) is the axis of revolution that generated the resulting shape from the successive cross-sections and is affected by shape, motion, and projection. Thus the action cylinder is a representation for capturing the combined shape and motion information of an object. Because it is formed from the projections, it also accounts for the effect of projection. Other variations of the action cylinder are also possible, such as a feature-only version using corners within bounding and interior contours of the cross-sections. In this paper, we use a feature-based representation of the action cylinder in all our analysis. Action cylinder vs. other action representations It is interesting t o compare and contrast the action cylinder with other action representations. Action cylinders are similar t o temporal templates[7] and XYT cubes[l4, 151 in that they are all region-based descriptors. Unlike the temporal templates, action cylinders are 3d shapes that preserve the order with respect t o time. Unlike other 3d spatio-temporal shapes that form cubes from the entire image, the action cylinder represents only the object undergoing motion thus capturing the action in a precise fashion. Also, in other spatiotemporal approaches, the recognition of the action is achieved by extracting and classifying attributes. In contrast, we perform a direct geometric match of the entire shape of the action cylinders for recognizing actions. Since action cylinders are unique for a particular action and a specific object, they can distinguish the same action done by a different object unlike temp@ r d trajectories[l3, 171. The temporal trajectories of different points on an object can appear quite different and may not represent the action unless the entire 3d image-time shape was considered. Further, action cylinders allow detailed capture of the action without requiring complex tracking of the entire object. When they are represented using corner features, action cylinders are relative insensitive t o illumination changes. The action cylinders can also be sub-sampled for slow actions without affecting the overall perception of the cylinder’s shape and hence the action. Finally, because of the built-in redundancy in an action cylinder due t o appearing and disappearing features from the same object location, they are relatively insensitive t o feature detection errors.
3
3.1
Recognizing actions under changes in viewpoint
With cross-sections coming from perspective projections of the 3d object, an action cylinder is clearly not viewpoint invariant. It is, however, possible t o recover the viewpoint transformation by a match of feature points across the two actio? cy1inde:s. T? see this, consider a n action cylinder S ( t ) = (z( t ) , y ( t ) ) seen from one viewpoint V1. Let the stored representation be of the same action but seen from a different viewpoint V2 and denoted by the model cylinder S ( i ) = ( z ( i ) , y ( i ) ) . Let ,there be ,a pai: of corresponding cross-sectio:s S(to) 7 ( z ( t g )y(to)) , on the model cylinder, and S ( t o ) = (z( t o ) , y ( t o ) ) on the image cylinder. These are corresponding in the sense that S’(t0) = S(to)where S(to) is the 2d projection of the object seen from viewpo;nt V2 at the same time instant as the 2d projection S ( t o ) . In general, the time profiles do and to are identical only in the special case when the action is viewed simultaneously at the same frame rate. To recognize a n action under changes in viewpoint, we use a well-known relation between two views of an object under perspective projection derived in several papers including [16]. We incorporate time information into the notation to specifically address the shapes induced by actions. Using matrix notation, and choosing the viewpoint of the model cylinder as \he refereye viywpoin,t, a point on the 3d obje? P(to)7 (X(;,) Y ( t o )Z(to))T pfojects to, point,p(to) = (:(to) ~ ( t ~E S(to) ) ) ~and p ( t o ) = (z (t?)y E,S(to); Usin,g homogo3eous = coordinates P ( t o ) = ( X ( t o )Y ( t o )Z(to) l)T,@(to) and P ( t o > = ( & ( t o ) Y(to> llT, we (x(to>y(t0) have &‘(to) = AIRT]P(io) (1)
Recognizing actions using action cylinders
Using the action cylinder representation, the problem of action recognition can now be posed as a modelbased object recognition problem. That is, the problem is t o determine if a given action cylinder of an isolated
where IC is a scale factor called the projective depth, and the camera position and orientation of the new 66
Authorized licensed use limited to: KnowledgeGate from IBM Market Insights. Downloaded on July 20, 2009 at 18:53 from IEEE Xplore. Restrictions apply.
viewpoint is encoded in the 3 x3 rotation matrix R and the translation T representing the rigid transformation tha aligns the camera reference frames and A is the camera parameter matrix given by
Traditionally, the recovery of F had led t o the decomposition of F into rotation and translation components t o recover motion and hence the structure of the 3d object. In the context of action recognition though, equation 4 states the following: Claim 1: A set of eight corresponding points between corresponding cross-sections is saficient t o recover the Viewpoint change between two action cylinders. Recognizing actions after recovering viewpoint Does the recovery of the viewpoint automatically enable the recognition of the action? Ordinarily, this would require the knowledge of the 3d shape of the object. Since no 3d shape information is known in both model and image action cylinders, the recovery of viewpoint alone is not sufficient to enable recognition. Notice, however, that Equation 4 is essentially a re-statement of the epipolar line constraint. That is, a point on the reference or model action cylinder that corresponds t o a point on the given action cylinder is only known up t o a line. If a given action cylinder is indeed depicting the same action from a different viewpoint and the chosen &point correspondence is correct, then we should expect a large number of features on the model action cylinder t o lie on or be near some epipolar lines corresponding to features on the given action cylinder. That is, if we collect all the features from the model(M) and given action cylinders (N) at the corresponding time instants to and tointo matrices, we have
(2)
where ( U O , V O ) are the coordinates of the principal point, a = -fnu, p = - f n , are the focal lengths in the horizontal and vertical pixels, f is the focal length in mm, and nu, n, are the effective number of pixels per millimeter along the image axes, and 7 is skewness of the image axes. If we take the first viewpoint reference frame as the world reference frame, then the two views of the same action at corresponding time instants are related by np(tb) = AII O]P(iO, and k p ( t 0 ) = A‘[RT]P(tb) or
&’(to)
+
= A’[R TIP(&)= t ~ A ‘ R A - ~ f i ( tA‘T ~)
+
= kHfi(t0) €33)
where = A’T and H = A‘RA-l. Here we allow the internal parameters of the two cameras t o be different. The above equation c,aptures the constraint that the corresponding poirft p ( t o ) lies on the liqe going through the epipole A T and the point Hlj(to). Expressing this collinearity-in projective coordinates with the external product p’(to)(e’AHfi(to))= 0, or
$T(to)Ffi(io)= 0
(4)
where F is called the fundamental matrix. In the most general case, the only geometric information that can be computed from pairs of images is the fundamental matrix. Since equation 4 is homogeneous, the 3x3 matrix F can only be recovered up to a scale factor. In fact, eight corresponding points in general positions are sufficient t o solve F up t o a unique scale factor as shown earliest by Longuet-Higgins [ll]and popularized by Tsai and Huang[l6]. Thus t o recover the components of F we expand equation 4 and make f33 = 1 t o form the linear system of equations
~ ~ ( t o ) z i ( t ~ ) f~i,l~+( t o ) y i ( ~ o+ ) fz:(tO)fi3 iz
+Yi(tO)f23
Yi(t?)zi(t0)f21 Yi(tO)yi(tb)fi2 zi (tO)f31 yi ( t O ) f 3 2 = -1
+
and
then the matrix product
A’ (to)FS(i0)
(8)
can be used to form a verification measure since it will have close-to-zero entries for alll candidate c o r m y i ( t o ) )and (zj(to), y j ( t o ) ) . Let sponding points (zi(to), nzero be the number of such entries. Since a single point may match t o more than one by this measure, a simple way t o avoid this multiple count is t o use the min(n,ero, N , M ) t o obtain a normalized verification as:
+ + ,r, (’1
Although several ways of computing the fundamental matrix have evolved in literature since[3, 121, the 8point algorithm based on the above equations has been shown to be stable with simple normalization of coordinate points[9].
(9) 67
Authorized licensed use limited to: KnowledgeGate from IBM Market Insights. Downloaded on July 20, 2009 at 18:53 from IEEE Xplore. Restrictions apply.
recovery of F can be ensured. Thus while tracking is not essential t o our formulation for purposes of recognition, it can no doubt help in search reduction during recognition. Effect of occlusions and noise Occlusions are of two types in the case of action cylinder matching. First, the object undergoing motion itself may appear occluded (either due to actual occlusion or object detection errors). Secondly, a partial execution of an action can also be thought of as an occlusion. The former manifests as change in the shape of the individual cross-sections or equivalently, loss of features in these cross-sections, causing the individual verification score t o be affected. The latter causes loss of cross-sections in the cylinder affecting the combined verification score. In general, since we initiate search for correspondence from the given a d i o n cylinder, but initiate verification from the model cylinder, tolerance t o occlusions is achieved. Large occlusions, however, can still be a problem.
Thus given a pair of corresponding cross-sections on the two action cylinders, the viewpoint transformation can be recovered, and the cross-sections can be verified by projection using equation 9 above. Since we focused on a single pair of cross-sections here, this can only recognize the shape (of the object undergoing motion) part of the action. f! the time correspondence for all cross-sections (ti,ti) was known, however, this process can be repeated t o verify the entire action cylinder, one cross-section at a time. The final verification is then simply the sum of the pair-wise verifications, and the action is said t o be recognized if the following normalized verification score exceeds a chosen threshold T
Alternately, when only the verified features of the model cylinder are retained and the action sequence re-synthesized, we should expect t o see similarity with the original reference or model action sequence. In the results section, we show that this is indeed the case for correct recognition. Discussion The above result is significant since it implies that even without full 3d reconstruction, we can recognize the action through a simple verification constraint based on the epi-polar line constraint. While accidental alignment of points is possible, such repeated alignment of projected points with present features throughout all cross-sections is unlikely for the incorrect pose. Also, unlike object recognition, where only the visible points in two views can be verified, points that disappear can re-appear at a later time and can still be recognized during the evolution of an action. Also, the approach taken here is more general than the one in [5]where the transformation recovered from two video sequences was a homography. Reducing search in correspondence through tracking Note that the correponding points used in equation 5 need not come from a single pair of cross-sections. Equation 5 will hold for all corresponding points on corresponding cross-sections. This observation is significant because it implies that the search for 8 corresponding features can be reduced by tracking. In the extreme case, w: can pick on: starting feature pair from cross-sections S ( t o ) and S(to),and derive the remaining 7 features by noting the position of the starting feature in successive corresponding cross-sections through tracking. Several other ways of picking corresponding features are also possible, such as 4 in one cross-section, 3 in the other, all 8 in one cross-section, etc. In general, if the tracked features are well-separated, an accurate
3.2
Handling non-rigid and articulated object motion
Non-rigide articulated object motions are characterized by separate frame-to-frame transformation. Since the viewpoint transformation is a function of the camera geometry and not the object motion, the above analysis carries through even when the object is nonrigid provided a pair of corresponding cross-sections can be found. Also, since the matching and verification of the cross-sections is based on the local shape at corresponding time instants, the non-rigidity of the object or its motion does not affect the verification process.
4
The action recognition algorithm
Putting the above results together, the overall action recognition algorithm consists of a pre-processing step followed by matching. Action cylinder generation and pre-processing To generate an action cylinder, the moving object is first extracted from the video. Since we assumed a single moving object, a simple background subtraction using frame difference with a background frame, was found sufficient. We have also experimented with other ways of segmenting objects including color-based segmentation, and simple frame-by-frame differencing. In general, we expect some segmentation errors so that a portion of the background may be included or a portion of the object may be missing in bit maps extracted. Some of the noisy portions of segmented objects can be merged through dilation. Since the verification is performed model t o image cylinder-wise, the spurious 68
Authorized licensed use limited to: KnowledgeGate from IBM Market Insights. Downloaded on July 20, 2009 at 18:53 from IEEE Xplore. Restrictions apply.
features do not affect as much in the final verification from their tracked, versions at corresponding times as score. An edge detector is then applied over each frame FI(d2),FI( t 3 )F ,2 (t4),F2 (t5), F3 (de). Since we are not image and the edges in the object-containing regions consid,erinF velocity and accelaration changes here, we are retained. Curves are assembled from the edges, have ti - to = ti - to. and corner features are extracted from curves as places 6 . Use the corresponding features t o solve for the where there is a sharp change in curvature. Each acfundamental matrix using Equation 5. tion is now represented by the curves and corners in 7. Verify the correctness of the solution using equaeach cross-section at successive time instants t o serve tion 10. If the verification score is above the acceptas the action cylinder. We use the curves mainly t o able threshold, declare action recognized. If not, repick features for correspondence in an orderly manner, peat steps 4 through 6 until all valid feature pairings and also t o statically display an action. This processare tried. ing is repeated for all model actions and the resulting 8. If these fail, repeat steps 2 through 7 with another model cylinders stored in a library. corresponding time instant on the model cylinder. DeIn addition to cylinder creation, we also extract clare mis-recognition if all else fails. salient features during the pre-processing stage t o faComplexity Analysis cilitate the search for corresponding features. The To analyze the complexity of the above action recogsaliency was based on three factors: 1) the sharpness nition algorithm, let mi,i = l , M be the number of of included angle, 2) the length of the incoming lines, features in cross-section i in the model cylinder (of and 3) proximity t o other features. The salient features M cross-sections). Let n j , j = l , N be the numare then tracked using a conventional optical flow alber of features in cross-section j in the image cylingorithm. For each feature tracked, we note the crossder. The search for matching features within a pair sections in which the tracked feature was also verified of cross-sections takes O(m:n;), since the remaining by an actually detected corner feature at the indicated 5 features are chosen from fixed cross-sections in the location. This gave rise to a set of tracked salient featracked salient features. The overall search, therefore, tures F = {Fi(tj)},where Fi(tj) is the ith tracked feais O(n:,,(CE1 m:)) for a fixed choice of the starting ture in t j t h (time instant) cross-section. These features cross-section to in the given action cylinder. are stored as part of the model cylinder descriptions in the library. 5 Results Matching algorithm: Given a new action, we generate its action cylinTo test the recognition of actions, we experimented der and the set of tracked salient features as described with several 3d objects, ranging from synthetic rigid above. We then match it against each model cylinder objects under known viewpoints, to articulated objects in the library using the following algorithm. (hands) executing dance-like motions to completely 1. Pick a starting time instant to on the given action non-rigid such as flower stalks and clothes moving in cylinder. We usually choose a cross-section that has 8 the wind. There are currently over 60 objects in our or more salient features. To achieve tolerance t o action collection. Each object was made t o exhibit between segmentation errors though, different starting time in5-20 actions. For each action type, different instances stants may have t o be tried. were generated by observing the action at different 2. Pick a corresponding time do on the model cylintimes, and under different viewpoints and at different der. For the first iteration do = 1. times for a total of 600 actions in the database. Since 3. Define correspondence for all cross-sections of the object and action segmentation were not the focus of two cylinders. Since we are not considering velocity this paper, we made the experimental set up in most cases have a white or dark background. The cameras and acceleration changes here, we assume the time cora translation, i.e. we were always fixed, and the object was made to undergo respondence t o be one-one upto I I have t - to = t --.to. the action. In all these experiments, the illumination 4. Pick three salient features F;(to),F;(to), Fi(to) set up for the different instances was more or less the same. However, illumination variations could still be in cross-section S'(t0) on the given action cylinseen in the different videos of the same action due t o der. For each of these features, pick two other feacamera and viewpoint differences. tures at fixed (but possibly different) offsets in the Examples trFked feature get. We retai? 8 of $hese fe,atures ,as F,(ti),F;(t2),F1 ( t 3 ) , F2(to),F2(t4), F2(t5), F3(to),F3(ts). We -now illustrate results of action recognition through a few examples. Some of these were based on 5. Pi,& a set o f , 3 matching candidate, feaan earlier implementation of the matching algorithm tures fi (to),F2(t0), F3(tO)from cross-section on the model cylinder. Pick the remaining 5 features in which we experimented with several valid feature
s(to)
69
Authorized licensed use limited to: KnowledgeGate from IBM Market Insights. Downloaded on July 20, 2009 at 18:53 from IEEE Xplore. Restrictions apply.
l
Object
toy hand flower l pencil hand , hand 1 bottle toy
Action type rigid articulated non-rigid rigid articulated rigid rigid non-rigid
# viewpoint # correct Recall of the recognition algorithm where the corresponding difference 8 7 5 13
8 6 3 11
4
4
7 5 10
6
4 9
features were derived from tracking salient features in 100% at most two frames. The original features on the model 85% cylinder are shown in Figure 3g projected into a sin60% gle image. Notice that a large amount of background 84.6% is captured in this representation due t o segmentation 100% errors. We expect these features t o find matches dur85% ing the verification stage based on the epipolar lines. 80% This can also be seen in the verified features shown 90% projected as a single image in Figure 3h. Even so, the
Table 1: Illustration of effectiveness of action recognition in the action database. It indicates the number of similar actions t o each of the model actions under changes in viewpoint.
covered which could seen by playing the reconstructed and original feature-based action sequences. Action recognition performance In the next set of experiments, we tested the accuracy of action recognition against the database of model actions. Specifically, for each of the 600 actions in the database, we noted the number of actions that were perceptually similar t o the given action under changes in viewpoint. We then used each of these actions as a test action, and noted $he matching actions returned in the top n. The value of n varied based on the action. That is, for actions which were known t o have n similar actions in the database, the top n matches were retained. If all the n matches returned were correct, then this was considered 100% recall. The number of mismatches in the top n matches was taken as an indication of precision. The results are indicated in Table 1 for some representative objects and their actions tested. As can be seen, the action recognition algorithm possesses good recall for the actions tested.
selection schemes. First, consider a rigid object rotating around the y-axis whose action is seen from two different but unknown viewpoints. Figure 2a and b show their respective action cylinders. Note that due to object segmentation errors, these cylinders have a portion of the background mixed. Even with imprecision in feature detection, and loss of features due t o segmentation, since a n action has a large number of features spanning multiple time instants, it was always possible t o find features for correspondence as well as pose verification. For this example, we used an earlier version of the algorithm that chose the set of eight corresponding points on the model and image action cylinders from a single cross-section at time instant=5. The positions of these points that gave rise to a d i d fundamental matrix that was verified by the model action cylinder features are depicted by circles in Figure 2c and d. For the fundamental matrix computed through this correspondence, nearly 12,000 of the 15,000 features from 60 cross-sections of the action sequence (the first 2 seconds of the sequence was used) were verified as seen from Figure 2e and f. Figure 2e shows the original model features on the model action cylinder, while Figure 2f shows the verified model features retained. By animating the verified features in their respective frames, we were able t o reconstruct an action sequence that resembled the model action sequence. Next, we illustrate the performance of action recognition in the case of non-rigid objects undergoing nonrigid motion. A sample dance sequence illustrating this is shown in video dance-view.mpg. A sequence in the database that resembles this action but seen from a different viewpoint is shown in dance-model.mpg. Their respective action cylinders are shown in Figure 3a and b. The corresponding features that aligned the two sequences were found from frame 15 and 36 as shown in Figure 3c-d, and e-f. In this case, we used a version
Errors in recognition
The recognition errors could be accounted by four factors (a) tracking errors, (b) time correspondence errors(c) lack of common corresponding points under changes in viewpoint, and finally, (d) object segmentation errors. Since we have used a sparse feature-based representation of the action, tracking can sometimes cause problems where missing features occur in successive frames, causing problems in the generation of corresponding points. In such cases, however, we predict the positions of features using the motion computed in previous frames. Next, the lack of common corre sponding points between the two is an obvious limitation. The prediction in tracking can sometimes account for these missing or occluded features. Fortunately, there are plenty of cross-sections t o try for an action sequence. Finally, object segmentation errors can cause a wrong action t o be verified when the verification picks up more features from the background rather than the object region in the epipolar lines-based test. Often though, we have observed the spurious features t o affect the verification score without changing the result on the best matching action. A case where this hap70
Authorized licensed use limited to: KnowledgeGate from IBM Market Insights. Downloaded on July 20, 2009 at 18:53 from IEEE Xplore. Restrictions apply.
[lo] S. Ju, M. Black, and Y. Yacoob.
Cardboard people: A parameterized model of articulated image motion. In Proc. IEEE Int. Conf. on Automatic Face and Gesture Recognition, pages 38-44, 1996.
pens is shown in Figure 3, where the background features are also matched in the verification step. Better object segmentation methods such as color-based object selection can alleviate this problem. Conclusions In this paper we have analyzed the effect of viewpoint changes on action representation and recognition. The action cylinder representation was shown t o be computationally simple as well as versatile in handling actions of rigid and non-rigid objects.
[ll] H.C. Longuet-Higgins. A computer algorithm for reconstructing a scene from two projections. Nature, 293~133-135,1981. [12] &.-T. Luong and 0. Faugeras. The fundamental matrix: theory, algorithms, and stability analysis. International Journal of Computer Vision, pages 43-76, 1996.
References [l] M. Black and A. Jepson. Eigentracking: Robust matching and tracking of articulated objects using a view-based representation. In Proceedings of the European Conference on Computer Vision, pages 63-84,1998.
[13] A. Nishikawa, A. Ohnishi, and F. Miyazaki. Description and recognition of human gestures based on the transition of curvature from motion images. In Proc. IEEE Int. Conf. on Automatic Face and Gesture Recognition, pages 552-557, 1998.
[2] M. Black and Y. Yacoob. Tracking and recognizing rigid and non-rigid facial motions using local parametric models of image motion. In Proceedings of the International Conference on Computer Vision, pages 374 -381, June 1995.
[14] S. Niyogi and E.H. Adelson. Analyzing and recognizing walking figures in xyt. In Proceedings IEEE
Conf. on Computer Vision and Pattern Recognation, pages 469-474, 1994. [15] R. Polana and R.C. Nelson. Detecting activities. Jl. of Visual Communication and Image Representation, 5:172-180, 1994.
[3] B. Boufama and R. Mohr. Epipole and fundamental matrix estimation using virtual parallax. In Proceedings of the International Conference on Computer Vision, pages 1030-1036, 1995.
[16] R. Tsai and T. Huang. Uniqueness and estimation of three-dimensional motion parameters of rigid objects with curved surfaces. IEEE Dans-
[4] C. Bregler, A. Hertzmann, and H. Biermann. Recovering non-rigid 3d shape from image streams. In Proceedings IEEE Conf. on Computer Visaon and Pattern Recognition, pages 13-15, 2000.
actions on Pattern Analysis and Machine Intelligence, pages 13-27,1984. [17] M. Yang and N. Ahuja. Extracting gestural motion trajectories. In Proc. IEEE Int. Conf.on Automatic Face and Gesture Recognition, pages 1015, 1998.
[5] Y. Caspi and M. Irani. A step towards sequenceto-sequence alignment. In Proceedings IEEE Conf.
on Computer Vision and Pattern Recognition, pages 682-689, 2000. [6] T. Darrell and A. Pentland. Space-time gestures. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition, pages 335-340, 1993. [7] J. Davis and A. Bobick. The representation and recognition of action using temporal templates. In
Proceedings IEEE Conf. on Computer Vision and Pattern Recognition, pages 928-934,1997. [SI I. Haritaoglu, D. Harwood, and L. Davis. W4: Real-time surveillance of people and their adivities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):809-830, 2000. [9J R.I. Hartley. In defense of the eight-point algorithm. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 580-593, 1997. 71
Authorized licensed use limited to: KnowledgeGate from IBM Market Insights. Downloaded on July 20, 2009 at 18:53 from IEEE Xplore. Restrictions apply.
(a)
(b)
(4
(4
(e)
(f)
Figure 2: Illustration of action recognition under changes in viewpoint. - Rigid object case. (a) given action cylinder of a rotating toy. (b) model cylinder of the rotating toy seen from a different viewpoint. (c)-(d) Matching salient features on the two cylinders. (e) Features on the model action cylinder projected into image plane. (f) Features on the model action cylinder that found a match to features on the given cylinder in successfulverification using the corresponding features of (c)-(d).
Figure 3: Illustration of action recognition under changes in viewpoint. - Case of articulated/non-rigid motion. (a) given action cylinder of a dancing hand. (b) model cylinder of the dancing action seen from a different viewpoint. (c)-(d) Tracked salient features from two cross-sections on the given cylinder. (e)-(f) tracked salient features on the model cylinder that found a match. (g) features on the model action cylinder projected into image plane. (h) Features on the model action cylinder that found a match to features on the given cylinder in successful verification using the corresponding features of (c)-(f).
72
Authorized licensed use limited to: KnowledgeGate from IBM Market Insights. Downloaded on July 20, 2009 at 18:53 from IEEE Xplore. Restrictions apply.