An optical flow based approach for action recognition - IEEE Xplore

0 downloads 0 Views 1020KB Size Report
Abstract-A new approach for motion-based representation on the basis of optical flow analysis and random sample consensus. (RANSAC) method is proposed ...
Proceedings of 14th International Conference on Computer and Information Technology (ICCIT 2011) 22-24 December, 2011, Dhaka, Bangladesh

An Optical Flow Based Approach for Action Recognition Upal Mahbub*, Hafiz Imtiaz* and Md. Atiqur Rahman Ahadt, Member, IEEE *Bangladesh University of Engineering and Technology, Dhaka-WOO, Bangladesh E-mail: [email protected];[email protected] t Kyushu Institute of Technology, Kitakyushu, Japan Email: [email protected]

Abstract-A new approach for motion-based representation on the basis of optical flow analysis and random sample consensus (RANSAC) method is proposed in this paper. Optical flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer (an eye or a camera) and the scene. It is intuitive that an action can be characterized by the frequent movement of the optical flow points or interest points at different areas of the human figure. Additionally, RANSAC, an iterative method to estimate parameters of a mathematical model from a set of observed data which contains inliers and outliers, can be used to filter out any unwanted interested points all around the scene and keep only those which are related to the particular human's motion. By this manner, the area of the human body within the frame is estimated and this rectangular area is segmented into a number of smaller regions or blocks. The percentage of change of interest points in each block from frame to frame is then recorded. Similar procedure is repeated for different persons performing the same action and the corresponding values are averaged for respective blocks. A matrix constructed by this strategy is used as a feature vector for that particular action. Afterwards, for the purpose of recognition using the extracted feature vectors, a distance-based similarity measure and a support vector machine (SVM)-based classification technique have been exploited. From extensive experimentations upon a standard motion database, it is found that the proposed method offers not only a very high degree of accuracy but also computational savings.

Index Terms-Motion-based representation, action recognition, optical flow, RANSAC, SVM.

I.

INTRODUCTION

Recognizing the identity of individuals as well as the actions, activities and behaviors performed by one or more per­ sons in video sequences are very important for various appli­ cations. Surveillance, robotics, rehabilitation, video indexing, in the fields of biomechanics, medicine, sports analysis, film, games, mixed reality, etc. are among various key application arenas of human motion recognition [1]. The importance of human motion classification is evident by the increasing requirement of machines to be able to interact intelligently and effortlessly with a human inhabited environment. However, most of the information extracted by machines from human movement have been from static events, such as key press. In order to improve machine capabilities in real-time, it is desirable to represent motion. However, due to various limita­ tions and constraints, no single approach seems to be enough for wider applications in action understanding and recognition.

Present methods can be classified into: view/appearance-based, model-based, space-time volume-based, or direct motion­ based [2]. Template-matching approaches [3] are simpler and faster algorithms for motion analysis or recognition that can represent an entire video sequence into a single image format. Recently, approaches related to Spatio-Temporal Interest fea­ ture Points (STIP) become prominent for action representation [4]. In this paper, a novel action clustering-based human action recognition algorithm is presented, which employs optical flow and RANSAC for determining apparent motion of human. In order to detect the presence and direction of motion, optical flow is employed. RANSAC is used for further localization and identification of the most prominent motions within the frame. From the density of optical flow interest points, the probable position of the person along the horizontal direction in the frame is determined. Then more localization is done based on evaluation of the mean and standard deviation of the positions of the interest points both horizontally and vertically. A small rectangular area is thus obtained within which the person performs hislher actions. The area has been divided into a number of small blocks and the percentage of change in number of interest points within each block has been calculated frame by frame. All the matrices formed this way from the similar actions have been averaged and used as a feature for that respective action. Finally, simple classifiers have been utilized for the classification task. II.

RELATED WORKS

Human action recognition from video sequences has been a major field of research in recent years for different real life applications. Initially, the MHI method was used to recognize various actions by [3]. Later this method is used for recognition of human movements and moving object tracking by various groups (e.g., Refs. [5][6]). Bradski and Davis [6] and Davis [7] improved the MHI in various ways for recognizing various gestures. Various interactive systems have been successfully constructed using motion history template as a primary sensing mechanism. Nguyen et al. [8] introduced the concept of a motion swarm, a swarm of particles that moves in response to the field representing a motion history image. They created interactive art that can be enjoyed by groups such as audiences at

987-161284-908-9/11/$26.00 «:l2011 IEEE

Feature Extraction Locating the position of the person in the frame

l'crcCnt8gC

Dodyarea

change

of

intCf($\ pninls

localization from Statistical Propenies

frame by frame

in cach

segment

Result Person

Running

Fig. 1.

Matching & Classification: 3. Euclidean Distance b. Support Vector Machine (SVM)

Template Feature Space

The main components of the feature extraction and the human action recognition system

public events. Another interactive art demonstration has been constructed from the motion templates by [9]. On the other hand, some researchers in the computer vision community have used bag-of-words models for various recognition problems. Fei-Fei and Perona [10] use a variant of LDA for natural scene categorization while Sivic et al. [11] use pLSI for unsupervised object class recognition and segmentation. Optical flow based human action detection has also been investigated mainly because of the simplicity of optical flow-based algorithms [12]. , Specially for real-time surveillance scenes [13)[14] optical flow-based algorithms has proved to be fruitful. Various other approaches are immensely expanded into action recognition and understating and some of these, as above, are employed in computer vision-based computer games, interactive systems, which are real-time or pseudo-real- time and can significantly reduce the system cost through simple approaches. This paper addresses a novel ac­ tion representation technique based on optical flow, RANSAC and simple statistical evaluation and thereby smartly recognize various actions. III.

l'IjI._

'"

and the scene. In computer vision, optical flow means tracking specific features (points) in an image across multiple frames. It is widely used in computer vision to find moving objects from one frame to another, to calculate the speed of movement and the direction of motion and to determine the structure of the environment. By principle, optical flow techniques can be de­ veloped using four methods, namely, Phase correlation meth­ ods, Block-based methods, Differential methods and Discrete optimization methods [15]. Among these a there is a widely used differential method called the LucasKanade method or L. K. method for optical flow estimation developed by Bruce D. Lucas and Takeo Kanade [16)[17] which stands out for its simplicity and lack of assumptions about the underlying image. It assumes that the flow is essentially constant in a local neighborhood of the point p under consideration. Thus the optical flow equation can be assumed to hold for all pixels within a window centered at p. Namely, the local image flow (velocity) vector (Vx,vy) must satisfy,

Ix(Sl)Vx + Iy(Sl)Vy Ix(S2)Vx + Iy(s2)Vy

PROPOSED METHOD

A. Feature Extraction and Training In general, any recognition technique consists of two major parts, training and testing. The training phase can be divided into two sub-sections, feature extraction and learning or feature vector formation. Feature extraction is the most crucial part of any recognition system as it directly dictates the overall accuracy. Figure 1 shows the overall flow diagram of the proposed system, which consists of feature extraction, learning and recognition phases. The objective of the proposed method is to extract the variations, which are present in different human actions, by developing a successful measure to follow the movement of different body parts at different directions during an action. During an action, not all the body parts are moving significantly. Any significant movement anywhere within a frame can be detected by the optical flow analysis. Optical flow or optic flow is the pattern of apparent motion of objects, surfaces, and edges in a visual scene caused by the relative motion between an observer (an eye or a camera)

(1) where, Sl,S2,... ,Sd are the pixels inside the window, and Ix(si), Iy(si), It(si) are the partial derivatives of the image I with respect to position x, y and time t, evaluated at the point Si

and at the current time. These equations can be written in matrix form Av = b, where

A=

v=

[�] ,

(2)

and

b=

r ] -1t(Sl) -It(s2) .

-It(sd)

(3)

TABLE I T H E R A NSA C ALGORITHM

N iterations of the minimum number of points denoted (!min are required, where,

I:

Select randomly the minimum number of points required to determine the model parameters.

2:

Solve for the parameters of the model.

3:

Determine how many points from the set of all points fit with a predefined tolerance E.

4:

If the fraction of the number of inliers over the total number points in the set exceeds a predefined threshold T, re-estimate the model parameters using all the identified inliers and terminate. Otherwise, repeat steps I through 4 (maximum of N times).

5:

This system has more equations than unknowns and thus it is usually over-determined. The L. K. method obtains a compromise solution by the least squares principle. Namely, it solves the 2 x 2 system (4) where AT is the transpose of matrix A. That is, it computes

[VxV ] [LL1x(si)1 i 1x(si)2 (si) =

i

y

y

X

]

l Li 1X(Si)1Y(Si) Li 1y(si)2 - Li 1x(Si)1t(Si) - LJy(si)1t(si)

[

d.

]

(5)

with the sums running from i 1 to Thus, by combining information from several nearby pixels, the L. K. method can often resolve the inherent ambiguity of the optical flow equation. It is also less sensitive to image noise than point-wise methods [15]. However, the optical flow method is prone to slightest background movement or camera movements. So, the RAndom SAmple Consensus (RANSAC) algorithm is next applied for further purification of the motion detection. A basic assumption of RANSAC is that the data consists of inliers, i.e., data whose distribution can be explained by some set of model parameters, and outliers which are data that do not fit the model. In addition to this, the data can be subject to noise. The outliers can come, e.g., from extreme values of the noise or from erroneous measurements or incorrect hypotheses about the interpretation of data. RANSAC also assumes that, given a (usually small) set of inliers, there exists a procedure which can estimate the parameters of a model that optimally explains or fits this data. The RANSAC algorithm possesses robust capacity to remove outliers. As pointed out by Fischler and Bolles [18], unlike conventional sampling techniques that use as much of the data as possible to obtain an initial solution and then proceed to prune outliers, RANSAC uses the smallest set possible and proceeds to enlarge this set with consistent data points. The basic algorithm of RANSAC is summerized in Table I. The number of iterations, N, is chosen high enough to ensure that the probability Pin (usually set to 0.99) that at least one of the sets of random samples does not include an outlier. Let u represent the probability that any selected data point is an inlier and v 1 - u the probability of observing an outlier. =

=

(6) and thus with some manipulation, N

log(l (1 (l og l

-

=

-

Pin) ) l!=in ) .

(7)

- V

This way, some interest points are obtained, which seem to be tagged with the moving object and move towards whatever direction the object moves. In this case, the objects are nothing but different body parts of a human, for example, during running action, the leg and the hand is significantly moved, so most of the interest points, after performing RANSAC, are gathered around the hand and leg areas of the human body. Also, it is intuitive that all the interest points would gather around the whole body, which provides the scope to detect the position of the human subject in the scene by calculating the density of the values of the horizontal axis of the points. To do that, a window of a fixed length is taken along X-axis (preferably a window wider than the width of the human body in the scene), and the number of interest points inside the window is calculated. Then the window is shifted by a pixel and again the same calculation is repeated. This process, if done for the entire X-axis, will return the position of the window with maximum number of interest points, which obviously is the position of the human subject. Next, the mean cay) and standard deviation (O"y) of the Y­ axis values of all the points in this window are calculated. Thus, we now know the vertical distribution of the interest points along the human body. The Y-axis is divided into n segments starting from {}y - O"y - Jy to {}y O"y Jy. The value Jy is an adjustment constant chosen empirically based on the height of the frame and probable height of the human subject. Within the window along X-axis and the limit imposed along Y-axis, the mean ({) x) and standard deviation (0" x) of the X-axis values of all the points are calculated. Then the X-axis is also divided into n segments starting from {}x O"x - Jx to {}x O"x Jx. The value Jx is another adjustment constant chosen empirically based on the width of the frame and probable width of the human subject. Thus, the human subject is now encapsulated within a rectangular area which is divided into n x n smaller blocks or segments. Fig. 2 shows the original image sequence, the optical flow output, the RANSAC output and finally the output of localization and segmentation operation performed on the action sequences. The procedure is repeated in each frame. If the number of interest points within block k at the i-th frame is I then the change in number of interest points in each block is calculated frame by frame using the following equation,

+

+

+

+

p�,