Motion history image: its variants and applications - Semantic Scholar

12 downloads 0 Views 2MB Size Report
human actions performed by seven different subjects and good recognition results .... where N is the number of frames in the complete cycle(s) of a silhouette ...
Machine Vision and Applications DOI 10.1007/s00138-010-0298-4

ORIGINAL PAPER

Motion history image: its variants and applications Md. Atiqur Rahman Ahad · J. K. Tan · H. Kim · S. Ishikawa

Received: 4 January 2010 / Revised: 21 May 2010 / Accepted: 10 September 2010 © Springer-Verlag 2010

Abstract The motion history image (MHI) approach is a view-based temporal template method which is simple but robust in representing movements and is widely employed by various research groups for action recognition, motion analysis and other related applications. In this paper, we provide an overview of MHI-based human motion recognition techniques and applications. Since the inception of the MHI template for motion representation, various approaches have been adopted to improve this basic MHI technique. We present all important variants of the MHI method. This paper points some areas for further research based on the MHI method and its variants. Keywords MHI · MEI · Motion recognition · Action analysis · Computer vision

1 Introduction There are excellent surveys on human motion recognition and analysis [1–3,7,13,37,61,67,69,85,102–104,111,114,118, 131,144,145,157,169]. These papers cover many detailed approaches and issues, and most of these have cited the motion history image (MHI) method [31] as one of the important methods. This paper surveys human motion and behavior analysis based on the MHI and its variants for various applications. Action recognition approaches can be categorized into one of the three groups: (i) template matching, (ii) statespace approaches and (iii) semantic description of human behaviors [2,144]. The MHI method is a template matching Md. A. R. Ahad (B) · J. K. Tan · H. Kim · S. Ishikawa Faculty of Engineering, Kyushu Institute of Technology, 1-1, Sensui-cho, Tobata, Kitakyushu, Fukuoka 804-0012, Japan e-mail: [email protected]

approach. Approaches based on template matching first convert an image sequence into a static shape pattern (e.g., MHI, MEI), and then compare it to pre-stored action prototypes during recognition [144]. Template matching approaches are easy to implement and require less computational load, though they are more prone to noise and more susceptible to the variations of the time interval of the movements. Some template matching approaches are presented in Refs. [6,31,96,117,123]. Moreover, recognition approaches can be divided into (i) appearance- or view-based approaches, (ii) generic human model recovery, and (iii) direct motion-based recognition approaches [31]. Appearance-based motion recognition is one of the most practical recognition methods for recognizing a gesture without any incorporation of sensors on the human body or its neighborhoods. The MHI is a viewbased or appearance-based template-matching approach. In the MHI, the silhouette sequence is condensed into gray scale images, while dominant motion information is preserved. Therefore, it can represent a motion sequence in compact manner. This MHI template is also not so sensitive to silhouette noises, like holes, shadows, and missing parts. These advantages make these templates a suitable candidate for motion and gait analysis [89]. It keeps a history of temporal changes at each pixel location, which then decays over time [159]. The MHI expresses the motion flow or sequence by using the intensity of every pixel in temporal manner. The motion history recognizes general patterns of movement; thus, it can be implemented with cheap cameras and lower powered CPUs [33,34]. It can also be implemented in low light areas where structure can not be easily detected. The paper is organized as follows: Sect. 2 introduces the basic MHI approach, and then we sum up the variants of the MHI method in Sect. 3. Section 4 presents various applications based on these approaches. Section 5 discusses some issues related to the MHI approach and its variants for

123

Md. A. R. Ahad et al.

Fig. 1 Development of the MHI images for two different actions. The produced MHI images are shown under the actions sequentially

future research perspectives. Finally, Sect. 6 concludes the paper.

2 Overview of the motion history image method This section presents an overview of the MHI method. Importance of various parameters is analyzed. Finally, several limitations of the basic MHI method are pointed out.

some function of the motion at that pixel position [31]. These view-specific templates are matched against the stored models of views of known movements. Incorporation of both the MHI and the MEI templates constitute the MHI method. The MHI Hτ (x, y, t) can be computed from an update function  (x, y, t): Hτ (x, y, t)  τ if  (x, y, t) = 1 = max (0, Hτ (x, y, t −1)−δ) otherwise

2.1 MHI and MEI templates Bobick and Davis in [30] first propose a representation and recognition theory that decomposed motion-based recognition by first describing where there is motion (the spatial pattern) and then describing how the object is moving. They [30,31] present the construction of a binary motion energy image (MEI) or binary motion region (BMR) [47], which represents where motion has occurred in an image sequence. The MEI describes the motion-shape and spatial distribution of a motion. Next, an MHI is generated. Intensity of each pixel in the MHI is a function of motion density at that location. One of the advantages of the MHI representation is that a range of times may be encoded in a single frame, and in this way, the MHI spans the time scale of human gestures [33]. Taken together, the MEI and the MHI can be considered as a two-component version of a temporal template, a vector-valued image where each component of each pixel is

123

Here, (x, y) and t show the position and time,  (x, y, t) signals object’s presence (or motion) in the current video image, the duration τ decides the temporal extent of the movement (e.g., in terms of frames), and δ is the decay parameter. This update function is called for every new video frame analyzed in the sequence. The result of this computation is a scalar-valued image where more recently moving pixels are brighter and vice-versa [31,93]. Figure 1 presents the development of the MHI images for two different actions sequentially. These illustrate clearly that a final MHI image records the temporal history of motion in it. Some possible image processing techniques for defining this update function  (x, y, t) are background subtraction, image differencing and optical flow [165]. More details on this issue will be presented below. Usually, the MHI is generated from a binarized image, obtained from frame subtraction [162], using a threshold ξ :

Motion history image: its variants and applications

Fig. 2 Example of the MHI and the MEI: First four columns are some sequential frames; images in 5th column are the corresponding MHIs; and images in the right-most column show the MEI images for both

  (x, y, t) =

1 if D (x, y, t) ≥ ξ 0 otherwise

where D (x, y, t) is defined with difference distance  as: D (x, y, t) = |I (x, y, t) − I (x, y, t ± )| Here, I (x, y, t) is the intensity value of pixel location with coordinate (x, y) at the tth frame of the image sequence. We can get the final MHI template as Hτ (x, y, τ ). Now we will define the MEI. The MEI is the cumulative binary motion image that can describe where a motion is in the video sequence, computed from the start frame to the final frame. The moving object’s sequence sweeps out a particular region of the image and the shape of that region (where there is motion—instead of how as in the MHI concept) can be used to suggest the movement occurring region [57]. As the update function  (x, y, t) represents a binary image sequence indicating regions of motion, the MEI E τ (x, y, t) can be defined as: E τ (x, y, t) =

τ −1

D (x, y, t − i)

i=0

The MEI can be deduced from the MHI (by thresholding the MHI above zero [31]),  1 if Hτ (x, y, t) ≥ 1 E τ (x, y, t) = 0 otherwise A benefit of using the gray-scale MHI is that it is sensitive to the direction of motion, unlike the MEIs; hence the MHI is better suited for discriminating between actions of opposite directions (e.g., ‘sitting down’ versus ‘standing up’) [47]. However, both the MHI and the MEI images are important for representing motion information. The two images together provide better discrimination than either alone [31].

actions (‘hand-waving and body-bending’: upwards using left-hand (top row) and downwards using right-hand (bottom row))

Figure 2 shows typical MHI and MEI for two mirror-actions of one hand-waving and sideway body-bending. 2.2 Dependence on τ and δ Figure 3 shows the dependence on τ in producing the MHI. In this action of waving up the left-hand (with 26 frames), we produce different MHIs with different τ values. If the τ value is smaller than the number of frames, then we loss prior information of the action in its MHI. For example, when τ = 15 for an action having 26 frames, we lose the motion information of the first frame after 15 frames if the value of the decay parameter (δ) is 1. On the other hand, if the temporal duration value is set at very high value compared to the number of frames (e.g., 250 in this case for an action with 26 frames), then the changes of pixel values in the MHI template is less significant. Therefore, this point should be considered while producing MHIs. Figure 4 shows the dependence on the decay parameter (δ) while calculating the MHI image. In the basic MHI method [31], δ is replaced by 1. While loading the frames, if there is no change (or no presence) of motion in a specific pixel where earlier there was a motion, the pixel value is reduced by δ. However, having different δ values may provide slightly different information; hence the value can be chosen empirically. Researchers need to consider this parameter, while working with the MHI. The top-row of Fig. 4 shows final MHI images for the same action (as shown in Fig. 1 (top row)) with different δ values (i.e., 1, 3, 5 and 10). We notice that higher values for δ remove the earlier trail of motion sequence. The second row presents a running action. The first two images are for δ = 1, and the latter two for δ = 3, while the 1st and 3rd images are taken mid-way and the 2nd and 4th images are taken at the end of the sequence. We note that when δ = 3, part of the earlier motion information is

123

Md. A. R. Ahad et al. Fig. 3 Dependence on τ to develop MHI images

Fig. 4 Dependence on δ in calculating the MHI template

missing. Similarly, the 3rd row shows the MHIs for walking action. The bottom row presents MHIs (1st and 3rd) and MEIs (2nd and 4th) for walking action when τ is set as 250 instead of its number of frames 100. The first two images considered δ = 3, while the last two images considered δ = 5. These information are important based on the demand and action sets, we can modulate the values of δ and τ while producing the MHI and the MEI. Regarding the parameters, a question may arise: ‘Under what circumstances, does one want a faster versus slower decay?’ From the above discussion, it is clear that together the values of τ and δ combine to determine how long it takes for a motion to decay to 0, thus determining the temporal window size. However, different settings can lead to the same temporal window (e.g., τ = 10 and δ = 1 leads to the same temporal window as τ = 100 and δ = 10). The joint

123

effect of τ and δ is to determine how many levels of quantization the MHI will have; thus the combination of a large τ and a small δ yields a slowly-changing continuous gradient, whereas a large τ and large δ provide a more step-like, discrete quantization of motion. This provides insight into not only what parameters and design choices one has, but also into the impact of choosing different parameters or designs. 2.3 Selection of update function (x, y, t) for motion segmentation Many vision-based human motion analysis systems start with human detection [144]. Human detection aims at segmenting regions of interest corresponding to people from the rest of an image. It is a significant issue in a human motion analysis system since the subsequent processes such

Motion history image: its variants and applications

as tracking and action recognition are greatly dependent on the performance and proper segmentation of the region of interest. Background subtraction, frame differencing, optical flow, statistical methods for subtraction are renowned approaches for motion segmentation. Based on the static or dynamic background, the performance and method for background subtraction vary. For static background (i.e., with no background motion), it is trivial to subtract the background, when other factors like outdoor or cluttered scenes are absent. Few dominant methods are enlisted in Refs. [22,56,65,74,95,101,135,136,158,160]. Some of these methods employ various statistical approaches, adaptive background models (e.g., [74,135]) and incorporation of other features (e.g., color and gradient information in [95]) with an adaptive model, in order to negotiate dynamic background or other complex issues related to background subtraction. For the MHI generation, background subtraction is employed initially by [31,47]. Frame-to-frame differencing methods are also widely used for motion segmentation [21,28,43,76,88,146]. These temporal differencing methods employed among two [21,43, 88,146] or three consecutive frames [28,76] are adaptive to dynamic environments, though we can note poor extraction of the relevant feature pixels in general. Unless the thresholds are defined properly, a generation of holes (see Fig. 6) inside moving objects is a major concern. To generate the MHI and the MEI, temporal differencing methods are employed (e.g., by [8]) as well. Optical flow methods [27,29,66,91,94,113,138,149,153] can be used for the generation of the MHI and motion segmentation for various purposes. Ahad et al. [6,8,10] employed optical flow in their variants of the MHI for motion segmentation to extract the moving object. Computing quality optical flow from consecutive image frames is a challenging task. To produce better results in a motion’s presence and its directions from optical flow, RANSAC (RANdom Sample Consensus) method [60] can be employed to reduce outliers. Based on this refined optical flow vectors, the MHI can be constructed, thus providing better direction and a clearer picture of the motion’s presence. Ahad et al. [6] employed optical flow’s four channels to compute the MHIs. In this case, instead of background or frame subtraction, a gradient-based optical flow vector [42] ( (x, y, t)) is computed between two consecutive frames and split it into four channels (as depicted in Fig. 5). It is based on the concept of motion descriptors on smoothed and aggregated optical flow measurements [55]. Though optical flow can produce good results even in the presence of a bit of camera motion, it is computationally complex and very sensitive to noise and the presence of texture. Moreover, for real-time perspective, special optical flow methods can be tried to ascertain whether it can achieve better results without any incorporation of special hardware.

Fig. 5 Optical flow is split into four different channels, which are used to calculate directional MHIs

Beauchemin and Barron [27] and McCane et al. [94] have presented various methods for optical flow. Seven different methods are tested by Novins et al. [94]’s method for benchmarking optical flow methods. Several real-time optical methods [29,138,149] are developed for various motion segmentation and computations. The changes in weather, illumination variation, repetitive motion, and presence of camera motion or cluttered environment hinder the performance of motion segmentation approaches. Therefore, a proper approach is crucial based on the dataset or environment, especially for outdoor environment. Extraction of shadow and removal of it from the motion part is another concern in computer vision, and most importantly in generating the MHI template. As pointed out above, one important concern is the selection of the update function,  (x, y, t) for motion segmentation and its threshold value (ξ ). Figure 6 demonstrates typical examples on the selection of threshold values for the frame subtraction method. The top-row presents MHIs for an action with different threshold values (i.e., 30, 50, 75 and 150 from left to right). We note the presence of noisy background when the threshold is employed at 30 (first image of 1st row). However, if we increase ξ , we also miss some part of the motion information (note the presence of hole in the rightmost image on the top-row). In another example (as shown in the bottom-row images), we use walking motion in a different environment and depth. The noisy MHI and MEI images (1st two images) for a walking action has employed ξ = 12, whereas the next two images (without any noisy background but with missing information) used ξ = 150. Therefore, the selection of the update function and its ξ are very crucial for calculating motion history/energy templates.

123

Md. A. R. Ahad et al. Fig. 6 Importance of the selection of threshold value (ξ ) for the update function, (x, y, t). Note the presence of noises or holes in various images

Fig. 7 Changes in standing position of a person (top-row) create MHIs (1st and 3rd images) and MEIs (2nd and 4th images) wider, as shown in the bottom-row (1st two images are computed at 45 frames and the rest two images are at the final frame)

Another issue is the change of the standing position of a person while executing an action that is supposed to be in one specific location. For example, Fig. 7 depicts a person moving its standing position; hence the final MHI becomes wider. Therefore, if an action does not incorporate movement from its initial position, then tracking on the central point of the moving body is required for this kind of position changing. Another useful option is to normalize the size of the entire moving body and then create the MHI based on the normalized moving portion; or normalize the MHI and the MEI for further processing. This is crucial for recognition purposes by employing the MHI images, because the MHI method takes into account the global calculation of the image, and hence changing its position makes the final MHI wider than the object of interest and incorporates some unwanted region of interest. 2.4 Feature vector analysis and classification Figure 8 shows the system flow of the basic MHI approach for motion classification and recognition. According to the basic MHI method [31], feature vectors are calculated using seven Hu moments [68] from the MHI and MEI images. Hu moments are widely used for shape representation [6,8,10, 11,15,31,33,34,47,48,122,123,170]. There are other different approaches to get the shape for calculating feature vectors from the templates. Figure 9 shows various other options for shape representation based on [68,80,81,168]. Though

123

Fig. 8 Typical system flow diagram of the MHI method for action recognition

Hu invariants are widely employed for the MHI or related methods, other approaches (e.g., Zernike moments [15,16], global geometric shape descriptors [15], Fourier transform [143]) are also utilized for creating feature vectors. Several researchers [15,16,38,39,122] employ the PCA to reduce the dimensions of the feature vectors. After feature vectors are developed, classification is done and unknown motion is recognized. These steps are shown in the system flow diagram of the MHI method (Fig. 8). For classification, the support vector machine (SVM) [15, 16,39,96–99], K-nearest neighbor (KNN) [6,11,23,24,112, 122,141], multi-class nearest neighbor [15,16], Mahalanobis distance [31,33,34,47] and maximum likelihood (ML) [110] are employed. One could employ (i) the re-substitution method (training and test sets are the same); (ii) the holdout method (half the data is used for training and the rest data is used for testing); (iii) Leave-one-out method; (iv) the rotation method or

Motion history image: its variants and applications Fig. 9 Numerous approaches for shape representations. Region-based global Hu moments [68] are considered by many researchers [33,34,47,48, 6,8,122,123,10,11,170,15], including Bobick and Davis [31]

N -fold cross validation (a compromise between leave-oneout method and holdout method, which divides the samples into P disjoint subsets, 1 ≤ P ≤ N . Use (P − 1) subsets for training and the remaining subset for test); and (v) the bootstrap method for partitioning scheme [70]. In most of the cases, leave-one-out cross validation scheme is used for the partitioning scheme (e.g., [6,11,110]). This means that out of N samples from each of the cclasses per database, N -1 of them are used to train (design) the classifier and the remaining one to test it [81]. This process is repeated N times, each time leaving a different sample out. Therefore, all of the samples are ultimately used for testing. This process is repeated and the resultant recognition rate is averaged. Usually, this estimate is unbiased.

2.5 Limitations of the basic MHI method Though successful in constrained situations, there are a few limitations of the basic MHI method. The MHI method is not suitable for dynamic background with its basic representation (which is based on background subtraction or image differencing approaches) [129]. However, by employing approaches that can segment motion information from a dynamic background, the MHI method can be useful in dynamic cases too. Occlusions of the body, or improper implementation of the update function,  (x, y, t) results in serious recognition failures [31,2]. The MHI method does not need trajectory analysis [46]. However, the non-trajectory nature of it can be a problem for cases where tracking could be necessary to analyze a moving car or a person [8]. The MHI representation is exploited with tracking information for some applications (e.g., by [123]). It is also limited to label-based (token) recognition, where it could not yield any information other than specific identity matches (e.g., it could not report that ‘upward’ motion was

occurring at a particular image location) [24,49,50]. This is due to the fact that the holistic generation (and matching) of the moment features are computed from the entire template [49]. Another limitation of this method is the requirement of having stationary objects, and the insufficiency of the representation to discriminate among similar motions [123]. The MHI method is appearance-based method. However, by employing several cameras from different directions and by combining moment features from these directions, action recognition can be achieved. However, due to some similar representations for different actions (but from different camera-views), it may produce false recognition for an action. Another key problem of this method is its failure to separate the motion information when there is motion self-occlusion or overwriting [8,19,73,96,112,141]. In this problem, if an action has opposite directions (e.g., from sitting down to a standing position) in its atomic actions, then the previous motion information (e.g., sitting down) is deleted or overwritten by the latter motion information (e.g., standing) (Fig. 10). Therefore, if a person sits down, and then stands up, the final MHI image should contain brighter pixels in the upper part of the image to represent the stand-up motion only. It can not vividly distinguish the direction of the motion. This self-occlusion of the moving object or person overwrites the prior information. Like any template matching approach, the MHI also has the drawback that it is sensitive to the variance of movement duration [145].

3 Motion history image-based approaches In this section, MHI-based approaches are presented. We start with direct implementations of the MHI method for numerous applications, and afterwards with some modifications.

123

Md. A. R. Ahad et al.

Fig. 10 Motion overwriting problem (due to self-occlusion) of the MHI method

We also categorize and analyze important developments of the MHI in 2D and 3D domains. 3.1 Various approaches employing the MHI method 3.1.1 Direct implementation of the MHI Due to its simple representation of an action, the MHI method is employed by different researchers without any modification for their respective demonstrations. Rosales [122] and Rosales and Sclaroff [123] employ the MHI method with seven Hu moments and Rosales [122] uses principal components analysis (PCA) to reduce the dimensionality of this representation. The system is trained using different subjects performing a set of examples of every action to be recognized. Given these samples, K-nearest neighbor, Gaussian, and Gaussian mixture classifiers are used to recognize new actions. Experiments are conducted using instances of eight

123

human actions performed by seven different subjects and good recognition results are achieved. Rosales and Sclaroff [123] propose a trajectory-guided recognition method. It tracks an action by employing extended Kalman filter and then uses the MHI for action recognition via a mixture of Gaussian classifier. They test the system to recognize different dynamic outdoor activities. Jan [71] hypothesizes that a suspicious person in a restricted parking lot would display erratic pattern in his/her walking trajectories (to inspect vehicles and its belongings for possible malicious attempts). To this aim, trajectory information is collected and its MHI (based on profiles of changes in velocity and in acceleration) is computed. The highest, average and median MHI values are profiled for each individual on the scene. Though It is a simple hypothesis, collections of real-time data from surveillance devices seem challenging. Nonetheless, it is an initial phase to make an attempt to analyze the information, which can be exploited to decide possible suspicious behaviors. Apparently there can be far more features than just trajectories (and its velocity and accelerations). Alahari and Jawahar [18] model action characteristics by MHIs for some hand gesture recognition and four different actions (i.e., jumping, squatting, limping and walking). They introduce discriminative actions, which describe the usefulness of the fundamental units in distinguishing between events. They achieve average 30.29% reduction in error for some event pairs. Shan et al. [129] employ the MHI for hand gesture recognition considering the trajectories of the motion. They employ the mean shift embedded particle filter, which enables a robot to robustly track natural hand motion in real-time. Then, an MHI for a hand gesture is created based on the hand tracking results. In this manner, spatial trajectories are retained in a static image, and the trajectories are called temporal template-based trajectories (TTBT). Hand gestures are recognized based on statistical shape and orientation analysis of TTBT. By applying this hand tracking algorithm and gesture recognition approach in a wheelchair, they have realized a real-time hand control interface for the robot. Meng et al. [100] developed a simple system based on SVM classifier and MHI representations, which is implemented on a reconfigurable embedded computer vision architecture for realtime gesture recognition. In another work by Vafadar and Behrad [140], the MHI is employed for gesture recognition for interaction with handicapped people. In this approach, after constructing the MHI for each gesture, a motion orientation histogram vector is extracted. These vectors are then used for the training of hidden Markov Model (HMM) and hand gesture recognition. Yau et al. [162,163] decompose MHIs into wavelet subimages using stationary wavelet transform (SWT). The motivation of using the MHI in visual speech recognition is

Motion history image: its variants and applications

the ability of the MHI to remove static elements from the sequence of images and preserve the short duration complex mouth movement. The MHI is also invariant to the skin color of the speakers due to the difference of frame and image subtraction process involved in the generation of the MHI. Here, the SWT is used to denoise and to minimize the variations between the different MHIs of the same consonant. Three moment-based features are extracted from SWT sub-images to classify three consonants only. The MHI is used to produce input images for the line fitter, which is a system for fitting lines to a video sequence that describe its motion [58]. It uses the MHI method for summarizing the motion depicted in video clips; however, it fails with rotational motion. Rotations are not encoded in the MHIs because the moving objects occupy the same pixel locations from frame to frame, and new information overwrites old information. Another failure example is that of curved motion. Obviously, the straight line model is inadequate here. In order to improve performance, a more flexible model is needed. Orrite et al. [110] propose a silhouette-based action modeling for recognition where they employ the MHI directly as the input feature of the actions. Then these 2D templates are projected into a new subspace by means of the Kohonen self organizing feature map (SOM). Action recognition is accomplished by a maximum likelihood (ML) classifier. In another experiment, Tan and Ishikawa [139] employe the MHI method and their proposed method to compare six different actions. Their results produce poor recognition rate. After analyzing the datasets, it seems that actions inside the dataset have motion overwriting; hence, it is understandable that the MHI method may have poor recognition rate for this type of dataset. Also, Meng et al. [96–99] and Ahad et al. [8] compare the recognition performances of the MHI method with their HMHH and DMHI methods, respectively, for several different datasets (one with radio-aerobics dataset and another with KTH dataset). These datasets have motion overwriting due to self-occlusion, and therefore, their approaches outshine the MHI method in terms of the average recognition rates. 3.1.2 Implementation of the MHI with some modifications In this sub-section, several methods and applications are presented where the MHI method is exploited with little modification, or considered almost similar route in developing the motion cues. The MHI method or the MHI and/or MEI templates are implemented with some modifications by several researchers in different applications. To start with, Han and Bhanu [63,64] proposed the gait energy image (GEI) that targets specific normal human walking representation, based on the concept of the MEI. The GEI is implemented as a gait template for individual gait recognition. As compared

to the MEI and the MHI, the GEI targets specific normal human walking representation [63]. Given the preprocessed binary gait silhouette images at time t in a video sequence, the gray-level GEI is defined as, G (x, y) =

N 1  Bt (x, y) N t=1

where N is the number of frames in the complete cycle(s) of a silhouette sequence, t is the frame number in the sequence (moment of time) [63]. Therefore, this GEI becomes a timenormalized accumulative energy image of human walking in the complete cycle(s). Though it performs very well for gait recognition, it seems from the construction of the equation that for human’s activity recognition, this approach might not perform as smartly as the MHI method does. In the similar fashion, Zou and Bhanu [170] employ the GEI and co-evolutionary genetic programming (CGP) for human activity classification. They extract Hu moments and normalize histogram bins from the original GEIs as input features. The CGP is employed to reduce the feature dimensionality and learn the classifiers. Bashir et al. [25,26] and Yang et al. [161] implement the GEI directly for human identification with different feature analyses. Similar to the development of the GEI, an action energy image (AEI) is proposed for activity classification by Chandrashekhar and Venkatesh [38]. They use eigen decomposition of an AEI in eigen activity space obtained by PCA, which best represents the AEI data in least-square sense. AEIs are computed by averaging silhouettes and unlike the MEI that captures only ‘where the motion occurred’; the AEI captures ‘where and how much the motion occurred’. The MEI carries less structural information since it is computed by accumulating motion images obtained by image differencing, while the AEI incorporates the information about both structure and motion. They experiment with their AEI concept for walking and running motions and achieve good result. On the other hand, Liu and Zheng [89] propose a method called gait history image (GHI) for gait representation and recognition. The GHI inherits the idea of the MHI in the sense that temporal information and the spatial information can be recorded in both cases. The GHI preserves the temporal information besides the spatial information. It overcomes the shortcoming of no temporal variation in the GEI. However, each cycle only obtains one GEI or GHI template, which easily leads to the problem of insufficient training cycles [41]. Moreover, the gait moment energy (GMI) method is developed by Ma et al. [92] based on the GEI. The GMI is the gait probability image at each key moment of all gait cycles. In this approach, the corresponding gait images at a key moment are averaged as the GEI of this key moment. They introduce moment deviation image (MDI) by using silhouette

123

Md. A. R. Ahad et al.

images and GMIs. As a good complement of the GEI, the MDI provides more motion features than the GEI. Both MDI and GEI are utilized to present a subject. However, it is not easy for the GMI to select key moments from cycles with different periods. Therefore, to compensate this problem, Chen et al. [41] propose a clustered-based GEI approach. In this case, the GEIs are computed from several clusters and the Dominant Energy Image (DEI) is obtained by denoising the averaged image of each cluster. The frieze and wavelet features are adopted and HMM is employed for recognition. This approach performs better than the GEI, the GHI and the GMI representations, as it is superior (due to its clustered concept) when the silhouette has incompleteness or noise. Wang and Suter [147] directly convert an associated sequence of human silhouettes derived from videos into two types of computationally efficient representations, namely, average motion energy (AME) and mean motion shape (MMS), to characterize actions. These representations are used for recognition. The MMS is proposed based on shapes, not silhouettes (in a similar manner to the AME). The process of generating the AME is computationally inexpensive and can be employed in real-time applications [166]. This AME is computed exactly the similar manner of the computation of the GEI, though the former is exploited for action recognition whereas the latter method is used for gait recognition. In calculating the AME, Wang and Suter [147] employ the sum of absolute difference (SAD) for action recognition purpose and obtain adequate recognition results. However, for large image size or database, the computation of SAD is inefficient and computationally expensive. This constraint is addressed by Yu et al. [166] who propose a histogram-based approach, which can efficiently compute the similarity among patterns. As an initial approach, an AME image is converted to the motion energy histogram (MEH). From a histogram point of view, we can regard AME as a two-dimensional histogram whose bin value represents the frequency on position during time interval. Thus, we can reform the AME to the MEH by using: MEH = 

AME (x, y) x,y∈AME AME (x, y)

Then, a multi-resolution structure is adopted to construct the multi-resolution motion energy histogram (MRMEH). A high-speed human motion recognition technique is proposed based on a modified-MHI and a superposed motion image (SMI) [108]. Using a multi-valued differential image ( f i (x, y, t)) to extract information about human posture, they propose modified-MHI that can be defined as Hα (x, y, t) = max ( f i (x, y, t) , α Hα (x, y, t − 1)) , where Hα (x, y, t) is a modified-MHI, the parameter α is a vanishing rate which is set at 0 < α ≤ 1. An SMI is the maximum value image that is generated from summing the past

123

successive images with an equal weight and we can get the SMI by putting α = 1 in the modified MHI. Employing these images, a motion is described in an eigenspace as a set of points, and each SMI plays a role of a reference point. By calculating the correlation between reference SMIs and the MHI generated from an unknown motion sequence, a match is found with images described in the eigenspace to recognize the unknown motion. Experimental results show good performances [99,108] with different datasets. This method, however, is highly dependent on the parameter α and it is database-specific. An approach to generate motion-based pattern where moving objects are firstly segmented by employing adaptive threshold-based change detection is proposed by [72]. They used scalar valued rear-MHI and front-MHI to represent how motion is evolved. It is used to segment and measure the motion. After that, the motion vectors with orientation and magnitude are generated from chamfer distance. Finally, an approach is derived to generate intra-MHI for inside moving parts. Singh et al. [132] use the MHI and the MEI, and develop a motion color image (MCI) by combining motion and color cues. The MCI is constructed by bit-wise OR from the MEI of the four previous levels and color localization data. They dynamically control the frame differencing. Later they divide the MCI into nine boxes and compute motion pixels of motion data in the MHI in each of the nine boxes. Feature vectors are calculated as the sum of motion pixels in each box for classification. Its recognition rate is highly dependent on the training data. A scene-specific, adaptive camera navigation model is constructed for video surveillance by automatically learning locations of high activity [52]. It measures activity from the MHI at each view across the full viewable field of a PTZ camera. For each MHI blob of the scene, it determines whether the blob is a potential candidate for human activity. Later the intensity fade of each MHI blob is examined against noise. Using this iterative candidacy-classification-reduction process, one can produce an ‘activitymap’, where brighter areas correspond to locations with more activity. Vitaladevuni et al. [143] present a Bayesian framework for recognizing actions through ballistic dynamics. It temporally segments videos into its atomic movements. It enhances the performance of the popular MHI feature. This ballistic segmentation with the MHI improves the recognition rate (over those obtained by using only the MHI). Ahmad et al. [14–16] propose spatio-temporal silhouette representations, called silhouette energy image (SEI) and silhouette history image (SHI) to characterize motion and shape properties for recognition of human movements. The SEI and the SHI are constructed by using the silhouette image sequence of an action. They employ the Korea University gesture database and the KTH database [127] for recognition.

Motion history image: its variants and applications

The computations of the SHI and the SEI are exactly the same concept of the MHI and the GEI, respectively. These are computed from the silhouette images, rather than direct motion images of the actions (though the MHI can be computed from silhouette images). From the SEI and the SHI, they compute human shape variability model to approximate the variability of anthropometry of different actions [14]. Watanabe and Kurita [148] propose new features for motion recognition: called higher-order local autocorrelation (HLAC) features. These are extracted from MHIs, and have good properties for motion recognition. These features are tested by using image sequences of pitching in baseball games. They achieve good recognition results for their action datasets. An edge motion history image (EMHI) method [39,40] is computed by combining edge detection and MHI technique. It is extracted as a temporal-compressed feature vector from a short video sequence. Usually, background is not easy to be extracted in news and sports videos with complex background scenes. Moreover, stereo depth information is usually not available in video data either. Therefore, instead of using the MHI directly, they propose to use edge information detected in each frame, instead of silhouettes to compute an EMHI. Let Bt (x, y, t) be a binary value to indicate if a pixel is located on an edge at time t. An EMHI (EMHIτt (x, y, t)) is computed from the EMHI of the previous frame EMHIτt−1 (x, y, t) as: EMHIτt (x, y, t)  τ   if Bt (x, y, t) = 1 = τ max 0, EMHIt−1 (x, y, t)−1 otherwise In this equation, the basic feature of this EMHI is edges. Later, they manage the scale adaptation and noises (as existing edge detection algorithms are sensitive to noises). The motion history concept can help to smooth noises and provide historic motion clues to help a human vision system for building correspondences on edge points [40]. They develop layered Gaussian mixture model (LGMM) to exploit these features for classifying various shots in video. Another conceptually similar work to the MHI method [31] is proposed by Masoud and Papanikolopoulos [93]. This different method extracts motion directly from the image sequence. At each frame, motion information is represented by a feature image, which is calculated efficiently using an infinite impulse response (IIR) filter. In particular, they use the response of the IIR filter as a measure of motion in the image. The idea is to represent motion by its recentness: recent motion is represented as brighter than older motion, just like [31]. This technique, also called recursive filtering, is simple and time-efficient. Unlike the MHI method [31], an action is represented by several feature images [93]

rather than just two images (namely, the MHI and the MEI images) [31]. 3.2 Variants of the MHI method in 2D 3.2.1 Solutions to motion self-occlusion problem One of the key limitations of the MHI method is its inability to perform well in presence of motion overwriting due to self-occlusion. Several attempts have been targeted to mitigate this issue, so that multi-directional activities can be represented by the concept of the MHI representation. One initial approach is the multiple-level MHI (MMHI) method [112,141,142]. It aims at overcoming the problem of motion self-occlusion by recording motion history at multiple time intervals (i.e., multi-level MHIs). It creates all MHIs to have a fixed number of history levels n. So, each image sequence is sampled to (n+1) frames. The MMHI is computed as follows: MMHIt (x, y, t)  s∗t if  (x, y, t) = 1 = MMHIt (x, y, t − 1) otherwise where s = (255/n) is the intensity step between two history levels. MMHIt (x, y, t) = 0 for t ≤ 0. The final template is found by iteratively computing the above equation for t = 1, . . . , n + 1. This method encodes motion occurrences at different time instances on the same pixel location in such a manner that it can be uniquely decodable afterwards. For this purpose, it uses a simple bit-wise coding scheme. If a motion occurs at time tat pixel location (x, y), it adds 2t−1 to the old motion value of the MMHI as follows: MMHI (x, y, t) = MMHI (x, y, t − 1) +  (x, y, t) · 2(t−1) Due to this bitwise coding scheme, one can separate multiple actions occurring at the same position [141]. It focuses on automatic detection of facial actions units that compose expressions. It requires sophisticated registration system, because all employed image sequences must have the faces at the same position and on the same scale. The result does not clearly demonstrate the superiority of the MMHI with respect to basic MHI [20,141]. Even in their reports, the MMHI produces lower recognition result than the MHI [142]. However, they point out that self-occlusion due to motion overwriting problem might be solved using this MMHI. Ahad et al. [11] implement the MMHI with aerobics dataset and another action dataset, but it is found that the MMHI method shows poor recognition result. Motion overwriting or self-occlusion problem of the MHI method is robustly solved by the directional motion history image (DMHI) method [8]. In this approach, instead of background or frame subtraction, gradient-based optical flow is

123

Md. A. R. Ahad et al.

calculated between two consecutive frames and split it into four channels (see Fig. 5, as shown above). Based on this strategy, one can get four-directional motion templates for left, right, up and down directions. The corresponding four history images are calculated as: DMHIτ (x, y, t)  τ if   (x, y, t) > ξ =  max(0, DMHIτ (x, y, t−1)−δ) otherwise, where to denote the four different directions, we have  ∈ {up(+y ), down(−y ), right (+x ), le f t (−x )}. For positive and negative horizontal direction, DMHI+x τ (x, y, t) and −x DMHIτ (x, y, t) image templates are achieved as motion −y +y history templates. Also, DMHIτ (x, y, t) and DMHIτ (x, y, t) represent the positive and negative vertical directions, respectively. These four motion history templates resemble the directions of the motion vectors. Each DMHI template is passed into a median filter to smooth noisy patterns and hence smoothed DMHI images med Hτ (x, y, t) are computed:  med Hτ (x, y, t) = med DMHIτ (x, y, t) where med() is the function for median filter. We compute four MEIs after thresholding these templates above zero:  1 if med Hτ (x, y, t) ≥ 1  DMEIτ (x, y, t) = 0 otherwise This method can solve overwriting problem significantly. Several complex actions and aerobics (which have more than one direction in these actions) are tested. More than 94% recognition with the DMHI method is achieved, whereas the MHI shows around 50% recognition result. The DMHI method requires four history templates and four energy templates for four directions; hence the size of the feature vector becomes large, and hence it becomes computationally a bit more expensive than the MHI. With recent work based on this approach, various reduced-sized feature vectors have been proposed which can recognize motions faster with almost the same recognition result [4]. Moreover, having the combined cues from the DMHI and the MEI representations for each action (with an outdoor action dataset), the achieved result is also satisfactory [17]. The DMHI is also employed for low-resolution action recognition, because it keeps information on the motion components even though the resolution is poor. With low-resolution video sequences (from 320×240 to 64×48 pixels), the recognition results are very promising [5]. However, with very low-resolution, due to lack of pixel information, it becomes difficult to achieve significant information for recognition. If there is no motion information in the final history or energy templates then feature vectors can not be computed for these templates. Another improvement

123

is proposed [12] called timed-DMHI to cover similar actions having different speed. This concept is simple but not robust. Earlier, Meng et al. [96] propose a SVM-based system called hierarchical motion history histogram (HMHH). In [97–100], they compare other methods (i.e., modifiedMHI, MHI) to demonstrate the robustness of HMHH in recognizing several actions. This representation retains more motion information than the MHI, and also remains inexpensive to compute [100]. In this approach, to solve the overwriting problem, they define some patterns Pi in the motion mask (D(x, y, :)) sequences, based on the number of connected ‘1’, e.g., P1 = 010, P2 = 0110, P3 = 01110, . . . , PM = 01 · · 10

· M1s Now define a subsequence Ci = bn1 , bn2 , . . . , bni and denote the set of all sub-sequences of D (x, y, :) as {D (x, y, :)}. Then for each pixel, count the number of occurrences of each specific pattern Pi in the sequence of D (x, y, :) as shown,  HMHH (x, y, Pi ) = 1{C j =Pi |C j ∈ {D(x,y,:)} } j

Here, 1 is the indicator function. Hence, from each pattern Pi , we construct one gray-scale image (called motion history histogram, MHH), and in aggregation, we call all MHH images as hierarchical MHH, HMHH). They use these final feature images for classification and then recognition using SVM. These solutions are compared [11] to show their respective robustness in solving the overwriting problem of the MHI method [31]. The employed dataset has some activities that are complex in nature and have motion overwriting. For the HMHH method, four patterns are considered as more than four patterns do not provide significant information. For every activity, the recognition result with the DMHI representation is very satisfactory (about 94% recognition). Though the HMHH representation achieved better results than the MHI and the MMHI representations, the performance of the HMHH is unacceptable, as it achieved about 67% recognition rate. Kellokumpu et al. [73] extract spatially enhanced local binary pattern (LBP) histograms from the MHI and the MEI temporal templates and model their temporal behavior with HMMs. They select a fixed frame number. The computed MHI is divided into four sub-regions through the centroid of the silhouette. All MHI and MEI LBP features are concatenated into one histogram and normalized so that the sum of the histogram equals to one. In this case, the temporal modeling is done by using HMMs. This texture-based description of movements can handle overwrite problem of the MHI. One concern of this approach is the choice of the sub-regions division scheme for every action.

Motion history image: its variants and applications

3.2.2 Solutions to some issues of the MHI in 2D To overcome several constraints of the MHI method [31], various developments are proposed both in 2D and 3D domains. This sub-sub-section covers some other variants of the MHI method in 2D, and 3D extensions will be covered in Sub-section afterwards. Davis [51] presents a method for recognizing movement that relies on localized regions of motion, which are derived from the MHI. He offers a realtime solution for recognizing some movements by gathering and matching multiple overlapping histograms of the motion orientations from the MHI. In this extension from the original work [31], Davis explains a method to handle variable-length movements as well as occlusion issue. The directional histogram for each body region has twelve bins (30 degree each), and the feature vector is a concatenation of the histograms of different body regions. In another update, the MHI is generalized by directly encoding the actual time in a floating point format, which is called timed-motion history image (tMHI) [33,34]. In tMHI, a new silhouette values are copied in with a floating-point time stamp. This MHI representation is updated as follows, not by considering the frame numbers but time stamp of the video sequence [34]:  τ if current silhouette at (x, y) tMHIγ (x, y) = 0 else if tMHIτ (x, y) < (τ − γ ), where τ is the current time stamp, and γ is the maximum time duration constant (typically a few seconds) associated with the template. This method makes the representation independent of the system speed or frame rate (within limits) so that a given gesture can cover the same MHI area at different capture rates. They also present a method of motion segmentation based on segmenting layered motion regions that are meaningfully connected to movements of the object of interest. The segmented regions are not ‘motion blobs’, but motion regions that are naturally connected to parts of moving objects. This is motivated by the fact that segmentation by collecting ‘blobs’ of similar directional motion does not guarantee the correspondence of the motion over time. This motion segmentation, together with silhouette pose recognition, provides a very general and useful tool for gesture and motion recognition [34]. This approach is later employed by Senior and Tosunoglu [128] for tracking objects in real-time. They use the tMHI for motion segmentation. The motion gradient orientation (MGO) is also computed by Bradski and Davis [34] from the interior silhouette pixels of the tMHI. These orientation gradients are employed for recognition. Wong and Cipolla [154,155] exploit MGO images to form motion features for gesture recognition. Pixels in the MGO image encode the change in orientation between nearest moving edges shown on the MHI and the region of interest is defined as the largest rectangle cov-

ering all bright pixels in the MEI. Therefore, the MGO contains information about where and how a motion has occurred [155]. The MHI’s limitation relating to the ‘global image’ feature calculations can be overcome by computing dense local motion vector field directly from the MHI for describing the movement [49]. Davis [49] extends the original MHI representation into a hierarchical image pyramid format to provide with a means of addressing the gradient calculation of multiple image speeds. An image pyramid is constructed by recursively low-pass filtering and sub-sampling an image (i.e., power-of-2 reduction with anti-aliasing) until reaching a desired size of spatial reduction. The result is a hierarchy of motion fields, where the resulting computed motion in each level is tuned to a particular speed (i.e., with faster speeds residing at higher levels). The hierarchical MHI (HMHI) is not directly created from the original MHI, but through the pyramid representation of the silhouette images. Afterwards, based on the orientations of the motion flow (computed from the MHI pyramid), a motion orientation histogram (MOH) is produced. The resulting motion is characterized by a polar histogram. The HMHI approach remains a computationally inexpensive algorithm to represent, characterize and recognize human motion in video [100]. 3.2.3 Motion separation and identification approach Based on the DMHI template [8], complex motion’s temporal segmentation or separation scheme to its primitives is proposed [6]. This temporal motion segmentation method can demonstrate the intermediate interpretation of complex motion into four directions, namely, right, left, up and down. After having the motion templates for a complex action or activity, it calculates the volume of pixel values (ντ ) by summing up the brightness levels of the motion templates. For consecutive frames, it is νt =

M  N 

DMHIτ (x, y, t)

x=1 y=1

One can decide the label  ∈ {up, down, le f t, right} of the segmented motion based on threshold values (that determines the starting point for a motion α and ending point of that motion β ) as shown in  t+k = νt+k − νt   if t < α =  if t > β

Here, t is the difference between two volume of pixel values (ντ ) for two frames. Variable k is the frame number. When the difference t is more than a starting threshold value α , we can decide the label of the segmented motion. However,

123

Md. A. R. Ahad et al.

when the t reduces to β > t , we can say that the scene is static or an earlier motion is no longer existent (). Therefore, based on this mechanism from the motion history templates, they easily segment a complex motion sequence into four directions. This is very useful for an intelligent robot to decide the directions of the human movement. Thus an action can be understood based on some consecutive left–right–up– down combination [6]. 3.2.4 Other 2D developments An advantage of the MHI is that although it is a representation of the history of pixel-level changes, only one previous frame needs to be stored. However, at each pixel location, explicit information about its past is also lost in the MHI when current changes are updated to the model with their corresponding MHI values ‘jumping’ to the maximal value [159]. To overcome this problem, Ng and Gong [105] propose a pixel signal energy (PSE) in order to measure the mean magnitude of pixel-level temporal energy over a period of time. It is defined by a backward window. The size of the window determines the number of frames (history) to be stored [106]. Another recent development on the MHI representation is pixel change history (PCH) [159]. This can measure the multi-scale temporal  changes at each pixel. The PCH of a  pixel Pς,δ (x, y, t) can be defined by Pς,δ (x, y, t) ⎧  ⎨ min Pς,δ (x, y, t −1) + 255 , 255 if D (x, y, t) = 1 δ  = ⎩ max Pς,δ (x, y, t − 1) − 255 , 0 otherwise, δ where D (x, y, t) is the binary foreground image, ς is an accumulation factor and δ is a decay parameter. When D (x, y, t) = 1, the value of a PCH increases gradually according to the accumulation factor, instead of jumping to the maximum value. When no significant pixel-level visual change is detected at location (x, y) in the current frame, the pixel (x, y) is treated as part of the background and the corresponding PCH starts to decay. The speed of decay is controlled by a decay factor. In fact, the MHI is a special case of the PCH. A PCH image is equivalent to an MHI image when a parameter called accumulation factor (ς ) is set to 1. Compared to the PCH, the MHI has weaker discriminative power to distinguish different types of visual changes. Moreover, similar to that of the PSE [105], a PCH can also capture a zero order pixel-level change, i.e., the mean magnitude of change over time [159]. MHIs can also be used to detect and interpret actions in compressed video data. Compressed domain human motion is recognized at the top of the MHI approach by the introduction of motion flow history (MFH) [23,24]. The MFH quantifies the motion in compressed video domain. Motion

123

vectors are extracted from the compressed MPEG stream by partial decoding. Then noise is reduced and the coarse MHI and the corresponding MFH are constructed at macro-block resolution instead at pixel resolution. By this approach, they reduce the computation by 16 times. The MFH can be computed according to the following equations:  v (x, y, t) if E (vτ (x, y, t)) < ℘ MFHτ (x, y, t) = τ M (vτ (x, y, t)) otherwise where, E (vτ (x, y, t)) = vτ (x, y, t) −med (vτ (x, y, t) . . . vτ −α (x, y, t))2 M (vτ (x, y, t)) = med (vτ (x, y, t) . . . vτ −α (x, y, t)) Here med() refers to median filter, vτx (x, y, t) can be horiy zontal (vτx ) or vertical (vτ ) components of the motion vector located at (x, y) in frame τ , and α indicates the number of previous frames to be considered for median filtering. The function E() checks the reliability of the current motion vector with respect to former non-zero motion vectors at the same location against a predefined threshold ℘. The MFH gives the information about the extent of the motion at each macroblock (‘where and how much the motion has occurred’). The MHI, which has spatio-temporal information but no motion vector information, is complemented by the MFH. The features, extracted from MHI and MFH are used to train classifiers for recognizing a set of seven human actions. However, self occlusion or overlapping of motion on the image plane may result in the loss of a part of the motion information. Yalmaz and Shah [164] propose two modifications on the MHI representation. In their representation, motion regions are represented by contours, not by the entire silhouettes. Contours in multiple frames are not compressed into one image but directly represented as a spatial-temporal volume (STV) by computing the correspondences of contour points across frames. 3.3 Extensions of the MHI method in view-invariant methods All the above developments are based on 2D-based MHI and hence these are not view-invariant. Several 3D extensions of the basic MHI method are proposed for view-invariant 3D motion recognition [20,121,130,151]. Also, approaches by Davis [46,49] have looked at the problem of combining MHIs from multiple views (e.g., eight different views [46]) to perform view-invariant recognition. The motion history volume (MHV) is introduced in 3D instead of the MHI for 2D. For feature extraction, 3D moments are employed [90], as an extension to the 2D Hu invariants [68]. Shin et al. [130] present a novel method for real-time gesture recognition with 3D motion history model (MHM). Utilizing this 3D-MHM with disparity information, not only is the camera view problem

Motion history image: its variants and applications

solved but also the reliability of recognition and the scalability of system are improved. Apart from the view-invariance issue, Shin et al. [130] also propose a dynamic history buffering (DHB) to solve the gesture duration problem that comes from the variation of gesture velocity at every performing time. The DHB mitigates the problem by using the magnitude of motion. Based on their work, the system using 3D-MHM achieves better results of recognition than using only 2D motion information. Another (similar to [130]) viewinvariant 3D recognition method, called volume motion template (VMT) is proposed [121]. It extracts silhouette images using background subtraction and disparity maps. Then it calculates volume object in 3D space to construct a VMT. With 10 gestures, it achieves good recognition result. Weinland et al. [150,151] develop a 3D extension of the MHI method, called motion history volume (MHV) based on visual hull for viewpoint-independent action recognition. The proposed transition from 2D to 3D is straightforward: pixels are replaced with voxels, and the standard image differencing function D(x, y, t) is substituted with the space occupancy function D(x, y, z, t), which is estimated using silhouettes and thus, corresponds to a visual hull. Voxel values in the MHV at time t are defined as: MHVτ (x, y, z, t)  τ if D (x, y, z, t) = 1 = max (0, MHVτ (x, y, z, t −1)−1) otherwise

They automatically segment action sequences into primitive actions which can be represented by a single MHV. Later they cluster the resulting MHVs into a hierarchy of action classes, which allow recognizing multiple occurrences of repeating actions [150]. The MHV demonstrates that temporal segmentation is a much easier process in 3D than in 2D, so the temporal scale and parameters can be set automatically. The MHV is used both for supervised and unsupervised learning of action primitives. It offers an interesting alternative to action recognition with multiple cameras. However, introduction of additional computational complexity due to calibration, synchronization of multiple cameras, and parallel background subtraction are not discussed [19] in these works. Similar to the MHV [151], Canton-Ferrer et al. [35,36] propose another 3D version of the MHI by adding information regarding the position of the human body limbs, employing multiple calibrated cameras. An ellipsoid body model is fit to the incoming 3D data to capture body part in which the gesture occurs. In their temporal analysis module, they first compute motion energy volume (MEV) from binary dataset, which indicates the region of motion. This measure captures the 3D locations where there is a motion in last few frames (τ ). In this approach, selection of the parameter τ is a crucial factor in defining the temporal extent of a gesture. To represent the temporal evolution of the motion, they also define a motion history volume (MHV) where the intensity of each

voxel is a function of the temporal history of the motion at that 3D location. They exploit 3D invariant statistical moments [90] for shape analysis and classification. This method is implemented by Petras et al. [116], who develop a flexible test bed for unusual behavior detection. Albu et al. [19,20] presents a new 3D motion representation, called the volumetric motion history image (VMHI), to be used for the analysis of irregularities in human actions. Such irregularities may occur either in speed or orientation and are strong indicators of the balance abilities and of the confidence level of the subject performing the activity. The VMHI can be computed by VMHI (x, y, k)  S (x, y, k) ∇ S (x, y, k +1) if ℵS (x, y, k) = ℵS (x, y, k +1) = 1 otherwise

where ℵS (x, y, k) is the one pixel thick contour of the binary silhouette in frame k and ∇stands for the symmetric difference operator. It attempts to overcome the limitation of the basic MHI related to motion self-occlusion, speed variability and variable-length motion sequences. This 3D representation is different from other 3D variants of the MHI because its concentration is to analyze motion instead of motion recognition. It does not need to evaluate the temporal duration of the MHI. 4 Application realms of MHI-based methods The MHI method and the concept of the MHI/MEI representations are widely employed and analyzed by various computer vision communities. Though most of these methods are presented in the above sections, the objective of this section is to elaborate their applications and categorize them. We categorize numerous applications (which are based on the MHI method and its variants) into three broad groups: (1) gesture or action recognition; (2) motion analysis; and (3) interactive systems. Table 1 summarizes these applications. 4.1 The MHI for action or gesture recognition To begin with, the MHI method is used to recognize different actions by [31,47]. Later on, this method is used for recognition of human movements and moving object tracking by various groups (e.g., Refs. [8,10,18,23,24,33–35,48,49,51, 63,73,82,86,89,96–99,108,115,121–124,129,130,132,139, 143,150,151,159]). Xiang and Gong [159] introduce a similar concept for recognizing indoor shopping activities and outdoor aircraft cargo activities. Leman et al. [86] develop a PDA-based recognition system based on the MHI method. Rosales in [122] performs experiments by using instances of eight human actions and achieves satisfactory recognition result. Singh et al. [132] also use the MHI and the MEI along with a new MCI template for a real-time recognition system.

123

Md. A. R. Ahad et al. Table 1 Various applications by employing MHI and its variants [Ref.] (Year)

Employed databases (DB)/Applications

Results (RR)/Features (F)/Classifier (C)/ Comments

[122] (1998)

DB: 8 actions by 7 subjects

RR: good; F: Hu moment, PCA, Gaussian & Gaussian Mixture; C: KNN

[31] (2001)

DB: 18 aerobic exercises by 1 instructor—taken several times

RR: good; F: Hu moment; C: Mahalanobis distance

[23] (2003) [24] (2004)

DB: 7 actions by 5 subjects with 10 repetitions

[82] (2004)

DB: 5 hand gestures by 5 subjects with 10 repetitions

RR: ∼98%; F: Compressed domain motion; C: KNN, Neural Network, SVM & Bayes classifier RR: 96%; F: Hu moment;C: Back propagation based multilayer perception ANN

[46] (2004)

DB: Walk, run, stand by 3 subjects, 8 different viewpoints, from thermal camera

RR: 77%; F: Sequential reliable-inference likelihood, Hu moment; C: Bayesian Information Criterion

[129] (2004)

DB: 7 hand gestures recognition by a robot in real-time

RR: high; F: Hu moments, Tracking by Mean Shift Embedded PF; C: Mahalanobis distance,

[130] (2005)

DB: 4 gestures (walking, sitting, arm-up, bowing) from calibrated stereo camera

RR: ∼90%; F: 3D global gradient orientations; Likelihood by least square method. Duration issue is considered

A. Recognition

[86] (2005)

PDA-based recognition system

Employ PDA and limited scope

[132] (2006)

DB: 11 gestures for robot behavior in real-time

RR: 90%; F: Motion Color Cue with MHI, MEI

[121] (2006)

DB: 10 gestures from 7 viewpoints with 10 repetitions DB: 6 actions by 9 subjects with 3 repetitions

RR: 90%; 3D method, by using VMT

[108] (2006)

RR: 79.9% Eigenspace & calculating reference points for all actions; recognition by mapping on eigenspace

[35,36] (2006)

DB: 8 actions from 5 calibrated wide lens cameras, in a SmartRoom, multiple people

RR: ∼98%; F: Ellipsoid body model, 3D moments, PCA; C: Bayesian classifier

[151] (2006) [150] (2006)

DB: INRIA IXMAS action dataset [120]: 11 actions by 10 subjects (5 males, 5 females), with 3 repetitions from 5 cameras

RR: 93.3%; F: Fourier transformation in cylindrical coordinates; C: Mahalanobis distance, Linear Discriminant Analysis (LDA); Visual Hull, MHV, 3D approach

[39] (2006)

DB: TRECVID’05, TRECVID’03: 6 types of actions/shots from video (total 100 shots)

RR: ∼63%; F: Layered Gaussian Mixture Model; C: PCA, SVM; EMHI

[124] (2006)

DB: 10 different gestures from several subjects

RR: 90%; Indoor environment, stereo camera for a robot

[18] (2006)

DB: (i) Marcel’s Dynamic Hand Gesture database: 15 video sequences having 4 gestures (click, no, stop-grasp-ok, rotate) (ii) 4 actions (jump, squat, limp, walk) by 20 subjects DB: USF HumanID Gait DB [126]

RR: (i) ∼92% on some action pairs of DB, (ii) ∼90% on pairs of 4 actions;F: Discriminant vectors on MHI; C: Fisher Discriminant Analysis

DB: 9 activities by 9 subjects [62] (one action less than [62])

RR: 93.8%; F: AEI from GMM background model; C: PCA

[63,170] (2006) [64] (2003)

[38] (2006)

123

RR: overall ∼71% (rank 5); F: Frequency & phase estimation; C: PCA+MDA (Multiple Discriminant Analysis)

Motion history image: its variants and applications Table 1 continued [Ref.] (Year)

Employed databases (DB)/Applications

Results (RR)/Features (F)/Classifier (C)/ Comments

[147] (2006)

DB: 10 activities by 9 subjects [62]

RR: ∼100%; F: AME, mean motion shape (MSS); C: KNN, NN, NN with class exemplar (ENN), Summation of Absolute Difference, Mahalanobis distance

[139] (2007)

DB: 6 actions by 9 subjects from 4 cameras [108]

RR: Poor result due to the presence of motion overwriting in some actions; MHI method is used

[96] (2007)

DB: 6 actions by 25 subjects [127]

[92] (2007)

DB: USF HumanID Gait DB [126]

[10] (2007) [8] (2008) [17] (2010)

(i) 5 actions by 9 subjects, indoor, from multiple cameras (ii) 10 aerobics by 8 subjects, indoor (iii) 10 actions by 8 subjects, outdoor DB: USF HumanID Gait DB [126]

RR: 80.3%;F: Motion Geometric Distribution from MHI+HMHH; C: SVM light RR: overall ∼66% (rank 5);F: Gait period estimation, gait moment image, moment deviation image+GEI; C: Nearest Neighbor RR: (i) 93%, (ii) ∼94%, (iii) 90%; F: Optical flow-based DMHI+DMEI/MEI, Hu moments; C: KNN

[161] (2008)

RR: A bit better than GEI [63]; F: Gabor phase spectrum of GEI; C: Low dimensional manifold

[143] (2008)

DB: (i) 14 gestures by 5 subjects with 5 repetitions (ii) 7 video sequences by 6 subjects (iii) INRIA XMAS Action Dataset [120]

RR: (i) 92%, (ii) ∼85.5%, (iii) 87%; F: Fourier-based MHI on ballistic segments; C: Dynamic Time Warping, Dynamic Programming

[26,25] (2008)

DB: CASIA gait database—all [167]

RR: overall ∼90.5%; F: GEI with feature selection mask; C: Adaptive Component and Discriminant Analysis

[73] (2008)

DB: (i) 15 gestures by 5 subjects [78] (ii) 10 actions by 9 subjects [62] DB: (i) Full-body gesture DB—14 actions by 20 subjects, ages ranging 60∼80 years [59] (ii) 6 actions by 25 subjects [127] DB: (i) Virtual Human Action Silhouette DB [54]—20 different actions by 9 actors (ii) INRIA IXMAS Action dataset [120] Sequences of pitching in baseball games

RR: (i) 95%, (ii) 97.8%; F: LBP histograms from MHI+MEI; C: HMM RR: (i) 89.4%, (ii) 87.5%; F: Hu and Zernike moments, global geometric shape descriptors; C: multi-class SVM RR: (i) 98.5%, (ii) 77.27%; F: Kohonen Self Organizing feature Map; C: Maximum likelihood (ML) RR: 100% (when image is 90x90 pixels), 96.7% (when image is 25x25 pixels); F: Higher-order Local Auto-Correlation (HLAC) features from MHI; C: PCA, Dynamic Programming RR: (i) ∼82% (better than [89,63,92]), (ii) 93.9%; F: Frieze and wavelet features from Dominant Energy Image; C: HMM

[15,16] (2008) [14] (2010)

[110] (2008)

[148] (2008)

[41] (2009)

[166] (2010)

[Ref.] (Year)

DB: (i) CMU Mobo gait database—by 25 subjects (ii) CASIA gait database (DB B)—by 124 subjects (93 males, 31 females) [167] DB: (i) 7 actions (from 9 subjects [62] (3 actions less than [62]) + additional 10 subjects of their own) (ii) CASIA gait database (DB B) [167]

RR: (i) ∼98.5% at normal resolution for action recognition, (ii) 96.4%; F: Multi-Resolution structure on Motion Energy Histogram (HRMEH), quad-tree decomposition; C: Histogram-matching algorithm

Applications/Scenes

Employed approaches/Comments

Discriminate different movements

Histograms of motion orientations of MHI; Mahalanobis distance

B. Motion analysis [51] (1998)

123

Md. A. R. Ahad et al. Table 1 continued [Ref.] (Year)

Applications/Scenes

Employed approaches/Comments

[123] (1999) [48] (1999)

Dynamic outdoor activities from a single camera Action analysis and understand in real-time

Trajectory-guided tracking using Extended Kalman Filter based on MHI Motion gradients and MHI

[115] (2001)

Tracking first 3 sequences of PETS dataset-1

[125] (2004)

[58] (2004)

Motion tracking for a moving robot in real-time Threat assessment for surveillance in car parking Line fitter

MHI tracker, Gaussian weighting, Kalman filter, Mahalanobis distance Camera motion compensation, Kalman filtering Tracking, NN-based, not promising result

[128] (2005)

Tracking with CAMSHIFT algorithm

Neural network is employed

[133] (2005)

Detection and localization of road area in traffic video Behavior understanding in indoor and outdoor scenes Moving object localization from thermal imagery Moving object tracking

Fuzzy-shadowed sets are used

Adaptive camera models for video surveillance using PTZ camera, outdoor in various places Analysis of sway and speed-related abnormalities of human actions; 5 different actions Real-time detector for unusual behavior, 4 major partners (ACV, BILKENT, UPC and SZTAKI) achieved this task Temporal motion segmentation and action analysis—both indoor and outdoor

Need to include more features (e.g., texture, color) for improvement

[Ref.] (Year)

Applications/Scenes

Comments

C. Interactive systems [50] (1998)

Virtual aerobics trainer

[48] (1999)

Interactive art demonstration

[32] (1999)

The KidsRoom—an environment for kids to play

[141] (2004) [142] (2004) [112] (2005) [107] (2006)

21 or less facial Action Unit classes; MMI-Face-DB; Cohn-Kanade Face DB Interactive art demonstration

Watch and respond to the user as he/she performs the workout Map different colors to the various timestamps within the MHI for fun An interactive, narrative play space for children, with virtual monsters Poor recognition rate; higher pre-processing load In complex environment

[162,163] (2006)

Speech recognition—3 consonants

SWT; very limited result

[71] (2004)

[159] (2006) [165] (2006) [87] (2006) [52] (2007)

[20] (2007) [19] (2007)

[116] (2007) [137] (2007)

[6] (2009)

For recognizing a set of seven human actions, Refs. [23,24] upgrade the MHI. Davis [49,51] and Bradski and Davis [33,34,48] improve the MHI in different ways for recognizing various motions and gestures. In one of Davis’s recent works [46], a rapid-and-reliable action (run, walk and stand) recognition approach is proposed using the MHI method. To recognize complex activities and to solve the motion overwriting problem of the MHI, Ahad et al. [8,10] develop the DMHI method for recognizing various aerobics and other actions. Similarly, to solve the overwriting problem, Meng et al. in their sequences of works [96–100] propose the

123

Failure in rotational motion

EM, BIC, DPN, Dynamically Multi-Linked-HMM RANSAC MHI; overall approach is not great

Solved motion self-occlusion, action length variability Employed 3D-MHI of [35], web server-based real-life detection module, tracking, outdoor Need to implement in an intelligent robot; management of outliers in flow vectors is required

HMHH approach to recognize various actions using SVM. In another solution to overcome the overwriting problem of the MHI, Kellokumpu et al. [73] extract spatially enhanced local binary pattern (LBP) histograms from the MHI and the MEI to classify various actions. This approach demonstrates robustness against irregularities in data and partial occlusion. Vitaladevuni et al. [143] enhance the performance of the MHI feature for recognizing actions through ballistic dynamics. It temporally segments videos into its atomic movements. Using eigenspace, Ogata et al. [108] employ different MHIs and SMI for recognizing several actions.

Motion history image: its variants and applications

Kumar et al. [82] develop a system for hand gesture classification by employing the MHI. In another hand gesture recognition approach, Shan et al. [129] employ MHIs by considering various trajectories of the motion in realtime. Alahari and Jawahar [18] model action characteristics by MHIs for some hand gesture recognition and four different actions. Ryu et al. [124] demonstrate gesture recognition system by employing MHIs and MEIs, whereas Refs. [35,121,150,151] improve the MHI for view-invariant motion recognition. Similarly, another view-invariant approach is developed for real-time human gesture recognition by [130]. Gait recognition and person identifications are targeted by various researchers by modifying the MHI and the MEI concept [26,38,41,63,64,89,92,147,161,166,170]. We can note that the HMHH method [96–100] attempt to solve motion overwriting problem of the MHI, but they employ a database which is mainly one-dimensional and hence the overwriting issue is insignificant. Therefore, to judge its performance to solve the motion overwriting problem, different databases having complex actions [11] are challenged. Similarly, most of the methods employed their own databases for recognition. Apart from the various approaches (whether it is the MHI, MEI, GHI, GEI, DMHI, MMHI, HMHH, SMI, tMHI, HMHI, MFH, PCH, MHV, VMT or MHM) for motion representations, the development strategies of feature vectors and then the classification methods are varied for different methods. Even in some comparative analyses (e.g., in [11,96,100,166]) of some methods, they do not follow the same strategies of the other methods to compare. For example, Ahad et al. [11] compare the MHI, the HMHH and the MMHI methods with the DMHI method. They use seven Hu invariants [68] for feature vectors and KNN for classification, even though the MHI method [31] uses Mahalanobis distance and the HMHH method [96–99] uses SVM. Therefore, not only the issue of databases, but also the selections of classification and feature analysis approach are imperative to evaluate different methods.

tracking system is developed where the MHI is calculated to find the moving part [115]. Jan [71] develops a surveillance system for threat assessment in car park. This system employs MHIs to display any erratic pattern of a suspicious person in restricted parking place. Using the MHI, a video surveillance system, a PTZ camera is developed for automatically learning locations of high activity [52]. Jin et al. [72] temporally segment a human body and measure its motion by employing the MHI. Son et al. [133] calculate the MHI and then combine with a background model to detect the candidate road image. Albu et al. [19,20] develop the MHI method to use it for the analysis of irregularities in human actions. Such irregularities may occur either in speed or orientation and are strong indicators of the balance abilities and of the confidence level of the subject performing the activity. Petras et al. [116] devise a flexible test-bed for unusual behavior detection and automatic event analysis using the MHI. In the human model and motion-based unusual event detection (UPC) module, the concept of the MHI/MEI is introduced to realize a simple motion representation [137]. It extends this formulation to represent view-independent 3D motion (with the concept of Canton et al. [35]). A simple ellipsoid body model is corresponded to the incoming 3D data to capture the body part where gesture occurs. This improves the recognition ratio and generates a more informative classification. In another approach [125], an AIBO robot detects motion by calculating MHIs before tracking it by using a Kalman filter. Based on four directional motion history templates, a complex motion segmentation scheme is proposed [6]. This temporal motion segmentation (TMS) method can split the complex motion into ‘left–right–up–down’ combinations, so that a smart intelligence system can understand the semantics of actions promptly in real-time. The MHI is also used to produce input images for line fitter, which is a system for fitting lines to a video sequence that describe its motion [58]. In this approach, MHI templates are utilized to summarize the motion in video clips.

4.2 The MHI for motion analysis 4.3 The MHI for interactive systems Apart from human action and activity recognition, the MHI method is also employed for various motion detection and localization, for automatic video surveillance and other purposes. Automatically localizing and tracking moving person or vehicle for an automatic visual surveillance system is demonstrated in [87,165]. They employ the MHI method before exploiting an extended mean shift approach [87]. Yin and Collins [165] combined forward-MHI and backward-MHI to achieve a contour shape for the moving object at the current frame. But this system is not a complete tracking system but localization approach from the fading trail of the MHI. Rosales and Sclaroff [123] use the MHI for tracking several outdoor activities. In another approach, a multi-modal adaptive

Various interactive systems have been successfully constructed using motion history template as a primary sensing mechanism. For example, using the MHI method, Davis et al. [50] develop a virtual aerobics trainer that watches and responds to the user as he/she performs the workout. An interactive and narrative play space for children, called KidsRoom [32] is developed using the MHI method successfully. This is a perceptually-based environment in which children could interact with monsters while playing in a story-telling scenario. Nguyen et al. [107] introduce the concept of a motion swarm, a swarm of particles that moves in response to the

123

Md. A. R. Ahad et al. Fig. 11 Walking from camera (1st two columns); and towards camera (last two columns). Bottom-row shows the corresponding energy images

field representing an MHI. A structure that is imposed on the behavior of the swarm forces a response to MHIs. To create interactive art, the art responds to the motion swarm and not to the motion directly. Since they desire to have swarm particles respond in a natural and predictable manner, they smooth the MHI in space by convolving it with a Gaussian kernel. Since the brightest pixels in the MHI are where motion has most recently occurred, their particles tend to follow the motion. Following this strategy, they create interactive art that can be enjoyed by groups such as audiences at public events. Another interactive art demonstration is constructed from the motion templates by [48]. Yau et al. [162,163] develop a method for visual speech recognition by employing MHIs. The video data of the speaker’s mouth is represented by MHIs. Valstar et al. [141,142] and Pantic et al. [112] focus on automatic detection of facial actions units (AU) that compose facial expressions. Table 1 presents the three application areas that have been discussed in this Section. The year of publication of the referred papers is shown in () after the references in [ ]. All of these methods employ the MHI or its variants; nevertheless the databases or features or classifications approaches vary comprehensively.

5 Discussions We present an overview of the MHI method, its various applications, and important modifications (we extensively search related works based on the MHI method, and we cover all the key variants and applications of the MHI method) in this paper. We find that this MHI and MEI representations and its variants can be employed for action representations, understanding and recognition in various applications. This section discusses some future issues that are still unsolved. More-

123

over, ‘will the MHI stand the test of time?’—the answer of this question is discussed here, by illustrating its key features and their future implications in computer vision community. In a sub-section above, we present few methods which are developed to solve the view-invariance problem of the MHI. Though these methods have shown good performance based on some datasets, these methods depend on multiple-camera systems, and subsequently contain extra computational overhead. Besides, similar to the basic MHI method, most of the other 2D variants are having the same problem of view-invariance. By using multiple cameras, the recognition improves but at the same time, for some actions, one camera mirrors another action from another angle. These issues are not easy to solve and more explorations are required. Incorporation of image depth analysis will be a cue to tackle this issue. Also, we find that these 2D methods (e.g., MHI, DMHI, MMHI) can not perform well or faces difficulties in some activities. For example, when more than one person are in the scene, these method can not recognize properly, especially when all of them are moving in different directions. Also, it can not recognize if the person is walking towards the camera’s optical axis or if it moves something like diagonal directions. Figure 11 shows the case for two actions: (i) a person is walking from the camera to move outward-direction with the optical axis line; and (ii) a person is walking towards the camera from far, both almost in-line with the optical axis of the camera [9]. It is evident from the energy images that the system can not profoundly separate these two actions. So this issue should be solved with some semantics or depth analysis. The energy of the moving regions can be analyzed intermittently, and this information may be exploited to resolve this problem. Another important pair of activities are running and walking across the optical axis. Recognizing or distinguishing walking from running motion for video surveillance is very

Motion history image: its variants and applications

Fig. 12 Motion and its corresponding DMEI (top-row) and DMHI (bottom-row) images: a for walking; and b for running. The H/E− x images for top-row of b show more ripple-shaped information than that of walking motion in a

difficult with the present manifestation of the MHI. Similar to the MHI method, other variants show almost similar motion templates for both walking and running, and hence demonstrate poor recognition results. Though the AEI is presented in [38], and they claim that walking and running motion can be easily recognized, their action datasets are limited for this work. One intuitive way to achieve better features for separating walking and running motion is to employ the DMHI method and use the decay parameter (δ) as a higher value so that we can achieve the ripples at the top of the template (notice the more evident ripples at top of the white patches in the column H/E− x of Fig. 12 [9]) and use these features for recognition. When multiple moving persons/objects are present in the scene, these approaches cannot solve the problem of multiple object identification [37]. Image depth analysis can aid to solve this problem. Researchers may think about the camera movement and its effect. Usually, camera motion compensation is difficult and the effect of camera movement and the employment of the MHI are not solved, though Davis et al. [52] apply the MHI with PTZ camera. Another important issue is whether the MHI and the MEI representations are still required when there are several other approaches in different directions. In the last decade, spatiotemporal interest feature points (STIP), histogram of oriented gradients (HOG) [45], histogram of oriented flow (HOF) [44] and few other methods have become prominent for action representations and recognition apart from the MHI-based approaches. But among these approaches, the MHI (and its variants) attains notable attentions in the computer vision arena according to our analyses. We know that interest point detection in static images is a well-studied topic in computer vision. Laptev and Lindeberg [83] pioneer to propose a spatio-temporal extension, building on the Harris-Laplace detector. Several spatiotemporal interest/feature point (STIP) detectors are recently exploited in video analysis for action

recognition. Feature points are detected using a number of measures, namely entropy-based saliency [75,79,109,152], global texture [156], cornerness [84,83], periodicity [53,79] and volumetric optical flow [77]. These are mainly based on intensity [53], texture, color and motion information [119]. In the spatiotemporal domain, however, it is unclear which features indicate useful interest points [79]. Most of the STIP detectors are usually computationally expensive (compare to the straightforward computation of the MHI) and are therefore restricted to the processing of short or low resolution videos (e.g., [53,83,84,109]). Detection of reduced number of features is a prerequisite to keep the computational cost under control [152]. Furthermore, in some cases, all input videos need to be preprocessed [156]. Though these approaches are proven in recognizing various actions, they have additional theoretical and computational complexity than that of the MHI method. The MHI method is very simple and computationally not expensive. Moreover, it covers every motion details and these segmented motion regions are employed for various applications. We notice that the MHI and MHI-based approaches are employed, exploited and modulated for a good number of applications in various domains and dimensions (see above). Therefore, we strongly feel that the MHI method is still useful and that the limitations that are still unsolved can be managed in the future. Moreover, the MHI method at its basic format is very easy to understand and implement. This is a key beneficial feature for the MHI. From the MHI and MEI image, using Hu moments or other shape representation approaches, we can easily get the feature vectors for recognition. However, the MHI is a global-based approach; hence, motions from objects that are not the target of interest will deter the performance (STIP-based methods are better in this context). The MHI is a representation of choice for action recognition when temporal segmentation is available; when actors are fully visible and can be separated from each other; and when

123

Md. A. R. Ahad et al.

they do not move along the z-axis of the camera. In other cases, other representations are probably needed, including bag-of-features (BOF) methods based on STIP, HOF and HOG, which have been shown to overcome those limitations. The STIP-based approaches and HOG/HOF-based developments can be incorporated along with the MHI/MEI representations for future research. Integration of multiple cues (e.g., motion, shape, edge information (e.g., [39]), color or texture), or a fusion of information will produce a better result [134]. Presence of multiple moving subjects, moving camera, view-invariance issues, image depth analysis and overall a better and robust image segmentation technique (for producing the update function in outdoor and cluttered environments) are the major challenges ahead for the MHI method. We feel that the above discussions will open some doors for further research to improve the methods for real-life applications.

6 Conclusions Human motion analysis is a challenging problem due to large variations in human motion and appearance, camera viewpoint and environment settings [118]. The field of action and activity representation and recognition is relatively old, yet not well-understood [104]. Some important but common motion recognition problems are even now unsolved properly by the computer vision community. However, in the last decade, a number of good approaches are proposed and evaluated subsequently by many researchers. Among those methods, one method gets significant attention from many researchers in the computer vision field. Therefore, though there are various approaches for motion analysis and recognition, this paper analyzes the MHI method. It is one of the key methods, and a number of variants are developed from this concept. The MHI is simple to understand and implement; hence many researchers employ this method or its variants for various action/gesture recognition and motion analysis, with different datasets. We present a tutorial that covers important issues for this representation and method. Afterwards, several key limitations are mentioned. In this work, we categorize and present various implementations of the MHI and its developments. This paper also discusses several issues to solve in future. The motion self-occlusion problem of the MHI is addressed and solved with satisfactory recognition rate. Though 3D approaches are proposed as view-invariant methods at the top of 2D MHI, these are computationally expensive. Nevertheless, several essential concerns of the MHI—related to self-occlusion due to motion, motion overlapping or multiple repetitions, significant occlusion from multiple moving persons, and object’s motion towards the optical axis of the camera, should further be investigated rigorously in future

123

so that this simple approach can be extended to various reallife applications with better performance. We hope that this paper would be beneficial to various researchers (especially to inspire new researchers) to understand the MHI method, its variants and applications. Acknowledgments The authors are grateful to the anonymous reviewers for their excellent reviews and constructive comments that helped to improve the manuscript. The work is supported by the Japan Society for the Promotion of Science (JSPS), Japan.

References 1. Aggarwal, J., Cai, Q.: Human motion analysis: a review. In: Proc. Nonrigid and Articulated Motion Workshop, pp. 90–102 (1997) 2. Aggarwal, J.K., Cai, Q.: Human motion analysis: a review. Comput. Vis. Image Underst. 73, 428–440 (1999) 3. Aggarwal, J.K., Park, S.: Human motion: modeling and recognition of actions and interactions. In: Proc. Int. Symposium on 3D Data Processing, Visualization, and Transmission (3DPVT’04), p. 8 (2004) 4. Ahad, Md.A.R., Tan, J.K., Kim, H., Ishikawa, S.: Lowerdimensional feature sets for template-based motion recognition approaches. J. Comput. Sci. 6(8), 920–927 (2010) 5. Ahad, Md.A.R., Tan, J.K., Kim, H., Ishikawa, S.: A simple approach for low-resolution activity recognition. Int. J. Comput. Vis. Biomech. 3(1) (2010) 6. Ahad, Md.A.R., Tan, J.K., Kim, H., Ishikawa, S.: Temporal motion recognition and segmentation approach. Int. J. Imaging Syst. Technol. 19, 91–99 (2009) 7. Ahad, Md.A.R., Tan, J.K., Kim, H., Ishikawa, S.: Human activity recognition: various paradigms. In: Proc. Int. Conf. on Control, Automation and Systems, pp. 1896–1901, October 2008 8. Ahad, Md.A.R., Ogata, T., Tan, J.K., Kim, H., Ishikawa, S.: A complex motion recognition technique employing directional motion templates. Int. J. Innov. Comput. Inf. Control 4(8), 1943– 1954 (2008) 9. Ahad, Md.A.R., Ogata, T., Tan, J.K., Kim, H., Ishikawa, S.: Moment-based human motion recognition from the representation of DMHI templates. In: SICE Annual Conference, pp. 578–583, August 2008 10. Ahad, Md.A.R., Ogata, T., Tan, J.K., Kim, H., Ishikawa, S.: A smart automated complex motion recognition technique. In: Proc. Workshop on Multi-dimensional and Multi-view Image Processing (with ACCV), pp. 142–149 (2007) 11. Ahad, Md.A.R., Tan, J.K., Kim, H., Ishikawa, S.: Analysis of motion self-occlusion problem due to motion overwriting for human activity recognition. J. Multimed. 5(1), 36–46 (2009) 12. Ahad, Md.A.R., Tan, J.K., Kim, H., Ishikawa, S.: Action recognition with various speeds and timed-DMHI feature vectors. In: Proc. Int. Conf. on Computer and Info. Tech., pp. 213–218, December 2008 13. Ahad, Md.A.R., Tan J.K., Kim H., Ishikawa, S.: Human activity analysis: concentrating on motion history image and its variants. In: SICE-ICASE Joint Annual Conf., pp. 5401–5406 (2009) 14. Ahmad, M., Parvin, I., Lee, S.-W.: Silhouette history and energy image information for human movement recognition. J. Multimedia 5(1), 12–21 (2010) 15. Ahmad, M., Lee, S.-W.: Recognizing human actions based on silhouette energy image and global motion description. In: Proc. IEEE Automatic Face and Gesture Recognition, pp. 523–588 (2008)

Motion history image: its variants and applications 16. Ahmad, M., Hossain, M.Z.: SEI and SHI representations for human movement recognition. In: Proc. Int. Conf. on Computer and Information Technology (ICCIT), pp. 521–526 (2008) 17. Ahad, Md.A.R., Tan, J.K., Kim, H., Ishikawa, S.: Action recognition by employing combined directional motion history and energy images. In: IEEE Computer Society Conf. on Computer Vision and Pattern Recognition’s Workshop on CVCG, p. 6 (2010) 18. Alahari, K., Jawahar, C.V.: Discriminative actions for recognizing events. In: Indian Conf. on Computer Vision, Graphics and Image Processing (ICVGIP’06), LNCS, vol. 4338, pp. 552–563 (2006) 19. Albu, A.B., Beugeling, T.: A three-dimensional spatiotemporal template for interactive human motion analysis. J. Multimedia 2(4), 45–54 (2007) 20. Albu, A., Trevor, B., Naznin, V., Beach, C.: Analysis of irregularities in human actions with volumetric motion history images. In: Proc. IEEE Workshop on Motion and Video Computing, Texas, USA, p. 16, February 2007 21. Anderson, C., Bert, P., Wal, G.V.: Change detection and tracking using pyramids transformation techniques. In: Proc. SPIEIntelligent Robots and Computer Vision, vol. 579, pp. 72–78 (1985) 22. Arseneau, S., Cooperstock, J.R.: Real-time image segmentation for action recognition. In: Proc. IEEE Pacific Rim Conf. on Communications, Computers and Signal Processing, pp. 86–89 (1999) 23. Babu, R., Ramakrishnan, K.: Compressed domain human motion recognition using motion history information. In: Proc. ICIP, vol. 2, pp. 321–324 (2003) 24. Babu, R., Ramakrishnan, K.: Recognition of human actions using motion history information extracted from the compressed video. Image Vis. Comput. 22, 597–607 (2004) 25. Bashir, K., Xiang, T., Gong, S.: Feature selection for gait recognition without subject cooperation. In: British Machine Vision Conference, p. 10 (2008) 26. Bashir, K., Xiang, T., Gong, S.: Feature selection on gait energy image for human identification. In: IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 985–988 (2008) 27. Beauchemin, S.S., Barron, J.L.: The computation of optical flow. ACM Comput. Surv. 27(3), 433–467 (1995) 28. Bergen, J.R., Burt, P., Hingorani, R., Peleg, S.: A three frame algorithm for estimating two-component image motion. IEEE Trans. PAMI 14(9), 886–896 (1992) 29. Bimbo, A.D., Nesi, P.: Real-time optical flow estimation. In: Proc. Int. Conf. on Systems Engineering in the Service of Humans, Systems, Man and Cybernetics, vol. 3, pp. 13–19 (1993) 30. Bobick, A., Davis, J.: An appearance-based representation of action. In: Intl. Conf. on Pattern Recognition, pp. 307–312 (1996) 31. Bobick, A., Davis, J.: The recognition of human movement using temporal templates. IEEE Trans. PAMI 23(3), 257–267 (2001) 32. Bobick, A., Intille, S., Davis, J., Baird, F., Pinhanez, C., Campbell, L., Ivanov, Y., Schutte, A., Wilson, A.: The Kidsroom: a perceptually-based interactive and immersive story environment. Presence: Teleoperators Virtual Environ. 8(4), 367–391 (1999) 33. Bradski, G., Davis, J.: Motion segmentation and pose recognition with motion history gradients. In: Proc. IEEE Workshop on Applications of Computer Vision, pp. 174–184, December 2000 34. Bradski, G., Davis, J.: Motion segmentation and pose recognition with motion history gradients. Mach. Vis. Appl. 13(3), 174–184 (2002) 35. Canton-Ferrer, C., Casas, J.R., Pardas, M.: Human model and motion based 3D action recognition in multiple view scenarios. In: Proc. Conf. European Signal Process, Italy, pp. 1–5, September 2006 36. Canton-Ferrer, C., Casas, J.R., Pardàs, M., Sargin, M.E., Tekalp, A.M.: 3D human action recognition in multiple view scenarios.

37. 38.

39.

40.

41.

42.

43.

44.

45.

46.

47.

48.

49.

50.

51.

52.

53.

54.

55. 56.

57.

In: Proc. Jornades de Recerca en Automàtica, Visió i Robòtica, Barcelona (Spain), p. 5, 4–6 July 2006 Cedras, C., Shah, M.: A survey of motion analysis from moving light displays. In: Proc. IEEE CVPR, pp. 214–221 (1994) Chandrashekhar, V., Venkatesh, K.S.: Action energy images for reliable human action recognition. In: Proc. of Asian Symposium on Information Display (ASID), pp. 484–487 (2006) Chen, D., Yang, J.: Exploiting high dimensional video features using layered Gaussian mixture models. In: Proc. IEEE ICPR, p. 4 (2006) Chen, D., Yan, R., Yang, J.: Activity analysis in privacy-protected video, p. 11. (2007). http://www.informedia.cs.cmu.edu/ documents/T-MM_Privacy_J2c.pdf Chen, C., Liang, J., Zhao, H., Hu, H., Tian, J.: Frame difference energy image for gait recognition with incomplete silhouettes. Pattern Recognit. Lett. 30(11), 977–984 (2003) Christmas, W.J.: Spatial filtering requirements for gradient-based optical flow measurement. In: 9th British Machine Vision Conference, pp. 185–194 (1998) Collins, R.T., Lipton, A., Kanade, T., Fujiyoshi, H., Duggins, D., Tsin, Y., Tolliver, D., Enomoto, N., Hasegawa, O., Burt, P., Wixson, L.: A system for video surveillance and monitoring. VSAM final report, CMU-RI-TR-00-12, Technical Report, Carnegie Mellon University, p. 69 (2000) Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: European Conference on Computer Vision, pp. 428–441 (2006) Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: Intl. Conf. on Computer Vision and Pattern Recognition, pp. 886–893 (2005) Davis, J.: Sequential reliable-inference for rapid detection of human actions. In: Proc. IEEE Workshop on Detection and Recognition of Events in Video, pp. 1–9, July 2004 Davis, J.W.: Appearance-based motion recognition of human actions. M.I.T. Media Lab Perceptual Computing Group Tech. Report No. 387, p. 51 (1996) Davis, J., Bradski, G.: Real-time motion template gradients using Intel CVLib. In: Proc. ICCV Workshop on Frame-Rate Vision, pp. 1–20, September 1999 Davis, J.: Hierarchical motion history images for recognizing human motion. In: Proc. IEEE Workshop on Detection and Recognition of Events in Video, pp. 39–46 (2001) Davis, J., Bobick, A.: Virtual PAT: a virtual personal aerobics trainer. In: Proc. Perceptual User Interfaces, pp. 13–18, November 1998 Davis, J.: Recognizing movement using motion histograms. MIT Media Lab. Perceptual Computing Section Tech. Report No. 487 (1998) Davis, J.W., Morison, A.M., Woods, D.D.: Building adaptive camera models for video surveillance. In: Proc. IEEE Workshop on Applications of Computer Vision (WACV’07), p. 6 (2007) Dollar, P., Rabaud, V., Cottrell, G., Belongie, S.: Behavior recognition via sparse spatiotemporal features. In: Intl. Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance, pp. 65–72, October 2005 Digital Imaging Research Centre, K.U.L.: Virtual Human Action Silhouette (ViHASi) Database. http://dipersec.king.ac.uk/ VIHASI/ Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance. In: Proc. ICCV, pp. 726–733 (2003) Elgammal, A., Harwood, D., David, L.S.: Nonparametric background model for background subtraction. In: Proc. European Conference on Computer Vision, p. 17 (2000) Essa, I., Pentland, S.: Facial expression recognition using a dynamic model and motion energy. In: Proc. IEEE CVPR, p. 8, June 1995

123

Md. A. R. Ahad et al. 58. Forbes, K.: Summarizing motion in video sequences, pp. 1–7. http://thekrf.com/projects/motionsummary/MotionSummary.pdf. Accessed 9 May 2004 59. Full-body Gesture Database, Korea University. http://gesturedb. korea.ac.kr/ 60. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981) 61. Gavrilla, D.: The visual analysis of human movement: a survey. Comput. Vis. Image Underst. 73, 82–98 (1999) 62. Gorelick, L., Blank, M., Shechtman, E., Irani, M., Basri, R.: Actions as space-time shapes. IEEE Trans. PAMI 29(12), 2247–2253 (2007) 63. Han, J., Bhanu, B.: Individual recognition using gait energy image. IEEE Trans. PAMI 28(2), 316–322 (2006) 64. Han, J., Bhanu, B.: Gait energy image representation: comparative performance evaluation on USF HumanID database. In: Proc. Joint Intl. Workshop VS-PETS, pp. 133–140 (2003) 65. Haritaoglu, I., Harwood, D., Davis, L.S.: W4 : real-time surveillance of people and their activities. IEEE Trans. PAMI 22(8), 809–830 (2000) 66. Horn, B., Schunck, B.G.: Determining optical flow. Artif. Intell. 17, 185–203 (1981) 67. Hu, W., Tan, T., Wang, L., Maybank, S.: A survey on visual surveillance of object motion and behaviors. IEEE Trans. SMC-Part C. 34(3), 334–352 (2004) 68. Hu, M.K.: Visual pattern recognition by moment invariants. IRE Trans. Info. Theory 8, 179–187 (1962) 69. Jaimes, A., Sebe, N.: Multimodal human-computer interaction: a survey. Comput. Vis. Image Underst. 108(1–2), 116–134 (2007) 70. Jain, A., Duin, R., Mao, J.: Statistical pattern recognition: a review. IEEE Trans. PAMI 2(1), 4–37 (2000) 71. Jan, T.: Neural network based threat assessment for automated visual surveillance. In: Proc. IEEE Joint Conf. on Neural Networks, vol. 2, pp. 1309–1312, July 2004 72. Jin, T., Leung, M.K.H., Li, L.: Temporal human body segmentation. In: Villanieva, J.J. (ed.) IASTED Int. Conf. Visualization, Imaging, and Image Processing (VIIP’04). Acta Press, Marbella. ISSN: 1482-7921, 6–8 September 2004 73. Kellokumpu, V., Zhao, G., Pietikäinen, M.: Texture based description of movements for activity analysis. In: Proc. Conf. Computer Vision Theory and Applications (VISAPP’08), vol. 2, pp. 368–374, Portugal (2008) 74. Kilger, M.: A shadow handler in a video-based real-time traffic monitoring system. In: Proc. IEEE Workshop on Applications of Computer Vision, pp. 1060–1066 (1992) 75. Kadir, T., Brady, M.: Scale, saliency and image description. IJCV 45(2), 83–105 (2001) 76. Kameda, Y., Minoh, M.: A human motion estimation method using 3-successive video frames. In: Proc. Int. Conf. on Virtual Systems and Multimedia, p. 6 (1996) 77. Ke, Y., Sukthankar, R., Hebert, M.: Efficient visual event detection using volumetric features. In: ICCV, vol. 1, pp. 166–173 (2005) 78. Kellokumpu, V., Pietikäinen, M., Heikkilä, J.: Human activity recognition using sequences of postures. Mach. Vis. Appl., pp. 570–573 (2005) 79. Kienzle, W., Scholkopf, B., Wichmann, F.A., Franz, M.O.: How to find interesting locations in video: a spatiotemporal interest point detector learned from human eye movements. In: 29th DAGM Symposium, pp. 405–414, September 2007 80. Kindratenko, V.: Development and application of image analysis techniques for identification and classification of microscopic particles. PhD thesis, University of Antwerp, Belgium (1997). http:// www.ncsa.uiuc.edu/~kindr/phd/index.pdf 81. Khotanzad, A., Hong, Y.H.: Invariant image recognition by Zernike moments. IEEE Trans. PAMI 12(5), 489–497 (1990)

123

82. Kumar, S., Kumar, D., Sharma, A., McLachlan, N.: Classification of hand movements using motion templates and geometrical based moments. In: Proc. Int’l Conf. on Intelligent Sensing and Information Processing, pp. 299–304 (2003) 83. Laptev, I., Lindeberg, T.: Space-time interest points. In: ICCV, vol. 1, p. 432 (2003) 84. Laptev, I.: On space-time interest points. IJCV 64(2), 107–123 (2005) 85. LaViola, J.: A survey of hand posture and gesture recognition techniques and technology. Tech. Report CS-99-11, Brown University, p. 80, June 1999 86. Leman, K., Ankit, G., Tan, T.: PDA-based human motion recognition system. Int. J. Softw. Eng. Knowl. 2(15), 199–205 (2005) 87. Li, L., Zeng, Q., Jiang, Y., Xia, H.: Spatio-temporal motion segmentation and tracking under realistic condition. In: Proc. Int’l Symposium on Systems and Control in Aerospace and Astronautics, pp. 229–232 (2003) 88. Lipton, A.J., Fujiyoshi, H., Patil, R.S.: Moving target classification and tracking from real-time video. In: Proc. IEEE Workshop on Applications of Computer Vision, pp. 8–14 (1998) 89. Liu, J., Zhang, N.: Gait history image: a novel temporal template for gait recognition. In: Proc. IEEE Int. Conf. Multimedia and Expo, pp. 663–666 (2007) 90. Lo, C., Don, H.: 3-D moment forms: their construction and application to object identification and positioning. IEEE Trans. PAMI 11(10), 1053–1063 (1989) 91. Lucas, B., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: Proc. Int. Joint Conf. on Artificial Intelligence, pp. 674–679 (1981) 92. Ma, Q., Wang, S., Nie, D., Qiu, J.: Recognizing humans based on gait moment image. In: 8th ACIS Intl. Conf. on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed Computing, pp. 606–610 (2007) 93. Masoud, O., Papanikolopoulos, N.: A method for human action recognition. Image Vis. Comput. 21, 729–743 (2003) 94. McCane, B., Novins, K., Crannitch, D., Galvin, B.: On benchmarking optical flow. Comput. Vis. Image Underst. 84, 126– 143 (2001) 95. McKenna, S.J., Jabri, S., Duric, Z., Wechsler, H., Rosenfeld, A.: Tracking groups of people. Comput. Vis. Image Underst. 80(1), 42–56 (2000) 96. Meng, H., Pears, N., Bailey, C.: A human action recognition system for embedded computer vision application. In: Proc. Workshop on Embedded Computer Vision (with CVPR), pp. 1–6 (2007) 97. Meng, H., Pears, N., Bailey, C.: Human action classification using SVM_2K classifier on motion features. In: LNCS: Multimedia Content Representation, Classification and Security, vol. 4105/2006, pp. 458–465 (2006) 98. Meng, H., Pears, N., Bailey, C.: Motion information combination for fast human action recognition. In: Proc. Conf. Computer Vision Theory and Applications (VIASAPP07), Spain, March 2007 99. Meng, H., Pears, N., Bailey, C.: Recognizing human actions based on motion information and SVM. In: Proc. IEE Int. Conf. Intelligent Environments, pp. 239–245 (2006) 100. Meng, H., Pears, N., Freeman, M., Bailey, C.: Motion history histograms for human action recognition. In: Embedded Computer Vision (Advances in Pattern Recognition), part II, pp. 139–162. Springer, London (2009) 101. Mittal, A., Paragois, N.: Motion-based background subtraction using adaptive kernel density estimation. In: Proc. IEEE CVPR, p. 8 (2004) 102. Moeslund, T.B.: Summaries of 107 computer vision-based human motion capture papers. Tech. Report: LIA 99-01, University of Aalborg, p. 83, March 1999

Motion history image: its variants and applications 103. Moeslund, T.B., Granum, E.: A survey of computer vision-based human motion capture. Comput. Vis. Image Underst. 81, 231– 268 (2001) 104. Moeslund, T.B., Hilton, A., Kruger, V.: A survey of advances in vision-based human motion capture and analysis. Comput. Vis. Image Underst. 104, 90–126 (2006) 105. Ng, J., Gong, S.: Learning pixel-wise signal energy for understanding semantics. In: Proc. BMVC, pp. 695–704 (2001) 106. Ng, J., Gong, S.: Learning pixel-wise signal energy for understanding semantics. Image Vis. Comput. 21, 1183–1189 (2003) 107. Nguyen, Q., Novakowski, S., Boyd, J.E., Jacob, C., Hushlak, G.: Motion swarms: video interaction for art in complex environments. In: Proc. ACM Int. Conf. Multimedia, CA, pp. 461–469 (2006) 108. Ogata, T., Tan, J.K., Ishikawa, S.: High-speed human motion recognition based on a motion history image and an Eigenspace. IEICE Trans. Inf. Syst. E89-D(1), 281–289 (2006) 109. Oikonomopoulos, A., Patras, I., Pantic, M.: Spatiotemporal salient points for visual recognition of human actions. IEEE Trans. Syst. Man Cybern. B: Cybern. 36(3), 710–719 (2006) 110. Orrite, C., Martınez, F., Herrero, E., Ragheb, H., Velastin, S.: Independent viewpoint silhouette-based human action modelling and recognition. In: Proc. Int. Workshop on Machine Learning for Vision-based Motion Analysis (MLVMA’08) with ECCV, pp. 1–12 (2008) 111. Pantic, M., Pentland, A., Nijholt, A., Hunag, T.S.: Human computing and machine understanding of human behavior: a survey. In: Proc. Int. Conf. on Multimodal Interfaces, pp. 239–248 (2006) 112. Pantic, M., Patras, I., Valstar, M.F.: Learning spatio-temporal models of facial expressions. In: Proc. Int. Conf. on Measuring Behaviour, pp. 7–10, September 2005 113. Papenberg, N., Bruhn, A., Brox, T., Didas, S., Weickert, J.: Highly accurate optic flow computation with theoretically justified warping. Int. J. Comput. Vis. 67(2), 141–158 (2006) 114. Pavlovic, V., Sharma, R., Huang, T.: Visual interpretation of hand gestures for human-computer interaction: a review. IEEE Trans. PAMI 19(7), 677–695 (1997) 115. Piater, J., Crowley, J.: Multi-modal tracking of interacting targets using Gaussian approximations. In: Proc. IEEE Workshop on Performance Evaluation of Tracking and Surveillance at CVPR, pp. 141–147 (2001) 116. Petrás, I., Beleznai, C., Dedeo˘glu, Y., Pardàs, M., et al.: Flexible test-bed for unusual behavior detection. In: Proc. ACM Conf. Image and Video Retrieval, pp. 105–108 (2007) 117. Polana, R., Nelson, R.: Low level recognition of human motion. In: Proc. IEEE Workshop on Motion of Non-rigid and Articulated Objects, pp. 77–82 (1994) 118. Poppe, R.: Vision-based human motion analysis: an overview. Comput. Vis. Image Underst. 108(1–2), 4–18 (2007) 119. Rapantzikos, K., Avrithis, Y., Kollias, S.: Dense saliency-based spatiotemporal feature points for action recognition. In: Intl. Conf. on Computer Vision and Pattern Recognition, pp. 1–8 (2009) 120. Rhne-Alpes, I.: The Inria XMAS (IXMAS) motion acquisition sequences. http://charibdis.inrialpes.fr 121. Roh, M.-C., Shin, H.-K., Lee, S.-W., Lee, S.-W.: Volume motion template for view-invariant gesture recognition. In: Proc. ICPR, vol. 2, pp. 1229–1232 (2006) 122. Rosales, R.: Recognition of human action using moment-based features. Boston University Computer Science Tech. Report, BU 98-020, 1–19, November 1998 123. Rosales, R., Sclaroff, S.: 3D trajectory recovery for tracking multiple objects and trajectory guided recognition of actions. In: Proc. CVPR, vol. 2, pp. 117–123 (1999)

124. Ryu, W., Kim, D., Lee, H.-S., Sung, J., Kim, D.: Gesture recognition using temporal templates. In: Proc. ICPR, Demo Program, Hong Kong, August 2006 125. Ruiz-del-Solar, J., Vallejos, P.A.: Motion detection and tracking for an AIBO robot using camera motion compensation and Kalman filtering. In: Proc. RoboCup Int. Symposium 2004, Lisbon, LNCS, vol. 3276, pp. 619–627 (2005) 126. Sarkar, S., Phillips, P.J., Liu, Z., Vega, I.R., Grother, P., Bowyer, K.W.: The humanid gait challenge problem: data sets, performance, and analysis. IEEE Trans. PAMI 27(2), 162–177 (2005) 127. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local SVM approach. In: Proc. ICPR, vol. 3, pp. 32–36 (2004) 128. Senior, A., Tosunoglu, S.: Hybrid machine vision control. In: Florida Con. on Recent Advances in Robotics, pp. 1–6, May 2005 129. Shan, C., Wei, Y., Qiu, X., Tan, T.: Gesture recognition using temporal template based trajectories. In: Proc. ICPR, vol. 3, pp. 954–957 (2004) 130. Shin, H.-K., Lee, S.-W., Lee, S.-W.: Real-time gesture recognition using 3D motion history model. In: Proc. Conf. on Intelligent Computing, Part I, LNCS, vol. 3644, pp. 888–898, China, August 2005 131. Sigal, L., Black, M.J.: HumanEva: Synchronized video and motion capture dataset for evaluation of articulated human motion. Department of Computer Science, Brown University, Tech. Report CS-06-08, p. 18, September 2006 132. Singh, R., Seth, B., Desai, U.: A real-time framework for vision based human robot interaction. In: Proc. IEEE/RSJ Conf. on Intelligent Robots and Systems, pp. 5831–5836 (2006) 133. Son, D., Dinh, T., Nam, V., Hanh, T., Lam, H.: Detection and localization of road area in traffic video sequences using motion information and fuzzy-shadowed sets. In: Proc. IEEE Int’l Symp. Multimedia, pp. 725–732, December 2005 134. Spengler, M., Schiele, B.: Towards robust multi-cue integration for visual tracking. Mach. Vis. Appl. 14, 50–58 (2003) 135. Stauffer, C., Grimson, W.: Adaptive background mixture models for real-time tracking. In: Proc. IEEE CVPR, vol. 2, pp. 246–252 (1999) 136. Sun, H.Z., Feng, T., Tan, T.N.: Robust extraction of moving objects from image sequences. In: Proc. Asian Conference on Computer Vision, pp. 961–964 (2000) 137. Sziranyi, T.: with other partners UPC, SZTAKI, Bilkent and ACV: real time detector for unusual behavior. http://www.muscle-noe. org/content/view/147/64/ 138. Talukder, A., Goldberg, S., Matthies, L., Ansar, A.: Real-time detection of moving objects in a dynamic scene from moving robotic vehicles. In: Proc. IEEE/RSJ Intl Conference on Intelligent Robots and Systems, pp. 1308–1313 (2003) 139. Tan, J.K., Ishikawa, S.: High accuracy and real-time recognition of human activities. In: 33rd Annual Conf. of IEEE Industrial Electronics Society (IECON), pp. 2377–2382 (2007) 140. Vafadar, M., Behrad, A.: Human hand gesture recognition using motion orientation histogram for interaction of handicapped persons with computer. In: Elmoataz, A., et al. (eds.) ICISP 2008, LNCS, vol. 5099, pp. 378–385 (2008) 141. Valstar, M., Pantic, M., Patras, I.: Motion history for facial action detection in video. In: Proc. IEEE Int. Conf. SMC, vol. 1, pp. 635–640 (2004) 142. Valstar, M., Patras, I., Pantic, M.: Facial action recognition using temporal templates. In: Proc. IEEE Workshop on Robot and Human Interactive Communication, pp. 253–258 (2004) 143. Vitaladevuni, S.N., Kellokumpu, V., Davis, L.S.: Action recognition using ballistic dynamics. In: Proc. CVPR, p. 8 (2008)

123

Md. A. R. Ahad et al. 144. Wang, L., Hu, W., Tan, T.: Recent developments in human motion analysis. Pattern Recognit. 36, 585–601 (2003) 145. Wang, J.J.L., Singh, S.: Video analysis of human dynamics—a survey. Real-Time Imaging 9(5), 321–346 (2006) 146. Wang, C., Brandstein, M.S.: A hybrid real-time face tracking system. In: Proc. Int. Conf. on Acoustics, Speech, and Signal Processing, p. 4 (1998) 147. Wang, L., Suter, D.: Informative shape representations for human action recognition. Intl. Conf. Pattern Recognit. 2, 1266– 1269 (2006) 148. Watanabe, K., Kurita, T.: Motion recognition by higher order local auto correlation features of motion history images. In: Proc. Bioinspired, Learning and Intelligent Systems for Security, pp. 51–55 (2008) 149. Wei, J., Harle, N.: Use of temporal redundancy of motion vectors for the increase of optical flow calculation speed as a contribution to real-time robot vision. In: Proc. IEEE TENCON—Speech and Image Technologies for Computing and Telecommunications, pp. 677–680 (1997) 150. Weinland, D., Ronfard, R., Boyer, E.: Automatic discovery of action taxonomies from multiple views. In: Proc. CVPR, pp. 1639–1645 (2006) 151. Weinland, D., Ronfard, R., Boyer, E.: Free viewpoint action recognition using motion history volumes. Comput. Vis. Image Underst. 104(2), 249–257 (2006) 152. Willems, G., Tuytelaars, T., Gool, L.V.: An efficient dense and scale-invariant spatio-temporal interest point detector. In: 10th European Conference on Computer Vision, pp. 650–663 (2008) 153. Wixson, L.: Detecting salient motion by accumulating directionally-consistent flow. IEEE Trans. PAMI 22(8), 774–780 (2000) 154. Wong, S.F., Cipolla, R.: Continuous gesture recognition using a sparse Bayesian classifier. In: Intl. Conf. on Pattern Recognition, vol. 1, pp. 1084–1087 (2006) 155. Wong, S.F., Cipolla, R.: Real-time adaptive hand motion recognition using a sparse Bayesian classifier. In: Intl. Conf. on Computer Vision Workshop, pp. 170–179 (2005) 156. Wong, S.F., Cipolla, R.: Extracting spatiotemporal interest points using global information. In: ICCV, pp. 1–8 (2007) 157. Wren, R., Clarkson, B.P., Pentland, A.P.: Understanding purposeful human motion. In: Proc. Int’l Conf. on Automatic Face and Gesture Recognition, pp. 19–25 (1999) 158. Wren, C.R., Azarbayejani, A., Darrell, T., Pentland, A.P.: Pfinder: real-time tracking of the human body. IEEE Trans. PAMI 19(7), 780–785 (1997) 159. Xiang, T., Gong, S.: Beyond tracking: modelling activity and understanding behaviour. Int. J. Comput. Vis. 67(1), 21–51 (2006) 160. Yang, Y.H., Levine, M.D.: The background primal sketch: an approach for tracking moving objects. Mach. Vis. Appl. 5, 17–34 (1992) 161. Yang, X., Zhang, T., Zhou, Y., Yang, J.: Gabor phase embedding of gait energy image for identity recognition. In: 8th IEEE Intl. Conf. on Computer and Information Technology, pp. 361–366, July 2008 162. Yau, W., Kumar, D., Arjunan, S., Kumar, S.: Visual speech recognition using image moments and multiresolution wavelet. In: Proc. Conf. on Computer Graphics, Imaging and Visualization, pp. 194–199 (2006)

123

163. Yau, W., Kumar, D., Arjunan, S.: Voiceless speech recognition using dynamic visual speech features. In: Proc. HCSNet Workshop on the Use of Vision in HCI, Australia (2006) 164. Yilmaz, A., Shah, M.: Actions sketch: a novel action representation. In: IEEE Computer Society Conf. on Computer Vision and Pattern Recognition, vol. 1, pp. 984–989 (2005) 165. Yin, Z., Collins, R.: Moving object localization in thermal imagery by forward-backward MHI. In: Proc. IEEE Workshop on Object Tracking and Classification in and Beyond the Visible Spectrum, NY, pp. 133–140, June 2006 166. Yu, C.-C., Cheng, H.-Y., Cheng, C.-H., Fan, K.-C.: Efficient human action and gait analysis using multiresolution motion energy histogram. EURASIP J. Adv. Signal Process. 2010, 1–13 (2010) 167. Yu, S., Tan, D., Tan, T.: A framework for evaluating the effect of view angle, clothing and carrying condition on gait recognition. In: Intl. Conf. on Pattern Recognition, pp. 441–444 (2006) 168. Zhang, D., Lu, G.: Review of shape representation and description techniques. Pattern Recognit. 37, 1–19 (2004) 169. Zhou, H., Hu, H.: A survey—human movement tracking and stroke rehabilitation. Tech. Report: CSM-420, Department of Computer Sciences, University of Essex, p. 33, December 2004 170. Zou, X., Bhanu, B.: Human activity classification based on gait energy image and co-evolutionary genetic programming. In: Proc. ICPR, vol. 3, pp. 555–559 (2006)

Author Biographies Md. Atiqur Rahman Ahad was born in Bangladesh and has obtained his B.Sc. (Hons’) and Master’s degrees from the Department of Applied Physics, Electronics and Communication Engineering, University of Dhaka, Bangladesh. He later received a Master’s degree from School of Computer Science and Engineering, University of New South Wales, Australia. He obtained his Ph.D. degree from the Faculty of Engineering, Kyushu Institute of Technology, Japan. Since 2000, he has taught in different universities and working in the University of Dhaka, Bangladesh, since 2001 (currently on-leave). He has also served as a Casual Academic in University of New South Wales during three sessions from 2002 to 2004. He is currently working as JSPS Postdoctoral Research Fellow at Kyushu Institute of Technology, Japan. Mr. Ahad is a student member of IEEE, IEEE IES and Society of Instrument and Control Engineers (SICE). He has won the Best Student Paper Award in the International Workshop on Combinatorial Image Analysis (IWCIA), Buffalo, NY, in April 2008. He has also been awarded the Biomedical Fuzzy Systems Association’s Best Paper Award (Journal) in 2008. His present research includes human motion recognition and analysis, motion segmentation, motion tracking, etc.

Motion history image: its variants and applications Joo Kooi Tan obtained his Ph.D. from Kyushu Institute of Technology in 2000. She is presently an assistant professor with faculty of Mechanical and Control Engineering in the same university. Her current main research interests include threedimensional shape and motion recovery, human motion analysis, human activity recognition and understanding, and applications of computer vision. She received the SICE Kyushu Branch Young Author’s Award in 1999, the AROB10th Young Author’s Award in 2004, Young Author’s Award from IPSJ of Kyushu Branch in 2004, the Japanese Journal Best Paper Award from BMFSA in 2008, the Best Paper Award from ISII in 2009, and she has also won The Excellent Paper Award from Biomedical Fuzzy System Association in 2010. She is a member of IEEE, The Society of Instrument and Control Engineers, and The Information Processing Society of Japan.

Seiji Ishikawa obtained his B.E., M.E., and D.E. degrees from The University of Tokyo, where he majored in Mathematical Engineering and Instrumentation Physics. He joined Kyushu Institute of Technology and he is currently Professor of Department of Control & Mechanical Engineering, KIT. Professor Ishikawa was a visiting research fellow at Sheffield University, U.K., from 1983 to 1984, and a visiting professor at Utrecht University, The Netherlands, in 1996. He was awarded BMFSA Best Paper Award in 2008 and 2010. His research interests include three-dimensional shape/motion recovery, and human detection and its motion analysis from car videos. He is a member of IEEE, The Society of Instrument and Control Engineers, The Institute of Electronics, Information and Communication Engineers, and The Institute of Image Electronics Engineers of Japan.

Hyoungseop Kim received his B.A. degree in electrical engineering from Kyushu Institute of Technology in 1994, the Master’s and Ph.D. degree from Kyushu Institute of Technology in 1996 and 2001, respectively. He is an associate professor in the Department of Control Engineering at Kyushu Institute of Technology. His research interests are focused on medical application of image analysis. He is currently working on automatic segmentation of multi-organ of abdominal CT image, and temporal subtraction of thoracic MDCT image sets.

123