Affine-Invariant Feature Extraction for Activity Recognition

Hindawi Publishing Corporation ISRN Machine Vision Volume 2013, Article ID 215195, 7 pages http://dx.doi.org/10.1155/2013/215195

Research Article Affine-Invariant Feature Extraction for Activity Recognition Samy Sadek,1 Ayoub Al-Hamadi,2 Gerald Krell,2 and Bernd Michaelis2 1 2

Department of Mathematics and Computer Science, Faculty of Science, Sohag University, 82524 Sohag, Egypt Institute for Information Technology and Communications (IIKT), Otto von Guericke University Magdeburg, 39106 Magdeburg, Germany

Correspondence should be addressed to Samy Sadek; [email protected] Received 28 April 2013; Accepted 4 June 2013 Academic Editors: A. Gasteratos, D. P. Mukherjee, and A. Torsello Copyright © 2013 Samy Sadek et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. We propose an innovative approach for human activity recognition based on affine-invariant shape representation and SVMbased feature classification. In this approach, a compact computationally efficient affine-invariant representation of action shapes is developed by using affine moment invariants. Dynamic affine invariants are derived from the 3D spatiotemporal action volume and the average image created from the 3D volume and classified by an SVM classifier. On two standard benchmark action datasets (KTH and Weizmann datasets), the approach yields promising results that compare favorably with those previously reported in the literature, while maintaining real-time performance.

1. Introduction Visual recognition and interpretation of human-induced actions and events are among the most active research areas in computer vision, pattern recognition, and image understanding communities [1]. Although a great deal of progress has been made in automatic recognition of human actions during the last two decades, the approaches proposed in the literature remain limited in their ability. This leads to a need for much research work to be conducted to address the ongoing challenges and develop more efficient approaches. It is clear that developing good algorithms for solving the problem of human action recognition would yield huge potential for a large number of potential applications, for example, the search and the structuring of large video archives, human-computer interaction, video surveillance, gesture recognition, and robot learning and control. In fact, the nonrigid nature of human body and clothes in video sequences, resulting from drastic illumination changes, changing in pose, and erratic motion patterns, presents the grand challenge to human detection and action recognition. In addition, while the real-time performance is a major concern in computer vision, especially for embedded computer vision systems, the majority of state-of-the-art human action recognition systems often employ sophisticated feature extraction and learning techniques, creating a barrier to

the real-time performance of these systems. This suggests a trade-off between accuracy and real-time performance. The remainder of this paper commences by briefly reviewing the most relevant literature in this area of human action recognition in Section 2. Then, in Section 3, we describe the details of the proposed method for action recognition. The experimental results corroborating the proposed method effectiveness are presented and analyzed in Section 4. Finally, in Section 5, we conclude and mention possible future work.

2. The Literature Overview Recent few years have witnessed a resurgence of interest in more research on the analysis and interpretation of human motion motivated by the rise of security concerns and increased ubiquity and affordability of digital media production equipment. Human action can generally be recognized using various visual cues such as motion [2, 3] and shape [4, 5]. Scanning the literature, one notices that a significant body of work in human action recognition focuses on using spatial-temporal key points and local feature descriptors [6]. The local features are extracted from the region around each key point detected by the key point detection process. These features are then quantized to provide a discrete set of visual words before they are fed

2

ISRN Machine Vision

Figure 1: GMM background subtraction: the first and third rows display two sequences of walking and running actions from KTH and Weizmann action datasets, respectively, while the second and fourth rows show the results of background subtraction where foreground objects are shown in cyan color.

into the classification module. Another thread of research is concerned with analyzing patterns of motion to recognize human actions. For instance, in [7], periodic motions are detected and classified to recognize actions. Alternatively, some researchers have opted to use both motion and shape cues. In [8], the authors detect the similarity between video segments using a space-time correlation model. In [9], Rodriguez et al. present a template-based approach using a Maximum Average Correlation Height (MACH) filter to capture intraclass variabilities. Likewise, a significant amount of work is targeted at modelling and understanding human motions by constructing elaborated temporal dynamic models [10]. There is also an attractive area of research which focuses on using generative topic models for visual recognition based on the so-called Bag-of-Words (BoW) model [11]. The underlying concept of a BoW is that each video sequence is represented by counting the number of occurrences of descriptor prototypes, so-called visual words. Topic models are built and then applied to the BoW representation. Three examples of commonly used topic models include Correlated Topic Models (CTMs) [11], Latent Dirichlet Allocation (LDA) [12], and probabilistic Latent Semantic Analysis (pLSA) [13].

3. Proposed Methodology In this section, the proposed method for action recognition is described. The main steps of the framework are explained in detail along the following subsections.

3.1. Background Subtraction. In this paper, we use Gaussian Mixture Model (GMM) as a basis to model background distribution. Formally speaking, let 𝑋𝑡 be a pixel in the current frame 𝐼𝑡 , where 𝑡 is the frame index. Then, each pixel can be modeled separately by a mixture of 𝐾 Gaussians: 𝐾

𝑃 (𝑋𝑡 ) = ∑𝜔𝑖,𝑡 𝜂 (𝑋𝑡 ; 𝜇𝑖,𝑡 , Σ𝑖,𝑡 ) ,

(1)

𝑖=1

Where 𝜂 is a Gaussian probability density function. 𝜇𝑖,𝑡 , Σ𝑖,𝑡 , and 𝜔𝑖,𝑡 are the mean, covariance, and an estimate of the weight of the 𝑖th Gaussian in the mixture at time 𝑡, respectively. 𝐾 is the number of distributions, which is set to 5 in experiments. Before the foreground is detected, the background is updated (see [14] for details about the updating procedure). After the updates are done, the weights 𝜔𝑖,𝑡 are normalized. By applying a threshold 𝑇 (set to 0.6 in our experiments), the background distribution remains on top with the lowest variance, where 𝐵 = arg min ( 𝑏

∑𝑏𝑖=1 𝜔𝑖,𝑡

∑𝐾 𝑖=1 𝜔𝑖,𝑡

> 𝑇) .

(2)

Finally, all pixels 𝑋𝑡 that match none of the components are good candidates to be marked as foreground. An example of GMM background subtraction can be seen in Figure 1. 3.2. Average Images from 3D Action Volumes. The 3D volume in the spatio-temporal (𝑋𝑌𝑇) domain is formed by piling

ISRN Machine Vision

3 𝐼4 =

1 3 2 2 2 [𝜂20 𝜂03 − 6𝜂20 𝜂11 𝜂12 𝜂03 − 6𝜂20 𝜂02 𝜂21 𝜂03 11 𝜂00 2 2 2 + 9𝜂20 𝜂02 𝜂12 + 12𝜂20 𝜂11 𝜂21 𝜂03

+ 6𝜂20 𝜂11 𝜂02 𝜂30 𝜂03 + 18𝜂20 𝜂11 𝜂02 𝜂30 𝜂12 3 2 2 2 − 8𝜂11 𝜂30 𝜂03 − 6𝜂20 𝜂02 𝜂30 𝜂12 + 9𝜂20 𝜂02 𝜂21 2 2 3 3 +12𝜂11 𝜂02 𝜂30 𝜂12 − 6𝜂11 𝜂02 𝜂30 𝜂12 + 𝜂02 𝜂30 ] ,

𝐼5 =

1 2 [𝜂40 𝜂04 − 4𝜂31 𝜂13 + 3𝜂22 ], 6 𝜂00

𝐼6 =

1 2 [𝜂40 𝜂04 𝜂22 + 2𝜂31 𝜂13 𝜂22 − 𝜂40 𝜂13 9 𝜂00 2 3 −𝜂04 𝜂13 − 𝜂22 ],

Figure 2: 2D average image created from the 3D spatio-temporal volume of a walking sequence.

up the target region in the image sequences of one action cycle, which is used to partition the sequences for the spatiotemporal volume. An action cycle is a fundamental unit to describe the action. In this work, we assume that the spatiotemporal volume consists of a number of small voxels. The average image 𝐼𝑎V (𝑥, 𝑦) is defined as 1 𝜏−1 𝐼𝑎V (𝑥, 𝑦) = ∑ 𝐼 (𝑥, 𝑦, 𝑡) , 𝜏 𝑡=0

(3)

where 𝜏 is the number of frames in action cycle (we use 𝜏 = 25 in our experiments). 𝐼(𝑥, 𝑦, 𝑡) represents the density of the voxels at time 𝑡. An example of average image created from the 3D spatio-temporal volume of the running sequence is shown in Figure 2. For characterizing these 2D average images, the 2D affine moment invariants are considered as features [26]. 3.3. Feature Extraction. As is well known, the moments describe shape properties of an object as it appears. Affine moment invariants are moment-based descriptors, which are invariant under a general affine transform. Six affine moment invariants can be conventionally derived from the central moments [27] as follows: 𝐼1 =

1 2 [𝜂20 𝜂02 − 𝜂11 ], 4 𝜂00

𝐼2 =

1 2 2 3 [𝜂03 𝜂30 − 6𝜂30 𝜂21 𝜂12 𝜂03 + 4𝜂30 𝜂12 10 𝜂00 3 2 2 +4𝜂03 𝜂21 − 3𝜂21 𝜂12 ] ,

𝐼3 =

1 2 [𝜂20 (𝜂21 𝜂03 − 𝜂12 ) − 𝜂11 (𝜂30 𝜂03 − 𝜂21 𝜂12 ) 7 𝜂00 2 +𝜂02 (𝜂03 𝜂12 − 𝜂21 )] ,

(4) where 𝜂𝑝𝑞 is the central moment of order 𝑝 + 𝑞. For a spatio-temporal (𝑋𝑌𝑇) space, the 3D moment of order (𝑝 + 𝑞 + 𝑟) of 3D object O is derived using the same procedure of the 2D centralized moment: 𝑝

𝑞

𝑟

𝜂𝑝𝑞𝑟 = ∑ ∑ ∑(𝑥 − 𝑥𝑔 ) (𝑦 − 𝑦𝑔 ) (𝑡 − 𝑡𝑔 ) 𝐼 (𝑥, 𝑦, 𝑡) , (𝑥,𝑦,𝑡)∈O (5) Where (𝑥𝑔 , 𝑦𝑔 , 𝑡𝑔 ) is the centroid of object in the spatiotemporal space. Based on the definition of the 3D moment in (5), six 3D affine moment invariants can be defined. The first two of these moment invariants are given by 𝐽1 =

1

5 𝜂000

2 [𝜂200 𝜂020 𝜂002 + 2𝜂110 𝜂101 𝜂011 − 𝜂200 𝜂011 2 2 −𝜂020 𝜂101 − 𝜂002 𝜂110 ],

𝐽2 =

1 2 [𝜂400 (𝜂040 𝜂004 + 3𝜂022 − 4𝜂013 𝜂031 ) 7 𝜂000 2 + 3𝜂202 (𝜂040 𝜂202 − 4𝜂112 𝜂130 + 4𝜂121 )

+ 12𝜂211 (𝜂022 𝜂211 + 𝜂103 𝜂130 − 𝜂031 𝜂202 −𝜂121 𝜂112 ) + 4𝜂310 (𝜂031 𝜂103 − 𝜂004 𝜂220 +3𝜂013 𝜂121 − 3𝜂022 𝜂112 ) + 3𝜂220 (𝜂004 𝜂220 + 2𝜂022 𝜂202 +4𝜂112 − 4𝜂013 𝜂311 − 4𝜂121 𝜂103 ) + 4𝜂301 (𝜂013 𝜂130 − 𝜂040 𝜂103 + 3𝜂031 𝜂112 −3𝜂022 𝜂121 ) ] . (6) Due to their long formulae, the remaining four moment invariants are not displayed here (refer to [28]). Figure 3

4

ISRN Machine Vision ×10−4 5

1 0.9 I1

I2

0 −5

0.8

Jog

Run

Box

Wave

−10 Walk

Clap

0.01

0.1

0

0

−0.01

−0.1

I4

I3

0.7 Walk

Run

Box

Wave

−0.3 Walk

Clap

0.02

0.6

0.015

0.4

0.01

I6

I5

Jog

0.8

0.2 0 Walk

Run

Box

Wave

Clap

Jog

Run

Box

Wave

Clap

Jog

Run

Box

Wave

Clap

−0.2

−0.02 −0.03 Walk

Jog

0.005

Jog

Run

Box

Wave

Clap

0 Walk

Figure 3: Plots of 2D affine moment invariants (𝐼𝑖 , 𝑖 = 1, . . . , 6) computed on the average images of walking, jogging, running, boxing, waving, and clapping sequences. xi

shows a series of plots of 2D dynamic affine invariants with different action classes computed on the average images of action sequences. 3.4. Action Classification Using SVM. In this section, we formulate the action recognition task as a multiclass learning problem, where there is one class for each action, and the goal is to assign an action to an individual in each video sequence [1, 29]. There are various supervised learning algorithms by which action recognizer can be trained. Support Vector Machines (SVMs) are used in this work due to their outstanding generalization capability and reputation of a highly accurate paradigm [30]. SVMs that provide a best solution to data overfitting in neural networks are based on the structural risk minimization principle from computational theory. Originally, SVMs were designed to handle dichotomic classes in a higher dimensional space where a maximal separating hyperplane is created. On each side of this hyperplane, two parallel hyperplanes are conducted. Then, SVM attempts to find the separating hyperplane that maximizes the distance between the two parallel hyperplanes (see Figure 4). Intuitively, a good separation is achieved by the hyperplane having the largest distance. Hence, the larger the margin, the lower the generalization error of the classifier. Formally, let D = {(x𝑖 , 𝑦𝑖 ) | x𝑖 ∈ R𝑑 , 𝑦𝑖 ∈ {−1, +1}} be a training dataset; Vapnik [30] shows that the problem is best

𝜉i

𝛽x

+ 𝛽0

𝛽x

=+

+ 𝛽0

𝛽x

1

=0

+ 𝛽0

=

𝜉j

−1 xj

Figure 4: Generalized optimal separating hyperplane.

addressed by allowing some examples to violate the margin constraints. These potential violations are formulated with some positive slack variables 𝜉𝑖 and a penalty parameter 𝐶 ≥ 0 that penalize the margin violations. Thus, the generalized optimal separating hyperplane is determined by solving the following quadratic programming problem: 1 󵄩 󵄩2 min 󵄩󵄩󵄩𝛽󵄩󵄩󵄩 + 𝐶∑𝜉𝑖 𝛽,𝛽0 2 𝑖 subject to (𝑦𝑖 (⟨x𝑖 , 𝛽⟩ + 𝛽0 ) ≥ 1 − 𝜉𝑖 ∀𝑖) ∧ (𝜉𝑖 ≥ 0 ∀𝑖).

(7)

ISRN Machine Vision Geometrically, 𝛽 ∈ R𝑑 is a vector going through the center and perpendicular to the separating hyperplane. The offset parameter 𝛽0 is added to allow the margin to increase and not to force the hyperplane to pass through the origin that restricts the solution. For computational purposes, it is more convenient to solve SVM in its dual formulation. This can be accomplished by forming the Lagrangian and then optimizing over the Lagrange multiplier 𝛼. The resulting decision function has weight vector 𝛽 = ∑𝑖 𝛼𝑖 x𝑖 𝑦𝑖 , 0 ≤ 𝛼𝑖 ≤ 𝐶. The instances x𝑖 with 𝛼𝑖 > 0 are called support vectors, as they uniquely define the maximum margin hyperplane. In the current approach, several classes of actions are created. Several one-versus-all SVM classifiers are trained using affine moment features extracted from action sequences in the training dataset. For each action sequence, a set of six 2D affine moment invariants is extracted from the average image. Also, another set of six 3D affine moment invariants is extracted from the spatio-temporal silhouette sequence. Then, SVM classifiers are trained on these features to learn various categories of actions.

4. Experiments and Results To evaluate the proposed approach, two main experiments were carried out, and the results we achieved were compared with those reported by other state-of-the-art methods. 4.1. Experiment 1. We conducted this experiment using KTH action dataset [31]. To illustrate the effectiveness of the method, the obtained results are compared with those of other similar state-of-the-art methods. The KTH dataset contains action sequences, comprised of six types of human actions (i.e., walking, jogging, running, boxing, hand waving, and hand clapping). These actions are performed by a total of 25 individuals in four different settings (i.e., outdoors, outdoors with scale variation, outdoors with different clothes, and indoors). All sequences were acquired by a static camera at 25 fps and a spatial resolution of 160 × 120 pixels over homogeneous backgrounds. To the best of our knowledge, there is no other similar dataset already available in the literature of sequences acquired on different environments. In order to prepare the experiments and to provide an unbiased estimation of the generalization abilities of the classification process, a set of sequences (75% of all sequences) performed by 18 subjects was used for training, and other sequences (the remaining 25%) performed by the other 7 subjects were set aside as a test set. SVMs with Gaussian radial basis function (RBF) kernel are trained on the training set, while the evaluation of the recognition performance is performed on the test set. The confusion matrix that shows the recognition results achieved on the KTH action dataset is given in Table 1, while the comparison of the obtained results with those obtained by other methods available in the literature is shown in Table 3. As follows from the figures tabulated in Table 1, most actions are correctly classified. Furthermore, there is a high distinction between arm actions and leg actions. Most of the mistakes where confusions occur are between “jogging” and “running” actions and between “boxing” and

5 Table 1: Confusion matrix for the KTH dataset.

Action

Walking Running Jogging Boxing Waving Clapping

Walking

0.94

0.01

0.05

0.00

0.00

0.00

Running Jogging

0.00 0.04

0.96 0.08

0.04 0.88

0.00 0.00

0.00 0.00

0.00 0.00

Boxing

0.00

0.00

0.00

0.94

0.02

0.04

Waving Clapping

0.00 0.00

0.00 0.00

0.00 0.00

0.02 0.01

0.93 0.03

0.05 0.96

“clapping” actions. This is intuitively plausible due to the fact of high similarity between each pair of these actions. From the comparison given by Table 3, it turns out that our method performs competitively with other state-of-the-art methods. It is pertinent to mention here that the state-ofthe-art methods with which we compare our method have used the same dataset and the same experimental conditions; therefore, the comparison seems to be quite fair. 4.2. Experiment 2. This second experiment was conducted using the Weizmann action dataset provided by Blank et al. [32] in 2005, which contains a total of 90 video clips (i.e., 5098 frames) performed by 9 individuals. Each video clip contains one person performing an action. There are 10 categories of action involved in the dataset, namely, walking, running, jumping, jumping in place, bending, jacking, skipping, galloping sideways, one-hand waving, and two-hand waving. Typically, all the clips in the dataset are sampled at 25 Hz and last about 2 seconds with image frame size of 180 × 144. In order to provide an unbiased estimate of the generalization abilities of the proposed method, we have used the leave-oneout cross-validation (LOOCV) technique in the validation process. As the name suggests, this involves using a group of sequences from a single subject in the original dataset as the testing data and the remaining sequences as the training data. This is repeated such that each group of sequences in the dataset is used once as the validation. Again, as with the first experiment, SVMs with Gaussian RBF kernel are trained on the training set, while the evaluation of the recognition performance is performed on the test set. The confusion matrix in Table 2 provides the recognition results obtained by the proposed method, where correct responses define the main diagonal. From the figures in the matrix, a number of points can be drawn. The majority of actions are correctly classified. An average recognition rate of 97.8% is achieved with our proposed method. What is more, there is a clear distinction between arm actions and leg actions. The mistakes where confusions occur are only between skip and jump actions and between jump and run actions. This intuitively seems to be reasonable due to the fact of high closeness or similarity among the actions in each pair of these actions. In order to quantify the effectiveness of the method, the obtained results are compared qualitatively with those obtained previously by other investigators. The outcome of this comparison is presented in Table 3. In the light of this comparison, one can see that the proposed method is competitive with the state-of-the-art methods.

6

ISRN Machine Vision Table 2: Confusion matrix for the Weizmann dataset.

Action

Bend

Jump

Pjump

Walk

Run

Side

Jack

Skip

Wave 1

Wave 2

Bend

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

Jump

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

Pjump

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

Walk

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

0.00

0.00

Run

0.00

0.00

0.00

0.00

0.90

0.00

0.00

0.10

0.00

0.00

Side

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

0.00

Jack

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

0.00

0.00

Skip

0.00

0.00

0.00

0.00

0.10

0.00

0.00

0.90

0.00

0.00

Wave 1

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

0.00

Wave 2

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

1.00

Table 3: Comparison with the state of the art on the KTH and Weizmann datasets. Method Our method Liu and Shah [15] Wang and Mori [16] Jhuang et al. [17] Rodriguez et al. [9] Rapantzikos et al. [18] Dollár et al. [19] Ke et al. [20] Fathi and Mori [21] Bregonzio et al. [22] Zhang et al. [23] Niebles et al. [24] Dollár et al. [19] Kläser et al. [25]

KTH 93.5% 92.8% 92.5% 91.7% 88.6% 88.3% 81.2% 63.0% — — — — — —

Weizmann 98.0% — — — — — — — 100% 96.6% 92.8% 90.0% 85.2% 84.3%

It is worthwhile to mention that all the methods that we compared our method with, except the method proposed in [21], have used similar experimental setups; thus, the comparison seems to be meaningful and fair. A final remark concerns the real-time performance of our approach. The proposed action recognizer runs at 18fps on average (using a 2.8 GHz Intel dual core machine with 4 GB of RAM, running 32-bit Windows 7 Professional).

5. Conclusion and Future Work In this paper, we have introduced an approach for activity recognition based on affine moment invariants for activity representation and SVMs for feature classification. On two

benchmark action datasets, the results obtained by the proposed approach were compared favorably with those published in the literature. The primary focus of our future work will be to investigate the empirical validation of the approach on more realistic datasets presenting many technical challenges in data handling, such as object articulation, occlusion, and significant background clutter.

References [1] S. Sadek, A. Al-Hamadi, B. Michaelis, and U. Sayed, “Recognizing human actions: a fuzzy approach via chord-length shape features,” ISRN Machine Vision, vol. 1, pp. 1–9, 2012. [2] A. A. Efros, A. C. Berg, G. Mori, and J. Malik, “Recognizing action at a distance,” in Proceedings of the 9th IEEE International Conference on Computer Vision (ICCV ’03), vol. 2, pp. 726–733, October 2003. [3] S. Sadek, A. Al-Hamadi, B. Michaelis, and U. Sayed, “Towards robust human action retrieval in video,” in Proceedings of the British Machine Vision Conference (BMVC ’10), Aberystwyth, UK, September 2010. [4] S. Sadek, A. Al-Hamadi, B. Michaelis, and U. Sayed, “Human activity recognition: a scheme using multiple cues,” in Proceedings of the International Symposium on Visual Computing (ISVC ’10), vol. 1, pp. 574–583, Las Vegas, Nev, USA, November 2010. [5] S. Sadek, A. AI-Hamadi, M. Elmezain, B. Michaelis, and U. Sayed, “Human activity recognition via temporal moment invariants,” in Proceedings of the 10th IEEE International Symposium on Signal Processing and Information Technology (ISSPIT ’10), pp. 79–84, Luxor, Egypt, December 2010. [6] S. Sadek, A. Al-Hamadi, B. Michaelis, and U. Sayed, “An action recognition scheme using fuzzy log-polar histogram and temporal self-similarity,” EURASIP Journal on Advances in Signal Processing, vol. 2011, Article ID 540375, 2011. [7] R. Cutler and L. S. Davis, “Robust real-time periodic motion detection, analysis, and applications,” IEEE Transactions on

ISRN Machine Vision

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 781– 796, 2000. E. Shechtman and M. Irani, “Space-time behavior based correlation,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR ’05), vol. 1, pp. 405–412, June 2005. M. D. Rodriguez, J. Ahmed, and M. Shah, “Action MACH: a spatio-temporal maximum average correlation height filter for action recognition,” in Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’08), June 2008. N. Ikizler and D. Forsyth, “Searching video for complex activities with finite state models,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR ’07), June 2007. D. M. Blei and J. D. Lafferty, “Correlated topic models,” in Advances in Neural Information Processing Systems (NIPS), vol. 18, pp. 147–154, 2006. D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” Journal of Machine Learning Research, vol. 3, no. 4-5, pp. 993–1022, 2003. T. Hofmann, “Probabilistic latent semantic indexing,” in Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’99), pp. 50–57, 1999. S. J. McKenna, Y. Raja, and S. Gong, “Tracking colour objects using adaptive mixture models,” Image and Vision Computing, vol. 17, no. 3-4, pp. 225–231, 1999. J. Liu and M. Shah, “Learning human actions via information maximization,” in Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’08), June 2008. Y. Wang and G. Mori, “Max-Margin hidden conditional random fields for human action recognition,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR ’09), pp. 872–879, June 2009. H. Jhuang, T. Serre, L. Wolf, and T. Poggio, “A biologically inspired system for action recognition,” in Proceedings of the 11th IEEE International Conference on Computer Vision (ICCV ’07), pp. 257–267, October 2007. K. Rapantzikos, Y. Avrithis, and S. Kollias, “Dense saliencybased spatiotemporal feature points for action recognition,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR ’09), pp. 1454–1461, June 2009. P. Dollár, V. Rabaud, G. Cottrell, and S. Belongie, “Behavior recognition via sparse spatio-temporal features,” in Proceedings of the 2nd Joint IEEE International Workshop on Visual Surveillance and Performance Evaluation of Tracking and Surveillance (VS-PETS ’05), pp. 65–72, October 2005. Y. Ke, R. Sukthankar, and M. Hebert, “Efficient visual event detection using volumetric features,” in Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV ’05), pp. 166–173, October 2005. A. Fathi and G. Mori, “Action recognition by learning mid-level motion features,” in Proceedings of the 26th IEEE Conference on Computer Vision and Pattern Recognition (CVPR ’08), June 2008. M. Bregonzio, S. Gong, and T. Xiang, “Recognising action as clouds of space-time interest points,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops (CVPR ’09), pp. 1948–1955, June 2009.

7 [23] Z. Zhang, Y. Hu, S. Chan, and L.-T. Chia, “Motion context: a new representation for human action recognition,” in Proceeding of the European Conference on Computer Vision (ECCV ’08), vol. 4, pp. 817–829, 2008. [24] J. C. Niebles, H. Wang, and L. Fei-Fei, “Unsupervised learning of human action categories using spatial-temporal words,” International Journal of Computer Vision, vol. 79, no. 3, pp. 299–318, 2008. [25] A. Kläser, M. Marszaek, and C. Schmid, “A spatiotemporal descriptor based on 3D-gradients,” in Proceedings of the British Machine Vision Conference (BMVC ’08), 2008. [26] S. Sadek, A. Al-Hamadi, B. Michaelis, and U. Sayed, “Human action recognition via affine moment invariants,” in Proceedings of the 21st International Conference on Pattern Recognition (ICPR ’12), pp. 218–221, Tsukuba Science City, Japan, November 2012. [27] J. Flusser and T. Suk, “Pattern recognition by affine moment invariants,” Pattern Recognition, vol. 26, no. 1, pp. 167–174, 1993. [28] D. Xu and H. Li, “3-D affine moment invariants generated by geometric primitives,” in Proceedings of the 18th International Conference on Pattern Recognition (ICPR ’06), pp. 544–547, August 2006. [29] S. Sadek, A. Al-Hamadi, B. Michaelis, and U. Sayed, “An SVM approach for activity recognition based on chord-lengthfunction shape features,” in Proceedings of the IEEE International Conference on Image Processing (ICIP ’12), pp. 767–770, Orlando, Fla, USA, October 2012. [30] V. N. Vapnik, The Nature of Statistical Learning Theory, Springer, New York, NY, USA, 1995. [31] C. Schüldt, I. Laptev, and B. Caputo, “Recognizing human actions: a local SVM approach,” in Proceedings of the 17th International Conference on Pattern Recognition (ICPR ’04), pp. 32–36, 2004. [32] M. Blank, L. Gorelick, E. Shechtman, M. Irani, and R. Basri, “Actions as space-time shapes,” in Proceedings of the 10th IEEE International Conference on Computer Vision (ICCV ’05), vol. 2, pp. 1395–1402, October 2005.

International Journal of

Rotating Machinery

Engineering Journal of

Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

The Scientific World Journal Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014


Distributed Sensor Networks

Journal of

Sensors Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014


Volume 2014


Volume 2014

Journal of

Control Science and Engineering

Advances in

Civil Engineering Hindawi Publishing Corporation http://www.hindawi.com


Volume 2014

Volume 2014

Submit your manuscripts at http://www.hindawi.com Journal of

Journal of

Electrical and Computer Engineering

Robotics Hindawi Publishing Corporation http://www.hindawi.com


Volume 2014

Volume 2014

VLSI Design Advances in OptoElectronics


Navigation and Observation Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014



Chemical Engineering Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

Volume 2014

Active and Passive Electronic Components

Antennas and Propagation Hindawi Publishing Corporation http://www.hindawi.com

Aerospace Engineering


Volume 2014


Volume 2014

Volume 2014




Modelling & Simulation in Engineering

Volume 2014


Volume 2014

Shock and Vibration Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

Advances in

Acoustics and Vibration Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014