Daniel Weinland, Edmond Boyer, and Remi Ron- fard. Action Recognition from Arbitrary Views us- ing 3D Exemplars. In 2007 IEEE 11th International.
Pattern Analysis & Application manuscript No. (will be inserted by the editor)
Three-dimensional action recognition using Volume Integrals
Luis Díaz-Más, Rafael Muñoz-Salinas, F. Madrid-Cuevas, R. Medina-Carnicer Department of Computing and Numerical Analysis. University of Córdoba. 14071 Córdoba (Spain) e-mail: {i22dimal,rmsalinas,ma1macuf,rmedina}@uco.es The date of receipt and acceptance will be inserted by the editor
Abstract
This work proposes the Volume Integral (VI)
Action recognition approaches can be roughly divided
as a new descriptor for three-dimensional action recog-
into two main categories: holistic and sequential. Holistic
nition. The descriptor transforms the actor's volumet-
approaches consider the whole action as a shape vary-
ric information into a two-dimensional representation
ing in time and space, which can be represented by a
by projecting the voxel data to a set of planes that
single feature vector. Then, inferences can be drawn by
maximize the discrimination of actions. Our descriptor
dierent methods such as Clustering, Neural Networks
signicantly reduces the amount of data of the three-
or SVM. In contrast, sequential approaches consider the
dimensional representations yet preserves the most im-
action as a series of individual poses, so that a dier-
portant information. Therefore, the action recognition
ent feature vector is extracted for each frame. Then,
process is greatly sped up, and very high success rates
inferences can be drawn by models such as HMM or
are obtained. The method proposed is therefore espe-
Conditional Random Fields, which capture the intrinsic
cially appropriate for applications in which limitations
temporal structure of the action as a sequence of state
of computing power and space are signicant aspects to
transitions.
consider, such as real-time applications or mobile devices.
While most research eorts have been focused on recognition from arbitrary single view-points, when mul-
This paper tests the VI using several Dimensional-
tiple views are available during the recognition stage,
ity Reduction techniques (namely PCA, 2D-PCA, LDA)
three-dimensional information can be exploited to im-
and dierent Machine Learning approaches (namely Clus-
prove the success of recognition. The works of Wein-
et al.
et al.
tering, SVM and HMM) so as to determine the best com-
land
bination of these for the action recognition task. Exper-
most notable examples of three-dimensional viewpoint-
iments conducted on the public IXMAS dataset show
invariant action recognition in the holistic and sequen-
that the VI compares favorably to state-of-the-art de-
tial approaches, respectively. In the rst work, the au-
scriptors both in terms of classication rates and com-
thors propose the MHV as a method for compressing
puting times.
the voxel's occupancy of a whole action into a single
Key words
[9] and Qian
[14], constitute the two
voxelset. Then, the spectrum of the Fourier transform Action recognition, View invariance, Multi-
camera, Motion descriptor.
is obtained, which is rotation invariant, and the result is reduced using PCA. Finally, recognition is performed using a Clustering method. In the second work, the voxel
1 Introduction Action recognition has become an important research
data of each frame is reduced using Multilinear Analisys and Hidden Markov Models (HMM) for inference. In both cases, manipulating the large amount of in-
topic in several elds, such as video analysis, video surveil-
formation of three-dimensional voxel representations poses
lance, and human-computer interaction, thus attracting
a major problem. As a consequence, standard Dimen-
a great deal of attention in recent years [1, 2]. The prob-
sionality Reduction techniques must be applied prior to
lem has been addressed using dierent approaches, both
the use of Machine Learning approaches. However, a di-
in the 2D and 3D domains, with either single monoc-
rect application of Dimensionality Reduction techniques
ular cameras [3, 4, 5], single stereo cameras [6, 7, 8] or
to voxel data results in sub-optimal strategies, both in
multiple monocular cameras [9, 10, 11, 12]. However, de-
terms of computational eciency and recognition perfor-
veloping robust methods of recognition independent of
mance. The reason is that they require the use of large
the view employed remains an unsolved problem [13].
matrices that are dicult to manipulate and determine
2
Luis Díaz-Más et al.
can be used before applying standard Dimensionality
The work of Haritaoglu et al. is relevant in the eld of human action recognition. They showed that silhouette projection histograms are a very simple yet eective representation mechanism [15, 16, 17, 18]. Their main disadvantage is that the results depend strongly on the view. Thus, a great
Reduction techniques in both holistic and sequential ap-
part of the eort on view-based invariant gesture recog-
proaches.
nition has been focused on recognizing actions observed
accurately, given the small number of training examples frequently available. Therefore, a more compact representation of voxel data would be preferable. This work proposes the use of the Volume Integral (VI), an intermediate representation of voxel data that
Our approach can be seen as the threedimensional extension of the classical silhouette projection histograms, a very popular method for two-dimensional action recognition [15, 16, 17, 18]. The VI reduces voxel data with a minimum loss
from single arbitrary viewpoints, using models acquired from either single or multiple views. One of the earliest holistic approaches is the Motion-History Image (MHI), proposed by Bobick
et al.
[3]. In their work,
of information by projecting it onto a set of planes that
temporal templates are created by analyzing movement
maximize the discrimination of the actions. Our descrip-
information, and gestures are identied by measuring
tor allows a reduction of data by one and three orders
distances to templates previously stored from multiple
of magnitude, respectively, compared to Weinland's and
views. Later, Yilmaz and Shah propose the use of spatio-
Qian's approaches. Consequently, Dimensionality Reduc-
temporal volumes (STVs), which are automatically gen-
tion techniques can be more eciently applied without
erated for actions observed from any viewing direction
aecting classication performance. Another remarkable
using a graph-theoretic approach. STVs are later ex-
et al.
property of the VI is that, unlike Weinland's descriptor,
tended to multiple-views by Pingkun
it is sensitive to reection. Thus, it is able to dierenti-
propose the use of 4D action feature model (4D-AFM)
ate gestures performed with dierent limbs, which is of
for recognizing actions from arbitrary viewpoints. The
interest in certain applications.
4D-AFM is created by concatenating in time a sequence
Our descriptor has been evaluated in the public dataset
[12]. They
of reconstructed Visual Hulls (VHs) and then computing
Inria Xmas Motion Acquisition Sequences (IXMAS) to
the dierential geometric properties of the STVs. Also,
compare it with Weinland's and Qian's works, which em-
Souvenir and Parrigan [22] propose a manifold learning-
ploy the same dataset. The VI has been applied using
based framework which is tested on MHI and on the
several Dimensionality Reduction techniques and dier-
Transform Surface Motion Descriptor.
R
ent Machine Learning approaches (both sequential and
When a sequential approach is selected, HMM are
holistic) to determine the best strategy for the action
most frequently employed. Lv and Nevatia [23] propose
recognition task. The results show that the proposed de-
an example-based action recognition method by the use
scriptor improves on the previous work in terms of both
of
computing times and classication rates.
erated graph models that contain the 2D representa-
Action Nets.
The action nets are automatically gen-
The remainder of this paper is structured as follows:
tion of one view of 3D key poses, in which links rep-
The rest of this section is devoted to related work in the
resent transitions between key poses. Inspired by that
eld of action recognition. Section 2 denes the basis of
work, Xiaofei and Liu [24] propose a simpler approach,
three-dimensional action representation approaches, in-
in which actions are rst clustered in the 3D space and
cluding the denition of our proposal. Sect. 3 presents
then projected onto virtual cameras. On the other hand,
the experiments carried out, while Sect. 4 draws conclu-
Weinland [25] proposes the use of a wrapper approach
sions.
to determine the key exemplars; Chakraborty
et al.
[5]
propose a dierent approach, in which individual classi-
1.1 Related work and proposed contribution
ers are trained for specic body parts seen from specic viewpoints. Then, actions are recognized by a component-
View-invariant action recognition approaches can be broadly wise HMM that takes as input the classier results. divided into two main categories: model-based and view-
In any case, recognition from a single monocular view
based. The former rely on a parametric model of the per-
suers from ambiguities produced by projecting on two-
son (normally comprised of a set of connected joints in a
dimensional images, e.g., it is dicult to determine if
tree structure), which is matched against the input ob-
a person is pointing towards the camera or away from
servations [19, 20, 21]. The latter, however, rely on a set
it. Thus, some authors have proposed the use of stereo
of observations from which a set of features are learned
vision. Shin
and employed for recognition. Our work is related to
3D using stereo input sequences for generating Motion
the latter category, which has two main sub-categories:
History Models (MHMs). Later, Muñoz-Salinas
holistic and sequential.
presented stereo-based extensions to several monocular
Our approach can be seen as a three-dimensional ex-
silhouette projection histograms,
et al.
[6] were the rst to extend MHIs to
et al. [8]
approaches, showing that using stereo information leads
which are very popular for two-dimensional action recog-
et al. [7] proposed an extension of MHIs to three-dimensional
nition
space called Volume Motion Template (VMT). It is ob-
tension of the classical
to better results in all cases. Recently, Myung-Cheol
Three-dimensional action recognition using Volume Integrals
3
tained by projecting stereo information to a 2D orthog-
applications and that the use of lighter methods would
onal plane that maximizes the discrimination of the ac-
be preferable. Our comparison to this method will be
tion.
based on the results reported by the authors.
Despite the great attention that has been paid to the recognition of actions from a single viewpoint, there are applications in which multiple views are available
2 Action representation
during the recognition phase. These approaches rely on voxel data, a grid-based division of the 3D space into
This section explains the basis of the methods employed
cubes of the same size that can be either occupied or
for action representation. The rst two sections explain
empty. The set of voxels occupied by an actor forms the
the techniques employed for representing 3D actions.
so called Visual Hull (VH), which is a rich description
First, Sect. 2.1 explains the concept of VH and how it
of the actor's shape. Although 3D approaches can yield
is created using the Shape-from-Silhouette (SfS) algo-
more descriptive models and thus achieve higher success
rithm. Whereas sequential approaches use the VH gen-
rates, few view-based approaches have been proposed to
erated at each frame for recognition, holistic approaches
exploit this fact.
use the MHVs, which compresses a set of frames into a
The work of Weinland
et al.
[9] is one of them. The
authors propose the Motion History Volume (MHV), a natural 3D extension of MHIs, which is a representation of the sequence of VHs that form an action. The MHV is processed using the Fourier transform in order to obtain its spectrum, which is invariant under rotation.
single voxelset (Sect. 2.2). Once a suitable 3D representation of the actions is obtained, the information is transformed so as to reduce the amount of data and to achieve rotation invariance. Section 2.3 explains Weinland's approach while Sect. 2.4 describes our proposal, the VI.
Finally, the spectrum is reduced using Principal Component Analysis (PCA), and inference is performed using Clustering. Nonetheless, their approach has two main drawbacks. First, the number of features that result from the Fourier spectrum extraction is very high. After taking advantage of spectrum symmetry, the number of features to be reduced with PCA is
nθ , nz
and
nr
(nθ × nz × nr )/2, where
2.1 Visual Hull extraction Let us assume that the monitored area can be divided into cubes of the same volume (called voxels) denoted by
V = {v i |i = (xi , y i , z i )},
are the numbers of subdivisions on each
i
i
i
axis. As a consequence, the Dimensionality Reduction
where
process is slow, and in some cases, the number of obser-
Cartesian coordinates and
vations is insucient to obtain reliable covariance ma-
the voxel is occupied or empty.
i = (x , y , z )
(1)
represents the voxel's center in
v i ∈ {1, 0} depending on whether
trices. Second, a side eect of using the Fourier spec-
The VH of an actor consists of all the occupied voxels
trum is that the resulting descriptor is invariant under
that belong to him, and it can be calculated using the
reection. Thus, gestures performed with dierent limbs
well-known SfS algorithm [26], although more recent ap-
produce similar descriptors, which might be a problem
proaches can also be employed [27, 28]. For that purpose,
in certain applications.
we require a set of cameras surrounding the actor and
et al.
[14] presents a
placed at known locations determined by calibration. In
sequential approach based on HMM. It consists of re-
addition, a background subtraction method is required
ducing voxel data by pose tensor decomposition using
to obtain the actor's silhouettes.
In contrast, the work of Qian
the High Order Singular Value Decomposition method,
The traditional SfS method examines the projections
which is a generalization of the singular value decom-
of the voxels in the foreground images in order to de-
position method for matrices of arbitrary dimensions.
termine whether or not they belong to the shapes in
Although their method improves on the results obtained
question. This is achieved by means of a projection test.
by Weinland's, it suers from several weaknesses. First,
Each voxel is projected in all the foreground images, and
it requires a discretization of all possible body orienta-
if its projection lies completely inside a silhouette in all
tions so as to create tensors in each direction. Second,
the foreground images, then it is considered occupied.
and partially as a consequence of the former, the pose
However, if the voxel projects to a background region
tensor requires a large amount of memory. It requires
in any of the images, it is considered empty. Finally, if
nx × ny × nz × r × n features, where nx , ny
denote
the voxel projects partially onto a foreground region it is
the number of subdivisions in each axis; the parameter
considered to belong to a border, and a decision must be
r
and
nz
represents the number of possible body orientations;
and
n
is the number of key poses (these are employed
to calculate the core tensor). These values are set to
nx = ny = nz = 32, r = 16
and
n = 25
in their work.
made. In the end, the result is a Boolean decision
{0, 1}
indicating whether the region of the space represented by the voxel is empty or occupied. Figure 1(a) shows the VHs obtained for three in-
We consider that the large amount of information re-
stants of time in one of the
quired by this method makes it inappropriate for many
dataset.
kick sequences of the IXMAS
4
Luis Díaz-Más et al.
Figure 1: (a) Set of Visual Hulls of the
kick action. (b) Volume Integrals of the actions' Visual Hulls.(c) MHV of the
action. (d) Volume Integral of the MHV.
2.2 Motion History Volumes
theory less sensitive to that problem because they are capable of modeling the intrinsic dynamics of the action
The Motion History Volumes compress a set of VHs into
as state transitions.
a single probabilistic representation of the voxels' occupancy. Let us dene
t.
vti
as the value of voxel
vi
i τ if vt = 1 , Vτ (i,t)= max(0,Vτ (i,t−1)−1) otherwise where
τ
(2)
is the duration, in number of frames, of the ac-
tion represented. At instant t,
τ
at time
Then the MHV function is dened by
Vτ (i, t) has values close to
if the corresponding voxel has been recently observed
as occupied. Values of
Vτ (i, t)
near
0
indicate that the
voxel has not been occupied recently, and the value
0
indicates that it has not been observed in occupation of
2.3 Weinland's descriptor Although the VHs and MHV are invariant under scaling and translation, they are not invariant under rotation. The descriptor proposed by Weinland taking into account the Fourier that a function
f0 (x − x0 )
[9] achieves
shift theorem. This states
f0 (x) and its translated counterpart ft (x) =
dier only by a phase modulation in the
Fourier space:
any of the sequence frames.
Ft (k) = F0 (k)e−j2πkx0 .
In order to perform a reliable comparison between
In this descriptor, voxels are expressed in a cylindri-
the MHVs, they must be normalized. Assuming that the start and end time of the action is known, the MHV is
et al.
rotation invariance by using the Fourier transform (FT),
cal coordinate system (see Fig. 2) dened as follows:
normalized as:
C(r, θ, z) = V(r cos(θ), r sin(θ), z) V(i) =
Vτ (i, τ) . τ
(3)
Invariance to translation and scale are achieved by calculating the means
µx , µy , µz
and variances
σx , σy , σy
Thus, rotations around the vertical z-axis result in cyclical translation shifts:
V(x cos(θ0 )+y sin(θ0 ), −x sin(θ0 )+y cos(θ0 ), z) → C(r, θ+θ0 , z).
of the non-empty voxels in the MHV and then shifting and scaling it so that
µx = µy = µz = 0
and
σx = σy =
σz = 1.
ues
Figure 1(c) shows the MHV of the action shown in
1 while 0. Then, more recently
Figure 1(a). Red colors represent values near blue correspond to values close to
θ for each pair of val(r, z) is achieved using the Fourier spectrum values |C(r, kθ , z)| of the 1D Fourier transform Z π C(r, kθ , z) = C(r, θ, z)e−j2πkθ θ dθ. (4) Hence, rotation invariance along
occupied voxels correspond to redder areas. The main advantage of the MHV is that it compresses a whole sequence into a single voxelset. However
−π The descriptor is then dened as the vector
it has two main drawbacks. First, it requires knowledge of the start and end of the sequence. Second, the speed
W = {wz,r |z = 1, . . . , nz , r = 1, . . . , nr },
at which the action is performed aects the MHV obtained. On the other hand, sequential approaches are in
where
(5)
Three-dimensional action recognition using Volume Integrals
5
and rescaling it so that
σx = σy = σz = 1).
θ
µx = µy = µz = 0
and
Afterwards, rotation invariance
is obtained by nding the angle that aligns two identical voxelsets. The actor's main direction is then obtained and the voxelset rotated accordingly. Assuming that people stand while performing actions, the angle search can
x−z
be restricted to the
plane. Thus, the actor's main
direction is dened as the vector from the center of the
z
volume to the farthest occupied voxel. The resulting voxelset is then integrated over the three axes in pairs, so that the 3D information is rep-
r
resented by three 2D images with a minimum loss of information. Let us dene the integrals of the voxelset
Figure
2:
Representation
Fourier transforms over
θ
of
Weinland's
are computed for
descriptor.
(r, z)
pairs
n
forming the feature vector.
wz,r = {|C(r, kθ0 , z)|, . . . , |C(r, kθnθ , z)|},
Pz (x, y) =
(6)
nz , nr and nθ represent the number employed in r -, θ - and z -axes, respectively.
The parameters divisions
as
of
Due to the trivial ambiguity of 1D-Fourier magni-
+
z X 1 V(x, y, z i ), i i=1 δ(V(x, y, z )) i=1
Pnz
(7)
ny
X 1 Pny V(x, y i , z), Py (x, z) = + i=1 δ(V(x, y i , z)) i=1
(8)
and
tudes with respect to the reversal of the signal, motions that are symmetric with respect to the z-axis (i.e., identical actions performed with dierent limbs) result in the same motion descriptors. Invariance under reection
n
Px (y, z) =
+
can be considered either as a loss of information or as a useful property, depending on the nal application. In any case, considering the symmetry of the Fourier spectrum, the total number of features of the descriptor is
(nθ × nz × nr )/2,
which is very large.
x X 1 V(xi , y, z). i , y, z)) δ(V(x i=1 i=1
Pnx
(9)
V(x, y, z) is the value of the voxel with (x, y, z).When calculating the VI from a VH, V(x, y, z) represents the voxel's state, i.e., V(x, y, z) ∈ {0, 1}, indicating whether the voxel is empty or The function
center
tor can also be employed to reduce individual VHs. Ac-
occupied. However, when the VI represents a MHV, V(x, y, z) ∈ [0, 1] is the continuous value given by Eq. 3. The function δ is 0 if the voxel's oc-
tually, this work demonstrates (in the experimental sec-
cupancy is
tion) that Weinland's descriptor obtains better results
are normalized by dividing by the number of non-empty
when it is employed in combination with a sequential
voxels plus
approach (HMM) than when the MHV is employed.
by zero in case all voxels are empty.
Although in his original work Weinland applied the descriptor only to reduce MHVs, note that the descrip-
2.4 Volume Integral As can be observed, the amount of information obtained in a voxel-based representation is very high. Hence, it is desirable to reduce it before applying Machine Learning techniques. To that end, we propose the Volume Integral, which is an alternative representation of a voxelset that can be applied both to VHs and MHVs. The idea is to project the three-dimensional information to a set of planes that maximize the action's discriminability.
0 ,
and
1
otherwise. Therefore, the integrals
which is a small value to avoid dividing
Figure 3 presents the proposed method as an algorithm with an aim to clarify the above explanation. Note that the integral is normalized by dividing by the number of non-empty voxels instead of by the total number of subdivision (nx , ny or nz ). We have experimentally found that this approach yields better results, especially when dealing with MHV. The reason is that for regions such as limbs, the sum of the occupancy is very low, and dividing by the non-empty voxels better preserves the information for these parts. Figures Fig. 1(c,d) show the VIs corresponding to
For the calculus of the VI, it is assumed that there is
the voxelsets of Fig. 1(a) and Fig. 1(b). It can be ob-
a voxelset invariant under translation and scaling. This is
served that our normalization approach produces binary
achieved using the transformation procedure explained
silhouettes when applied to VHs, but it preserves the
above
most essential spatio-temporal properties when applied
for the MHV (i.e., calculating the means and variances of the voxelset and then shifting
to MHVs.
6
Luis Díaz-Más et al. Finally, a unique descriptor is formed by concate-
nating all the features in
Py (x, z), Pz (x, y) and Px (y, z).
When compared with Weinland's descriptor, the VI achieves a substantial reduction of the data because the nal feature vector contains only
nx × ny + nx × nx + ny × nz
For instance, for a voxel set of resolution 64 × 64 × 64 (as employed in our experiments), Weinland's descriptor produces 131072 features, while ours produces 12288, i.e., approximately ten times fewer features. Compared to Qian's descriptor, our descriptor produces one hundred times fewer features. As we show in the elements.
Mach. Learn Clust. Clust. Clust. SVM-RBF SVM-RBF SVM-RBF SVM-LIN SVM-LIN SVM-LIN HMM HMM HMM
Dim. Red PCA 2D-PCA PCA+LDA PCA 2D-PCA PCA+LDA PCA 2D-PCA PCA+LDA PCA 2D-PCA PCA+LDA
Weinland
Volum. Integ.
93.33
90.90
-
85.15
92.42
90.90
80.90
90.00
-
77.27
89.39
86.69
89.09
88.87
-
86.66
86.69
88.48
71.21
76.06
-
87.57
93.64
98.45
experimental section, these are sucient to provide better results than Weinland's descriptor. In addition, the
Table 1: Classication results in the original IXMAS
VI produces dierent descriptors for reected actions.
dataset.
Hence, similar actions performed with dierent limbs can be distinguished.
Volume Integral creation
INPUT: Volume set V OUTPUT: Volume integral 1. Shift voxels in V so that the resulting volume has µx = µy = µz = 0 and σx = σy = σz = 1. 2. Rotate voxels in V by nding the main's direction as the vector from the center of the volume to the farthest occupied voxel. 3. Project voxels in the three planes xy , xz and yz so as to obtain the volume integrals P as: Pz (x, y) =
Py (x, z) = Px (y, z) =
4.
Pnz 1 δ(V(x,y,z i )) + i=1 1 Pny + i=1 δ(V(x,y i ,z)) Pnx 1 δ(V(xi ,y,z)) + i=1
nz i=1
V(x, y, z i )
Pny
V(x, y i , z)
Pnx
V(xi , y, z)
i=1
i=1
Concatenate Pz (x, y),Py (x, z) and Px (y, z) creating a single 2D feature matrix representing the volume integral.
of our descriptor to the one proposed by Weinland [9]. For that purpose, the descriptors are evaluated using the original dataset and the results are presented in Sect. 3.1. Third, the tests evaluate the discriminatory power of the descriptors when reected actions are considered, i.e., identical actions performed with dierent limbs. For that purpose, a total of 4 new actions are added to the original dataset by applying a reection operator to some of the original actions. The results obtained are shown in Sect. 3.2. Finally, in Sect. 3.3, we also aim to evaluate the improvement in the computational time achieved by our descriptor when compared to Weinland's. We have employed the same evaluation methodology employed in Ref. [9] so as to obtain comparable results. This means that: i) we have used
10
of the
12
actors in the dataset; ii) a leave-one-out cross-validation is applied, i.e., one actor is chosen for testing and the
Figure 3: Algorithm that transform a 3D volume into a
rest are used for training; iii) actions 11 (point) and 13
volume integral
(throw over head) have not been included, because they are performed very dierently by the actors, and iv) the space is discretized into
64
subdivisions for each axis. In
addition, the VHs and ground-truth provided with the
3 Experimental results
dataset have been employed.
This section presents the experiments conducted to eval-
3.1 Comparative evaluation in the IXMAS dataset
uate the proposed descriptor. For testing purposes, the public IXMAS dataset has been employed. It is a well-
The rst test compares our descriptor to Weinland's in
known 3D dataset in the eld of action recognition that
the original IXMAS dataset using dierent combinations
contains
12
13
actions, each one performed three times by
of Dimensionality Reduction and Machine Learning ap-
dierent actors. The actors freely change their ori-
proaches. The results of the tests are shown in Table 1.
entations in each acquisition so that rotation invariance can be analyzed.
The goal of our experiment is four-fold. First, we seek to evaluate the performance of the VIs using dierent combinations of Dimensionality Reduction techniques and Machine Learning approaches in order to nd the most appropriate one. Second, we compare the discriminability power
The rst column indicates the Machine Learning approach employed, and the second column indicates the Dimensionality Reduction technique applied. In the rst column, the
SVM-RBF stands for Support Vector Machines SVM-LIN in-
(SVM) using radial basis function, while
dicates that a linear kernel is being employed. The third and fourth columns show the results obtained for the dierent descriptors. For the
SVM-RBF, SVM-LIN and
Three-dimensional action recognition using Volume Integrals Method Weinland VI Weinland-VI
Mean 0.8708 0.8879 Mean -0.0171
Std. Deviation 0.0765 0.0621 Std. Deviation 0.0422
Std. Error 0.0270 0.0219 Std. Mean Error 0.0149
7
t -1.145
DOF 7
Sig. (2-tailed) 0.290
Table 2: Statistics of the distributions and results of the Paired-Samples Test. See text for discussion.
HMM methods, the results presented correspond to the best results after testing with dierent values of the methods' parameters. As previously explained, the Clus-
SVM-RBF
SVM-LIN
there is insucient evidence to suggest that one method is better than the other in terms of performance in the dataset employed.
using the MHVs. However, the HMM tests employ the
Finally, we must indicate that the work of Qian et al. [14], reported a success rate of 94.59%. To allow repli-
individual VHs generated at each frame.
cation of our experiments, we provide the complete set of
tering,
and
methods are applied
We note several remarks regarding the Dimension-
VIs corresponding to the VHs and MHVs of the IXMAS
ality Reduction techniques. First, the Complete Two-
actions, along with references to the libraries we have
dimensional Principal Component Analysis (2D-PCA)
used.
technique has been applied to our descriptor but not to Weinland's descriptor, since the technique requires the data to be a two-dimensional matrix. Dierent compression ratios have been tested, and the values in the Table correspond to those that provided the best results.
Note to reviewers: The nal location of this data has not yet been determined since it might be located in the journal servers. However, the dataset is temporarily available at http: // www. uco. es/ users/ i22dimal/ files/ data-vi. tar. bz2 for reviewing purposes.
Second, to determine the number of PCA components, we have employed the minimum-error formulation suggested by Bishop [29]. For the VI, we retain, approximately, the rst
12288
123
features given by the descriptor. For Weinland's
descriptor, we employed approximately out of the
131072
233
components
features. Finally, for the Linear Dis-
criminant Analysis (LDA) method, we retained only the
10
3.2 Comparative evaluation using reected actions
principal components out of the
most relevant features, i.e., the number of classes mi-
nus one.
The goal of this test is to evaluate the ability of our method to deal with reected actions. For that purpose, a new dataset has been created by adding four new actions to the original IXMAS dataset. The new actions are created by applying a reection operator to the actions check watch, wave hand, kick and punch. An example
The results obtained show that Weinland's descrip-
can be seen in Fig. 4, which shows the original VH of
tor obtains better results than ours when the Clustering
one of the sequences of the punch action, along with its
and SVM-LIN methods are applied. However, the VI ob-
reected counterpart. It can be observed that the ac-
tains better results when using the SVM-RBF and HMM
tion is identical except for the fact that it seems to be
methods. In general, the results show that the combina-
performed with the opposite arm.
tion of HMM + (PCA+LDA) obtains the best results
The results using the two descriptors are shown in
for both descriptors. In particular, our method obtains,
Table 4. The parameters of the Machine Learning and
98.45% success rate. This result is ob12 states and no skips between
Dimensionality Reduction techniques have been obtained
in the best case, a
tained using HMM with states.
In order to compare both methods more formally, we have used a Paired-Samples test using the data in Table 1 (excluding the 2D-PCA results because they are not available in both methods). For more information on statistical analysis see [30, 31]. The rst three rows of Table 2 show the performance statistics of both methods, while the last row shows the statistics calculated by the Paired-Samples test. As indicated in the table, the signicance of the null hypothesis is 0.290 (following a two-tailed T-Distribution with 7 degrees of freedom (DOF)). Using a p − value > 0.05, the test indicates that there is not enough evidence to reject the null hypothesis. In other words,
as in the previous test. As previously explained, the descriptor proposed by Weinland is not capable of distinguishing between reected action because of the symmetry of the Fourier spectrum. This explains the bad results obtained in this test. In contrast, the proposed VI achieves better classication results using all combinations of Machine Learning and Dimensionality Reduction techniques. As in the previous case, the best results are obtained for a combination of HMM and (PCA+LDA). The HMM had
8
states and no skips between states. The confusion matrices of each descriptor for the best case are shown in Figures 5 and 6. The actions identied with the sux
R correspond to those reected. It can be
clearly observed that most of the errors of Weinland's de-
8
Luis Díaz-Más et al.
Figure 4: Reection of action punch
Mach. Learn Clust. Clust. Clust. SVM-RBF SVM-RBF SVM-RBF SVM-LIN SVM-LIN SVM-LIN HMM HMM HMM
Dim. Red PCA 2D-PCA PCA+LDA PCA 2D-PCA PCA+LDA PCA 2D-PCA PCA+LDA PCA 2D-PCA PCA+LDA
Weinland
Volum. Integ.
70.00
84.88
-
80.66
67.99
85.55
68.44
83.55
-
70.66
70.22
84.88
64.22
78.44
-
70.80
59.99
81.77
52.2
72.50
-
79.20
70.90
91.90
scriptor correspond to confusion between an action and Table 4: Classication results in the IXMAS dataset us-
its reected counterpart.
As in the previous case, we have employed the Paired-Samples test to compare the results. Table 3 shows the performance statistics of both methods, along with the statistics calculated by the test. In this case, the null hypothesis is rejected using a p − value > 0.05, so we can arm that the proposed method is signicantly better than Weinland's in distinguishing actions performed with dierent limbs.
ing reected actions.
Descriptor VI Weinland
Desc. Comp. 4.6−3 s 1.1−2 s
PCA-Comp. 3.7 s 37.5 s
PCA-Reduct. 1.04−2 s 1.14−1 s
Table 5: Times employed in computing and reducing data using PCA
computing times required for the PCA process are re-
3.3 Evaluation of the computational cost
duced by one order of magnitude. We propose that the reduction in the computational cost makes the proposed descriptor more suitable for a wide variety of applica-
The experiments conducted so far show that the pro-
tions in which time is a crucial aspect.
posed descriptor outperforms Weinland's method in several situations while using far fewer features. The results also suggest that a combination of (PCA+LDA) pro-
4 Conclusions
vides the best performance. LDA computation is not a time consuming process, since the number of features
This paper proposes a new descriptor, the VI, for three-
obtained after PCA is small. However, calculating the
dimensional action recognition. Our descriptor (which is
PCA eigenvectors and eigenvalues, as well as projecting
invariant under rotation, scaling and translation) is cre-
the patterns to its main components, are computation-
ated by integrating voxel data over a set of planes that
ally demanding processes.
maximize the discriminability of actions. The main ad-
This section aims to analyze the benet of using the
vantage of our descriptor is that it signicantly reduces
proposed VI descriptor in terms of computing time. To
the amount of data of three-dimensional representations
that end, we have measured the times needed to calcu-
yet preserves the most important information. As a con-
late the most relevant phases of the action recognition
sequence, the action recognition process is greatly sped
process for the two descriptors; the results are presented
up, but very high success rates are maintained. In ad-
in Table 5. The rst column indicates the descriptor em-
dition, the proposed descriptor can distinguish between
ployed, and the second column indicates the time re-
reected actions, which was not possible with some of
quired to calculate it from a single voxelset. The third
the previous approaches.
column indicates the average time required to calculate
The VI is tested in the IXMAS dataset and compared
et al.
the PCA eigenvectors and eigenvalues for the MHVs of
to the descriptor proposed by Weinland
all the train patterns in the IXMAS database (i.e., a to-
experiments evaluate the performance of the descrip-
tal of
297
[9]. The
patterns). Finally, the last column indicates
tors using several Dimensionality Reduction and Ma-
the amount of time required to reduce a single voxelset
chine Learning techniques, so as to determine the most
to its main components. The tests have been performed
appropriate combination. The results show that: i) the
on a Intel Quad Core
2.66 Ghz
with
4
Gb of RAM, run-
proposed descriptor obtains similar performance to Wein-
ning Linux, and the routines employed for computing
land's descripor in the IXMAS data set, and better per-
PCA and the Fourier transform are those provided by
formance on mirrored action; ii) the combination of (PCA+LDA)
the OpenCv library [32].
and HMM obtains the best classication results; and iii)
Note that the calculation of the VI requires half the
the proposed descriptor reduces the amount of data to
time required by Weinland's descriptor. In addition, the
be manipulated by one order of magnitude compared
Three-dimensional action recognition using Volume Integrals Method Weinland VI Weinland-VI
Mean 0.6549 0.8293 Mean -0.1743
Std. Deviation 0.0649 0.0567 Std. Deviation 0.0315
Std. Error 0.0229 0.0200 Std. Mean Error 0.0111
9
t -15.622
DOF 7
Sig. (2-tailed) 0.00
Table 3: Statistics of the distributions and results of the Paired-Samples Test using the database of reected actions.
Check watch
Check watch
Check watch-R
Check watch-R
Cross arms
Cross arms
Scratch head
Scratch head
Sit down
Sit down
Get up
Get up
Turn around
Turn around
Walk
Walk
Wave hand
Wave hand
Wave hand-R
Wave hand-R
Punch
Punch
Punch-R
Punch-R
Kick
Kick
Kick-R
Kick-R
Pick up
Pick up
ku p
k-R
Pic
Kic
av eh an d Wa ve ha n d-R Pu nc h Pu nc h-R Kic k
W
Ch ec kw atc Ch h ec kw atc Cr h-R os sa rm Sc s rat ch he Sit ad do wn Ge tu p Tu rn aro un W d alk
k-R
ku p Pic
Kic
Ch ec kw atc Ch h ec kw atc Cr h-R os sa rm Sc s rat ch he Sit ad do wn Ge tu p Tu rn aro un W d alk W av eh an d Wa ve ha nd Pu -R nc h Pu nc h-R Kic k
See text for discussion.
Figure 5: Confusion matrix for the VI descriptor in the
Figure 6: Confusion matrix for Weinland's descriptor in
test with reected actions. The results correspond to the
the test with reected actions. The results correspond to
combination of (PCA+LDA) and HMM
the combination of (PCA+LDA) and HMM.
to Weinland's approach. As a consequence, the time em-
man action recognition.
ployed to reduce the dimensionality of our feature vector
27(10):15151526, 2009.
is also reduced by an order of magnitude.
Image Vision Comput.,
5. Bhaskar Chakraborty, Ognjen Rudovic, and Jordi Gonzalez.
View-invariant human-body detection
with extension to human action recognition using
Acknowledgements This work was developed with the support of the Re-
In 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition, pages 16. IEEE, 2008.
search Project TIN2010-18119 nanced by Science and
6. Ho-Kuen Shin, Sang-Woong Lee, and Seong-Whan
component-wise HMM of body parts.
Lee. Real-time gesture recognition using 3d motion
Technology Ministry of Spain.
history model. In
References 1. Ronald Poppe.
ICIC (1), pages 888898, 2005.
7. Myung-Cheol Roh, Ho-Keun Shin, and Seong-Whan Lee.
View-independent human action recognition
with volume motion template on single stereo camA survey on vision-based human
action recognition.
Image and Vision Computing,
2009. 2. Pavan Turaga, Rama Chellappa, V. S. Subrahmanian, and Octavian Udrea.
Machine Recognition
IEEE Transactions on Circuits and Systems for Video Technology, of Human Activities: A Survey. 18(11):14731488, 2008. 3. Aaron F. Bobick and James W. Davis. The recognition of human movement using temporal templates.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 23:257267, 2001. 4. Nazl Ikizler and Pnar Duygulu.
Histogram of
oriented rectangles: A new pose descriptor for hu-
era.
Pattern Recognition Letters, In Press, Corrected
Proof:, 2009. 8. R.
Muñoz-Salinas,
R.
Medina-Carnicer,
F.
J.
Madrid-Cuevas, and A. Carmona-Poyato. Depth silhouettes for gesture recognition.
tion Letters, 29:319329, 2008.
9. Daniel
Weinland,
Remi
Pattern Recogni-
Ronfard,
and
Edmond
Boyer. Free viewpoint action recognition using motion history volumes.
Comput. Vis. Image Underst.,
104(2):249257, 2006. 10. Yuedong
Yang,
View-invariant points.
Aimin action
Hao,
and
recognition
Qinping using
Zhao.
interest
International Multimedia Conference, 2008.
10
Luis Díaz-Más et al.
IEE Conference on Computer Vision and Pattern Recognition, pages 18, 2007.
11. Srikanth Cherla, Kaustubh Kulkarni, Amit Kale, and V. Ramasubramanian.
path searching. In
Towards fast, view-
2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pages 18. invariant human action recognition. In
24. Xiaofei Ji and Honghai Liu. View-Invariant Human Action Recognition Using Exemplar-Based Hidden Markov Models.
IEEE, 2008.
Lecture Notes in Computer Science,
5928/2009:7889, 2009.
12. Yan Pingkun, Saad M. Khan, and Mubarak Shah.
25. Daniel Weinland, Edmond Boyer, and Remi Ron-
Learning 4D action feature models for arbitrary view
fard. Action Recognition from Arbitrary Views us-
action recognition.
ing 3D Exemplars. In
2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1
2007 IEEE 11th International Conference on Computer Vision, pages 17. IEEE,
In
7. IEEE, 2008.
2007.
13. Xiaofei Ji and Honghai Liu.
Advances in View-
IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 40(1):1324,
26. A. Laurentini.
2010.
contour-based image understanding. In
27. L.
14. B Peng G. Qian and S. Rajko. View-Invariant Full-
Davis.
Larry S. Davis. Backpack: Detection of people car-
Computer Vision and Image Understanding, pages 102107, 1999. rying objects using silhouettes. In
17. R. Cucchiara, C. Grana, A. Prati, and R. Vezzani. Probabilistic
Posture
Classication
for
Human-
IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, 35(1):4254, January 2005. Behavior Analysis.
18. Chia-Feng Juang and Chia-Ming Chang.
Human
Body Posture Classication by a Neural Fuzzy Net-
work and Home Care System Application. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, 37(6):984994, November 2007. 19. Marcus A. Brubaker, David J. Fleet, and Aaron Hertzmann.
Physics-Based Person Tracking Using
International Journal of Computer Vision, 87(1-2):140155, marzo 2009. the Anthropomorphic Walker.
20. Stefano Corazza, Lars Mündermann, Emiliano Gambaretto, Giancarlo Ferrigno, and Thomas P. Andriacchi.
Markerless Motion Capture through Visual
Hull, Articulated ICP and Subject Specic Model
Generation. International Journal of Computer Vision, 87(1-2):156169, marzo 2009. 21. Rui Li, Tai-Peng Tian, Stan Sclaro, and MingHsuan Yang. 3d human motion tracking with a co-
ordinated mixture of factor analyzers. International Journal of Computer Vision, 87(1-2):170190, 2009.
22. R. Souvenir and K. Parrigan. Viewpoint Manifolds
EURASIP Journal on Image and Video Processing, 2009:113, 2009.
for Action Recognition.
23. Fengjun Lv and R. Nevatia. Single view human action recognition using key pose matching and viterbi
F.J.
Madrid-
28. J.L. Landabaso, M. Pardàs, and J. Ramon Casas.
Computer Vision and Image Understanding, 112:210224, 2008. Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, 1st ed. 2006. corr. 2nd printing ediShape from inconsistent silhouette.
their activities.
16. Ismail Haritaoglu, Ross Cutler, David Harwood, and
Muñoz-Salinas,
Pattern Recognition, In Press, Corrected Proof:, 2010.
W4: Real-time surveillance of people and
IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:809830, 2000.
R.
ette using dempster-shafer theory.
ICDSC, 2009.
15. Ismail Haritaoglu, David Harwood, and Larry S.
Díaz-Más,
Cuevas, and R. Medina-Carnicer. Shape from silhou-
Body Gesture Recognition via Multilinear Analysis of Voxel Data.
The visual hull: a new tool for
Proceedings of Seventh Scandinavian Conference on Image Processing, pages 9931002, 1991.
Invariant Human Motion Analysis: A Review.
29.
tion, October 2007.
Handbook of Parametric and Nonparametric Statistical Procedures. Chapman &
30. David J. Sheskin.
Hall/CRC, 4 edition, 2007.
Probability and Statistics for Engineering and the Sciences. Thomson Brooks/Cole, 7
31. J. L. Devore. edition, 2008. 32. Intel. brary.
OpenCV: Open source Computer Vision li-
http://www.intel.com/research/mrl/opencv/.