Three-dimensional action recognition using Volume

Pattern Analysis & Application manuscript No. (will be inserted by the editor)

Three-dimensional action recognition using Volume Integrals

Luis Díaz-Más, Rafael Muñoz-Salinas, F. Madrid-Cuevas, R. Medina-Carnicer Department of Computing and Numerical Analysis. University of Córdoba. 14071 Córdoba (Spain) e-mail: {i22dimal,rmsalinas,ma1macuf,rmedina}@uco.es The date of receipt and acceptance will be inserted by the editor

Abstract

This work proposes the Volume Integral (VI)

Action recognition approaches can be roughly divided

as a new descriptor for three-dimensional action recog-

into two main categories: holistic and sequential. Holistic

nition. The descriptor transforms the actor's volumet-

approaches consider the whole action as a shape vary-

ric information into a two-dimensional representation

ing in time and space, which can be represented by a

by projecting the voxel data to a set of planes that

single feature vector. Then, inferences can be drawn by

maximize the discrimination of actions. Our descriptor

dierent methods such as Clustering, Neural Networks

signicantly reduces the amount of data of the three-

or SVM. In contrast, sequential approaches consider the

dimensional representations yet preserves the most im-

action as a series of individual poses, so that a dier-

portant information. Therefore, the action recognition

ent feature vector is extracted for each frame. Then,

process is greatly sped up, and very high success rates

inferences can be drawn by models such as HMM or

are obtained. The method proposed is therefore espe-

Conditional Random Fields, which capture the intrinsic

cially appropriate for applications in which limitations

temporal structure of the action as a sequence of state

of computing power and space are signicant aspects to

transitions.

consider, such as real-time applications or mobile devices.

While most research eorts have been focused on recognition from arbitrary single view-points, when mul-

This paper tests the VI using several Dimensional-

tiple views are available during the recognition stage,

ity Reduction techniques (namely PCA, 2D-PCA, LDA)

three-dimensional information can be exploited to im-

and dierent Machine Learning approaches (namely Clus-

prove the success of recognition. The works of Wein-

et al.

et al.

tering, SVM and HMM) so as to determine the best com-

land

bination of these for the action recognition task. Exper-

most notable examples of three-dimensional viewpoint-

iments conducted on the public IXMAS dataset show

invariant action recognition in the holistic and sequen-

that the VI compares favorably to state-of-the-art de-

tial approaches, respectively. In the rst work, the au-

scriptors both in terms of classication rates and com-

thors propose the MHV as a method for compressing

puting times.

the voxel's occupancy of a whole action into a single

Key words

[9] and Qian

[14], constitute the two

voxelset. Then, the spectrum of the Fourier transform Action recognition, View invariance, Multi-

camera, Motion descriptor.

is obtained, which is rotation invariant, and the result is reduced using PCA. Finally, recognition is performed using a Clustering method. In the second work, the voxel

1 Introduction Action recognition has become an important research

data of each frame is reduced using Multilinear Analisys and Hidden Markov Models (HMM) for inference. In both cases, manipulating the large amount of in-

topic in several elds, such as video analysis, video surveil-

formation of three-dimensional voxel representations poses

lance, and human-computer interaction, thus attracting

a major problem. As a consequence, standard Dimen-

a great deal of attention in recent years [1, 2]. The prob-

sionality Reduction techniques must be applied prior to

lem has been addressed using dierent approaches, both

the use of Machine Learning approaches. However, a di-

in the 2D and 3D domains, with either single monoc-

rect application of Dimensionality Reduction techniques

ular cameras [3, 4, 5], single stereo cameras [6, 7, 8] or

to voxel data results in sub-optimal strategies, both in

multiple monocular cameras [9, 10, 11, 12]. However, de-

terms of computational eciency and recognition perfor-

veloping robust methods of recognition independent of

mance. The reason is that they require the use of large

the view employed remains an unsolved problem [13].

matrices that are dicult to manipulate and determine

2

Luis Díaz-Más et al.

can be used before applying standard Dimensionality

The work of Haritaoglu et al. is relevant in the eld of human action recognition. They showed that silhouette projection histograms are a very simple yet eective representation mechanism [15, 16, 17, 18]. Their main disadvantage is that the results depend strongly on the view. Thus, a great

Reduction techniques in both holistic and sequential ap-

part of the eort on view-based invariant gesture recog-

proaches.

nition has been focused on recognizing actions observed

accurately, given the small number of training examples frequently available. Therefore, a more compact representation of voxel data would be preferable. This work proposes the use of the Volume Integral (VI), an intermediate representation of voxel data that

Our approach can be seen as the threedimensional extension of the classical silhouette projection histograms, a very popular method for two-dimensional action recognition [15, 16, 17, 18]. The VI reduces voxel data with a minimum loss

from single arbitrary viewpoints, using models acquired from either single or multiple views. One of the earliest holistic approaches is the Motion-History Image (MHI), proposed by Bobick

et al.

[3]. In their work,

of information by projecting it onto a set of planes that

temporal templates are created by analyzing movement

maximize the discrimination of the actions. Our descrip-

information, and gestures are identied by measuring

tor allows a reduction of data by one and three orders

distances to templates previously stored from multiple

of magnitude, respectively, compared to Weinland's and

views. Later, Yilmaz and Shah propose the use of spatio-

Qian's approaches. Consequently, Dimensionality Reduc-

temporal volumes (STVs), which are automatically gen-

tion techniques can be more eciently applied without

erated for actions observed from any viewing direction

aecting classication performance. Another remarkable

using a graph-theoretic approach. STVs are later ex-

et al.

property of the VI is that, unlike Weinland's descriptor,

tended to multiple-views by Pingkun

it is sensitive to reection. Thus, it is able to dierenti-

propose the use of 4D action feature model (4D-AFM)

ate gestures performed with dierent limbs, which is of

for recognizing actions from arbitrary viewpoints. The

interest in certain applications.

4D-AFM is created by concatenating in time a sequence

Our descriptor has been evaluated in the public dataset

[12]. They

of reconstructed Visual Hulls (VHs) and then computing

Inria Xmas Motion Acquisition Sequences (IXMAS) to

the dierential geometric properties of the STVs. Also,

compare it with Weinland's and Qian's works, which em-

Souvenir and Parrigan [22] propose a manifold learning-

ploy the same dataset. The VI has been applied using

based framework which is tested on MHI and on the

several Dimensionality Reduction techniques and dier-

Transform Surface Motion Descriptor.

R

ent Machine Learning approaches (both sequential and

When a sequential approach is selected, HMM are

holistic) to determine the best strategy for the action

most frequently employed. Lv and Nevatia [23] propose

recognition task. The results show that the proposed de-

an example-based action recognition method by the use

scriptor improves on the previous work in terms of both

of

computing times and classication rates.

erated graph models that contain the 2D representa-

Action Nets.

The action nets are automatically gen-

The remainder of this paper is structured as follows:

tion of one view of 3D key poses, in which links rep-

The rest of this section is devoted to related work in the

resent transitions between key poses. Inspired by that

eld of action recognition. Section 2 denes the basis of

work, Xiaofei and Liu [24] propose a simpler approach,

three-dimensional action representation approaches, in-

in which actions are rst clustered in the 3D space and

cluding the denition of our proposal. Sect. 3 presents

then projected onto virtual cameras. On the other hand,

the experiments carried out, while Sect. 4 draws conclu-

Weinland [25] proposes the use of a wrapper approach

sions.

to determine the key exemplars; Chakraborty

et al.

[5]

propose a dierent approach, in which individual classi-

1.1 Related work and proposed contribution

ers are trained for specic body parts seen from specic viewpoints. Then, actions are recognized by a component-

View-invariant action recognition approaches can be broadly wise HMM that takes as input the classier results. divided into two main categories: model-based and view-

In any case, recognition from a single monocular view

based. The former rely on a parametric model of the per-

suers from ambiguities produced by projecting on two-

son (normally comprised of a set of connected joints in a

dimensional images, e.g., it is dicult to determine if

tree structure), which is matched against the input ob-

a person is pointing towards the camera or away from

servations [19, 20, 21]. The latter, however, rely on a set

it. Thus, some authors have proposed the use of stereo

of observations from which a set of features are learned

vision. Shin

and employed for recognition. Our work is related to

3D using stereo input sequences for generating Motion

the latter category, which has two main sub-categories:

History Models (MHMs). Later, Muñoz-Salinas

holistic and sequential.

presented stereo-based extensions to several monocular

Our approach can be seen as a three-dimensional ex-

silhouette projection histograms,

et al.

[6] were the rst to extend MHIs to

et al. [8]

approaches, showing that using stereo information leads

which are very popular for two-dimensional action recog-

et al. [7] proposed an extension of MHIs to three-dimensional

nition

space called Volume Motion Template (VMT). It is ob-

tension of the classical

to better results in all cases. Recently, Myung-Cheol


3

tained by projecting stereo information to a 2D orthog-

applications and that the use of lighter methods would

onal plane that maximizes the discrimination of the ac-

be preferable. Our comparison to this method will be

tion.

based on the results reported by the authors.

Despite the great attention that has been paid to the recognition of actions from a single viewpoint, there are applications in which multiple views are available

2 Action representation

during the recognition phase. These approaches rely on voxel data, a grid-based division of the 3D space into

This section explains the basis of the methods employed

cubes of the same size that can be either occupied or

for action representation. The rst two sections explain

empty. The set of voxels occupied by an actor forms the

the techniques employed for representing 3D actions.

so called Visual Hull (VH), which is a rich description

First, Sect. 2.1 explains the concept of VH and how it

of the actor's shape. Although 3D approaches can yield

is created using the Shape-from-Silhouette (SfS) algo-

more descriptive models and thus achieve higher success

rithm. Whereas sequential approaches use the VH gen-

rates, few view-based approaches have been proposed to

erated at each frame for recognition, holistic approaches

exploit this fact.

use the MHVs, which compresses a set of frames into a

The work of Weinland

et al.

[9] is one of them. The

authors propose the Motion History Volume (MHV), a natural 3D extension of MHIs, which is a representation of the sequence of VHs that form an action. The MHV is processed using the Fourier transform in order to obtain its spectrum, which is invariant under rotation.

single voxelset (Sect. 2.2). Once a suitable 3D representation of the actions is obtained, the information is transformed so as to reduce the amount of data and to achieve rotation invariance. Section 2.3 explains Weinland's approach while Sect. 2.4 describes our proposal, the VI.

Finally, the spectrum is reduced using Principal Component Analysis (PCA), and inference is performed using Clustering. Nonetheless, their approach has two main drawbacks. First, the number of features that result from the Fourier spectrum extraction is very high. After taking advantage of spectrum symmetry, the number of features to be reduced with PCA is

nθ , nz

and

nr

(nθ × nz × nr )/2, where

2.1 Visual Hull extraction Let us assume that the monitored area can be divided into cubes of the same volume (called voxels) denoted by

V = {v i |i = (xi , y i , z i )},

are the numbers of subdivisions on each

i

i

i

axis. As a consequence, the Dimensionality Reduction

where

process is slow, and in some cases, the number of obser-

Cartesian coordinates and

vations is insucient to obtain reliable covariance ma-

the voxel is occupied or empty.

i = (x , y , z )

(1)

represents the voxel's center in

v i ∈ {1, 0} depending on whether

trices. Second, a side eect of using the Fourier spec-

The VH of an actor consists of all the occupied voxels

trum is that the resulting descriptor is invariant under

that belong to him, and it can be calculated using the

reection. Thus, gestures performed with dierent limbs

well-known SfS algorithm [26], although more recent ap-

produce similar descriptors, which might be a problem

proaches can also be employed [27, 28]. For that purpose,

in certain applications.

we require a set of cameras surrounding the actor and

et al.

[14] presents a

placed at known locations determined by calibration. In

sequential approach based on HMM. It consists of re-

addition, a background subtraction method is required

ducing voxel data by pose tensor decomposition using

to obtain the actor's silhouettes.

In contrast, the work of Qian

the High Order Singular Value Decomposition method,

The traditional SfS method examines the projections

which is a generalization of the singular value decom-

of the voxels in the foreground images in order to de-

position method for matrices of arbitrary dimensions.

termine whether or not they belong to the shapes in

Although their method improves on the results obtained

question. This is achieved by means of a projection test.

by Weinland's, it suers from several weaknesses. First,

Each voxel is projected in all the foreground images, and

it requires a discretization of all possible body orienta-

if its projection lies completely inside a silhouette in all

tions so as to create tensors in each direction. Second,

the foreground images, then it is considered occupied.

and partially as a consequence of the former, the pose

However, if the voxel projects to a background region

tensor requires a large amount of memory. It requires

in any of the images, it is considered empty. Finally, if

nx × ny × nz × r × n features, where nx , ny

denote

the voxel projects partially onto a foreground region it is

the number of subdivisions in each axis; the parameter

considered to belong to a border, and a decision must be

r

and

nz

represents the number of possible body orientations;

and

n

is the number of key poses (these are employed

to calculate the core tensor). These values are set to

nx = ny = nz = 32, r = 16

and

n = 25

in their work.

made. In the end, the result is a Boolean decision

{0, 1}

indicating whether the region of the space represented by the voxel is empty or occupied. Figure 1(a) shows the VHs obtained for three in-

We consider that the large amount of information re-

stants of time in one of the

quired by this method makes it inappropriate for many

dataset.

kick sequences of the IXMAS

4


Figure 1: (a) Set of Visual Hulls of the

kick action. (b) Volume Integrals of the actions' Visual Hulls.(c) MHV of the

action. (d) Volume Integral of the MHV.

2.2 Motion History Volumes

theory less sensitive to that problem because they are capable of modeling the intrinsic dynamics of the action

The Motion History Volumes compress a set of VHs into

as state transitions.

a single probabilistic representation of the voxels' occupancy. Let us dene

t.

vti

as the value of voxel

vi

 i τ if vt = 1 , Vτ (i,t)=  max(0,Vτ (i,t−1)−1) otherwise where

τ

(2)

is the duration, in number of frames, of the ac-

tion represented. At instant t,

τ

at time

Then the MHV function is dened by

Vτ (i, t) has values close to

if the corresponding voxel has been recently observed

as occupied. Values of

Vτ (i, t)

near

0

indicate that the

voxel has not been occupied recently, and the value

0

indicates that it has not been observed in occupation of

2.3 Weinland's descriptor Although the VHs and MHV are invariant under scaling and translation, they are not invariant under rotation. The descriptor proposed by Weinland taking into account the Fourier that a function

f0 (x − x0 )

[9] achieves

shift theorem. This states

f0 (x) and its translated counterpart ft (x) =

dier only by a phase modulation in the

Fourier space:

any of the sequence frames.

Ft (k) = F0 (k)e−j2πkx0 .

In order to perform a reliable comparison between

In this descriptor, voxels are expressed in a cylindri-

the MHVs, they must be normalized. Assuming that the start and end time of the action is known, the MHV is

et al.

rotation invariance by using the Fourier transform (FT),

cal coordinate system (see Fig. 2) dened as follows:

normalized as:

C(r, θ, z) = V(r cos(θ), r sin(θ), z) V(i) =

Vτ (i, τ) . τ

(3)

Invariance to translation and scale are achieved by calculating the means

µx , µy , µz

and variances

σx , σy , σy

Thus, rotations around the vertical z-axis result in cyclical translation shifts:

V(x cos(θ0 )+y sin(θ0 ), −x sin(θ0 )+y cos(θ0 ), z) → C(r, θ+θ0 , z).

of the non-empty voxels in the MHV and then shifting and scaling it so that

µx = µy = µz = 0

and

σx = σy =

σz = 1.

ues

Figure 1(c) shows the MHV of the action shown in

1 while 0. Then, more recently

Figure 1(a). Red colors represent values near blue correspond to values close to

θ for each pair of val(r, z) is achieved using the Fourier spectrum values |C(r, kθ , z)| of the 1D Fourier transform Z π C(r, kθ , z) = C(r, θ, z)e−j2πkθ θ dθ. (4) Hence, rotation invariance along

occupied voxels correspond to redder areas. The main advantage of the MHV is that it compresses a whole sequence into a single voxelset. However

−π The descriptor is then dened as the vector

it has two main drawbacks. First, it requires knowledge of the start and end of the sequence. Second, the speed

W = {wz,r |z = 1, . . . , nz , r = 1, . . . , nr },

at which the action is performed aects the MHV obtained. On the other hand, sequential approaches are in

where

(5)


5

and rescaling it so that

σx = σy = σz = 1).

θ

µx = µy = µz = 0

and

Afterwards, rotation invariance

is obtained by nding the angle that aligns two identical voxelsets. The actor's main direction is then obtained and the voxelset rotated accordingly. Assuming that people stand while performing actions, the angle search can

x−z

be restricted to the

plane. Thus, the actor's main

direction is dened as the vector from the center of the

z

volume to the farthest occupied voxel. The resulting voxelset is then integrated over the three axes in pairs, so that the 3D information is rep-

r

resented by three 2D images with a minimum loss of information. Let us dene the integrals of the voxelset

Figure

2:

Representation

Fourier transforms over

θ

of

Weinland's

are computed for

descriptor.

(r, z)

pairs

n

forming the feature vector.

wz,r = {|C(r, kθ0 , z)|, . . . , |C(r, kθnθ , z)|},

Pz (x, y) =

(6)

nz , nr and nθ represent the number employed in r -, θ - and z -axes, respectively.

The parameters divisions

as

of

Due to the trivial ambiguity of 1D-Fourier magni-

+

z X 1 V(x, y, z i ), i i=1 δ(V(x, y, z )) i=1

Pnz

(7)

ny

X 1 Pny V(x, y i , z), Py (x, z) = + i=1 δ(V(x, y i , z)) i=1

(8)

and

tudes with respect to the reversal of the signal, motions that are symmetric with respect to the z-axis (i.e., identical actions performed with dierent limbs) result in the same motion descriptors. Invariance under reection

n

Px (y, z) =

+

can be considered either as a loss of information or as a useful property, depending on the nal application. In any case, considering the symmetry of the Fourier spectrum, the total number of features of the descriptor is

(nθ × nz × nr )/2,

which is very large.

x X 1 V(xi , y, z). i , y, z)) δ(V(x i=1 i=1

Pnx

(9)

V(x, y, z) is the value of the voxel with (x, y, z).When calculating the VI from a VH, V(x, y, z) represents the voxel's state, i.e., V(x, y, z) ∈ {0, 1}, indicating whether the voxel is empty or The function

center

tor can also be employed to reduce individual VHs. Ac-

occupied. However, when the VI represents a MHV, V(x, y, z) ∈ [0, 1] is the continuous value given by Eq. 3. The function δ is 0 if the voxel's oc-

tually, this work demonstrates (in the experimental sec-

cupancy is

tion) that Weinland's descriptor obtains better results

are normalized by dividing by the number of non-empty

when it is employed in combination with a sequential

voxels plus

approach (HMM) than when the MHV is employed.

by zero in case all voxels are empty.

Although in his original work Weinland applied the descriptor only to reduce MHVs, note that the descrip-

2.4 Volume Integral As can be observed, the amount of information obtained in a voxel-based representation is very high. Hence, it is desirable to reduce it before applying Machine Learning techniques. To that end, we propose the Volume Integral, which is an alternative representation of a voxelset that can be applied both to VHs and MHVs. The idea is to project the three-dimensional information to a set of planes that maximize the action's discriminability.

0 ,

and

1

otherwise. Therefore, the integrals

which is a small value to avoid dividing

Figure 3 presents the proposed method as an algorithm with an aim to clarify the above explanation. Note that the integral is normalized by dividing by the number of non-empty voxels instead of by the total number of subdivision (nx , ny or nz ). We have experimentally found that this approach yields better results, especially when dealing with MHV. The reason is that for regions such as limbs, the sum of the occupancy is very low, and dividing by the non-empty voxels better preserves the information for these parts. Figures Fig. 1(c,d) show the VIs corresponding to

For the calculus of the VI, it is assumed that there is

the voxelsets of Fig. 1(a) and Fig. 1(b). It can be ob-

a voxelset invariant under translation and scaling. This is

served that our normalization approach produces binary

achieved using the transformation procedure explained

silhouettes when applied to VHs, but it preserves the

above

most essential spatio-temporal properties when applied

for the MHV (i.e., calculating the means and variances of the voxelset and then shifting

to MHVs.

6

Luis Díaz-Más et al. Finally, a unique descriptor is formed by concate-

nating all the features in

Py (x, z), Pz (x, y) and Px (y, z).

When compared with Weinland's descriptor, the VI achieves a substantial reduction of the data because the nal feature vector contains only

nx × ny + nx × nx + ny × nz

For instance, for a voxel set of resolution 64 × 64 × 64 (as employed in our experiments), Weinland's descriptor produces 131072 features, while ours produces 12288, i.e., approximately ten times fewer features. Compared to Qian's descriptor, our descriptor produces one hundred times fewer features. As we show in the elements.

Mach. Learn Clust. Clust. Clust. SVM-RBF SVM-RBF SVM-RBF SVM-LIN SVM-LIN SVM-LIN HMM HMM HMM

Dim. Red PCA 2D-PCA PCA+LDA PCA 2D-PCA PCA+LDA PCA 2D-PCA PCA+LDA PCA 2D-PCA PCA+LDA

Weinland

Volum. Integ.

93.33

90.90

-

85.15

92.42

90.90

80.90

90.00

-

77.27

89.39

86.69

89.09

88.87

-

86.66

86.69

88.48

71.21

76.06

-

87.57

93.64

98.45

experimental section, these are sucient to provide better results than Weinland's descriptor. In addition, the

Table 1: Classication results in the original IXMAS

VI produces dierent descriptors for reected actions.

dataset.

Hence, similar actions performed with dierent limbs can be distinguished.

Volume Integral creation

INPUT: Volume set V OUTPUT: Volume integral 1. Shift voxels in V so that the resulting volume has µx = µy = µz = 0 and σx = σy = σz = 1. 2. Rotate voxels in V by nding the main's direction as the vector from the center of the volume to the farthest occupied voxel. 3. Project voxels in the three planes xy , xz and yz so as to obtain the volume integrals P as: Pz (x, y) =

Py (x, z) = Px (y, z) =

4.

Pnz 1 δ(V(x,y,z i )) + i=1 1 Pny + i=1 δ(V(x,y i ,z)) Pnx 1 δ(V(xi ,y,z)) + i=1

nz i=1

V(x, y, z i )

Pny

V(x, y i , z)

Pnx

V(xi , y, z)

i=1

i=1

Concatenate Pz (x, y),Py (x, z) and Px (y, z) creating a single 2D feature matrix representing the volume integral.

of our descriptor to the one proposed by Weinland [9]. For that purpose, the descriptors are evaluated using the original dataset and the results are presented in Sect. 3.1. Third, the tests evaluate the discriminatory power of the descriptors when reected actions are considered, i.e., identical actions performed with dierent limbs. For that purpose, a total of 4 new actions are added to the original dataset by applying a reection operator to some of the original actions. The results obtained are shown in Sect. 3.2. Finally, in Sect. 3.3, we also aim to evaluate the improvement in the computational time achieved by our descriptor when compared to Weinland's. We have employed the same evaluation methodology employed in Ref. [9] so as to obtain comparable results. This means that: i) we have used

10

of the

12

actors in the dataset; ii) a leave-one-out cross-validation is applied, i.e., one actor is chosen for testing and the

Figure 3: Algorithm that transform a 3D volume into a

rest are used for training; iii) actions 11 (point) and 13

volume integral

(throw over head) have not been included, because they are performed very dierently by the actors, and iv) the space is discretized into

64

subdivisions for each axis. In

addition, the VHs and ground-truth provided with the

3 Experimental results

dataset have been employed.

This section presents the experiments conducted to eval-

3.1 Comparative evaluation in the IXMAS dataset

uate the proposed descriptor. For testing purposes, the public IXMAS dataset has been employed. It is a well-

The rst test compares our descriptor to Weinland's in

known 3D dataset in the eld of action recognition that

the original IXMAS dataset using dierent combinations

contains

12

13

actions, each one performed three times by

of Dimensionality Reduction and Machine Learning ap-

dierent actors. The actors freely change their ori-

proaches. The results of the tests are shown in Table 1.

entations in each acquisition so that rotation invariance can be analyzed.

The goal of our experiment is four-fold. First, we seek to evaluate the performance of the VIs using dierent combinations of Dimensionality Reduction techniques and Machine Learning approaches in order to nd the most appropriate one. Second, we compare the discriminability power

The rst column indicates the Machine Learning approach employed, and the second column indicates the Dimensionality Reduction technique applied. In the rst column, the

SVM-RBF stands for Support Vector Machines SVM-LIN in-

(SVM) using radial basis function, while

dicates that a linear kernel is being employed. The third and fourth columns show the results obtained for the dierent descriptors. For the

SVM-RBF, SVM-LIN and

Three-dimensional action recognition using Volume Integrals Method Weinland VI Weinland-VI

Mean 0.8708 0.8879 Mean -0.0171

Std. Deviation 0.0765 0.0621 Std. Deviation 0.0422

Std. Error 0.0270 0.0219 Std. Mean Error 0.0149

7

t -1.145

DOF 7

Sig. (2-tailed) 0.290

Table 2: Statistics of the distributions and results of the Paired-Samples Test. See text for discussion.

HMM methods, the results presented correspond to the best results after testing with dierent values of the methods' parameters. As previously explained, the Clus-

SVM-RBF

SVM-LIN

there is insucient evidence to suggest that one method is better than the other in terms of performance in the dataset employed.

using the MHVs. However, the HMM tests employ the

Finally, we must indicate that the work of Qian et al. [14], reported a success rate of 94.59%. To allow repli-

individual VHs generated at each frame.

cation of our experiments, we provide the complete set of

tering,

and

methods are applied

We note several remarks regarding the Dimension-

VIs corresponding to the VHs and MHVs of the IXMAS

ality Reduction techniques. First, the Complete Two-

actions, along with references to the libraries we have

dimensional Principal Component Analysis (2D-PCA)

used.

technique has been applied to our descriptor but not to Weinland's descriptor, since the technique requires the data to be a two-dimensional matrix. Dierent compression ratios have been tested, and the values in the Table correspond to those that provided the best results.

Note to reviewers: The nal location of this data has not yet been determined since it might be located in the journal servers. However, the dataset is temporarily available at http: // www. uco. es/ users/ i22dimal/ files/ data-vi. tar. bz2 for reviewing purposes.

Second, to determine the number of PCA components, we have employed the minimum-error formulation suggested by Bishop [29]. For the VI, we retain, approximately, the rst

12288

123

features given by the descriptor. For Weinland's

descriptor, we employed approximately out of the

131072

233

components

features. Finally, for the Linear Dis-

criminant Analysis (LDA) method, we retained only the

10

3.2 Comparative evaluation using reected actions

principal components out of the

most relevant features, i.e., the number of classes mi-

nus one.

The goal of this test is to evaluate the ability of our method to deal with reected actions. For that purpose, a new dataset has been created by adding four new actions to the original IXMAS dataset. The new actions are created by applying a reection operator to the actions check watch, wave hand, kick and punch. An example

The results obtained show that Weinland's descrip-

can be seen in Fig. 4, which shows the original VH of

tor obtains better results than ours when the Clustering

one of the sequences of the punch action, along with its

and SVM-LIN methods are applied. However, the VI ob-

reected counterpart. It can be observed that the ac-

tains better results when using the SVM-RBF and HMM

tion is identical except for the fact that it seems to be

methods. In general, the results show that the combina-

performed with the opposite arm.

tion of HMM + (PCA+LDA) obtains the best results

The results using the two descriptors are shown in

for both descriptors. In particular, our method obtains,

Table 4. The parameters of the Machine Learning and

98.45% success rate. This result is ob12 states and no skips between

Dimensionality Reduction techniques have been obtained

in the best case, a

tained using HMM with states.

In order to compare both methods more formally, we have used a Paired-Samples test using the data in Table 1 (excluding the 2D-PCA results because they are not available in both methods). For more information on statistical analysis see [30, 31]. The rst three rows of Table 2 show the performance statistics of both methods, while the last row shows the statistics calculated by the Paired-Samples test. As indicated in the table, the signicance of the null hypothesis is 0.290 (following a two-tailed T-Distribution with 7 degrees of freedom (DOF)). Using a p − value > 0.05, the test indicates that there is not enough evidence to reject the null hypothesis. In other words,

as in the previous test. As previously explained, the descriptor proposed by Weinland is not capable of distinguishing between reected action because of the symmetry of the Fourier spectrum. This explains the bad results obtained in this test. In contrast, the proposed VI achieves better classication results using all combinations of Machine Learning and Dimensionality Reduction techniques. As in the previous case, the best results are obtained for a combination of HMM and (PCA+LDA). The HMM had

8

states and no skips between states. The confusion matrices of each descriptor for the best case are shown in Figures 5 and 6. The actions identied with the sux

R correspond to those reected. It can be

clearly observed that most of the errors of Weinland's de-

8


Figure 4: Reection of action punch

Mach. Learn Clust. Clust. Clust. SVM-RBF SVM-RBF SVM-RBF SVM-LIN SVM-LIN SVM-LIN HMM HMM HMM

Dim. Red PCA 2D-PCA PCA+LDA PCA 2D-PCA PCA+LDA PCA 2D-PCA PCA+LDA PCA 2D-PCA PCA+LDA

Weinland

Volum. Integ.

70.00

84.88

-

80.66

67.99

85.55

68.44

83.55

-

70.66

70.22

84.88

64.22

78.44

-

70.80

59.99

81.77

52.2

72.50

-

79.20

70.90

91.90

scriptor correspond to confusion between an action and Table 4: Classication results in the IXMAS dataset us-

its reected counterpart.

As in the previous case, we have employed the Paired-Samples test to compare the results. Table 3 shows the performance statistics of both methods, along with the statistics calculated by the test. In this case, the null hypothesis is rejected using a p − value > 0.05, so we can arm that the proposed method is signicantly better than Weinland's in distinguishing actions performed with dierent limbs.

ing reected actions.

Descriptor VI Weinland

Desc. Comp. 4.6−3 s 1.1−2 s

PCA-Comp. 3.7 s 37.5 s

PCA-Reduct. 1.04−2 s 1.14−1 s

Table 5: Times employed in computing and reducing data using PCA

computing times required for the PCA process are re-

3.3 Evaluation of the computational cost

duced by one order of magnitude. We propose that the reduction in the computational cost makes the proposed descriptor more suitable for a wide variety of applica-

The experiments conducted so far show that the pro-

tions in which time is a crucial aspect.

posed descriptor outperforms Weinland's method in several situations while using far fewer features. The results also suggest that a combination of (PCA+LDA) pro-

4 Conclusions

vides the best performance. LDA computation is not a time consuming process, since the number of features

This paper proposes a new descriptor, the VI, for three-

obtained after PCA is small. However, calculating the

dimensional action recognition. Our descriptor (which is

PCA eigenvectors and eigenvalues, as well as projecting

invariant under rotation, scaling and translation) is cre-

the patterns to its main components, are computation-

ated by integrating voxel data over a set of planes that

ally demanding processes.

maximize the discriminability of actions. The main ad-

This section aims to analyze the benet of using the

vantage of our descriptor is that it signicantly reduces

proposed VI descriptor in terms of computing time. To

the amount of data of three-dimensional representations

that end, we have measured the times needed to calcu-

yet preserves the most important information. As a con-

late the most relevant phases of the action recognition

sequence, the action recognition process is greatly sped

process for the two descriptors; the results are presented

up, but very high success rates are maintained. In ad-

in Table 5. The rst column indicates the descriptor em-

dition, the proposed descriptor can distinguish between

ployed, and the second column indicates the time re-

reected actions, which was not possible with some of

quired to calculate it from a single voxelset. The third

the previous approaches.

column indicates the average time required to calculate

The VI is tested in the IXMAS dataset and compared

et al.

the PCA eigenvectors and eigenvalues for the MHVs of

to the descriptor proposed by Weinland

all the train patterns in the IXMAS database (i.e., a to-

experiments evaluate the performance of the descrip-

tal of

297

[9]. The

patterns). Finally, the last column indicates

tors using several Dimensionality Reduction and Ma-

the amount of time required to reduce a single voxelset

chine Learning techniques, so as to determine the most

to its main components. The tests have been performed

appropriate combination. The results show that: i) the

on a Intel Quad Core

2.66 Ghz

with

4

Gb of RAM, run-

proposed descriptor obtains similar performance to Wein-

ning Linux, and the routines employed for computing

land's descripor in the IXMAS data set, and better per-

PCA and the Fourier transform are those provided by

formance on mirrored action; ii) the combination of (PCA+LDA)

the OpenCv library [32].

and HMM obtains the best classication results; and iii)

Note that the calculation of the VI requires half the

the proposed descriptor reduces the amount of data to

time required by Weinland's descriptor. In addition, the

be manipulated by one order of magnitude compared

Three-dimensional action recognition using Volume Integrals Method Weinland VI Weinland-VI

Mean 0.6549 0.8293 Mean -0.1743

Std. Deviation 0.0649 0.0567 Std. Deviation 0.0315

Std. Error 0.0229 0.0200 Std. Mean Error 0.0111

9

t -15.622

DOF 7

Sig. (2-tailed) 0.00

Table 3: Statistics of the distributions and results of the Paired-Samples Test using the database of reected actions.

Check watch

Check watch

Check watch-R

Check watch-R

Cross arms

Cross arms

Scratch head

Scratch head

Sit down

Sit down

Get up

Get up

Turn around

Turn around

Walk

Walk

Wave hand

Wave hand

Wave hand-R

Wave hand-R

Punch

Punch

Punch-R

Punch-R

Kick

Kick

Kick-R

Kick-R

Pick up

Pick up

ku p

k-R

Pic

Kic

av eh an d Wa ve ha n d-R Pu nc h Pu nc h-R Kic k

W

Ch ec kw atc Ch h ec kw atc Cr h-R os sa rm Sc s rat ch he Sit ad do wn Ge tu p Tu rn aro un W d alk

k-R

ku p Pic

Kic

Ch ec kw atc Ch h ec kw atc Cr h-R os sa rm Sc s rat ch he Sit ad do wn Ge tu p Tu rn aro un W d alk W av eh an d Wa ve ha nd Pu -R nc h Pu nc h-R Kic k

See text for discussion.

Figure 5: Confusion matrix for the VI descriptor in the

Figure 6: Confusion matrix for Weinland's descriptor in

test with reected actions. The results correspond to the

the test with reected actions. The results correspond to

combination of (PCA+LDA) and HMM

the combination of (PCA+LDA) and HMM.

to Weinland's approach. As a consequence, the time em-

man action recognition.

ployed to reduce the dimensionality of our feature vector

27(10):15151526, 2009.

is also reduced by an order of magnitude.

Image Vision Comput.,

5. Bhaskar Chakraborty, Ognjen Rudovic, and Jordi Gonzalez.

View-invariant human-body detection

with extension to human action recognition using

Acknowledgements This work was developed with the support of the Re-

In 2008 8th IEEE International Conference on Automatic Face & Gesture Recognition, pages 16. IEEE, 2008.

search Project TIN2010-18119 nanced by Science and

6. Ho-Kuen Shin, Sang-Woong Lee, and Seong-Whan

component-wise HMM of body parts.

Lee. Real-time gesture recognition using 3d motion

Technology Ministry of Spain.

history model. In

References 1. Ronald Poppe.

ICIC (1), pages 888898, 2005.

7. Myung-Cheol Roh, Ho-Keun Shin, and Seong-Whan Lee.

View-independent human action recognition

with volume motion template on single stereo camA survey on vision-based human

action recognition.

Image and Vision Computing,

2009. 2. Pavan Turaga, Rama Chellappa, V. S. Subrahmanian, and Octavian Udrea.

Machine Recognition

IEEE Transactions on Circuits and Systems for Video Technology, of Human Activities: A Survey. 18(11):14731488, 2008. 3. Aaron F. Bobick and James W. Davis. The recognition of human movement using temporal templates.

IEEE Transactions on Pattern Analysis and Machine Intelligence, 23:257267, 2001. 4. Nazl Ikizler and Pnar Duygulu.

Histogram of

oriented rectangles: A new pose descriptor for hu-

era.

Pattern Recognition Letters, In Press, Corrected

Proof:, 2009. 8. R.

Muñoz-Salinas,

R.

Medina-Carnicer,

F.

J.

Madrid-Cuevas, and A. Carmona-Poyato. Depth silhouettes for gesture recognition.

tion Letters, 29:319329, 2008.

9. Daniel

Weinland,

Remi

Pattern Recogni-

Ronfard,

and

Edmond

Boyer. Free viewpoint action recognition using motion history volumes.

Comput. Vis. Image Underst.,

104(2):249257, 2006. 10. Yuedong

Yang,

View-invariant points.

Aimin action

Hao,

and

recognition

Qinping using

Zhao.

interest

International Multimedia Conference, 2008.

10


IEE Conference on Computer Vision and Pattern Recognition, pages 18, 2007.

11. Srikanth Cherla, Kaustubh Kulkarni, Amit Kale, and V. Ramasubramanian.

path searching. In

Towards fast, view-

2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, pages 18. invariant human action recognition. In

24. Xiaofei Ji and Honghai Liu. View-Invariant Human Action Recognition Using Exemplar-Based Hidden Markov Models.

IEEE, 2008.

Lecture Notes in Computer Science,

5928/2009:7889, 2009.

12. Yan Pingkun, Saad M. Khan, and Mubarak Shah.

25. Daniel Weinland, Edmond Boyer, and Remi Ron-

Learning 4D action feature models for arbitrary view

fard. Action Recognition from Arbitrary Views us-

action recognition.

ing 3D Exemplars. In

2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1

2007 IEEE 11th International Conference on Computer Vision, pages 17. IEEE,

In

7. IEEE, 2008.

2007.

13. Xiaofei Ji and Honghai Liu.

Advances in View-

IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 40(1):1324,

26. A. Laurentini.

2010.

contour-based image understanding. In

27. L.

14. B Peng G. Qian and S. Rajko. View-Invariant Full-

Davis.

Larry S. Davis. Backpack: Detection of people car-

Computer Vision and Image Understanding, pages 102107, 1999. rying objects using silhouettes. In

17. R. Cucchiara, C. Grana, A. Prati, and R. Vezzani. Probabilistic

Posture

Classication

for

Human-

IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, 35(1):4254, January 2005. Behavior Analysis.

18. Chia-Feng Juang and Chia-Ming Chang.

Human

Body Posture Classication by a Neural Fuzzy Net-

work and Home Care System Application. IEEE Transactions on Systems, Man, and Cybernetics - Part A: Systems and Humans, 37(6):984994, November 2007. 19. Marcus A. Brubaker, David J. Fleet, and Aaron Hertzmann.

Physics-Based Person Tracking Using

International Journal of Computer Vision, 87(1-2):140155, marzo 2009. the Anthropomorphic Walker.

20. Stefano Corazza, Lars Mündermann, Emiliano Gambaretto, Giancarlo Ferrigno, and Thomas P. Andriacchi.

Markerless Motion Capture through Visual

Hull, Articulated ICP and Subject Specic Model

Generation. International Journal of Computer Vision, 87(1-2):156169, marzo 2009. 21. Rui Li, Tai-Peng Tian, Stan Sclaro, and MingHsuan Yang. 3d human motion tracking with a co-

ordinated mixture of factor analyzers. International Journal of Computer Vision, 87(1-2):170190, 2009.

22. R. Souvenir and K. Parrigan. Viewpoint Manifolds

EURASIP Journal on Image and Video Processing, 2009:113, 2009.

for Action Recognition.

23. Fengjun Lv and R. Nevatia. Single view human action recognition using key pose matching and viterbi

F.J.

Madrid-

28. J.L. Landabaso, M. Pardàs, and J. Ramon Casas.

Computer Vision and Image Understanding, 112:210224, 2008. Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, 1st ed. 2006. corr. 2nd printing ediShape from inconsistent silhouette.

their activities.

16. Ismail Haritaoglu, Ross Cutler, David Harwood, and

Muñoz-Salinas,

Pattern Recognition, In Press, Corrected Proof:, 2010.

W4: Real-time surveillance of people and

IEEE Transactions on Pattern Analysis and Machine Intelligence, 22:809830, 2000.

R.

ette using dempster-shafer theory.

ICDSC, 2009.

15. Ismail Haritaoglu, David Harwood, and Larry S.

Díaz-Más,

Cuevas, and R. Medina-Carnicer. Shape from silhou-

Body Gesture Recognition via Multilinear Analysis of Voxel Data.

The visual hull: a new tool for

Proceedings of Seventh Scandinavian Conference on Image Processing, pages 9931002, 1991.

Invariant Human Motion Analysis: A Review.

29.

tion, October 2007.

Handbook of Parametric and Nonparametric Statistical Procedures. Chapman &

30. David J. Sheskin.

Hall/CRC, 4 edition, 2007.

Probability and Statistics for Engineering and the Sciences. Thomson Brooks/Cole, 7

31. J. L. Devore. edition, 2008. 32. Intel. brary.

OpenCV: Open source Computer Vision li-

http://www.intel.com/research/mrl/opencv/.

Three-dimensional action recognition using Volume

Three-dimensional action recognition using Volume

Suggest Documents

Action Recognition using Visual Attention

Action Recognition System Using Finite State

Action Recognition using Probabilistic Parsing - CiteSeerX

Human Action Recognition in Videos Using Hybrid

Human Action Recognition Using Spatiotemporal ... - Amir Ghodrati

Action Recognition using Visual Attention - Semantic Scholar

ACTION RECOGNITION USING TEMPORAL TEMPLATES AARON F

Action Recognition using Randomised Ferns - Semantic Scholar

Human action recognition using support vector ...

VIEW-INVARIANT ACTION RECOGNITION USING CROSS RATIOS

Human action recognition using an ensemble of

ROBUST HUMAN ACTION RECOGNITION USING HISTORY TRACE

ACTION RECOGNITION USING TEMPORAL TEMPLATES AARON F ...

Human Action Recognition using Log-Euclidean ... - Cs.jhu.edu

Human Action Recognition Using Factorized Spatio-Temporal ...

ROBUST HUMAN ACTION RECOGNITION USING HISTORY TRACE ...

Action Recognition using Exemplar-based Embedding - Deutsche

Free viewpoint action recognition using motion history

Skeleton-based action recognition using LSTM and

ROBUST HUMAN ACTION RECOGNITION USING HISTORY TRACE ...

ROBUST HUMAN ACTION RECOGNITION USING HISTORY TRACE

Action Recognition Using Spatial-Temporal Context

LiveReal Time ThreeDimensional Transesophageal ...

Morphometric, quantitative, and threedimensional