Motion Recognition Based on Dynamic-Time Warping Method with Self-Organizing Incremental Neural Network Shogo Okada Dept ofIntelligence Science and Technology,Kyoto University,JAPAN okada....s@ i. kyoto- u. ac.jp Osa u asegawa Imaging Science and Engineering Laboratory, Tokyo Institute oj Technology,JAPAN hasegawa.
[email protected],jp
Abstract This paper prese/llS all approach (SOINN-DTW) for recognition ofnwTion (gesrure) Thar is based on Tile SelfOrganizing Incremental Neural Network (SOINN) and Dynamic Time Warping (DTW). U ing SOINN's fimclion ofeliminating noise ill The input daTa alld representing lhe distribll1ion of input data, SOlN -DTW meThod approximates the OUTpllt disTribution of each sTate in a self-organizing mallller corresponding 10 the inpllf data. The proposed SOINN-DTW method enhances Stochastic Dynamic Time Warping Method (Nakagawa, 1986). Results of experimellts show that OINN-DTI¥ ollfperfOrtll HMM, CRF, alld HCRF ill motion data.
1 Introduction A common approach to model sequential data is the use of the Hidden Markov Model (HMM), a powerful generative model that includes hidden state trucLUre. In fact, HMM is u cd for speech rccognition[6]and for gesture (motion) recognition. In HMM, equential data are modeled u. ing a Markov model that ha finite. tate. Here, we mu t choose and determine the model size (number of tates) in advance. On the other hand unlike speech, motion (ge LUre) ha different time length corre ponding to the kind of motion. Therefore, it i difficult to set the optimal number of tate corresponding to each motion. In rSl, akagawa propo ed tochatic DTW method, which uses advantage of both DTW method and HMM. U ing tochastic DTW, the probabili tic distance is used instead of the Euclidean distance (local di'lance); a probabilistic path takes the place of a warping path. In addition, a Gaussian di o'ibution i u ed for the output distribution. However, a Gau ian di o'ibution can not approximate all output di u'ibutions becau e the outputdi tribution changes according to feature vector type and the number of u'ain-
978-1-4244-2175-6/08/$25.00 ©2008 IEEE
ing dara. In thi paper, we propo C a mCthod that approximates the output distribution of each state according to thc kind of fcaturc vcctor and its numbcr. In thc proposed model, the output distribution of Statc is approximated using a Self-Organizing Incremental Neural Network (SOINN), which can grow incrementally and accommodate input data of an online non-stationary data distribution r71- As [71 reported, the neural network deign enables it to both repre ent the di tribution of unsupervised data and to report the reasonable number of cluster. It can eliminate noi e in the input data both dynamically and efficiently. U ing SOl's function of the repre entation of di u'ibution, the output distribution i repre ented in a elf-organizing manner. In addition, the number of state (time-series number of the template sequence) i et automatically. Therefore, we can avoid the time-consuming choice of number of states. We define this propo ed model as SO -DTW. The main conuibution of thi paper i the inu'oduc-DTW, wch can appro tion of this approach, SO irnate the output distribution of tate in a detailed and flexible manner according to the kind of feature vector. Experimental results how that thi approach pre ent advantage for motion recognition. We de cribe an experiment that highlight how SOINN-DTW outperforms other relatcd mcthods in human motion data.
2
Related Work
Recently, there has been increasing inrere tin u ing conditional random field (CRF) [31 for learning of sequences including ge tures and spcech. The advanragc of CRF over hiddcn Markov models i their conditional nature, re ulting in the relaxation of the independence as umprions, which are required by HMM to en ure tractable inference.
Datal
In [8), Wang pro po ed a model for ge ture recognition; it incorporates hidden state variable in a di criminative multi-class random field model (HCRF). According to the re ult of head and arm ge ture recognition experiments in [8), HCRF outperforms HMMs and CRF. We compare the motion recognition accuracy with Stochastic-DTW[5], HMM, CRF, and HCRF to verify the utility of the proposed SOIN -DTW method.
Data2
Data3
3 Approach 3.1
12,' Is Input to State 1(SOINN)
SOl
We introduce the self-organizing incremental neural network (SO ) [7], which i the foundation of the proposed method. Fundamentally OIN adopts a two-layer network. The training re ult of the fir t layer are u ed as the training et for the ec nd layer. The goal of SO N u e are realizing un upervi ed learning and repre 'enring the input di tribution. For those purpo es, SOINN adopts two chemes: between-class insertion and within-cia in ertion, to in ert new node and thereby realize incremental learning and topology repre entation. During u'aining of the fir t layer of SO , between-class insertion is the main task: within-cia insertion ha little conu'ibution for in erting new node. Here we adopt only the first layer of SO for SO -DTW, then delete the within-cla s insertion part to facilitate its comprehen ion and to retain five user-determined parameters. Here we describe . Detail of SO are dea brief outline of SO cribed elsewhere [71.
Figure 1. Process of STEP 2. LetPj be a. ample of tandard data P* arrimej. Let pi be a sample of training data Pn(n E C) at time i. After DTW of the standard data P* and the training data the optimal warping path (i = wj'(j = 1,2,'" ,T*)) between pj and pi' is determined a the following equation, sUl:h that the global distanl:e D(P*, Pn ) is minimized. Sample pi' at time i is divided to each state (SOINN) j of the template model according [Q i = wj'. This allocation of samples i done from time 1 to time T*. [STEP 2] i executed for all training data (n E C). As a result, the - - 1 optimal path is dctcrmincd as w"(n = 1,··· N - 1). The allocation of ample i' also done according to w". We definc the et of sample a Zj. that is allocated to each state j (SO 'ample in Zj are scarce when training data are scarcc. And a set of samplcs from Zj to Zj+L-l i (state j) to prevent that problem. L deinput to SO note. the number of. egments and i the parameter in SOl -DTW. Figure I repre ent the proce of ST P 2.
3.2 SOINN-DTW In SOINN-DTW, the global di tance between training data i calculated using D1 W. In addition, a template model is constructed based on DTW. Let 1 • be the number of training data which belong to category C. Then we explain the con truction procedure of the template model from r training data.
[STEP 3: Learning by SO After Z* j i input into SOl
and learned by SOINN the number of node ets (clu ter ) that are connected by edge i output by Ol N. We e timare the output di tribution of SOl N (state j) from position vectors of these node set
[STEP 1: Selection of standard data]
3.3 Parameter estimation of output distribution
Standard data P* of the template model are elected from among N training data belonging to category C. Standard data P* are determined u ing the following equation.
P
=
arg~~n {~D(Prn,Pn)}
After leaIl1ing by SO N, the numbcr of node sets (clu ter ) which are connected by edge is output by '01 (Fig. 2). We define two kinds of probabili tic den ity function (pdf) and calculate two likelihood (global likelihood and local likelihood) from these pd~ . Global likelihood All nodcs in statc S.i are u ed for calculation of the global likelihood. The node et in statc Sj is approximated u ing a Gau ian di tribution which ha a fullcovariance matrix. We define the output probability from the Gaus ian di u'ibution as Pglobal(XilSJ and
({Pit Pm} E C)
(I) In eq. (1 ),Prn , Pn denote training data which belong to category C. In addition, D(P", , P,,) denote the global di tance in ymmetric DTW, where T* i the time length of standard data P*.
[STEP 2: Allocation of sample to each state] 2
dataset for an experiment. In panicular, we u ed moving images, which directly captured human motion, using no device such as data gloves. Moving images are captured in four different room . The data et u ed for the experiment was created manually extracting only the motion part from the moving image. Extracted motion are poru-ayed a MI-M7 in Fig. 3. Thc time length of motion arc variou: llD-440 frames. In particular, M I(M2) and M6(M7) are motions which have similar trajectory and sequential data which are observed from these motion have the change of timescales. In thi experiment, we u ed total 175 motion data (7 motion x 25 data) which are observed from 5 performer. Motions have segmellling infonnation(bcgin and cnd point). Datasct i (training data, test data)=(15.JO)«20,5)). and we used 5 data x 3(4) performe.rs for training data and 5 data x 2(1) other pe.rformers for test data; we performed 10(5) experiment u ing cros -validation manner. The image processing method that is applied in this experiment is explained below. We extracted the change of direction of body's centroid from moving image. After smoothing each frame of the input equence image. we calculated the difference between frame. We converted thc RGB value to a luminance valuc. and converted color images to binary image. We extracted a self-correlation feature [21 for the time- eries direction. A (3 x 3) mask was applied to calculate the correlation feature, giving a nine-dimensional real value vectOr.
Loeallikellhood is c:*ul ltd by nodes in inlemal cLa$S
GIQblllllkelihood 1$ cillc:ul ted by all nodes in SOINN
Figure 2. Two kinds of probability distribution formed with results of SOINN (nodes and edges) define global likelihood as log(Pqlobal(xil j»). In addition, mean vector and full-covariance matrix are calculated using maximum likelihood estimation. Loc.al likelihood Local likelihood is calculated u ing nodes in inner clas e . Classesl-3 in Fig. 2 repre ent inner classe . Here, node in an inner cia s are scarce. For that reaon, the node set of inner clas'c is approximated not by a Gaussian distribution, which has a full-covariance matrix, but by a Gaussian kernel function. Let U jk be inner cia k in OJNN ( j)' u ing a kernel function with the Parzen window method [I], the output probability Ploeal (Xi!Ujk) i e timated from all nodes in Ujk . Width parameters of kernel funco
L::l
tion is l\~'.
lal - xjkl. Using Pglobal(xiISj) and Ploeal(xilUjk), likelihood C(Xi, Sj) i repre ented a
4.1
;k
We compared SOJNN-DTW with tochastic DTW, HMM, CRF, and HCRF in the point of III tion r cognition accuracy.
J(
C(x, Sj)
= 10g(Pglobal(xiISj) L wkPlocal(xiIUjd) k .
where weIght Wk
• I
Wk
r •
=~
Stochastic DTW: f{
(~k
Wk
= 1).
(2)
An asymmetric recurrence formula i u ed in the Stocha tic-DTW. we implemented Stocha tic DTW as hown in [51.
In eq.
)
(2),
all denotes the number of node in SO (Sj); is the number of inner cia e in 0 . The global di tance between the template model and input data is calculated using the following recurrence formula.
J{
Q(i,j -1) Q(i,j)
= max
{
Hidden Markov Model: Thc HMM that we used was the left-to-right model based on mixed Gaussian probabilities having a fullcovariance matrix for each state. The model parameter are learned from training data using the Baum-Welch algorithm. For HMM, we performed experiment and searched for the optimal number of tates /"nm and hmm < optimal number of mixtures N!,.",,,, (1::::: 15,1 ::::: NI"mm ::::: 3) such that HMM has the best recognition pcrformancc.
+ C(Xi,Sj)
Q(i - 1 j - 1) + 2C(Xi Sj)
Q(i - l,j)
+ C(Xi
Sj)
(3) In eq. (3), C(Xi Sj) denote likelihood. By calculation of the global di tance between the template model and input data, the category to which input data belongs i decided.
4
Comparative models
CRF and HCRF Model: We trained a single CRF chain model in the manner decribed in [8]. DLlring evalLlation, we found the Vitabi path under the CRF model, and a igned the sequcnce label based on the most frequclllly OCCUlTing motion la-
Experiment
In this scction, to evaluate the motion recognition -DTW, we used the motion performance of SO
3
Table 1. Correct recognition rate in the motion recognition task [%] (TD15(TD20) denotes the number of training data) TDI5 1'020
98.290/