Relevance feedback for human motion retrieval using

Multimed Tools Appl DOI 10.1007/s11042-014-2325-3

Relevance feedback for human motion retrieval using a boosting approach Songle Chen & Zhengxing Sun & Yan Zhang & Qian Li

Received: 3 April 2014 / Revised: 19 August 2014 / Accepted: 13 October 2014 # Springer Science+Business Media New York 2014

Abstract Content-based human motion retrieval (CBMR) has been more and more important with the rapid growth of motion capture data, but the gap between high-level semantic concepts and low-level features hinders further performance improvement. Relevance feedback is an effective tool to narrow the semantic gap and enhance the retrieval performance. However, as a type of variable-length multivariate time series (VLMTS), motion capture data has its own characteristics including high-dimensionality, demand for elastic matching, and difficulty representing different movements in a uniform feature space, which make it much more challenging to design an effective relevance feedback approach. This paper presents a novel boosting approach for CBMR and the main contributions include three aspects. First, to fit in with the characteristics of VLMTS data and meet the real-time requirement of relevance feedback, the ensemble learning framework RankBoost is introduced and k-nearest neighbors combining with dynamic time warping (KNN-DTW) is employed as its weak ranker. Second, the set of extended Boolean geometry features containing much richer geometry elements and measures is used to represent motion content, and it provides a comparatively complete feature set for designing the weak ranker of RankBoost. Third, to solve the over-fitting problem caused by the small-sample training of relevance feedback, a novel learning objective composed of minimizing empirical ranking loss and minimizing the maximum generalization loss is proposed for RankBoost ensemble learning. Experimental results on CMU database and its extended database verify the effectiveness of the proposed approach. Keywords Motion capture data . Motion retrieval . Relevance feedback . RankBoost . Ranking loss

S. Chen : Z. Sun (*) : Y. Zhang : Q. Li State Key Lab for Novel Software Technology, Nanjing University, Nanjing 210023, China e-mail: [email protected] S. Chen e-mail: [email protected] Y. Zhang e-mail: [email protected] Q. Li e-mail: [email protected]

Multimed Tools Appl

1 Introduction In recent years, as a new kind of digital media, motion capture data has been widely used in computer animation, virtual reality, film special effects, etc. With the emergence of motion capture data and the presence of large 3D human motion databases [1], content-based human motion retrieval (CBMR) [2] has become a hot topic in the field of information retrieval. Through the efforts of researchers, CBMR has made much progress in motion content representation and similarity calculation. Similar to other content-based multimedia retrieval, the semantic gap between high-level query concepts and low-level features also exists in CBMR and hinders its further development [3]. Relevance feedback is an effective tool for reducing the semantic gap and improving the retrieval performance. During each round of relevance feedback, the user labels a number of retrieval results to be relevant or irrelevant to his query conception, then the system refines the retrieval results based on the labeled instances. These two steps are carried out iteratively to gradually learn the user’s preferences and improve the retrieval performance. The topic of relevance feedback is quite new in CBMR. Tang et al. [4] suggested adjusting the weight of each feature in the similarity measurement via the average normalized discounted cumulative gain (AnDCG), while Chen et al. [5] proposed to set the weight of each feature via the ratio of the variance of interclass to intraclass (RVII). Both of them belong to heuristic methods. Heuristic methods have the advantages such as robustness and briefness, but due to the deficiency of a clear optimization objective, it is hard for them to achieve satisfactory performance [6]. Meanwhile, in other content-based multimedia retrieval such as contentbased image retrieval (CBIR), the relevance feedback techniques have already progressed from early heuristic weight adjustments to online learning mechanisms. CBMR urgently needs to introduce effective online learning mechanisms to accurately capture the user’s subjective intentions. Compared with other machine learning problems, small-sample training and real-time requirement are the two main challenges which the online learning should be faced in relevance feedback. Small-sample training is the preliminary problem of online learning [7–9]. In relevance feedback, the training samples are the labeled instances from the user during each query session, which are very few compared with the feature dimensionality and the size of the database. With the small size training set, the learning algorithms are prone to produce the over-fitting problem that the learning outcome can only separate the labeled relevant and irrelevant instances but cannot discern the unlabeled relevant or irrelevant instances in the database. As a consequence, it is difficult to guarantee the performance and stability of relevance feedback. Real-time requirement is the basic need of relevance feedback for the online learning algorithms must be fast enough to allow real-time interaction between the user and system. The techniques of relevance feedback based on online learning mechanisms have been well studied in CBIR. Nevertheless, a retrieval object (image) in CBIR is usually represented in a vector, but a retrieval object in CBMR is an action formed by continuous poses, as a type of variable-length multivariate time series (VLMTS), it is naturally represented in a matrix [10], each column of which is a high-dimensional vector representing a pose. Since different actions have different durations and there are nonlinear distortions between them, it is difficult to represent different actions in a uniform feature space, although their poses can be represented in a uniform vector space. Moreover, VLMTS data needs elastic matching to measure the similarity of different sequential patterns, and elastic matching is often accompanied by a dynamic planning process. Consequently, the computational complexity of VLMTS data is much higher than that of vector data, and the high-dimensionality of motion data makes this

Multimed Tools Appl

problem more prominent. The particularity of representation and the complexity of computation make it much more difficult and challenging to design an effective relevance feedback approach for CBMR. In this paper, we present a novel boosting approach to solving the problems including small-sample training and real-time requirement that the online learning of VLMTS data faces in relevance feedback for CBMR. The main contributions of this paper include: 1) The ensemble learning framework RankBoost is introduced and k-nearest neighbors combining with dynamic time warping (KNN-DTW) is employed as its weak ranker, which not only can fit in with the characteristics of VLMTS data, but also can meet the real-time requirement of relevance feedback. 2) The extended Boolean geometry features are used to represent motion content, which provide a comparatively complete set of pose spatial-temporal features for designing the weak ranker of RankBoost. 3) A novel learning objective composed of minimizing empirical ranking loss and minimizing the maximum generalization loss (M-ERL&MGL) is proposed and it is adopted by RankBoost for ensemble learning, which can effectively solve the over-fitting problem caused by the small-sample training of relevance feedback.

2 Related work Relevance feedback has been widely used in content-based multimedia retrieval, especially in CBIR. However, the research of relevance feedback for CBMR is still at the outset stage. As far as the authors know, heuristic weight adjustments based on AnDCG [4] and based on RVII [5] are only two approaches in CBMR. It is necessary for CBMR to employ effective online learning mechanisms to further refine the retrieval performance. The online learning methods can be broadly characterized as either discriminative or generative according to whether or not the distribution of the data is modelled [11]. Because discriminative methods are trained to predict the class labels rather than the detailed distribution model, they usually tend to have better predictive performance. Support Vector Machine (SVM) and boosting are two of the most representative discriminative methods for relevance feedback [11]. SVM is considered as one of the state-of-the-art learning methods in CBIR owing to its good generalization ability [11,12]. Since SVM is a kernel method, the kernel function and parameters used in SVM are very crucial in determining its performance [11,12]. The most commonly used kernel functions such as RBF in SVM are restricted to the input space with fixed-length feature vectors, and it is suitable for CBIR. However, the retrieval object in CBMR is VLMTS data and it is naturally represented in a matrix [10]. One way in which SVM can be applied to VLMTS data is to transform a sequence into a feature vector [13]. For example, Li et al. [14] proposed a method to transform VLMTS data into a vector through singular value decomposition, and then use SVM to classify the vectors. Obviously, the classification performance depends mainly on the degree of how much the extracted feature vector can represent the content of VLMTS data. Another way in which SVM can be applies to VLMTS data is to define a sequence distance-based kernel function [13], such as Gaussian DTW kernel [15,16], Gaussian elastic metric kernel [17], global alignment kernel [18] and so on. However, Lei et al. [19] has concluded that elastic matching distance is not eligible to construct positive definite symmetric (PDS) kernels and theoretically proved the Gaussian DTW kernel [15,16] is not a qualified PDS kernel. Gaussian elastic

Multimed Tools Appl

metric kernel [17] uses edit distance with real penalty and time warp edit distance to construct elastic kernels, but the PDS property still cannot be guaranteed. Although the global alignment kernel is PSD, but the resulting kernel matrix has very large entries on the diagonal and the resulting classifier may have poor generalization properties [20]. At present, it is still a complex problem to define a kernel function for VLMTS data which can meet the required PDS condition and calculate the kernel matrix in real-time. In this paper, the typical approaches of these two ways of applying SVM to VLMTS data will be compared with the proposed approach. Boosting is another important method for relevance feedback. Boosting has the well-known capability of adaptive selection of discriminating and complementary features in the training process, so it not only can capture the user’s subjective intentions but also can explain the intention from the weights of features. The efficiency and flexibility of boosting are hard to be rivaled by most of other feedback methods. Boosting uses a forward feature selection strategy, but generally the learning accuracy can be obtained after a finite number of iterations, so the learning efficiency of boosting is much higher than that of evolutionary feedback methods [21,22] which use the random search strategy. As an ensemble learning framework, boosting doesn’t put a limit on the form of weak learners, so different weak learners can be designed to improve the performance of relevance feedback, such as BiasMap weak ranker [23], Bayesian weak classifier [9], and KNN weak classifier [12]. However, as a wrapper feature selection method, boosting is prone to produce the overfitting problem with the small size training set [24], which leads to it being difficult to guarantee the performance and stability of relevance feedback. In order to solve the problem of small-sample training, Huang et al. [9] proposed a quantization approach based on the ID3-like balance tree and employed Bayesian classification to replace the traditional binary weak classifiers of AdaBoost. Jiang et al. [12] suggested a feature selection criterion combining the unified feature matching measurement with the fuzzy feature contrast model and used this criterion to select the best weak classifier in each iteration of AdaBoost. Zhou et al. [23] proposed a BiasMap-based weak ranker to replace the traditional weak ranker of RankBoost. Nevertheless, ID3-like balance tree, Bayesian classifier, fuzzy feature contrast model and BiasMap are suitable for vector data but can hardly be applied to VLMTS data. As a result, these approaches are not fit for CBMR. In this paper, the proposed approach also belongs to boosting method, but it adopt a novel learning objective M-ERL&MGL to solve this problem.

3 System overview The workflow of the CBMR based on the proposed approach is shown in Fig. 1, which consists of two major parts: content match and relevance feedback. Content match is used to obtain the first retrieval results, while relevance feedback is used to refine the retrieval results. After the user inputting the query action, pose feature extraction is carried out first. In this paper, the extended Boolean geometry features described in section 4.1.1 are used to represent motion content. Then feature weight estimation is implemented to obtain the initial weight of each feature according to the movement area of its relevant joints. This scheme is proposed by Chao et al. [25]. Next, K features with the highest weights are inputted into the similarity calculation module to calculate the similarity between the query and each action in the database. In this paper, DTW algorithm [26] which widely used in CBMR [2,5] is used as the similarity measurement. Finally, top-N candidate actions with the lowest dissimilarity values are returned to the user.

Multimed Tools Appl

Set of pose features Irrelevant actions

Query action UI

Weak ranker KNN-DTW

Validation set

Learning objective M-ERL&MGL

Relevant actions

Pose feature extraction

Feature weight estimation

Similarity calculation

Training set

3D human action database

Unlabeled actions

RankBoost learning

Random Selection Ensemble ranker

Content Match

Relevance Feedback

Fig. 1 The workflow of the CBMR based on the proposed approach

During each round of relevance feedback, the user first labels the candidate actions to be relevant or irrelevant, and the labeled actions constitute the training set. Meanwhile, the system randomly selects a number of actions from the database. The randomly selected actions and the labeled relevant actions constitute the validation set. Both of the two sets are inputted into the proposed RankBoost relevance feedback algorithm, which employs KNN-DTW as the weak ranker and adopts M-ERL&MGL as the learning objective. Then, an ensemble ranker is trained and assigns a score to each action in the database. Finally, top-N candidate actions with the highest scores are returned to the user as the refined results. If the user is satisfied with the results, the process ends. Otherwise, the system steps into the next round of relevance feedback.

4 Proposed boosting approach for CBMR To solve the problems including small-sample training and real-time requirement that the online learning of VLMTS data faces in relevance feedback for CBMR, the ensemble learning framework RankBoost [27] is introduced and a novel boosting approach is presented. The key components of our approach include the weak ranker KNN-DTW, learning objective MERL&MGL, and extended Boolean geometry features. RankBoost with the weak ranker KNN-DTW is used to fit in with the characteristics of VLMTS data and meet the real-time requirement of relevance feedback. RankBoost with the learning objective M-ERL&MGL is used to address the over-fitting problem caused by the small-sample training of relevance feedback. The extended Boolean geometry features are used to provide a comparatively complete feature set for designing the weak ranker and further improve the retrieval performance. Figure 2 shows the flowchart of the proposed boosting approach. Minimizing empirical ranking loss (M-ERL) of the learning objective M-ERL&MGL is used to get the rankers which can correctly separate the labeled relevant and irrelevant actions of the training set, while minimizing the maximum generalization loss (M-MGL) of the learning objective M-ERL&MGL is used to filter these rankers and select the ranker which has the lowest maximum generalization loss risk. To achieve this learning objective, while

Multimed Tools Appl

Fig. 2 The flowchart of the proposed boosting approach

using the labeled actions as the training set, the system also uses the labeled relevant actions and the unlabeled actions randomly selected from the database as the validation set. The weak ranker of each pose feature is calculated on the training set. During each iteration of RankBoost, first, the weights of pairs in the training set and the weights of pairs in the validation set are updated separately. Then, two ranking losses of each weak ranker on the training set and on the validation set are calculated. The ranking loss on the training set represents the empirical loss while the ranking loss on the validation set reflects the generalization loss risk. Finally, the weak ranker ht with the lowest integrated ranking loss is selected and its weight is calculated. After T iterations, the algorithm outputs the ensemble ranker H. In this section, we first present the design of the weak ranker for RankBoost. Then we describe the design of the learning objective M-ERL&MGL for RankBoost. Finally, we give the proposed RankBoost relevance feedback algorithm. 4.1 Design of the weak ranker for RankBoost 4.1.1 Motion content representation Effective motion content representation is the premise and foundation for content match and relevance feedback. The movement of human body is the process of continuous pose changes. Euler rotation angle, 3D coordinates and quaternion can exactly describe the changes and they become the classical measures to evaluate the motion similarity [2,28,29]. Based on these pose measures, some researches resorted to dimensionality reduction and cluster techniques such as PCA, SOM to get the low-dimensional representation for each pose [30,31], and some researches applied spherical harmonics transformation, singular value decomposition and other means to extract high level features for each motion [25,32]. However, it is proven that Euler rotation angle, 3D coordinates and quaternion are numerical similarity measures and they are insufficient to identify similar motions with style difference [3,4].

Multimed Tools Appl

For retrieval of logically relevant human motions, Müller et al. [3] proposed to use Boolean geometry features to express motion content, and Tang et al. [4] suggested using joint relative distance to describe motion content. Both of them are logical relevance measures [4]. However, the limited number of geometry elements and binary states of each feature restrict Boolean geometry features to distinguish the subdivided action classes [33], and joint relative distance features are only limited to the distance between joint pairs and ignore other valuable measures contained in Boolean geometry features. In this paper, Boolean geometry features are extended reasonably to break these limitations. First, the geometry elements including points, lines and planes formed by joints are generalized to nearly all parts of body. Then the angle and distance measures between geometry elements are taken as the spatial features, and the angular velocity and acceleration of joints are taken as temporal features, which form a comparatively complete set of pose spatial-temporal features to represent motion content. Figure 3 shows the skeleton model of human body used in our implementation. The set of geometry elements contains 18 points, 17 lines, and 10 planes, where each joint is corresponding to a joint, each line is corresponding to a bone, and each plane is formed by three adjacent joints. To describe the relative positional relationship and its change between geometry elements, the extended Boolean geometry features contain a total of 9 types of measures, which are shown in Fig. 4. The pose spatial features include distance of two joints, distance of a joint and a bone, distance of a joint and a plane, angle of two bones, angle of a bone and a plane, angle of two planes. To represent the relative rotation of adjacent joints, Euler rotation angles are also included as a part of pose spatial features. The pose temporal features include the norm of angular velocity and the norm of acceleration of each joint to reflect the speed and effort of movement. According to [34], suppose the rotation of a joint at frame i−1 and i expressed by quaternion are q(i−1) and q(i), and the sample interval is Δt, then the angular velocity of this joint at frame i is F ω ðiÞ ¼

2logðq−1 ði−1Þ⋅qðiÞÞ Δt

ð1Þ

Head End Hand End

R. Shoulder

L. Shoulder

Hand End

Neck R. Wrist R. Elbow

R. Hip

R. Knee

R. Ankle

Fig. 3 The skeleton model of human body

Thorax

L. Elbow L. Wrist

Waist L. Hip

L. Knee

L. Ankle

Multimed Tools Appl

Fig. 4 9 types of pose spatial-temporal features

After the angular velocity has been calculated, the angular acceleration can be obtained as F α ðiÞ ¼

F ω ðiÞ− F ω ði−1Þ Δt

ð2Þ

To relieve the influence of different skeletons, before feature extraction, the skeleton of each motion data is normalized in a similar way with [4,25]. By combining different geometry elements and filtering out the features whose relevant joints are rigid pairs, a total of 950 pose spatial-temporal features are defined, as shown in Table 1. All the features are normalized into range 0–1 before any further computation. Compared with the original Boolean geometry features and joint relative distance features, the set of the extended Boolean geometry features contains much richer geometry elements and measures, which provides a comparatively complete set of pose spatial-temporal features for designing the weak ranker of RankBoost. 4.1.2 Weak ranker KNN-DTW In this paper, KNN-DTW is used as the weak ranker for RankBoost ensemble learning. On the one hand, the advantage of KNN algorithm only needing a distance measure can overcome the shortcoming of VLMTS data that it is difficult to represent different movements in a uniform Table 1 The summary of extended Boolean geometry features Type

Number

Description

Type

Number

Description

Fj,j,d

135

distance of two joints

Fp,p,a

45

angle of two planes

Fj,l,d

270

distance of a joint and a bone

FEuler

39

Euler angle of a joint

Fj,p,d

150

distance of a joint and a plane

Fω

13

angular velocity of a joint

Fl,l,a

135

angle of two bones

Fα

13

angular acceleration of a joint

Fl,p,a

150

angle of a bone and a plane

Total

950

Multimed Tools Appl

feature space, and DTW algorithm can meet its elastic matching demand. On the other hand, RankBoost is an ensemble learning framework and each weak ranker is usually corresponding to a single feature, so the elastic matching distance along a single feature between any two instances in the database can be computed offline, and the high search efficiency of RankBoost in the pose feature space is beneficial to solve the high-dimensionality problem of motion data, all of which will help to resolve the conflict between the high computational complexity of VLMTS data and the real-time requirement of relevance feedback. Let F={fk,k=1,…,d} be the set of the extended Boolean geometry features, DTW [26] is employed to calculate the elastic matching distance along feature fk between two actions. Suppose the number of poses in action A is S and its matrix representation is A=(FA,1,FA,2,…, FA,S), the number of poses in action B is T and its matrix representation is B=(FB,1,FB,2,…, FB,T), the distance between pose FA,i ∈A and pose FB,j ∈B along feature fk is d(FA,i,FB,j, fk)= |FA,i −FB,j|k. Let R={r1,r2,…,rL} be a curved path between A and B, where rl =(lA, lB)∈[1:S]×[1:T] and L is the length of the path. The total distance between A and B along feature fk following the path R is cR ðA; B; f k Þ ¼

XL l¼1

d F A;lA ; F B;lB ; f k

ð3Þ

The DTW distance is the smallest distance between A and B following the path R*, namely, DTW(A,B,fk)=cR*(A,B,fk)=min{cR(A,B,fk)}, and it can be obtained by dynamic programming. Let D be the accumulated distance matrix, and it satisfies D(s,1,fk)=∑sl=1d(FA,l,FB,1,fk) for s∈[1:S], D(1,t,fk)=∑tl=1d(FA,1,FB,l,fk) for t∈[1:T]. The resulting accumulated distance matrix D can then be computed by the recursion 8 < Dðs−1; t; f k Þ þ d F A;s ; F B;t ; f k Dðs; t; f k Þ ¼ min Dðs−1; t−1; f k Þ þ 2 d F A;s ; FB;t ; f k : Dðs; t−1; f k Þ þ d F A;s ; F B;t ; f k

ð4Þ

for 1