Interactive Video Indexing With Statistical Active Learning - LMS

14 downloads 0 Views 1MB Size Report
where is the original model output on , and are the maximum ..... [33] M. Wang, X.-S. Hua, Y. Song, J. Tang, and L. Dai, “Multi-concept multi-modality ... Scholarship of Chinese Academy of Science in 2009, and the Best Paper Award in the 17th ...
IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 1, FEBRUARY 2012

17

Interactive Video Indexing With Statistical Active Learning Zheng-Jun Zha, Member, IEEE, Meng Wang, Member, IEEE, Yan-Tao Zheng, Yi Yang, Richang Hong, and Tat-Seng Chua

Abstract—Video indexing, also called video concept detection, has attracted increasing attentions from both academia and industry. To reduce human labeling cost, active learning has been introduced to video indexing recently. In this paper, we propose a novel active learning approach based on the optimum experimental design criteria in statistics. Different from existing optimum experimental design, our approach simultaneously exploits sample’s local structure, and sample relevance, density, and diversity information, as well as makes use of labeled and unlabeled data. Specifically, we develop a local learning model to exploit the local structure of each sample. Our assumption is that for each sample, its label can be well estimated based on its neighbors. By globally aligning the local models from all the samples, we obtain a local learning regularizer, based on which a local learning regularized least square model is proposed. Finally, a unified sample selection approach is developed for interactive video indexing, which takes into account the sample relevance, density and diversity information, and sample efficacy in minimizing the parameter variance of the proposed local learning regularized least square model. We compare the performance between our approach and the state-of-the-art approaches on the TREC video retrieval evaluation (TRECVID) benchmark. We report superior performance from the proposed approach. Index Terms—Active learning, optimum experimental design, video indexing.

I. INTRODUCTION ITH rapid advances in storage devices, networks, and compression techniques, large-scale video data are now publicly accessible. There is a compelling need for automatic understanding and efficient retrieval of these data. Many commercial search engines, such as Google, Microsoft Bing, and Yahoo!, offer video search based on indexing textual metadata associated with videos. The advantage of text-based video search is that many text search technologies can be directly used to do video search. However, the search performance is generally unsatisfactory, since the textual metadata are usually

W

Manuscript received January 30, 2011; revised July 18, 2011; accepted August 29, 2011. Date of publication November 04, 2011; date of current version January 18, 2012. This work was supported by A*Star Research Grant R-252-000-437-305. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Qingshan Liu. Z.-J. Zha and T.-S. Chua are with the School of Computing, National University of Singapore, Singapore 117417. M. Wang and R. Hong are with Hefei University of Technology, Anhui, China (e-mail: [email protected]). Y.-T. Zheng is with the Institute for Infocomm Research, Singapore. Y. Yang is with the School of Information Technology and Electrical Engineering, The University of Queensland, Brisbane, Australia. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TMM.2011.2174782

Fig. 1. Illustration of interactive video indexing.

noisy, incomplete, and sometimes inconsistent with the video content. Recent research moves one-step ahead of text-based video search and utilizes a set of semantic concepts as intermediate semantic descriptors to facilitate video search [2]. The concepts of interest include a wide range of categories such as scenes (e.g., urban, sky, mountain, etc.), objects (e.g., airplane, bus, face, etc.), and events (e.g., people-marching, walking-running, etc.). Predicting the presence of semantic concepts in video clips has shown encouraging progress towards bridging the “semantic gap” and thus leads to more effective results for video search [2]. The task of predicting the presence of concepts is usually called semantic video indexing, video concept detection, or video annotation in literatures. A wealth of approaches have been proposed for semantic video indexing recently. Most of them are developed based on machine learning techniques, in which concept models are learned with the training data and then newly given video clips can be directly predicted [8], [9], [30], [34], [37], [42]. Despite the encouraging progress in semantic video indexing, existing studies show that huge training datasets are often required in order to achieve reasonable performance due to the complexity of video data [2]. However, manual labeling is laborious and time-consuming. For example, existing researches show that annotating 1 h of video with 100 concepts generally takes anywhere between 8 to 15 h [3]. To reduce human labeling effort, active learning has been introduced into semantic video indexing recently [14]. As illustrated in Fig. 1, active learning-based interactive video indexing system typically consists of a concept learning module and a sample selection module. These two modules work alternatively as follows: the concept learning module trains concept classifiers based on current training dataset; then the sample selection module identifies and selects the most informative unlabeled samples for manual labeling, and these samples are added to

1520-9210/$26.00 © 2011 IEEE

18

the training dataset. In this way, the interactive video indexing can harvest the most informative training data quickly and thus boost the performance effectively. Existing interactive video indexing strategies can be roughly divided into three categories: the risk reduction strategy which aims to select the samples that minimize the expected risk of current classifier [4], [13], [35], the most uncertainty strategy that selects the samples that current classifier is most uncertain about [7], [24], [29], [36], and other strategies which take into account sample relevance, density, and diversity information [5], [10], [38]. The problem of selecting samples to label is typically referred to as experimental design in statistics. The sample x is referred to as experiment, and its label is referred to as measurement. The study of optimum experimental design [15] is concerned with the design of experiments that are expected to minimize variances of a parameterized model. The classic optimum experimental design algorithms are based on a supervised learning model: linear regression, which only makes use of the labeled data. As aforementioned, in semantic video indexing, the labeled data are usually insufficient since manual labeling is highly laborious and time-consuming, while the unlabeled data are usually plentiful. In this case, supervised learning model could not perform well due to the lack of training data. Recent progress on semi-supervised learning shows that the unlabeled data used with a small number of labeled samples can improve the learning performance greatly [16]–[18]. Motivated by this, recent research works have developed new optimum experimental design algorithms based on semi-supervised learning framework. Based on Laplacian regularized least squares (LapRLS) [18], Laplacian regularized optimum experimental design algorithms have been proposed [19], [20], [22]. Although these algorithms perform more effectively than the classic optimum experimental design, they generally cannot perform well in video indexing due to the following two limitations. First, Laplacian regularized optimum experimental design focuses on selecting the samples that can minimize model parameter variance or model prediction variance, where the sample relevance, density, and diversity information are not taken into account. However, these information have been found effective for interactive video indexing in recent studies [5]. Second, Laplacian regularized optimum experimental design is based on the graph Laplacian regularizer, which estimates the label consistency over graph topology from pair-wise perspective, i.e., if two samples are close to each other, their labels are close as well. However, this strategy usually fails to characterize the desired property of label consistency, since the label consistency is a term defined over the whole set of neighbors and is naturally characterized by the local structure of the neighbors rather than the individual pair-wise relations between samples [27], [32], [31], [47]. The local structure is especially beneficial to video indexing since most of the video corpus, such as TRECVID dataset, are obtained from different sources and span a large time interval. Motivated by the above observations, in this paper, we propose a novel interactive video indexing approach. A unified active learning method is developed to exploit sample’s local structure, and sample density, relevance, and diversity information, as well as take advantage of both the labeled and unlabeled

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 1, FEBRUARY 2012

data. Specifically, we propose a local learning regularized optimum experimental design approach. Our assumption is that for each sample, its label can be well estimated based on its neighbors. A local regression model is developed to exploit sample local structure and characterize the desired label consistence. In order to assign an optimal label confidence value to each sample, we globally align the nonlinear regression models from all the samples, and obtain a local learning regularizer, based on which a local learning regularized least square model is developed for semantic concept learning. Finally, a unified sample selection approach is proposed, which takes into consideration the sample relevance, density, diversity, as well as sample efficacy in minimizing the parameter variance of the proposed local learning regularized least square model. The main contributions of this paper can be summarized as follows. • A novel active learning approach is proposed. Different from the existing methods, our approach simultaneously exploits sample’s local structure and sample density, relevance, and diversity information, as well as takes advantage of both the labeled and unlabeled data. • A local learning regularized least square model is developed, based on which the local structure of each sample can be well exploited. • A unified sample selection strategy is proposed, which is demonstrated to be effective in selecting the most informative samples and boosting the learning performance. The rest of this paper is organized as follows. Firstly, we provide a short review of the related works in Section II. The sample relevance, density, and diversity information are detailed in Section III. In Section IV, we elaborate the proposed active learning approach. Specifically, the proposed local learning regularizer and local learning regularized least square model are elaborated in Section IV-A and Section IV-B, respectively. The sample selection strategy is detailed in Section IV-C. In Section V, we report the experimental results over a real-world video corpus: TRECVID 2007 dataset, followed by the conclusions in Section VI. II. RELATED WORK In this section, we provide a brief description of the generic active learning problem, review the existing active learning approaches explored in video indexing, and describe the conventional optimum experimental design. A. Notation and Active Learning Problem In this paper, we use boldface lowercase letters (e.g., ) to denote vectors and boldface uppercase letters (e.g., ) to deto denote the determinant of and note matrices. We use to denote its transpose. We use calligraphic letters (e.g., ) to denote the cardinality of a set. to denote sets and Given a set of samples , where is the -dimensional feature vector, the generic problem of active learning is to find a subset containing the most informative samples. That is, if these samples are labeled and used as training samples, the underlying classifier can predict the labels of the unlabeled samples most precisely.

ZHA et al.: INTERACTIVE VIDEO INDEXING WITH STATISTICAL ACTIVE LEARNING

B. Active Learning in Video Indexing As aforementioned, active learning technique has been introduced into semantic video indexing, since it can significantly reduce human cost in labeling training data. The core problem of active learning methods is the sample selection. Existing sample selection strategies explored in video indexing can be categorized into three categories: the risk reduction strategy [4], [13], [35], the most uncertainty strategy [7], [24], [36], [29], and other strategies that consider sample relevance, density, and diversity information [5], [10], [38]. Cohen et al. [13] and Roy et al. [4] suggested that the optimal active learning method should select samples that minimize the expected risk of current learning model. The reduced risk is estimated with respect to each unlabeled sample, and the most effective samples are selected. However, for most of existing learning models, it is infeasible to estimate the risk. Therefore, in practice, many active learning methods adopt the most uncertainty sample selection strategy which selects the samples that current model is most uncertain about. The most popular uncertainty-based active learning method is support vector machine active learning (SVMactive) [24]. SVMactive was applied in video indexing in [7]. It builds an SVM classifier based on current training samples and then selects the samples that are closest to the current classification boundary to label. The rationale is that the closer to the boundary a sample is, the less reliable its classification is. In addition, Nguyen et al. [29] chose samples closest to the boundary in an information space based on distances to a number of selected samples. One of the major disadvantages of these works is that the estimated boundary may not be accurate enough, especially when the training samples are scarce. This problem impairs the identification of informative samples and consequently reduces the learning performance. Some other sample selection approaches take sample relevance, density, and diversity information into account. Ayache et al. [5] proposed to use relevance criteria in sample selection. The samples that have the highest probabilities to be relevant were selected to label. They reported that the relevance criteria is effective for the concepts whose positive samples are much less than the negative ones. Density criteria aims to select samples from the dense regions of feature space [38], [45], [46]. The samples in dense regions are usually representative of the underlying sample distribution. Therefore, these samples can add much more information to the learning model as compared to the samples within low density regions. Brinker et al. [39] incorporated the diversity criteria into sample selection. They selected the samples that are diverse to each other. C. Optimum Experimental Design Closely related to active learning is optimum experimental design in statistics, which has received increasing attention due to its theoretical foundation and practical effectiveness [15]. Optimum experimental design considers a linear regression model (1) where are the model parameters, is the real-valued output, and is the measurement noise which is a Gaussian random variable with zero mean and constant variance .

19

Optimum experimental design aims to choose the most inforfrom to learn a premative samples so that the expected prediction diction function error can be minimized [15]. Given a set of selected samples and the corresponding labels , the linear regression model can be estimated following the least squares criteria:

(2) The optimum solution is (3) is the feature matrix, and is the label vector. It can be proved that is an unbiased estimation of and its covariance matrix is given as where

(4) The conventional optimum experimental design can be classified into two categories as according to the sample selection criteria. The first category is to select the samples that can minimize the size of the parameter covariance matrix, including D-, A-, and E-optimal Design [15]. D-optimal Design aims to seand lect the samples to minimize the determinant of thus minimize the volume of the confidence region. A-optimal and thus minDesign attempts to minimize the trace of imizes the dimensions of the enclosing box around the confidence region. E-optimal Design minimizes the maximum eigen, i.e., the size of the major axis of the convalue of fidence region. The other category aims to select the samples to minimize the variance of predication value, including I- and G-optimal Design. For each sample , its prediction value is and the variance of the prediction value is . I-optimal Design minimizes the average variance of prediction value, while G-optimal Design minimizes the maximum variance of prediction value. All of these conventional optimum experimental design algorithms are based on a supervised learning model and usually suffers from the insufficient training sample problem. To address this issue, Laplacian regularized optimum experimental design has been proposed recently to take into account both the labeled and unlabeled data [19], [20], [22]. All of these approaches are based on LapRLS [18], which estimates the label consistency over graph topology from pair-wise perspective, i.e., if two samples are close to each other, their labels are close as well. However, the label consistency is naturally characterized by the local structure of sample’s whole neighbor set rather than the individual pair-wise relations between samples [26], [27], [31], [32]. The local structure is especially important in video indexing since most of the video collections are obtained from diverse sources and span a large time interval. Therefore, it is worthwhile to develop a new active learning approach to exploit the local structure of samples.

20

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 1, FEBRUARY 2012

III. RELEVANCE, DENSITY, AND DIVERSITY In this section, we introduce the estimation of sample relevance, density, and diversity information, which will be exploited in our active learning approach. A. Relevance Relevance criteria aims to maximize the size of the set of positive samples by selecting samples that are most probable to be relevant to the given concept. In video indexing, the positive samples of most concepts are usually scarce and much less than the negative ones, and the distribution of negative samples are usually in very broad domain. In this case, the most probable positive samples can add most information to the current concept learning model [5]. The relevance can be estimated in of sample various ways. For example, the relevance could be the normalized model output on : or

(5)

Based on (9), the angle between the unlabeled sample and the current labeled sample set is defined as the to any sample in , i.e., maximal angle from sample . Consequently, we define the diversity measure of sample as follows. The is more diverse to the current labeled sample with larger sample set: (10)

IV. PROPOSED ACTIVE LEARNING APPROACH In this section, we elaborate the proposed active learning method. We elaborate the proposed local learning regularizer in Section IV-A, and then introduce the local learning regularized least square model in Section IV-B. Finally, we describe our sample selection algorithm in Section IV-C. A. Local Learning Regularizer

where is the original model output on , and are the maximum and minimum model output on all the unlabeled is the standard sigmoid function. samples. B. Density Density criteria aims to select samples from dense unlabeled feature regions [38]. The samples in dense regions of the feature space are usually representative of the underlying sample distribution. Therefore, these samples can add much more information to the learning model as compared to the samples within low density regions. To estimate the density of each sample, we by kernel density esfirst estimate the sample distribution timation (KDE) approach [40]: (6) is a kernel function that satisfies and . To reduce computation cost, of each sample is usually estimated over its neighbors rather than all is computed as the samples. That is,

As aforementioned, the label consistency is naturally characterized by sample’s local structure of its neighbor set rather than the individual pair-wise relations between samples [27]. Let denote the predictive label value of generated by an un. derlying classifier, such as a regression model To exploit the local structure of each sample , our assumption is that the value of can be well estimated based on the denote the neighbors of . Let and its -nearest neighbors. That is, should apsample proach to the output of a local model trained locally with the [27]. Here, we adopt a local nonlinear data , where are the model regression model denote the local predictive error parameters. Let of on the sample . can be estimated by solving the following optimization problem: (11)

where

(7)

where is a regularization term to control the model complexity and thus avoid the “overfitting”. is a tradeoff paramas the sum square eter. We define the loss function norm in the regularization term. We then errors and use the get

where denotes the neighbor set of . Based on (7), we define the density of sample by normalizing to [0, 1] as follows:

(12)

(8)

which can be rewritten as (13)

C. Diversity Previous studies demonstrate that the selected samples should be diverse [39]. The most popular diversity measure is angular diversity. The angle between two samples and is defined as (9)

is the transwhere formed feature matrix containing all the samples in , and . We align all the local regression by summing (13) over all the samples and then get model (14)

ZHA et al.: INTERACTIVE VIDEO INDEXING WITH STATISTICAL ACTIVE LEARNING

to

By taking the derivative of the objective function with respect , we get the following solution: (15) By substituting (15) for

21

is the local learning regularizer and the label of . is a regularizer to control model complexity. According to Representer Theorem [18], the optimal solution is an expansion of kernel functions over both the labeled and unlabeled data:

, we get (22) (16)

is a kernel matrix between samwhere . ples in Thus, the objective function in (14) becomes

where is a positive definite mercer kernel. Substituting this form into (21), we can arrive at the convex differentiable objective function of the -dimensional variable :

(23) (17) Theorem 1: Equation(17) is equivalent to the following equation: (18) Proof: See the Appendix. Let us introduce in which , if , and , otherwise. are the sample indices in . Let , we get . Then, we can rewrite (24) and get the local learning regularizer as (19) where

where is the label vector, is an kernel matrix with , is an kernel matrix with , and . The derivative of the objective function vanishes at the minimizer: (24) which leads to the following solution: (25) For a test sample , its label value with the optimal .

can be computed as (22)

C. Active Sample Selection

is defined as

(20) B. Local Learning Regularized Least Square Model in Recall the linear regression function (1). Here we perform it in the reproducing kernel Hilbert . The corresponding nonlinear version is space (RKHS) . To estimate the function , we propose a local learning regularized least square model which simultaneously takes advantages of both labeled and unlabeled data, as well as the local structure of each sample. Based on the local learning regularizer in (19), the local learning regularized least square model is formulated as follows:

Given the above concept learning model, the task of sample selection is to select the most informative samples from , such that can provide most information to the concept learning model. That is, if samples are labeled and used as training samples, the concept learning model can predict the labels of the unlabeled samples most precisely. Closely related to sample selection is optimum experimental design in statistics. The intent of optimum experimental design is usually to minimize the variance of a parameterized model. The samples that can minimize the model parameter variance or the model prediction variance are selected as the most informative samples. Here, we develop our sample selection strategy based on the D-optimal experimental design criteria due to its connection to the confidence region of the model parameters [15]. In D-optimal experimental design, the samples are selected to minimize the determinant of the covariance matrix of model parameters. in (25) by , Denote by . The covariance matrix of and is given as

(21) where

is a function in the RKHS , is the vector , is a labeled sample in which is a subset of . is

(26)

22

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 1, FEBRUARY 2012

Since the parameters and in are usually set to be very small, we get the determinant of the covariance matrix of as

is a constant while selecting the Since (31) can be rewritten as

th sample,

(27)

(33)

The smaller is, the smaller is the size of the covariance , D-optimal experimental matrix. Noticing that design criteria is equivalent to selecting samples to maximize , i.e., . As aforementioned, the sample relevance, density, and diversity information have been found effective in interactive video indexing[5]. Therefore, incorporating these information into the sample selection procedure of optimum experimental design is a worthwhile direction to explore. Recall that the sample relevance, density, and diversity information are introduced in Section III. These information are usually complementary to each other in real-world applications and the combination of them is usually more effective than any individual one [5]. We thus combine these information to measure the informativeness of each sample. According to theoretical analysis [13], it is more rational to use density measure as a weight of relevance than linearly combing them. Thus, we define the informativeness of sample as a linear combination of diversity and relevance weighted by density:

As can be seen, the most expensive computation in (33) is . Here we use the Sherman-Morrisonthe matrix inverse Woodbury formula [28] to avoid directly inverting a matrix. and two column vectors , the Given an invertible matrix Sherman-Morrison-Woodbury formula states

(28) In order to simultaneously exploit the sample relevance, density, and diversity information, as well as sample efficacy in minimizing the variance of parameterized model, our unified sample selection strategy is formulated as the following optimization problem:

(34) th sample is selected, the inverse of Once the can thus be updated following the Sherman-Morrison-Woodbury formula as

(35) In this way, we only need to compute the inverse of . Our sample selection strategy is summarized in Algorithm 1. Algorithm 1: Sample selection algorithm Input: A set of samples samples to be selected Output: The

, the number of

most informative samples:

Compute (29) We adopt a sequential optimization approach to solve is selected such that (29) efficiently. The first sample is maximized. Suppose a set of samples have been selected, we define

, compute

for for if

, then

endif (30) th sample The lowing problem:

end for

is selected by solving the fol-

(31) end for where is the corresponding to sample . column vector in By using matrix determinant lemma [44], the determinant of can be rewritten as a multiplicative updated of the determinant of (32)

return V. EXPERIMENTS In this section, we conduct experiments on the widely used benchmark TRECVID dataset [1] to evaluate the proposed approach and compare it to other state-of-the-art methods.

ZHA et al.: INTERACTIVE VIDEO INDEXING WITH STATISTICAL ACTIVE LEARNING

23

Fig. 2. Sample keyframes of the 36 concepts in TRECVID-07 dataset.

A. Dataset Description TRECVID-07 dataset [1] is one of the most used datasets by many groups in the area of video indexing. It consists of video genres taken from a real archive, namely news magazine, science news, news reports, documentaries, educational programming, and archival video. It contains 219 video clips, which are separated into two groups by NIST: a development set and a test set [1]. Specifically, the development set contains 110 video clips, which are segmented into 21 532 keyframes. For each keyframe, NIST provides the groundtruth of 36 concepts. These concepts consist of a wide range of genres, including program category, setting/scene/site, people, object, activity, and graphics. Fig. 2 shows some sample keyframes of these concepts.

Fig. 3. MAP performance of the five active learning approaches on TRECVID-07 dataset.

B. Experimental Setting In this subsection, we describe the experimental setting. We start with visual feature. 1) Visual Feature: To represent the content of keyframes, we extracted three kinds of visual features: the 225-dimensional block-wise color moment extracted over 5-by-5 division of the keyframe, and each division is represented by a 9-dimensional feature, the 75-dimensional edge direction distribution histogram, and the 128-dimensional wavelet texture feature [43]. We concatenated these features into a 428-dimensional single feature vector, and normalized each dimension into zero mean and unit standard deviation. It has been widely acknowledged that other features, including the bag of visual words, the pyramid histogram of oriented gradients, and gist, can be aggregated to boost the performance [41]. Here, our work focuses on investigating the effectiveness of the proposed active learning approach to video indexing. The influence of

different visual features to the indexing performance is left as our future work. 2) Compared Approaches: To demonstrate how our proposed approach improves the performance of interactive video indexing, we compared our approach to following four closely related and representative methods: • Support vector machine active learning (SVMactive) [24]: SVMactive is one of the most popular active learning algorithm. The samples, which are closest to the boundary of current SVM classifier, are selected to label and added into training data for updating the classifier. • Support vector machine active learning with relevance, density and diversity (SVMactive+rdd): In SVMactive+rdd, the relevance, density, and diversity criteria are incorporated into the sample selection strategy of SVMactive.

24

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 1, FEBRUARY 2012

Fig. 4. AP performance of the five active learning approaches at the first round.

• Dual strategy active learning (DUAL) [45]: DUAL is a dual strategy for active learning that exploits both uncertainty and density information for sample selection. • Laplacian regularized D-optimal design (LapDOD) [22]: LapDOD is based on a Laplacian regularized least square model. It aims to select samples that minimize the determinant of the covariance matrix of model parameters. 3) Evaluation Strategy: We separated the dataset into two partitions: a training subset containing 100 videos with about 19 000 keyframes, and a testing subset containing the keyframes in the remaining 10 videos. All the above active learning approaches were started with 4000 labeled samples selected from the training subset. Then at each round, these approaches selected and annotated 1000 samples from the training subset and incorporated them into the labeled sample set. Note that the samples which have been selected at previous rounds were excluded from later selections. We adopted Gaussian RBF kernel as the kernel function. The kernel bandwidth , the tradeoff parameters in SVM, and the regularization parameters in LapDOD and our approach were selected via three-fold cross validation on the initial labeled samples. The tradeoff parameters in (28) was empirically set to 0.3. We adopted -nearest neighbor method to find the neighbors for each sample, where is set to 30 empirically. The performance evaluation was conducted on the testing subset. 4) Evaluation Metric: We adopted the official performance metric average precision (AP) in the TRECVID tasks to evaluate and compare the above approaches on each concept. The AP corresponds to the area under a non-interpolated recall/precision curve and it favors highly ranked relevant keyframes. We averaged the AP over all the 36 concepts to create the mean average precision (MAP), which is the overall performance metric. C. Experimental Results and Analysis Fig. 3 illustrates the MAP comparison of our proposed approach and the four representative methods: SVMactive [24],

SVMactive+rdd, DUAL [45], and LapDOD [22]. From the experimental results, the following observations can be obtained. • Our approach outperforms the other four approaches at every active learning round. For example, our approach achieves about 7.5%, 10%, 15.4%, and 62.6% MAP improvements at the first round as compared to LapDOD, SVMactive+rdd, SVMactive, and DUAL, respectively, and it improves the MAP by 13.4%, 7.3%, 12.8%, and 25.4% at the last round. • Our approach reduces labeling effort while it achieves comparable performance to the other methods. For example, our approach obtains a comparable performance at the third round (7000 labeled samples in total) compared to SVMactive and LapDOD at the last round (13 000 labeled samples in total). That is to say, our approach can reduce labeling effort by about 46% as compared to SVMactive and LapDOD. Compared to DUAL at the last round, our approach achieves a close performance at the forth round (8000 labeled samples in total). This indicates that our approach can reduce labeling effort by about 38.5%. Compared to SVMactive+rdd at the last round, our approach obtains a close performance at the fifth round (9000 labeled samples in total). That is to say, our approach can reduce labeling effort by about 31%. • Both our approach and SVMactive+rdd incorporate the sample relevance, density, and diversity information into sample selection criteria. Different from SVMactive+rdd, our approach further exploits the local structure of each sample and makes use of both labeled and unlabeled samples. These advantages of our approach lead to performance improvement. • Both LapDOD and our approach are based on semi-supervised learning framework. Different from LapDOD which estimates the label consistency from the pair-wise relations between samples, our approach estimates the label consistency by exploiting sample’s local structure based on a local learning regularizer. Furthermore, our approach

ZHA et al.: INTERACTIVE VIDEO INDEXING WITH STATISTICAL ACTIVE LEARNING

25

Fig. 5. AP performance of the five active learning approaches at the last round.

incorporates sample relevance, density, and diversity information into sample selection criteria. The experimental results show that our approach outperforms LapDOD at every round. Fig. 4 shows the APs on all the 36 concept by these five approaches at the first round, while Fig. 5 illustrates the APs by these approaches at the last round. We can see that our approach improves AP performance on most concepts. Specifically, it performs best on 26 of all the 36 concepts at the first round. Some improvements are significant, such as “charts” (176.9% better than DUAL, 76.6% better than LapDOD, 97.3% better than SVMactive, and 55.2% better than SVMactive+rdd) and “Office” (108.4% better than DUAL, 26.2% better than LapDOD, 48.1% better than SVMactive, and 28.1% better than SVMactive+rdd). At the last round, our approach achieves the best performance on 23 concepts. For example, it improves the AP of “building” by 35.8%, 24.2%, 17.5%, and 15.5% compared to DUAL, LapDOD, SVMactive, and SVMactive+rdd), respectively.

methods, and can consistently improve the performance over most concepts. APPENDIX PROOF OF THEOREM 1 IN SECTION IV-A Proof: Equation (24) can be rewritten as

VI. CONCLUSIONS In this paper, we have proposed a new active learning approach for interactive video indexing. It simultaneously exploits sample’s local structure, and relevance, density, and diversity information, as well as makes use of both labeled and unlabeled samples. To exploit sample’s local structure, we have proposed a local learning regularizer, based on which a local learning regularized least square model has been developed. To actively select the most informative samples, we have further proposed a unified sample selection approach, which takes into account the sample relevance, density, diversity, and sample efficacy in optimizing the proposed local learning regularized least square model. The experimental results on TRECVID benchmark have shown that our approach is able to improve the overall video indexing performance compared to the state-of-the-art

REFERENCES [1] TRECVID: TREC Video Retrieval Evaluation. [Online]. Available: http://www-nlpir.nist.gov/projects/trecvid. [2] C. G. M. Snoek and M. Worring, “Concept-based video retrieval,” Found. Trends Inf. Retriev., vol. 4, no. 2, pp. 215–322, 2009. [3] C.-Y. Lin, B. L. Tseng, M. R. Naphade, A. Natsev, and J. R. Smith, “Mpeg-7 video automatic labeling system,” in Proc. 11th ACM Int. Conf. Multimedia, 2003, pp. 98–99. [4] N. Roy and A. McCallum, “Toward optimal active learning through sampling estimation of error reduction,” in Proc. 18th Int. Conf. Machine Learning, 2001, pp. 441–448. [5] S. Ayache and G. Quénot, “Evaluation of active learning strategies for video indexing,” Image Commun., vol. 22, pp. 692–704, 2007.

26

IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 14, NO. 1, FEBRUARY 2012

[6] M.-Y. Chen and A. Hauptmann, “Active learning in multiple modalities for semantic feature extraction from video,” in Proc. 20th National Conf. Artificial Intelligence Workshop Learning in Computer Vision, 2005. [7] M.-Y. Chen, M. Christel, A. Hauptmann, and H. Wactlar, “Putting active learning into multimedia applications: Dynamic definition and refinement of concept classifiers,” in Proc. 13th ACM Int. Conf. Multimedia, 2005, pp. 902–911. [8] M. Wang, X.-S. Hua, J. Tang, and R. Hong, “Beyond distance measurement: Constructing neighborhood similarity for video annotation,” IEEE Trans. Multimedia, vol. 11, no. 3, pp. 465–476, Apr. 2009. [9] M. Wang, X.-S. Hua, R. Hong, J. Tang, G.-J. Qi, and Y. Song, “Unified video annotation via multi-graph learning,” IEEE Trans. Circuits Syst. Video Technol., vol. 19, no. 5, pp. 733–746, May 2009. [10] C. K. Dagli, S. Rajaram, and T. S. Huang, “Leveraging active learning for relevance feedback using an information theoretic diversity measure,” in Proc. Int. Conf. Image and Video Retrieval, 2006, pp. 123–132. [11] E. Parzen, “On estimation of a probability density function and mode,” Ann. Math. Statist., vol. 33, pp. 1065–1076, 1962. [12] M. Wu and B. Schölkopf, “Transductive classification via local learning regularization,” in Proc. 11th Int. Conf. Artificial Intelligence and Statistics, 2007, pp. 628–635. [13] D. A. Cohn, Z. Ghahramani, and M. I. Jordan, “Active learning with statistical models,” J. Artif. Intell. Res., vol. 4, pp. 129–145, 1996. [14] T. S. Huang, C. K. Dagli, S. Rajaram, E. Y. Chang, M. I. M. , G. E. Polner, and D. P. W. Ellis, “Active learning for interactive multimedia retrieval,” Proc. IEEE, vol. 96, no. 4, pp. 648–667, Apr. 2008. [15] A. Atkinson, A. Donev, and R. Tobias, Optimum Experiment Designs, With SAS. Oxford, U.K.: Oxford Univ. Press, 2007. [16] X. Zhu, Semi-Supervised Learning Literature Survey, 2005. [Online]. Available: http://pages.cs.wisc.edu/~jerryzhu/research/ssl/semireview. html. [17] O. Chapelle, B. Schölkopf, and A. Zien, Semi-Supervised Learning. Cambridge, MA: MIT Press, 2010. [18] M. Belkin, P. Niyogi, and V. Sindhwani, “Manifold regularization: A geometric framework for learning from labeled and unlabeled examples,” J. Mach. Learn. Res., vol. 7, pp. 2399–2434, 2006. [19] X. He, W. Min, D. Cai, and K. Zhou, “Laplacian optimal design for image retrieval,” in Proc. SIGIR Conf. Research and Development in Information Retrieval, 2007, pp. 119–126. [20] L. Zhang, C. Chen, W. Chen, J. Bu, D. Cai, and X. He, “Convex experimental design using manifold structure for image retrieval,” in Proc. ACM Int. Conf. Multimedia, Beijing, China, 2009, pp. 45–54. [21] X. He, M. Ji, and H. Bao, “A unified active and semi-supervised learning framework for image compression,” in Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, Miami, FL, 2009, pp. 65–72. [22] X. He, “Laplacian regularized d-optimal design for active learning and its application to image retrieval,” IEEE Trans. Image Process., vol. 19, no. 1, pp. 254–263, Jan. 2009. [23] C. Chen, Z. Chen, J. Bu, C. Wang, L. Zhang, and C. Zhang, “G-optimal design with Laplacian regularization,” in Proc. 24th AAAI Conf. Artificial Intelligence, pp. 254–263. [24] S. Tong and E. Chang, “Support vector machine active learning for image retrieval,” in Proc. ACM Int. Conf. Multimedia, 2001, pp. 107–118. [25] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning. New York: Springer, 2009. [26] Z.-J. Zha, L. Yang, T. Mei, M. Wang, and Z. Wang, “Visual query suggestion,” in Proc. ACM Int. Conf. Multimedia, Beijing, China, 2009, pp. 15–24. [27] M. Wu and B. Schölkopf, “Transductive classification via local learning regularization,” in Proc. 11th Int. Conf. Artificial Intelligence and Statistics, Canada, 2007, pp. 628–635. [28] G. H. Golub and C. F. V. Loan, Matrix Computations, 3rd ed. Baltimore, MD: Johns Hopkins Univ. Press, 1996.

[29] G. P. Nguyen, M. Worring, and A. W. M. Smeulders, “Interactive search by direct manipulation of dissimilarity space,” IEEE Trans. Multimedia, vol. 9, no. 7, pp. 1404–1415, Nov. 2007. [30] J. Shen, D. Tao, and X. Li, “Modality mixture projections for semantic video event detection,” IEEE Trans. Circuits Syst. Video Technol., vol. 18, no. 11, pp. 1587–1596, Nov. 2008. [31] Y. Yang, D. Xu, F. Nie, J. Luo, and Y. Zhuang, “Ranking with local regression and global alignment for cross media retrieval,” in Proc. ACM Int. Conf. Multimedia, 2009, pp. 175–1184. [32] X. Tian, L. Yang, J. Wang, X. Wu, and X.-S. Hua, “Transductive video annotation via local learnable kernel classifier,” in Proc. Int. Conf. Multimedia and Expo., 2008, pp. 1509–1512. [33] M. Wang, X.-S. Hua, Y. Song, J. Tang, and L. Dai, “Multi-concept multi-modality active learning for interactive video annotation,” in Proc. Int. Conf. Semantic Computing, 2007. [34] G.-J. Qi, X.-S. Hua, Y. Rui, J. Tang, T. Mei, and H.-J. Zhang, “Correlative multi-label video annotation,” in Proc. ACM Int. Conf. Multimedia, 2007. [35] R. Yan, J. Yang, and A. Hauptmann, “Automatically labeling video data using multi-class active learning,” in Proc. Int. Conf. Computer Vision, 2003. [36] M. Naphadea and J. R. Smith, “Active learning for simultaneous annotation of multiple binary semantic concepts,” in Proc. Int. Conf. Image Processing, 2004. [37] Z.-J. Zha, X.-S. Hua, T. Mei, J. Wang, G.-J. Qi, and Z. Wang, “Active learning for simultaneous annotation of multiple binary semantic concepts,” in Proc. Int. Conf. Image Processing, 2004. [38] H. T. Nguyen and A. Smeulders, “Active learning using pre-clustering,” in Proc. Int. Conf. Machine Learning, 2004. [39] K. Brinker, “Incorporating diversity in active learning with support vector machines,” in Proc. Int. Conf. Machine Learning, 2003. [40] E. Parzen, “On the estimation of a probability density function and the model,” Ann. Math. Statist., vol. 33, no. 2, 1962. [41] Y.-G. Jiang, J. Yang, C.-W. Ngo, and A. Hauptmann, “Representations of keypoint-based semantic concept detection: A comprehensive study,” IEEE Trans. Multimedia, vol. 12, no. 1, pp. 42–53, Jan. 2010. [42] Z.-J. Zha, T. Mei, J. Wang, Z. Wang, and X.-S. Hua, “Graph-based semi-supervised learning with multiple labels,” J. Vis. Commun. Image Represent., vol. 20, no. 2, 2009. [43] T. Mei, X.-S. Hua, W. Lai, L. Yang, and Z.-J. Zha et al., “MSRA-USTC-SJTU at TRECVID 2007: High-level feature extraction and search,” in Proc. TRECVID 2007 Workshop, 2007. [44] D. A. Harville, Matrix Algebra From a Statistician’s Perspective. New York: Springer, 1997. [45] P. Donmez, J. G. Carbonell, and P. N. Bennett, “Dual strategy active learning,” in Proc. European Conf. Machine Learning, 2007. [46] S. Huang, R. Jin, and Z. H. Zhou, “Active learning by querying informative and representative examples,” in Proc. Advance in Neural Information Processing Systems, 2010. [47] Z.-J. Zha, Y.-T. Zheng, M. Wang, F, Chang, and T.-S. Chua, “Locally Regressive G-optimal design for image retrieval,” in Proc. Int. Conf. Multimedia Retrieval, 2011. Zheng-Jun Zha (M’08) received the B.E. degree and the Ph.D. degree in the Department of Automation from the University of Science and Technology of China (USTC), Hefei, China, in 2004 and 2009, respectively. He is a senior research fellow in the School of Computing, National University of Singapore. His current research interests include multimedia content analysis, computer vision, as well as multimedia applications such as search, recommendation, and social networking. Dr. Zha received the Microsoft Research Fellowship in 2007, the President Scholarship of Chinese Academy of Science in 2009, and the Best Paper Award in the 17th ACM International Conference on Multimedia. He is a member of ACM.

ZHA et al.: INTERACTIVE VIDEO INDEXING WITH STATISTICAL ACTIVE LEARNING

Meng Wang (M’09) received the B.E. degree and Ph.D. degree in the Special Class for the Gifted Young and the Department of Electronic Engineering and Information Science from the University of Science and Technology of China (USTC), Hefei, China, respectively. He is a Professor in the Hefei University of Technology. He previously worked as an associate researcher at Microsoft Research Asia, and then a core member in a startup in Silicon Valley. After that, he worked in the National University of Singapore as a senior research fellow. His current research interests include multimedia content analysis, search, mining, recommendation, and large-scale computing. He has authored more than 100 book chapters, journal, and conference papers in these areas. Dr. Wang received the best paper awards continuously in the 17th and 18th ACM International Conference on Multimedia and the best paper award in the 16th International Multimedia Modeling Conference. He is a member of ACM.

Yan-Tao Zheng received the B.Eng. degree (with 1st class Hons) from Nanyang Technological University, Singapore, and the Ph.D. degree from the National University of Singapore (with Best Ph.D. Thesis Award). He is a research scientist at the Institute for Infocomm Research, Singapore. His research interests include geo-mining in multimedia, image annotation, and video search. He is the recipient of a number of international awards, including Champion of Star Challenge, Microsoft Research Fellowship, IBM Watson Emerging Multimedia Leaders, and so on. During his attachment at Google Inc. in 2008, he developed a world-scale landmark recognition engine together with Google engineers, which has been highly praised and well publicized. Dr. Zheng is a member of ACM. He has served as program committee member and reviewer of a number of prestigious international conferences and journals.

Yi Yang received the Ph.D. degree from the Department of Computer Science, Zhejiang University, Hangzhou, China, in 2010. He is currently a Post-Doctoral Research Fellow with the School of Computer Science at Carnegie Mellon University, Pittsburgh, PA. His current research interests include machine learning and data mining and their applications to multimedia and computer vision.

27

Richang Hong received the Ph.D. degree from the University of Science and Technology of China, Hefei, China, in 2008. He worked as a Research Fellow in the School of Computing, National University of Singapore, as a Research fellow from September 2008 to December 2010. He is now a Professor in Hefei University of Technology. His research interests include multimedia question answering, video content analysis, and pattern recognition. He has coauthored more than 50 publications in these areas. Dr. Hong is a member of the Association for Computing Machinery. He won the Best Paper Award in ACM Multimedia 2010.

Tat-Seng Chua received the B.S. and Ph.D. degrees from the University of Leeds, Leeds, U.K., in 1979 and 1983, respectively. He then joined the National University of Singapore, where he is now a full Professor at the School of Computing. He was the Acting and Founding Dean of the School of Computing during 1998–2000. He spent three years as a research staff member at the Institute of Systems Science in the late 1980s. His main research interest is in multimedia information processing, in particular on the extraction, retrieval, and question-answering (QA) of text and video information. He focuses on the use of relations between entities and external information and knowledge sources to enhance information processing. He is currently working on several multi-million-dollar projects: interactive media search, local contextual search, and event detection in movies. His group participates regularly in TREC-QA and TRECVID news video retrieval evaluations. Dr. Chua is active in the international research community. He has organized and served as program committee member of numerous international conferences in the areas of computer graphics, multimedia, and text processing. He is the conference co-chair of ACM Multimedia 2005, CIVR (Conference on Image and Video Retrieval) 2005, and ACM SIGIR 2008. He serves in the editorial boards of: ACM Transactions of Information Systems (ACM), The Visual Computer (Springer), and Multimedia Tools and Applications (Springer). He is a member of the steering committee of CIVR, Computer Graphics International, and Multimedia Modeling conference series; and as member of International Review Panels of two large-scale research projects in Europe.