1110
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 8, AUGUST 2010
Multi-Label Transfer Learning with Sparse Representation Yahong Han, Fei Wu, Yueting Zhuang, Member, IEEE, and Xiaofei He, Member, IEEE
Abstract—Due to the visually polysemous barrier, videos and images may be annotated by multiple tags. Discovering the correlations among different tags can significantly help predicting precise labels for videos and images. Many of recent studies toward multi-label learning construct a linear subspace embedding with encoded multi-label information, such that data points sharing many common labels tend to be close to each other in the embedded subspace. Motivated by the advances of compressive sensing research, a sparse representation that selects a compact subset to describe the input data can be more discriminative. In this paper, we propose a sparse multi-label learning method to circumvent the visually polysemous barrier of multiple tags. Our approach learns a multi-label encoded sparse linear embedding space from a related dataset, and maps the target data into the learned new representation space to achieve better annotation performance. Instead of using l1 -norm penalty (lasso) to induce sparse representation, we propose to formulate the multi-label learning as a penalized least squares optimization problem with elastic-net penalty. By casting the video concept detection and image annotation tasks into a sparse multi-label transfer learning framework in this paper, ridge regression, lasso, elastic net, and the multi-label extended sparse discriminant analysis methods are, respectively, well explored and compared. Index Terms—Image annotation, multi-label learning, sparse representation, transfer learning, video concept detection.
I. Introduction
T
AKING SEMANTIC concepts or tags as class labels, video concept detection and automatic image annotation problems can be solved by machine learning algorithms [1]– [4]. Furthermore, there are usually lots of interesting objects/regions within each image, which implies that videos and images are intrinsically visually polysemous and should be annotated by multiple tags (labels). It has been shown that TRECVID 2005 dataset has multi-labeling property: almost 71.32% of all subshots have more than one label and some subshots are even labeled with 11 concepts [5]. Moreover,
Manuscript received October 14, 2009; revised March 11, 2010. Date of publication July 1, 2010; date of current version August 4, 2010. This work was supported by the National Natural Science Foundation of China, under projects 90920303 and 60833006, the 973 Program, under project 2010CB327905, the National High Technology Research and Development Program of China, under project 2006AA010107, and the Program for Changjiang Scholars and Innovative Research Team in University, under projects IRT0652 and PCSIRT. This paper was recommended by Associate Editor S. Pankanti. The authors are with the College of Computer Science, Zhejiang University, Hangzhou 310027, China (e-mail:
[email protected];
[email protected];
[email protected];
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCSVT.2010.2057015
with the booming of social tagging websites, such as Flickr and YouTube, we are witnessing an explosion of multilabeled images and videos. As indicated in [6], there are over 10 million images on Flickr labeled with more than four tags on average by different users. Correlations of multilabeled tags may reflect the semantic similarity of tagged images. Therefore, in order to exploit the inherent correlations among multi-labels, many efforts have been made to cast the video concept detection and image annotation problems into a multi-label learning framework [7]–[9]. Previous approaches to multi-label learning consider each individual tag as a binary classification problem [10], with which the drawback is that the correlations among different class labels are unfortunately ignored. In order to exploit the correlation information inherent in multiple labels, both supervised and semi-supervised multilabel learning algorithms [7]–[9] have been proposed recently. In this paper, we propose a multi-label transfer learning framework to solve video concept detection and automatic image annotation problems. We take automatic image annotation task as a running example to motivate the discussion of our proposed framework. Suppose we have a training set of images with each image labeled by multiple tags. One could assume that visually similar images should be annotated with similar tags. Therefore, we can predict appropriate tags for images in test set by using semi-supervised manifold ranking algorithm [11]. In this way, we first construct an adjacency graph whose weights describe visual similarity between two images. Then, for each tag, taking images annotated with this tag in training set as positive items, we can compute the ranking scores of all images in test set and the top ranked images will be assigned with this tag. However, there are two issues which need to be addressed. 1) First, the above approach considers each tag as an independent class label and the multi-label correlations are neglected. As discussed previously, correlations among different tags are helpful for the prediction of image tags. 2) Second, in practice, due to the so-called “semantic gap” between low-level features and high level semantics, and also the visually polysemous barrier of images, visually similar images may share overlapping annotated tags among multiple correlated candidates. However, for a designated target tag, label consistency does not always exist among visually similar images. Therefore, if we take each tag as a class label individually, the corresponding training data may be of small size and low quality, which may result in unsatisfactory performance.
c 2010 IEEE 1051-8215/$26.00
HAN et al.: MULTI-LABEL TRANSFER LEARNING WITH SPARSE REPRESENTATION
In order to address the above two issues, two types of correlations should be well exploited, i.e., correlations among multiple labels and correlations between labels and the common visual features of images. Motivated from recent progresses on transfer learning [12], [13], such correlations learned from a larger but related set of unlabeled images could be applied to certain current learning task. As introduced above, prosperity of the social tagging websites provides us with abundant multi-labeled image data which are not expensive to obtain. For image annotation task we can download a large number of images and corresponding annotated tags from Flickr to form another different but related multi-labeled data set. The multi-labeled correlations inherent in such related multi-labeled data set could be exploited, and the corresponding knowledge can be transferred from related domain to current learning task. The basic intuition is that the multi-label correlations which are learned from the related data set can be applied directly to the target data set by a linear subspace embedding. Different from the existing transfer learning algorithms that learn from related labeled or unlabeled data, we propose to learn from a related multilabeled data set. To tackle aforementioned weaknesses under multi-label transfer learning framework, two key questions need to be answered. 1) First, how to efficiently perform feature extraction from the related multi-labeled data set in order to encode the multi-label information into the new features. 2) Second, how to learn a common features or base patterns from examples in the related data set, and then map the target data into the new representation space efficiently to boost learning performance. Many multi-label learning algorithms only address the first question. In [14] and [15], the authors developed algorithms for multi-label dimensionality reduction. In [16], Ji et al. proposed a general framework for extracting a shared lowerdimensional subspace which is assumed to be shared among multiple labels. In [17], Sun et al. proposed a hypergraph spectral multi-label learning framework (denoted by HSML in this paper) for learning a lower-dimensional embedding, which attempts to preserve the inherent multi-label relationship among data points captured by the hypergraph Laplacian. In [18], Raina et al. partially addressed the second question by developing a transfer learning algorithm that used sparse coding to learn a high-level representation from unlabeled data, and then applied such representation to the labeled data for the classification task. Similarly, in [19], the authors learned from multiple related tasks to get a sparse prototype representation which is based on kernel distance to unlabeled data points. Recently, Wang et al. [4] attempted to tackle both the multi-label learning and sparse representation problems by a two-stage process, i.e., labels sparse coding and features sparse coding. Both of these approaches share a common motivation that instances in target data set can be represented by sparse linear combination of the overcomplete dictionary of base elements learned from related data set. In the statistical signal processing community, such sparse linear combination
1111
can be efficiently computed by the l1 -norm penalized least square method, such as the lasso [20]. Due to the nature of the l1 -norm penalty, some regression coefficients will be shrunk to exact zero, which produces a more sparse and interpretable model simultaneously. Different from the above approaches, this paper dedicates to address the two questions by developing a sparse multilabel transfer learning (S-MLTL) framework, such that the multi-label subspace learning and sparse representation are integrated together. According to the research in [21], HSML [17] involves a generalized eigenvalue problem, and under a mild condition it can be formulated as a least square problem. Based on such least square formulation, in this paper, we propose a sparse hypergraph spectral multi-label (S-HSML) learning algorithm by imposing regularization on the regression coefficient. Specifically, we cast the HSML algorithm into an elastic net [22] formulation by adding to the least square an elastic-net penalty [22], which is a convex combination of the l1 -norm and l2 -norm penalty. The application of the elastic net as a gene selection method in microarray analysis [22] has shown that the elastic net often outperforms the lasso, while enjoying a similar sparse representation for the situations that predictors have strong correlations. In this paper, with S-MLTL framework, we, respectively, explore the performance of ridge regression, lasso and elastic net in the multi-label learning task. We summarize the proposed sparse multi-label transfer learning framework into three sequential steps. 1) First, applying sparse multi-label learning on the related multi-labeled data set. The output of this step is a sparse lower-dimensional transformation matrix W, in which the multi-label correlations are encoded. 2) Second, projecting the target data set (training data and test data) onto a lower-dimensional space by W. 3) Third, applying supervised or semi-supervised learning approach for each label in the new feature subspace. Central to this framework is the first step that is the primary focus of this paper. Moreover, the proposed framework can be extended easily when we introduce different sparse multi-label learning approaches into the first step. Besides the S-HSML method, we also discuss the multi-label extension of sparse discriminant analysis (SLDA) [23] in this paper. SLDA was originally proposed not for multi-label classification task such that the label indicator matrix has the constraint that every instance belongs to a unique class. By performing optimal scoring on a multi-label indicator matrix we propose the multi-label SLDA (ML-SLDA) and embed it into our multilabel transfer learning framework. Formulation details of MLSLDA and experimental studies are also presented in this paper. We evaluate our framework on two practical tasks: video concept detection and automatic image annotation. In each task, we conduct the lasso, ridge regression and elastic net penalties on the least square formulation, respectively, and compare the results with a baseline algorithm. The performance of embedding SLDA into our framework is also explored in the experiments. Experimental results show that learning a sparse representation from a different but related
1112
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 8, AUGUST 2010
data set does improve the multi-label learning performance of the target tasks. The key contributions of this paper are highlighted as follows. 1) We propose a general framework for multi-label transfer learning. In this framework, multi-label correlations can be transferred from a related data set to target data set by a learned sparse lower-dimensional embedding. 2) We cast hypergraph spectral multi-label learning as a more general sparse representation framework, i.e. elastic net. Furthermore, comparative performance of ridge regression, the lasso and elastic net in the multilabel learning task are presented in detail in this paper. 3) Multi-label extension of sparse discriminant analysis is introduced. The organization of the rest of this paper is as follows. Section II reviews HSML and elastic net. Section III proposes the sparse multi-label transfer learning framework. Section IV introduces the multi-label extension of sparse discriminant analysis. Experiment details and results are reported in Section V, and we summarize our paper in Section VI. Notations. We use nr , nc , and nt denote the number of instances in related data set, training set, and test set, respectively. The data dimensionality and the number of labels are denoted by d and m, respectively. Related data set, training set, and test set are, respectively, denoted by X(r) ∈ Rnr ×d , nc ×d X(c) ∈ , and X(t) ∈ Rnt ×d . X is assumed to be centered, R n i.e., i=1 xi = 0. The multi-label indicator matrix for X is denoted by Y ∈ {0, 1}n×m .
Fig. 1.
Steps of the hypergraph spectral multi-label learning algorithm.
then the optimization problem in (1) can be reformulated as max trace(W T XSXT W) n×d W∈R (3) s.t. W T XXT W = I which can be solved by the following generalized eigenvalue problem: XSXT w = λ(XXT )w.
(4)
It has been proved in [17] and [21] that under a mild condition the eigenvalue problem (4) can be formulated as a least square problem. We summarize the solution of HSML in Fig. 1.
II. Brief Review of HSML and Elastic Net B. Lasso and Elastic Net
A. Hypergraph Spectral Multi-Label Learning The HSML algorithm [17] was proposed by constructing a hypergraph to capture the correlation information contained in different labels, and by which a lower-dimensional embedding, i.e., a linear transformation matrix W, was learned for better representation. In the multi-label learning task, the label indicator matrix Y ∈ {0, 1}n×m can be intuitively taken as a hypergraph, where n denotes the number of instances, and Yij = 1 if the ith instance has the jth label, and 0 otherwise. So each label corresponds to a hyperedge, i.e., a column vector of Y, in such representation. The HSML framework was formulated to solve the following optimization problem: min trace(W T XLXT W) n×d W∈R (1) s.t. W T XXT W = I where I is the identity matrix, and L denotes the normalized hypergraph Laplacian of hypergraph Y that was defined in [24]. Following from spectral graph embedding theory [25], linear transformation W ∈ Rd×k (k n case. 2) Second, if there are high correlations between predictor variables, the prediction performance of the lasso may not be optimal. The elastic net [22] generalizes the lasso to overcome these drawbacks. We first introduce the naive elastic net, and then present the formulation of elastic net.
HAN et al.: MULTI-LABEL TRANSFER LEARNING WITH SPARSE REPRESENTATION
1) Naive Elastic Net: Suppose the data set has n instances with p variables. For any non-negative λ1 and λ2 , let β ∈ Rp denotes the coefficient vector, naive elastic net is defined as an optimization problem min ||y − Xβ||22 + λ2 ||β||22 + λ1 ||β||1 . (7) β
Let α = λ2 /(λ1 + λ2 ), then we can reformulate (7) as a penalized least squares problem [22] min ||y − Xβ||22 β (8) s.t. α||β||22 + (1 − α)||β||1 ≤ t for some t where α||β||2F + (1 − α)||β||1 is called the elastic-net penalty [22], which is a convex combination of the l1 -norm and l2 norm penalty. The naive elastic net criterion (8) can be written as min ||y∗ − X∗ β∗ ||22 + γ||β∗ ||1 (9) √
β
√ ∗ = [yT , 0]T , and where γ = λ1 / 1 + λ2 , β∗ = 1 + λ2 β, y(n+p) √ ∗ −1/2 T T [X , λ2 I] . Therefore, (9) can be X(n+p)×p = (1 + λ2 ) solved by the lasso and the naive elastic net can potentially select all p variables even in the p > n case. Let ∗ = arg min ||y∗ − X∗ β∗ ||2 + γ||β∗ ||1 (10) β 2 β∗
we have
∗ . β(naive elastic net) = 1/ 1 + λ2 β
of the ordinary least squares often suffers from the overfitting problem. To improve ordinary least squares we may add penalty to the regression coefficients. First, we can extend (6) by adding a l2 -penalty on W so as to form the ridge regression and conduct continuous shrinkage on the regression coefficients. Then (6) is reformulated as ||W T X − H T ||2F + λ
(11)
(12)
where the factor λ2 was introduced to remove a double amount of shrinkage incurred from the naive elastic net method [22]. Obviously, the lasso is a special case of the elastic net when λ2 = 0. Given a fixed λ2 , the optimization problem (12) can be efficiently solved by the LARSEN algorithm [22]. Experimental studies in [22] showed that elastic net performs better in prediction while enjoying a similar sparsity of representation.
III. Multi-Label Transfer Learning Framework In this section, we present sparse extensions of the HSML algorithm by adding l1 and l2 -penalty simultaneously to the least square formulation (6) such that the sparse HSML (SHSML) method is casted as the elastic net formulation. With the proposed S-HSML approach, we present the S-MLTL framework. A. Sparse HSML Formulation From the view of model fitting, we can consider (6) as an ordinary least squares technique that minimizes the residual sum of squares. Taken accuracy of prediction on future data and interpretation of the model as two aspects of the criteria for evaluating the effectiveness of a model, the performance
k
||wj ||22
(13)
j=1
where λ > 0 is the regularization parameter, wj ∈ Rd denotes the jth column vector of W. However, ridge regression cannot produce sparse representation, since it always keeps all the predictors in the model due to the nature of l2 -norm. However, we can improve (13) by replacing the l2 -penalty with a l1 -norm of wj so as to form the lasso regression and conduct continuous shrinkage and variable selection simultaneously. Specifically, (13) is reformulated as ||W T X − H T ||2F + λ
k
||wj ||1 .
(14)
j=1
Furthermore, we propose to reformulate (6) by adding an elastic net penalty and therefore the sparse HSML (S-HSML) can be formulated as the following optimization problem (15): ⎡
2) Elastic Net: The elastic net estimates β(elastic net) are given as follows [22]: β(elastic net) = (1 + λ2 )β(naive elastic net)
1113
min ⎣||W T X − H T ||2F + λ2
W∈Rd×k
k j=1
||wj ||22 + λ1
k
⎤ ||wj ||1 ⎦
j=1
(15) where k is the dimensionality of the target embedding subspace. B. Framework of Sparse Multi-Label Transfer Learning As introduced in Section I, our sparse multi-label transfer learning framework involves three steps. 1) First, apply S-HSML on the related multi-labeled data set to compute the lower-dimensional embedding matrix W. 2) Second, project the target data set onto the embedding subspace by W. 3) Third, take each label independently as class label, predict labels for data in test set by certain supervised or semi-supervised algorithm. We summarize our S-MLTL framework in Fig. 2. According to (12), for solving (17) by the LARSEN we only need to fix λ2 . Furthermore, when λ2 = 0 the optimization problem in (17) can be solved by the lasso, and when λ1 = 0 the optimization problem in (17) transforms to a ridge regression problem. The computational complexity of solving (17) by LARSEN algorithm is critical for the S-MLTL framework. As indicated in [22], LARSEN requires O(r3 + pr 2 ) operations for stopping in r steps. Experiments in this paper and real data computational experiments in [22] show that the optimal results are achieved at an early stage of LARSEN. In fact, r is a moderate constant in practice.
1114
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 8, AUGUST 2010
Furthermore, for introducing sparseness criterion we can also perform LDA by regression [23], [31]. LDA was rewritten in [23] as a regression type problem using optimal scoring [31], which is defined as min n−1 ||Yθ − Xβ||22 θ,β (20) s.t. n−1 ||Yθ||22 = 1 where Y is the label indicator matrix for data X under the constraint Y 1=1 and 1 denotes a vector of all 1s. Thus, in (20) Y is not the label indicator matrix for multi-label problem. By assigning scores, θ converts the binary numbers of Y into real numbers. Given as a symmetric and positive semi-definite penalized matrix [23], PDA adds a penalty of βjT βj to (20) as min n−1 ||Yθ − Xβ||22 + λ2 ||1/2 β||22 θ,β . (21) s.t. n−1 ||Yθ||22 = 1 In order to obtain sparseness in PDA, Clemmensen et al. [23] added a sparse constraint to (21) and formalized sparse discriminant analysis as min n−1 ||Yθ − Xβ||22 + λ2 ||1/2 β||22 + λ1 ||β||1 θ,β . (22) s.t. n−1 ||Yθ||22 = 1 For solving (22), first we fix θ and obtain min n−1 ||Yθ − Xβ||22 + λ2 βT β + λ1 ||β||1 β
(23)
which for = I is an elastic net problem and can be solved by the LARSEN algorithm. For fixed β we obtain a Procrusteslike problem [32] min n−1 ||Yθ − Xβ||22 θ . (24) s.t. n−1 ||Yθ||22 = 1 If there is constraint Y 1 = 1, the closed-form solution of (24) was given in [23]. In this paper, we propose to remove the constraint Y 1 = 1 and relax Y to Y ∈ {0, 1}n×m , which can indicate the multiple labels for instances. Therefore, we extend SLDA to the multi-labeled SLDA (ML-SLDA) when = I. For solving ML-SLDA we only need to present the updated solution for (24). We rewrite (24) to be min ||n−1/2 Yθ − n−1/2 Xβ||22 θ (25) s.t. (n−1/2 Yθ)T (n−1/2 Yθ) = 1
Fig. 2.
Sparse multi-label transfer learning (S-MLTL) fram ework.
which is a standard Procrustes problem [32], taking the SVD [33] of n−1/2 Xβ as SVD(n−1/2 Xβ) = USV T .
IV. Multi-Label Sparse Discriminant Analysis Linear discriminant analysis (LDA) [29] is a favored tool for supervised classification and a supervised dimensionality reduction method that projects data along the most discriminative direction. To address the situations when LDA fails, many techniques for penalized discriminant analysis have been proposed, such as PDA [30], FDA [31], and so on.
(26)
By solution of Procrustes we have n−1/2 Yθ = UV T .
(27)
Thus the solution for (25) is θ = n1/2 Y † UV T where Y † denotes the pseudo-inverse [33] of Y.
(28)
HAN et al.: MULTI-LABEL TRANSFER LEARNING WITH SPARSE REPRESENTATION
Fig. 3.
1115
Steps of the solution for ML-SLDA.
Alternately updating β and θ by solving (23) and using (28), respectively, until converge, we can solve the multi-labeled SLDA. We summarize the solution of ML-SLDA in Fig. 3. To embed ML-SLDA into our S-MLTL framework, first we substitute W as defined in (17) for β in (22), and then substitute Y (r) for Y, X(r) for X in (22). Taking = I, the ML-SLDA in this paper solves the following optimization problem: ⎡ ⎤ min n−1 ||Y (r) θ − X(r) W||2F + λ2 kj=1 ||wj ||22 θ,W ⎢ ⎥ ⎣ +λ1 k ||wj ||1 ⎦ . (29) j=1
s.t. n−1 ||Y (r) θ||22 = 1
Alternatively updating θ and W using steps as defined in Fig. 3 we can calculate a multi-label encoded linear embedding W. Embedding (29) into Step 1 in Fig. 2, we extend the proposed sparse multi-label transfer learning framework by ML-SLDA. V. Experiments To evaluate the performance of the proposed S-MLTL framework, we conduct experiments on two benchmark dataset, i.e., TRECVID 2005 [34] and MIRFLICKR-25000 Image Collection [35], with two practical tasks, i.e., video concept detection and automatic image annotation, respectively. Experimental results by ridge regression, the lasso, elastic net and ML-SLDA are compared in this section. A. Experimental Settings 1) Dataset Detail: TRECVID 2005 consists of about 170 h of TV news videos from 13 different programs in English, Arabic, and Chinese. We use the development set in our experiments, since there are annotations of semantic concepts defined in LSCOM [36], which could be taken as the ground truth. Because of the limitation of multi-languages, we choose the news videos broadcasted by English. We sequentially select about 3000 shots that begin from the video index “141” in ascending order. The first 1000 shots were chosen as related dataset for transfer learning, and the rest shots constitute the target dataset. In order to evaluate the performance of S-MLTL framework, we need those concepts that appear frequently in the dataset. Therefore, we also choose concept annotations of Columbia 374 released by Columbia University, New York, NY, [37] and select 35 concepts for our detection task, out of which 17 concepts belong to the 39 concepts annotated in TRECVID 2005 and 18 another concepts annotated in Columbia 374. Fig. 4 illustrates these concepts and their distribution in the data set. As introduced above, this dataset has the multi-label property. The ground truth of the presence
Fig. 4. Thirty-five video concepts and their distribution in the experimental dataset.
of each concept was assumed to be binary (1 denotes present and 0 absent). The MIRFLICKR-25000 image collection [38] consists of 25 000 images downloaded from Flickr.com. The average number of tags per image is 8.94. There are 1386 tags which occur in at least 20 images in the collection, which shows the evident multi-label property. We sequentially choose the first indexed 3000 images as dataset in this experiment. We also choose 33 tags annotated in MIRFLICKR-25000 collection as the ground truth for our dataset. The average number of tags per image is 3.97. The first 1000 images are chosen as related dataset for transfer learning, and the rest images constitute the target dataset. 2) Feature Extraction: a) Visual features of key frames: One key frame within each shot is obtained as a representative image for that shot. Visual features are then extracted from the key frames. We use three different types of image features [37]: edged direction histogram (EDH, 73 dimensions), Gabor (GBR, 48 dimensions), and grid color moment (GCM, 225 dimensions). We concatenate and normalize all three kinds of features into one 346-dimension visual feature vector. And finally, each visual feature vectors are centered to form the data matrix X. b) Visual features of Flickr images: We use bag-ofwords model for feature detection and representation of Flickr images. We use the first 500 images in our related data set to learn the codeword dictionary and the feature extraction involves three sequential steps as follows. 1) For each of the first 500 images in the related dataset we compute the SIFT [39] descriptors represented as 128-dimensional vectors. The total number of the SIFT vectors is about 440 000. 2) Taking these 440 000 SIFT vectors as input for the fast k-means algorithm [40], we cluster the whole set of SIFT vectors into 1000 clusters and output the corresponding cluster centroids to construct the codeword dictionary. 3) For each image in our dataset we conduct vector quantization according to the codeword dictionary. The output vectors from vector quantization are normalized to form the final 1000-dimensional visual features. Finally, we
1116
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 8, AUGUST 2010
center each visual feature vectors to form the data matrix X.
TABLE I MAP S Comparison of Video Concept Detection Task from L1-HSML, L2-HSML, and BL for Different Sizes of Training Set
B. Evaluation Metrics As defined by Step 3 in Fig. 2, our S-MLTL algorithm takes ranking list for each item in test set as output for each class label, i.e., semantic concept or tag. So the performance of both video concept detection and automatic image annotation can be measured by average precision (AP) and mean average precision (MAP). AP is defined as 1 Rk · Ik R k=1 k S
AP =
where S is the size of the test set, R is the number of relevant items returned, Rk is the number of relevant items in the topk returns, and Ik = 1 if the item ranked at the kth position is relevant and 0 otherwise. And we average the AP over all the semantic concepts or tags to create the MAP, which is the overall evaluation result. Moreover, for further evaluating the performance of the automatic image annotation task, we also compute the annotation precision and recall for the final output, i.e., Output defined in Fig. 2. Annotation precision at top k position P@k is defined as follows: nt k 1 P@k = Iij nt · k i=1 j=1 where nt denotes the number of instances in test set. Iij = 1 if the tag estimated by S-MLTL at position j for the ith image in test set is correct in terms of the ground truth and 0 otherwise. Since the average number of tags per image is 3.97 in our dataset, we choose k = 1, 2, 3, 4 and calculate the corresponding P@k in this experiment. The annotation recall Recall is defined as nt si 1 1 Recall = Iij nt i=1 si j=1 where si denotes the number of annotated tags for the ith image in test set in terms of the ground truth. Iij = 1 if the tag estimated by S-MLTL at position j for the ith image is correct in terms of the ground truth and 0 otherwise. C. Baseline Algorithm We take the manifold ranking algorithm proposed in [11] as our baseline algorithm in this experiment. The affinity graph is constructed on the target dataset X. Let A denote the affinity matrix with Aij measuring the similarity between the ith and jth sample vectors. The matrices X and A are defined in (16) and (19), respectively. Therefore, if we ignore Step 1 and Step 2 in Fig. 2 we obtain the baseline algorithm. For briefness in the rest of this paper, we let BL denote the baseline algorithm, L2-HSML denote the algorithm in Fig. 2 with λ1 = 0 in (17), L1-HSML denote the algorithm in Fig. 2 with λ2 = 0 in (17), EN denote the original algorithm S-MLTL presented in Fig. 2, and SLDA denote the multi-label SLDA (ML-SLDA) algorithm.
Size of Training Set 100 200 300 400 500 600 700 800
MAP from L1-HSML 0.11805 0.10630 0.09875 0.08959 0.08949 0.08293 0.08239 0.08207
MAP from L2-HSML 0.11290 0.10010 0.09153 0.08262 0.08251 0.07671 0.07627 0.07609
MAP from BL 0.11405 0.10358 0.09506 0.08637 0.08609 0.07946 0.07907 0.07895
From the table, we can see that L1-HSML can successfully improve average precision of video concept detection. The best results are shown in boldface.
D. Experimental Results In this section, we first compare the multi-label transfer learning results from L1-HSML with L2-HSML and BL. From the results we can see that a sparse representation learned from the related dataset can help to produce better performance. And then we explore the results from EN mainly on the image annotation task. Finally, we present the results from SLDA and then compare the overall evaluation results on video concept detection and automatic image annotation tasks, respectively, from BL, L2-HSML, L1-HSML, EN, and SLDA. 1) Comparative Results from Lasso, Ridge Regression, and Baseline: First, we conduct S-MLTL framework with λ2 = 0 and λ1 = 0 and the baseline algorithm, respectively, on video concept detection task. In this experiment, we leave the last 1000 video shots in our dataset as test data, and the size of training set varies from 100 to 800 gradually. For the lasso and ridge regression, the parameters λi (i = 1, 2) are set to λi = 1e3, 1e-2, 1e-1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0, 1.25, respectively. Since there are some concepts not appearing in such designated small training data set, in the video concept detection task, of all the 35 concepts we only explore the results for 20 concepts, which are listed in Table II in the Appendix. We output the MAP averaged over the 20 concepts from L1-HSML, L2-HSML, and BL. As listed in Table I, MAPs for different sizes of training set from L1-HSML and L2-HSML are averaged over different values of λi . From the table we can see that: 1) first, for each size of training set the best results are obtained from the L1-HSML method, and 2) second, though the size of training set is small, transfer learning by L1-HSML can even obtain higher MAPs. MAPs from L1-HSML and L2-HSML for different values of λi (i = 1, 2) are plotted in Fig. 5, which are averaged over different sizes of training set. Furthermore, we also plot the sparseness of the multi-label encoded linear embedding matrix W obtained by the lasso from L1-HSML in Fig. 6. Sparseness is defined as the ratio of number of nonzero elements over the total number of elements in W. From Figs. 5 and 6, we can see that: 1) first, the overall performance of L1-HSML is better than L2-HSML and BL, especially when the value of λi is big, and 2) second, the bigger the value of λ1 is, the more
HAN et al.: MULTI-LABEL TRANSFER LEARNING WITH SPARSE REPRESENTATION
1117
Fig. 5. MAPs of video concept detection task from L1-HSML, L2-HSML, BL for different values of tunable parameters.
Fig. 7. MAPs comparison of image annotation task from L1-HSML, L2HSML, and BL for different sizes of training set.
Fig. 6. Sparseness of W defined in Fig. 2 of video concept detection task from L1-HSML.
Fig. 8. MAPs of image annotation task from L1-HSML, L2-HSML, and BL for different weights of regression penalty.
sparse the embedding matrix W is, and correspondingly, the better the concept detection performance is. As we know, with a large l1 -penalty on the coefficient of X in (14) the lasso could select the parsimonious but discriminative features with multilabel information encoded by our L1-HSML approach. So the L1-HSML could obtain better concept detection performance. Similarly, we also conduct S-MLTL algorithm with λ2 = 0 and λ1 = 0 and the baseline algorithm, respectively, on image annotation task. In this experiment, we leave the last 1000 images in our dataset as test data, and the size of training set varies from 100 to 1000 gradually. For the lasso and ridge regression, the parameters λi (i = 1, 2) are set to λi = 1, 1.05, 1.1, 1.15, 1.2, 1.25, 1.3, 1.35, 1.4, 1.45, 1.5, respectively. As shown in Figs. 7 and 8, for different settings of training data size and weight of regression penalty, L1-HSML always performs the best. Furthermore, as shown in Fig. 9, when imposing a stronger l1 -penalty, i.e., λi = 1.5, the lasso obtains sparse coefficient. Correspondingly, as shown in Fig. 8, the overall annotation performance is better for large λ1 . 2) Image Annotation Results from S-MLTL by Elastic Net Penalty: In this experiment we conduct image annotation by solving (17). The parameter λ2 is set to λ2 = 1e-6, 1e-4, 1e2, 1e-1, 1, 10, respectively. Similarly, we leave the last 1000 images in our dataset as test data, and the size of training set varies from 100 to 1000 gradually. The plots of MAPs vs. training size, MAPs vs. λ2 , and sparseness of W are shown in Figs. 10–12.
Fig. 9. Sparseness of W defined in Fig. 2 of image annotation task from L1-HSML.
As can be seen from Figs. 10 and 11, the EN approach performs better than L1-HSML. As shown in Fig. 12, the sparse W can be obtained when λ2 is small. Furthermore, comparing Fig. 12 with Fig. 9 we can see that for comparable sparseness with the lasso we have to impose a very small l2 penalty in (17). a) The p > n case: As described in the Feature Extraction section, the dimensionality of visual features for Flickr images is 1000, i.e., p = 1000. In this example, the training size varies from 100 to 1000, so when n < 1000 it falls into the p > n case. As shown in Fig. 10, when the training size is less than 500, the EN approach outperforms
1118
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 8, AUGUST 2010
Fig. 10. MAPs comparison of image annotation task from EN, L1-HSML, L2-HSML, and BL for different sizes of training set.
Fig. 11. MAPs of image annotation task from EN and BL for different values of λ2 .
Fig. 12. Sparseness of W defined in Fig. 2 of image annotation task from EN for different values of λ2 .
L1-HSML. The elastic net can potentially include all variables in the fitted model of the p > n case [22] and may outperform lasso in such a case. Furthermore, comparing Fig. 11 with Fig. 8 we can see that, in the p > n case, the EN approach is slightly better than L1-HSML. To further explore the transfer learning performance, we compute the annotation precision and recall for the image annotation task under different values of λ2 . As shown in Fig. 13, the P@1 is low for all values of λ2 , and the best P@4 is obtained when λ2 = 1e-6, which means the matrix W has the highest sparsity. For recall, the best result is also obtained when λ2 = 1e-6 (or 1e-4), as shown in Fig. 14. Therefore, we draw the conclusion that elastic net performs better in
Fig. 13. P@k of image annotation task from EN, k = 1, 2, 3, 4.
Fig. 14. Recall of image annotation task from EN for different values of λ2 .
the S-MLTL framework while enjoying a similar sparsity of representation. 3) Results from Multi-Labeled Transfer Learning with SLDA: In this section, we explore the results from SLDA for video concept detection and image annotation tasks, respectively. As introduced in Section IV, taking = I the multilabeled SLDA in this paper solves the optimization problem (29), which is an alternative optimization process that can be solved by the LARSEN algorithm. Based on the experiment of EN, in this experiment, we set λ2 = 1e-6. In this experiment, we first compute MAP from SLDA for video concept detection and image annotation tasks, respectively. As shown in Figs. 15 and 16, although lower than L1HSML and EN, multi-label transfer learning with SLDA does improve the average precision over the baseline and even the L2-HSML algorithm. To further explore the transfer learning performance from SLDA, we compute the annotation precision and recall for the image annotation task. As shown in Fig. 17, SLDA performs slightly better than EN for P@1. For recall, as shown in Fig. 18, SLDA is better than BL and L2-HSML and comparable with L1-HSML. 4) Overall Comparison for BL, L2-HSML, L1-HSML, EN, and SLDA: Since there are many parameters in our algorithms, we summarize the best APs under different parameter settings for each approach and present the overall comparative analysis in this section. The details of the best APs for each concept and tag are listed in Tables II and III, respectively in the Appendix. We
HAN et al.: MULTI-LABEL TRANSFER LEARNING WITH SPARSE REPRESENTATION
Fig. 15. MAPs of video concept detection task from L1-HSML, L2-HSML, BL, and SLDA for different values of λ2 .
Fig. 16. MAPs of image annotation task from EN, BL, and SLDA for different weights of elastic net penalty.
1119
Fig. 18. Recall of image annotation task from L1-HSML, L2-HSML, BL, and SLDA.
Fig. 19. Overall MAPs from BL, L1-HSML, L2-HSML, EN, and ML-SLDA for video concept detection task.
Fig. 20. Overall MAPs from BL, L1-HSML, L2-HSML, EN, and ML-SLDA for image annotation task. Fig. 17. P@k of image annotation task from BL, L2-HSML, L1-HSML, SLDA, and EN, k = 1, 2, 3, 4.
average the best APs over the 20 concepts and 33 tags, respectively, to obtain the MAPs for each approach. As shown in Figs. 19 and 20, we draw the conclusion as follows. 1) First, since the multi-label correlations are exploited and transferred from the related dataset, the proposed SMLTL approaches obtain better performance than the baseline algorithm. 2) Second, the sparse-representation based approaches, i.e., L1-HSML and EN, have shown clear advantages over the non-sparse approaches. Therefore, by using l1 -norm or elastic-net penalty, the extension of HSML proposed in this paper successfully produces a parsimo-
nious (so interpretable) and more discriminative linear embedding. 3) Third, the attempt of embedding the multi-label extended SLDA into our proposed S-MLTL framework is promising. Although not better than L1-HSML and EN in performance, ML-SLDA significantly outperforms the baseline approach and obtains comparable performance with the L2-HSML in our S-MLTL framework. Moreover, there are two open problems left for future exploration. First, comparing L1-HSML with EN in Tables II and III, we can see that the EN approach performs relatively better in video concept detection task than in image annotation task. As presented in the Feature Extraction section, since we obtain the 346-dimension visual feature vectors of key frame
1120
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 8, AUGUST 2010
TABLE II
Experiments on video concept detection and image annotation tasks showed the better performance of our framework. By imposing a very small l2 -penalty in the elastic net, our framework gained comparable or better performance than the lasso. Furthermore, although the ML-SLDA performs better than Best AP the baseline algorithm and obtains comparable performance from SLDA with the ridge regression approach, further improvement could 0.20757 be obtained by developing the multi-labeled extension of the 0.68286 linear discriminant analysis, which is left for our future work. 0.00137
Performance Comparison of BL, L1-HSML, L2-HSML, EN, and ML-SLDA for Each of the 20 Concepts in Video Concept Detection Task Concept Person Face Standing Individual Talking Crowd Government-Leader Ties Buildings Entertainment Corporate-Leader Vehicle Urban Sunny Flags Meeting Road Computers Office Congressman MAP
Best AP from BL 0.1952 0.67241 0.00147 0.07633 0.01123 0.01885 0.00851 0.00404 0.08147 0.00251 0.00918 0.12495 0.31296 0.01211 0.03324 0.00394 0.01077 0.01794 0.01559 0.0765 0.08446
Best AP Best AP Best AP from L1 from L2 from EN 0.25661 0.18607 0.30842 0.69956 0.67239 0.71259 0.00174 0.0017 0.00427 0.09147 0.0784 0.02245 0.01224 0.00905 0.01417 0.0196 0.01867 0.02303 0.01109 0.00834 0.01971 0.0054 0.0054 0.00424 0.08971 0.08112 0.13526 0.00405 0.00352 0.02500 0.01071 0.00854 0.00781 0.14013 0.12164 0.11338 0.33518 0.3265 0.33088 0.01882 0.01474 0.01494 0.04209 0.0348 0.04489 0.00571 0.00519 0.00473 0.02203 0.01266 0.00239 0.01776 0.01639 0.00949 0.01767 0.01636 0.01987 0.10787 0.0795 0.09331 0.09547 0.08505 0.09554
0.04013 0.00538 0.00890 0.00865 0.00439 0.08267 0.00203 0.00933 0.11630 0.33618 0.01300 0.05265 0.00474 0.00335 0.00987 0.01137 0.12504 0.08629
The best results for each concept are shown in boldface.
by concatenating three kinds of feature vectors, there may be more correlations between variables. Whereas, we form codeword dictionary by clustering process for Flickr images, such that each visual words may be discriminative. Therefore, correlations between variables in the 1000-dimension vectors are low. Experimental results agree with the analysis by Zou and Hastie [22] in that the prediction performance of elastic net is better than the lasso if there are high correlations between variables. Second, it is reasonable to extend SLDA to multi-labeled SLDA by removing the constraints Y 1 = 1 and performing optimal scoring on the multi-labeled indicator matrix Y. However, multi-labeled extension of the original linear discriminant analysis may be more helpful.
VI. Conclusion In this paper, we proposed a sparse multi-label transfer learning framework. Our goal is to circumvent the barrier of multiple tags in practical applications, such as video concept detection and automatic image annotation. In order to exploit the correlation information inherent in multiple labels, the proposed S-MLTL framework first learns a multi-label encoded sparse linear embedding space from a related dataset, and then the target data are mapped into the new representation space. In order to learn the sparse representation, we cast the hypergraph spectral multi-label learning algorithm as a penalized least squares optimization problem by elastic-net penalty. Furthermore, by performing optimal scoring on a multi-label indicator matrix we extend sparse discriminant analysis in multi-label SLDA, which can be embedded into our S-MLTL framework to learn the sparse representation.
Appendix TABLE III Performance Comparison of BL, L1-HSML, L2-HSML, EN, and ML-SLDA for Each of the 33 Tags in Automatic Image Annotation Task Concept
Best AP from BL Animals 0.12811 Baby 0.00681 Bird 0.03075 Bird− r1 0.03242 Car 0.02819 Car− r1 0.01056 Clouds 0.13502 Clouds− r1 0.02739 Dog 0.03087 Dog− r1 0.0255 Female 0.24611 Flower 0.09445 Flower− r1 0.05041 Food 0.0285 Indoor 0.3315 Lake 0.01944 Male 0.23471 Night 0.09462 Night− r1 0.03119 People 0.4073 Plant− life 0.29768 Portrait 0.13402 River 0.01956 River− r1 0.00244 Sea 0.0256 Sea− r1 0.00753 Sky 0.29295 Structure 0.34531 Sunset 0.09921 Transport 0.07125 Tree 0.11429 Tree− r1 0.01994 Water 0.14641 MAP 0.10818
Best AP Best AP Best AP Best AP from from L1 from L2 from EN SLDA 0.18247 0.15218 0.17029 0.13361 0.03064 0.03886 0.01536 0.00645 0.07646 0.02744 0.09292 0.02667 0.0702 0.02211 0.04048 0.02148 0.04739 0.0331 0.05031 0.05969 0.03082 0.016 0.02334 0.01157 0.16618 0.15794 0.17927 0.12888 0.07814 0.03218 0.05536 0.02667 0.07922 0.0386 0.05094 0.03755 0.07957 0.03158 0.04798 0.03145 0.30026 0.26244 0.29823 0.24395 0.12163 0.09779 0.12418 0.10598 0.07876 0.05262 0.11463 0.05086 0.03886 0.03468 0.05786 0.03920 0.40033 0.35247 0.35239 0.31276 0.01949 0.01479 0.02165 0.01450 0.28328 0.25614 0.27471 0.23028 0.12637 0.11347 0.10385 0.14717 0.05543 0.03666 0.06096 0.06274 0.48195 0.43711 0.47868 0.40152 0.39043 0.33923 0.38831 0.31707 0.20605 0.15694 0.19532 0.13142 0.0312 0.04502 0.07747 0.02193 0.00872 0.00283 0.00613 0.00244 0.0396 0.02665 0.03715 0.02889 0.02128 0.01123 0.00910 0.01407 0.31826 0.31269 0.32709 0.30113 0.37171 0.34278 0.37840 0.32537 0.09114 0.10283 0.10192 0.10577 0.09395 0.08052 0.11805 0.08859 0.16972 0.14066 0.17118 0.12212 0.04254 0.02482 0.06762 0.02394 0.12642 0.1249 0.13357 0.12856 0.14117 0.11877 0.14014 0.11225
The best results for each concept are shown in boldface.
References [1] F. Wu, Y. Liu, and Y. Zhuang, “Tensor-based transductive learning for multi-modality video semantic concept detection,” IEEE Trans. Multimedia, vol. 11, no. 5, pp. 868–878, Aug. 2009. [2] Y. Liu, T. Mei, X. Wu, and X. Hua, “Multigraph-based queryindependent learning for video search,” IEEE Trans. Circuits Syst. Video Technol., vol. 19, no. 12, pp. 1841–1850, Dec. 2009.
HAN et al.: MULTI-LABEL TRANSFER LEARNING WITH SPARSE REPRESENTATION
[3] Y.-H. Yang, W. H. Hsu, and H. H. Chen, “Online reranking via ordinal informative concepts for context fusion in concept detection and video search,” IEEE Trans. Circuits Syst. Video Technol., vol. 19, no. 12, pp. 1880–1890, Dec. 2009. [4] C. Wang, S. Yan, L. Zhang, and H. J. Zhang, “Multi-label sparse coding for automatic image annotation,” in Proc. IEEE Soc. Conf. Comput. Vision Pattern Recognit., 2009, pp. 1643–1650. [5] G. Qi, X. S. Hua, Y. Rui, J. Tang, T. Mei, and H. J. Zhang, “Correlative multi-label video annotation,” in Proc. 15th Annu. ACM Int. Conf. Multimedia, 2007, pp. 17–26. [6] B. Sigurbjörnsson and R. V. Zwol, “Flickr tag recommendation based on collective knowledge,” in Proc. 17th Int. Conf. World Wide Web, 2008, pp. 327–336. [7] Y. Liu, R. Jin, and L. Yang, “Semi-supervised multi-label learning by constrained non-negative matrix factorization,” in Proc. 21st Natl. Conf. Artif. Intell. 18th Innovative Applicat. Artif. Intell. Conf., vol. 21. 2006, pp. 421–426. [8] M. Zhang and Z. Zhou, “M3MIML: A maximum margin method for multi-instance multi-label learning,” in Proc. 8th IEEE Int. Conf. Data Mining, 2008, pp. 688–697. [9] Z. Zha, T. Mei, J. Wang, Z. Wang, and X. Hua, “Graph-based semisupervised learning with multiple labels,” J. Vis. Commun. Image Representat., vol 20, no. 2, pp. 97–103, 2009. [10] A. Argyriou, T. Evgeniou, and M. Pontil, “Multi-task feature learning,” in Proc. 19th Annu. Conf. Neural Informat. Process. Syst., Dec. 2006, pp. 41–48. [11] D. Zhou, J. Weston, A. Gretton, O. Bousquet, and B. Schölkopf, “Ranking on data manifolds,” Adv. Neural Informat. Process. Syst., vol. 16, pp. 169–176, 2004. [12] W. Dai, Y. Chen, G. Xue, Q. Yang, and Y. Yu, “Translated learning,” in Proc. 21st Annu. Conf. Neural Informat. Process. Syst., 2008, pp. 353–360. [13] R. K. Ando and T. Zhang, “A framework for learning predictive structure from multiple tasks and unlabeled data,” J. Mach. Learn. Res., vol. 6, pp. 1817–1853, Nov. 2005. [14] Y. Zhang and Z. H. Zhou, “Multi-label dimensionality reduction via dependence maximization,” in Proc. 23rd AAAI Conf. Artif. Intell., 2008, pp. 1503–1505. [15] S. Ji and J. Ye, “Linear dimensionality reduction for multi-label classification,” in Proc. 21st IJCAI, 2009, pp. 1077–1082. [16] S. Ji, L. Tang, S. Yu, and J. Ye, “Extracting shared subspace for multilabel classification,” in Proc. 14 ACM SIGKDD Int. Conf. KDD, 2008, pp. 381–389. [17] L. Sun, S. Ji, and J. Ye, “Hypergraph spectral learning for multilabel classification,” in Proc. 14 ACM SIGKDD Int. Conf. KDD, 2008, pp. 668–676. [18] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng, “Selftaught learning: Transfer learning from unlabeled data,” in Proc. 24th Int. Conf. Mach. Learn., 2007, pp. 759–766. [19] A. Quattoni, M. Collins, and T. Darrell, “Transfer learning for image classification with sparse prototype representations,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2008, pp. 1–8. [20] R. Tibshirani, “Regression shrinkage and selection via the lasso,” J. R. Statist. Soc. Ser. B, vol. 58, no. 1, pp. 267–288, 1996. [21] L. Sun, S. Ji, and J. Ye, “A least squares formulation for a class of generalized eigenvalue problems in machine learning,” in Proc. 26th ICML, 2009, pp. 977–984. [22] H. Zou and T. Hastie, “Regularization and variable selection via the elastic net,” J. R. Statist. Soc. Ser. B, vol. 67, no. 2, pp. 301–320, 2005. [23] L. Clemmensen, T. Hastie, and B. Ersboll. Sparse Discriminant Analysis [Online]. Available: http://www-stat.stanford.edu/∼hastie/Papers/ [24] D. Zhou, J. Huang, and B. Schölkopf, “Learning with hypergraphs: Clustering, classification, and embedding,” Adv. Neural Informat. Process. Syst., vol. 19, pp. 169–176, 2007. [25] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural Computat., vol. 15, no. 6, pp. 1373–1396, 2003. [26] J. Wright, A. Yang, A. Ganesh, S. Sastry, and Y. Ma, “Robust face recognition via sparse representation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 31, no. 2, pp. 210–227, Feb. 2009. [27] P. Praks, R. Kucera, and E. Izquierdo, “The sparse image representation for automated image retrieval,” in Proc. 15th IEEE Int. Conf. Image Process., 2008, pp. 25–28. [28] J. Yang, H. Tang, Y. Ma, and T. Huang, “Face hallucination via sparse coding,” in Proc. 15th Int. Conf. Image Process., 2008, pp. 1264–1267. [29] A. M. Martinez and A. C. Kak, “PCA vs. LDA,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 23, no. 2, pp. 228–233, Feb. 2001.
1121
[30] T. Hastie, A. Buja, and R. Tibshirani. Penalized Discriminant Analysis [Online]. Available: http://www-stat.stanford.edu/∼hastie/Papers/ [31] T. Hastie, R. Tibshirani, and A. Buja. Flexible Discriminant Analysis by Optimal Scoring [Online]. Available: http://wwwstat.stanford.edu/∼hastie/Papers/ [32] P. H. Schonemann, “A generalized solution of the orthogonal procrustes problem,” Psychometrika, vol. 31, no. 1, pp. 1–10, 1966. [33] G. H. Golub and C. F. V. Loan, Matrix Computations, 3rd ed. Baltimore, MD: Johns Hopkins Univ. Press, 1996. [34] TRECVID: TREC Video Retrieval Evaluation [Online]. Available: http://www.nlpir.nist.gov/projects/trecvid [35] M. J. Huiskes and M. S. Lew, “The MIR Flickr retrieval evaluation,” in Proc. ACM Int. Conf. MIR, 2008, pp. 39–43. [36] M. Naphade, J. R. Smith, J. Tesic, S. F. Chang, W. Hsu, L. Kennedy, A. Hauptmann, and J. Curtis, “Large-scale concept ontology for multimedia,” IEEE Multimedia Mag., vol. 13, no. 3, pp. 86–91, Jul.–Sep. 2006. [37] A. Yanagawa, S. F. Chang, L. Kennedy, and W. Hsu, “Columbia University’s baseline detectors for 374 LSCOM semantic visual concepts,” Electr. Eng. Dept., Columbia Univ., New York, Tech. Rep. 222-2006-8, Mar. 2007. [38] ImageCLEF: Image Retrieval in CLEF [Online]. Available: http:// imageclef.org/2009 [39] D. G. Lowe, “Object recognition from local scale-invariant features,” in Proc. Inter. Conf. Comput. Vis.. vol 2. 1999, pp. 1150–1157. [40] C. Elkan, “Using the triangle inequality to accelerate k-means,” in Proc. 20th Int. Conf. Mach. Learn., 2003, pp. 111–117.
Yahong Han received the B.S. degree from Zhengzhou University, Zhengzhou, Henan, China, in 2000, and the M.S. degree from Hohai University, Nanjing, Jiangsu, China, in 2003. He is currently pursuing the Ph.D. degree from the College of Computer Science, Zhejiang University, Hangzhou, China. His current research interests include multimedia analysis, retrieval, and machine learning.
Fei Wu received the B.S. degree from Lanzhou University, Lanzhou, Gansu, China, the M.S. degree from Macao University, Taipa, Macau, and the Ph.D. degree from Zhejiang University, Hangzhou, China. He is currently an Associate Professor with the College of Computer Science, Zhejiang University. His current research interests include multimedia analysis, retrieval, statistic learning, and pattern recognition.
Yueting Zhuang (M’92) received the B.S., M.S., and Ph.D. degrees from Zhejiang University, Hangzhou, China, in 1986, 1989, and 1998, respectively. Currently, he is a Professor and Ph.D. Supervisor with the College of Computer Science, Zhejiang University. His current research interests include multimedia databases, artificial intelligence, and video-based animation.
Xiaofei He (M’05) received the B.S. degree in computer science from Zhejiang University, Hangzhou, China, in 2000, and the Ph.D. degree in computer science from the University of Chicago, Chicago, IL, in 2005. He is currently a Professor and Ph.D. Supervisor with the College of Computer Science, Zhejiang University. His current research interests include machine learning, information retrieval, computer vision, and multimedia.