Computer Vision and Image Understanding 113 (2009) 384â396. Contents lists ..... vary the kernel bandwidths according
Computer Vision and Image Understanding 113 (2009) 384–396
Contents lists available at ScienceDirect
Computer Vision and Image Understanding journal homepage: www.elsevier.com/locate/cviu
Semi-supervised kernel density estimation for video annotation q Meng Wang a,*, Xian-Sheng Hua a, Tao Mei a, Richang Hong b, Guojun Qi b, Yan Song b, Li-Rong Dai b a b
Microsoft Research Asia, Zhichun Road, Beijing 100080, PR China University of Science and Technology of China, Huanshan Road, Hefei 230027, PR China
a r t i c l e
i n f o
Article history: Received 30 September 2007 Accepted 18 August 2008 Available online 29 August 2008 Keywords: Video annotation Semi-supervised learning Kernel density estimation
a b s t r a c t Insufficiency of labeled training data is a major obstacle for automatic video annotation. Semi-supervised learning is an effective approach to this problem by leveraging a large amount of unlabeled data. However, existing semi-supervised learning algorithms have not demonstrated promising results in largescale video annotation due to several difficulties, such as large variation of video content and intractable computational cost. In this paper, we propose a novel semi-supervised learning algorithm named semisupervised kernel density estimation (SSKDE) which is developed based on kernel density estimation (KDE) approach. While only labeled data are utilized in classical KDE, in SSKDE both labeled and unlabeled data are leveraged to estimate class conditional probability densities based on an extended form of KDE. It is a non-parametric method, and it thus naturally avoids the model assumption problem that exists in many parametric semi-supervised methods. Meanwhile, it can be implemented with an efficient iterative solution process. So, this method is appropriate for video annotation. Furthermore, motivated by existing adaptive KDE approach, we propose an improved algorithm named semi-supervised adaptive kernel density estimation (SSAKDE). It employs local adaptive kernels rather than a fixed kernel, such that broader kernels can be applied in the regions with low density. In this way, more accurate density estimates can be obtained. Extensive experiments have demonstrated the effectiveness of the proposed methods. Ó 2008 Elsevier Inc. All rights reserved.
1. Introduction With rapid advances in storage devices, networks, and compression techniques, large-scale video data is becoming available to more and more average users. How to manage and access these data becomes a challenging task. To deal with this issue, it has been a common theme to develop techniques for deriving metadata from videos to describe their content at syntactic and semantic levels. With the help of these metadata, the manipulations of video data can be easily accomplished, such as summarization, indexing, and retrieval. Video annotation is an elementary step to obtain these metadata. Ideally, video annotation is formulated as a classification task and it can be accomplished by learning based methods. However, due to the large gap between low-level features and the semantic concepts to be annotated, typically learning based methods need a large labeled training set to guarantee annotation accuracy. As human annotation is labor-intensive and time-consuming (experiq An early version of this paper has been published in proceedings of ACM Multimedia 2006. * Corresponding author. E-mail addresses:
[email protected], (M. Wang), xshua@microsoft. com (X.-S. Hua),
[email protected] (T. Mei),
[email protected] (R. Hong),
[email protected] (G. Qi), songy@ustc. edu.cn (Y. Song), lrdai@ustc. edu.cn (L.-R. Dai).
1077-3142/$ - see front matter Ó 2008 Elsevier Inc. All rights reserved. doi:10.1016/j.cviu.2008.08.003
ments prove that typically annotating 1 h of video with 100 concepts can take anywhere between 8 and 15 h [22]), several methods that can help reduce human effort have been proposed. One approach to dealing with the training data insufficiency problem is to apply semi-supervised learning (SSL) algorithms, which leverage a large amount of unlabeled data to boost classification accuracy [9,37,44]. Although many different SSL algorithms have been applied in multimedia annotation and several acknowledging results are reported, SSL methods are still not popular in this field, in particular when a large dataset is faced, such as in TRECVID benchmark [3]. We suppose this is mainly due to the following two factors: (1) Large variation of video content. Many SSL methods can only be effective when the assumed models are accurate [10]. Here we consider SSL with parametric model, which is a large family of SSL algorithms. Although this method can be employed with different generative models, such as GMM and Multiple Multinomial [7,24], it has not been successfully applied in image or video annotation since it is difficult to accurately model video semantic concepts. (2) Large computational cost. Many SSL algorithms introduce much larger computational costs than supervised methods and thus they can hardly be applied when dealing with a
M. Wang et al. / Computer Vision and Image Understanding 113 (2009) 384–396
large dataset. For example, the computational cost of transductive SVM scales as Oðn3 Þ where n is the size of dataset, including labeled and unlabeled samples [42]. This cost is infeasible when n is large. In this paper we propose a novel SSL method named semisupervised kernel density estimation (SSKDE) to address these two difficulties. This method is developed based on a non-parametric density estimation approach, i.e., kernel density estimation (KDE). So, it avoids the model assumption problem. In classical KDE class conditional probability densities are estimated from only labeled samples, whereas in SSKDE both labeled and unlabeled samples are utilized by introducing an extended form of kernel density estimation. Based on the extended KDE, densities and posterior probabilities are related bi-directionally (note that posterior probabilities can be derived from densities based on Bayes rule). SSKDE is thus formulated based on the bi-directional relationship between densities and posterior probabilities. It can be solved by an efficient iterative process, i.e., by iteratively updating densities and posterior probabilities. We also show that SSKDE is closely related to graph-based SSL methods. Based on SSKDE, we can provide more natural interpretations to several studies on graph-based methods. Based on SSKDE, we further propose an improved method named semi-supervised adaptive kernel density estimation (SSAKDE). In SSAKDE, the kernels over observed samples are adapted such that broader kernels are adopted in the regions with low density. In this way, more accurate density estimates can be obtained. Experiments demonstrate that this method further improves the performance of SSKDE and it is superior to many other existing supervised and semi-supervised methods. The main contributions of this paper are highlighted as follows: (1) Developed SSKDE method based on non-parametric approach. It incorporates unlabeled data into KDE such that better performance can be obtained. (2) Investigated the connection between SSKDE and graphbased SSL methods. We show that SSKDE helps better study graph-based methods. (3) Further proposed SSAKDE based on SSKDE. It achieves better performance by adopting adaptive kernels. The organization of the rest of this paper is as follows. In Section 2, we make a short review on the related work. In Section 3, the SSKDE algorithm is formulated. We provide a discussion on this algorithm in Section 4, including the solution and its relationship with other existing methods. Then we further introduce SSAKDE in Section 5. Experiments are introduced in Section 6, and then we discuss the computational costs of the proposed methods in Section 7, followed by concluding remarks in Section 8. Additionally, we provide an analysis on the effect of unlabeled samples for SSKDE and SSAKDE in appendix. 2. Related work Video annotation is also named ‘‘high-level feature extraction” or ‘‘semantic concept detection”, which is a task in TRECVID benchmark [3]. It is regarded as a promising approach to bridging semantic gap such that higher level manipulation can be facilitated. As noted by Hauptmann [18], this splits the semantic gap between low-level features and user information needs into two, hopefully smaller gaps: (a) mapping the low-level features into the intermediate semantic concepts and (b) mapping these concepts into user needs. Annotation is exactly the step
385
to accomplish the first mapping. When we only consider visual information, it is also closely related to the work on ‘‘image annotation”, such as [15,16]. Naphade and Smith [23] have given a survey on TRECVID high-level feature extraction benchmark, where a great deal of different algorithms applied in this task can be found. Over the recent years, the availability of large data collections, with only limited human annotation, has turned the attention of a growing community of researchers to the problem of SSL. By leveraging unlabeled data with certain assumptions, SSL methods are expected to build more accurate models than those that can be achieved by purely supervised learning methods. Many different SSL algorithms have been proposed. Some often-applied ones include: self-training [26], co-training [6], transductive SVM [42], SSL with parametric model [7,24], and graph-based SSL methods [5,43,46]. Extensive reviews of the existing methods can be found in [9,44]. Although many different SSL algorithms are available, only few of them are applied to image/video content analysis. In [39], Wu et al. proposed a method named Discriminant-EM, which makes use of unlabeled data to construct a generative model. But they also pointed out that the performance of the proposed method will be compromised if the components of data distribution are mixed up. In [33], Tian et al. conducted a study on SSL in image retrieval, and illustrated that SSL is not always helpful in this field due to inappropriate assumption. In [28], Song et al. applied co-training to video annotation based on a careful split of visual features. In [40], Yan et al. pointed out the drawbacks of co-training in video annotation, and proposed an improved co-training style algorithm named semisupervised cross-feature learning. Recently, graph-based SSL methods have attracted great interests of the researchers in this community. In [19], He et al. adopted a graph-based SSL method named manifold-ranking in image retrieval, and Yuan et al. then applied the same algorithm to video annotation [41]. Tang et al. proposed a graph-based SSL named kernel linear neighborhood propagation and demonstrated its effectiveness in video annotation [30]. Wang et al. developed a multi-graph learning method, such that several difficulties in video annotation can be attacked in a unified scheme [36]. More recent works in this field focus on incorporating the local structures around samples into the design of graphs. In [31], Tang et al. integrated the difference of densities around two samples into the estimation of their similarity, and this method has been shown to be better than estimating similarities based solely on the distances in feature space. In [38], Wang et al. proposed a neighborhood similarity based on the pairwise Kullback-Leibler divergence of the local distributions around samples. However, many more works on image or video annotation only employ supervised methods. Especially in TRECVID benchmark, no satisfactory results with SSL methods are reported. It seems that this field has not taken sufficient advantages of SSL algorithms. As aforementioned, this is attributed to the invalid prior model and the large computational costs of the existing SSL methods. In this work, we develop SSKDE and SSAKDE algorithms based on a non-parametric approach. These two methods avoid the model assumption problem and they are computationally efficient. So, they are appropriate for video annotation. We show that SSKDE is closely related to graph-based SSL methods. This can explain why graph-based methods are relatively more popular among existing SSL approaches in video annotation. We will demonstrate the effectiveness of SSKDE and SSAKDE. Additionally, we will show that the proposed methods are computationally efficient and can be applied in large-scale annotation.
386
M. Wang et al. / Computer Vision and Image Understanding 113 (2009) 384–396
Besides Gaussian kernel, in this work we will also apply Exponential kernel, i.e.,
3. Semi-supervised kernel density estimation In this section, we detail the formulation of SSKDE. Firstly, we introduce notations and problem definition. Then we provide an extended form of KDE and derive SSKDE based on it.
je ðxÞ ¼
1 ð2rÞd
expðkxk=rÞ
ð3Þ
We consider a normal K-class classification problem. There are l labeled samples L ¼ fx1 ; x2 ; . . . ; xl g and u unlabeled samples U ¼ fxlþ1 ; . . . ; xlþu g; x 2 Rd . Let yi denote the label of xi ðxi 2 LÞ and we have yi 2 f1; 2; 3; . . . ; Kg. Let n ¼ l þ u be the total number of samples. Denote by Li the set of samples with label i and let li P denote its size. Thus we have L ¼ [Li and l ¼ li . Assume that i.i.d. samples xi ðxi 2 L [ UÞ are extracted from an unknown (global) probability density function pðxÞ, and denote by pðxjC k Þ the class conditional probability density function of class k. For concision, we abbreviate ‘‘class conditional probability density function” to density in the rest of our paper. Then the task is to assign labels to xi , where xi 2 U. For clarity, we list all the notations and their descriptions throughout this paper in Table 1.
As traditional KDE is based on only labeled data, the accuracy of class conditional densities estimated by KDE heavily relies on the number of labeled samples. As shown in Fig. 1, the estimated densities based on limited labeled samples are inaccurate which may induce shifted classification boundary. On the other hand, unlabeled samples are usually much more than labeled ones. If the labels of unlabeled samples are also known, then estimated densities will be much more accurate. This directly motivates us to incorporate unlabeled data into KDE. To extend KDE to unlabeled samples, we first make an assumption that the class posterior probabilities of all samples are known (how to compute them is to be detailed in the next subsection). For concision, in the following discussion we abbreviate ‘‘class posterior probability” to posterior probability. Denote by PðC k jxi Þ the posterior probability of class k given xi . We weight the kernels by corresponding posterior probabilities in KDE as follows:
3.2. Extended kernel density estimation
^ðxjC k Þ ¼ p
3.1. Notations and the problem
Pn
It is well known that density estimation methods can be categorized into parametric approaches and non-parametric approaches. Among the non-parametric methods, the most popular one is KDE (or Parzen density estimation) [25], by which the class conditional densities in the above problem can be estimated as
^ðxjC k Þ ¼ p
1 X lk x 2L j
jðx xj Þ
xj Þ
ð4Þ
We can see that Eq. (1) and Eq. (4) are the same if we let U ¼ U and PðC k jxi Þ ¼ dðyi ¼ kÞ, where i 2 L and d is the indicator function (i.e., d½true ¼ 1; d½false ¼ 0). Note that an attractive property of KDE is its consistency, i.e., its convergence to target function when n ! 1. Here we show that the L1 convergence of the extended KDE can
ð1Þ
k
^ðxjC k Þ is the estimated density of class k, and jðxÞ is a kernel where p R function that satisfies jðxÞ > 0 and jðxÞdx ¼ 1. The most widely applied one is Gaussian kernel, i.e.,
jg ðxÞ ¼
j
j¼1 PðC k jxj Þ ðx Pn j¼1 PðC k jxj Þ
1 ð2pÞd=2 rd
2
2
expðkxk =2r Þ
ð2Þ
5 4 3 2 1 0
Table 1 Symbols and corresponding descriptions Symbol
Description
d l u n K L U Ni N Lk lk jðxÞ pðxÞ pðxjC k Þ PðC k jxÞ PðC k Þ d W D F P Fi
Dimension of feature space Number of labeled samples Number of unlabeled samples n ¼ l þ u, number of all samples Number of classes L ¼ fx1 ; x2 ; ; xl g U ¼ fxlþ1 ; xlþ2 ; ; xn g Neighborhood around xi , used in SSAKDE Neighborhood size, used in SSAKDE Set of samples labeled as class k Size of Lk Kernel function Global probability density function Conditional probability density function of class k Posterior probability of class k given x Prior probability of class k Indicator function, d[true] = 1, d[false] = 0 n n matrix, W ij indicates the similarity between xi and xj P n n diagonal matrix, Dii ¼ j W ij n K matrix, F ik is the estimated value of PðC k jxi Þ n n matrix, see Eq. (10) F i ¼ ½F i1 ; F i2 ; ; F iK Bandwidth of Gaussian kernel, see Eq. (32) See Eq. (23) and Eq. (24) See Eq. (9) See Eq. (15) See Eq. (16)
-1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
0.6
0.8
0.6
0.8
(a) True densities and extracted samples 5
r l ti T0 T00
Classification boundary shift
4 3 2 1 0 -1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
(b) Kernel Density Estimation on labeled data 5 4 3 2 1 0 -1
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
(c) Kernel Density Estimation on all data Fig. 1. (a) True densities and extracted samples; (b) KDE on labeled data with Gaussian kernel (large symbols are labeled samples); (c) estimated densities (solid line) with an assumption that the labels of unlabeled samples are known. We can see that the densities estimated in (a) are not accurate due to sample insufficiency, whereas the problem is alleviated in (b).
387
M. Wang et al. / Computer Vision and Image Understanding 113 (2009) 384–396
be proven as well. We re-write kernel function jðxÞ as hðx=rÞ, where r is kernel bandwidth (i.e., a smoothing factor). Define
Pn j¼1 PðC k jxj Þjðx xj Þ Pn Jn ¼ pðxjC k Þdx x2Rd j¼1 PðC k jxj Þ Z
ð5Þ
Then we have the following result: Theorem 1. If r ! 0 and nrd ! 0, then J n converge to 0 almost surely as n ! 1. The proof of Theorem 1 can be found in Appendix. From Theorem 1 we can see that Eq. (4) is a natural extension of classical KDE. However, how to compute posterior probabilities PðC k jxi Þ remains a problem. Extended KDE has indicated that densities can be estimated from posterior probabilities. Meanwhile, it is well known that posterior probabilities can be computed from densities according to Bayes rule. So, densities and posterior probabilities are related bi-directionally, as illustrated in Fig. 2. In the next subsection, we formulate SSKDE based on this bi-directional relationship. 3.3. Formulation of SSKDE
^ðxj jC k Þ ^ k jxj Þ ¼ P PðC k Þp PðC K ^ðxj jC k Þ PðC k Þp k¼1
ð6Þ
Meanwhile, by Bayes rule the prior probabilities can be approximated based on the strong law of large numbers as
Z
PðC k jxÞpðxÞdx
where xj 2 U ð9Þ where xj 2 L
Equation set (9) is a linear equation set with respect to F ik . For clarity of its solution, we re-write the equation set in matrix form. Let
jðxi xj Þ j¼1 jðxi xj Þ
Pij ¼ Pn
ð10Þ
We split the matrix P into 4 blocks after the lth row and column
P¼
PLL
PLU
PUL
PUU
ð11Þ
Then, we split posterior probability matrix F into 2 blocks after the lth row as
F¼
FL
ð12Þ
FU
Therefore, Eq. set (9) can be written as
Denote by PðC k Þ the prior probability of class k. As densities are estimated in Eq. (4), posterior probabilities can be re-computed by Bayes rule as follows:
PðC k Þ ¼
8 Pn > i¼1 F ik jðxj xi Þ > > < Pn jðx x Þ ¼ F jk j i i¼1 Pn > F jðxj xi Þ > ik i¼1 > þ t i dðyj ¼ kÞ ¼ F jk : ð1 t i Þ P n j i¼1 ðxj xi Þ
n 1X PðC k jxi Þ n i¼1
ð7Þ
ðPUU IÞFU þ PUL FL ¼ 0 ðI TÞðPLL FL þ PLU FU Þ þ TY FL ¼ 0
where T ¼ Diagðt1 ; t 2 ; . . . ; t l Þ. Consequently, after some algebra operations, the solution of Eq. set (13) can be written as
F ¼ ðT0 þ I PÞ1 T00 Y
PðC k jxi Þjðxj xi Þ PðC k Þ i¼1Pn PðC k jxi Þ ^ k jxj Þ ¼ Pn i¼1 PðC PK PðC k jxi Þjðxj xi Þ i¼1 Pn k¼1 PðC k Þ i¼1 PðC k jxi Þ Pn PðC k jxi Þjðxj xj Þ ¼ i¼1Pn i¼1 jðxj xi Þ
T 0ij ¼
Posterior probabilies
ti =ð1 ti Þ if i ¼ j and i 2 L 0
else
ti =ð1 ti Þ if i ¼ j and i 2 L 0
else
ð8Þ
ð1 6 i 6 n; 1 6 j 6 lÞ
ð16Þ
We can see that the closed-form solution in Eq. (13) involves the inversion of an n n matrix which scales as Oðn3 Þ. This cost is intractable when n is large. But we can adopt an EM-style iterative method to avoid the expensive solution. The iterative process is illustrated in Fig. 3. We can see that this iterative process can be viewed as a label propagation process [45], in which the labels of samples are propagated to each other according to a similarity matrix. In all of our experiments we adopt this iterative process instead of the closed-form solution. Now we prove the convergence of this iterative process. The steps (2) and (3) can be merged as
IT 0 0
Let A ¼
I IT 0
( n!1
Fig. 2. Relationship between densities and posterior probabilities.
ð15Þ
4.1. Solution
F ¼ lim
Extended KDE
ð1 6 i; j 6 nÞ
4. Discussion
F¼
Densities
00
where the matrices T ðn nÞ and T ðn lÞ are defined by
T 00ij ¼
^ k jxi Þ, we assume that they are close to PðC k jxi Þ. Thus To compute PðC ^ k jxi Þ, where i 2 U and it is rational for us to let PðC k jxi Þ ¼ PðC ^ k jxi Þ þ t i dðy ¼ kÞ, where i 2 L and 0 < ti 6 1 PðC k jxi Þ ¼ ð1 t i ÞPðC i (we use weights t i to integrate labeling information for labeled samples). For clarity, in the following discussion we let F jk denote the ^ k jxj Þ with F jk ) estimated posterior probabilities (i.e., replacing PðC and PðC k jxj Þ denote the truths. Thus we have
Bayes rule
ð14Þ 0
Plugging Eq. (4) and Eq. (7) into Eq. (6), we obtain
Pn
ð13Þ
PF þ
TY
ð17Þ
0
0 P, then we have I
An F0 þ
n X i¼1
Ai1
!
TY 0
)
where F0 is the initial value of F. Since P is row normalized and ti > 0, we can derive that
ð18Þ
388
M. Wang et al. / Computer Vision and Image Understanding 113 (2009) 384–396
Fig. 3. Iterative solution process of SSKDE.
9c < 1;
n X
8j ¼ 1; 2; . . . ; n
Aij 6 1;
ð19Þ
i¼1
where F i and Y i indicate the ith row of F and Y, respectively. Now we prove that in fact GRF can be derived from SSKDE. We extend Eq. (22) to
Therefore n X
n A ij ¼
i¼1
n X
n X
Aik ðAn1 Þkj ¼
n X
n X
k¼1
i¼1
i¼1 k¼1
ðAn1 Þkj
F ¼ arg minF
Aik
n X 6c ðAn1 Þkj 6 cn
ð20Þ
k¼1
Thus A converges to 0 as n ! 1. On the other hand, it is not difficult to prove that ðI AÞ is invertible. Thus Eq. (18) becomes
TY
After some algebra operations, we can derive Eq. (14) from Eq. (21), i.e., the iterative process illustrated in Fig. 3 converges as n ! 1 and the solution in Eq. (13) is the unique fixed point.
R00ij ¼
4.2. Connection to graph-based SSL
Dij ¼
Graph-based SSL is a large family among existing SSL methods [43,46]. They are conducted on a graph, where the vertices are labeled and unlabeled samples and the edges reflect the similarities between sample pairs. An assumption of these methods is label smoothness which requires the labeling function to simultaneously satisfy the following two conditions: (1) it should be close to the given truths on the labeled vertices and (2) it should be smooth on the whole graph. These two conditions are often characterized in regularization frameworks. Many different algorithms in this manner have been proposed, and detailed reviews can be found in [9,44]. We consider two well-known graph-based methods, i.e., the Gaussian random fields (GRF) method and the learning with local and global consistency (LLGC) method. Denote by W an n n affinity matrix with W ij indicates the similarity measure between xi and xj . Denote by D a diagonal matrix with its ði; iÞ-element equals to the sum of the ith row of W. Then the GRF and the LLGC methods are formulated as follows:
GRF : arg minF
( n X
) W ij kF i F j k
2
s:t: F i ¼ Y i ; i ¼ 1; 2; . . . ; l
ð22Þ
i;j¼1
8 1 r > > < n i¼1 ri ¼ r P kxk xi kr r > > > rrir ¼ Pxk 2Ni : r j
xk 2Nj
ð33Þ
kxk xj k
From Eq. (33) we can derive that
P
k 2Ni ri ¼ nrr Pn xP
kxk xi kr
xk 2Ni kxk
i¼1
!1=r
xi kr
ð34Þ
Specifically, adaptive Gaussian kernel and adaptive Exponential kernel can be computed as follows:
8 > jg ðx; xi Þ ¼ ð2pÞ1d=2 rd expðkx xi k2 =2r2i Þ > > i < !1=2 P kxk xi k2 > x 2N 2P P i k > > n 2 : ri ¼ n r i¼1
xk 2Ni
kxk xi k
ð35Þ
0
0.5
1
Fig. 4. A synthetic binary classification task with two training samples. (a) Labels of all samples. (b) Two training samples.
8 > j ðx; xi Þ ¼ ð2r1 Þd expðkx xi k=ri Þ > < e Pi kxk xi k xk 2Ni > > : ri ¼ nr Pn P i¼1
5. SSAKDE
1
xk 2Ni
ð36Þ
kxk xi k
Then, analogous to the way that leads to SSKDE, we can develop SSAKDE. In fact, in comparison to SSKDE, we only have to replace Eq. (10) by
jðxj ; xi Þ jðxj ; xi Þ
Pij ¼ Pn
ð37Þ
j¼1
and the subsequent process of SSAKDE is the same to SSKDE. 6. Experiments To evaluate the performance of the proposed methods, we conduct experiments for three different classification tasks, including (1) a toy problem, (2) handwritten digit and letter recognition, and (3) video annotation. In all experiments, the parameters t i in SSKDE and SSAKDE are empirically set to 0.9,1 and the parameter l in LLGC is empirically set to 0.1. 6.1. Toy problem We conduct experiments on a synthetic data set illustrated in Fig. 4. There are 130 two-dimensional samples that are uniformly distributed within two circles. For each class a training sample is labeled, as illustrated in Fig. 4. We compare the classification performance of the following six methods: (1) SVM with RBF kernel (2) k-NN (k ¼ 1); (3) KDE; (4) LLGC; (5) SSKDE; and (6) SSAKDE. We use Gaussian kernel in KDE methods. The parameter r in LLGC and KDE methods is set to 0.1. The parameter N in SSAKDE is set to 10. The classification results are illustrated in Fig. 5. From the figure we can see that the three supervised learning methods, including SVM, k-NN, and KDE, all lead to the same classification accuracy, i.e., 74.6%. LLGC and SSKDE achieve accuracies of 76.9% and 74.6%, respectively. This indicates that unlabeled data have not brought significant performance improvements in these two SSL methods. We can see that the classification boundaries obtained by these methods are significantly biased. But the problem has been successfully alleviated in SSAKDE. With merely two training samples, SSAKDE attains a high classification accuracy of 98.5%,
1 As indicated by Eqs. (24) and (30), the parameters ti can be derived from the parameter l in graph-based SSL, which adjusts the trade-off between the two terms in the regularization framework (see Eq. (24)). Existing studies have demonstrated that the performance of graph-based SSL is rather insensitive to the setting of l (compared with the parameter r), and in most works this parameter is empirically set to a fixed value for simplicity [43,46]. Analogously, here we also empirically set ti . In Section 6.2 we will further conduct experiments on the performance sensitivities of SSKDE and SSAKDE with respect to the parameters to demonstrate this.
390
M. Wang et al. / Computer Vision and Image Understanding 113 (2009) 384–396
Accuracy: 74.6%
1 0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
-0.2
0
0.5
1
Accuracy: 74.6%
1
-0.2
0
Accuracy: 74.6%
1
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
-0.2
-0.2
0.5
1
0
(c) KDE
0.5
Accuracy: 74.6% 1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0 0
0.5
1
(d) LLGC
1
-0.2
1
Accuracy: 76.9%
1
0.8
0
0.5
(1) SVM with a RBF kernel. We implement the experiments using LIBSVM [8], a free library for SVM; (2) KDE; (3) SSKDE; (4) LLGC; (5) SSAKDE.
(b) k-NN
(a) SVM
1
-0.2
Accuracy: 98.5%
0
(e) SSKDE
0.5
1
(f) SSAKDE
Fig. 5. Performance comparison for six different methods on the synthetic dataset.
i.e., only two samples are misclassified. Comparing the results obtained by SSKDE and SSAKDE, we can clearly see the effectiveness of the proposed adaptive kernel approach. 6.2. Handwritten digit and letter recognition We conduct experiments on the following two datasets: ‘‘handwritten digit recognition” dataset from Cedar Buffalo [20,46] and ‘‘letter image recognition” dataset from UCI MLR [4]. From these two datasets we generate four classification tasks: (1) (2) (3) (4)
For each classification dataset we set different labeled data size l, and perform 10 trials for each l. In each trial we randomly select labeled samples and use the rest of the samples as testing data. For each trial, if any class does not contain labeled sample, we redo the sampling. We compare the averaged results over 10 trails of the following methods:
10-way classification of all digits; even and odd digits classification; 26-way classification of all letters; letters ‘A’ to ‘M’ and ‘N’ to ‘Z’ classification.
Table 2 illustrates the information on these four classification tasks. Table 2 Information of classification tasks
Since there is no reliable model selection approach when labeled samples are extremely few, the following parameters are tuned to their optimal values: parameter r in the last four methods, the radius parameter c for RBF kernel and trade-off between training error and margin c in SVM model. The size of neighborhood N in SSAKDE is empirically set to 50 (in fact this parameter also can be tuned which will lead to better results for SSAKDE). Fig. 6 illustrates the performance comparison of the five methods. Firstly, we compare KDE, SSKDE and SSAKDE (for concision, we abbreviate them to KDE methods in the following discussion). We can see that SSKDE performs better than KDE in most cases. This indicates the positive contribution of unlabeled data in SSKDE. But we can also see that in even and odd digits classification, SSKDE performs worse than KDE when l is large. In Appendix, we provide a detailed analysis on the effect of unlabeled data, and this phenomenon will be explained. From the experimental results we can clearly see the superiority of SSAKDE over SSKDE. This confirms the effectiveness of the adaptive kernel approach. Then, we compare LLGC and SSAKDE. From the experimental results we can see that in most cases SSAKDE outperforms LLGC. This is an interesting result. LLGC is usually believed to be superior to GRF due to that it can be viewed as based on normalized graph Laplacian, whereas GRF is based on graph Laplacian [17]. We have shown that GRF and SSKDE can be viewed as equivalent to some extent (although they are derived from different viewpoints and there are several small differences between their formulations, see Section 4). Thus we regard LLGC and SSAKDE both as variants of SSKDE method, the former in the viewpoint of spectral graph theory [17] and the latter in the perspective of KDE. SSAKDE also can be regarded as an improved graph-based SSL method. So, the superiority of SSAKDE over LLGC demonstrates that the KDE perspective can help better extend graph-based methods. In these four tasks, the performance gaps between these two methods are small in magnitude. In the next subsection we will demonstrate more significant improvement from LLGC to SSAKDE in video annotation. Now we study the performance sensitivities of SSKDE and SSAKDE to the parameters t i and r. Take the classification task ‘‘Digit 10-way” as an example, and we set l to 100. Fig. 7 illustrates the performance curves of SSKDE and SSAKDE with respect to the parameters. From the figure we can see that, consistent with the existing studies [35,43,46], the setting of r is critical to the performance of SSKDE and SSAKDE, and the parameters ti are relatively less sensitive. The results have confirmed the rationality of the settings of the parameters in our experiments. 6.3. Video annotation
Task
Dimension
Size
Class
Digit even/odd Digit 10-way Letter A-M/N-Z Letter 26-way
256 256 16 16
11,000 11,000 20,000 20,000
2 10 2 26
We conduct experiments on TRECVID 2005 dataset [3]. TRECVID 2005 dataset consists of 273 news videos that are about 160 h in duration. The dataset is split into a development set and a test set. The development videos are segmented into 49,532 shots and 61,901 subshots, and the test videos are segmented into
391
M. Wang et al. / Computer Vision and Image Understanding 113 (2009) 384–396
Digit 10-way
Digit even/odd KDE SVM SSKDE LLGC SSAKDE
0.6
0.3 0.25
error rate
error rate
0.5
KDE SVM SSKDE LLGC SSAKDE
0.4
0.3
0.2 0.15
0.2
0.1
0.1 50
100
150
0.05
200
50
labeled samples
100
150
200
labeled samples
Letter 26-way
Letter A-M/N-Z 0.4 KDE SVM SSKDE LLGC SSAKDE
0.7
0.3
error rate
error rate
0.6
KDE SVM SSKDE LLGC SSAKDE
0.35
0.5
0.4
0.25 0.2 0.15
0.3
0.1 0.2
100
200
300
400
500
100
labeled samples
200
300
400
500
labeled samples
Fig. 6. Performance comparison of different algorithms for digit and letter classification.
SSKDE
SSKDE 0.75
0.75
error rate
error rate
0.7 0.65 0.6
0.7
0.55 0.5
0.2
0.3
0.4
0.5
0.65 0.8
0.6
0.85
σ
1
0.95
1
i
0.9
0.85
SSAKDE
0.88
0.8
error rate
error rate
0.95
t
SSAKDE
0.75 0.7
0.86 0.84 0.82
0.65 0.6 0.2
0.9
0.3
0.4
σ
0.5
0.6
0.8 0.8
0.85
0.9
t
i
Fig. 7. Performance curves of SSKDE and SSAKDE with respect to the parameters r and ti (classification task: digit 10-way; l ¼ 100).
45,766 shots and 64,256 subshots. In the following discussion the development set and test set will be referred to as set 1 and set 2, respectively. A key-frame is selected for each subshot, and from the
key-frame we extract 225D block-wise color moment features based on a 5 by 5 division of the image. We annotate the following 10 concepts: Walking_Running, Explosion_Fire, Maps, Flag-US, Building, Waterscape_Waterfront, Mountain, Prisoner, Sports, and Car. The descriptions of these concepts can be found in TRECVID website [3], and several exemplary key-frames are illustrated in Fig. 8. For each concept, its annotation is considered as a binary classification problem, and thus for sample xi we obtain F i0 and F i1 by KDE methods, which are the posterior probabilities of relevance and irrelevance given sample xi . In the previous two tasks (toy problem and hand written digit and letter recognition), classification results are directly obtained from the posterior probabilities. But in video annotation we frequently encounter imbalanced classes, i.e., positive samples are much fewer than negative samples for the given concept, and thus classification accuracy is not a preferred performance measure. To address this issue, NIST has defined non-interpolated average precision over a set of retrieved shot as a measure of retrieval effectiveness [2]. Here we combine F i0 and F i1 generate the relevance score of xi . Since negative samples are usually much more than positive ones and their distributions are in a very broad domain, positive samples are expected to contribute more in video concept learning [19]. Thus we compute relevance scores fi as
fi ¼ F i0 þ
1 1 F i1 frequency 1
ð38Þ
where frequency is measured to be the percentage of positive samples in labeled training set. In fact this setting is equivalent to duplicating ð1=frequency 1Þ copies for each positive training sample, so that they are balanced with negative ones. We will show experi-
392
M. Wang et al. / Computer Vision and Image Understanding 113 (2009) 384–396
Walking_Running
Explosion_Fire
Maps
Flag-US
Building
Waterscape_Waterfront
Mountain
Prisoner
Sports
Car
Fig. 8. Exemplary key-frames of the 10 concepts.
mentally that this setting can successfully integrate the two probabilities and it is better than only using F i0 or F i1 . We follow the guideline of TRECVID to evaluate annotation performance [2]. The relevance scores on the test set are merged from subshot to shot by maximum aggregation if a shot contains multiple subshots, i.e.,
f ðshotm Þ ¼ maxsubshoti 2shotm fsubshoti
ð39Þ
Then the relevance scores are ranked and we evaluate average precision of the first 2000 shots. The size of neighborhood N in SSKDE is empirically set to 50, and the parameter r in KDE methods are tuned by 10-fold crossvalidation. We make matrices P sparse by only keeping N largest values in each row in SSKDE and SSAKDE. This is a frequently applied strategy which can significantly reduce computational cost while retaining comparable performance. First, we test the following methods with regarding set 1 as training data and set 2 as testing data: (1) SVM with RBF kernel. Here we split out a hold-out dataset from original development set and use this dataset to tune two parameters in SVM model, i.e., radius parameter c for RBF kernel and trade-off between training error and margin c; (2) LLGC; (3) KDE with Gaussian kernel; (4) KDE with Exponential kernel; (5) SSKDE with Exponential kernel; (6) SSKDE with Gaussian kernel; (7) SSAKDE with Gaussian kernel; (8) SSAKDE with Exponential kernel. The experimental results are illustrated in Table 3. Comparing the performance of KDE methods with two different kernels, it is clear that Exponential kernel is superior to Gaussian kernel. It is due to the fact that generally L1 distance is more appropriate for many visual features, since it can better approximate the perceptual difference between images [29]. Then we compare the KDE methods. We can see that SSKDE outperforms KDE for nearly all concepts. It only performs slightly worse than KDE for concept
Explosion_Fire. The superiority of SSKDE over KDE is evident in MAP measure. Meanwhile, SSAKDE shows much better performance than SSKDE. These results have confirmed the effectiveness of our approaches, including exploiting unlabeled data and adaptive kernels. We can see that SSAKDE also significantly outperforms LLGC. SSAKDE with Exponential kernel has obtained the best results for most concepts among these 8 methods, and its superiority is evident in MAP measure. In Section 4, we have discussed that SSAKDE can be viewed as an extension to traditional graph-based SSL to a certain extent, and thus it will be instructive to compare it with state-of-the-art graph-based SSL methods. In Section 2, we have introduced that several improved graph-based methods also take into account the local structures around samples. Here we compare SSAKDE with three improved graph-based methods: (1) structure-sensitive manifold-ranking (SSMR) [31]; (2) GRF with neighborhood similarity (GRF + NS) [38]; and (3) LLGC with neighborhood similarity (LLGC + NS) [38]. These three methods all attempt to modify traditional distance-based similarity in the design of graphs. The first method has incorporated density difference into similarity estimation, whereas the last two methods have defined a novel ‘‘neighborhood similarity” by taking into account the local distributions around samples. Table 4 illustrates the experimental results. Here we only use Exponential kernel for SSAKDE, since it has been demonstrated to be more effective than Gaussian kernel. The detailed implementation issues and parameter settings of the three improved graph-based SSL methods can be found in [31,38]. From the results we can see that SSAKDE achieves the best results for most concepts. Compared with the three state-of-the-art graphbased SSL methods, the superiority of SSAKDE is evident in MAP measure. We also investigate the effectiveness of Eq. (38) for the three KDE methods. We compare three approaches: (1) rank shots based on F i0 ; (2) rank shots based on F i1 (in fact the shots are ranked according to F i1 ); (3) rank shots based on fi , which are generated according to Eq. (38). We adopt Exponential kernel, and the MAP results are illustrated in Table 5. From the results we can see that individually using F i1 generates very poor results, and it confirms that the previous analysis that positive samples contribute more
393
M. Wang et al. / Computer Vision and Image Understanding 113 (2009) 384–396 Table 3 Performance of six different algorithms on TRECVID 2005 benchmark Concept
SVM
LLGC
KDE (G)
KDE (E)
SSKDE (G)
SSKDE (E)
SSAKDE (G)
SSAKDE (E)
Walking_Running Explosion_Fire Maps Flag-US Building Waterscape_Waterfront Mountain Prisoner Sports Car MAP
0.167 0.067 0.376 0.059 0.413 0.330 0.266 0.0003 0.324 0.262 0.226
0.172 0.046 0.451 0.081 0.399 0.361 0.284 0.0007 0.332 0.270 0.240
0.109 0.052 0.338 0.051 0.325 0.299 0.261 0.0002 0.242 0.238 0.192
0.117 0.058 0.353 0.086 0.317 0.301 0.259 0.0015 0.281 0.240 0.201
0.140 0.049 0.448 0.106 0.405 0.357 0.280 0.0013 0.298 0.264 0.235
0.152 0.051 0.462 0.089 0.403 0.369 0.288 0.0004 0.330 0.268 0.241
0.159 0.054 0.475 0.098 0.404 0.373 0.321 0.003 0.384 0.278 0.255
0.166 0.056 0.491 0.101 0.446 0.383 0.332 0.003 0.385 0.289 0.265
The best results for each concept is illustrated in boldface. Table 4 Performance comparison of SSAKDE and improved graph-based SSL methods Concept
SSAKDE (E)
SSMR
GRF + NS
LLGC + NS
Walking_Running Explosion_Fire Maps Flag-US Building Waterscape_Waterfront Mountain Prisoner Sports Car MAP
0.166 0.056 0.491 0.101 0.446 0.383 0.332 0.003 0.385 0.289 0.265
0.152 0.049 0.474 0.127 0.440 0.337 0.324 0.003 0.359 0.252 0.252
0.168 0.048 0.491 0.118 0.432 0.367 0.331 0.001 0.363 0.292 0.261
0.169 0.047 0.479 0.106 0.436 0.358 0.333 0.0008 0.368 0.287 0.258
The best results for each concept is illustrated in boldface.
Table 5 Comparison of MAP results obtained by three different ranking methods Method
Using fi
Using F i0
Using F i1
KDE (E) SSKDE (E) SSAKDE (E)
0.201 0.241 0.265
0.196 0.233 0.260
0.013 0.113 0.129
The best results for each concept is illustrated in boldface.
than negative ones. However, using fi shows better performance than using F i0 , and this demonstrates that integrating F i1 is still helpful. Finally, we conduct experiments to study whether the effectiveness of the proposed methods would depend on the size of training set and the relative percentages of labeled and unlabeled data. We randomly split set 1 into two sets, i.e., labeled training set and unlabeled training set, and set 2 is regarded as out-of-sample data, as illustrated in Fig. 9. The sizes of unlabeled training set and unlabeled testing set are denoted by u1 and u2 , respectively. So, we have l þ u1 ¼ 61; 901 and u2 ¼ 61; 614. We implement SSKDE and SSAKDE on labeled samples and unlabeled training samples first, and then induce the labels for unlabeled testing samples according to Eq. (31). We compare their performance with KDE (for KDE, unlabeled training samples are not used). We set different l and perform 10 trials for each l. Fig. 10 demonstrates the MAP measures obtained by KDE methods. For clarity, we plot the improveTRECVID development set
labeled data unlabeled training data unlabeled testing data TRECVID test set
Fig. 9. Data distribution for inductive experiments.
ment curves from KDE to SSKDE and from SSKDE to SSAKDE in this figure as well. From the figure we can see that the improvements are always positive, even with little labeled data or with little unlabeled training data. When unlabeled are fewer, the improvement percentages are smaller, but the signs are consistent. This confirms the robustness of the proposed methods in video annotation. 7. Computational cost The computational costs of SSKDE and SSAKDE mainly consist of two parts, one is for the construction of matrix P, and the other is for the iterative solution process illustrated in Fig. 3. We can easily derive that the cost of matrix construction scales as Oðd n2 Þ and the cost of iterative solution scales as Oðn N MÞ, where n is the number of samples, d is the dimension of feature space, N is the neighborhood size, and M is the number of iterations in the solution process. We illustrate the definitions of all these notations and their detailed values in the video annotation experiments in Table 6 for clarity. Obviously the iterative solution process is much more rapid than the matrix construction process. In practical experiments the matrix construction step takes about 30 h, whereas the iterative solution process can be finished in less than 2 min for each concept. But the matrix construction step is concept independent, i.e., the matrix only has to be constructed once and then it can be utilized for all concepts. Compared with traditional methods those need to train a model for each individual concept (such as SVM), SSKDE and SSAKDE have great advantage in computational efficiency when dealing with multiple concepts. For instance, in our experiments we need more than 6 h to train a SVM model for a concept. Since this cost is proportional to the lexicon size, it would be prohibitive if we have to annotate a large lexicon of concepts, such as large-scale concept ontology for multimedia (LSCOM) [1]. Contrarily, SSKDE and SSAKDE only need to repeat the iterative solution process for different concepts, thus their computational costs will not increase dramatically. This property makes these two methods particularly appropriate for large-scale annotation, in terms of both dataset size and lexicon size. All these time costs are recorded on a PC with Pentium 4 3.0G CPU and 1G memory.
8. Conclusion In this paper, we introduce a novel SSL method named SSKDE. It is formulated based on a classical non-parametric approach, i.e., KDE. While only labeled samples are used to estimate densities in KDE, SSKDE is able to leverage both labeled and unlabeled samples. This method naturally avoids the model assumption problem which may degrade performance in many other SSL methods.
394
M. Wang et al. / Computer Vision and Image Understanding 113 (2009) 384–396
Improvement (%)
0.24 KDE SSKDE SSAKDE
0.22 0.2 0.18
0.14 0.12 0.1 0.08 0.06 0.04
KDE to SSKDE 50
0
0
1
2
3
4
5
6
Labeled samples
0
1
2
3
4
5
Labeled Samples
6
7
Improvement (%)
MAP
0.16
100
7 x 104
30
SSKDE to SSAKDE 20 10 0 0
1
x 104
2
3
4
Labeled samples
5
6
7 x 104
Fig. 10. (a) MAP measures obtained by KDE methods with different l and (b) relative improvements from KDE to SSKDE and from SSKDE to SSAKDE.
Table 6 The practical values of the notations
According to Bayes rule and strong law of large number, we have
Notation
Description
Value
n d N M
Number of samples Dimension of feature space Neighborhood size Number of iterations
126,157 225 50 40
Furthermore, we also propose an improved method named SSAKDE. It employs adaptive kernels rather than a global fixed kernel. The bandwidths of local kernels are adapted according to the sparseness degrees of the nearby regions. We have analyzed the effect of unlabeled data in the proposed methods and their connection with other SSL methods. Experiments have demonstrated their effectiveness. A major contribution of this work is the approach that incorporates unlabeled samples into KDE, i.e., the bi-directional relationship between densities and posterior probabilities. As noted in Section 4, KDE is a classical method that has already been extensively studied and there have already been many improved methods [14,21,34]. So, we believe that many new SSL methods can be developed based on these methods by manipulating unlabeled samples analogous to SSKDE. SSAKDE is just such an example. We will try to develop more methods in this way in the future. Appendix A
Furthermore, it is obvious that
n 1 X jðx xj Þ < sup jðxÞ n j¼1 x
ð43Þ
So, we can derive
! Pn Z 1 Pn a:e: n j¼1 jðx xj Þ j¼1 PðC k jxj Þjðx xj Þ Pn PðC k jxÞ dx ! 0 1 Pn n j¼1 PðC k jxj Þ j¼1 jðx xj Þ
ð44Þ
i.e., J1;n converges to 0 almost surely. Now we prove the convergence of J 2;n . Based on Bayes rule, we have
Pn Z 1 PðC k jxÞpðxÞ j¼1 jðx xj Þ dx PðC k jxÞ n1 Pn PðC k Þ j¼1 PðC k jxj Þ n Pn Z 1 PðC k jxÞpðxÞ j¼1 jðx xj Þ n 6 PðC k jxÞ 1 Pn 1 Pn dx j¼1 PðC k jxj Þ j¼1 PðC k jxj Þ n n Z PðC k jxÞpðxÞ PðC k jxÞpðxÞ þ 1 Pn dx n j¼1 PðC k jxj Þ PðC k Þ
J 2;n ¼
Z X n a:e: 1 jðx xj Þ pðxÞdx ! 0 n j¼1
Clearly we have
ð45Þ
ð46Þ
Consequently, we can derive that
1 PðC k jxj Þ j¼1 n Z n 1X jðx xj Þ PðC k jxÞpðxÞdx PðC k jxÞ n j¼1 Z X n 1 a:e: 1 6 1 Pn jðx xj Þ pðxÞdx ! 0 n PðC jx Þ j k j¼1 n j¼1
J 02;n ¼ 1 Pn ð40Þ
Let J1;n and J2;n denote the two terms on the right-hand side of the above inequality, we only have to prove that both J1;n and J 2;n converge to 0 almost surely. Based on the main theorem in [13] (i.e., the L1 convergence of kernel regression)
Z Pn a:e: j¼1 PðC k jxj Þjðx xj Þ Pn PðC k jxÞdx ! 0 j¼1 jðx xj Þ
ð42Þ
Denote these items by J 01;n and J 01;n , respectively. According to the Theorem 2 in [12] (i.e., the convergence of classical KDE), we have
A.1. Proof of Theorem 1
Pn Z Pn j¼1 PðC k jxj Þjðx xj Þ j¼1 jðx xj Þ Pn Pn Jn ¼ pðxjC k Þdx j ðx x Þ PðC jx Þ j j k j¼1 j¼1 ! Pn Z 1 Pn n j¼1 jðx xj Þ j¼1 PðC k jxj Þjðx xj Þ Pn ¼ 1 Pn PðC k jxÞ dx n j¼1 PðC k jxj Þ j¼1 jðx xj Þ Pn Z j¼1 jðx xj Þ þ PðC k jxÞ Pn pðxjC k Þdx j¼1 PðC k jxj Þ
Z n 1X a:e: PðC k jxÞpðxÞdx ¼ PðC s kÞ > 0 PðC k jxj Þ ! n j¼1
ð41Þ
ð47Þ
On the other hand, based on Eq. (42), we can easily derive that
J 002;n ¼
Z PðC k jxÞpðxÞ a:e: PðC k jxÞpðxÞ dx ! 0 1 Pn n j¼1 PðC k jxj Þ PðC k Þ
ð48Þ
M. Wang et al. / Computer Vision and Image Understanding 113 (2009) 384–396
So, J 2;n converges to 0 almost surely as well, which completes the proof. A.2. Analysis of unlabeled data In this appendix, we provide a qualitative analysis on the effect of unlabeled data in SSKDE. Consider L1 generalization error. Firstly we cite a conclusion from the study on kernel regression [13]. Define
Z Pn j¼1 PðC k jxj Þjðx xj Þ Pn Dn ¼ PðC k jxÞpðxÞdx j ðx x Þ j j¼1
ð49Þ
Then we have the follow theorem. Theorem 2. If r ! 0 and nrd ! 0, then for every > 0 there exists constants c and n0 , such that for every n P n0 , PðDn P Þ < ecn . The proof of Theorem 2 can be found in [13]. Now we replace PðC k jxj Þ by estimated posterior probabilities F jk . Thus the definition of generalization error turns to follows:
D0n ¼
Z Pn j¼1 F jk jðx xj Þ PðC k jxÞpðxÞdx Pn j¼1 jðx xj Þ
ð50Þ
Then we define
D0l ¼
Z Pl j¼1 F jk jðx xj Þ PðC k jxÞpðxÞdx Pl j ðx x Þ j j¼1
ð51Þ
D0n and D0l can be regarded as the generalization errors of KDE and SSKDE, respectively. Suppose that the estimated posterior probabilities have biases Djk , i.e., F jk ¼ PðC k jxj Þ þ Djk . Define Dl ¼ maxj6l jDjk j and Dn ¼ maxj6n jDjk j. Thus, we can obtain
D0n
Z Pn j¼1 PðC k jxj Þjðx xj Þ Pn 6 pðC k jxÞpðxÞdx j¼1 jðx xj Þ P Z n j¼1 Djk jðx xj Þ þ Pn pðxÞdx j¼1 jðx xj Þ
6 Dn þ Dn Similarly we have D0l 6 Dl þ Dl . Now we name Dl and Dn supervised and semi-supervised generalization error, respectively. Analogously, Dl and Dn are named supervised and semi-supervised bias error, respectively. According to the definitions of Dl and Dn we can find that Dl 6 Dn . This is consistent with intuition, since the biases in estimated posterior probabilities of unlabeled samples are usually greater than labeled samples (the posterior probabilities of labeled samples can be directly obtained by their labels). Meanwhile, according to Theorem 2, it is reasonable for us to suppose Dn 6 Dl . Then we can find the twofold effect of the unlabeled samples: (1) Decrease generalization error. This is according to Theorem 2 which indicates that the generalization error in kernel regression reduces as training samples increase. (2) Increase bias error. This is due to the fact that the posterior probabilities of unlabeled samples are usually not accurate enough. Thus, if the decrease of generalization error is greater than the increase of bias error, then SSKDE outperforms KDE; otherwise, KDE performs better, such as the results illustrated in Fig. 6(b). This phenomenon also can be explained in another perspective. We revisit the iterative process in Fig. 3. It is mentioned that it can be viewed as a propagation process. When the posterior probabil-
395
ities of unlabeled samples have too large biases, they may propagate to each other, so that the performance degenerates and then SSKDE may be even worse than the traditional KDE method. Currently it is difficult to accurately predict whether SSKDE will perform better (or how much better) than KDE given a dataset. Trying to establish such a theoretical framework will be an interest work. References [1] LSCOM lexicon definitions and annotations version 1.0. dto challenge workshop on large scale concept ontology for multimedia, in: DTO Challenge Workshop on Large Scale Concept Ontology for Multimedia, Columbia University ADVENT Technical Report #217-2006-03. [2] TREC-10 Proceedings Appendix on Common Evaluation Measures. Available from: . [3] TRECVID: TREC Video Retrieval Evaluation. Available from: . [4] UCI Repository of Machine Learning Databases. Available from: . [5] M. Belkin, L. Matveeva, P. Niyogi, Regularization and semi-supervised learning on large graphs, in: Proceedings of COLT, 2004. [6] A. Blum, T. Mitchell, Combining labeled and unlabeled data with co-training, in: Proceedings of COLT, 1998. [7] V. Castelli, T. Cover, The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter, IEEE Transactions on Information Theory 42 (1996). [8] C.C. Chang, C.J. Lin, LIBSVM: a library for support vector machines, 2001. available from: http://www.csie.ntu.edu.tw/~cjlin/libsvm. [9] O. Chapelle, A. Zien, B. Scholkopf, Semi-Supervised Learning, MIT Press, 2006. [10] I. Cohen, F.G. Cozman, N. Sebe, M.C. Cirelo, T.S. Huang, Semi-supervised learning of classifiers: theory algorithms and their application to human– computer interaction, IEEE Transactions on Pattern Analysis and Machine Intelligence (2004). [11] O. Delalleau, Y. Bengio, N.L. Roux, Efficient non-parametric function induction in semi-supervised learning, in: Proceedings of International Conference on Artificial Intelligence and Statistics, 2005. [12] L. Devroye, The equivalence of weak, strong and complete convergence in L1 for kernel density estimates, The Annals of Statistics 11 (1983). [13] L. Devroye, A. Krzyzak, An equivalence theorem for L1 convergence of the kernel regression estimate, Journal of Statistical Planning and Inference (1989). [14] A. Elgammal, R. Duraiswami, L.S. Davis, Efficient kernel density estimation using the fast gaussian transform for computer vision, IEEE Transactions on Pattern Analysis and Machine Intelligence (2003). [15] S.L. Feng, R. Manmatha, V. Lavrenko, Multiple bernoulli relevance models for image and video annotation, in: Proceedings of International Confernce on Computer Vision and Pattern Recognition, 2004. [16] A. Ghoshal, P. Arcing, S. Khudanpur, Hidden markov models for automatic annotation and content-based retrieval of images and video, in: Proceedings of International ACM SIGIR Conference, 2005. [17] F.C. Graham, Spectral Graph Theory, Regional Conference Series in Mathematics, vol. 92, American Mathematical Society, 1997. [18] A.G. Hauptmann, Lessons for the future from a decade of informedia video analysis research, in: Proceedings of ACM International Conference on Image and Video Retrieval, 2005. [19] J.R. He, M.J. Li, H.J. Zhang, H.H. Tong, C.S. Zhang, Manifold-ranking based image retrievalm, in: Proceedings of ACM Multimedia, 2004. [20] J.J. Hull, A dataset for handwritten text recognition research, IEEE Transactions on Pattern Analysis and Machine Intelligence (1994). [21] A.J. Inzeman, Recent developments in nonparametric density estimation, Journal of American Statistical Association (1991). [22] C. Lin, B. Tseng, J.R. Smith, VideoAnnEx: IBM MPEG-7 annotation tool for multimedia indexing and concept learning, in: Proceedings of International Confernce on Multimedia & Expo, 2003. [23] M.R. Naphade, J.R. Smith, On the detection of semantic concepts at TRECVID, in: Proceedings of ACM Multimedia, 2004. [24] K. Nigam, A.K. McCallum, S. Thrun, T. Mitchell, Text classification from labeled and unlabeled documents using em, Machine Learning 39 (2000). [25] E. Parzen, On the estimation of a probability density function and the mode, Annals of Mathematical Statistics 33 (1962). [26] C. Rosenberg, M. Heberg, H. Schneiderman, Semi-supervised self-training of object detection models, in: Proceedings of Workshop on Applications of Computer Vision, 2005. [27] R.S. Sain, Adaptive Kernel Density Estimation, Ph.D. Thesis, Rice University, 1994. [28] Y. Song, X.S. Hua, L.R. Dai, M. Wang, Semi-automatic video annotation based on active learning with multiple complementary predictors, in: Proceedings of ACM SIGMM International Workshop on Multimedia Information Retrieval, 2005. [29] M. Stricker, M. Orengo, Similarity of color images, in: Proceedings of Storage and Retrieval for Image and Video Databases (SPIE 2420), 2000.
396
M. Wang et al. / Computer Vision and Image Understanding 113 (2009) 384–396
[30] J. Tang, X.S. Hua, G. Qi, Y. Song, X. Wu, Kernel based linear neighborhood label propagation for semantic video annotation, in: Proceedings of Pacific-Asia Confernce on Knowledge Discovery and Data Mining, 2007. [31] J. Tang, X.S. Hua, G.J. Qi, M. Wang, T. Mei, X. Wu, Structure-sensitive manifold ranking for video concept detection, in: Proceedings of ACM Multimedia, 2007. [32] G.R. Terrell, D.W. Scott, The equivalence of weak strong and complete convergence in L1 for kernel density estimates, The Annals of Statistics 20 (1992). [33] Q. Tian, J. Yu, Q. Xue, N. Sebe, A new analysis of the value of unlabeled data in semi-supervised learning in image retrieval, in: Proceedings of International Confernce on Multimedia & Expo, 2004. [34] P. Vincent, Y. Bengio, Manifold parzen windows, in: Proceedings of Advances in Neural Information Processing System, 2003. [35] F. Wang, C. Zhang, Label propagation through linear neighborhoods, in: Proceedings of International Confernce on Machine Learning, 2006. [36] M. Wang, X.S. Hua, X. Yuan, Y. Song, L.R. Dai, Optimizing multi-graph learning: towards a unified video annotation scheme, in: Proceedings of ACM Multimedia, 2007. [37] M. Wang, X.S. Hua, X. Yuan, Y. Song, S. Li, H.J. Zhang, Automatic video annotation by semi-supervised learning with kernel density estimation, in: Proceedings of ACM Multimedia, 2006. [38] M. Wang, T. Mei, X. Yuan, Y. Song, L.R. Dai, Video annotation by graph-based learning with neighborhood similarity, in: Proceedings of ACM Multimedia, 2007.
[39] Y. Wu, Q. Tian, T.S. Huang, Dsicriminant-em algorithm with application to image retrieval, in: IEEE Conference on Computer Vision and Pattern Recognition, 2000. [40] R. Yan, M.R. Naphade, Semi-supervised cross feature learning for semantic concept detection in videos, in: Proceedings of International Confernce on Computer Vision and Pattern Recognition, 2005. [41] X. Yuan, X.S. Hua, M. Wang, X. Wu, Manifold-ranking based video concept detection on large database and feature pool, in: Proceedings of ACM Multimedia, 2006. [42] T. Zhang, F.J. Oles, A probability analysis on the value of unlabeled data for classification problems, in: Proceedings of International Confernce on Machine Learning, 2000. [43] D. Zhou, O. Bousquet, T.N. Lal, J. Weston, B. Schölkopf, Learning with local and global consistency, in: Proceedings of Advances of Neural Information Processing, 2004. [44] X. Zhu, Semi-supervised learning literature survey, Technical Report (1530), University of Wisconsin-Madison. Available from: . [45] X. Zhu, Z. Ghahramani, Learning from labeled and unlabeled data with label propagation, Technical Report, CMU-CALD-02-106, Carnegie Mellon University. [46] X. Zhu, Z. Ghahramani, J. Lafferty, Semi-supervised learning using gaussian fields and harmonic functions, in: Proceedings of International Confernce on Machine Learning, 2003.