1
Scalable Multi-View Semi-Supervised Classification via Adaptive Regression Hong Tao, Chenping Hou, Member, IEEE, Feiping Nie, Jubo Zhu, Dongyun Yi
Abstract—With the advent of multi-view data, multi-view learning has become an important research direction in machine learning and image processing. Considering the difficulty of obtaining labeled data in many machine learning applications, we focus on the multi-view semi-supervised classification problem. In this paper, we propose an algorithm named Multi-View SemiSupervised Classification via Adaptive Regression (MVAR) to address this problem. Specifically, regression based loss functions with ℓ2,1 matrix norm are adopted for each view and the final objective function is formulated as the linear weighted combination of all the loss functions. An efficient algorithm with proved convergence is developed to solve the non-smooth ℓ2,1 norm minimization problem. Regressing to class labels directly makes the proposed algorithm efficient in calculation and can be applied to large-scale datasets. The adaptively optimized weight coefficients balance the contributions of different views automatically, which makes the performance robust against the existence of low-quality views. With the learned projection matrices and bias vectors, predictions for out-of-sample data can be easily made. To validate the effectiveness of MVAR, comparisons are made with some benchmark methods on realworld datasets and in the scene classification scenario as well. The experimental results demonstrate the effectiveness of our proposed algorithm. Index Terms—Multi-view, Semi-supervised learning, Classification, ℓ2,1 -norm minimization.
I. I NTRODUCTION In many pattern recognition and image processing applications, such as video surveillance, social computing, and image retrieval, data are collected from diverse domains or obtained from various feature extractors [1]–[7]. This kind of data, that is, the same instance with different feature representations, is called multi-view data and each representation is referred to as one view [2]. For example, images which can be characterized by color, shape and texture features, are the most common multi-view data. In general, each view summarizes a specific characteristic of the studied object from heterogeneous feature spaces. Therefore, different views contain complementary and partly independent information to one another. If this information is properly integrated, the performance of many machine learning tasks (e.g. classification, regression and clustering) can H. Tao, C. Hou, J. Zhu and D. Yi are with the College of Science, National University of Defense Technology, Changsha, Hunan, 410073, China (E-mail:
[email protected];
[email protected]; ju bo
[email protected];
[email protected]). F. Nie is with the Center for Optical Imagery Analysis and Learning, Northwestern Polytechnical University, Xi’an 710072, China (E-mail:
[email protected]). This work was supported by NSF China (No. 61473302 and No. 61503396). Chenping Hou and Feiping Nie are both corresponding authors.
be significantly enhanced [5], [8]–[10]. However, conventional single-view algorithms simply concatenate all views into one single view to meet the learning setting, which neglect the interaction between different views. Thus, for the purpose of boosting the learning performance by effectively exploiting multi-view data, multi-view learning emerges and has attracted increasing attention. Except for the existence of multiple representations, scarcity of labeled examples and abundance of unlabeled data are characteristics of many machine learning applications as well [11]–[16]. For example, in web image classification, large amounts of web images are supplied by users while labeled samples are fairly expensive to obtain because they require human effort. The above mentioned two aspects: existence of multiple views and abundance of unlabeled data, suggest the Multi-View Semi-Supervised Learning (MVSSL) strategy. In particular, we focus on the classification problem. The existing multi-view semi-supervised classification algorithms are mainly developed in two paradigms. One is cotraining [9], which is originally designed for datasets with two distinct views. It trains classifiers separately on each view, and adds the most confidently predicted examples of either classifier to the training set of the other in each iteration. From the procedures of co-training, it can be found that it requires the predictions on each view to be accurate. In other words, the overall classification results may be deteriorated if either classifier provides erroneous information to the other [17]. Since the standard co-training algorithm being proposed, there are many variants have been developed, such as Co-trained Expectation-Maximization (Co-EM) [18] which embeds the Expectation-Maximization algorithm into the cotraining procedure, Bayesian Co-training [19] that develops a Bayesian undirected graphical model for co-training, and Co-Regularization [20] which constructs a data-dependent “co-regularization” norm through forms of multi-view regularization. Another paradigm for multi-view semi-supervised classification is graph based methods. There have been several graph based algorithms presented with good performance. M. Wang et al. proposed an approach named Optimized MultiGraph-based Semi-Supervised Learning (OMG-SSL) for video annotation [21]. It fuses multiple graphs into one and then conducts semi-supervised learning on the fused graph. It can be used for two-class classification. The Adaptive Multi-Modal Semi-Supervised (AMMSS) classification algorithm by X. Cai et al. is designed for image classification [5]. It propagates the class labels from labeled images to unlabeled ones based on the integrated multi-view feature similarity. However, conventional graph based methods have the following three main
2
drawbacks [11], [22]. First, when building the data graph, the choice of kernel functions may affect the algorithms’ performance dramatically. Second, graph based methods mix the training and testing phases and have low efficiency in dealing with newly coming data, i.e., out-of-sample data, since they have to reconstruct graphs and rerun the algorithm. Last but not least, due to the heavy computation of kernel construction, these graph based methods cannot be utilized to tackle MVSSL problems with large-scale data size. To remedy the first two defects, a new graph based algorithm named Multi-feature Learning via Hierarchical Regression (MLHR) is proposed recently [23]. Instead of constructing the graph by computing affinity matrix directly, MLHR constructs local linear regression models for each datum to learn multiple view-based graphs. To classify the out-of-sample data, global classifiers based on the linear regression model are trained on each view. Nevertheless, in MLHR, the models of each view are combined with equal weight, neglecting the contribution diversity of different views. Additionally, constructing graphs by learning local linear regression models for each datum is more time-consuming, especially for large-scale datasets. In this paper, we propose a new multi-view semi-supervised classification algorithm, namely Multi-View Semi-Supervised Classification Algorithm via Adaptive Regression (MVAR), to overcome the above mentioned problems. We merely construct global regression models for each view and formulate the final objective function as the linearly weighted combination of all the loss functions of each view. More concretely, not squared ℓ2 -norm losses are employed to compute the residual for each training sample and the loss functions of each view are finally formulated in ℓ2,1 matrix norm by summing the residuals over all training samples. An efficient algorithm is developed to solve the non-smooth ℓ2,1 -norm minimization problem. With the learned projection matrices and bias vectors, predictions for out-of-sample data can be easily made. Since MVAR does not need graph construction or training local models, it is efficient in calculation. Moreover, the weight coefficients that indicate the contributions of different views, are adaptively optimized when minimizing the objective function. Under this mechanism, if the classifier learned on a certain view is not well qualified, then a relative small weight will be assigned to this view, and vice versa. Hence, the impact of a low-quality classifier can be weakened, thereby avoiding deteriorating the overall classification performance. Our contributions are summarized as follows. 1) With the automatically optimized weight coefficients of each view, the quality diversity of different feature representations is taken into consideration. In other words, the classification performance of MVAR is robust against the existence of low-quality views. 2) Regressing to class labels directly makes the proposed algorithm efficient in calculation and can be applied to large-scale multi-view semi-supervised classification problems. In addition, with the learned projection matrices and bias vectors, MVAR can easily make predictions for out-of-sample data. 3) We develop an efficient algorithm to address the non-
smooth ℓ2,1 -norm minimization problem and prove that the algorithm monotonically decreases the objective of MVAR until convergence. 4) We evaluate MVAR systematically on six real-world multi-view datasets and in the real scene classification application. The experimental results indicate that our algorithm outperform other compared algorithms in most cases. The rest of the paper is organized as follows. Section II introduces the notations used in this paper and briefly reviews three related works on multi-view semi-supervised classification. The formulation and solution of the proposed method MVAR are introduced in Section III. Section IV gives analyses of MVAR in four aspects. Experimental results on six benchmark datasets are displayed in Section V, followed by the application to scene categorization in Section VI. Finally, we conclude this paper in Section VII. II. N OTATIONS AND R ELATED W ORKS We start with the introduction of some basic notations and definitions used in this paper. Matrices and vectors are written as boldface uppercase letters and boldface lowercase letters respectively. 1q ∈ Rq denotes a q-dimensional vector of all ones, and the subscript is omitted when the dimension is obvious. For a matrix M = (mij ), its i-th row and j-th column are denoted by mi and mj respectively. The ℓ2 -norm √ d ∑ 2 |vi | . tr(·) is of a vector v ∈ Rd is defined as ∥v∥2 = i=1
the trace operation of a matrix and ∥·∥F denotes the matrix Frobenius norm. The ℓ2,1 -norm of a matrix M ∈ Rd×n is defined as [24], [25] v d u∑ d ∑ ∑
i u n 2
m . t ∥M∥2,1 = mij = (1) 2 i=1
j=1
i=1
Given n data samples {xi }ni=1 , the data matrix is denoted as X = [x1 , · · · , xn ] ∈ Rd×n . The i-th sample (1) (V ) xi = [(xi )T , · · · , (xi )T ]T ∈ Rd includes features from (v) (v) V views and the v-th view xi ∈ Rd has d(v) features ∑V (v) such that d = . Denote the data matrix of the v=1 d (v) (v) (v) (v) v-th view as X = [x1 , · · · , xn ] ∈ Rd ×n , thus X = [(X(1) )T , · · · , (X(V ) )T ]T . Suppose the input data samples belong to c classes. The 1−of −c binary coding scheme is adopted to indicate the class labels, that is, the label vector of data point xi is represented by yi ∈ {0, 1}c , such that yi (j) = 1 if xi belongs to the j-th class, and 0 otherwise. Then the label matrix is Y = [y1 , · · · , yn ]T ∈ {0, 1}n×c . Without loss of generality, assume the first l ≪ n data points are already labeled and let L = {1, 2, · · · , l}. Denote the set of indexes of unlabeled samples as U = {l + 1, · · · , n}, and denote the cardinality of U as u. Correspondingly, the data matrix[ and label matrix are split into X = [XL , XU ] and ] YL Y= respectively. The notations used in this paper YU are summarized in Table I.
3
( TABLE I N OTATIONS Notations d d(v) n l u m c V α(v) r L U x i ∈ Rd (v) (v) x i ∈ Rd yi ∈ {0, 1}c fi ∈ {0, 1}c X ∈ Rd×n (v) X(v) ∈ Rd ×n Y ∈ {0, 1}n×c F ∈ {0, 1}n×c (v) W(v) ∈ Rd ×n c b∈R α ∈ RV 1 q ∈ Rq
vector
(v)
(v) γA
The Adaptive Multi-Modal Semi-Supervised (AMMSS) classification algorithm is a graph based method designed for image classification [5]. It adaptively integrates multiple popularly used visual features for semi-supervised image classification, and aims at obtaining more accurate results than using any single one image content representation. In this method, the label information of labeled data is propagated to unlabeled data through the graph on each view. Then a consensus class label matrix is learned by minimizing the differences between the consensus class label matrix F and the class label matrix F(v) of each view. To be specific, the objective function of AMMSS is summarized as follows,
F(v) ,F,FL =YL , α≥0,αT 1=1
A. Co-Regularization Co-Regularization, a variant of the standard co-training algorithm, is a framework where classifiers are learned in each view through forms of multi-view regularization [20]. Algorithms proposed within this framework are based on optimizing measures of agreement and smoothness over labeled and unlabeled examples. Standard regularization methods like Support Vector Machines (SVM) and Regularized Least Squares (RLS) are naturally extended for multi-view semi-supervised classification. With the manifold regularization techniques, the CoRegularized Laplacian SVM and Least Squares (Co-LapSVM, Co-LapRLS) utilize multi-view graph regularizers to enforce complementary and robust notions of smoothness in each view. Concretely, a graph Laplacian L(v) is constructed on the v-th view, and then a multi-view regularizer is formed by taking a convex combination of all Laplacian matrices V ∑ L= α(v) L(v) , where α(v) ≥ 0 is the weight coefficient V ∑
α(v) = 1. Then the optimization
v=1
problems for each view is: 1∑ (v) loss(xi , yi , f (v) ) l (v) f ∈HK (v) i=1
2 (v) (v) (v) + γA f (v) + γI f (v)T Lf (v) ,
, and the regularization
(v) 0, γI
B. Adaptive Multi-Modal Semi-Supervised Classification
min
of the v-th view and
)T
parameters > > 0 control the influence of unlabeled examples relative to the RKHS norms and the manifold regularization respectively.
Descriptions The total dimensionality of data The dimensionality of the v-th view The data size The number of labeled data The number of unlabeled data The percentage of labeled samples The number of classes The number of views The weight coefficient of the v-th view The weight redistribution parameter The set of indexes of labeled samples The set of indexes of unlabeled samples The i-th sample The i-th sample’s feature vector in view v The label vector of the i-th sample the predicted label vector of the i-th sample The data matrix The data matrix of the v-th view The label matrix The predicted label matrix The projection matrix in the v-th view The bias vector The weight vector A vector of all ones for an arbitrary number q
v=1
(v)
f (v) (x1 ), · · · , f (v) (xn )
l
f (v)∗ = arg min
(2)
K
where f (v) is the classifier of view v, HK (v) is a Reproducing Kernel Hilbert Space (RKHS) of functions with kernel function K (v) and ∥·∥K (v) is the corresponding norm in HK (v) , loss(·) is some loss function such as squared loss for RLS or hinge loss function for SVM, f (v) denotes the
V ∑
( ) T (α(v) )r tr (F(v) ) L(v) F(v)
v=1
+λ
V ∑
( ) T tr (F − F(v) ) (F − F(v) )
v=1
(3) where α = [α(1) , α(2) , · · · , α(V ) ]T is the weight vector for each view, FL is the predicted label matrix for labeled data, L(v) stands for the normalized Laplacian matrix for the v-th view, λ > 0 is the regularization parameter to balance the two terms, r > 1 is used to control the weight distribution, and tr(·) is the trace operation of a matrix .
C. Multi-feature Learning via Hierarchical Regression Multi-feature Learning via Hierarchical Regression (MLHR) [23], is a hierarchical combination of a set of local and global linear regression models. It is also a graph based method, but it does not construct the graph by computing the affinity matrix directly. Instead, it trains local linear regression models on each view for each datum to learn multiple view-based graphs. Moreover, MLHR can classify newly coming data as it trains V global classifiers based on linear regression on each view. (v) Concretely, given the feature vector xi of the i-th (v) sample on view v, the k-nearest neighbors of xi are (v) (v) (v) (v) (v) (v) (v) (v) xi1 , xi2 , · · · , xik . Let Xi = [xi , xi1 , xi2 , · · · , xik ] ∈ (v) (v) Rd ×(k+1) , and Fi ∈ R(k+1)×c be the corresponding ˜ ∈ {0, 1}n×c as the label predicted label matrix. Define Y information matrix. Given a labeled data point xi , if it belongs ˜ ij = 1, otherwise Y ˜ ij = 0. If xi is to the j-th class, Y ˜ unlabeled, then Yij = 0 for any j that 1 ≤ j ≤ c. Denote U as a diagonal matrix, such that Uii = ∞ if xi is labeled, and
4
Uii = 0 otherwise. Then the objective function of MLHR is min F,F(v) ,W(v) , (v)
2 ( ) V ∑
˜ T U(F − Y) ˜ µ1
F − F(v) + tr (F − Y) F
v=1
(v)
b(v) ,wi ,bi ( V ∑
2
)
(v) T (v)
(v) 2 (v) T (v) +µ2
(X ) W + 1(b ) − F + γ W F F v=1 (
2
) V n ∑ ∑
(v) 2
(v) T (v) (v) T (v) +
(Xi ) wi + 1(bi ) − Fi + λ wi , F
v=1 i=1
Empirically, the score of labeled samples is larger than that of unlabeled samples [28]. In the multi-view setting, one sample can be represented by several different views. If these views are combined appropriately, the classification accuracy can be enhanced. Extend the formulation in (8) to the multi-view case, we get the formulation of MVAR,
F
(4)
where ∥·∥F denotes the Frobenius norm, W(v) ∈ Rd ×c and b(v) ∈ Rc are the global classifier and bias term on the v-th (v) (v) (v) view, wi ∈ Rd ×c and bi ∈ Rc are the local classifier (v) and bias vector of xi , µ1 > 0, µ2 > 0, γ > 0 and λ > 0 are parameters. (v)
III. M ULTI -V IEW S EMI -S UPERVISED C LASSIFICATION VIA A DAPTIVE R EGRESSION In this section, the formulation and solution of the proposed MVAR algorithm are introduced.
min
V ∑
(
(α(v) )r
v=1
s.t.
n ∑ i=1
2 )
(v) si (W(v) )T xi − fi + λ(v) W(v)
2,1
2
T
A. Formulation We first introduce the objective function for single-view semi-supervised classification and then extend it to the multiview scenario. Denote the predicted label vector of xi as fi ∈ {0, 1}c and F = [f1 , · · · , fn ]T ∈ {0, 1}n×c , the general objective function for single-view semi-supervised classification can be formulated as n ∑ min loss((f (xi )), fi ) + λΩ(f ) (5) f,F,FL =YL
i=1
where f : X → {0, 1}c is a classifier function, loss(·) is a loss function and Ω(f ) is the regularization term with λ > 0 as its parameter, and FL is the predicted label matrix corresponding to the labeled data. With different loss functions and regularizations, semi-supervised classification can be implemented in various ways. We adopt the least square loss function, which is widely used in many applications for its efficiency and simplicity [26], [27], then the objective function becomes W,b,F,FL =YL
(9)
where α = [α(1) , α(2) , · · · , α(V ) ]T is the weight vector for each view, λ(v) > 0 is the regularization parameter of the v-th view, and r > 1 is the parameter to adjust the weight distribution for all views [29]. Denote S as a diagonal matrix with the i-th diagonal element Sii = si , for v = 1, 2, · · · , V , we have n
(
) ∑ T
(v) si (W(v) )T xi − fi = S (X(v) ) W(v) − F . i=1
min
F
2
FL = YL , α ≥ 0, αT 1 = 1,
n ∑
T
W xi + b − fi + λ ∥W∥2 , F 2
(6)
i=1
where W ∈ Rd×c is the projection matrix, and b ∈ Rc is the bias. Here, the not squared residual is employed to increase robustness. For simplicity, we add the constant value 1 to each data point xi as an additional dimension, then the bias b can be absorbed into W. Thus the formulation becomes: n ∑
T
W xi − fi + λ ∥W∥2 . min (7) F 2 W,F,FL =YL
i=1
In semi-supervised learning, labeled and unlabeled samples usually play different roles, so a score si ≥ 0 for each sample is added, n ∑
2 min si WT xi − fi 2 + λ ∥W∥F . (8) W,F,FL =YL
i=1
For simplicity, use E(v) to represent (X(v) ) W(v) − F. We rewrite the formulation in (9) with matrix ℓ2,1 -norm: ( V
2 ) ∑
min (α(v) )r SE(v) + λ(v) W(v) . (v) 2,1
W ,F,FL =YL v=1 α≥0,αT 1=1
F
(10) One point needs to be highlighted here. ℓ2,1 -norm has been employed to impose row sparsity for feature selection in many works [24], [25], [30]. Nevertheless, in this paper, ℓ2,1 -norm is utilized as the substitute for the Frobenius norm on the loss functions to gain robustness. B. Solution The proposed formulation can not be resolved directly because of the following two reasons. First, the ℓ2,1 -norm is non-smooth. Second, each entry of the predicted label matrix is a binary integer and each row vector must satisfy the 1−of −c coding scheme. We adopt the alternating iterative strategy to solve it. According to [24], the problem in (10) can be addressed by solving the following problem: min (v)
W ,F,FL =YL α≥0,αT 1=1
V ∑
(α(v) )r tr((E(v) )T B(v) E(v) )
v=1
+
V ∑
(α(v) )r λ(v) tr((W(v) )T W(v) ),
v=1
(11) where B(v) ∈ Rn×n is a diagonal matrix corresponding to the v-th view and its i-th diagonal element is defined as:
(v) bii = si /2 (e(v) )i , ∀i = 1, 2, · · · , n, (12) 2
(v) i
where (e ) stands for the i-th row of E(v) = T (X(v) ) W(v) − F. Note that B(v) is related to W(v) and F, thus it is also an unknown variable. The objective function is multi-variable,
5
and the unknown variables are F, {W(v) }Vv=1 , {B(v) }Vv=1 and α. Since optimizing all these variables simultaneously is difficult, we adopt an iterative method. Specifically, we only optimize one variable at a time and fix the others. Then the problem in (11) is divided into four sets of convex subproblems. Find the optimal solution to each subproblem alternatively and iteratively, then the objective function will converge to a local solution. With the initialized F and W(v) , we obtain B(v) (v = 1, 2, · · · , V ) according to (12), then we can begin the iterations. (1) Fix F, α and {B(v) }Vv=1 to optimize {W(v) }Vv=1 . Since the constraints are independent of {W(v) }Vv=1 and the relations among views are decoupled, the problem disassembles into V separate subproblems (v = 1, 2, · · · , V ): min tr(((E(v) )T B(v) E(v) ) + λ(v) tr((W(v) )T W(v) ). (13)
W(v)
Set the derivative of (13) with respect to (w.r.t.) W(v) to be zero, then we have, ( )−1 W(v) = X(v) B(v) (X(v) )T + λ(v) I X(v) B(v) F, (14) where I is the identity matrix. (2) Fix {W(v) }Vv=1 , {B(v) }Vv=1 and α to optimize F. Ignore the regularization term which is not related to F, the problem becomes V ∑
min
F,FL =YL
(α(v) )r tr((E(v) )T B(v) E(v) ).
(15)
v=1
We have
=
min
α≥0,αT 1=1
V ∑
(α(v) )r g (v) ,
v=1 V ∑
( ) ( ) where g (v) = tr (E(v) )T B(v) E(v) + λ(v) tr (W(v) )T W(v) . The Lagrange function of (20) is L(α, η) =
V ∑
V ∑ (α(v) )r g (v) − η( α(v) − 1).
v=1
Set the derivative of L(α, η) w.r.t. α(v) to zero. We have 1
α(v) = (η/(rg (v) )) r−1 . Substitute the resultant α(v) into the constraint
i=1 v=1
(v) (v) (α(v) )r bii (−2(xi )T W(v) fi
(v)
+
fiT fi ),
(v)
where (xi )T W(v) (W(v) )T xi is independent of fi . In addition, according to the definition of fi , we have fiT fi = 1. Notice that fi = yi , i = 1, 2, · · · , l, then the problem in (15) is equivalent to resolving the following u subproblems, (V ) ∑ (v) r (v) (v) T (v) max c (α ) bii (xi ) W fi , i = l + 1, · · · n. fi ∈{0,1}
we obtain α(v) = (g (v) )1/(1−r) /
V ∑
(g (v) )1/(1−r) .
(23)
v=1
(4) Fix {W(v) }Vv=1 , F and α, and update {B(v) }Vv=1 according to (12). By the above four steps, we alternatively update {W(v) }Vv=1 , F, {B(v) }Vv=1 as well as α and repeat these procedures iteratively until the objective function converges. We summarize the iteration process in Algorithm 1. For ∆ a testing data point xt , denote ζ t = [ζt1 , · · · , ζtc ] = V ∑ (α(v) )r (W(v) )T xt and jmax = arg max ζtj , then the 1≤j≤c
(24)
A. Convergence analysis
(16)
(α(v) )r bii (xi )T W(v) , (17)
Theorem 1. The iterative approach in Algorithm 1 will monotonically decrease the objective value of the problem in (10) in each iteration until the convergence.
Define V ∑
This section gives analyses of MVAR in four aspects. We first discuss the convergence, time complexity and parameter determination problems of the algorithm. Then an intuitive performance analysis of MVAR is given to illustrate why it has advantages over the aforementioned algorithms in Section II.
We first prove that the iteration process described in Algorithm 1 will monotonically decreases the objective function value in each iteration until the convergence.
v=1
ζ i = [ζi1 , ζi2 , · · · , ζic ] ,
α(v) = 1,
v=1
IV. A NALYSIS
(v) (v) (v) (α(v) )r bii (xi )T W(v) (W(v) )T xi
+
(22) V ∑
2
i=1
i=1 v=1 n ∑ V ∑
(21)
v=1
predicted label vector ft is computed by { 1, if j = jmax , ft (j) = 0, otherwise.
(α(v) )r tr((E(v) )T B(v) E(v) )
2 n ∑
(v) (v) (α(v) )r bii (W(v) )T xi − fi
v=1 n ∑ V ∑
(20)
v=1
v=1 V ∑
=
(3) Fix {W(v) }Vv=1 , F and {B(v) }Vv=1 to optimize α. The problem becomes
(v)
(v)
v=1
(18)
Proof. Suppose after the t-th iteration, we have obtained (v) {Wt }Vv=1 , Ft and αt . In the next iteration, we fix F and α (v) as Ft and αt respectively for solving {Wt+1 }Vv=1 . According to Algorithm 1, for the v-th view (v = 1, 2, · · · , V ), it can be inferred that
(19)
Wt+1 = arg minW(v) g (v) (W(v) , Ft , Bt ),
and solve jmax = arg max1≤j≤c ζij . Then the optimal solution to (16) is { 1, j = jmax fi (j) = , i = l + 1, · · · , n. 0, otherwise
(v)
(v)
(25)
6
Algorithm 1 MVAR Input: 1. The data matrix X = [(X(1) )T , · · · , (X(V ) )T ]T , 2. The labels for the first l samples: YL = [y1 , · · · , yl ]T , 3. Parameters si (i = 1, 2, · · · , n), λ(v) (v = 1, 2, · · · , V ) and r > 1. Output: 1. The predicted label vectors fi (i = l + 1, · · · , n) for unlabeled samples, 2. The projection matrix W(v) , v = 1, 2, · · · , V , 3. The weight vector α = [α(1) , · · · , α(V ) ]T . Initialization: 1. Set t = 0, (v) 2. Initialize αt as αt = 1/V, v = 1, 2, · · · , V , 3. Use (XL , YL ) to calculate the initialized projection (v) matrix Wt (v = 1, 2, · · · , V ) by least square classification on each view, 4. Predict the initial labels fi,t (i = l + 1, · · · , n) for unlabeled samples according to (17) and (18) and set T Ft = [YL , FTU,t ]T , (v) 5. Calculate Bt (v = 1, 2, · · · , V ) according to (12). Procedure: Repeat (v) 1. Compute Wt+1 according to (14) or (35), 2. Compute Fu,t+1 by (17) and (18) and update Ft+1 = T [YL , FTU,t+1 ]T , 3. Calculate the weight vector αt+1 by (23), (v) 4. Update {Bt+1 }Vv=1 according to (12). Until converges
Unfolding the above inequality, it becomes
(v) i 2 V n
s )
(e
∑ (v) ∑ i 2 t+1 (v) (v)
2 (αt )r
(v) i + λ Wt+1 F v=1 i=1 2 (et ) 2 (30)
2
(v) i V n si (e
2 ) ∑ ∑ t (v) (v) (v)
2 ≤ (αt )r
(v) i + λ Wt F , v=1 i=1 2 (et ) 2
(v) (v) where (et )i and (et+1 )i stand for (v) (v) (X(v) )T Wt+1 − Ft+1 and Et =
(v)
the i-th row of Et+1 = (v) (X(v) )T Wt − Ft re-
spectively. According to the Lemma 1 in [24], the following inequality holds
(v) 2
(v) 2
si (et )i si (et+1 )i
(v) i
(v)
2
2 si (et+1 )i −
(v) i ≤ si (et ) 2 − (v) i , 2 2 (et ) 2 (et ) 2
2
∀i = 1, · · · , n, ∀v = 1, · · · , V. (31) Summarize the above inequation over all data points and all views:
(v) i 2 V n
s )
(e ∑ (v) ∑ (v) i t+1 i
2 (αt )r si (et+1 ) −
(v) i 2 2 )
(et
v=1 i=1 2 (32)
2
(v) i V n
si (et ) ∑ (v) ∑ (v) i
2 ≤ (αt )r si (et ) −
(v) i . 2 2 (et ) v=1 i=1 2
where we represent tr((E(v) )T B(v) E(v) + λ(v) (W(v) )T W(v) ) Combining (30) and (32), we have as g (v) (W(v) , F, B(v) ) for convenience. From the above ( V
) ∑ equation, we have
(v)
(v) 2 (v) (αt )r SEt+1 + λ(v) Wt+1 (v) (v) (v) (v) 2,1 F g (v) (Wt+1 , Ft , Bt ) ≤ g (v) (Wt , Ft , Bt ). (26) v=1 (33) ( V
2 ) ∑
(v) r (v) (v) Summing over all views, it arrives at ≤ (αt ) SEt + λ(v) Wt . 2,1 F V ∑ v=1 (v) r (v) (v) (v) (αt ) g (Wt+1 , Ft , Bt ) After updating αt+1 by (23), we arrive at v=1 ( (27)
2 ) V V ∑ ∑ (v) r (v) (v) (v) (v) r (v) (v) (v) (α ) + λ
SE
W t+1 t+1 t+1 ≤ (αt ) g (Wt , Ft , Bt ). 2,1 F v=1 (
2 ) V v=1 ∑
(v) r (v) (v) (34) ≤ (αt ) SEt+1 + λ(v) Wt+1 (v) (v) 2,1 F With Bt , αt and the current calculated {Wt+1 }Vv=1 , we v=1 ( )
V ∑ update Ft+1 , then we obtain
(v)
(v) 2 (v) ≤ (αt )r SEt + λ(v) Wt . 2,1 F V v=1 ∑ (v) r (v) (v) (v) (αt ) g (Wt+1 , Ft+1 , Bt ) Thus the Algorithm 1 will monotonically decrease the v=1 objective value of (10) in each iteration. In addition, the (28) V ∑ objective function is lower bounded by 0, so the objective (v) (v) (v) ≤ (αt )r g (v) (Wt+1 , Ft , Bt ). value will converge. Therefore, the proof is completed. v=1
Combining (27) and (28), we have V ∑
(v)
(v)
B. Time complexity (v)
(αt )r g (v) (Wt+1 , Ft+1 , Bt )
v=1
≤
V ∑ v=1
(29) (v) (v) (v) (αt )r g (v) (Wt , Ft , Bt ).
We first analyze the time complexity of solving the projection matrices {W(v) }Vv=1 . As B(v) is invertible, according to the following identity, (P−1 + BT R−1 B)−1 BT R−1 = PBT (BPBT + R)−1 ,
7
we have W
(v)
=X
(v)
(
(v) T
(X
) X
(v)
(v)
+λ
(v) −1
(B
)
)−1
F.
(35)
Note that the time complexity of computing the inverse of a square matrix is cubic w.r.t. the matrix size. When d(v) < n, we use (14) to calculate W(v) , otherwise (35) is adopted. When d(v) < n, the time complexity of calculating (v) (v) X B (X(v) )T is O(n(d(v) )2 ), and computing X(v) B(v) F requires O(nd(v) c) computations. Instead of computing the (v) T inverse matrix of X(v) B(v) ) + λ(v) I, we solve ( (X ) the sys(v) (v) tem of linear equations X B (X(v) )T + λ(v) I W(v) = X(v) B(v) F and the time complexity is O((d(v) )3 ). Recall that d(v) < n, so O(n(d(v) )2 ) dominates O((d(v) )3 ) and the time complexity is O(n(d(v) )2 ). When d(v) ≥ n, with a similarly analysis, the time complexity of computing W(v) is O(n2 d(v) ). In the subsequent steps, the time complexity of updating the predicted label matrix FU for unlabeled data is O(udc). The calculation of α and the updating of {B(v) }Vv=1 can be completed together with O(ncd) computations. Therefore, the time complexity of MVAR is V ∑ 2 O(T max{n, d(v) } min {n, d(v) } ), where T is the total v=1
number of iterations. C. Parameter determination The proposed formulation MVAR has three kinds of parameters, i.e., the scores {si }ni=1 for training samples, the regularization parameters {λ(v) }Vv=1 on each view and the parameter r to adjust the weight distribution of all views. The scores determine the contributions of training sample to the whole formulation. With label information, labeled samples are usually allocated with larger scores than unlabeled ones [28]. In this paper, in order to reduce model complexity, we set the scores of unlabeled samples si (i = l + 1, · · · , n) as 1, and si = µ ≥ 1, (i = 1, · · · , l) for all labeled samples. How the value of µ is determined is described in detail in Section V-B. The regularization parameter λ(v) controls the smoothness of the estimator W(v) . The performance of least square classifier is closely related to the choice of λ(v) . We adopt the method proposed in [31] to determine them. Finally, the parameter r is introduced to avoid the trivial solution. More concretely, if r is not introduced, then the solution of α in (10) is α(vmin ) = 1 and 0 otherwise, where 2 vmin = arg min SE(v) 2,1 +λ(v) W(v) F . This means that 1≤v≤V
only one view is selected, which is not consistent with our original intention in the multi-view setting. In addition, the introduction of r redistributes the weights of each view [29]. From (23), we can see that when r → ∞, the weight for each view tends to be equal, while when r → 1, we will get the trivial solution. That is to say, the larger the value of r is, the smaller the difference between the weight coefficients is. Thus, we can use r to control the weight distribution.
this subsection, we make comparisons between MVAR and the aforementioned methods in this subsection. As shown in (9), MVAR considers training samples’ role diversity and different views’ quality diversity simultaneously, by allocating them different scores si (i = 1, · · · , n) and different weight coefficients α(v) (v = 1, · · · , V ) respectively. By contrast, from the formulations of Co-Regularization and AMMSS, we can see that they only take account of the view diversity. Though MLHR and MVAR are similar in employing the least square regression model, MLHR does not consider neither the samples’ role diversity nor the views’ quality diversity as shown in (4). Despite of the success of semi-supervised learning and multi-view learning in many applications, exploiting the unlabeled data does not necessarily bring positive effect, and also it may not always be the case that including more features is beneficial [23]. If there exists noise samples or low-quality views, assigning low weights or scores for them can reduce their negative effects to some extent. Additionally, the problem that how to determine the kernel function when building Laplacian graphs is still open. Since different kernels may correspond to various notions of similarity, in Co-LapSVM, Co-LapRLS and AMMSS, the quality of the Laplacian matrices L(v) (v = 1, · · · , V ) are easily affected by the choice of kernel functions. In comparison, MVAR bypasses the problem of choosing kernel functions by regressing to labels directly. Therefore, from the above mentioned two points, it can be concluded that the classification performance of MVAR is more stable. V. E XPERIMENTS In this section, we compare our proposed MVAR with related semi-supervised methods in terms of classification accuracy. Then the impact of parameters on our approach is evaluated. Lastly, convergence performance of the proposed method and comparison of computational time are also represented. Six benchmark datasets are used in the experiments, they are, Animal with Attributes1 (Animal), NUS-WIDEObject2 (NUS-Object), COREL Photo Image Dataset3 (Corel5k), GRAZ024 , Microsoft Research Cambridge Volume 15 (MSRC-v1), Scene UNderstanding database6 (SUN). Each dataset has a certain number of types of features (views), whose details are described in next subsection and summarized in Table II. We compare our proposed MVAR with the following methods. 1) Regularized Least Square (RLS) classification on labeled data of each single view (S-RLS). The learned projection matrix and bias vector are used to make predictions on 1 http://attributes.kyb.tuebingen.mpg.de/ 2 http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm
D. Intuitive Performance Analysis of MVAR In order to obtain better understanding of the advantages of MVAR over Co-Regularization, AMMSS and MLHR, in
3 http://www.cais.ntu.edu.sg/˜chhoi/SVMBMAL/ 4 http://www.emt.tugraz.at/˜pinz/data/GRAZ
02
5 http://research.microsoft.com/en-us/projects/objectclassrecognition/ 6 http://vision.princeton.edu/projects/2010/SUN/
8
unlabeled training data and testing data. We report the best single-view results. 2) RLS classification on labeled data of the concatenated feature matrix of all views (C-RLS). The learned models are used to classify the unlabeled training data and testing data. 3) The Co-LapSVM and Co-LapRLS within the CoRegularization framework [20]. 4) Adaptive Multi-Modal Semi-Supervised (AMMSS) classification algorithm [5]. 5) Multi-feature Learning via Hierarchical Regression (MLHR) [23]. The implementations of Co-LapSVM and Co-LapRLS7 , AMMSS8 , and MLHR9 are downloaded from the authors’ websites. A. Dataset description Animal consists of 30,475 images of 50 animals classes with pre-extracted feature representations for each image. We utilize the six earliest published features for all the images: color histogram (CH) features, local self-similarity (LSS) features, pyramid HOG (PHOG) [32] features, SIFT [33] features, colorSIFT (RGSIFT) [34] features, and SURF [35] features. NUS-Object is a real world object image dataset. It consists of 31 object categories and 30,000 images in total. The following 5 visual features are available: CH, color correlogram (CORR), edge direction histogram (EDH), wavelet texture (WT), block-wise color moments (CM). Corel5k contains 5,000 images from 50 different categories. Each category consists of exactly 100 images that are randomly selected from COREL database. CM, EDH and WT [36] features are extracted. GRAZ02 is a database for object categorization. It contains images with objects of high complexity and high intra-class variability. It consists of 365 images with bikes, 311 images with persons, 420 images with cars and 380 images not containing one of these objects. We extract the following 6 visual features: SIFT, SURF, GIST [37], local binary pattern (LBP) [38], PHOG and WT. MSRC-v1 is an object recognition dataset containing 8 classes, and each class has 30 images. The same as in [5], 7 classes composed of tree, building, airplane, cow, face, car, bicycle are selected. We extract the the same visual features: LBP, HOG, GIST, CM, CENTRIST [39] and SIFT. SUN contains 899 categories and 130,519 images. We randomly choose 10 classes from the 397 well-sampled subset dataset used in [40] and each class has 100 images. We refer to the sampled subset as SUN1k. Three pre-extracted features are adopted for our experiments, they are, SIFT, HOG and texton histogram (TH) features. B. Experimental setup. For S-RLS, C-RLS and our proposed MVAR, the regularization parameters are determined by the method proposed in 7 http://vikas.sindhwani.org/manifoldregularization.html 8 http://www.escience.cn/people/fpnie/papers.html 9 http://www.cs.cmu.edu/˜yiyang/Publications.html
[31]. In MVAR, r is set to be 2 and the score of unlabeled samples is fixed as 1, while the score µ for labeled samples is tuned in the range of {100 , 101 , 102 , 103 , 104 , 105 , 106 }. In Co-LapSVM and Co-LapRLS, we use linear kernels for all views. When constructing the graph Laplacian L(v) of each view, the affinity matrix is calculated using the 5-nearest neighbors with Euclidean distance and heat kernel weight. The bandwidth parameter in heat kernel is set as the mean value of all edge distances. The multi-view graph Laplacian is V ∑ α(v) L(v) , where the weights the convex combination L = i=1
α(v) (v = 1, · · · , V ) is learned by our proposed method. After computing L, the algorithms are performed on each view and the best results are reported. We fix the manifold regularization (v) parameter γI as 1 and tune the RHKS norm regularization (v) parameter γA in the range of {100 , 102 , 104 } (Pleas refer to [20] for more details). Additionally, the one-versus-rest rule is employed in Co-LapSVM to fulfill the multiclass classification task. AMMSS is a graph based approach. The graph Laplacian of each view is constructed in the same way as in Co-LapSVM. The weight redistributing parameter r is set to be 2 as in MVAR. Finally, AMMSS has a trade-off parameter to balance the first term and the second term in its formulation. According to the experiment settings in [5], we tune it in the range from 0.2 to 1 with incremental step 0.2. There are total five parameters to be determined in MLHR, i.e., the number of nearest neighbors k, the trade-off parameters µ1 and µ2 , and the regularization parameters γ and λ. Here, k is also set to be 5. As reported in [23], the performance of MLHR is not sensitive to the local regularization parameter λ, and so does the global regularization parameter γ when it is not larger than 1. Hence, we fix these two parameters as 1. For µ1 and µ2 , we tune them from {100 , 103 , 106 }. For each dataset, with the class proportions unchanged, we randomly choose 80%, 10%, 10% samples as the training set, the testing set and the validation set respectively. Further, among the training set, we randomly generate m% (of the whole dataset) labeled samples, and the remaining are unlabeled. To simulate the “real” situation in the semi-supervised scenario where l ≪ u, we set m from 5 to 30 with an interval of 5. The above procedure is repeated 10 times and produces 10 random partitions of the data. But for Animal and NUSObject only 5 random partitions are produced, because it takes a long computational time for algorithms on these two datasets. We adopt a window based stopping criterion: for a given window size h, at every iteration t, we calculate the following ratio: q = (max Jt − min Jt )/max Jt , where the set Jt = {objt−h+1 , · · · , objt } consists of history objective values in a window. If q < θ, where θ is a predefined value, the algorithm stops iterating. Specifically, we set h = 6 and θ = 10−4 . The classification performance is evaluated in terms of average classification accuracy. In the experiment, we use the validation set to determine the parameters of the compared methods. Concretely, algorithms are first trained on
9
TABLE II D ETAILS OF THE MULTI - VIEW Feature type 1 2 3 4 5 6 Data points Classes
Animal LSS(2000) CH(2688) PHOG(252) SIFT(2000) RGSIFT(2000) SURF(2000) 30475 50
DATASETS USED IN OUR EXPERIMENTS ( FEATURE TYPE ( DIMENSIONALITY )).
NUS-Object CH(64) CORR(144) EDH(73) CM(255) WT(128) 30000 31
Corel5k CMT(9) EDH(18) WT(9) 5000 50
the training data (including labeled and unlabeled data) with different candidate parameter values. Then predictions are made for the validation data with the trained models. The value which produces the best average classification accuracy on the validation data is then chosen as the parameter for training and testing. We report average classification accuracy on both the unlabeled training samples and the unseen testing samples. C. Classification results Fig. 1 and Fig. 2 show the classification accuracy results of all the compared methods with different percentages of labeled samples on the unlabeled training data and testing data, respectively. It should be noted that a single run of AMMSS is not finished within a week on datasets Animal and NUSObject under the experimental condition of this paper, thus only the results of the other six methods are reported for these two datasets. It can be seen that all methods achieve higher classification accuracy with more labeled training samples in most cases, which is consistent with intuition. Surprisingly, the classification results of S-RLS is not the worst except that it gets lowest classification accuracy when testing on Corel5k. This on the opposite side illustrates that the learning performance will not be enhanced if the multiple representations are not properly integrated. The performance of AMMSS is unstable as illustrated in Fig. 1 and Fig. 2. On MSRC-v1, AMMSS ranks second, while it performs the worst on SUN1k. This is probably because that the chosen kernel function is not suitable for all cases. In comparison, the stability of the results of MLHR w.r.t. different datasets is better, as it employs a local linear regression model to learn graphs, rather than directly compute the affinity matrix. As shown in Fig. 1, when making predictions on the unlabeled training data, the proposed MVAR prevails over the other compared methods in most cases. From Fig. 2 we can see that, although the advantage of our method on testing data is not so obvious as that when classifying unlabeled training data, MVAR still obtains competitive or higher classification accuracy. The performance of MVAR is remarkable when only a small amount of labeled training data are available. D. Impact of parameters µ and r In this subsection, we study the impact of the score µ for labeled samples and the weight redistribution parameter r on the performance of MVAR, respectively.
GRAZ02 SIFT(500) SURF(500) PHOG(680) LBP(256) GIST(512) WT(32) 1476 4
SUN1k SIFT(6300) HOG(6300) TH(10752) 1000 10
MSRC-v1 CMT(48) LBP(256) HOG(100) SIFT(200) GIST(512) CENTRIST(1320) 210 7
We set the value of µ and r as {100 , 101 , 102 , 103 , 104 , 105 , 106 } and {1.2, 1.4, 1.6, 1.8, 2, 3, 5, 7, 9}, respectively. Without loss of generality, we train MVAR on the training data and make predictions on the testing data when the labeled samples is 10%. This procedure is also repeated 10 times on the above used 10 random partitions of data. We report the results on GRAZ02 and MSRC-v1, since the algorithm behaves similarly on the other datasets. The detailed results on both unlabeled training data and testing data are displayed in Fig. 3. As shown in Fig. 3, as the value of r varies, the performance of MVAR is slightly affected. This demonstrates that empirically setting r as 2 is reasonable. On the other hand, the selection of µ affects MVAR’s performance to some degree, especially on the testing data. As we can see, the optimal results are achieved when µ > 1 in all cases. Recall that the value of µ stands for the scores for the labeled samples, while the scores of unlabeled samples are all set as 1. Thus, this phenomenon illustrates that the labeled samples play more important roles than unlabeled samples. In the semi-supervised setting, we have access to a small amount of labeled samples. Therefore, using the labeled samples for validation, the parameters can be determined, thereby avoiding the degradation in classification accuracy brought by inappropriate parameters. E. Convergence Analysis and Time Comparison In order to verify the convergence of Algorithm 1, we present the convergence behavior curves on datasets Corel5k and SUN1k when the percentage of labeled samples is 10%. The convergence curves are displayed in Fig. 4. As seen from Fig. 4, the objective values are non-increasing during the iterations and converge to a fixed value. Additionally, the algorithm converges within 10 iterations. Therefore, our proposed algorithm scales well in practice because of the fast convergence speed. The time complexity of the training process of the compared multi-view methods are listed in Table III. When the data size n is larger than the total dimensionality d, the time complexity of Co-LapSVM, Co-LapRLS, AMMSS and MLHR is cubic time w.r.t. n. In other words, the first four methods are not efficient for large-scale datasets. It also can be found that the computational time of both AMMSS and MVAR is closely related with the number of iterations. We report the computational time of these four methods on three datasets: Animal, Corel5k and SUN1k. Among these
25 20 15 10 5
40 35 30 25 20 15 10 5
10 15 20 25 30 Percentage of labeled samples (%)
80
S−RLS C−RLS Co−LapSVM Co−LapRLS AMMSS MLHR MVAR
60 50 40 5
50 40 30 20 10 5
10 15 20 25 30 Percentage of labeled samples (%)
(b) NUS-Object
Classification accuracy (%)
Classification accuracy (%)
70
S−RLS C−RLS Co−LapSVM Co−LapRLS AMMSS MLHR MVAR
60
10 15 20 25 30 Percentage of labeled samples (%)
(a) Animal 80
70
S−RLS C−RLS Co−LapSVM Co−LapRLS MLHR MVAR
45
Classification accuracy (%)
30
50
S−RLS C−RLS Co−LapSVM Co−LapRLS MLHR MVAR
(c) Corel5k
S−RLS C−RLS Co−LapSVM Co−LapRLS AMMSS MLHR MVAR
70 60
Classification accuracy (%)
35
Classification accuracy (%)
Classification accuracy (%)
10
50 40
90
80
70 S−RLS C−RLS Co−LapSVM Co−LapRLS AMMSS MLHR MVAR
60
30 10 15 20 25 30 Percentage of labeled samples (%)
5
50 5
10 15 20 25 30 Percentage of labeled samples (%)
(d) GRAZ02
10 15 20 25 30 Percentage of labeled samples (%)
(e) SUN1k
(f) MSRC-v1
Fig. 1. Classification accuracy comparison between different methods on unlabeled training data.
15
40 35 30 25 20
5
60
S−RLS C−RLS Co−LapSVM Co−LapRLS AMMSS MLHR MVAR
20
5
10 15 20 25 30 Percentage of labeled samples (%)
50 45
(c) Corel5k
S−RLS C−RLS Co−LapSVM Co−LapRLS AMMSS MLHR MVAR
70
55
60 50 40
90
80
70 S−RLS C−RLS Co−LapSVM Co−LapRLS AMMSS MLHR MVAR
60
30
40 5
30
(b) NUS-Object
Classification accuracy (%)
Classification accuracy (%)
65
40
10 15 20 25 30 Percentage of labeled samples (%)
(a) Animal 70
50
10
15 10 15 20 25 30 Percentage of labeled samples (%)
S−RLS C−RLS Co−LapSVM Co−LapRLS AMMSS MLHR MVAR
60 Classification accuracy (%)
20
10 5
S−RLS C−RLS Co−LapSVM Co−LapRLS MLHR MVAR
45
Classification accuracy (%)
25
S−RLS C−RLS Co−LapSVM Co−LapRLS MLHR MVAR
Classification accuracy (%)
Classification accuracy (%)
30
10 15 20 25 30 Percentage of labeled samples (%)
5
50 5
10 15 20 25 30 Percentage of labeled samples (%)
(d) GRAZ02
10 15 20 25 30 Percentage of labeled samples (%)
(e) SUN1k
(f) MSRC-v1
56 9
7
5
r
3 2 1.2 1
1e3 10 1e2
1e6 1e4 1e5
µ
(a) GRAZ02, unlabeled
60 40 20 9
7
5
r
3 2 1.2 1
1e3 10 1e2
1e6 1e4 1e5
µ
(b) GRAZ02, testing
90
89
88 9 7 5
r
3 2 1.2 1
1e3 10 1e2
1e6 1e4 1e5
µ
(c) MSRC-v1, unlabeled
Classification accuracy (%)
58
Classification accuracy (%)
60
Classification accuracy (%)
Classification accuracy (%)
Fig. 2. Classification accuracy comparison between different methods on testing data.
95 90 85 80 9 7 5
r
3 2 1.2 1
1e3 10 1e2
1e4 1e5
1e6
µ
(d) MSRC-v1, testing
Fig. 3. Performance Comparison of MVAR with different µ and r values on GRAZ02 and MSRC-v1 with 10% labeled samples on both unlabeled training samples and unseen testing samples.
11
8
6
x 10
14
TABLE IV C OMPUTATIONAL TIME ( SECOND ) ON FOUR DATASETS . −: PROGRAM IS NOT FINISHED IN A WEEK FOR A SINGLE RUN ; ∗ : COMPUTATIONAL TIME WITH PRE - COMPUTED GRAPHS .
12 Objective value
Objective value
1.5959
x 10
1.5958 1.5957 1.5956 1.5955
10 8 6 4
1.5954 2
4 Iteration
6
8
2
2
(a) Corel5k
4 Iteration
6
8
(b) SUN1k
Co-LapSVM Co-LapRLS AMMSS MLHR MVAR
Animal 5418.8(6.8) 5327.7(11.8) − 4442.8(8.1)∗ 1980.1(3.1)
Corel5k 17.0(0.9) 15.7(0.1) 248.3(1.6) 21.8(0.7) 1.4(0.1)
SUN1k 1.6(0.1) 1.5(0.1) 5.4(0.1) 237.3(0.8) 2.6(0.1)
Fig. 4. Convergence Analysis of MVAR with 10% labeled data on two datasets.
TABLE III T IME COMPLEXITY
OF COMPARED MULTI - VIEW METHODS .
MLHR
Time complexity max{O(V n3 ), O(n2 d)} max{O(V n3 ), O(n2 d)} max{O(T V n3 ), O(T n2 d)} V ∑ max{O(V n3 ), O( (d(v) )3 )}
MVAR
O(T
Co-LapSVM Co-LapRLS AMMSS
V ∑
v=1 2
(max{n, d(v) } min {n, d(v) } ))
v=1
datasets, Animal and Corel5k have much larger data size than dimensionality, and SUN1k is with the highest dimensionality. All algorithms are tested on a work station with 12 processors (2.10 GHz for each) and 96.0 GB RAM memory by MATLAB implementations with MATLAB R2013b. With 10% labeled samples, we implement each method for training and testing with predetermined parameters for 5 times. The average time and standard deviation are reported in Table IV. There are two points need to be noticed. First, the computational time of AMMSS on Animal is not reported, since the program is not finished within a week for a single run. Second, on average, it takes more than 4 hours for MLHR to construct the graph of one view on Animal, therefore, we only report the computational time that MLHR consumes with pre-computed graphs on this dataset. From Table III and Table IV, it can be observed that the computational time results are consistent with the time complexity of the compared methods. MVAR takes the least time to make prediction on Animal and Corel5k, while the other four methods use much more time as they has cubic time complexity w.r.t. the data size. AMMSS spends the most time on Corel5k, because it not only has high complexity w.r.t the data size but also needs iterations. Since Co-LapSVM and CoLapRLS are within the same Co-Regularization framework, they have little difference in the amount of time spent on all datasets. Compared with AMMSS and MVAR, Co-LapSVM and Co-LapRLS have an advantage that they do not need iterations. Therefore, they use less time on the small-scale dataset SUN1k. As SUN1k’s dimensionality of each view is much larger than its data size, MLHR consumes the most time on it. To sum up, MVAR is more scalable and more efficient for large-scale datasets.
Fig. 5. One image from each of the 15 scene categories. The categories are bedroom, coast, forest, highway, industrial, inside city, kitchen, living room, mountain, office, open country, store, street, suburb, and tall building, respectively (from top to bottom, and from left to right).
VI. A PPLICATION TO S CENE C ATEGORIZATION Scene categorization, or scene recognition, usually refers to the problem of recognizing the semantic label (e.g. bedroom, mountain, or coast) of a single image [41]. Scenes are captured more by ensemble of objects, rather than by individual objects. Therefore, classifying scenes is not an easy task owing to their variability and ambiguity [39]. A lot of visual descriptors or representations have been designed for successful recognition of scenes, such as SIFT [33], GIST [37] and HOG [32]. Instead of using only one representation, in this section we apply the proposed MVAR to scene categorization by integrating several state-of-the-art descriptors in the semi-supervised setting. Two commonly used scene datasets are employed for evaluation. The first is released by Fei-Fei Li, containing 13 categories. There are a total of 3859 images, and the categories are highway (260 images), inside of cities (308 images), tall buildings (356 images), streets (292 images), suburb residence (241 images), forest (328 images), coast (360 images), mountain (374 images), open country (410 images), bedroom (216 images), kitchen (210 images), livingroom (289 images) and office (215 images). The second dataset is an expansion of the thirteen category dataset with two new categories: industrial (311 images) and store (315 images). For convenience, these two datasets are referred to as Scene13 and Scene15 respectively. These images are about 300 × 250 in resolution, containing a wide range of scene categories in both indoor and outdoor environments. Fig. 5 shows example images from each category. For both scene datasets, we extract the same 6 visual features with the GRAZ02 dataset.
12
TABLE V T HE CLASSIFICATION ACCURACY (%)
OF UNLABELED TRAINING DATA OF DIFFERENT METHODS ON TWO SCENE CATEGORIZATION TASKS WITH DIFFERENT PERCENTAGES (m%) OF LABELED SAMPLES . S TANDARD DEVIATION IS IN THE PARENTHESES .
m(%) 5 10 15 20 25 30 m(%) 5 10 15 20 25 30
Dataset
Scene13
Dataset
Scene15
S-RLS 62.0(1.7) 65.2(1.0) 66.5(1.1) 68.0(1.1) 68.5(1.1) 68.0(0.8) S-RLS 55.5(1.1) 57.9(0.8) 60.3(0.8) 59.6(0.6) 61.1(0.7) 61.5(0.7)
C-RLS 55.8(1.6) 58.3(2.1) 60.7(0.7) 61.4(1.5) 64.0(0.9) 62.1(0.9) C-RLS 52.4(1.5) 56.4(0.9) 58.2(1.0) 58.3(0.9) 58.9(1.5) 59.4(1.3)
Co-LapSVM 56.1(0.7) 62.7(0.9) 64.8(1.0) 67.1(1.1) 68.4(0.8) 68.7(0.7) Co-LapSVM 49.9(0.9) 55.5(1.0) 58.4(0.6) 60.3(0.7) 61.3(0.7) 62.5(0.5)
Co-LapRLS 25.6(1.2) 27.3(1.2) 23.1(0.5) 26.3(0.7) 31.0(0.9) 34.8(1.1) Co-LapRLS 24.7(1.4) 22.5(0.9) 25.2(0.9) 27.5(0.9) 33.0(0.9) 26.8(0.8)
AMMSS 55.2(1.1) 61.5(1.1) 62.9(4.3) 58.5(4.5) 69.9(0.5) 71.1(0.6) AMMSS 49.6(1.6) 56.0(0.6) 59.5(1.4) 57.3(2.8) 56.7(3.5) 66.5(0.9)
MLHR 51.4(1.5) 64.5(1.4) 70.0(0.6) 73.7(0.9) 75.8(0.9) 77.2(0.5) MLHR 46.5(1.5) 59.9(1.2) 65.7(1.2) 69.1(0.8) 72.0(0.5) 73.4(0.8)
MVAR 65.8(1.0) 70.1(0.8) 75.9(1.0) 78.9(0.9) 80.1(0.9) 81.1(0.7) MVAR 59.4(1.8) 66.8(0.7) 72.2(0.9) 74.8(0.6) 76.4(0.6) 77.0(1.0)
TABLE VI T HE CLASSIFICATION ACCURACY (%)
ON TESTING DATA OF DIFFERENT METHODS ON TWO SCENE CATEGORIZATION TASKS WITH DIFFERENT PERCENTAGES (m%) OF LABELED SAMPLES . S TANDARD DEVIATION IS IN THE PARENTHESES .
Dataset
Scene15
100
80
60
60
S−LSC C−LSC Co−LapSVM Co−LapRLS AMMSS MLHR MVAR
80
0.05
0.1
0.15
0.2
0.25
0.3
Percentage of labeled samples (%)
(a) Scene13, unlabeled
0
AMMSS 53.2(2.3) 61.7(2.3) 62.8(4.2) 59.3(3.8) 70.5(1.6) 70.6(1.6) AMMSS 49.7(2.2) 55.7(2.5) 60.0(2.1) 55.7(4.8) 56.8(3.9) 66.4(2.5)
S−LSC C−LSC Co−LapSVM Co−LapRLS AMMSS MLHR MVAR
60
0.1
0.15
0.2
0.25
0.3
(b) Scene15, unlabeled
0
S−LSC C−LSC Co−LapSVM Co−LapRLS AMMSS MLHR MVAR
20
20
0.05
MVAR 59.4(2.7) 66.5(2.7) 74.3(1.7) 78.1(2.1) 78.8(1.8) 79.9(2.2) MVAR 53.8(2.7) 63.9(1.3) 70.5(2.6) 72.0(2.3) 75.1(1.9) 75.4(1.3)
40
40
Percentage of labeled samples (%)
MLHR 52.0(2.2) 63.3(2.0) 70.1(2.1) 74.5(2.7) 75.9(2.1) 77.0(1.7) MLHR 46.6(1.3) 58.9(1.4) 66.1(2.4) 67.9(1.5) 71.4(2.0) 73.2(1.7)
80
60
20
20
Co-LapRLS 43.2(2.0) 60.0(0.8) 63.4(2.2) 66.8(2.9) 69.4(1.6) 68.6(2.6) Co-LapRLS 42.1(1.7) 51.6(3.0) 56.7(2.1) 58.4(1.9) 61.0(1.5) 61.8(1.9)
100
40
40
0
Co-LapSVM 55.1(2.1) 62.7(2.0) 65.0(1.9) 67.1(1.4) 68.3(2.2) 68.5(3.2) Co-LapSVM 50.1(1.2) 54.5(2.2) 58.2(3.1) 59.0(1.8) 62.6(2.0) 62.5(2.5)
MacroF1
MacroF1
80
S−LSC C−LSC Co−LapSVM Co−LapRLS AMMSS MLHR MVAR
C-RLS 55.1(2.1) 58.2(3.3) 60.9(1.9) 60.3(1.5) 64.0(2.0) 62.3(1.8) C-RLS 53.8(1.9) 55.9(2.2) 57.7(2.8) 56.7(3.5) 58.5(2.1) 59.9(2.3)
MacroF1
Scene13
S-RLS 62.0(2.3) 65.2(2.4) 66.5(1.8) 68.0(1.8) 68.5(1.5) 68.0(2.4) S-RLS 55.5(1.5) 57.9(2.5) 60.3(2.3) 59.6(1.8) 61.1(2.6) 61.5(2.4)
MacroF1
m(%) 5 10 15 20 25 30 m(%) 5 10 15 20 25 30
Dataset
0.05
0.1
0.15
0.2
0.25
0.3
0
0.05
0.1
0.15
0.2
0.25
0.3
Percentage of labeled samples (%)
Percentage of labeled samples (%)
(c) Scene13, testing
(d) Scene15, testing
Fig. 6. Comparison of six methods on scene categorization w.r.t. average MacroF1 with different percentages of labeled samples. The error bars show the standard deviation.
Besides classification accuracy, we also employ the MacroF1 value for evaluation. MacroF1 value is calculated by first calculating F1 values for each category and then taking the average of them. The same with classification accuracy, the higher the MacroF1 value is, the better the classifier performs. The other settings are the same as in previous experiments. The classification accuracy results of unlabeled training data and unseen testing data are displayed in Table V and Table VI respectively. Fig. 6 shows the comparison of MacroF1 values of all methods, with the error bars standing for the standard deviation. On both datasets, MVAR outperforms the other methods in terms of both classification accuracy and the MacroF1 score in most cases.
VII. C ONCLUSION AND F UTURE W ORK In this paper, we propose a scalable multi-view semisupervised classification algorithm. In particular, the regression based loss functions of each view are formulated in ℓ2,1 matrix norm, and their linear weighted combination forms the final objective function. We solve the model by decomposing the original problem into three sets of convex problems, and an iterative algorithm is proposed with proved convergence. Regressing to labels directly makes MVAR efficient in calculation and more capable of solving large-scale multi-view semi-supervised classification problems. With the adaptively optimized weight coefficients of each view, MVAR automatically balances the contributions of different views, and this makes its performance robust against the existence of low-
13
quality views. Finally, the proposed method is evaluated on six real-world datasets and applied to scene categorization, and its performance is remarkable in most cases. There are several interesting directions to investigate in the future. First, we would like to adjust the proposed method to deal with datasets with domain knowledge other than feature representations only. Second, it is possible to formulate the objective function using ℓ2,p -norm with p < 1. Lastly, how to automatically determine the scores for each sample is also an interesting problem. R EFERENCES [1] X. Liu, “Learning from multi-view data: clustering algorithm and text mining application,” Ph.D. dissertation, KU Leuven, Belgium, Sept. 2011. [2] C. Xu, D. Tao, and C. Xu, “A survey on multi-view learning,” arXiv preprint arXiv:1304.5634, 2013. [3] S. Sun, “A survey of multi-view machine learning,” Neural Computing & Applications, vol. 23, no. 7-8, pp. 2031–2038, 2013. [4] C. Gong, D. Tao, S. J. Maybank, W. Liu, G. Kang, and J. Yang, “Multimodal curriculum learning for semi-supervised image classification,” IEEE Transactions on Image Processing, vol. 25, no. 7, pp. 3249–3260, July 2016. [5] X. Cai, F. Nie, W. Cai, and H. Huang, “Heterogeneous image features integration via multi-modal semi-supervised learning model,” in ICCV, 2013, pp. 1737–1744. [6] Y. Yang, F. Nie, D. Xu, J. Luo, Y. Zhuang, and Y. Pan, “A multimedia retrieval framework based on semi-supervised ranking and relevance feedback,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 34, no. 4, pp. 723–742, April 2012. [7] X. Wang, W. Bian, and D. Tao, “Grassmannian regularized structured multi-view embedding for image classification,” IEEE Transactions on Image Processing, vol. 22, no. 7, pp. 2646–2660, Jul. 2013. [8] C. Hou, C. Zhang, Y. Wu, and F. Nie, “Multiple view semi-supervised dimensionality reduction,” Pattern Recognition, vol. 43, no. 3, pp. 720– 730, 2010. [9] A. Blum and T. Mitchell, “Combining labeled and unlabeled data with co-training,” in COLT, 1998, pp. 92–100. [10] S. Bahrampour, N. Nasrabadi, A. Ray, and W. Jenkins, “Multimodal task-driven dictionary learning for image classification,” IEEE Transactions on Image Processing, vol. 25, no. 1, pp. 24–38, Jan. 2016. [11] X. Zhu, “Semi-supervised learning literature survey,” Computer Sciences, University of Wisconsin-Madison, Tech. Rep. 1530, 2005. [12] F. Nie, D. Xu, I. W. H. Tsang, and C. Zhang, “Flexible manifold embedding: A framework for semi-supervised and unsupervised dimension reduction,” IEEE Transactions on Image Processing, vol. 19, no. 7, pp. 1921–1932, July 2010. [13] X. Zhu, Z. Ghahramani, J. Lafferty et al., “Semi-supervised learning using gaussian fields and harmonic functions,” in ICML, vol. 3, 2003, pp. 912–919. [14] L. Yong, D. Tao, B. Geng, C. Xu, and S. J. Maybank, “Manifold regularized multitask learning for semi-supervised multilabel image classification,” IEEE Transactions on Image Processing, vol. 22, no. 2, pp. 523–536, Feb 2013. [15] R. Ewerth and B. Freisleben, “Semi-supervised learning for semantic video retrieval,” in Proceedings of the 6th ACM international conference on Image and video retrieval. ACM, 2007, pp. 154–161. [16] Y. Cong, J. Liu, J. Yuan, and J. Luo, “Self-supervised online metric learning with low rank constraint for scene categorization,” IEEE Transactions on Image Processing, vol. 22, no. 8, pp. 3179–3191, Aug 2013. [17] M.-L. Zhang and Z.-H. Zhou, “CoTrade: Confident co-training with data editing,” IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 41, no. 6, pp. 1612–1626, Dec. 2011. [18] K. Nigam and R. Ghani, “Analyzing the effectiveness and applicability of co-training,” in CIKM, 2000, pp. 86–93. [19] S. Yu, B. Krishnapuram, R. Rosales, and R. B. Rao, “Bayesian cotraining,” Journal of Machine Learning Research, vol. 12, pp. 2649– 2680, 2011. [20] V. Sindhwani, P. Niyogi, and M. Belkin, “A co-regularization approach to semi-supervised learning with multiple views,” in ICML Workshop on Learning with Multiple Views, 2005, pp. 74–79.
[21] M. Wang, X.-S. Hua, R. Hong, J. Tang, G.-J. Qi, and Y. Song, “Unified video annotation via multigraph learning,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 19, no. 5, pp. 733– 746, May 2009. [22] X. Cai, F. Nie, and H. Huang, “Multi-view k-means clustering on big data,” in IJCAI, 2013, pp. 2598–2604. [23] Y. Yang, J. Song, Z. Huang, Z. Ma, N. Sebe, and A. G. Hauptmann, “Multi-feature fusion via hierarchical regression for multimedia analysis,” IEEE Transactions on Multimedia, vol. 15, no. 3, pp. 572–581, Apl. 2013. [24] F. Nie, H. Huang, X. Cai, and C. H. Ding, “Efficient and robust feature selection via joint ℓ2,1 -norms minimization,” in NIPS, 2010, pp. 1813– 1821. [25] H. Tao, C. Hou, F. Nie, Y. Jiao, and D. Yi, “Effective discriminative feature selection with nontrivial solution,” IEEE Transactions on Neural Networks and Learning Systems, vol. 27, no. 4, pp. 796–808, Apr. 2016. [26] S. Xiang, F. Nie, G. Meng, C. Pan, and C. Zhang, “Discriminative least squares regression for multiclass classification and feature selection,” IEEE Transactions on Neural Networks and Learning Systems, vol. 23, no. 11, pp. 1738–1754, Nov. 2012. [27] C. Hou, F. Nie, D. Yi, and Y. Wu, “Efficient image classification via multiple rank regression,” IEEE Transactions on Image Processing, vol. 22, no. 1, pp. 340–352, Jan. 2013. [28] X. Chang, F. Nie, Y. Yang, and H. Huang, “A convex formulation for semi-supervised multi-label feature selection,” in AAAI, 2014, pp. 1171– 1177. [29] T. Xia, D. Tao, T. Mei, and Y. Zhang, “Multiview spectral embedding,” IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, vol. 40, no. 6, pp. 1438–1446, Jun. 2010. [30] S. Yang, C. Hou, F. Nie, and Y. Wu, “Unsupervised maximum margin feature selection via L1 -norm minimization,” Neural Computing & Applications, vol. 21, no. 7, pp. 1791–1799, 2012. [31] J. Gui, Z. Sun, J. Cheng, S. Ji, and X. Wu, “How to estimate the regularization parameter for spectral regression discriminant analysis and its kernel version?” IEEE Transactions on Circuits and Systems for Video Technology, vol. 24, no. 2, pp. 211–223, Feb. 2014. [32] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detection,” in CVPR, vol. 1, 2005, pp. 886–893. [33] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International Journal of Computer Vision, vol. 60, no. 2, pp. 91–110, 2004. [34] A. E. Abdel-Hakim and A. A. Farag, “CSIFT: A SIFT descriptor with color invariant characteristics,” in CVPR, 2006, pp. 1978–1983. [35] H. Bay, T. Tuytelaars, and L. Van Gool, “SURF: Speeded up robust features,” in ECCV, 2006, pp. 404–417. [36] B. S. Manjunath and W.-Y. Ma, “Texture features for browsing and retrieval of image data,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 18, no. 8, pp. 837–842, Aug. 1996. [37] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic representation of the spatial envelope,” International Journal of Computer Vision, vol. 42, no. 3, pp. 145–175, 2001. [38] T. Ojala, M. Pietik¨ainen, and T. M¨aenp¨aa¨ , “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 7, pp. 971–987, Jul. 2002. [39] J. Wu and J. M. Rehg, “CENTRIST: A visual descriptor for scene categorization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 8, pp. 1489–1501, Aug. 2011. [40] J. Xiao, K. A. Ehinger, J. Hays, A. Torralba, and A. Oliva, “Sun database: Exploring a large collection of scene categories,” International Journal of Computer Vision, pp. 1–20, 2014. [41] L. Fei-Fei and P. Pietro, “A bayesian hierarchical model for learning natural scene categories,” in CVPR, 2005, pp. 524–531.
14
Hong Tao is a PhD condidate with the College of Science at the National University of Denfense Technology, Changsha, China. She earned her B.S. degree and M.S. degree from the same university in 2012 and 2014. Her research interests include machine learning, system science and data mining.
Chenping Hou (M’12) received the Ph.D. degrees from the National University of Defense Technology, Changsha, China in 2009. He is currently an Associate Professor with the College of Science of the same university. He has authored several papers in journals and conferences, such as the IEEE TNNLS/TNN, IEEE TSMCB/TCB, IEEE TIP, PR, the IJCAI and AAAI. His current research interests include machine learning, data mining, and computer vision.
Feiping Nie received the Ph.D. degree in Computer Science from Tsinghua University, China in 2009. His research interests are machine learning and its application fields. He has published more than 100 papers in the top journals and conferences, including TPAMI, IJCV, TIP, TNNLS/TNN, TKDE, ICML, NIPS, KDD, IJCAI, AAAI, ICCV, CVPR, SIGIR, ACM MM etc. He is serving as AE or PC member for several prestigious journals and conferences in the related fields.
Jubo Zhu received the B.S. degree from Fudan University, Shanghai, China, and the M.S. and Ph.D. degrees from the National University of Defense Technology, Changsha, China. He is a Professor with the College of Science, National University of Defense Technology. His current research interests include systems science, multivariate information processing and compressed sensing.
Dongyun Yi received the B.S. degree from Nankai University, Tianjin, China, and the M.S. and Ph.D. degrees from the National University of Defense Technology, Changsha, China. He was a visiting researcher with the University of Warwick, Coventry, U.K., in 2008. He is a Professor with the College of Science, National University of Defense Technology. His current research interests include statistics, systems science, and data mining.