Semi-supervised feature selection via hierarchical regression for web ...

Multimedia Systems (2016) 22:41–49 DOI 10.1007/s00530-014-0390-0

SPECIAL ISSUE PAPER

Semi-supervised feature selection via hierarchical regression for web image classification Xiaonan Song • Jianguang Zhang • Yahong Han Jianmin Jiang

•

Published online: 27 May 2014 Springer-Verlag Berlin Heidelberg 2014

Abstract Feature selection is an important step for largescale image data analysis, which has been proved to be difficult due to large size in both dimensions and samples. Feature selection firstly eliminates redundant and irrelevant features and then chooses a subset of features that performs as efficient as the complete set. Generally, supervised feature selection yields better performance than unsupervised feature selection because of the utilization of labeled information. However, labeled data samples are always expensive to obtain, which constraints the performance of supervised feature selection, especially for the large web image datasets. In this paper, we propose a semi-supervised X. Song School of Software Engineering and Technology, Tianjin University, Tianjin, China e-mail: [email protected]

feature selection algorithm that is based on a hierarchical regression model. Our contribution can be highlighted as: (1) Our algorithm utilizes a statistical approach to exploit both labeled and unlabeled data, which preserves the manifold structure of each feature type. (2) The predicted label matrix of the training data and the feature selection matrix are learned simultaneously, making the two aspects mutually benefited. Extensive experiments are performed on three large-scale image datasets. Experimental results demonstrate the better performance of our algorithm, compared with the state-of-the-art algorithms. Keywords Feature selection Multi-class classification Semi-supervised learning

1 Introduction J. Zhang Y. Han (&) School of Computer Science and Technology, Tianjin University, Tianjin, China e-mail: [email protected] J. Zhang e-mail: [email protected] J. Zhang Department of Mathematics and Computer Science, Hengshui University, Hengshui, China Y. Han Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin University, Tianjin, China J. Jiang University of Surrey, Guildford, UK e-mail: [email protected] J. Jiang School of Computer Science and Software Engineering, Shenzhen University, Shenzhen, China

With rapid advances of digital camera and various image processing tools, a huge amount of images are available online through various image sharing websites such as Flickr and Facebook. To efficiently browse and retrieve the web images, numerous approaches have been proposed, such as image classification [1], video annotation [2], and image retrieval [3, 4]. Most of these existing visual computing tasks use image features that are often noisy and redundant. Hence, it is of great value to propose effective feature selection methods for a more precise and compact representation of images. The huge image databases also pose significant challenges in existing applications in terms of scalability. Virtually images are usually represented by different features. Unfortunately, the combination of these features results in high-dimensional vector data (e.g., hundreds– thousands), leading to computational and memory-related complications. Reducing the number of redundant features

123

42

saves the running time of a learning algorithm and yields a more general concept. Feature selection is one effective means that aims to choose a subset of features for improving prediction accuracy or decreasing the size of the structure without significantly decreasing prediction accuracy of the classifier built using only the selected features [5]. Supervised feature selection evaluates feature relevance by measuring the correlation between the feature and class labels. Because of the utilization of labeled information, supervised feature selection usually yields better and more reliable performance than unsupervised feature selection [6]. However, supervised feature selection requires sufficient labeled data samples which are expensive to obtain because of the excessive cost in human labor. That is the reason why in many real applications, we have huge unlabeled data and small labeled samples. The desire to develop feature selection methods that are capable of exploiting both labeled and unlabeled data motivates us to introduce semi-supervised learning [7, 8] into the feature selection process. Recent researches of semi-supervised learning [9–11] focus on the utilization of unlabeled data. Generally, exploiting unlabeled data is based on the manifold assumption that most data examples lie on a low-dimensional manifold in the input space. The well-known algorithms in this category include label propagation, harmonic function, graph cuts, spectral graph transducer, and manifold regularization. [8] can be referred to as a comprehensive study of semi-supervised learning techniques. Different from most of the existing semi-supervised algorithms [12, 13], Yang et al. [7] propose a new semi-supervised learning method that preserves the manifold structure of each feature type and utilizes a statistical approach to better exploit the manifold sturcture of the training data. Experimental results demonstrate that their method is more robust than simply using the pairwise distances of the training data. In this paper, we reserve the existing advantages of the semisupervised learning model proposed in [7] and benefit the semi-supervised feature selection from it. Traditional feature selection algorithms, e.g., Fisher Score [14], generally evaluate the importance of each feature individually and discard important features that are less informative by themselves but are informative when combined with other features. In contrast, a recently popular approach, sparsity-based feature selection, selects features jointly across all data points [6, 15–19]. ‘1 -SVM is well known to perform feature selection using the ‘1 -norm regularization that gives sparse solution [20]. A combination of both ‘1 -norm and ‘2 -norm is proposed to form a structured regularization in [21]. In [6, 15, 22], they all use ‘2;1 -norm on regularization to evaluate features jointly. The regularization function in their works has twofold role in avoiding over-fitting and inducing sparsity, making it particularly suitable for feature selection.

123

X. Song et al.

In this paper, we propose a new semi-supervised feature selection algorithm, namely feature selection via hierarchical regression (FSHR). Different from most of the existing semi-supervised feature selection algorithms [22, 23], we preserve the manifold structure of each feature type during the training phase. Local classifiers and global classifiers are learnt simultaneously. ‘2;1 -norm is added to the regularization of global classifiers, making many rows of optimal global classifiers shrink to zeros. Once the optimal global classifiers are obtained, we sort the features accordingly in descending order and select the top-ranked ones. The selected discriminative features are proved to be effective to classify web images, when combined with a multi-class SVM classifier in the experiments. In comparison with the existing popular feature selection method [14, 15, 22, 23], our proposed take advantages of both semisupervised learning of [7] and ‘2;1 -norm-based sparsity, which can be highlighted as: (1) FSHR embeds the feature selection process with the semi-supervised learning which preserves the manifold structure of each feature type. A statistical approach is used to better exploit the manifold structure of the training data, resulting in a more faithful and robust learning result. (2) We propose to simultaneously learn the predicted label matrix of the training data and the feature selection matrices, making the two aspects mutually promotive.

2 The objective function In this section, we give the objective function of the proposed algorithm. We begin with the preliminaries. t is the number of labeled training images, and n is the number of the training images, t n. Given an integer g v, we denote xig as the g-th feature of the i-th image and Xg ¼ ½x1g ; . . .; xng 2 Rdg n , where dg is the dimension of the g-th feature. Denote Y ¼ ½Y1 ; Y2 ; . . .Yn T 2 f0; 1gnc as the label information. Yij is 1 if xi belongs to the j-th class, and 0 otherwise. Let F ¼ ½F1 ; F2 ; . . .Fn T 2 Rnc , where Fi 2 Rc is the predicted label vector of xi . We define fg ¼ ½fg1 ; fg2 ; . . .fgn T 2 Rnc as the predicted matrix of training images derived from the g-th feature. In the rest of this paper, k kF denotes the Frobenius norm. 1m 2 Rm is a column vector with all its elements being 1, and Hm ¼ I m1 1m 1Tm is the centering matrix, where I is the identity matrix. For an arbitrary matrix A 2 Rrp , its ‘2;1 -norm is ffi P qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi Pp 2 . Tr() is the trace A defined as kAk2;1 ¼ ri¼1 j¼1 ij operator. As indicated by previous semi-supervised learning approaches [13, 24, 25], exploiting the ‘‘manifold

Hierarchical regression for web image classification

43

structure’’ of the data is a promising family of techniques that are generally based upon an assumption that similar unlabeled examples should be given the same classification. To utilize unlabeled data effectively, we construct a i local set, denoted as N g , consisting of xig and its k-nearest neighbors according to the distance derived from the g-th feature. We use the statistical approach proposed in [7] to exploit the manifold structure of the input images for semisupervised learning. Similar to [7], we assume there is a i local classifier that classifies all the training images in N g . We use least square regression for classification as 2 2 v X n X X i T j i i min ðw Þ x þ b f ; g g gj þkwg g fg ;wig big

F

g¼1 i¼1 xjg 2N i g

ð1Þ

F

2R and 2 R are the local classifier and where bias term of xig w.r.t. the g-th feature. The predicted label matrix F of the training images should be consistent with each evidence fg and the ground truth if exists. Therefore, P we minimize minF;fg vg¼1 kF fg k2F and impose a hard constraint on F that Fi ¼ Yi , if xi is a labeled training image. In this way, each training image learns a soft label on each kind of feature, which is gained by considering both labeled images and the manifold structure of input images. We formulate the semi-supervised learning parts as 2 2 v X n X X i T j i i min ðw Þ x þ b f gj þkwg g g g i i wig

fg ;wg ;bg

dg c

big

c

F


F

v X F fg 2 ; þ l2 F g¼1

s:t:Fi ¼ Yi ; if xi is a labeled training image:

ð2Þ

To select features, we utilize the ‘2;1 -norm to control classifier’s capacity and also ensure it is sparse in rows, making it particularly suitable for feature selection. We propose to simultaneously learn the predicted label matrix fg from the g-th feature, and the v feature selection matrices Wg . Specifically, the objective function is shown as follows: 2 2 v X n X X i T j i i ðw Þ x þ b f min g g gj þkwg g fg ;wig

big ;F

F


þ l1 þ l2

F

v X n 2 X T ðWg Þ xig þ Bg fgi þcWg 2;1

Xgi ¼ ½xig ; xig1 ; . . .; xigk 2 Rdg ðkþ1Þ . The objective function can be rewritten as 2 2 v X n X i T i min ðXg Þ wg þ 1Kþ1 big fgi þkwig F;fg ;wig

Wg ;big ;Bg

g¼1 v X F fg 2 þ Tr ðF YÞT UðF YÞ ; þ l2 F

ð4Þ

g¼1

where fgi ¼ ½fgi ; fgi1 ; . . .; fgik T 2 Rkc is the predicted local label matrix of Xgi . 3 Solution to FSHR The objective problem of (4) can be solved as follows. By setting the derivative of (4) w.r.t. wig and big to the 0, we have 1 i T ð5Þ ðfg Þ 1kþ1 ðwig ÞT Xgi 1kþ1 ; big ¼ kþ1 1 ð6Þ Xgi Hkþ1 fgi : wig ¼ Xgi Hkþ1 ðXgi ÞT þ kI Similarly, by setting the derivative of (4) w.r.t. Wg and Bg to be zero, we have 1 ð7Þ Bg ¼ ðfgT 1n WgT Xg 1n Þ; n Wg ¼ ðXg Hn XgT þ cDg Þ1 Xg Hn fg ;

ð8Þ

where Dg is a diagonal matrix with each element on the diagonal, i.e., Diig , defined as 1 Diig ¼ ð9Þ i : 2wg 2

Substituting big ; wig ; Bg ; and Wg in (4) by (5), (6), (7) and (8), respectively, we arrive at v X n v X X Tr ðfgi ÞT Lig fgi þ l1 Tr ðfgT Ag fg Þ min g¼1 i¼1 v X

g¼1

F fg 2 þ TrððF YÞT UðF YÞÞ; F

ð10Þ

g¼1

F

g¼1

s:t:Fi ¼ Yi ; if xi is a labeled training data;

ð3Þ

where Wg 2 R and Bg 2 R are the global classifier and bias term w.r.t. the g-th feature. Denote U as a diagonal dg c

F

v X ðXg ÞT Wg þ 1n Bg fg 2 þcWg F 2;1

þ l2

F fg 2 ;

F

g¼1 i¼1

þ l1

F;fg

F

g¼1 i¼1 v X

matrix. If xi is a labeled datum, Uii ¼ 1, and Uii ¼ 0 otherwise. Let N ig ¼ fxig ; xig1 ; . . .; xigk g where xig1 ; . . .; xigk are the k-nearest neighbors of xig according to the g-th feature.

c

where 1 Lig ¼ Hkþ1 Hkþ1 ðXgi ÞT Xgi Hkþ1 ðXgi ÞT þ kI Xgi Hkþ1 ð11Þ

123

44

X. Song et al. v X

and Ag ¼ H n

Hn XgT

Xg Hn XgT

þ cDg

1

Xg Hn :

ð12Þ

For the ease of representation, we define the selection matrix Spg 2 Rnðkþ1Þ as follows: 1; if xi is the j-th element in N p ; g g Spg ¼ ð13Þ ij 0; otherwise:

l2 ðF fg Þ þ UðF YÞ ¼ 0:

ð22Þ

g¼1

Substituting fg in (22) by (21), we have F¼

vl2 I þ U

l22

v X

!1 ðLg þ l1 Ag þ l2 IÞ

1

UY:

g¼1

ð23Þ

Therefore, we have fgi ¼ ðSig ÞT fg :

ð14Þ

Then, we have n X

n X Tr ðfgi ÞT Lig fgi ¼ Tr fgT Sig Lig ðSig ÞT fg

i¼1

i¼1

¼ Tr

! n X T fgT Sig Lig ðSig Þ fg :

ð15Þ

i¼1

Denote Lg ¼

Pn i¼1

Sig Lig ðSig ÞT , Then, (10) can be written

as min F;fg

v X

Tr ðfgT Lg fg Þ þ l1

g¼1

v X

Tr ðfgT Ag fg Þ

g¼1

v X

F fg 2 þ Tr ðF YÞT UðF YÞ ; þ l2 F

ð16Þ

g¼1

which is equivalent to the following min F;fg

v X

Tr fgT ðLg þ l1 Ag Þfg

g¼1

þ l2

v X

F fg 2 þ Tr ðF YÞT UðF YÞ : F

ð17Þ

g¼1

Let us define 2 1 Lg 6 Qg ¼ 4 0

0 Lng

3 7 5

ð18Þ

and Sg ¼ ½S1g ; . . .; Sng :

ð19Þ

2

Then, we have Lg ¼ Sg Qg STg :

ð20Þ

By setting the derivative of (17) w.r.t fg to be 0, we have

1 ð21Þ fg ¼ l2 Lg þ l1 Ag þ l2 I F: By setting the derivative of (17) w.r.t F to be 0, we have

123

Based on the above mathematical deduction, the procedure can be described in Algorithm 1. For each feature, once Wg is obtained, we sort the features according to wig in descending order and select the top-ranked ones. The complexity of the proposed algorithm is discussed as follows. In each iteration, the most time-consuming steps are Step 8, Step 10, and Step 11. In Step 8, the time complexity of computing Lg defined in (20) is about Oðn3 Þ. In Step 10 and Step 11, the complexity of calculating the inverse of a few matrices is Oðn3 Þ.

Hierarchical regression for web image classification

4 Experiments

45

•

4.1 Datasets descriptions • We choose three image datasets, i.e., MFlickr [26], MSRAMM [27], and NUS-WIDE [28] in our experiments. As our method is irrelevant to multi-labeled classification, we remove multi-labeled samples so that each sample only belongs to one class. However, the labels of the samples in the three datasets are not balanced. For instance, in NUSWIDE dataset, there is no positive image for one class while there are 31,617 positive images for another class after removing multi-labeled samples. Therefore, we preprocess the dataset so that the numbers of positive images of different classes are balanced. The following is a brief description of the three datasets and their preprocessing. MFlickr This dataset contains 25,000 samples in 33 classes from the Flickr website. Classes whose samples are \2,000 or more than 5,000 are excluded. Images that are not labeled or multi-labeled are removed. After preprocessing, we obtain a dataset with 9,439 images in eight classes. We extract the three types of features for each image, namely 499 dimensional Bag of Words, 128 dimensional Color Coherence, and 256 dimensional Color Histogram. MSRA-MM We use the concept set of the original MSRA-MM 2.0 dataset, which includes 50,000 images related to 100 concepts. After the same preprocessing of MFlickr, there are 14,445 images in six classes remaining. Three feature types used in [22, 29], namely Color Correlogram, Edge Direction Histogram, and Wavelet Texture are also used in our experiments to represent the images. NUS-WIDE The dataset includes 269,648 images and 81 concepts that can be used for evaluation. Concepts with more than 10,000 or \5,000 images are excluded, and images associated with no concept or more than two concepts are removed. We use the remaining 45,227 images associated with 10 concepts in our experiments. The images are also represented by Color Correlogram, Edge Direction Histogram, and Wavelet Texture. 4.2 Experiment setup We compare our method proposed in this paper with the following feature selection algorithms. •

•

Fisher Score [14] depends on fully labeled training data to select features and calculates a Fisher/correlation score for each feature. Feature selection via joint l2;1 -norms minimization (FSNM) [15] employs joint l2;1 -norm minimization on both loss function and regularization to realize feature selection across all data points.

Feature selection via spectral analysis (FSSA) [23] is a semi-supervised feature selection method using spectral regression. Structural feature selection with sparsity (SFSS) [22] jointly selects the most relevant features from all the data points using a sparsity-based model and exploiting both labeled and unlabeled data to learn the manifold structure.

To generate the training data, we randomly sample 1,000 data for each dataset. This sampling process is repeated five times, and average results are reported. We set the percentage of labeled data to be 1, 5, 10, 25, 50, and 100 %, respectively, and make sure the labeled data cover all the semantic concept of each dataset. The rest images of each dataset work as the testing set to evaluate the multiclassification performance. There are five parameters in our algorithm. However, as the performance is not sensitive to the local regularization parameter k [30], we did not tune this parameter and fix it as one. In our experiment, the parameter k specifies the k nearest neighbors used to compute the graph Laplacian matrix, and it is used in the same way as FSSA and SFSS do. We set k to five in our algorithm for all three datasets, as well as in FSSA and SFSS. Except these parameters, we tune all the other parameters (if any) by a grid-search strategy from f106 ; 103 ; 1; 103 ; 106 g to fairly compare different feature selection algorithms. For all the algorithms, we report the best results obtained from different parameters. Generally, we can see that in terms of average accuracy, our method works best and is followed by SFSS, FSNM, FSSA, and Fisher Score in sequence. SFSS is the semi-supervised feature selection algorithm published in the ACM-MM conference, 2011. Our method and SFSS both utilize the unlabeled data. When \25 % of the data are labeled, our method is competitive with SFSS. In our experiment, each feature selection algorithm is first performed to select features. Then, the support vector machine (SVM) classifier with the linear kernel is employed to these datasets, using fivefold cross-validation SVM classifier. Accuracy is widely used as evaluation metric, so we use accuracy in this paper. 4.3 Experimental results and discussions For each dataset and different percentages of labeled data, we average the accuracy for different numbers of selected features ({10, 20,…, 100}). Figure 1 shows the classification accuracy compared to other feature selection algorithms when different percentages of data are labeled. The detailed differences of the averaged accuracy are listed in Tables 1, 2, and 3, when 1, 5, and 10 % of the training data are labeled.

123

46

X. Song et al. MSRA−MM data

NUS−WIDE data 45

40

40

40

35

35

35

30 25 Fisher−score FSNM FSSA SFSS ours

20 15 10

1 5 10

25

50


20 15

100

Percentage of labeled training data (%)

Accuracy (%)

45

Accuracy (%)

Accuracy (%)

MFlickr data 45

10

1 5 10

25

50



20 15

100

10

1 5 10

25

50

100


Fig. 1 The classification accuracy compared with other feature selection algorithms when different percentages of data are labeled

Table 1 Performance comparison (accuracy standard deviation) when 1 % training images are labeled



MFlickr

NUS-WIDE

Ours

0.1864 ± 0.0669

0.2426 ± 0.0288

0.1811 ± 0.0349

F score

0.1444 ± 0.0371

0.2009 ± 0.0380

0.1603 ± 0.0120

FSNM

0.1794 ± 0.0676

0.2332 ± 0.0430

0.1735 ± 0.0306

FSSA

0.1566 ± 0.0576

0.2149 ± 0.0373

0.1641 ± 0.0098

SFSS

0.1844 ± 0.0644

0.2341 ± 0.0329

0.1738 ± 0.0289

MFlickr

MSRA-MM

NUS-WIDE

Ours

0.2577 ± 0.0283

0.2949 ± 0.0185

0.2890 ± 0.0213

F score

0.2152 ± 0.0256

0.2651 ± 0.0238

0.2590 ± 0.0251

FSNM

0.2533 ± 0.0332

0.2804 ± 0.0190

0.2811 ± 0.0259

FSSA

0.2432 ± 0.0315

0.2649 ± 0.0186

0.2596 ± 0.0196

SFSS

0.2577 ± 0.0286

0.2837 ± 0.0176

0.2845 ± 0.0262

MFlickr

MSRA-MM

NUS-WIDE

Ours

0.2621 ± 0.0031

0.3263 ± 0.0181

0.3213 ± 0.0114

F score

0.2364 ± 0.0154

0.2934 ± 0.0204

0.2966 ± 0.0183

FSNM

0.2620 ± 0.0259

0.3258 ± 0.0214

0.3120 ± 0.0109

FSSA

0.2610 ± 0.0027

0.2950 ± 0.0212

0.2888 ± 0.0128

SFSS

0.2620 ± 0.0027

0.3167 ± 0.0242

0.3163 ± 0.0150

We have the following observations and analysis from the experimental results: (1) Generally, we can see that in terms of average accuracy, our method works best and is followed by SFSS, FSNM, FSSA, and Fisher Score in sequence. Our method and SFSS both utilize the unlabeled data. This observation suggests that the unlabeled information is important for feature selection. (2) When 10 % or less of the training data are labeled, our method outperforms Fisher Score and FSSA obviously. Our method is competitive with SFSS and FSNM or better, but the performance improvement is not significant. On the MFlickr dataset and the NUS-WIDE dataset, our method achieves

123

MSRA-MM

the same performance with the better of SFSS and FSNM. On the MSRA-MM dataset, our method outperforms SFSS and FSNM slightly. (3) Finally, when more than 10 % of the data are labeled, our method consistently outperforms other methods on the NUS-WIDE dataset and the MSRAMM dataset. However, the relative difference between our method and other feature selection methods becomes smaller. On the MFlickr dataset, our method achieves similar performance with FSSA, FSNM, and SFSS, but outperforms Fisher Score. Figure 2 shows the classification accuracy comparisons of all five feature selection methods with different numbers

Hierarchical regression for web image classification MFlickr data, with 5% labeled training data

30

20

25

25

20

15

Fisher−score FSNM FSSA SFSS ours

15

20

30

40

50

60

70

80

90

10 10

100

20

Number of Selected Features


Accuracy (%)

Accuracy (%)

50

60

70

25

40

50

60

70

80

10 10

100

20

90

30

30

Fisher−score FSNM FSSA SFSS ours 20

30

40

50

60

70

80

40

50

60

70

80

90

100

MSRA−MM data, with 10% labeled training data 35

15 10

100

30


35

20

30

90

25

20

20

80

MSRA−MM data, with 5% labeled training data

35

15 10

40



MSRA−MM data, with 1% labeled training data

30

30

20

15

Accuracy (%)

10 10

30

90

25 Fisher−score FSNM FSSA SFSS ours

20

100

15 10

20

30

40

50

60

70

80

90

100




NUS−WIDE data, with 1% labeled training data



35

Accuracy (%)

30

Accuracy (%)


25

20

15

10 10

35

35

30

30

25


15

20

30

40

50

60

70

80


90

100

Accuracy (%)

Accuracy (%)

25

Accuracy (%)


MFlickr data, with 10% labeled training data

30

Accuracy (%)

MFlickr data, with 1% labeled training data

47

10 10

20

30

40

50

60

70

80


90

25


15

100

10 10

20

30

40

50

60

70

80

90

100


Fig. 2 The classification accuracy comparisons of all five feature selection methods with different numbers of selected feature and accuracy, when 1, 5, and 10 % of the training data are labeled

of selected features and accuracy, when 1, 5, and 10 % of the training data are labeled. From Fig. 2, we can see that when labeled images are 1 %, the performance of FSHR is constraint by the very few labeled images, limiting the accuracy almost constant among different numbers of features on the three datasets. However, the first few features can indeed improve the classification accuracy, and the proposed method works better than other approaches. In MFlickr dataset, when the labeled images are more than 5 %, FSHR is constantly comparative with FSNM, FSSA, and SFSS. In MSRA-MM and NUS-WIDE datasets, when the labeled images are more than 5 %, the general trend shows that classification accuracy increases fast at beginning and then

stays steady with the increasing number of features. Moreover, method works consistently better than the other feature selection algorithms on the top 100 features. To evaluate the accuracy of our algorithm w.r.t the three parameters c; l1 ; and l2 , we perform an experiment on the parameter sensitivity where 10 % training data are labeled. We average the accuracy for different datasets and different numbers of selected features ({10, 20, . . ., 100}). Our method gains the best performance when c ¼ 1; l1 ¼ 106 ; and l2 ¼ 106 for the three datasets in average. From Fig. 3, we can see that our algorithm is less sensitive to c; l1 ; and l2 when c is not \1.

123

48

X. Song et al.

(a) 28 26 24 22

28 26 24 22

20

20

10^−6 10^−3

10^−6 10^−3

1 10^3 10^6 μ 2

1 10^−3 10^−6 μ 1

10^3

10^6

30

Accuracy (%)

30

Accuracy (%)

30

Accuracy (%)

(c)

(b)

26 24 22 20

1 10^3 10^6 μ 2

Fig. 3 Averaged accuracy of three datasets for different numbers of selected features with 10 % training data labeled. a Performance variation w.r.t. l1 and l2 when c is fixed as 1. b Performance

1 10^−3 10^−6

γ

In this paper, we have proposed a new semi-supervised feature selection algorithm FSHR. In comparison with most of feature selection algorithms reported in the literature, our essential contribution is to simultaneously learn the statistically predicted label matrix and the ‘2;1 -normbased feature selection matrix, making the two aspect mutually promotive. Via embedding the feature selection process with the existing state-of-the-art semi-supervised learning algorithm published lately, the manifold structure of each feature type is preserved, resulting in a more faithful learning result. Experiments support that the proposed algorithm achieves better classification results on three large image datasets, in comparison with the popular supervised and semi-supervised feature selection methods. Although the semi-supervised feature selection algorithm has shown great potential for effective web image classification, exploiting the unlabeled data might have negative effects if the manifold assumption does not hold. Also, the performance is related to the number of selected features. Too many features selected may bring in redundant or noisy features, while too few features may not provide enough information for classification. Therefore, the future work includes designing: (1) algorithms that are able to predict if the unlabeled data will contribute and (2) methods that decide the optimal number of features that should be selected.

2.

3.

4.

5. 6.

7.

8.

9.

10.

11.

12. Acknowledgments This paper was partially supported by the National Program on the Key Basic Research Project (under Grant 2013CB329301), NSFC (under Grant 61202166), and Doctoral Fund of Ministry of Education of China (under Grant 20120032120042).

13.

14.

References 15. 1. Zha, Z.-J., Hua, X.-S., Mei, T., Wang, J., Qi, G.-J., Wang, Z.: Joint multi-label multi-instance learning for image classification.

10^3

10^6

10^−6 10^−3 1 10^3 10^6 μ 1

1 10^−3 10^−6

10^3

10^6

γ

variation w.r.t. c and l2 when l1 is fixed as 106 . c Performance variation w.r.t. c and l1 when l2 is fixed as 106

5 Conclusion

123

28

In: IEEE Conference on Computer Vision and Pattern Recognition, 2008. IEEE, (2008) Zha, Z.-J., Wang, M., Zheng, Y.-T., Yang, Y., Hong, R., Chua, T.-S.: Interactive video indexing with statistical active learning. IEEE Trans. Multimed. 14, 17–27 (2012) Zha, Z.-J., Yang, L., Mei, T., Wang, M., Wang, Z.: Visual query suggestion. In: Proceedings of the 17th ACM international conference on Multimedia, pp. 15–24. ACM, (2009) Zha, Z.-J., Yang, L., Wang, Z., Chua, T.-S., Hua, X.-S.: Visual query suggestion: towards capturing user intent in internet image search. ACM Trans. Multimed. Comput. Commun. Appl. (TOMCCAP) 6(13), 1–19 (2010) Koller, D., Sahami, M.: Toward optimal feature selection. Technical Report 1996–77, Stanford InfoLab, February (1996) Han, Y., Yang, Y., Zhou, X.: Co-regularized ensemble for feature selection. In: Proceedings of the Twenty-Third international joint conference on Artificial Intelligence (2013) Yang, Y., Song, J., Huang, Z., Ma, Z., Sebe, N.: Multi-feature fusion via hierarchical regression for multimedia analysis. IEEE Trans. Multimed. 15, 572–581 (2012) Zhu, X.: Semi-supervised learning literature survey. Technical Report 1530, Computer Sciences, University of WisconsinMadison (2005) Zhang, T., Changsheng, X., Hanqing, L.: A generic framework for video annotation via semi-supervised learning. IEEE Trans. Multimed. 14, 1206–1219 (2012) Zha, Z.-J., Mei, T., Wang, J., Wang, Z., Hua, X.-S.: Graph-based semi-supervised learning with multiple labels. J. Vis. Commun. Image Represent. 20(2), 97–103 (2009). Special issue on Emerging Techniques for Multimedia Content Sharing, Search and Understanding Zhu, J., Hoi, S.C.H., Lyu, M.R., Yan, S.: Near-duplicate keyframe retrieval by semi-supervised learning and nonrigid image matching. ACM Trans. Multimed. Comput. Commun. Appl. (TOMCCAP) 7(1), 4:1–4:24 (2011) Belkin, M., Niyogi, P., Sindhwani, V.: Manifold regularization: a geometric framework for learning from labeled and unlabeled examples. J. Mach. Learn. Res. 7, 2399–2434 (2006) Nie, F., Dong, X.: Flexible manifold embedding: a framework for semi-supervised and unsupervised dimension reduction. IEEE Trans. Image Process. 19(7), 1921–1932 (2010) Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley-Interscience, New York (2001) Nie, F., Huang, H., Cai, X., Ding, C.H.: Efficient and robust feature selection via joint l2;1 -norms minimization. Adv. Neural Inf. Process. Syst., pp. 1813–1821 (2010)

Hierarchical regression for web image classification 16. Yang, Y., Shen, H.T., Ma, Z., Huang, Z., Zhou, X.: l 2, 1-norm regularized discriminative feature selection for unsupervised learning. In: Proceedings of the Twenty-Second international joint conference on Artificial Intelligence, pp. 1589–1594. AAAI Press, (2011) 17. Zhao, Z., Wang, L., Liu, H.: Efficient spectral feature selection with minimum redundancy. In: Proceedings of the AAAI Conference on Artificial Intelligence, (2010) 18. Bao, B.-K., Liu, G., Yan, S.: Inductive robust principal component analysis. IEEE Trans. Image Process. 21(8), 3794–3800 (2012) 19. Bao, B.-K., Zhu, G., Shen, J., Yan, S.: Robust image analysis with sparse representation on quantized visual features. IEEE Trans. Image Process. 22(3), 860–871 (2013) 20. Bradley, P. S., Mangasarian, O.L.: Feature selection via concave minimization and support vector machines. In: International Conference on Machine Learning (ICML), vol. 98, pp. 82–90, (1998) 21. Sun, L., Liu, J., Chen, J., Ye, J.: Efficient recovery of jointly sparse vectors. Adv. Neural Inf. Process. Syst., pp. 1812–1820, (2009) 22. Ma, Z., Yang, Y., Nie, F., Uijlings, J., Sebe, N.: Exploiting the entire feature space with sparsity for automatic image annotation. In: Proceedings of the 19th ACM international conference on Multimedia, pp. 283–292. ACM, (2011) 23. Zhao, Z., Liu, H.: Semi-supervised feature selection via spectral analysis. In: SIAM International Conference on Data Mining, (2007)

49 24. Zhu, X., Ghahramani, Z., Lafferty, J. et al.: Semi-supervised learning using gaussian fields and harmonic functions. In: ICML, vol. 3, pp. 912–919, (2003) 25. Zenglin, X., Irwin King, M.R.-T., Lyu, R.J.: Discriminative semisupervised feature selection via manifold regularization. IEEE Trans. Neural Netw. 21(7), 1033–1047 (2010) 26. Huiskes, M.J., Lew, M.S.: The mir flickr retrieval evaluation. In: Proceedings of the 1st ACM international conference on Multimedia information retrieval, pp. 39–43. ACM, (2008) 27. Li, H., Wang, M., Hua, X.-S.: MSRA-MM 2.0: a large-scale web multimedia dataset. In: IEEE International Conference on Data Mining Workshops, pp. 164–169. IEEE, (2009) 28. Chua, T.-S., Tang, J., Hong, R., Li, H., Luo, Z., Zheng, Y.: NUSWIDE: a real-world web image database from National University of Singapore. In: Proceedings of the ACM International Conference on Image and Video Retrieval, number 48 in CIVR ’09, pp. 1–9. ACM, (2009) 29. Wu, F., Yuan, Y., Zhuang, Y.: Heterogeneous feature selection by group lasso with logistic regression. In: Proceedings of the international conference on Multimedia, pp. 983–986, (2010) 30. Yang, Y., Xu, D., Nie, F., Luo, J., Zhuang, Y.: Ranking with local regression and global alignment for cross media retrieval. In: Proceedings of the 17th ACM international conference on Multimedia, pp. 175–184. ACM, (2009)

123

Semi-supervised feature selection via hierarchical regression for web ...

Semi-supervised feature selection via hierarchical regression for web ...

Suggest Documents

Semisupervised Feature Selection via Spline ... - IEEE Xplore

Feature Selection for Ordinal Regression - esuli.it

Feature Selection for Ridge Regression with Provable

Feature Selection via Mathematical Programming

Automating Regression Test Selection for Web Services

unsupervised feature learning via sparse hierarchical ...

Multivariate feature selection and hierarchical ... - Semantic Scholar

Parallel Large Scale Feature Selection for Logistic Regression

L1 LOGISTIC REGRESSION AS A FEATURE SELECTION STEP FOR ...

Morisita-Based Feature Selection for Regression Problems - UCL/ELEN

Semi-supervised feature selection via multiobjective ... - CiteSeerX

Discriminative Feature Selection via Multiclass ... - Springer Link

Feature Selection via Discretization - Semantic Scholar

Discriminative Feature Selection via Multiclass Variable Memory ...

Feature Selection via Discretization - Semantic Scholar

Feature Selection in Unsupervised Learning via ... - CiteSeerX

Scalable Greedy Feature Selection via Weak Submodularity

Feature Selection via Chaotic Antlion Optimization - PLOS

Feature Selection via Regularized Trees - Google Sites

Feature Selection via Regularized Trees - Google Sites

Canonical feature selection for joint regression and ... - Google Sites

Unsupervised Feature Selection via Distributed ... - ICSI, Berkeley

Feature Selection Via Simultaneous Sparse ... - Google Sites

Semisupervised Learning of Hierarchical Latent Trait Models for Data ...