Multi-label dimensionality reduction and classification with extreme

0 downloads 0 Views 403KB Size Report
3, June 2014, pp.502–513. Multi-label dimensionality reduction and classification with extreme learning machines. Lin Feng1,2, Jing Wang1,2, Shenglan Liu1,2 ...
Journal of Systems Engineering and Electronics Vol. 25, No. 3, June 2014, pp.502–513

Multi-label dimensionality reduction and classification with extreme learning machines Lin Feng1,2 , Jing Wang1,2 , Shenglan Liu1,2,* , and Yao Xiao1,2 1. Faculty of Electronic Information and Electrical Engineering, School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China; 2. School of Innovation Experiment, Dalian University of Technology, Dalian 116024, China

Abstract: In the need of some real applications, such as text categorization and image classification, the multi-label learning gradually becomes a hot research point in recent years. Much attention has been paid to the research of multi-label classification algorithms. Considering the fact that the high dimensionality of the multi-label datasets may cause the curse of dimensionality and will hamper the classification process, a dimensionality reduction algorithm, named multi-label kernel discriminant analysis (MLKDA), is proposed to reduce the dimensionality of multi-label datasets. MLKDA, with the kernel trick, processes the multi-label integrally and realizes the nonlinear dimensionality reduction with the idea similar with linear discriminant analysis (LDA). In the classification process of multi-label data, the extreme learning machine (ELM) is an efficient algorithm in the premise of good accuracy. MLKDA, combined with ELM, shows a good performance in multi-label learning experiments with several datasets. The experiments on both static data and data stream show that MLKDA outperforms multi-label dimensionality reduction via dependence maximization (MDDM) and multi-label linear discriminant analysis (MLDA) in cases of balanced datasets and stronger correlation between tags, and ELM is also a good choice for multi-label classification.

Keywords: multi-label, dimensionality reduction, kernel trick, classification.

DOI: 10.1109/JSEE.2014.00058

1. Introduction Traditional supervised learning focuses on single-label data, in which each instance is associated with one label, while with the development of information technology, in some application areas, single-label is not capable of describing the information very well. For example, a piece of Manuscript received December 13, 2012. *Corresponding author. This work was supported by the National Natural Science Foundation of China (51105052; 61173163) and the Liaoning Provincial Natural Science Foundation of China (201102037).

news can be categorized as “Education” and “Economy” simultaneously, and an image can be tagged with “sea” and “beach” at the same time. Thus multi-label learning, in which each instance is associated with a set of labels, has received widespread attention in recent years. Besides text categorization [1,2] and image classification [3], the multi-label learning has been applied in areas such as automatic multimedia annotation [4,5], automated annotation of proteins with functions [2,6], music emotion categorization [7,8], and web pages categorization [9]. The dimensionality of datasets is related with the learning effect. With the increasing of dimensionality, the Euclidean distance becomes invalid [10], since the data points in high dimensional space are more similar than that in low dimensional space. Therefore, taking dimensionality reduction methods as a preprocessing step is important in the high dimensional data learning process. Principle component analysis (PCA) [11], which can identify lower dimensional space by maximizing the projection variance, is the most notable unsupervised dimensionality reduction method. However, PCA can just be applied to the linear dimensionality reduction. Another two unsupervised algorithms, local linear embedding (LLE) [12] and local tangent space alignment (LTSA) [13], take the local structure into consideration, and they both can reduce dimensionality nonlinearly, while they cannot process the new instances, thus they are restricted in real applications. Real life datasets usually contain label information, and we can achieve the supervised dimensionality reduction with these label information for better results. Linear discriminant analysis (LDA) [14,15] is a representative linear supervised method. LDA realizes dimensionality reduction by minimizing the within-class scatter and maximizing the between-class scatter to make data easy to distinguish. Another supervision method, named maximum margin cri-

Lin Feng et al.: Multi-label dimensionality reduction and classification with extreme learning machines

terion (MMC) [16], tries to maximize the distances between any two different classes, while it is computationally expensive to the high dimensional data. In order to make some linear method realize nonlinear feature extraction, some researchers introduced the kernel trick to existing algorithms. In 1997, Sch¨olkopf et al. [17] proposed the kernel PCA (KPCA) algorithm, which maps the original data into feature space and then realizes PCA in this feature space. In this way, KPCA can achieve nonlinear dimensionality reduction with slightly more complex calculations than PCA. Similarly, Mika et al. [18] proposed the kernel fisher discriminant analysis (FDA) and solved the nonlinear problem that FDA [14] faced with. The datasets in multi-label learning are always in high dimensionality, which raises challenges for multilabel supervised learning. Some researchers have proposed several dimensionality reduction methods for multi-label data. Park et al. [19] extendeded LDA to the multi-label problem. They split multi-label into single-label format, and utilize direct LDA [20] to reduce dimensionality in two stages. However, this method destroys the integrity of multi-label, and thus affects the classification performance. Considering maintaining the overall multi-label structure, Wang et al. [21] proposed another LDA-based algorithm, named multi-label LDA (MLDA), which takes the label correlation into consideration and realizes dimensionality reduction with the basic idea of LDA. MLDA cannot deal with the nonlinear case either. Multi-label dimensionality reduction via dependence maximization (MDDM) [22] is a good dimensionality reduction algorithm for both linear and nonlinear cases. It tries to identify lower-dimensional space by maximizing the dependence between the original data and its corresponding labels. However much attention should be paid to the selection of suitable methods for constructing a multi-label matrix. In [23], the multi-label least square (MLLS) algorithm is proposed. MLLS extracts the shared subspace among multiple labels and gets the optimal solution with eigenvalue problems, however, it is time-consuming. Also considering starting from the label, Sun et al. [24] proposed a hypergraph spectral learning method which uses hypergraph to capture label correlation. However this method just focuses on linear models. Besides, Sun et al. [25] extended canonical correlation analysis (CCA) to the multilabel dimensionality reduction, while it can only process the case in which two sets are linear. Different from the traditional learning process, Ji et al. [26] proposed a joint frame which can realize dimensionality reduction and classification at the same time. In multi-label data, multiple labels are not always independent to each other. For a better learning of multi-label

503

data, we consider the relations among different labels. Furthermore, to achieve the nonlinear dimensionality reduction, we propose a kernel-based multi-label dimensionality reduction algorithm, named multi-label kernel discriminant analysis (MLKDA). MLKDA considers the correlations between different labels, and with kernel trick the instances are mapped into the feature space in which the between-class scatter and the within-class scatter are calculated. The optimization model of MLKDA is similar with that of LDA. Then in the multi-label classification process, we use the extreme learning machine (ELM) classification algorithm. Empirical experiments show that MLKDA is a good multi-label dimensionality reduction algorithm and ELM is good in efficiency and accuracy for multi-label classification as well. The rest of this paper is organized as follows. In Section 2, we give a brief review of generalized LDA, multi-label classification, and ELM algorithm. Section 3 presents the details of the proposed MLKDA method. We report the experiment results in Section 4. Finally Section 5 concludes this paper.

2. Related work 2.1 LDA Given a matrix X = [x 1 , x2 , . . ., xN ] ∈ RD×N , which represents N instances in D-dimensional space, traditional LDA tries to compute a linear transformation with which the original data in D-dimensional space can be mapped to a lower L-dimensional space. The linear transformation can be represented as zi = W T xi , W ∈ RD×L ; L f (xi , y2 ), then rank f (xi , y1 ) < rankf (xi , y2 ). (i) HammingLoss HammingLoss counts the missing errors where a true label is not predicted and prediction errors where an incorrect label is predicted.

The larger the value of ap(h) is, the better the performance will be. These five evaluation metrics are listed in Table 2. In the column of ‘Good performance’, ↑ represents the larger the result is, the better performance we get. And vice versa for ↓.

N C 1  1  (δ(c ∈ Yi∗ ∧ c ∈ / Yi ) + hl(h) = N i=1 C c=1

Table 2 Measures used in the experiment

δ(c ∈ / Yi∗ ∧ c ∈ Yi ))

(19)

In the δ function, when the content is true, the output of δ is 1, and when the content is false, the output is 0. Small value of hl(h) corresponds to better performance. (ii) Coverage Coverage assesses the performance of the system for all the possible labels of an instance, not only the top-ranked label. The smaller the value of coverage(h) is, the better the performance will be. coverage(h) =

N 1  max rankf (xi , y) − 1 N i=1 y∈Yi

(20)

(iii) OneError OneError evaluates how many times the top-ranked label is not in the true set of labels of the instance.    N 1  δ arg max f (xi , y) ∈ / Yi oe(h) = (21) y∈Y N i=1 When there is only one label, oe(h) is the same as ordinary classification error. Small value of oe(h) corresponds to better performance. (iv) RankingLoss RankingLoss evaluates the average fraction of label pairs that are reversely ordered for the instance. rl(h) =

N 1 1  |{(y1 , y2 )|(y1 , y2 ) ∈ Yi × Y¯i , N i=1 |Yi ||Y¯i |

f (xi , y1 )  f (xi , y2 )}|

N 1  1 · N i=1 |Yi |

 |{y  ∈ Yi |rankf (xi , y  )  rankf (xi , y)}| rankf (xi , y)

y∈Yi

Good performance ↓ ↓ ↓ ↓ ↑

For short HL Co OE RL AP

4.2 Contrast algorithms MDDM and MLDA are selected as the contrast dimensionality reduction algorithms. MDDM is a good dimensionality reduction algorithm for multi-label data. In [22], the authors have performed several experiments demonstrating that MDDM outperforms other existing multi-label dimensionality reduction methods. MLDA [21] is also a dimensionality reduction method with the idea of LDA. Thus we choose them as the contrast algorithms. Then in the classification procedure, we choose the ELM as the multilabel classification algorithm. For a better illustration of the performance of the ELM algorithm, we compare it with RankSVM and ML-KNN since these three algorithms all belong to the “algorithm adaptation” approach and both RankSVM and ML-KNN get good results in multi-label classification. According to the dataset sizes and areas they belong to, we divide the datasets into two parts and perform experiment differently. For the first part, we select relatively small datasets scene Image, Yahoo! Web pages and Emotions datasets to conduct static experiments. For the second part, we use larger multimedia data and experiments dynamically, as a data stream.

(22)

where Y¯i denotes the complement of Y i . The smaller the value of rl(h) is, the better the performance will be. (v) AveragePrecision AveragePrecision evaluates the average fraction of labels ranked above a particular label y which is actually in Y . ap(h) =

Measures HamingLoss Coverage OneError RankingLoss AveragePrecision

(23)

4.3 Static experiments Table 3 – Table 8 list the classification results of ELM algorithm after three dimensionality reduction algorithms. For a better illustration of the effectiveness of MKLDA, we also list the classification results of RankSVM and MLKNN in Table 9 – Table 14 and Table 15 – Table 20, respectively. We reduce the dimensionality to five, in which the performance of most datasets are good. The result in the table are “the average result for 20 times ± standard deviation”, and we keep four significant figures after the decimal point.

Lin Feng et al.: Multi-label dimensionality reduction and classification with extreme learning machines Table 3

ELM classification results on Arts&Humanities dataset

Measures HL Co OE RL AP

MDDM 0.063 2±0.000 2 6.596 7±0.038 6 0.593 3±0.002 8 0.182 9±0.001 1 0.514 3±0.001 4

MLDA 0.081 1±0.000 7 8.824 0±0.059 6 0.664 7±0.002 8 0.262 8±0.002 3 0.435 2±0.001 7

MLKDA 0.063 1±0.000 1 6.076 0±0.004 6 0.752 0±0.000 1 0.177 2±0.000 2 0.436 4±0.001 4

Table 4 ELM classification results on Business&Economy dataset Measures HL Co OE RL AP Table 5

MDDM 0.031 1±0.000 3 3.460 7±0.039 2 0.147 7±0.002 3 0.069 1±0.001 1 0.846 7±0.001 4

MLDA 0.038 3±0.000 5 4.343 7±0.031 2 0.172 0±0.002 8 0.090 4±0.001 2 0.818 2±0.001 7

MLKDA 0.028 7±0.000 1 2.606 7±0.001 7 0.136 7±0.000 2 0.049 6±0.000 4 0.860 2±0.000 8

509

Table 10 RankSVM classification results on Business&Economy dataset Measures MDDM MLDA MLKDA HL 0.076 0±0.000 4 0.045 4±0.000 4 0.028 6±0.000 5 Co 4.205 0±0.005 2 3.059 0±0.002 1 2.913 3±0.001 0 OE 0.591 6±0.003 0 0.348 6±0.000 3 0.136 6±0.000 7 RL 0.089 7±0.000 1 0.063 7±0.001 9 0.054 2±0.000 7 AP 0.581 5±0.002 2 0.748 6±0.000 7 0.850 4±0.000 4 Table 11 RankSVM classification results on Computer&Internet dataset Measures MDDM MLDA MLKDA HL 0.061 0±0.001 2 0.049 3±0.002 3 0.045 0±0.001 9 Co 9.906 7±0.001 6 5.694 7±0.002 4 5.315 0±0.001 7 OE 0.777 3±0.000 3 0.477 6±0.001 3 0.481 3±0.000 3 RL 0.247 9±0.000 5 0.128 3±0.003 1 0.110 5±0.001 2 AP 0.371 4±0.000 6 0.592 9±0.001 8 0.586 2±0.000 5

ELM classification results on Computer&Internet dataset

Measures HL Co OE RL AP

MDDM 0.040 0±0.000 2 6.137 0±0.073 3 0.430 3±0.002 5 0.133 3±0.001 7 0.625 0±0.001 8

Table 6 Measures HL Co OE RL AP

MDDM 0.045 0±0.000 2 5.779 3±0.066 0 0.551 3±0.001 6 0.132 5±0.001 6 0.557 2±0.001 3

Table 9 dataset

MLDA 0.062 6±0.000 6 8.305 0±0.072 3 0.647 0±0.002 6 0.201 0±0.002 3 0.464 9±0.001 9

MLKDA 0.043 9±0.000 1 4.395 0±0.003 6 0.657 3±0.000 2 0.105 9±0.000 1 0.496 3±0.000 1

ELM classification results on Emotions dataset MDDM 0.351 4±0.013 2 2.663 7±0.110 6 0.503 4±0.020 5 0.338 9±0.021 1 0.642 2±0.015 8

Table 8 Measures HL Co OE RL AP

MLKDA 0.042 8±0.000 1 5.965 0±0.002 1 0.486 7±0.000 5 0.131 3±0.000 1 0.588 6±0.000 7

ELM classification results on Education dataset

Table 7 Measures HL Co OE RL AP

MLDA 0.065 7±0.000 6 8.439 7±0.088 4 0.506 7±0.002 7 0.197 3±0.002 5 0.552 8±0.002 1

MLDA 0.413 1±0.002 7 3.101 6±0.014 4 0.625 3±0.005 3 0.447 1±0.002 6 0.554 7±0.002 0

MLKDA 0.314 9±0.000 5 3.340 9±0.026 8 0.726 9±0.009 8 0.466 7±0.002 0 0.522 6±0.008 7

ELM classification results on Scene dataset

MDDM 0.335 5±0.003 4 2.458 8±0.020 1 0.758 8±0.006 8 0.562 2±0.005 4 0.456 9±0.002 9

MLDA 0.373 3±0.003 3 2.538 7±0.024 3 0.805 0±0.005 6 0.582 3±0.006 2 0.432 1±0.003 9

MLKDA 0.440 0±0.003 3 2.421 3±0.012 8 0.857 5±0.007 4 0.558 3±0.002 9 0.423 4±0.003 1

RankSVM classification results on Arts&Humanities

Measures MDDM MLDA HL 0.089 6±0.001 5 0.099 6±0.002 8 Co 16.864 0±0.041 5 13.547 1±0.031 9 OE 0.919 0±0.000 7 0.878 0±0.001 4 RL 0.582 2±0.002 4 0.452 9±0.000 4 AP 0.186 9±0.000 3 0.226 0±0.000 2

MLKDA 0.107 2±0.001 4 6.329 0±0.013 6 0.704 0±0.000 3 0.196 3±0.000 2 0.420 3±0.001 6

Table 12 Measures HL Co OE RL AP

RankSVM classification results on Education dataset MDDM 0.063 5±0.002 5 14.733 0±0.003 4 0.913 6±0.000 7 0.390 3±0.000 6 0.236 1±0.000 5

Table 13 Measures HL Co OE RL AP Table 14 Measures HL Co OE RL AP

MLDA 0.050 4±0.006 5 5.159 7±0.010 2 0.620 0±0.000 1 0.123 9±0.000 8 0.493 1±0.000 3

MLKDA 0.054 9±0.000 8 4.835 7±0.000 2 0.678 3±0.000 3 0.119 5±0.001 4 0.464 1±0.000 8

RankSVM classification results on Emotions dataset MDDM 0.366 8±0.000 2 3.142 2±0.005 2 0.550 7±0.000 4 0.425 1±0.010 2 0.575 7±0.000 7

MLDA 0.334 0±0.000 9 3.049 7±0.000 6 0.595 9±0.002 7 0.437 9±0.010 5 0.572 9±0.000 4

MLKDA 0.331 8±0.000 3 3.239 3±0.002 8 0.550 7±0.000 9 0.430 7±0.001 4 0.569 6±0.001 0

RankSVM classification results on Scene dataset MDDM 0.180 4±0.000 6 1.037 6±0.000 5 0.460 7±0.000 1 0.485 7±0.005 4 0.431 8±0.002 4

MLDA 0.181 7±0.001 2 2.463 2±0.010 4 0.828 6±0.005 6 0.476 3±0.000 8 0.421 3±0.003 7

Table 15 MLKNN classification results on dataset Measures MDDM MLDA HL 0.069 4±0.008 7 0.072 2±0.006 9 Co 5.610 0±0.030 2 6.397 7±0.014 7 OE 0.592 3±0.000 3 0.631 3±0.001 3 RL 0.185 5±0.001 4 0.185 7±0.000 4 AP 0.519 3±0.000 3 0.373 8±0.000 6

MLKDA 0.179 3±0.000 9 2.494 1±0.004 0 0.790 1±0.001 5 0.474 7±0.000 7 0.434 8±0.000 3 Arts&Humanities MLKDA 0.064 5±0.000 1 6.103 3±0.001 2 0.775 7±0.000 2 0.177 5±0.003 1 0.421 9±0.002 8

Table 16 MLKNN classification results on Business&Economy dataset Measures MDDM MLDA MLKDA HL 0.027 3±0.008 9 0.027 3±0.000 6 0.028 8±0.000 8 Co 2.686 0±0.000 2 2.659 0±0.001 7 2.650 3±0.002 0 OE 0.141 0±0.001 4 0.143 3±0.000 1 0.136 6±0.000 1 RL 0.045 6±0.000 5 0.046 7±0.006 1 0.050 4±0.001 0 AP 0.867 5±0.000 7 0.857 2±0.002 2 0.859 4±0.000 6

510

Journal of Systems Engineering and Electronics Vol. 25, No. 3, June 2014

Table 17 dataset

MLKNN classification results on Computer&Internet

Measures HL Co OE RL AP

Table 18 Measures HL Co OE RL AP

Table 19 Measures HL Co OE RL AP

MDDM 0.037 3±0.002 3 6.864 3±0.000 4 0.520 0±0.001 2 0.122 6±0.000 7 0.639 7±0.001 8

MLDA 0.042 4±0.006 5 6.512 7±0.000 2 0.467 6±0.000 7 0.123 3±0.003 5 0.591 0±0.000 6

MLKDA 0.044 8±0.0008 0 5.953 7±0.002 3 0.486 6±0.030 0 0.105 4±0.003 2 0.585 8±0.0024 3

MLKNN classification results on Education dataset MDDM 0.041 0±0.000 9 4.512 3±0.003 7 0.676 0±0.000 2 0.110 0±0.000 3 0.479 2±0.001 2

MLDA 0.042 8±0.007 2 4.568 7±0.005 9 0.616 0±0.000 7 0.093 0±0.0010 0.509 5±0.000 4

MLKDA 0.047 0±0.009 1 3.9997±0.000 3 0.559 0±0.000 8 0.1097±0.006 4 0.562 9±0.000 4

In kernel trick, the efficiency will get decreased with the increasing of training sets. Thus based on the static experiments, we extend the MLKDA to data stream experiments with larger datasets, in which large data sets are divided into several small batches. 4.4 Data stream experiments In this part, we conduct experiments with Mediamill and MPEG datasets. Set the batch size as 500. We slightly adjust our method for data stream experiment. To begin with, we use efficient PCA [44] to reduce the dimension to 20, and then apply MLKDA as in static experiments which will further improve the efficiency with little important information loss. Table 21 and Table 22 list the classification results with the ELM algorithm in Mediamill and MPEG datasets.

MLKNN classification results on Emotions dataset MDDM 0.339 7±0.000 3 3.011 3±0.003 6 0.607 2±0.001 5 0.425 7±0.020 0 0.531 0±0.001 9

MLDA 0.410 3±0.000 8 3.205 9±0.001 5 0.691 2±0.001 5 0.461 9±0.006 7 0.531 2±0.006 0

MLKDA 0.447 3±0.000 3 3.320 5±0.001 9 0.699 7±0.007 0 0.465 0±0.000 5 0.536 8±0.000 6

Table 21

ELM classification results on Mediamill dataset

Measures MDDM MLDA MLKDA HL 0.076 1±0.010 1 0.044 3±0.004 8 0.043 3±0.004 6 Co 56.935 0±3.556 1 45.324 0±4.893 0 37.503 0±3.407 1 OE 0.401 8±0.077 1 0.291 9±0.072 2 0.248 5±0.065 7 RL 0.277 8±0.028 6 0.192 9±0.028 8 0.155 6±0.023 1 AP 0.456 1±0.035 1 0.545 4±0.038 1 0.561 8±0.029 7

Table 20 MLKNN classification results on Scene dataset Measures HL Co OE RL AP

MDDM 0.226 1±0.001 3 2.755 0±0.020 1 0.652 8±0.000 9 0.331 7±0.001 2 0.485 9±0.004 5

MLDA 0.271 4±0.002 2 2.168 9±0.008 1 0.689 1±0.003 7 0.412 6±0.004 0 0.489 3±0.001 8

MLKDA 0.227 9±0.000 8 2.502 0±0.002 3 0.643 8±0.030 0 0.499 0±0.003 2 0.541 1±0.000 5

From Table 3 – Table 8 (the ELM classification results), Table 9 – Table 14 (the RankSVM classification results) and Table 15 – Table 20 (the MLKNN classification results), we can see that MLKDA outperforms MLDA in most cases, and gets better results than MDDM in most of the datasets. This can be attributed to the fact that MLKDA takes the label correlation into consideration, meantime, it can do well in the nonlinear case in which MDDM will also perform well just in condition that the suitable method for constructing the multi-label matrix is selected. While in Emotions datasets (Table 7, Table 13, Table 19), MLKDA get relatively worse results than MDDM, which may be caused by the loss of some important information, since the features number of this datasets is relatively small. For MLDA, due to its incapability in nonlinear case, it does not perform as well as MDDM and MLKDA. Compared with the classification results of RankSVM and MLKNN, the results in Table 3 – Table 8 show that the ELM classification algorithm is suitable for multi-label classification.

Table 22 ELM classification results on MPEG dataset Measures MDDM MLDA MLKDA HL 0.068 8±0.009 6 0.056 8±0.004 8 0.038 1±0.004 9 Co 58.134 0±3.656 5 50.127 0±4.874 4 33.908 0±3.463 8 OE 0.4244 0±0.076 6 0.403 9±0.072 6 0.218 1±0.067 7 RL 0.2740 7±0.025 8 0.235 3±0.028 8 0.128 1±0.021 9 AP 0.445 6±0.032 5 0.469 9±0.038 3 0.582 5±0.026 3

We draw the curve of HammingLoss and AveragePrecision results on Mediamill and MPEG datasets, as shown in Fig. 5 and Fig. 6. The results above show that MLKDA gets the best performance among three algorithms in two datasets. Since we perform efficient PCA before MLKDA, this may facilitate the learning process when there is little important discriminant information loss. Compared with the datasets and results in static experiments, we can see that, for datasets with more labels and closer label correlation, MLKDA gets better performance than MDDM and MLDA. In Fig. 5, MLKDA gets similar results with MLDA, both better than MDDM. While in Fig. 6(b) there is a steep drop in the 10th batch, which may be caused by the reason that the distribution of the data in this batch is not suitable for the Gaussian kernel, thus the average precision gets a steep reduction.

Lin Feng et al.: Multi-label dimensionality reduction and classification with extreme learning machines

Fig. 5

Mediamill dataset

Fig. 6

4.5 ELM vs (RankSVM & MLKNN) We use ELM to classify multi-label data. Comparing with the existing RankSVM and MLKNN, we get classification Table 23

511

MPEG dataset

results after MLKDA dimensionality reduction, see in Table 23. The row of “Times↓ (train/test)” represents the classification time of training and testing.

Comparison of MLKNN, RankSVM and ELM classification results after MLKDA dimensionality reduction in different datasets

Measures HL Co OE RL AP Time/s↓ (train/test)

MLKNN 0.064 5±0.000 1 6.103 3±0.001 2 0.775 7±0.000 2 0.177 5±0.003 1 0.421 9±0.002 8 5.350 8/8.704 9

Arts&Humanities RankSVM 0.107 2±0.001 4 6.329 0±0.013 6 0.704 0±0.000 3 0.196 3±0.000 2 0.420 3±0.001 6 763.09/11.981 0

ELM 0.063 1±0.000 1 6.076 0±0.004 6 0.752 0±0.000 1 0.177 2±0.000 2 0.436 4±0.001 4 0.156 0/0.046 8

MLKNN 0.028 8±0.000 8 2.650 3±0.002 0 0.136 6±0.000 1 0.050 4±0.001 0 0.859 4±0.000 6 6.130 8/ 10.608

Business&Economy RankSVM 0.028 6±0.000 5 2.913 3±0.001 0 0.136 6±0.000 7 0.054 2±0.000 7 0.850 4±0.000 4 962.21/12.698

ELM 0.028 7±0.000 1 2.606 7±0.001 7 0.136 7±0.000 2 0.049 6±0.000 4 0.860 2±0.000 8 0.202 8/0.047 2

MLKNN 0.044 8±0.0008 0 5.953 7±0.002 3 0.486 6±0.030 0 0.105 4±0.003 2 0.585 8±0.002 4 4.789 2/ 9.063 7

Computer&Internet RankSVM 0.045 0±0.001 9 5.315 0±0.001 7 0.481 3±0.000 3 0.110 5±0.001 2 0.586 2±0.000 5 115 7.8/13.4

ELM 0.042 8±0.000 1 5.965 0±0.002 1 0.486 7±0.000 5 0.131 3±0.000 1 0.588 6±0.000 7 2.7612/ 0.4212

MLKNN 0.047 0±0.009 1 3.9997±0.000 3 0.559 0±0.000 8 0.1097±0.006 4 0.562 9±0.000 4 5.038 8/10.780 0

Education RankSVM 0.054 9±0.000 8 4.835 7±0.000 2 0.678 3±0.000 3 0.119 5±0.001 4 0.464 1±0.000 8 1125.7/ 13.276

ELM 0.043 9±0.000 1 4.395 0±0.003 6 0.657 3±0.000 2 0.105 9±0.000 1 0.496 3±0.000 1 0.156/0.0624

MLKNN 0.447 3±0.000 3 3.320 5±0.001 9 0.699 7±0.007 0 0.465 0±0.000 5 0.536 8±0.000 6 0.031 2/0.156 0

Emotions RankSVM 0.331 8±0.000 3 3.239 3±0.002 8 0.550 7±0.000 9 0.430 7±0.001 4 0.569 6±0.001 0 1.497 6/0.093 6

ELM 0.314 9±0.000 5 3.340 9±0.026 8 0.726 9±0.009 8 0.466 7±0.002 0 0.522 6±0.008 7 0.0312 /0.0002

MLKNN 0.227 9±0.000 8 2.502 0±0.002 3 0.643 8±0.030 0 0.499 0±0.003 2 0.541 1±0.000 5 1.809 6/2.246 4

Scene RankSVM 0.179 3±0.000 9 2.494 1±0.004 0 0.790 1±0.001 5 0.474 7±0.000 7 0.434 8±0.000 3 31.45/1.996 8

ELM 0.440 0±0.003 3 2.421 3±0.012 8 0.857 5±0.007 4 0.558 3±0.002 9 0.423 4±0.003 1 0.156/0.0468

Measures HL Co OE RL AP Time/s↓ (train/test) Measures HL Co OE RL AP Time/s↓ (train/test)

512

Journal of Systems Engineering and Electronics Vol. 25, No. 3, June 2014

The results in Table 23 show that, after the dimensionality reduction of MLKDA, the ELM algorithm outperforms RankSVM and MLKNN in time consuming and ELM slightly outperforms RankSVM and MLKNN in classification results in most datasets. While in the emotions and scene datasets the classification results of ELM are worse than MLKNN and RankSVM, which may be caused by the loss of some important information in the dimensionality reduction process. Thus ELM gets a bad classification results. The basic idea of RankSVM is to adopt the strategy of “maximum margin”. Based on the ranking loss evaluate matrix (22), it defines a group of linear classifiers to classify the multi-label dataset. Besides, RankSVM processes the nonlinear dataset with the “kernel trick”. Comparing with ELM and MLKNN, RankSVM is in very low efficiency for its iteration calculation. See the comparison of time-consuming in the row of “Time/s ↓(train/test)” in Table 23. MLKNN is a lazy learning method with low efficiency, and the number of k is crucial to the final classification result. ELM is in high efficiency and has better classification performance, as can be seen from Table 10. Besides, [45] indicates that, with the same kernel function, ELM gets better results than SVM in classification. There is a problem needing to be considered that we get the final label by combining the eigenvalue of ELM and the average number of the datasets, therefore, MLKDA maybe underperformed in unbalanced datasets.

5. Conclusions In this paper, we propose a multi-label dimensionality reduction method, MLKDA, which considers the correlations between different labels and does not destroy the integrity of the label. We use the kernel trick in this method for nonlinear cases, and in the realization of dimensionality reduction it can help improve the classification performance. Experiments on real-world multi-label learning problems, i.e. scene image classification, music emotion categorization, automatic web page categorization and multimedia annotation, show that MLKDA outperforms some of the well-established multi-label dimensionality reduction algorithms. Meantime, the experiment results indicate that the ELM algorithm is a good choice for multi-label classification. While for the unbalanced dataset, MLKDA may be underperformed. In our future work, we plan to tackle with the large sparse dataset with MLKDA, which is a challenge considering the kernel trick.

References [1] D. D. Lewis, Y. Yang, T. G. Rose, et al. Rcv1: a new benchmark collection for text categorization research. Journal of

Machine Learning Research, 2004, (5): 361–397. [2] M. L. Zhang, Z. H. Zhou. Multilabel neural networks with applications to functional genomics and text categorization. IEEE Trans. on Knowledge and Data Engineering, 2006, 18(10): 1338–1351. [3] M. R. Boutell, J. Luo, X. Shen, et al. Learning multi-label scene classification. Pattern Recognition, 2004, 37(9): 1757– 1771. [4] G. J. Qi, X. S. Hua, Y. Rui, et al. Correlative multi-label video annotation. Proc. of the 15th International Conference on Multimedia, 2007: 17–26. [5] X. S. Xu, Y. Jiang, L. Peng, et al. Ensemble approach based on conditional random field for multi-label image and video annotation. Proc. of the 19th ACM International Conference on Multimedia, 2011: 1377–1380. [6] A. Elisseeff, J. Weston. A kernel method for multi-labelled classification. Advances in Neural Information Processing Systems 14. Cambridge: MA MIT Press, 2002: 681–687. [7] K. Trohidis, G. Tsoumakas, G. Kalliris, et al. Multilabel classification of music into emotions. Proc. of the 9th International Conference on Music Information Retrieval, 2008. [8] A. Wieczorkowska, P. Synak, Z. Ra´s. Multi-label classification of emotions in music. Intelligent Information Processing and Web Mining. Berlin Heidelberg: Springer, 2006. [9] N. Ueda, K. Saito. Parametric mixture models for multilabeled text. Advances in Neural Information Processing Systems 15. Combridge, MA: 2003: 721–728. [10] K. S. Beyer, J. Goldstein, R. Ramakrishnan, et al. When is “nearest neighbor” meaningful? Proc. of the 7th International Conference on Database Theory, 1999: 217–235. [11] I. T. Jolliffe. Principal component analysis. New York: Springer Verlag, 1986. [12] S. T. Roweis, L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 2000, 290(5500): 2323– 2326. [13] Z. Y. Zhang, H. Y. Zha. Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. Journal of Shanghai University (English Edition), 2004, 8(4): 406–424. [14] R. A. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 1936, 7(2): 179–188. [15] C. R. Rao. The utilization of multiple measurements in problems of biological classification. Journal of the Royal Statistical Society, Series B (Methodological), 1948, 10(2): 159–203. [16] H. F. Li, T. Jiang, K. S. Zhang. Efficient and robust feature extraction by maximum margin criterion. IEEE Trans. on Neural Networks, 2006, 17(1): 157–165. [17] B. Sch¨olkopf, A. Smola, K. R. M¨uller. Kernel principal component analysis. Artificial Neural Networks. Berlin Heidelberg: Springer, 1997. [18] S. Mika, G. Ratsch, J. Weston, et al. Fisher discriminant analysis with kernels. Neural Networks for Signal Processing 12, 1999: 41–48. [19] C. H. Park, M. Lee. On applying linear discriminant analysis for multi-labeled problems. Pattern Recognition Letters, 2008, 29(7): 878–887. [20] H. Yu, J. Yang. A direct lda algorithm for high-dimensional data—with application to face recognition. Pattern Recognition, 2001, 34(10): 2067–2070. [21] H. Wang, C. Ding, H. Huang. Multi-label linear discriminant analysis. Berlin: Springer, 2010. [22] Y. Zhang, Z. H. Zhou. Multilabel dimensionality reduction via dependence maximization. ACM Trans. on Knowledge Discovery from Data, 2010, 4(3): 1–21. [23] S. Ji, L. Tang, S. Yu, et al. Extracting shared subspace for

Lin Feng et al.: Multi-label dimensionality reduction and classification with extreme learning machines

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35] [36]

[37] [38]

[39] [40] [41] [42]

multi-label classification. Proc. of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008: 381–389. L. Sun, S. W. Ji, J. P. Ye. Hypergraph spectral learning for multi-label classification. Proc. of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2008: 668–676. L. Sun, S. W. Ji, J. P. Ye. Canonical correlation analysis for multilabel classification: a least-squares formulation, extensions, and analysis. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2011, 33(1): 194–200. S. W. Ji, J. P. Ye. Linear dimensionality reduction for multilabel classification. Proc. of the 21st International Conference on Artifical Intelligence, 2009: 1077–1082. G. Tsoumakas, I. Katakis. Multi-label classification: an overview. International Journal of Data Warehousing and Mining, 2007, 3(3): 1–13. J. Read, B. Pfahringer, G. Holmes, et al. Classifier chains for multi-label classification. Machine Learning, 2011, 85(3): 333–359. G. Tsoumakas, I. Vlahavas. Random k -labelsets: an ensemble method for multilabel classification. Machine Learning. Berlin: Springer, 2007. M. L. Zhang, Z. H. Zhou. Ml-knn: a lazy learning approach to multi-label learning. Pattern Recognition, 2007, 40(7): 2038– 2048. G. B. Huang, Q. Y. Zhu, C. K. Siew. Extreme learning machine: theory and applications. Neurocomputing, 2006, 70(1/3): 489–501. H. Wang, H. Huang, C. H. Q. Ding. Discriminant laplacian embedding. Proc. of the 24th AIAA Conference on Artificial Intelligence, 2010: 618–623. Z. Li, J. Zhang, S. Hu. Incremental support vector machine algorithm based on multi-kernel learning. Journal of Systems Engineering and Electronics, 2011, 22(4): 702–706. B. Scholkopf, A. J. Smola. Learning with kernels: support vector machines, regularization, optimization, and beyond. Cambridge: MIT Press, 2001. J. Ma. Function replacement vs. kernel trick. Neurocomputing, 2003, 50(4): 479–483. K. R. Muller, S. Mika, G. Ratsch, et al. An introduction to kernel-based learning algorithms. IEEE Trans. on Neural Networks, 2001, 12(2): 181–201. N. Kwak. Kernel discriminant analysis for regression problems. Pattern Recognition, 2012, 45(5): 2019–2031. J. P. Ye, Q. Li. A two-stage linear discriminant analysis via qrdecomposition. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2005, 27(6): 929–941. W. W. Zong, G. B. Huang. Face recognition based on extreme learning machine. Neurocomputing, 2011, 74(16): 2541–2551. M. L. Zhang. Data resourse page. http://cse.seu.edu.cn/people/ zhangml/. Data sourse page of Mulan. http://sourceforge.net/projects/ mulan/files/datasets/. G. Madjarov, D. Gjorgjevikj, S. Dzeroski. Two stage architec-

513

ture for multi-label learning. Pattern Recognition, 2012, 45(3): 1019–1034. [43] M. L. Zhang, J. M. Pe˜na, V. Robles. Feature selection for multi-label naive bayes classification. Information Sciences, 2009, 179(19): 3218–3229. [44] P. M. Roth, M. Winter. Survey of appearance-based methods for object recognition. Styria Austria: Graz University of Technology, 2008: 25–27. [45] G. B. Huang, H. M. Zhou, X. J. Ding, et al. Extreme learning machine for regression and multiclass classification. IEEE Trans. on Systems, Man, and Cybernetics—Part B: Cybernetics, 2012, 42(2): 513–529.

Biographies Lin Feng was born in 1969. He received his B.S. degree in electronic technology from Dalian University of Technology in 1992, M.S. degree in power engineering from Dalian University of Technology in 1995, and Ph.D. degree in mechanical design and theory from Dalian University of Technology in 2004. His research interests include intelligent image processing, robotics, data mining, and embedded systems. E-mail: [email protected] Jing Wang was born in 1988. She received her B.S. degree in electronic information engineer from Dalian University of Technology, in 2011. Currently, she is working in computer technology and application towards her M.S. degree in Dalian University of Technology. Her research interests are data mining and machine learning. E-mail: wangjing [email protected] Shenglan Liu was born in 1984. He received his M.S. degree in College of Computer and Information Technology, Liaoning Normal University in 2011. Currently, he is working toward his Ph.D. degree in the School of Computer Science and Technology, Dalian University of Technology. His research interests include pattern recognition and machine learning. E-mail: [email protected] Yao Xiao was born in 1986. He received his B.S. degree from Sun Yat-Sen University, in 2009. Currently, he is working towards his M.S. degree in computer technology and application in Dalian University of Technology. His research interests are pattern recognition and machine learning. E-mail: [email protected]

Multi-label dimensionality reduction and classification with extreme learning machines 作者: 作者单位:

Lin Feng, Jing Wang, Shenglan Liu, Yao Xiao Faculty of Electronic Information and Electrical Engineering, School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China

刊名:

系统工程与电子技术(英文版) Journal of Systems Engineering and Electronics 2014(3)

英文刊名: 年,卷(期):

参考文献(45条) 1.D.D.Lewis;Y.Yang;T.G.Rose Rcv1:a new bench-mark collection for text categorization research 2004(05) 2.M.L.Zhang;Z.H.Zhou Multilabel neural networks with applications to functional genomics and text categoriza-tion 2006(10) 3.M.R.Boutell;J.Luo;X.Shen Learning multi-label scene classification 2004(09) 4.G.J.Qi;X.S.Hua;Y.Rui Correlative multi-label video annotation 2007 5.X.S.Xu;Y.Jiang;L.Peng Ensemble approach based on conditional random field for multi-label image and video annotation 2011 6.A.Elisseeff;J.Weston A kernel method for multi-labelled classification 2002 7.K.Trohidis;G.Tsoumakas;G.Kalliris Multilabel classi-fication of music into emotions 2008 8.A.Wieczorkowska;P.Synak;Z.Ra′s Multi-label classification of emotions in music 2006 9.N.Ueda;K.Saito Parametric mixture models for multi-labeled text.Advances in Neural Information Processing Sys-tems 15 2003 10.K.S.Beyer;J.Goldstein;R.Ramakrishnan When is“nearest neighbor”meaningful 1999 11.I.T.Jolliffe Principal component analysis 1986 12.S.T.Roweis;L.K.Saul Nonlinear dimensionality reduction by locally linear embedding 2000(5500) 13.Z.Y.Zhang;H.Y.Zha Principal manifolds and nonlinear di-mensionality reduction via tangent space alignment 2004(04) 14.R.A.Fisher The use of multiple measurements in taxonomic problems 1936(02) 15.C.R.Rao The utilization of multiple measurements in prob-lems of biological classification 1948(02) 16.H.F.Li;T.Jiang;K.S.Zhang Efficient and robust feature ex-traction by maximum margin criterion 2006(01) 17.B.Sch ¨olkopf;A.Smola;K.R.M ¨uller Kernel principal com-ponent analysis.Artificial Neural Networks 1997 18.S.Mika;G.Ratsch;J.Weston Fisher discriminant anal-ysis with kernels 1999 19.C.H.Park;M.Lee On applying linear discriminant analysis for multi-labeled problems 2008(07) 20.H.Yu;J.Yang A direct lda algorithm for high-dimensional data-with application to face recognition 2001(10) 21.H.Wang;C.Ding;H.Huang Multi-label linear discriminant analysis 2010 22.Y.Zhang;Z.H.Zhou Multilabel dimensionality reduction via dependence maximization 2010(03) 23.S.Ji;L.Tang;S.Yu Extracting shared subspace for multi-label classification 2008 24.L.Sun;S.W.Ji;J.P.Ye Hypergraph spectral learning for multi-label classification 2008 25.L.Sun;S.W.Ji;J.P.Ye Canonical correlation analysis for multilabel classification:a least-squares formulation,extensions,and analysis 2011(01) 26.S.W.Ji;J.P.Ye Linear dimensionality reduction for multi-label classification 2009 27.G.Tsoumakas;I.Katakis Multi-label classification:an overview 2007(03) 28.J.Read;B.Pfahringer;G.Holmes Classifier chains for multi-label classification 2011(03) 29.G.Tsoumakas;I.Vlahavas Random k-labelsets:an en-semble method for multilabel classification.Machine Learn-ing 2007 30.M.L.Zhang;Z.H.Zhou Ml-knn:a lazy learning approach to multi-label learning 2007(07)

31.G.B.Huang;Q.Y.Zhu;C.K.Siew Extreme learning machine:theory and applications 2006(1/3) 32.H.Wang;H.Huang;C.H.Q.Ding Discriminant laplacian embedding 2010 33.Z.Li;J.Zhang;S.Hu Incremental support vector machine algorithm based on multi-kernel learning 2011(04) 34.B.Scholkopf;A.J.Smola Learning with kernels:support vec-tor machines,regularization,optimization,and beyond 2001 35.J.Ma Function replacement vs.kernel trick 2003(04) 36.K.R.Muller;S.Mika;G.Ratsch An introduction to kernel-based learning algorithms 2001(02) 37.N.Kwak Kernel discriminant analysis for regression prob-lems 2012(05) 38.J.P.Ye;Q.Li A two-stage linear discriminant analysis via qr-decomposition 2005(06) 39.W.W.Zong;G.B.Huang Face recognition based on extreme learning machine 2011(16) 40.M.L.Zhang Data resourse page 41.Data sourse page of Mulan 42.G.Madjarov;D.Gjorgjevikj;S.Dzeroski Two stage architec-ture for multi-label learning 2012(03) 43.M.L.Zhang;J.M.Pe?na;V.Robles Feature selection for multi-label naive bayes classification 2009(19) 44.P.M.Roth;M.Winter Survey of appearance-based meth-ods for object recognition 2008 45.G.B.Huang;H.M.Zhou;X.J.Ding Extreme learn-ing machine for regression and multiclass classification 2012(02)

引用本文格式:Lin Feng.Jing Wang.Shenglan Liu.Yao Xiao Multi-label dimensionality reduction and classification with extreme learning machines[期刊论文]-系统工程与电子技术(英文版) 2014(3)