Multi-Label Classification of Small Samples Using an Ensemble

26th Iranian Conference on Electrical Engineering (ICEE2018)

Multi-label Classification of Small Samples Using an Ensemble Technique Amirreza Mahdavi-Shahri 1, Jamil Karimian 2, Azadeh Javadi 3, Mahboobeh Houshmand 4 Department of Computer Engineering, Mashhad Branch, Islamic Azad University, Mashhad, Iran [email protected], [email protected], [email protected], houshmand@@mshdiau.ac.ir 1, 2, 3, 4

Abstract— Recently, multi-label classification problem has become an important challenge in many kinds of classification. In this problem, samples are associated with a set of class labels. Data with small sample essence is an important concept where the number of training samples is much less than the feature dimensions. The classification of small sample data has become a new problem in the fields of machine learning and pattern recognition. Ensemble learning is a kind of supervised learning process in which multiple learners are trained to solve the same problem. In this study, an ensemble learning method is proposed for the classification of multi-label data. To this end, first data is converted to the form of small sample, and then the proposed learning algorithm is applied to it. We used well-known parameters for the validation and evaluation of our results. Our results show that the proposed method can reach to the better performance as compared to the state-of-the-art base classifiers with respect to these parameters. Keywords— Machine learning; Multi-label Classification; Single-label classification; Small-sample data; Ensemble learning

I.

INTRODUCTION

The traditional single-label classification methods are exclusively with no overlapping in their classes [1], [2]. In the binary classification, the number of labels is two, [3], [4], where multi-class classification [5], [6] is a classification problem of instances with more than two classes. Multi-label classification is an important issue especially in the machine learning research area. Multi-label classification (MLC) allows a sample to belong to several classes simultaneously [7]–[10]. Multi-label classification method is used in several applications such as bioinformatics where each protein may be labeled by multiple functional labels like metabolism, energy, etc. In cellular biogenesis [11], or in video annotation, a movie can be defined with some labels and tags [12], or one of the most important applications of multi-label classification is the categorization of texts where each document can be assigned to a set of redefined topics [13]. According to general definitions, the small sample is a concept that points to the case where the number of training samples is less than the pattern feature dimension. The problem of the classification of small samples is that the parameters of feature extraction and classifier algorithms cannot be correctly estimated. For example, the sample covariance matrix within class is singular, and the optimal recognition feature is difficult to be extracted. There is another definition of small sammple according to which the number of

training samples is more than the feature dimension, but this difference is not big [14]. On the other hand, the small sample has its own advantages. For example, due to the less training samples, the learning time of feature extraction and classifier algorithm is short [15]. Ensemble learning is typically a kind of supervised learning algorithm in which some of the classifiers are merged in order to run a prediction. These classifiers are called baselevel classifiers. Lately, ensemble methods play an important role in the research scope of data mining and machine learning, because they can be effective in the accuracy of base-level classifiers [17]–[19]. Ensembles techniques play an important role in overcoming the overfitting problem that decreases the generalization of systems. Base-level classifiers are constructed using various algorithms to improve performance. In this study, an ensemble learning method is proposed for the classification of multi-label data with small samples. The rest of this paper is organized as follows. Section II gives a review of multi-label methods. Section III presents the proposed approach. In Section IV, we discuss the results and finally Section V concludes the paper and expresses the future directions. II.

REVIEW OF EXISITNG MULTI-LABEL METHODS

Here we define a sub-class of available multi-label approaches that is classified in four groups. A) Binary Relevance Methods Binary relevance methods create a model for all of labels to solve binary classification problem. In Table 1, we expressed some of the binary relevance methods [19], [20]. TABLE I. An overview of binary relevance methods [19], [20]

Index 1 2 3 4 5

Method Binary Relevance Classifier Chains Classifier Trellis Probabilistic Classifier Chains Bayesian Classifier Chains

978-1-5386-4916-9/18/$31.00 ©2018 IEEE

1708

26th Iranian Conference on Electrical Engineering (ICEE2018) B) Label Powerset Methods [20], [21] These kind of classifiers present an excellent performance, although some of the parameters that can scale up to large datasets [20].

2) Accuracy: Accuracy (Acc) computes the sum of correct labels divided by the union of predicted and true labels as defined in (1).

A cc = TABLE II. An overview of label power set methods [19]

Index Method 1 Label Combination 2 Random k-label Pruned Sets 3 Hierarchical Label Sets 4 Pruned Label Power set C) Pairwise and Threshold Methods TABLE III. An overview of pairwise and threshold methods [22]

Index Method 1 Four class Pairwise 2 Rank + Threshold D) Other Methods Semi-supervised approaches use the testing data (labels are removed first) to facilitate the process of training. It consists of neural-network based methods which make a higher level of feature representation in the training phase. TABLE IV. Semi-supervised methods [22]

Index 1 2 3 4

Method Expectation Maximization Back-Propagation Neural Network Deep Back Propagation Neural Network Deep Multi-label

Comparision Measures Generally, example-based metrics are a type of bipartitionbased metrics which validate bipartition over all examples of the evaluation data set. Besides, the ranking-based metrics validate rankings with respect to the ground truth of multi-label data set. Here, we define five metrics for the comparison of the base-classifiers with the multi-label classifiers. An evaluation dataset can be defined as (xi, Yi), i=1… N, where Yi ⊆ L is the

set of true labels and L = {λ j : j = 1,..., M } is the set of all

labels. Given an example xi, the set of labels which are estimated by a multi-label approach is shown by Zi, while the rank predicted for a label λ is denoted as ri (λ ) . The most appropriate label gets the highest rank (1), while the least appropriate one receives the lowest rank (M) [23]. 1) Exact Match: This is a criterion where its output shows the correctlyclassified-label sample percent. For instance, for a three-topic document if two of them are correctly predicted, this decision is already a failure.

1 N

N

Yi ∩ Zi

¦Y i =1

i

(1)

∪Zi

3) Hamming Loss: Hamming loss (HL) belongs to bipartition-based measures that can predict the number of incorrect labels out of the total number of labels. It is worth mentioning that when the value of hamming loss is low, the performance is better.

HL =

1 N

N

¦ i =1

Y i ΔZi M

(2)

Ranking-based Measures: 4) One-error: In single-label classification problem, the one-error (1-Err) is equal to the general error. This measure estimates the number of times that the top-ranked label was not in the set of existing labels. The performance is increased if this measure is low.

One − error =

1 M

M

¦ δ ( arg min r (λ ) ) i

(3)

i =1

Where

1 if λ ∈Y i ¯0 otherwise

δ (λ ) = ®

5) Ranking Loss: The ranking loss measure (RL), as defined in (4) considers the average fraction of label couples which are disordered [24].

RL =

1 N

N

1

i =1

Zi Zi

¦

(4)

III. THE PROPOSED METHOD Here we propose and discuss the method of ensemble learning for the classification of multi-label data with small samples. As we mentioned earlier, when the dimension of data decreases, there are some characteristics that we can refer to: The data with large dimension occur to curse of dimension problem, when the time of learning algorithms is less than the other states and prevent the over-fitting problem in some cases. In a previous work [7] we focused on the multi-label classification problem and presented a new learning algorithm with multiple classifiers but did not consider the time of learning algorithm. We divide the proposed method into two parts: first, we perform a preprocessing phase in which we select the data and reach to small sample dimension form, and then the learning algorithm is applied to the data with low dimensions. In the second phase, we divide the low-dimension data in two subsets: train data and test data. The main phases of this process are organized as Fig.1.

1709

26th Iranian Conference on Electrical Engineering (ICEE2018)

TABLE V. The used datasets statistics Input Dataset

Low Dimensional Data

Dataset

Real Instances

Features

Labels

Scene

2407

300

6

Llog

1460

1079

14

Music

592

77

6

Dividing Phase

Train Data

Test Data

Multi-Label Classifiers

Classifier 1 Classifier 2 Classifier 3

Classification

Classifier 4 Classifier 5

Classifier 1: BR Classifier 2: CC

Output

Classifier 3: CT Classifier 4: DBPNN Classifier 5: BPNN K-fold Cross Validation (K=5), EL: Combining Five MultiLabel Classifier

Figure 1. The general scheme of the proposed method

We used three multi-label datasets with 6,14 and 6 labels that are called Scene, Llog, and Music1. For the Scene dataset, the number of instances is more than the number of attributes. We applied an attribute selection for reducing the number of dimensions, then we selected 241 instances for the learning process. For avoiding over-fitting problem we used a K-fold Cross Validation (k=5) that is shown in Fig. 1. Similarly, for the Llog and Music datasets, after attribute selection we selected 121 and 60 instances with 35 and 77 features. Table 5 presents the primary datasets which were used in the learning process.

1

http://mulan.sourceforge.net/datasets-mlc.html

There are some tools for the implementation of multi-label classification problems like sickie-learn [25] and orange [26]. All implementations of this study were run in Weka-based package of Java’s classes that is called Mulan2. This package includes multiple methods of classifications like LP and RAKEL and ensemble of classifiers. As we mentioned earlier, ensemble learning is a supervised learning algorithm and we applied this method to make a group of multi-label base-learners. Five classifiers were selected for all of the experiments which are called BR, CC, CT, BPNN and DBPNN. The base-learners in the ensemble are to make a prediction on a multi-label dataset. Then, those base-learners are combined as one multi-label learner to make predictions on all labels of our dataset. According to the metrics that are defined for validation of learning algorithms, we expect to see the discriminative difference in base levels of ensemble technique. To combine the outputs of classifiers, there are some approaches like an average, weighted average, maximum and minimum that are called algebraic methods and also majority voting, weighted majority voting are called voting methods. Ensemble learning is used to construct a group of multi-label base learners. Base-learners consist of structures to approximate on base classifiers, then those base-learners are combined to make predictions on the set of labels. In this paper, we applied the majority voting approach for this ensemble classification IV. EXPERIMENTS AND RESULTS We used five different multi-label multiple classifiers which are called BR, CC, CT, BPNN and DBPNN. In this paper, we selected five measures in order to compare the output of the learning algorithms. Fig. 2 compares the results of multi-label classifiers with respect to the mentioned metrics on the Scene dataset. Accuracy and exact match are the positive metrics where the higher value they have, the better performance the classifier has. We selected our classifiers from different existing multi-labels where the worst performance belongs to DBPNN in the sense whose accuracy is less than the other classifiers and the oneerror is more than the others.

2

http://meka.sourceforge.net/methods.html

1710

26th Iranian Conference on Electrical Engineering (ICEE2018) [1], [27], which were used in the IBR method. These parameters are RL, 1-Err, HL and Acc.

Figure 2. Comparison of results on the Scene dataset

Figs. 3 and 4 show the results of multiple classifiers on the Llog and Music datasets, respectively. In contrast with Fig. 2, in Figs. 3 and, 4 DBPNN had a good performance so that we used it as one of classifiers, however, it increases the running time.

Figure 4. Comparison of proposed method with base classifiers on the Scene dataset

Figure 5. Comparison of the proposed method with the base classifiers on the Music dataset

Figure 3. Comparison of results on the Llog dataset

Figure 6. Comparison of the proposed method with the base classifiers on the Llog dataset Figure 4. Comparison of results on the Music dataset

In Fig. 2, CC classifier had the best performance in terms of accuracy and hence it was evaluated for other criteria. Then we calculated the mean value of each metric for the mentioned datasets (see Figs. 2 and 3). We chose four parameters for the comparison of the proposed method with the base classifiers

In Figs. 4-6, we evaluated the proposed method that consists of multi-label classifiers with respect to the evaluation metrics. In most metrics, our results reach to the best performance in comparison to the base classifiers. Basically, ensemble learning in most kinds of problems help to reduce the overfitting problem. In a previous work [7], we applied an ensemble learning in which the classifiers were single-label

1711

26th Iranian Conference on Electrical Engineering (ICEE2018) base leaners, but in this paper we extend it according to the essence of our approach (multi-label classifier instead of single label classifier) on small sample data.

REFERENCES [1]

TABLE VI. Comparison of time on the Llog dataset Build Time – IBR 80.634 47.361 5.672 4.828 25.43 Build Time Proposed Method 1.06 0.046 0.078 0.062 6.049

Test Time - IBR

Total Time - IBR

Average time - IBR

1.484 1.375 1.406 132.41 28.352 Test Time Proposed Method

82.118 48.736 7.078 137.238 53.782 Total Time Proposed Method

65.79

0 0.025 0.078 0.094 0.01

1.06 0.071 0.156 0.156 6.050

[2] [3] [4]

Average time Proposed Method

-

[6]

In Tables 6 and 7, we compared the time of train and test and the total time of each classifier. Average time shows the average time of classifiers in IBR and our method. As we have expected, the running time of algorithms is decreased as we showed on two out of three datasets. TABLE VII. Comparison of time on the Music dataset Build Time IBR 0.031 0.047 0.047 6.642 1.141 Build Time Proposed Method 0 0.328 0.094 0.078 0.282

V.

-

-

[5]

1.49

Test Time - IBR

Total Time - IBR

Average time - IBR

0.01 0.01 0.016 0.01 0.01 Test Time Proposed Method

0.032 0.048 0.063 6.643 1.1422 Total Time Proposed Method

1.58

0.25 0.016 0.14 0.01 0.01

0.25 0.344 0.234 0.079 0.283

[7]

[8] [9]

[10] [11]

Average time Proposed Method 0.23

CONCLUSION AND FUTURE DIRECTIONS

In this study, an ensemble learning was presented in order to reach to the best performance for the multi-label classification of small samples. To this end, we changed the single-label base-learners with multi-label multiple classifiers. Datasets from various domains were selected and after converting to small sample in the preprocessing phase, base-level classifiers and the proposed ensemble learning were applied to them. Four well-known parameters in the multi-label classification domain were used for comparing the results. Experimental results showed that the proposed ensemble method leads to better results with respect to the metrics. As a future work, applying other classification approaches like semi-supervised learning is being considered.

[12] [13] [14] [15]

[16]

[17] [18] [19] [20] [21]

ACKNOWLEDGMENT We are grateful to Samaneh Kamali for useful comments on an earlier version of this research.

[22] [23]

L. Michielan, L. Terfloth, J. Gasteiger, and S. Moro, “Comparison of multilabel and single-label classification applied to the prediction of the isoform specificity of cytochrome p450 substrates,” Journal of chemical information and modeling, vol. 49, no. 11, pp. 2588– 2605, 2009. Y. Wei, W. Xia, J. Huang, B. Ni, J. Dong, Y. Zhao, and S. Yan, “CNN: Single-label to multi-label,” arXiv preprint arXiv:14065726, 2014. Z. Yang and H. Ai, “Demographic classification with local binary patterns,” Advances in Biometrics, pp. 464–473, 2007. J. H. Min and C. Jeong, “A binary classification method for bankruptcy prediction,” Expert Systems with Applications, vol. 36, no. 3, pp. 5256–5263, 2009. F. Thabtah, P. Cowling, and Y. Peng, “MCAR: multi-class classification based on association rule,” in Computer Systems and Applications, 2005. The 3rd ACS/IEEE International Conference on, 2005, p. 33. G. Mujtaba, L. Shuib, R. G. Raj, R. Rajandram, K. Shaikh, and M. A. Al-Garadi, “Automatic ICD-10 multi-class classification of cause of death from plaintext autopsy reports through expert-driven feature selection,” PloS one, vol. 12, no. 2, p. e0170242, 2017. A. Mahdavi-Shahri, M. Houshmand, M. Yaghoobi, and M. Jalali, “Applying an ensemble learning method for improving multi-label classification performance,” in International Conference of Signal Processing and Intelligent Systems (ICSPIS), 2016, pp. 1–6. A. M. Santos, “Using Semi-Supervised Learning in Multi-label Classification Problems,” pp. 10–15, 2012. S. Kanj, F. Abdallah, T. Denœux, and K. Tout, “Editing training data for multi-label classification with the k-nearest neighbor rule,” Pattern Analysis and Applications, vol. 19, no. 1, pp. 145–161, 2016. M. A. Tahir, J. Kittler, and A. Bouridane, “Multilabel classification using heterogeneous ensemble of multi-label classifiers,” Pattern Recognition Letters, vol. 33, no. 5, pp. 513–523, 2012. B. Jin, B. Muller, C. Zhai, and X. Lu, “Multi-label literature classification based on the Gene Ontology graph,” BMC Bioinformatics, vol. 9, no. 1, p. 525, 2008. G. Qi, J. Tang, and T. Mei, “Correlative Multi-Label Video Annotation First Paradigmௗ: Individual Concept,” pp. 17–26, 2007. R. E. Schapire and Y. Singer, “BoosTexter: A boosting-based system for text categorization,” Machine learning, vol. 39, no. 2–3, pp. 135–168, 2000. S. Ji and J. Ye, “Generalized linear discriminant analysis: a unified framework and efficient model selection,” IEEE Transactions on Neural Networks, vol. 19, no. 10, pp. 1768–1782, 2008. W. Jia, D. Zhao, and L. Ding, “An optimized RBF neural network algorithm based on partial least squares and genetic algorithm for classification of small sample,” Applied Soft Computing, vol. 48, pp. 373–384, 2016. C. Shi, X. Kong, P. S. Yu, and B. Wang, “MultiLabel Ensemble Learning.” In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 223-239, 2011. N. L. B, Y. Jiang, and Z. Zhou, “Multi-label Selective Ensemble,” International Workshop on Multiple Classifier Systems, pp. 76–88, 2015. J. Read, B. Pfahringer, and G. Holmes, “Classifier chains for multilabel classification,” pp. 333–359, 2011. J. Read, “Mekaௗ: A Multi-label / Multi-target Extension to Weka,” The Journal of Machine Learning Research vol. 17, pp. 1–5, 2016. H. Modi, “Experimental Comparison of Different Problem Transformation Methods for Multi-Label Classification using MEKA,” vol. 59, no. 15, pp. 10–15, 2012. J. Read, B. Pfahringer, G. Holmes, and E. Frank, “Classifier chains for multi-label classification,” Machine Learning and Knowledge Discovery in Databases, pp. 254–269, 2009. J. Read and P. Reutemann, “MEKA: a multi-label extension to WEKA,” URL http://meka sourceforge net, vol. 17, pp. 1–5, 2012. J. Read, “Scalable multi-label classification,” PhD Thesis, University of Waikato, 2010.

1712

26th Iranian Conference on Electrical Engineering (ICEE2018) [24]

[25] [26] [27]

M.-L. Zhang and K. Zhang, “Multi-label learning by exploiting label dependency,” in Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, 2010, pp. 999–1008. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, et al. “Scikitlearn: Machine learning in Python,” Journal of Machine Learning Research, 2011, pp. 2825-2830. J. Demsar, T. Curk, A. Erjavec, C. Gorup, et al. “Orange: data mining toolbox in python,” Journal of Machine Learning Research, 2013, pp. 2349–2353. G. Tsoumakas and I. Vlahavas, “Random k-labelsets: An ensemble method for multi-label classification,” In Proceedings of the 18th European Conference on Machine Learning (ECML 2007),Warsaw, Poland, 2007, pp. 406–417

1713

Multi-Label Classification of Small Samples Using an Ensemble

Multi-Label Classification of Small Samples Using an Ensemble

Suggest Documents

An Ensemble of Bayesian Networks for Multilabel Classification

An Ensemble Multilabel Classification for Disease Risk Prediction

An Ensemble Method for Multilabel Classification - Semantic Scholar

Multilabel Consensus Classification - Semantic Scholar

Efficient Online Classification using an Ensemble of Bayesian Linear ...

Classification of Radar Environment Using Ensemble

AUTOMATED CLASSIFICATION OF GEOLOGICAL SAMPLES USING ...

Toxicological classification of urine samples using

Multilabel Associative Classification Categorization of MEDLINE ...

Adapting Non-Hierarchical Multilabel Classification ... - Google Sites

Automated Ham Quality Classification Using Ensemble ... - ubu

ECG SIGNAL CLASSIFICATION USING ENSEMBLE DECISION TREE

Sentiment Mining Using Ensemble Classification Models

Label Filters for Large Scale Multilabel Classification

Multilabel Classification through Random Graph ...

Adapting Non-Hierarchical Multilabel Classification ... - CiteSeerX

Distributed Text Classification With an Ensemble

Distributed Text Classification With an Ensemble

Reconstructing exposures from small samples using ... - Nature

Ensemble Classification for Anomalous

RANDOMIZATION TESTS FOR SMALL SAMPLES: AN ...

Learning a kernel function for classification with small training samples

Classification Rule for Small Samples: A Bootstrap Approach

Multilabel Classification of Non-Verbal Communication of Emotions ...