Int'l Conf. on Advances in Big Data Analytics | ABDA'15 |
75
Multi-Label classification for Mining Big Data Passent M. El-Kafrawy Amr M. Sauber Awad Khalil Math and Computer Science Dept. Math and Computer Science Dept. Dept. of Computer Science and Faculty of Science, Menofia University Faculty of Science, Menofia University Engineering, American University Shebin Elkom, Egypt Shebin Elkom, Egypt in Cairo Egypt
[email protected]fia.edu.eg
[email protected] [email protected]
Abstract—In big data problems mining requires special handling of the problem under investigation to achieve accuracy and speed on the same time. In this research we investigate the multi-label classification problems for better accuracy in a timely fashion. Label dependencies are the biggest influencing factor on performance, directly and indirectly, and is a distinguishing factor for multi-label from multi-class problems. The key objective in multi-label learning is to exploit this dependency effectively. Most of the current research ignore the correlation between labels or develop complex algorithms that don’t scale efficiently with large datasets. Hence, the goal of our research is to propose a fundamental solution through which preliminary identification of dependencies and correlations between labels is explicit from large multi-label datasets. This is to be done before any classifiers are induced by using an association rule mining algorithm. Then the dependencies discovered in the previous step are used to divide the problem into subsets depending on the correlation between labels for parallel classification. The experimental results were evaluated using Accuracy, Hamming Loss, MicroF-Measure and Subset Accuracy on a variety of datasets. The proposed model exploits all correlations among labels in multi-label datasets easily, facilitating the process of multi-label classification, increasing accuracy and performance as time was decreased while achieving higher accuracy. Keywords-Multi-Label classification; data mining; big data analytics;
I. I NTRODUCTION Data is acquired nowadays in huge amounts from different resources in a fast paste. These huge amounts of data are considered the main asset of the business, however, the use of such big data is gained when knowledge could be deduced intelligently and in a timely fashion. Current research seeks to exploit new techniques to handle such amount of data and explores ways to analytically find patterns in the data to deduce knowledge to add higher values to the business intelligence ecosystems. One of the major concerns in most classification problems is the existence of multi-label in multi-class problems. When each item could be associated with multiple labels of a certain classification system, this is known as multi-label classification. A multitude of other sources have also, knowingly or not, embraced the multi-label context. Domains such as microbiology or medicine often inherently require a multilabel scheme: a single gene may influence the production
of more than one protein, and a patients symptoms may be linked to multiple ailments. This explains the explosion of interest in multi-label classification in the academic literature over recent years. Tsoumakas and Katakis [1], categorized multi-label classification methods into two main categories: problem transformation and algorithm adaptation. In problem transformation a multi-label problem is transformed into one or more singlelabel problems. This scheme uses common off-the-shelf single-label classifiers and thus avoids the restrictions of a certain classification paradigm. This is opposed to algorithm adaptation, where a specific classifier is modified to carry out multi-label classification; often highly suited to specific domains or contexts but not as flexible and has a high computational complexity. Recently Dembczynski [2], provided a clarification and formalization of label dependence in multi-label classifications. According to them, one must distinguish between unconditional and conditional label dependence. Modeling unconditional dependencies is good enough for solving multi-label classification problems and improves performance; in contrast modelling conditional dependencies does not improve the predictive performance of the classifier as such [3]. Exploiting label dependence becomes a popular motivation in recent research, multi-label learning methods have been developed [4] for further research. Usually, label correlations are given beforehand or can be derived directly from data samples by counting their label co-occurrences. However, in some problem domains unfortunately this type of knowledge is unavailable, therefore new techniques are required to obtain this valuate knowledge. Accordingly, the goal of our research is to propose a fundamental solution through which preliminary identification of dependencies and correlations between labels explicitly from multi-label datasets for big data. This is to be done before any classifiers are induced by using an association rule mining algorithm. Then the dependencies discovered in the previous step are used to construct a multi-label classifier. The problem is such divided into subproblems for parallel classification that improves execution time and accuracy of classification. This leads to exploit all dependencies between labels easily and
76
Int'l Conf. on Advances in Big Data Analytics | ABDA'15 |
in a timely fashion to facilitate the multi-label classification process. Consequently, this would allow us to achieve increased efficiency, accuracy, speed which is the main goal of our research. This paper is organized as follows. Section II defines the problem under investigation and presents the related work. Section III explains our proposed method; aimed at improving both performance and accuracy of multi-label classifiers. The experimental results are illustrated and discussed in Section IV. Finally, Section V summarizes the conclusion and the future work. II. P ROBLEM D EFINITION Before defining the problem, related concepts and notations are introduced. The basic definition of multi-label systems need to be clarified, in order to clearly formulate the main problem under investigation. After which recent related work is presented. A. Multi-label Classification Our learning model is the following standard extension of the binary case. Assume that an instance x ∈ X can be associated with a subset of labels y, which is referred to as the relevant set of labels for x, for which we use subset notation y ⊆ [L] or vector notation y ∈ [0, 1]L , as dictated by convenience. Assume Y is a given set of predefined binary labels Y = λ1 , .., λL . For a given set of labeled examples D = x1 , x2 , ..., xn the goal of the learning process is to find a classifier h : X → Y , which maps an object x ∈ X to a set of its classification labels y ∈ Y , such that h(x) ⊆ λ1 , ..., λL for all x in X. The main feature distinguishing multi-label classification from a regular classification task is that a number of labels have to be predicted simultaneously. Thus, exploiting potential dependencies between labels is important and may improve classifier predictive performance. In problem transformation, a multi-label problem is transformed into one or more single-label problems, single label classification problems are solved with a commonly used single-label classification approach and the output is transformed back into multi-label representation. In this category, the common methods used are the label power-set [1], [5], binary relevance [1] and pair-wise methods [6], [7]. There are two significant methods in problem transformation methods: the pruned sets method [8], and extended in [8], and the classifier chains method [9], and ensemble schemes for both methods. Both these methods put a heavy emphasis on efficiency, achieve better predictive performance than other methods and can scale to large datasets. For multi-label dataset with a large number of labels a Hierarchy Of Multi-label Classifiers (HOMER) [10] method was proposed, and an ensemble method called RAKEL [5].
B. Related Work Few works on multi-label learning have directly discovered existing dependencies among labels in advance before any classifiers are induced. In [3], the author proposed the ConDep and ChiDep algorithms, which explicitly identify conditional and unconditional dependent label pairs between labels from a training set. The proposed method is based on analyzing the number of instances in each category. They apply the chi-square test for independence to the number of instances for each possible combination of two categories. The level of dependence between each two labels in the dataset is thus identified. Then decompose the original set of labels into several subsets of dependent labels, build an LP classifier for each subset, and then combine them as in the BR method. The results confirm that modeling unconditional dependencies is good enough for solving multi-label classification problems. The disadvantages of ConDep and ChiDep algorithms are complexity and needs improvement in ChiDep time performance. In another rresearch [11], heterogeneous information networks are used to facilitate the multi-label classification process by mining the correlations among labels from heterogeneous information networks. They proposed a novel solution, called PIPL, to assign a set of candidate labels to a group of related instances in the networks. Unlike previous work, the proposed PIPL can exploit various types of dependencies among both instances and labels based upon different meta-paths in heterogeneous information networks. Empirical studies on real-world tasks effectively boost classification performances; but only tested on a bioinformatics datasets, which are a heterogeneous network. III. P ROPOSED A LGORITHM We shall study the multi-label learning process by mining the correlations among labels from association rules. This is achieved by exploiting the correlation between labels and divide the dataset into two subsets: the items with unconditional dependencies and the ones with conditional dependencies. Each subset is feed in a parallel framework in the proposed algorithm for further speed up of the classification process. A. Discovering dependences between labels There are many multi-label learning methods to handle multi-label datasets and all these methods deal with the relations between labels in a different way. These relations between labels are necessary to facilitate the learning process. However, disparity on the degree of correlations between the labels may exists, where some labels may have high correlations while others may have medium correlations, and some other labels may not have any correlations among them. Multi-label learning methods might handle labels either by ignoring the correlations completely (this
Int'l Conf. on Advances in Big Data Analytics | ABDA'15 |
doesnt allow us to exploit the dependencies between labels) or take Holistic correlations between all (these increases the complexity of the learning process). To overcome these problems we shall propose a solution that is based on the idea that every problem is treated as a unique case where we explore each set of labels as a separate case. First of all we shall discover correlations between labels, which is achieved by using association rules. Association rules [12] mining is an interesting branch of data mining that focuses on looking at the discovery of associations and correlations among items in large transactional or relational data sets. It provides knowledge about the underlying system to a set of data and can be interpreted as implications, so that the presence of certain elements implies the occurrence of others. Y = Y ∪ Y , |Y | ≥ 0, |Y | ≥ 0
(1)
Y all label, Y uncorrelated label, Y correlated label. The Second step, we shall use divide and conquer technique to divide the datasets into sub datasets, this depends on the degree of correlations among labels. Basically we might divide the datasets into two datasets. One that has correlated labels and the other that has uncorrelated labels The third step, we apply a multi-label classifier on each data-subset considering each subset as a distinct problem, solved by itself. Finally, integrating the results of all data-subsets is required by combining the results using average function. A pseudo-code for the proposed method, see algorithm III-B B. Exploiting unconditional dependencies Exploiting conditional dependencies is done through applying a priori algorithm on labels in multi-label dataset. We shall obtain all associations and correlations among labels that satisfy minimum support count, which is the number of transactions that contain the labels. Accordingly, labels that are correlated are pointed out and the rest of the dataset are items with uncorrelated labels. The rule A ⇒ B holds in the transaction set D with support s, where s is the percentage of transactions in D that contain A∪B. This is taken to be the probability, P (A∪B). The rule A ⇒ B has confidence c in the transaction set D, where c is the percentage of transactions in D containing A that also contain B. This is taken to be the conditional probability, P (B ∪ A). That is, Support(A ⇒ B)
=
Conf idence(A ⇒ B)
=
P (A ∪ B)
(2) support(A ∪ B) P (B|A) = (3) support(A)
Rules that satisfy both a minimum support threshold (minsup) and a minimum confidence threshold (min-conf) are called strong. Strong rules are used to define items that correlate. Thus the problem of mining association rules can
77
be reduced to that of mining frequent itemsets. In general, association rule mining can be viewed as a two-step process: 1) Find all frequent itemsets: by applying apriori algorithm on labels, each of these itemsets will occur at least as frequently as a predetermined minimum support count, min-sup. 2) Generate strong association rules from the frequent itemsets: satisfing minimum support and minimum confidence. Algorithm: A pseudo-code for the proposed method. P=A priorie (L) # return related label C= [] For (i=0; i