Hierarchical Multi-label Classification Problems: an LCS Approach

0 downloads 0 Views 307KB Size Report
Classifier Systems (LCS) to solve this kind of problem. ... Solutions to hierarchical problems using LCS are already being ex- ploited in other ..... MIT Press,.
Hierarchical Multi-label Classification Problems: an LCS Approach Luiz Melo Rom˜ao1 and Julio C´esar Nievola2 1

Universidade da Regi˜ ao de Joinville, Departamento de Inform´ atica, Joinville, Brasil [email protected] 2 Pontif´ıcia Universidade Cat´ olica do Paran´ a, PPGIA, Curitiba, Brasil [email protected]

Abstract. Traditional classification tasks deal with assigning instances to a single label. However, some real world databases classes are structured in a hierarchy and its instances can have their classes associated with two or more paths in the hierarchical structure. In this case, such situations are referred as hierarchical multi-label classification problems. The purpose of this paper is to explore the concept of hierarchical multilabel classification problems and present a solution based on Learning Classifier Systems (LCS) to solve this kind of problem. The Hierarchical Learning Classifier System Multi-label (HLCS-Multi) proposed, presents a comprehensive solution to hierarchical multi-label classification problems building a global classifier to predict all classes in the application domain. Keywords: Hierarchical Multi-label Classification Problems, Learning Classifier Systems, Protein Function

1

Introduction

According to [1], traditional computational approach to automated classification assumes that each object should be assigned to only one out of two or more classes. However, some real world applications digress from this generic scenario in two important ways. First, each example can belong to several classes simultaneously (multi-label classification). Second, the classes can be hierarchically ordered in the sense that some are more specific versions of others (hierarchical classification). In this case, such situations are referred as hierarchical multi-label classification problems. The task has recently received considerable attention, databases in various fields, including text categorization, web content searches, image annotation, digital libraries, or functional genomics (focus this work), are known to be organized as hierarchies. The purpose of this paper is to explore the concept of hierarchical multi-label classification problems and present a solution based on Learning Classifier Systems (LCS) to solve this kind of problem. Conceived in 1975 by John Holland

[2], the Learning Classifier System (LCS) consists of a set of rules called classifiers. As defined in [3], the LCS develops a model of intelligent decision-making, using two biological metaphors, evolution and learning, where learning guides the evolutionary component to move in the direction of the best rules. The proposed approach, called HLCS-Multi (Hierarchical Learning Classifier System Multi-label) will be used in this work for predicting protein functions. The remainder of this paper is organized as follows: Section 2 discusses the hierarchical multi-label classification concept and how to distinguish hierarchical problems. The HLCS-Multi architecture is described in Section 3. Section 4 demonstrates the computational results achieved and Section 5 presents the conclusions of this study and the possible directions for future research.

2

Hierarchical Multi-label Classification

According to [4], traditional classification tasks deal with assigning instances to a single label. In multi-label classification, the task is to find the set of labels that an instance can belong to rather than assigning a single label to a given instance. For example, a medical patient may suffer from more than one health condition: diabetes, high blood pressure, high cholesterol. Hierarchical classification is a variant a traditional classification where the task is to assign instances to a set of labels where the labels are related through a hierarchical classification scheme. In this case, when an instance is labeled with a certain class, it should also be labeled with all of its superclasses. The hierarchical classification problems can be differentiated according to the organization of its structure (tree or DAG), and according to type of algorithmic approach used (Local or Global). In the local approach a model trains a binary classifier for each node of the class hierarchy. In this case, it is necessary to use N independent local classifiers, one for each class except the root node. Therefore, the number of classifiers to be trained can be very large in situations where there are many classes. Moreover, in using the local approach, the technique can provide inconsistent results, because there is no guarantee that the class hierarchy will be respected. In the global approach, a single classification model is built from the training set, taking into account the hierarchy of classes as a whole during a single execution of the classifier algorithm. In the global approach, the fact that the algorithm maintains hierarchical relationships between classes during the phases of training and testing makes the outcome of the prediction easier to understand. Therefore, when a problem is said hierarchical multi-label besides classes are structured in a hierarchy, instances can have their classes associated with two or more paths in the hierarchical structure. Thus when a solution is said hierarchical multi-label indicates that the algorithm could potentially predict multiple data paths in the hierarchy of classes.

3

HLCS-Multi

The Hierarchical Learning Classifier System Multi-label (HLCS-Multi) algorithm proposed in this paper uses as a development model the Learning Classifier System (LCS). Solutions to hierarchical problems using LCS are already being exploited in other algorithm previously proposed by the authors. In [5] is presented the HLCS-DAG, which also uses a global approach with the capacity to work with databases structured in DAGs. The HLCS-Multi is an evolution of these work and presents a comprehensive solution to hierarchical multi-label classification problems building a global classifier to predict all classes in the application domain. In order to work with the class hierarchy, the HLCS-Multi presents a specific component for this task which is the evaluation component of the classifiers. This component has the task of analyzing the predictions of classifiers considering the class hierarchy. In addition to this, the HLCS-Multi architecture consists of the following modules: population of classifiers, performance component, credit assignment component and GA component, which interacts internally. The details of the main differences of HLCS-Multi components, in relation other version follow below. The HLCS-Multi starts its execution by analyzing the data hierarchy at the training base. Reading of data is performed at the training base of dimension m. This base consists of samples represented by Btrain = [v1 , v2 , v3 , ..., vN ], where N is the total number of instance. Each instance of Btrain characterized by attributes, vi = [a1 , a2 , ..., aj , ajk ], wherein aj is the attribute j-th, 1 ≤ j and ajk represents the attribute class. As this is a multi-label problem the attribute ajk can be formed by one or more classes. Reading data is also performed the learning process of the hierarchy of classes H which will be used by the algorithm HLCS-Multi, represented in the database as follows: H = (root/des1 , root/des2 , root/desx , des1 /des3 , desx desy ) where, root represents the root node, des represents the descendent node, A/B indicates that A is the parent of B, x and y the amount of relations A/B. Following we have the decomposition of the instances at the training base. This process makes the multi-label instances are transformed into a set of simplelabels instances as shown in Table 1.

Table 1. Decomposition Multi-label Instance Attrib1 1 1 1.1 1 1.2 1

... Attribn Class ... 3 GO0003674 @ GO0005624 ... 3 GO0003674 ... 3 GO0005624

In the example, the instance 1 is composed of two classes, after decomposition this instance is replaced by two new instances (1.1, 1.2). In this case, the attribute

values of the instances simple labels generated are kept equal to the attributes of multi-label instances, ensuring hierarchical knowledge. With the new set of training defined starts creating the initial population of classifiers. In the HLCS-Multi, the size of population (SizeP op) is determined in the initial settings by a percentage (P erc P op) in relation to number of instances in the training base (T otal Instance), according to Equation 1. SizeP op = T otal Instance ∗ P erc P op/100

(1)

In the initial population only exclusive classifiers are added, the set of all classifiers forms the prediction model. Each classifier Ci (0 < i ≤ SizeP op) of the HLCS comprises: a t set of conditions (where t=number of attributes of an training instance), the class value and the classifier quality measure, according to Equation 2. Ci = (([Cond0 ]and[Condt ])(V Class)(Qclassif ier ))

(2)

Each condition has three parameters: OP , V L, A/I, where: OP : operator relation (= or !=), V L: condition value and A/I: the choice of an active or inactive attribute, which determines whether the condition will be used in the classifier or not. In order to form each classifier, the HLCS-Multi randomly chooses an instance Instancei (0 < i ≤ N ) of the training base as a model. For each attribute of an training instance, a condition in the classifier is created. At the beginning, the conditions start with the operator relation (OP) “=”. The condition value (VL) receives the value attribute of the training instance and whether the condition will be active (A) or inactive (I) is randomly determined. The HLCS-Multi runs only on databases with nominal attributes, in the case of databases with continuous attributes, it is necessary to use some method of discretization. Each classifier created, presents in its structure, the definition of a rule (IFTHEN) that will be used for prediction. After the definition of conditions of the classifier is necessary to determine the class of prediction that is assigned to the classifier. For this, the classifier is analyzed with all instances of the training base. Each instance covered by the classifier has scored his class. After the analysis, the class scored more times is then chosen as the class of the classifier. The last step in the creation of the initial population of classifiers is defining the quality of the classifier. To calculate the quality of the classifier two factors are considered: the percentage of positive classes predicted and the hierarchical control evaluation of the classifier. The hierarchical control evaluation represents the predictive ability of the classifier, considering not only the class in question, but all the class antecedents in the hierarchy. This process is performed by the evaluation component which is essential for the HLCS-Multi to solve problems with hierarchical structures. The steps to calculate the quality of the classifier and the functions of evoluation component, of component performance, of credit assignment component and of GA component, follow the same definitions shown in [5].

After the step of credit assignment a new population of classifiers is generated. This population, called the final population, constitutes the final model of prediction of HLCS-Multi. It contains the winners of each competition who predicted correctly or partially correct the class of instance chosen at the beginning of the competition. This final population does not have a number of pre-defined classifiers, this value will depend on the power of learning model. To define the number of executions of learning, each classifier inserted into the final population is compared with all instances of the training base. All instances that are covered by the classifier and the real class of the instance is correct or partially correct with the predicted class classifier, are excluded from the training base. After this, the learning process is restarted until a minimum percentage of instances are covered by the classifiers of final population. This process maintains consistency with respect to classifiers sent to the final population and the hierarchical characteristic of instances of the training base. Every classifier inserted into the final population is excluded from the initial population. When this occurs, a new classifier is generated and included in the initial population to keep the size of the population.

4

Computational Results

In order to demonstrate the results obtained with the HLCS-Multi, the algorithm was compared with another version called HLCS-DAG proposed by the authors in [5]. The main difference between these versions is that HLCS-DAG does not present a complete solution for databases multi-label. The results of the HLCSMulti was also compared with the Clus-HMC approach proposed in [6]. Table 2. Summary of the data sets used in our experiments. The (’data set’) gives the data set name, the (’training’) gives the number of training examples, the (’test’) gives the number of test examples, the (’attributes’) gives the number of attributes and the (’classes’) gives the number of classes in the class hierarchy. Data Set Cellcycle Church Derisi Expr Pheno Spo

Training 2473 1627 2447 2485 1005 2434

Test 1278 1278 1272 1288 581 1263

Attributes 77 27 63 551 69 80

Classes 4126 4126 4120 4132 3128 4120

We have selected six bioinformatics data sets from [6] described in the Gene Ontology (GO) and structured in DAG. The different data sets describe different aspects of the genes in the yeast genome. They include five types of bioinformatic data: sequence statistics, phenotype, secondary structure, homology,

and expression. For the evaluations in this paper, the algorithms were executed using the training and test sets, retaining the same examples available at (http://dtai.cs.kuleuven.be/clus/hmcdatasets). Table 2 presents the details of the bases used in this experiment. In order to evaluate the algorithms we have used the metrics of hierarchical precision (hP), hierarchical recall (hR) and hierarchical F-measure proposed by [7]. These measures are, in fact, extended versions of the known measures like precision, recall and F-measure, tailored to the scenario of hierarchical classification. According to the results in the first test, the HLCS-Multi showed significantly better results than HLCS-DAG in all bases analyzed. GO bases are very complex, some examples have up to 22 classes and the total classes exceeds by more than 4100 at most bases. However, the HLCS-Multi through its hierarchical and multilabel solutions, proved to be more suitable for this type of problem. The results of the comparison between the model HLCS-Multi and HLCS-DAG are shown in Table 3. Table 3. Hierarchical measures of precision (hP), recall (hR) and F-measure (hF) values (mean ± standard deviation) calculated over 10 runs. An entry in the ’hF’ column is shown in bold if the hierarchical F-measure value obtained by one of the methods was significantly greater than the other method - according to a Wilcoxon test with 95% confidence.

Cellcyle Church Derisi Expr Pheno Spo

hP 0.2207 ± 0.2848 ± 0.2208 ± 0.1564 ± 0.2303 ± 0.1789 ±

Cellcyle Church Derisi Expr Pheno Spo

hP 0.2611 ± 0.2259 ± 0.2948 ± 0.3254 ± 0.2962 ± 0.2975 ±

HLCS-DAG hR 0.03 0.1463 ± 0.04 0.05 0.1469 ± 0.06 0.02 0.1029 ± 0.03 0.04 0.0889 ± 0.03 0.05 0.0730 ± 0.04 0.03 0.1645 ± 0.03 HLCS-Multi hR 0.04 0.4209 ± 0.02 0.06 0.6183 ± 0.03 0.03 0.5655 ± 0.03 0.05 0.3568 ± 0.03 0.04 0.7444 ± 0.04 0.03 0.3992 ± 0.04

hF 0.1759 ± 0.1939 ± 0.1404 ± 0.1134 ± 0.1109 ± 0.1714 ±

0.03 0.04 0.02 0.03 0.05 0.04

hF 0.3223 ± 0.3309 ± 0.3873 ± 0.3109 ± 0.4281 ± 0.3410 ±

0.03 0.04 0.03 0.04 0.05 0.04

The second test was conducted against Clus-HMC algorithm. The Clus-HMC consists of a hierarchical classification model based on decision tree method. In the test performed, the data evaluation had adverse effect by fact that the authors of it, publish and make available through the algorithm only results based in binary measure. As seen above, the HLCS-Multi is a hierarchical multi-

label global algorithm, which promotes their classifiers that have the ability to predict at least some kind of antecedent class from the real class. This fact makes the HLCS-Multi have a low percentage accuracy when compared to traditional binary evaluation methods, precision and recall. However, taking into account the hierarchy, HLCS-Multi presents some satisfactory results. However, to evaluate the HLCS-Multi against Clus-HMC, tests were performed using the GO database. With precision and recall obtained by Clus-HMC using standard threshold settings and run as material provided by the authors, the graphics of the PR-curve for all databases were designed. The PR-curve uses the relation between the precision measure plotted on the Y axis, and recall measure plotted on the X axis. As the precision and recall values of HLCS-Multi do not change much with different parameter settings, the results of the HLCS-Multi were marked as a point on the graph. Thus, as the comparison is being made between curve and point, you can define a better performance HLCS-Multi when the curve is below the point and better performance Clus-HMC when the curve is above the point. Also, another factor that can be considered is the fact that, these bases, the number of negative examples for each class far outweighs the number of positive examples. This suggests that, in this case, the value obtained by the recall measure is more significant than the precision. Thus, as shown in Figure 1, the HLCS-Multi algorithm had recall values above 0.4 points and also in some bases a favorable position of the point relative to Clus-HMC curve.

5

Conclusions

The paper introduced a new global model approach for hierarchical multi-label classification problems using LCS and applied it to the classification of biological dataset. The proposed HLCS-Multi unveils a global classification model in the form of an ordered list of IF-THEN classification rules which can predict terms at all levels of the hierarchy, satisfying the parent-child relationships between terms. The advantage of HLCS-Multi in contrast to other approaches is their adaptability. Based on the LCS model, the HLCS-Multi makes constant iterations of environmental samples to create their classification rules, making it a more flexible classification model. The purpose of this paper was to show a underexploited topic, but with a lot of applications in the world. The LCS has been used with great success in several areas like robotics, environment navigation, function approximation and others. However, the topic of this work, hierarchical multi-label classification problems like protein function, has not been approached by neither proposal based on LCS. The computational results with HLCS-Multi prove that the use of LCS models can be an alternative to the hierarchical multi-label. As future research, we

Fig. 1. Grfico Preciso Revocao - Bases GO

intend to evaluate this method on a larger number of datasets and compare it against other global hierarchical classification approaches.

References [1] Vateekul, P.: Hierarchical Multi-Label Classification: Going Beyond Generalization Trees. Open Access Dissertations, Paper 723 (2012) [2] Holland, J.H.: Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control and Artificial Intelligence. MIT Press, Cambridge (1992) [3] Urbanowicz, R. J. and Moore, J. H.: Learning classifier systems: a complete introduction, review, and roadmap. Jounal. Artif. Evol. App., 1:1–1:25 (2009) [4] Noor, A., Chandan, K. R., Farshad, F.: Exploiting Label Dependency for Hierarchical Multi-label Classification. In PAKDD (1), 294–305 (2012) [5] Romao, L. M. and Nievola, J. C.: Hierarchical Classification of Gene Ontology with Learning Classifier Systems. In Proceedings of IBERAMIA 2012 Lecture Notes in Computer Science/Lecture Notes in Artificial Intelligence LNCS/LNAI 7637, Pav´ on, J. and et al, 120–129, Springer-Verlag, Berlin Heidelberg (2012) [6] Vens, C., Struyf, J., Schietgat, L., Dˇzeroski, S., Blockeel, H.: Decision trees for hierarchical multi-label classification. Mach. Learn. 73, 2, 185–214 (2008) [7] Kiritchenko, S., Matwin, S., Fazel, A.F.: Functional Annotation of Genes Using Hierarchical Text Categorization. In Proceedings of BioLINK SIG: Linking Literature, Information and Knowledge for Biology (2005)

Suggest Documents