Multi-Label Classification from High-Speed Data

0 downloads 0 Views 559KB Size Report
tionally, different examples may be assigned with a dif- ... If label λk is assigned to the ith example then yi,k ... algorithm is based on FIMT-MT method that creates.
Noname manuscript No. (will be inserted by the editor)

Multi-Label Classification from High-Speed Data Streams with Adaptive Model Rules and Random Rules Ricardo Sousa · Jo˜ ao Gama

Received: date / Accepted: date

Abstract Multi-Label Classification is a methodology that tries to solve classification problems where multiple classes are associated to each data example. Data streams pose new challenges to this methodology caused by the massive amounts of structured data production. In fact, most of the existent batch mode methods may not support this condition. Therefore, this paper proposes two Multi-Label Classification methods based on Rule and Ensembles Learning from continuous flow of data. These methods are derived from a Multi-Target Regression algorithm. The main contribution of this work is the rule specialization for subsets of class labels, instead of the usual local (individual models for each output) or a global (one model for all outputs) methods. Prequential evaluation was conducted where global, local and subset operation modes were compared against other online classifiers found in literature. Six real world data sets were used. The evaluation demonstrated that the subset specialization presents competitive performance, when compared to local and global approaches and online classifiers found in literature. Keywords Multi-Label · Classification · Data Streams · Online Learning · Rule Learning · Ensembles Learning Ricardo Sousa LIAAD-INESC Porto, University of Porto, Campus da FEUP, Rua Dr. Roberto Frias 378, 4200-465 Porto, Portugal Tel.:+351 910 056 821 E-mail: [email protected] Jo˜ ao Gama LIAAD-INESC Porto, University of Porto, Campus da FEUP, Rua Dr. Roberto Frias 378, 4200-465 Porto, Portugal E-mail: [email protected]

1 Introduction Data Streams became one of the main ways of collecting data such as audio and video samples, sensors networks signals or network monitoring logs [4]. Structured data is produced unlimitedly in real time with all types of rates. This fact rises problems of storage, processing time and changes on the data probabilistic distributions over time [12]. For these reasons, applications based on classification need to adapt their training and prediction operations to obtain better performance [10]. Traditionally, classification assigns one class label from a set of possible labels to a non observed data example. However, in some classification problems, more than one class label is assigned to an example. Additionally, different examples may be assigned with a different number of classes [18]. This type of classification is called Multi-Label Classification (MLC) [23]. Formally, the input attributes are defined as a vector of input random variables X = {X1 , ..., Xj , ..., XM } ∈ RM , where M is the number of variables. The output attributes are defined as random subsets of labels Y ⊆ {λ1 , ..., λk , ..., λL } where L is the number of possible labels. The vectors xi = (xi,1 , ..., xi,j , ..., xi,M ) ∈ RM and yi ⊆ {λ1 , ..., λk , ..., λL }, where i ∈ {0, 1, 2, ...}, represent realizations of X and Y, respectively. A stream is defined as the sequence of examples ei = (xi , yi ) represented as S = {(x0 , y0 ), (x1 , y1 ), ..., (xi , yi ), ...}. The objective of MLC methods is to learn a function f (xi ) → yi that maps the input values of xi into the output values of yi . This methodology is present in many domains such as Biology (gene and protein function classification) [6], Engineering (Network Monitoring and sensor applications) [1], Economics (online stock market data) [16], Social Sciences (social networks evolution) [19], Library

2

and Information Science (text categorization) [16] and Multimedia (image, video and music categorization and annotation) [18]. Structured output classifiers predict complex objects such as label sets, in this particular case [19]. Conventionally, these structured classifiers can be categorized into local and global methods [15]. Local methods decompose the structured output into elementary scalar outputs and conventional learners predict for each one of these outputs. In turn, global methods train and predict over the complete structured output and consist of adaptation of conventional classification algorithms [8]. Output specialization is an intermediate concept where subsets of outputs are modelled. Rule Learning methodology develops methods that create partitions on the representation space of input attributes and create local models (e.g, linear models) in each partition. Therefore, the obtained models are more interpretable and highly modular (independent partitions) [9]. This property is the most prominent advantage of Rule Learning towards decision trees. Modularity is prospected in this work to surpass the global and local methods through rule specialization on output variables [8]. Ensemble Learning is focused on methods that explores model diversity to amplify the classification accuracy[5]. Multiple different models are employed to yield multiple predictions. These predictions are aggregated into one final prediction according to a criterion (e.g., majority vote). Diversity is created using Bagging or Boosting techniques [21]. In this paper, two solutions for online MLC problem based on Rule and Ensemble Learning are presented. These solutions are inspired on a regression approach of Multi-Target Regression. A performance evaluation was conducted where the proposed solutions were compared to other methods found in the literature [8]. This paper is organized as follows. Section 2 describes related work which involves six online MLC classifiers description found in literature and Section 3 explains the proposed algorithms. Performance evaluation is described in Section 4; the respective results are discussed on Section 5 and the conclusions are remarked in Section 6.

2 Related work This section provides a brief description of six existing online approaches in literature. Most approaches are based on problem transformation [19]. The random subsets of nominal labels Y are transformed into a vector of binary random variables Y = {Y1 , ..., Yj , ..., YL } ∈ RL .

Ricardo Sousa, Jo˜ ao Gama

The realizations yi are transformed into a vector of outputs variables [yi,1 · · · yi,k · · · yi,L ], where yi,k ∈ {0, 1} are binary. If label λk is assigned to the ith example then yi,k = 1, otherwise yi,k = 0. Here, the output realizations are redefined as yi = [yi,1 · · · yi,k · · · yi,L ]. Binary Relevance (BR) is a simple classifier that uses directly the problem transformation. One online binary classifier trains and predicts for each k th output variable only. These predictions are concatenated to produce a final structured prediction. The final predicˆ i = [f1 (xi ), ..., fk (xi ), ..., tion is formally represented by y fL (xi )], where fk represent the classifier of the k th output variable. As advantages, BR method is very simple and can use any online binary classifiers. As limitations, the information related to the correlation between the single labels outputs is lost. The method also requires a vast amount of hardware resources [23]. This classifier is used as a baseline in the performance tests. Classifier Chains (CC) is also based on the problem transformation [24]. The L outputs variables indexes are reordered randomly in a different sequence to avoid preference. Then, a classifier k is used to model the inputs and the first (k − 1) outputs variables. The ˆ i = [f1 (xi ), ..., fk (xi , yˆi,1 , ..., prediction is defined by y yˆi,k−1 ), ..., fL (xi , yˆi,1 , ..., yˆi,L−1 )]. Finally, the outputs are reordered in the original sequence. Unlike the BR method, the CC method models the labels dependency. However, CC does not guaranteed the optimal order of output variables. Multi-Label Hoeffding Trees (MHT) is a structured classifier based on a decision tree that uses the Hoeffding bound splitting criterion in the induction. This method uses the information gain as homogeneity heuristic and online classifiers (essentially BR based) at the ˆi = tree leaves [23]. The prediction can be modelled as y fn (xi ), where fn is a basic online classifier of n leaf. The main advantage is the creation of local models by dividing the input space into partitions dynamically. The main disadvantage is the vulnerability to outliers and creation of disjoint and dependent partitions. i-SOUP Tree is another structured method that was firstly developed for multi-target regression and was adapted to MLC through problem transformation. This algorithm is based on FIMT-MT method that creates model trees based on split selection criterion (Hoeding criterion) and a homogeneity measure (Variance). This uses an enhanced top-down training and MTR perceptrons(without activation function and using Delta rule) in the leaves of the tree [20].

Multi-Label Classification from High-Speed Data Streams with Adaptive Model Rules and Random Rules

3 AMRules and Random Rules for Multi-Label Classification This section describes the contributions of this work: Multi-Label AMRules (ML-AMRules) and Multi-Label Random Rules (ML-Random Rules) methods. The main contribution is the subsets specialization which is more efficient (memory and processing) than some methods of the literature and also very competitive. This method is also more efficient than other modes of ML-AMRules which are explained bellow. Rule Learning underlying theory and the fundamental of local and global operations modes are also clarified. These methods are based on the adaptation of the AMRules and Random Rules Multi-Target Regression (MTR) methods to the MLC problem through problem transformation [8, 2]. The adaptation also implied the modification of the heuristics that are used in the splitting functions.

3.1 Rule Learning Rule R is defined as A ⇒ C implication where the antecedent A is a conjunction of conditions (called literals) of the input variables xi . The consequent C is a predicting function which corresponds to a basic online classifier, in this context. For numerical data, literals may present the forms (Xj ≤ v) and (Xj > v), where Xj represents the j th input variable, meaning that xi,j must be less or equal to v, or xi,j must be greater than v, respectively. Regarding nominal data, literals may present forms (Xj = v) expressing that xi,j must be equal to v or (Xj 6= v) indicating that xi,j must be different than v. R is said to cover xi if, and only if, xi satisfies all the literals conditions in A. The support of the input variables of an example S(xi ), corresponds to a set of rules that cover xi . Function (the basic Multi-Label classiˆ i if a rule Rr covers the fier) in C returns a prediction y example input variables xi . Data structure Lr is associated to the rule Rr and contains the necessary statistics (about the rule and the examples) to the model training and prediction operations (expand the rule, detect changes and identify anomalies,...). Default rule D is used in the initial steps and for the case of none of the current rules covers the example (S(xi ) = ∅). The antecedent of D and its statistics LD start as an empty set. Rule set is composed by U learned rules represented as R = {R1 , · · · , Rr , · · · , RU } and a default rule D as depicted in Figures 1, 2 and 3.

3

3.2 Multi-Label Adaptive Model Rules Algorithm 1 presents the model training procedure of Multi-Label Adaptive Model Rules (ML-AMRules) method. In the initialization stage, the statistics LD of the default rule are initialized and the rule set R is started out empty. In the processing stage, given an incoming example (xi , yi ), the method finds the covering rules of example’s input variables xi . When a covering rule Rr ∈ S(xi ) is found, the example is submitted to anomaly (isAnomaly(Lr , xi )) and change (changeDetected(Lr , xi )) detection. These operations prune the model by avoiding anomalous examples and changes on probabilistic distributions, respectively. Page-Hinkley (PH) was used for change detection [22]. A method based on Cantelli’s inequality was employed to anomaly detection [8]. In case of anomaly occurrence, the example is simply rejected and in case of change detection, Rr is removed from the rule set (the rule is outdated). Otherwise, the statistics Lr are updated (update(Lr )). Rule expansion (addition of new literal) is attempted (expand(Rr )), and in an affirmative case, specialization of the rule on the output subset and rule addiction to R are performed. This specialization leads to more accurate predictions and it increases the speed of processing. The example input variables xi may not be covered by any rule. Consequently, the statistics of the default rule LD are updated and the expansion is attempted. If an expansion occur, the default rule D is added to the rule set R and a new default rule is initialized. The training process also involves the computation of the online average of false positives and negatives. This measure is used in the prediction stage to compute a weight which reflects the reliability of the model prediction (Algorithm 2). The online average of false positives and negatives of the rule r and output k is defined T as er,k = Wr,kr . Tr,k is the accumulated number of false positives and negatives and Wr is the number of examples observed since the last expansion. This variables are updated as Tr,k ← αTr,k + |ˆ yi,k ⊕ yi,k |, Wr ← αWr + 1,

(1)

where 0 < α < 1 is a fading factor, yi,k is the true value and yˆi,k is post-training prediction. The exclusive or operation identifies the false positives and negatives and the norm transforms it into an integer (zero or one). Each output variable under the rule Rr is associated to a linear and to a majority vote predictors. The majority r vote predictor finds the binary label yˆi,k that occurred most in the last n examples since last expansion of the

4

Ricardo Sousa, Jo˜ ao Gama

Algorithm 1 Multi-Label Adaptive Model Rules training 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23:

Initialization: R ← ∅, D ← 0 Input: Example (xi , yi ) ∈ S Output: Rules Set M odel R Method: for all Rr ∈ S(xi ) do if ¬isAnomaly(Lr , xi ) then if changeDetected(Lr , (xi , yi )) then R ← R \ {Rr } else Rc ← Rr update(Lr ) expanded ← expand(Rr ) if expanded = TRUE then Compute Oc0 Oc ← Oc0 R ← R ∪ {Rc }

. Initialize rule set

. Remove rule . Copy Rl before expanding . Learn from the example . Try to expand the current rule . Add complementary rule

. Add complementary rule Rc to the rule set

if S(xi ) = ∅ then update(LD ) expanded ← expand(D) if expanded = TRUE then R ← R ∪ {D} D←0

. No rule covers the example . Learn from the example

. Add D to the rule set . Create new default rule

rule r. The purpose is to provide fast training convergence to the linear predictor. Rule expansion event is the addition of a new literal to an antecedent Ar of a rule or the addition of a new rule to the rule set. This event corresponds to the creation of new partitions in the input space. The literal composition implies the Xj = v splitting value and the sign computations. Xj = v maximizes the uniformity of two groups of examples in the output space. Rule expansion uses the Extended Binary Search Tree (E-BST) method with limited depth to find these elements of the literals [14]. This method is an adaptation from the Multi-target Regression (MTR) to MLC problem. The main adaptation consisted of splitting heuristics changing. The MTR version uses the Mean Variance Ratio (MVR) which is also measure of uniformity [8]. Instead of MVR, the MLC version uses the Mean Information Gain (MIG) as uniformity maximizing function which is defined as

M IG(Xj , v) =

1 X IGu (Xj , v), |Or |

(2)

u∈Or

where IGu (Xj , v) is the Information Gain of splitting Xj given v and considering the output variable Yu . Or is the set of output variables indexes currently being considered by the rule Rr . The Information Gain (IG) is defined as

IGu (Xj , v) = Hu (E)-

|EL | Hu (EL ) |ER | Hu (ER ) , (3) |E| Hu (E) |E| Hu (E)

where, Hu (E) = −[p log(p) + (1 − p) log(1 − p)] is the entropy of Yu , p is the probability P (Yu = 1) considering the set of examples E. If the input variables are numerical, EL = {xi ∈ E : xi,j ≤ v} and ER = {xi ∈ E : xi,j > v}. Considering nominal input variables, EL = {xi ∈ E : xi,j = v} and ER = {xi ∈ E : xi,j 6= v}. The rule expansion procedure uses the Hoeffding bound [13] to determine the minimum number of examples n required to expand. Hoeffding bound states that the true mean of a random variable β, with range P , will not differ from the sample mean more than  with probability 1 − δ. The Hoeffding bound is defined q 2

as  = P ln2n(1/δ) . This procedure suggests several candidates for splitting [M IG(Xj , v1 ), ..., M IG(Xj , vc )] that are organized in decreasing order. Comparison between the two best splits is computed using the difference: β = M IG(Xj , v1 ) −M IG(Xj , v2 ). P = 1 because the range of β is [0, 1]. In case of β > , M IG(Xj , v1 ) is the best split with the probability of 1−δ. Threshold τ is defined to limit  and to avoid division by very low values or zero. If  < τ is met, the split with higher M IG(Xj , v1 ) is selected and the expansion takes place. The literal sign is determine from H(EL ) and H(ER ) values. If H(EL ) ≤ H(ER ), the sign is ≥ and 0 then Cm ← αCm + wm |fˆm (xi ) ⊕ yi |, Nm ← αNm + wm fˆm ← M L-AM Rules(fˆm , xi , yi , wm )

Algorithm 4 ML-Random Rules prediction 1: 2: 3: 4: 5: 6: 7: 8: 9: 10:

Initialization: ε − small positive number used to prevent numerical instabilities Input: Example (xi , ?) ∈ S Output: M odel Rules Set R Method: for all m ∈ 1, · · ·, M do ˆ m ← fˆm (x) y em ← Cm /Nm for all m ∈ 1, · · ·, M do −1 θm ← M(em +ε) X (et + ε)−1 t=1

P ˆm 11: mi ← k m=1 θm y 12: 13: for all k ∈({1, ..., L} do 1 if mi,k > 0.5 14: yˆi,k = 0 if mi,k ≤ 0.5

for each output. Finally, the aggregation variables are converted into a binary variables.

4 Experimental Setup This section presents the evaluation tests of the proposed methods described in Section 3. The classification performance of ML-AMRules and ML-Random Rules operation modes (global, local and subset) described in the Subsection 3.2 are compared to each other and to three online classifiers mentioned in Section 2.

Multi-Label Classification from High-Speed Data Streams with Adaptive Model Rules and Random Rules

The CC method is incorporated in the open source MEKA platform that includes both batch and online methods. The algorithms were implemented in JAVA programming language and are based on WEKA [23]. BR, MHT and the proposed methods ML-AMRules and ML-Random Rules were implemented in the Massive Online Analysis (MOA) platform. Its an open source platform of Machine Learning and Data Mining algorithms applied to data streams. This platform was also implemented in JAVA programming language [3]. Real scenarios data sets 20NG, mediamill,ENRON, OHSUMED , SLASHDOT and yeast were used to simulate data streams. These data sets are extensively described on literature and were obtained from UCI data set repository [19]. The main features of the data sets are presented in Table 1. Performance example-based Table 1 Data sets description. Dataset 20NG mediamill OHSUMED ENRON SLASHDOT yeast

#Examples 19300 43907 13929 1702 3782 2417

#Labels 20 101 23 53 22 14

#Inputs 1006 120 1002 1001 1079 103

measures, Exact Match, Accuracy, Precision, Recall and F-measure were used [18]. Prequential evaluation was applied where the methods first predict output values and then uses the example for training [11]. The sequence of examples predictions were divided into 100 windows and the above mentioned measures were computed for each window. Since all algorithms present random initializations, 10 runs were performed to get more consistency. As validation, for each run, 100 combinations of the algorithms parameters values were tested (). For all methods, the parameters that lead to the best F-measure value were kept. Finally, the mean and the standard deviation of the measures of all windows were computed. Perceptron with a logistic activation function was used as linear predictor by all methods due to its models simplicity, low computational cost and low error rates [17]. The ML-Random Rules were configured with 10 ML-AMRules instantiations and the fading factor was 0.99. A Friedman and Nemenyi post-hoc tests were applied to results tables to find groups of methods that differ significantly[7]. Both tests were performed with 0.05 of significance level. Finally, evolution graphs were computed to observe the dynamics of the models training. Sliding windows with length of 2000 examples and with steps of 50 examples were used to obtain high resolution. For each win-

7

dow, the F-measure was computed. F-measure was chosen because this measure is widely employed for global evaluation. 5 Results In this section, the evaluation results are presented. The results are organized by performance measures. Tables 2 to 11 present the Accuracy, Exact Match, Precision, Recall and F-measure values of the methods for each data set, respectively. Higher values mean better performance. Each table is followed by the critical diagram to evaluate the distinct groups of methods. Implicitly, the Friedman test revealed that the ranks for each datasets were different for all performance measures. The ML-AMR(S), ML-AMR(G) and ML-AMR(L) correspond to the subset, global and local ML-AMRules operation modes, respectively. Similarly, ML-RR(S), MLRR(G) and ML-RR(L) correspond to the subset, global and local ML-Random Rules operation modes. The BR method was chosen as baseline because of its simplicity. The last part of this section shows the dynamics of the training. The F-measure variation is depicted for all data sets and methods. Tables 2 and 3 show that the ML-AMR approaches present competitive values of Accuracy in comparison to the methods from the literature. The ML-RR approaches present the lower values considering all methods. The MHT method stands out in the experiments with the OSHUMED data set. Table 2 Accuracy for 20NG, mediamill and OSHUMED datasets. Mean and standard deviation values. Method BR CC MHT ML-AMR(L) ML-AMR(G) ML-AMR(S) ML-RR(L) ML-RR(G) ML-RR(S)

20NG 0.509±0.115 0.510±0.117 0.509±0.114 0.511±0.117 0.514±0.117 0.510±0.116 0.420±0.123 0.420±0.123 0.420±0.123

mediamill 0.362±0.031 0.362±0.031 0.371±0.030 0.362±0.031 0.362±0.031 0.362±0.031 0.323±0.034 0.283±0.029 0.323±0.034

OSHUMED 0.339±0.083 0.340±0.083 0.376±0.076 0.341±0.084 0.341±0.083 0.340±0.083 0.297±0.079 0.298±0.081 0.298±0.081

Table 3 Accuracy for ENRON, SLASHDOT and yeast datasets. Mean and standard deviation values. Methods BR CC MHT ML-AMR(L) ML-AMR(G) ML-AMR(S) ML-RR(L) ML-RR(G) ML-RR(S)

ENRON 0.403±0.053 0.409±0.054 0.404±0.051 0.411±0.059 0.409±0.054 0.409±0.054 0.353±0.046 0.347±0.050 0.348±0.050

SLASHDOT 0.308±0.118 0.305±0.119 0.307±0.118 0.307±0.118 0.305±0.119 0.305±0.119 0.202±0.110 0.207±0.087 0.207±0.087

yeast 0.465±0.054 0.467±0.051 0.467±0.050 0.466±0.053 0.467±0.051 0.468±0.051 0.434±0.054 0.405±0.045 0.405±0.045

8

Ricardo Sousa, Jo˜ ao Gama

Figure 4 displays the critical diagram related to Tables 2 and 3. This diagram shows that the ML-AMR approaches are in the best group and above the base line method. The ML-RR approaches are at lower ranks.

the ML-AMR approaches are in the best groups and the ML-RR approaches at lower ranks. MHT method stands out in mediamill and OSHUMED-F datasets experiments.

CD

9

8

CD

7

6

5

4

3

2

1

9

8

7

6

5

4

3

2

ML-AMR(L) BR

ML-AMR(G)

1

MHT BR

ML-AMR(G)

ML-RR(L)

MHT

ML-RR(L)

ML-AMR(L)

ML-RR(S)

CC

ML-RR(G)

ML-AMR(S)

ML-RR(G)

ML-AMR(S)

ML-RR(S)

CC

Fig. 4 Critical diagram for Accuracy. Nemenyi post-hoc test at 0.05 significance

Fig. 5 Critical diagram for Exact Match. Nemenyi post-hoc test at 0.05 significance.

Tables 4 and 5 present very low values of Exact Match for mediamill data set due to the high number of possibles labels (101 outputs). The ML-AMRules and ML-RR approaches present the same observations as the Accuracy measure.

Tables 6 and 7 display favourable precision values for ML-AMRules approaches. On the other hand, MLRR approaches present the lowest values.

Table 6 Precision for 20NG, mediamill and OSHUMED datasets. Mean and standard deviation values. Table 4 Exact Match for 20NG, mediamill and OSHUMED datasets. Mean and standard deviation values. Method BR CC MHT ML-AMR(L) ML-AMR(G) ML-AMR(S) ML-RR(L) ML-RR(G) ML-RR(S)

20NG 0.483±0.111 0.483±0.112 0.482±0.110 0.485±0.113 0.488±0.114 0.484±0.113 0.402±0.119 0.401±0.118 0.401±0.118

mediamill 0.049±0.022 0.049±0.022 0.060±0.025 0.048±0.022 0.049±0.022 0.049±0.022 0.041±0.022 0.038±0.020 0.041±0.022

OSHUMED 0.207±0.059 0.208±0.059 0.268±0.063 0.210±0.060 0.209±0.060 0.209±0.059 0.184±0.056 0.185±0.057 0.185±0.057

Table 5 Exact Match for ENRON, SLASHDOT and yeast datasets. Mean and standard deviation values. Methods BR CC MHT ML-AMR(L) ML-AMR(G) ML-AMR(S) ML-RR(L) ML-RR(G) ML-RR(S)

ENRON 0.096±0.032 0.114±0.033 0.099±0.030 0.105±0.031 0.113±0.033 0.113±0.033 0.081±0.029 0.076±0.033 0.078±0.034

SLASHDOT 0.278±0.107 0.275±0.108 0.281±0.107 0.277±0.106 0.275±0.108 0.275±0.108 0.186±0.101 0.188±0.079 0.188±0.079

yeast 0.120±0.050 0.119±0.050 0.120±0.050 0.119±0.050 0.118±0.050 0.119±0.050 0.082±0.045 0.062±0.034 0.062±0.034

Figure 5 provides the critical diagram associated to Tables 4 and 5. Similarly to the previous diagram,

Method BR CC MHT ML-AMR(L) ML-AMR(G) ML-AMR(S) ML-RR(L) ML-RR(G) ML-RR(S)

20NG 0.531±0.119 0.533±0.122 0.531±0.119 0.533±0.121 0.535±0.121 0.532±0.121 0.434±0.127 0.434±0.127 0.434±0.127

mediamill 0.402±0.036 0.402±0.036 0.408±0.034 0.402±0.037 0.402±0.036 0.402±0.036 0.364±0.040 0.302±0.034 0.365±0.039

OSHUMED 0.372±0.094 0.374±0.095 0.354±0.087 0.374±0.096 0.373±0.096 0.373±0.096 0.320±0.088 0.321±0.089 0.321±0.089

Table 7 Precision for ENRON, SLASHDOT and yeast datasets. Mean and standard deviation values. Methods BR CC MHT ML-AMR(L) ML-AMR(G) ML-AMR(S) ML-RR(L) ML-RR(G) ML-RR(S)

ENRON 0.516±0.056 0.519±0.057 0.516±0.055 0.523±0.063 0.519±0.057 0.519±0.057 0.391±0.053 0.383±0.056 0.385±0.057

SLASHDOT 0.317±0.123 0.315±0.124 0.319±0.123 0.317±0.122 0.315±0.124 0.315±0.124 0.203±0.111 0.210±0.089 0.210±0.089

yeast 0.578±0.053 0.582±0.049 0.580±0.048 0.579±0.053 0.580±0.049 0.580±0.049 0.484±0.068 0.451±0.057 0.451±0.057

Figure 6 presents the critical diagram of the Tables 6 and 7. This diagram also shows that the ML-AMR approaches are among the best methods.

Multi-Label Classification from High-Speed Data Streams with Adaptive Model Rules and Random Rules CD

9

8

9

CD

7

6

5

4

3

2

1

9

8

7

6

5

4

3

2

ML-AMR(L) BR

1

ML-AMR(L)

ML-AMR(G)

ML-RR(L)

ML-AMR(G)

ML-RR(L)

CC

ML-RR(G)

BR

ML-RR(S)

MHT

ML-RR(G)

ML-AMR(S)

MHT

ML-AMR(S)

ML-RR(S)

CC

Fig. 6 Critical diagram for Precision. Nemenyi post-hoc test at 0.05 significance.

Fig. 7 Critical diagram for Recall.Nemenyi post-hoc test at 0.05 significance.

Tables 8 and 9 exhibit predominance of the MLAMRules approaches. The global approaches present better performance. The ML-Random Rules methods obtained the best results for ENRON and yeast data sets.

Table 10 F-Measure for 20NG, mediamill and OSHUMED datasets . Mean and standard deviation values.

Table 8 Recall for 20NG, mediamill and OSHUMED datasets . Mean and standard deviation values. Method BR CC MHT ML-AMR(L) ML-AMR(G) ML-AMR(S) ML-RR(L) ML-RR(G) ML-RR(S)

20NG 0.493±0.112 0.493±0.114 0.492±0.112 0.494±0.114 0.498±0.115 0.493±0.114 0.412±0.121 0.411±0.120 0.411±0.120

mediamill 0.557±0.045 0.557±0.046 0.532±0.051 0.556±0.046 0.557±0.046 0.557±0.046 0.485±0.048 0.545±0.058 0.484±0.049

OSHUMED 0.408±0.090 0.409±0.090 0.387±0.084 0.409±0.090 0.409±0.089 0.408±0.089 0.374±0.091 0.374±0.093 0.374±0.093

Table 9 Recall for ENRON, SLASHDOT and yeast datasets . Mean and standard deviation values. Methods BR CC MHT ML-AMR(L) ML-AMR(G) ML-AMR(S) ML-RR(L) ML-RR(G) ML-RR(S)

ENRON 0.425±0.054 0.448±0.052 0.426±0.051 0.446±0.069 0.448±0.052 0.448±0.052 0.497±0.051 0.492±0.048 0.492±0.048

SLASHDOT 0.321±0.124 0.322±0.125 0.324±0.124 0.324±0.123 0.322±0.125 0.322±0.125 0.214±0.116 0.221±0.092 0.225±0.092

yeast 0.498±0.060 0.484±0.048 0.484±0.049 0.497±0.059 0.484±0.048 0.484±0.048 0.564±0.072 0.556±0.070 0.556±0.070

Figure 7 presents the critical diagram of the Tables 8 and 9. This diagram also shows that the ML-AMR approaches are in the best groups. Tables 10 and 11 reveal a general positive behaviour of the ML-AMRules approaches in terms of F-measure. The MHT method stands out in the experiments with OSHUMED data set.

Method BR CC MHT ML-AMR(L) ML-AMR(G) ML-AMR(S) ML-RR(L) ML-RR(G) ML-RR(S)

20NG 0.518±0.116 0.519±0.118 0.518±0.116 0.520±0.118 0.523±0.119 0.519±0.118 0.427±0.125 0.426±0.124 0.426±0.124

mediamill 0.482±0.035 0.482±0.034 0.486±0.032 0.482±0.035 0.482±0.034 0.482±0.034 0.436±0.039 0.387±0.035 0.436±0.039

OSHUMED 0.387±0.093 0.388±0.093 0.414±0.083 0.389±0.094 0.388±0.094 0.388±0.093 0.339±0.089 0.339±0.090 0.339±0.090

Table 11 F-Measure for ENRON, SLASHDOT and yeast datasets. Mean and standard deviation values. Methods BR CC MHT ML-AMR(L) ML-AMR(G) ML-AMR(S) ML-RR(L) ML-RR(G) ML-RR(S)

ENRON 0.467±0.063 0.465±0.062 0.468±0.060 0.472±0.071 0.465±0.062 0.465±0.062 0.459±0.052 0.451±0.056 0.453±0.056

SLASHDOT 0.313±0.122 0.311±0.123 0.313±0.122 0.314±0.121 0.311±0.123 0.311±0.123 0.207±0.113 0.214±0.090 0.214±0.090

yeast 0.530±0.071 0.536±0.062 0.537±0.060 0.534±0.070 0.537±0.062 0.532±0.062 0.523±0.053 0.522±0.046 0.522±0.046

Figure 8 presents the critical diagram of the Tables 10 and 11. This diagrams also shows that the ML-AMR approaches are in the best groups. The main observations of this set of results are: – ML-AMRules approaches among the best methods and above the base line method(BR). – The MHT stands out in some scenarios. – ML-AMRules performs better in global mode. – ML-RandomRules are at low rates. The MHT method presents better performance for Accuracy and Exact Match. The ML-AMR(G) presents best performance for Recall. For the remaining measures, the performance values are slightly similar for

10

Ricardo Sousa, Jo˜ ao Gama mediamill

CD 0.50

9

8

7

6

5

4

3

2

1 0.48

MHT BR

ML-AMR(L)

ML-RR(L)

ML-AMR(G)

ML-RR(S)

CC

ML-RR(G)

ML-AMR(S)

Fmeasure

0.46 0.44 0.42 0.40 0.38 0.36 0

Fig. 8 Critical diagram for F-measure.Nemenyi post-hoc test at 0.05 significance.

these two methods. In general, the mean values are close due to the fact that all methods use the same predictor. Figures 9 to 11 present graphs that reveal the dynamics of the models training. Each graph presents the F-measure variation. These Figures show overlapped curves which means that groups of methods present the same dynamics of training.

10000

20000 Example index

30000

40000

Fig. 10 F-measure evolution curves of all methods for the mediamill data set.

Figure 11 shows that the MHT method is isolated in the first higher values group.

OHSUMED-F 0.50

20NG-F

0.45

0.6 0.40 Fmeasure

Fmeasure

0.5

0.4

0.35

0.30

0.25 0.3 0.20 0.2 0

0

2500

5000

7500 10000 Example index

12500

15000

2000

4000

6000 8000 Example index

10000

12000

17500

Fig. 11 F-measure evolution curves of all methods for the OHSUMED data set. Fig. 9 F-measure evolution curves of all methods for the 20N data set.

Figure 9 shows that the methods BR, CC, MLAMR(L), ML-AMR(G) and ML-AMR(S) belong the first higher values group of curves. MHT is the second isolated curve. ML-RR(L), ML-RR(G) and ML-RR(S) correspond to the third group of curves. For Figure 10, the methods BR, MHT, CC, MLAMR(L), ML-AMR(G) and ML-AMR(S) are in the first higher values group. The ML-RR(L) and ML-RR(G) are in second group. The ML-RR(G) method is isolated in the third group.

The methods BR, CC, ML-AMR(L), ML-AMR(G) and ML-AMR(S) are in the second group and the last group comprise the ML-RR(L), ML-RR(G) and MLRR(S) methods. The most relevant remarks of this set of results are: – ML-AMR approaches present similar behaviour. – The curves are almost parallel. – The initial training conditions seem to influence significantly the overall performance. – The ML-RR approaches produced initial training that lead to lower performance

Multi-Label Classification from High-Speed Data Streams with Adaptive Model Rules and Random Rules

In addiction, the curves with higher level present higher learning rates. The curves present increasing tendency which means that the methods could reach higher performance values with larger data sets. In general, the curves are parallel which means that they present almost the same dynamics. 6 Conclusions This paper is the result of a preliminary work that suggests a MTR algorithm adaptation to the MLC problems using Rule Learning methods. It can be concluded that the ML-AMRules approaches are competitive when compared to online MLC algorithms from the literature and when the experimental context of this work is considered. The ML-RandomRules approaches did not present favourable results. The initial training conditions of this ensemble method may be the cause of the low performance. Another possible cause is the aggregation function which usually influence the ensembles methods performance. The AMRules global mode has shown to be competitive against its local and subset modes. However, its performance values does not stand out from other operation modes. This means that the subset operation mode presents the advantage of lower resources requirements. As future work, the number of data sets with higher number of examples will be used. For the ML-AMRules improvement, new heuristics will be implemented. The ensembles methods Random Rules will be improved by creating better aggregation functions. Acknowledgements This work is financed by the ERDF European Regional Development Fund through the Operational Programme for Competitiveness and Internationalisation - COMPETE 2020 Programme within project POCI-010145-FEDER-006961.

References 1. Aggarwal, C.C.: Data Streams: Models and Algorithms (Advances in Database Systems). Springer-Verlag New York, Inc., Secaucus, NJ, USA (2006) 2. Almeida, E., Ferreira, C., Gama, J.: Adaptive model rules from data streams. ECML 2013 - European Conference on Machine Learning (2013) 3. Bifet, A., Holmes, G., Kirkby, R., Pfahringer, B.: Moa: Massive online analysis. J. Mach. Learn. Res. 11, 1601– 1604 (2010) 4. Bifet, A., Holmes, G., Pfahringer, B., Kirkby, R., Gavald` a, R.: New ensemble methods for evolving data streams. In: Proceedings of the 15th ACM SIGKDD International Conference, KDD ’09, pp. 139–148. ACM, New York, NY, USA (2009)

11

5. Bifet, A., Kirkby, R.: Data stream mining: a practical approach. Tech. rep., The University of Waikato (2009) 6. Clare, A., King, R.D.: Knowledge discovery in multi-label phenotype data. In: Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery, PKDD ’01, pp. 42–53. Springer-Verlag, London, UK, UK (2001) 7. Demˇsar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006) 8. Duarte, J., Gama, J.: Multi-Target Regression from HighSpeed Data Streams with Adaptive Model Rules. In: IEEE conference on Data Science and Advanced Analytics (2015) 9. F¨ urnkranz, J., Gamberger, D., Lavra, N.: Foundations of Rule Learning. Springer (2012) 10. Gama, J.: Knowledge Discovery from Data Streams. Chapman and Hall / CRC Data Mining and Knowledge Discovery Series. CRC Press (2010) 11. Gama, J., Sebasti˜ ao, R., Rodrigues, P.P.: On evaluating stream learning algorithms. Machine Learning 90(3), 317–346 (2013) 12. Herrera, F., Charte, F., Rivera, A.J., del Jesus, M.J.: Multilabel Classification: Problem Analysis, Metrics and Techniques, 1st edn. Springer Publishing Company, Incorporated (2016) 13. Hoeffding, W.: Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association 58(301), 13–30 (1963) 14. Ikonomovska, E., Gama, J., Dzeroski, S.: Learning model trees from evolving data streams. Data Min. Knowl. Discov. 23(1), 128–168 (2011) 15. Kocev, D., Vens, C., Struyf, J., Dˇ zeroski, S.: Tree ensembles for predicting structured outputs. Pattern Recognition 46(3), 817–833 (2013) 16. Kong, X., Yu, P.: An ensemble-based approach to fast classification of multi-label data streams, pp. 95–104 (2011) 17. Loza Menc´ıa, E., F¨ urnkranz, J.: Pairwise learning of multilabel classifications with perceptrons. In: Proceedings of the International Joint Conference on Neural Networks, IJCNN 2008, part of the IEEE World Congress on Computational Intelligence, pp. 2899–2906 (2008) 18. Madjarov, G., Kocev, D., Gjorgjevikj, D., Deroski, S.: An extensive experimental comparison of methods for multilabel learning. Pattern Recogn. 45(9), 3084–3104 (2012) 19. Osojnik, A., Panov, P., Dzeroski, S.: Multi-label classification via multi-target regression on data streams. In: Discovery Science DS 2015, pp. 170–185 (2015) 20. Osojnik, A., Panov, P., D´ zEroski, S.: Multi-label classification via multi-target regression on data streams. Mach. Learn. 106(6), 745–770 (2017). DOI 10.1007/ s10994-016-5613-5. URL https://doi.org/10.1007/ s10994-016-5613-5 21. Oza, N.C., Russell, S.: Online bagging and boosting. In: In Artificial Intelligence and Statistics 2001, pp. 105–112. Morgan Kaufmann (2001) 22. Page, E.S.: Continuous inspection schemes. Biometrika 41(1/2), 100–115 (1954) 23. Read, J., Bifet, A., Holmes, G., Pfahringer, B.: Scalable and efficient multi-label classification for evolving data streams. Mach. Learn. 88(1-2), 243–272 (2012) 24. Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. In: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, ECML PKDD ’09, pp. 254–269. Springer-Verlag, Berlin, Heidelberg (2009)

Suggest Documents