Rank Aggregation Algorithm Selection. Meets Feature Selection. Alexey Zabashta, Ivan Smetannikov(B), and Andrey Filchenkov. Computer Science Department ...
Rank Aggregation Algorithm Selection Meets Feature Selection Alexey Zabashta, Ivan Smetannikov(B) , and Andrey Filchenkov Computer Science Department, ITMO University, 49 Kronverksky Pr., 197101 St. Petersburg, Russia {azabashta,ismetannikov,afilchenkov}@corp.ifmo.ru
Abstract. Rank aggregation is the important task in many areas, and different rank aggregation algorithms are created to find optimal rank. Nevertheless, none of these algorithms is the best for all cases. The main goal of this work is to develop a method, which for each rank list defines, which rank aggregation algorithm is the best for this rank list. Canberra distance is used as a metric for determining the optimal ranking. Three approaches are proposed in this paper and one of them has shown promising result. Also we discuss, how this approach can be applied to learn filtering feature selection algorithm ensemble.
Keywords: Meta-learning Feature selection
1
·
Rank aggregation
·
Ensemble learning
·
Introduction
In many domains where multiple ranking algorithms are applied in practice, the task of rank aggregation arises. An example of such a field is computational biology, where several rank aggregation algorithms are used to detect how physiological characteristics depend on genes [7,10]. In the Web search problem, many different rank aggregation algorithms are also used. Each of them exploits some features for documents evaluation, such as document popularity, the quality of matching between a document and a query, information source authority and others [13,24]. For given set of algorithms, solving the rank aggregation problem is usually not that obvious, which one is the best. This situation is typical as well in other domains: different algorithms are better than others on different tasks. John Rice has formulated this problem in a general form in 1975 [22]: how to find the algorithm that minimizes an error function for given problem without executing each algorithm. The solution of the algorithm selection problem is formulated under the metalearning approach [6]. Meta-learning algorithms treat the algorithm selection problem as a prediction problem: they require a training set to learn a model for predicting the best algorithm for a new dataset. c Springer International Publishing Switzerland 2016 P. Perner (Ed.): MLDM 2016, LNAI 9729, pp. 740–755, 2016. DOI: 10.1007/978-3-319-41920-6 56
Rank Aggregation Algorithm Selection Meets Feature Selection
741
Since the problem of finding the best possible resulting rank is usually NPhard, approximate algorithms are used to find it. Each of these algorithms is usually good for a certain type of rank optimization problem. However, the problem of rank aggregation algorithm selection was never solved in scientific literature. In this work, we deal with selection of the best algorithm for aggregating given ranks of the same length. We extend the previous research on this problem [29] by suggesting new approaches involving meta-feature selection. Also we propose an application of the rank aggregation algorithm selection system to the problem of feature selection. Some definitions and approaches for the rank aggregation problem (including description of all algorithms for permutation lists generating) are given in Section 2. Section 3 contains definitions and general scheme for meta-learning application to the algorithm selection problem. In Section 4, experiments are explained and thier results are presented and discussed. In Section 5, application of the proposed rank aggregation algorithm selection algorithm for designing novel ensemble-based feature selection algorithm is proposed. Section 6 concludes the paper with summarization of the obtained results and outlining future work.
2
Rank Aggregation
2.1
Metrics on Permutations
The mathematical formalization of rank involves permutations. Permutation π is an ordered set {π1 , π2 . . . , πn } of distinct natural numbers between 1 and n, where n is the length of the permutation. The set of all permutations of length n will be denoted by Πn . Metric μ(a, b) on the permutation space is a function from Πn × Πn to R, which satisfies the following axioms [11]: – – – –
μ(a, b) = μ(b, a), symmetry; μ(a, b) = 0 ⇔ a = b, coincidence axiom; μ(a, b) ≥ 0, non-negativity; μ(a, b) = μ(a · c, b · c), right-invariance, where a · c is product of permutations a and c.
We use short term “metric” to refer to a metric on the permutation space. In this paper we use seven metrics on the permutation space described in [11] and [7]. The Minkowski distance between a and b is an adaptation of the classical Minkowski distance for vectors to the case of the permutation space [11]: – l1 (a, b) = i |a(i) − b(i)| , the Manhattan distance; – l2 (a, b) = i (a(i) − b(i))2 , the Euclidean distance; – l∞ (a, b) = maxi |a(i) − b(i)| , the Chebyshev distance, where p(i) is the position of number i in the permutation p.
742
A. Zabashta et al.
The Canberra distance between a and b is a modification of the Euclidean distance [7], which assigns higher priority to differences in the top positions in the permutation, and which is equal to |a(i) − b(i)| i
a(i) + b(i)
.
The Cayley distance between a and b is the minimum number of transpo sitions required to obtain b from a [11], which is equal to n − loops(a · b−1 ) , where loops(a · b−1 ) is a set of loops in the permutation a · b−1 and b−1 is the inverse permutation to b. The Kendall tau rank distance between a and b is the number of pairwise disagreements between a and b [11], which equals to I[(a(i) − a(j))(b(i) − b(j)) < 0], i
j π(j)]. π∈Q
Algorithm, which we will refer to as weight-based algorithm (WBA), assigns a weight wi for each ith element and returns permutation π = {π1 , π2 . . . , πn }. In this algorithm, i ≥ j ⇔ w(πi ) ≤ w(πj ). The Borda Count is a WBA, which assigns weight wi = π∈Q f (n − i) to each ith element, where f is an increasing function [5]. We use the original increasing functions for Borda count, which is f (x) = x.
Rank Aggregation Algorithm Selection Meets Feature Selection
743
The Copelands Score is WBA, which assigns the following weight wi = (I[vi,j > li,j ] + 0.5 · I[vi,j = li,j ]) i=j
to each ith element [8]. The main idea of Markov Chain methods is to construct a Markov chain transition matrix T and find the stationary distribution for it [13], which we use as a weight vector in the same way as for WBA. We use two Markov chain methIn the second ods. In the first method, Ti,j = 1/n if vi,j > 0, and 0 otherwise. method, Ti,j = 1/n if vi,j > 0.5 · m. In both methods, Ti,i = 1 − i=j Ti,j . The Local Kemenization method sort elements in resulting permutation using following comparator: πi ≤ πj ⇔ vi,j ≥ vj,i [13]. The “Pick a perm” method minimizes given error function using permutations from the input permutation list Q [7]. It returns arg min E(π, Q). π∈Q
2.3
Algorithms for Generating Permutations
In this subsection, we introduce a scheme of an algorithm for generating permutations and describe three algorithms we use to generate a single permutation of fixed length. We follow [29]. An algorithm for generating permutations for given permutation length returns a permutation of this length. It is parameterized with variable σ ∈ [0, 1], which can be understood as a degree of distinction between the resulting permutation and the identity permutation. Algorithm A is a modification of the FisherYates shuffle [15]. In the original algorithm, on each ith step the transposition of the ith element with the jth element is performed, where j is a random number generated with the uniform distribution U(1, i). In the modified algorithm, we make a transposition on each step only if the following condition is satisfied: (n − i + 1)b < n · σ − s, where b is a random number generated from the uniform distribution U(0, 1) and s is the number of transpositions that have been already made. It is the ordinary FisherYates shuffle, if σ = 1. Algorithm B is another modification of the FisherYates shuffle. The difference is that on each step, we make the transposition of ith element with the element at position i − |σ|, where σ is a random number generated from the normal (Gaussian) distribution N (0, i · σ) with zero mean and i · σ squared scale. Algorithm C makes n · σ transpositions of elements at two random positions generated from U(1, n). 2.4
Algorithms for Generating Permutation Lists
In this subsection, we introduce a scheme of an algorithm for generating permutation lists and describe three algorithms we use to generate a list of permutation of a fixed length. We follow [29].
744
A. Zabashta et al.
Algorithm for generating permutation list (AGPL) for given permutation length and number of permutations generates these permutations. It uses an algorithm for generating permutations and parameters l, r ∈ [0, 1] to choose σ uniformly in the interval [l, r] for the algorithm for generating permutations. Formally, (i − 1) · (r − l) , σi = l + m−1 where m is the number of permutations in the resulting list and σi is the value for ith permutation from the resulting list. We have described three algorithms for generating permutations in the previous subsection. Each of them can be used to generate permutation lists. We will refer to each algorithm for generation permutation list implying a certain algorithm for generating permutation as AGPL-A, AGPL-B and AGPL-C correspondingly.
3 3.1
Meta-Learning for Rank Aggregation Algorithm Selection Meta-Learning
Meta-learning is the approach used to predict the best algorithms to solve given task without execution of these algorithms. The core idea of this approach is to use infomation about the given task, which hepls to make assumptions about the best algorithm. Usage of this information helps to avoid satisfying the No Free Lunch Theoems conditions [28]. Information about tasks is described with so-called meta-features, which are properties of the tasks that can be effectively measured (otherwise, it is faster to run each algorithm and compare their performance) [17]. A lot of different metafeature types exists. Nevertheless, the meta-feature engineering step is the most complicated step in meta-learning application. After the set of meta-feature is chosen, each task is associated with a vector of the meta-features values. Thus, each task is a point in the meta-feature space. Each meta-feature system has a set of algorithms it can choose of. To learn such a system, a training set of tasks is required. This training set contains tasks and information about the algorithms perfomance on these tasks it is used as labels for the tasks. The system is learnt on these tasks and can predict the best classifier for a new task. Thus, meta-learning reduces the algorithm selection problem to a supervised learning problem. 3.2
Baseline Approach
AGPL described in Section 2 uses two hidden variables l and r as parameters to generate a permutation list. We found that there is a certain dependency between optimal algorithm on dataset and l and r with which this dataset was generated. These dependencies can be seen Fig. 1, Fig. 2 and Fig. 3, in which
Rank Aggregation Algorithm Selection Meets Feature Selection
745
Fig. 1. Visualization of optimal rank aggregation algorithms distribution in HVA feature space for AGPL-A data. The domains of red points is the permutation lists, where the Borda Count is optimal, orange — the Markov Chain-1, green — the Markov Chain2, cyan — the Copelands Score, blue — the Local Kemenization, magenta — “Pick a perm”
for each dataset described with l and r, the best algorithm for this dataset is presented; l and r are paratemeters with which this dataset was generated by means of AGPL-A, AGPL-B and AGPL-C correspondingly. These tree figures demonstate that a clear dependency exists between parameters l and r describing a dataset and the best algorithm for this dataset. Hidden variable approach (HVA) provides description of permutation list with two features: l and r with which it was generated. We can apply this approach only to generic data, because we have no access to values of these hidden variables in real-world problems. 3.3
Approaches Without Meta-Feature Selection
In this subsection, we describe three approaches to generate meta-features. The first two are taken from [29], and the last one is novel. Basic meta-features approach (BMFA) for each metric μ explores all possible pairs of permutations from the input list Q and then constructs sequence Xμ = {μ(a, b) : a, b ∈ Q}. After that, it extracts statistic characteristics
746
A. Zabashta et al.
Fig. 2. Visualization of optimal rank aggregation algorithms distribution in HVA feature space for AGPL-B data.
(minimum, maximum, average, variance, skewness and kurtosis) from each sequence Xμ as meta-features. The meta-feature set thus consists of 7 (number of metrics) × 6 (number of features from each metric) that equals 42 meta-features. BMFA can be accelerated. The slowest part in this approach is scanning of all possible pairs. It costs O(m2 ) time, and the algorithm works in O(n · log(n) · m2 ) time. We are interested in decreasing its complexity, because some rank aggregation algorithms work faster. For example, the Borda count works in O(n · (log(n) + m)) time. That is why it cannot be applied on practice. To solve this problem, we aggregate the input permutation list to a single permutation c by means of a faster method. In this paper, we use the Borda count. Then we construct a sequence Xμ = {μ(π, c) : π ∈ Q} containing distances between c and permutations from the input list, then we extract features in the same way as in BMFA. The new algorithm, which we will refer to as accelerated meta-features approach (AMFA), works in O(n · log(n) · m) time. AMFA creates 42 meta-features that do not intersect with the meta-features generated by BMFA.
Rank Aggregation Algorithm Selection Meets Feature Selection
747
Fig. 3. Visualization of optimal rank aggregation algorithms distribution in HVA feature space for AGPL-C data.
Pairwise Meta-Features approach (PMFA) for each index i explores all possible pairs of permutations from input list Q sequence X = {|a(i) − b(i)| : a, b ∈ Q}. i
After that, it extracts as meta-features the same statistic characteristics from sequence X. This gives 6 meta-features. 3.4
Approaches with Meta-Feature Selection
The BMFA and AMFA meta-features approaches generate a lot of meta-features. Since the number of features is high and not all of them useful, it may reduce the accuracy and slow down the meta-classifiers. Therefore, we apply feature selection algorithms in order to reduce the number of features. The more detailed explanation of feature selection and its purpose can be found in Section 5. Because this work is devoted to meta-learning, we used meta-learning for feature selection algorithm selection. To the best of our knowledge, only two papers are devoted to this topic: by Wang et al. [26] and by Filchenkov and Pendryak [14]. The latter improves the first one, this is why we use it. The system
748
A. Zabashta et al.
recommends three of 16 feature selection algorithms implemented in WEKA library, which are predicted to be the best to preprocess given dataset. Thus, we can introduce a new scheme for ranking algorithm selection system design that uses feature selection recommender system, and an approach for rank aggregation algorithm selection. For a given dataset, the system suggests three feature selection algorithms. Then each algorithm is executed and then the preprocessed datasets are given as the inputs for the approach. We used the three recommended feature selection algorithms for BMFA and AMFA, and we will refer to them as BMFA-X, BMFA-Y, BMFA-Z, AMFA-X, AMFA-Y, and AMFA-Z. The last approach we present is Merged Meta-Feature Space Approach (MMFSA) it is based on application of the feature selection technique to the merged meta-feature space: we used 42 meta-features from BMFA, 42 metafeatures from AMFA and 6 meta-features from PMFA, resulting into 90 metafeatures. We will refer to the three algorithms obtained after the application of the three recommended feature selection algorithms as MMFSA-X, MMFSA-Y and MMFSA-Z. In Table 1, one can see the number of selected features by the three feature selection algorithms for the three meta-features approaches on three AGPL.
Table 1. The number of selected meta-features for each approach.
BMFA-X BMFA-Y BMFA-Z AMFA-X AMFA-Y AMFA-Z MMFSA-X MMFSA-Y MMFSA-Z
4 4.1
AGPL-A AGPL-B AGPL-C 19 28 16 37 38 31 35 18 28 13 22 12 34 36 29 28 17 30 35 54 20 50 61 43 82 36 80
Experiments Experiment Setup
We conduct experiments for each approach described in Section 3. Each approach returns feature description of a permutation list. Each permutation list is labeled with an aggregation algorithm that minimizes the error function, which is based on the Canberra distance. We use the following classifiers to solve the described classification problem: baggin (Bag), decision trees (DT), LogitBoost (LB), ClassificationViaRegression (Regres), support vector machine (SMV). All implementations were taken from WEKA library [16].
Rank Aggregation Algorithm Selection Meets Feature Selection
749
The result of the experiments is a confusion matrix C, where Ci,j is the number of objects from ith class that classifier has classified to the jth class. We used F1 -measure as a quality measure to compare classification algorithms. 4.2
Experiment Details
We generated permutation lists of the permutation lists family: the length of permutations equals 80, the number of permutations in lists equals 20. For generating each permutation list, we randomly took l and r values for AGPL from the uniform distribution U(0, 1). We used six rank aggregation methods described in Section 2 as labels: the Borda count (BC), the Copelands score (CS), two Markov chain methods (MC-1 and MC-2), the Local Kemenization (LK), and “Pick a perm” (PP). For each of the algorithms for generating permutation lists, we generated 1000 permutation lists for each class, therefore the sample contained of 6000 objects. We run each experiment 10 times and used cross-validation to measuring the efficacy of classification. 4.3
Results
Table 2, Table 3 and Table 4 contain resulting F1-score for each classifier used on meta-level for each approach and on rank lists generated with AGPL-A, AGPL-B and AGPL-C correspondingly. Table 2. Results on rank lists generated with AGPL-A. FAlg HVA PMFA BMFA BMFA-X BMFA-Y BMFA-Z AMFA AMFA-X AMFA-Y AMFA-Z MMFSA-X MMFSA-Y MMFSA-Z
Bag 0.623 0.549 0.648 0.641 0.637 0.647 0.631 0.636 0.629 0.629 0.651 0.645 0.646
DT 0.625 0.559 0.638 0.635 0.638 0.638 0.623 0.619 0.622 0.622 0.634 0.638 0.638
LB 0.587 0.566 0.649 0.652 0.652 0.651 0.638 0.630 0.636 0.639 0.654 0.652 0.655
Regres 0.636 0.580 0.645 0.644 0.641 0.643 0.626 0.645 0.632 0.635 0.658 0.655 0.641
SVM 0.506 0.568 0.645 0.649 0.647 0.646 0.632 0.627 0.633 0.632 0.650 0.649 0.659
Table 3. Results on rank lists generated with AGPL-B. Bag 0.434 0.391 0.452 0.449 0.452 0.448 0.455 0.457 0.455 0.454 0.456 0.465 0.459
DT 0.442 0.379 0.440 0.441 0.440 0.441 0.447 0.454 0.447 0.442 0.447 0.447 0.446
LB 0.422 0.386 0.456 0.455 0.456 0.457 0.450 0.463 0.456 0.452 0.455 0.455 0.453
Regres 0.445 0.415 0.449 0.451 0.449 0.456 0.443 0.452 0.446 0.450 0.439 0.449 0.442
SVM 0.335 0.359 0.460 0.465 0.460 0.461 0.453 0.455 0.456 0.448 0.456 0.466 0.462
As we see SVM meta-classifier under MMFSA-Z showed the highest result on datasets generated with AGPL-A. Confusion matrix for this approach is presented in Table 5. Table 6, Table 7 and Table 8 constain top meta-features for each AGLP. We used Information Gain (IG) [21] as the feature importance measure.
750
A. Zabashta et al.
Table 4. Results on rank lists generated with AGPL-C. FAlg HVA PMFA BMFA BMFA-X BMFA-Y BMFA-Z AMFA AMFA-X AMFA-Y AMFA-Z MMFSA-X MMFSA-Y MMFSA-Z
Bag 0.570 0.508 0.590 0.582 0.590 0.581 0.582 0.593 0.591 0.589 0.632 0.632 0.629
DT 0.579 0.503 0.576 0.579 0.576 0.575 0.563 0.564 0.563 0.563 0.578 0.584 0.584
LB 0.556 0.510 0.585 0.578 0.587 0.592 0.587 0.572 0.578 0.582 0.613 0.611 0.617
Regres 0.574 0.526 0.577 0.596 0.589 0.598 0.585 0.597 0.596 0.597 0.649 0.633 0.633
SVM 0.484 0.498 0.582 0.578 0.581 0.582 0.592 0.582 0.589 0.591 0.609 0.616 0.631
Table 5. Confusion matrix, for SVM under MMFSA-Z for datasets generated with AGPL-A.
BC MC1 MC2 CS LK PP
BC 956 108 145 142 21 1
MC1 MC2 CS LK PP 44 0 0 0 0 825 0 12 36 19 53 615 23 152 12 154 58 53 516 77 123 49 47 638 122 16 4 2 33 944
Table 6. Information gain values of meta- Table 7. Information gain values of metafeatures for datasets geneterated with features for datasets geneterated with AGPL-A. AGPL-B. Meta-feature BMFA-E(CayleyDistance) BMFA-E(CanberraDistance) BMFA-E(U lamDistance) PMFA-E BMFA-E(M anhattanDistance) BMFA-E(KendallT au) AMFA-E(EuclideanDistance) BMFA-E(EuclideanDistance) AMFA-E(KendallT au) AMFA-E(M anhattanDistance) AMFA-E(CanberraDistance) PMFA-Skew AMFA-E(U lamDistance) AMFA-Min(M anhattanDistance) PMFA-Kurt
IG 1.0939 1.0844 1.0802 1.0695 1.0695 1.0569 1.0567 1.0547 1.0526 1.0418 1.0382 0.9889 0.9684 0.9564 0.9512
Meta-feature BMFA-σ(CanberraDistance) BMFA-σ(KendallT au) BMFA-σ(M anhattanDistance) BMFA-σ(EuclideanDistance) BMFA-Kurt(CanberraDistance) AMFA-E(CanberraDistance) BMFA-Min(M anhattanDistance) AMFA-Min(CanberraDistance) BMFA-Min(CanberraDistance) AMFA-E(M anhattanDistance) BMFA-Min(EuclideanDistance) BMFA-Min(KendallT au) AMFA-E(EuclideanDistance) AMFA-σ(ChebyshevDistance) AMFA-Min(M anhattanDistance)
IG 0.5613 0.5600 0.5571 0.5472 0.5179 0.5151 0.5130 0.5058 0.5048 0.5013 0.4999 0.4969 0.4950 0.4947 0.4938
Rank Aggregation Algorithm Selection Meets Feature Selection
751
Table 8. Information gain values of meta-features for datasets geneterated with AGPL-C. Meta-feature BMFA-E(CayleyDistance) BMFA-E(U lamDistance) BMFA-E(CanberraDistance) PMFA-E BMFA-E(M anhattanDistance) BMFA-E(KendallT au) BMFA-E(EuclideanDistance) AMFA-E(KendallT au) AMFA-E(M anhattanDistance) AMFA-E(EuclideanDistance) AMFA-E(CanberraDistance) PMFA-Skew AMFA-E(U lamDistance) AMFA-Min(M anhattanDistance) AMFA-Min(KendallT au)
4.4
IG 0.9610 0.9563 0.9472 0.9415 0.9415 0.9195 0.9108 0.9080 0.8983 0.8969 0.8855 0.8540 0.8265 0.8192 0.8117
Discussion
As the results show, the rank lists generating approach is very impactful factor: the most complex task was predicting rank aggregation algorithms for data generated with AGPL-B. However, the approaches have lesser impact on the approach success. Nothing unexpected is in fact that in all the cases, the last approach involving all the meta-features outperformed all the other approaches in most of the cases. There are two notable exceptions: – For decision tree, many algorithms showed the same performance. That can be simply explained by the fact that decision trees select features themselves. – For data generated with AGPL-B, several classifiers showed better performance with other approaches. However, we may notice that these results were low, so we can claim that these classifiers work not well for this task (0.454 for DT, 0.457 for LB, and 0.456 for Regress versus 0.465 for Bag and 0.466 for SVM in Table 3). In general, feature selection application has improved most of the initial approaches for most of the classifiers applied on the meta-level. As we can see from comparison of the most important meta-features, they are different for data generated with different AGPLs. However, the most important meta-features are generated under BMFA. Despite that PMFA generates only 6 meta-features, three of them are among the most important meta-features for AGPL-A and two — for AGPL-C.
752
5 5.1
A. Zabashta et al.
Application to Feature Selection Feature Selection
With development of social networks and other massive data tasks, the dimensionality reduction problem arises. In machine learning, two ways to reduce dimensionality are used: feature selection and feature extraction [18]. In this terminology, features are some attributes that suit to objects. These features are usually gathered or calculated directly and can belong to any type of data. Feature extraction methods are commonly used for pattern recognition and image processing [20]. Some of these methods require deep data analysis and feature engineering, although if no such expert knowledge is available, they might still work [3]. Feature extraction techniques usually do not consider that some features may be redundant or irrelevant. In order to remove such features, feature selection techniques are used. This approach requires the resulting feature set to be a subset of the original feature set. Feature selection algorithms can be divided into three groups: filters, wrappers and embedded algorithms [23]. Wrappers are usually considered as the most accurate methods, but having very high computational cost. This is result of gradual feature subspace search with a classifier training performed on each step. Another way to select features is to use embedded algorithms. Some classifiers, for example random forests, already have built-in feature selection. The main problem here is that this type of feature selection requires usage of this specific classifier. And the last ones are filtering methods. These methods evaluate single features or feature subsets using an inner measure of feature importance. Filters are quite fast and do not involve any classifier to select feature subset. They are not that accurate in terms of resulting classification. However, they are de-facto standard technique in many task, such as gene selection [9] or n-gram processing [27]. 5.2
Feature Selection Algorithm Ensembling Involving Rank Algorithm Selection System
In order to improve filtering algorithms, filter ensembles are used. There are three ways to build a composition of them. The methods of the first group use ensemble of classifiers, which are learnt on datasets generated after filters application [12]. Another way to build a composition of filters is to aggregate their feature importance measures. After that, some cutting rule is applied [25]. And the third way to aggregate filters is to aggregate their ranking lists [4]. For this purpose, the approach described in this article can be used. We will explain the last one in details. Ranking, or univariate filter is a feature selection algorithm, that uses a feature importance measure to estimate, how important different features are in given dataset for predicting true labels. All the features are ranked by the value of the used measure and the top of these features are selected.
Rank Aggregation Algorithm Selection Meets Feature Selection
753
There are a lot of different measures, which can be used in feature selection via feature filtering. And, as we mentioned above, the best way to use filters is to aggregate several of them together. But, the main problem is that there are a lot of different aggregation algorithms exist. The approach to create meta-learningbased filtering feature selection ensemble consists of three following steps. At first, being given a set of filters, apply these filters obraining different feature ranked lists. Then, for the obtained feature ranked lists select the most appropriate rank aggregation algorithm using meta-learning and apply this algorithm obtaining the final features ranks. Finally, the top features from the resulting rank are selected.
6
Conclusion
In this paper, we focused on rank aggregation algorithm selection problem. We described several rank aggregation algorithms and provided formalism to select the best one from the set. We also presented three algorithms to generate artificial ranking lists. These ranking lists were used to compare different rank aggregation algorithms. The baseline we used was the algorithm, which predicts the best ranking algorithm using hidden parameters of data generating process. The experiments showed that the best approach was based on the combination of the basic approaches for meta-feature description and application of featureselection algorithms to the resulting set. Also, we suggested an approach utilizing the presented meta-learning approach for feature selection algorithm ensemble learning. In the further research, we are to improve meta-learning performance by applying more advanced machine learning techniques. This direction is related to developing a better rank aggregation algorithm selection system. Another direction of the future work is implementation and experimental comparison of the proposed feature selection algorithm ensemble learning. This work was financially supported by the Government of Russian Federation, Grant 074-U01, and the Russian Foundation for Basic Research, Grant 16-37-60115 mol a dk.
References 1. Albert, M.H., Aldred, R.E., Atkinson, M.D., van Ditmarsch, H.P., Handley, B., Handley, C.C., Opatrny, J.: Longest subsequences in permutations. Australasian Journal of Combinatorics 28, 225–238 (2003) 2. Bachmaier, C., Brandenburg, F.J., Gleißner, A., Hofmeier, A.: On maximum rank aggregation problems. In: Lecroq, T., Mouchard, L. (eds.) IWOCA 2013. LNCS, vol. 8288, pp. 14–27. Springer, Heidelberg (2013) 3. Benner, P., Mehrmann, V., Sorensen, D.C.: Dimension Reduction of Large-Scale Systems, vol. 45. Springer, Heidelberg (2005) 4. Bol´ on-Canedo, V., S´ anchez-Maro˜ no, N., Alonso-Betanzos, A., Ben´ıtez, J., Herrera, F.: A review of microarray datasets and applied feature selection methods. Information Sciences 282, 111–135 (2014)
754
A. Zabashta et al.
5. de Borda, J.C.: M´emoire sur les ´elections au scrutin (1781) 6. Brazdil, P., Carrier, C.G., Soares, C., Vilalta, R.: Metalearning: Applications to Data Mining. Springer Science & Business Media (2008) 7. Burkovski, A., Lausser, L., Kraus, J.M., Kestler, H.A.: Rank aggregation for candidate gene identification. In: Data Analysis, Machine Learning and Knowledge Discovery, pp. 285–293. Springer (2014) 8. Copeland, A.H.: A reasonable social welfare function. In: Seminar on Applications of Mathematics to Social Sciences. University of Michigan (1951) 9. Das, S., Das, A.K.: Sample classification based on gene subset selection. In: Behera, H.S., Mohapatra, D.P. (eds.) Computational Intelligence in Data Mining. AISC, vol. 410, pp. 227–236. Springer, India (2015) 10. DeConde, R.P., Hawley, S., Falcon, S., Clegg, N., Knudsen, B., Etzioni, R.: Combining results of microarray experiments: a rank aggregation approach. Statistical Applications in Genetics and Molecular Biology 5(1) (2006) 11. Deza, M., Huang, T.: Metrics on permutations, a survey. Journal of Combinatorics, Information and System Sciences. Citeseer (1998) 12. Dietterich, T.G.: Ensemble methods in machine learning. In: Kittler, J., Roli, F. (eds.) MCS 2000. LNCS, vol. 1857, pp. 1–15. Springer, Heidelberg (2000) 13. Dwork, C., Kumar, R., Naor, M., Sivakumar, D.: Rank aggregation methods for the web. In: Proceedings of the 10th International Conference on World Wide Web, pp. 613–622. ACM (2001) 14. Filchenkov, A., Pendryak, A.: Datasets meta-feature description for recommending feature selection algorithm. In: AINL-ISMW FRUCT, pp. 11–18 (2015) 15. Fisher, R.A., Yates, F., et al.: Statistical tables for biological, agricultural and medical research. Statistical Tables for Biological, Agricultural and Medical Research 13(Ed. 6.) (1963) 16. Garner, S.R., et al.: Weka: the waikato environment for knowledge analysis. In: Proceedings of the New Zealand Computer Science Research Students Conference, pp. 57–64. Citeseer (1995) 17. Giraud-Carrier, C.: Metalearning-a tutorial. In: Proceedings of the 7th International Conference on Machine Learning and Applications, pp. 1–45 (2008) 18. Guyon, I., Gunn, S., Nikravesh, M., Zadeh, L.A.: Feature Extraction: Foundations and Applications, vol. 207. Springer (2008) 19. Jones, N.C., Pevzner, P.: An Introduction to Bioinformatics Algorithms. MIT press (2004) 20. Kekre, H.B., Shah, K.: Performance Comparison of Kekre’s Transform with PCA and Other Conventional Orthogonal Transforms for Face Recognition, pp. 873–879. ICETET (2009) 21. Kent, J.T.: Information gain and a general measure of correlation. Biometrika 70(1), 163–173 (1983) 22. Rice, J.R.: The Algorithm Selection Problem (1975) 23. Saeys, Y., Inza, I., Larranaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007) 24. Schalekamp, F., van Zuylen, A.: Rank aggregation: together we’re strong. In: Proceedings of the Meeting on Algorithm Engineering & Expermiments, pp. 38–51. Society for Industrial and Applied Mathematics (2009) 25. Smetannikov, I., Filchenkov, A.: Melif: filter ensemble learning algorithm forgene selection. In: Advanced Science Letters (2016, to appear)
Rank Aggregation Algorithm Selection Meets Feature Selection
755
26. Wang, G., Song, Q., Sun, H., Zhang, X., Xu, B., Zhou, Y.: A feature subset selection algorithm automatic recommendation method. Journal of Artificial Intelligence Research 47(1), 1–34 (2013) 27. Wang, R., Utiyama, M., Goto, I., Sumita, E., Zhao, H., Lu, B.L.: Converting continuous-space language models into n-gram language models with efficient bilingual pruning for statistical machine translation. ACM Transactions on Asian and Low-Resource Language Information Processing 15(3), 11 (2016) 28. Wolpert, D.H., Macready, W.G.: No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation 1(1), 67–82 (1997) 29. Zabashta, A., Smetannikov, I., Filchenkov, A.: Study on meta-learning approach application in rank aggregation algorithm selection. In: MetaSel Workshop at ECML PKDD 2015, pp. 115–117 (2015)