This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and education use, including for instruction at the authors institution and sharing with colleagues. Other uses, including reproduction and distribution, or selling or licensing copies, or posting to personal, institutional or third party websites are prohibited. In most cases authors are permitted to post their version of the article (e.g. in Word or Tex form) to their personal website or institutional repository. Authors requiring further information regarding Elsevier’s archiving and manuscript policies are encouraged to visit: http://www.elsevier.com/copyright
Author's personal copy
Expert Systems with Applications 40 (2013) 2807–2816
Contents lists available at SciVerse ScienceDirect
Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa
Toward the scalability of neural networks through feature selection D. Peteiro-Barral ⇑, V. Bolón-Canedo, A. Alonso-Betanzos, B. Guijarro-Berdiñas, N. Sánchez-Maroño Laboratory for Research and Development in Artificial Intelligence (LIDIA), Computer Science Dept., University of A Coruña, 15071 A Coruña, Spain
a r t i c l e
i n f o
Keywords: Neural networks Machine learning Feature selection High dimensional datasets
a b s t r a c t In the past few years, the bottleneck for machine learning developers is not longer the limited data available but the algorithms inability to use all the data in the available time. For this reason, researches are now interested not only in the accuracy but also in the scalability of the machine learning algorithms. To deal with large-scale databases, feature selection can be helpful to reduce their dimensionality, turning an impracticable algorithm into a practical one. In this research, the influence of several feature selection methods on the scalability of four of the most well-known training algorithms for feedforward artificial neural networks (ANNs) will be analyzed over both classification and regression tasks. The results demonstrate that feature selection is an effective tool to improve scalability. Ó 2012 Elsevier Ltd. All rights reserved.
1. Introduction Machine learning algorithms are increasingly being applied to data sets of high dimensionality, that is, they present several of the following characteristics: the number of learning samples is very high; or the number of input features is very high; or the number of groups or classes to be classified is very high. Most algorithms were developed when data set sizes were much smaller, but nowadays distinct compromises are required for the case of smallscale and large-scale learning problems. Small-scale learning problems are subject to the usual approximation-estimation trade-off. In the case of large-scale learning problems, the trade-off is more complex because it involves not only the accuracy but also the computational complexity of the learning algorithm. Moreover, the problem here is that the majority of algorithms were designed under the assumption that the data set would be represented as a single memory-resident table. So if the entire data set does not fit in main memory, these algorithms are useless. For all these reasons, scaling up learning algorithms is a trending issue. The organization of the workshop ‘‘PASCAL Large Scale Learning Challenge’’ at the 25th International Conference on Machine Learning (ICML’08), and the workshop ‘‘Big Learning’’ at the conference of the Neural Information Processing Systems Foundation (NIPS2011) are cases in point. Scaling up is desirable because increasing the size of the training set often increases the accuracy of algorithms (Catlett, 1991). Scalability is defined as the effect that an increase in the size of the training set has on the computational performance of an algorithm: accuracy, training time and allocated memory. Thus the challenge is to find a deal among them or, in
⇑ Corresponding author. E-mail address:
[email protected] (D. Peteiro-Barral). 0957-4174/$ - see front matter Ó 2012 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.eswa.2012.11.016
other words, getting ‘‘good enough’’ solutions as ‘‘fast’’ as possible and as ‘‘efficiently’’ as possible. This issue becomes critical in situations in which there exist temporal or spatial constraints like: real-time applications dealing with large data sets, unapproachable computational problems requiring learning, or initial prototyping requiring quickly-implemented solutions. Data partitioning is probably one of the most promising lines of research for scaling up learning algorithms, involving breaking the data set up into subsets and learning from one or more of the subsets, thus avoiding processing too large data sets in main memory. On the other hand, selecting a subset of features, also known as feature selection, is a straightforward method for reducing problem size that it is often forgotten in discussions of scaling. Feature selection is usually employed to avoid over-fitting (especially with small-size data sets), train more reliable learners or provide more insights into the underlying causal relationships; but as the number of samples is increased (and thereby feature selection becomes less necessary from data-fitting perspective), feature selection becomes more necessary from both run-time and spatial-complexity perspectives. The process of selecting a subset of features from the data set and ignoring irrelevant features for training, could be an effective way to improve the performance and decrease the training time and memory requirements of a learning algorithm, thus, scaling it up. In this paper, we study the influence of feature selection methods on the scalability of ANN training algorithms by using the measures defined during the PASCAL workshop (Sonnenburg, Franc, Yom-Tov, & Sebag, 2008). These measures evaluate the scalability of algorithms in terms of error, computational effort, allocated memory and training time. Previous results showed (PeteiroBarral, Guijarro-Berdinas, Pérez-Sánchez, & Fontenla-Romero, 2011) that popular algorithms for ANNs are unable to deal with very large data sets. So preprocessing methods may be desirable
Author's personal copy
2808
D. Peteiro-Barral et al. / Expert Systems with Applications 40 (2013) 2807–2816
for reducing the input space size and improving scalability. We plan to demonstrate that feature selection methods are an appropriate approach to improve scalability. By reducing the number of input features and, consequently, the dimensionality of the data set, we expect to reduce the computational time while maintaining the performance on the other measures mentioned above, as well as being able to apply certain algorithms which could not deal with large data sets. There are three main models that deal with feature selection: filters, wrappers and embedded methods (Guyon, 2006). Although wrappers and embedded methods tend to obtain better performances, they are very time consuming and they will be intractable to deal with high dimensional data sets without compromising the time and memory requirements of machine learning algorithms, therefore this work will be focused on the filter approach. The rest of the paper is structured as follows: Section 2 presents a background, Section 3 describes the feature selection methods employed, Section 4 introduces the experimental study and Sections 5 and 6 show the results and discussion, respectively. Finally, some conclusions are included in Section 7.
multiple subsets can be processed in sequence or concurrently. The most used methods in practice in data partition are as follows,
2. Background
2.2. Feature selection
The appearance of very large data sets is not sufficient to motivate scaling efforts. The most commonly cited reason for scaling up algorithms is based on (typically) increasing the accuracy of algorithms when increasing the size of the training data set (Catlett, 1991). In fact, learning from small data sets frequently decreases the accuracy of algorithms as a result of over-fitting. For most scaling problems the limiting factor has been the number of samples and features describing each sample. The growth rate of the training time of an algorithm as the data set size increases is an outstanding question that appears. But temporal complexity does not reflect scaling in its entirety, and must be used in conjunction with other metrics. For scaling up learning algorithms the issue is not so much as one of speeding up a slow algorithm as one of tuning an impracticable algorithm into a practical one. The crucial point in question is seldom how fast you can run on a certain problem but rather how large a problem can you deal with (Provost & Kolluri, 1999). More precisely, space considerations are critical to scale up learning algorithms. The absolute size of the main memory plays a key role in this matter. Almost all existing implementations of learning algorithms operate with the training set entirely in main memory. If the spatial complexity of the algorithm exceeds the main memory then the algorithm will not scale well – regardless of its computational complexity – because page thrashing renders algorithms useless. Page thrashing is the consequence of many accesses to disk occurring in a short time, cutting drastically the performance of a system using virtual memory. Virtual memory is a technique for making a machine behave as if it had more memory than it really has, by using disk space to simulate RAM. But accessing to disk is much slower than accessing to RAM. In the worst case scenario, out of memory exceptions will make algorithms unfeasible in practice. It is a fact that most existing learning algorithms were developed to handle medium-size data sets. A promising approach to scale up algorithms is to circumvent the need to run algorithms on very large data sets by partitioning the data. The data partitioning approach involves breaking the data set up into subsets, learning from one or more of the subsets, and possibly combining the results. On the one hand, data partitioning methods can be categorized based on whether they separate subsets of instances or subsets of features (Provost & Kolluri, 1999). On the other hand, data partitioning methods can be categorized based on whether they learn from single subsets or multiple subsets. Furthermore,
An orthogonal approach to instance selection is feature selection. Feature selection selects a single, smaller subset of features. Up to the authors’ knowledge, the majority of the literature on feature selection has not focused on scalability directly but on increasing the accuracy of learning algorithms when reducing the number of features in a proper manner. The three main methods for feature selection are as follows (Guyon, 2006),
Instance sampling (single subset separated by instances). Feature selection (single subset separated by features). Incremental learning (multiple subsets processed in sequence). Distributed learning (multiple subsets processed concurrently).
2.1. Instance sampling A common method to deal with large data sets is to select a single, smaller subset of samples. Instance sampling methods differ on the particular selection procedure used. The main methods for instance sampling involve simple random sampling, duplicate compaction that removes duplicated instances from a data set, and stratified sampling by selecting instances of the minority classes with a greater frequency in order to even out the distribution (Catlett, 1991). The most important but very difficult task is to determine the appropriate sample size to maintain an acceptable accuracy and some works about adaptive sampling have been carried out (Domingo, Gavaldà, & Watanabe, 2002; Weiss & Provost, 2001).
Filters rely on the general characteristics of the training data, carrying out the feature selection process as a pre-processing step with independence of the learning algorithm. Wrappers involve a learning algorithm as a black box and consist of using its prediction performance to assess the relative usefulness of subsets of variables. In other words, the feature selection algorithm uses the learning algorithm as a subroutine with the computational burden that comes from calling the learning algorithm to evaluate each subset of features. Embedded methods perform feature selection in the process of training of the classifier and are usually specific to given learning machines. Therefore, the search for an optimal subset of features is built into the classifier construction, and can be seen as a search in the combined space of feature subsets and hypotheses. Feature selection has proven to be effective in many diverse fields, such as DNA microarray analysis (Yu & Liu, 2004), intrusion detection (Bolón-Canedo, Sánchez-Maroño, & Alonso-Betanzos, 2010; Lee, Stolfo, & Mok, 2000), text categorization (Forman, 2003; Gomez, Boiy, & Moens, 2011) or information retrieval (Egozi, Gabrilovich, & Markovitch, 2008), including image retrieval (Dy, Brodley, Kak, Broderick, & Aisen, 2003) or music information retrieval (Saari, Eerola, & Lartillot, 2011). 2.3. Incremental learning Incremental learning methods learn from multiple subsets of data processed in sequence. Incremental learning is a type of learning where the learner updates its model whenever new data become available. In other words, when multiple subsets are being processed in sequence it is possible to take advantage of the knowledge learned in previous steps to guide learning in the next step. Incremental learning has been used to scale up to data sets that are too large for batch learning because of limits of main
Author's personal copy
D. Peteiro-Barral et al. / Expert Systems with Applications 40 (2013) 2807–2816
memory. Some examples of incremental algorithms are in the literature (Parikh & Polikar, 2007; Pérez-Sánchez, Fontenla-Romero, & Guijarro-Berdiñas, 2010; Polikar, Upda, Upda, & Honavar, 2001; Ruping, 2001). 2.4. Distributed learning Distributed learning methods learn from multiple subsets of data processed concurrently. In order to increase efficiency, learning can be parallelized by distributing the subsets of data to multiple processors, learning in parallel and then combining them. Distributed precludes incremental learning where a prior knowledge is needed as input to a subsequent step. Distributed learning has been used to scale up data sets that are too large for batch learning where incremental learning is too slow. Some examples of distributed algorithms are in the literature (Ananthanarayana, Subramanian, & Murty, 2000; Chan & Stolfo, 1993; Peteiro-Barral, Guijarro-Berdinas, Pérez-Sánchez, & Fontenla-Romero, 2011; Tsoumakas & Vlahavas, 2002). All of these algorithms process instances in a distributed manner. While not common, there are some other developments that process features (Kargupta, Park, Sanseverino, Silvestre, & Hershberger, 1998; McConnell & Skillicorn, 2004; Skillicorn & McConnell, 2008). Instance sampling offers speedups because learning from fewer instances is faster. But Carlett’s work (Catlett, 1991) showed that, in general, learning from a smaller subset of data decreases accuracy and consequently will be discarded in this research. In previous works, some research was done in both incremental (Pérez-Sánchez et al., 2010) and distributed learning (Peteiro-Barral et al., 2011). However, the impact of feature selection in the scalability of learning algorithms has not been explored in depth yet (Bolón-Canedo, Peteiro-Barral, Alonso-Betanzos, GuijarroBerdinas, & Sánchez-Marono, 2011). 3. Filter methods for feature selection Feature selection consists on detecting the relevant features and discarding the irrelevant ones. It has several advantages (Guyon, 2006), such as:
2809
evaluation produces candidate feature subsets based on a certain search strategy. Each candidate subset is evaluated by a certain evaluation measure and compared with the previous best one with respect to this measure. While the individual evaluation is not able to remove redundant features because redundant features likely have similar rankings, the subset evaluation approach can handle feature redundancy with feature relevance. However, methods in this framework can suffer from the inevitable problem caused by searching through feature subsets required in the subset generation step, and thus, both approaches will be studied in this research. In what follows, the four filters used in this work will be described. The three first follow the subset evaluation framework whilst the last one belongs to the individual evaluation approach. 3.1. Correlation-based Feature Selection (CFS) Correlation based Feature Selection (CFS) is a simple filter algorithm suitable for both classification and regression tasks. It ranks feature subsets according to a correlation based heuristic evaluation function (Hall, 1999). The bias of the evaluation function is toward subsets that contain features that are highly correlated with the output to be predicted and uncorrelated with each other. Irrelevant features should be ignored because they will have low correlation with the class. Redundant features should be screened out as they will be highly correlated with one or more of the remaining features. The acceptance of a feature will depend on the extent to which it predicts classes in areas of the instance space not already predicted by other features. CFS’s feature subset evaluation function is:
kr cf MS ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ; k þ kðk 1Þr ff where MS is the heuristic ‘merit’ of a feature subset S containing k features, rcf is the mean feature-class correlation (f 2 S) and r ff is the average feature-feature intercorrelation. The numerator of this equation can be thought of as providing an indication of how predictive of the class a set of features is; and the denominator of how much redundancy there is among the features. 3.2. Consistency-based filter
Improving the performance of the machine learning algorithms. Data understanding, gaining knowledge about the process and perhaps helping to visualize it. Data reduction, limiting storage requirements and perhaps helping in reducing costs. Simplicity, possibility of using simpler models and gaining speed. The benefits provided by feature selection may help in reducing the computational effort, allocated memory and training time, measures that will be considered to study the scalability of machine learning algorithms. As was stated in the Introduction, this work will be focused on the filter model, due to the high computational demand of wrappers and embedded methods. Filters rely on the general characteristics of the training data in order to select features with independence of any predictor and are usually computationally less expensive than wrappers and embedded methods. Furthermore, filters have the ability to scale to large data sets and result in a better generalization because they act independently of the induction algorithm. There exist two major approaches for filtering: individual evaluation and subset evaluation (Yu & Liu, 2004). On the one hand, individual evaluation, also known as feature ranking, assesses individual features by assigning them weights according to their degrees of relevance. On the other hand, subset
The consistency-based filter (Dash & Liu, 2003) only supports classification and evaluates the worth of a subset of features by the level of consistency in the values of the class to be predicted when the training instances are projected onto the subset of attributes. The algorithm generates a random subset S from the number of features in every round. If the number of features of S is less than the current best, the data with the features prescribed in S is checked against the inconsistency criterion. If its inconsistency rate is below a pre-specified one, S becomes the new current best. 3.3. INTERACT The INTERACT algorithm (Zhao & Liu, 2007) is a subset filter supported for classification tasks based on symmetrical uncertainty (SU) and the consistency contribution (c-contribution). Ccontribution of a feature is an indicator about how significantly the elimination of that feature will affect consistency. The algorithm consists of two major parts. In the first part, the features are ranked in descending order based on their SU values. In the second part, features are evaluated one by one starting from the end of the ranked feature list. If c-contribution of a feature is less than an established threshold, the feature is removed, otherwise it is selected. The authors stated in Zhao and Liu (2007) that INTERACT
Author's personal copy
2810
D. Peteiro-Barral et al. / Expert Systems with Applications 40 (2013) 2807–2816
Table 1 Characteristics of each dataset. Dataset Connect-4 KDD Cup 99 Covertype MNIST Friedman Lorenz
Features
Classes
Training
Test
Task
42 42 54
3 2 2/–
60,000 494,021 100,000
7557 311,029 50,620
748
2/–
60,000
10,000
10 8
– –
1,000,000 1,000,000
100,000 100,000
Classification Classification Classification/ regression Classification/ regression Regression Regression
can thus handle feature interaction, and efficiently selects relevant features. 3.4. ReliefF ReliefF Kononenko (1994) is an extension of the original Relief algorithm (Kira & Rendell, 1992) which supports both regression and classification tasks. The original Relief works by randomly sampling an instance from the data and then locating its nearest neighbor from the same and opposite output to be predicted. The values of the attributes of the nearest neighbors are compared to the sampled instance and used to update relevance scores for each attribute. The rationale is that a useful attribute should differentiate between instances from different classes and have the same value for instances from the same output. ReliefF adds the ability of dealing with multiclass problems and is also more robust and capable of dealing with incomplete and noisy data. Since this filter follows the individual evaluation approach, a threshold is required in order to select the attributes that have a relevance score higher than it. 4. Experimental study In order to check the effect of different feature selection methods on the scalability of machine learning algorithms, four of the most popular training algorithms for ANNs were selected. Two of these algorithms are gradient descent (GD) (Bishop, 2006) and gradient descent with momentum and adaptive learning rate (GDX) (Bishop, 2006), whose complexity is O(n). The other algorithms are scaled conjugated gradient (SCG) (Moller, 1993) and Levenberg–Marquardt (LM) (More, 1978), whose complexities are O(n2) and O(n3), respectively. 4.1. Data sets Classification and regression are two of the most common tasks in machine learning. Table 1 depicts the datasets1 used in this paper along with a brief description of them (number of features, classes, training examples and test examples). Covertype and MNIST datasets, which are originally classification taks, were also transformed into a regression task by predicting 1 for samples of class 1; and +1 for samples of class 2 (Collobert & Bengio, 2008). Friedman and Lorenz are artificial datasets. Friedman is defined by the equation y = 10sin(px1 x2) + 20(x3 0.5)2 + 10x4 + 5x5 + r(0, 1) where the input attributes x1, . . . , x10 are generated independently, each of which is uniformly distributed over interval [0, 1]. Variables x6 x10 are randomly generated. On the other hand, Lorenz is defined by the simultaneous solution of three equations dX ¼ dY dX; dY ¼ XZ þ rX Y; dZ ¼ XY bZ, dt dt dt 1 Connect-4 and Covertype datasets are available on ; KDD Cup 99 on ; and MNIST on .
where the systems exhibits chaotic behavior for d = 10, r = 28 and b ¼ 83. The goal of the network is to predict the current sample based on the four previous samples. 4.2. Performance measures In order to assess the performance of learning algorithms, common measures as accuracy are insufficient since they do not take into account all aspects involved when dealing with large datasets. Accordingly, the goal for machine learning developers is to find a learning algorithm such that it achieves a low error in the shortest possible time using as few samples as possible. Since there are no standard measures of scalability, those defined in the PASCAL Large Scale Learning Challenge (Sonnenburg et al., 2008) are used: Fig. 1(a) shows the relationship between training time and test error, computed on the largest dataset size the algorithm is able to deal with, aimed at answering the question ‘‘which test error can we expect given limited training time resources?’’. Following the PASCAL Challenge, the different training time budgets are set to 10[. . .,1,0,1,2,. . .] seconds. We compute the following scalar measures based on this figure: – Err: minimum test error (standard class error for classification and mean squared error for regression (Weiss & Kulikowski, 1991)). – AuTE: area under training time vs test error curve (gray area). – Te5%: the time t for which the test error e falls below a threshold eErr < 0:05. e Fig. 1(b) shows the relationship between different training set sizes and the test error of each one aimed at answering the question ‘‘which test error can be expected given limited training data resources?’’. Following the PASCAL Challenge, the different training set sizes (training samples) are set to 10[2,3, 4,. . .] up to the maximum size of the dataset. We compute the following scalar measures based on this figure: – AuSE: area under training set size vs test error curve (gray area). – Se5%: the size s for which the test error e falls below a threshold eErr < 0:05 e Fig. 1(c) shows the relationship between different training set sizes and the training time for each one aimed at answering the question ‘‘which training time can be expected given limited training data resources?’’. Again, the different training set sizes are set to 10[2,3,4,. . .] and the maximum size of the dataset. We compute the following scalar measure based on this figure: – Eff: slope b of the curve using a least squares fit to axb. In order to establish a general measure of scalability, the final Score of the algorithms is calculated as the average rank of its contribution with regard to the six scalar measures defined above. 4.3. Experimental procedure As a preprocessing step, several feature selection methods were applied over the training set to obtain a subset of features. CFS, INTERACT and Consistency-based were used for classification whilst CFS and ReliefF were employed for regression (see Section 3). The latter follows the individual evaluation framework, and therefore a threshold is required. In this work, we have opted for an aggressive (lower number of features to retain) and a soft reduction (higher number of features to retain). After the preprocessing step, different simulations (N = 10) were carried out over the training set for accurately estimating the scalability of algorithms on each dataset, as showed in the following procedure:
Author's personal copy
2811
D. Peteiro-Barral et al. / Expert Systems with Applications 40 (2013) 2807–2816
(a) Training time vs Test error.
(b) Training set size vs Test error.
(c) Training set size vs Training time. Fig. 1. Performance measures.
1. Select features over the training set using the methods presented in Section 3. 2. Set the number of hidden units of the ANN to 2 number_of_inputs + 1 (Hecht-Nielsen, 1990) and train the network. It is important to remark that the aim here is not to investigate the optimal topology of an ANN for a given dataset, but to check the scalability of learning algorithms on large networks. 3. Compute the score of algorithms as the average rank of their contribution with regard to the six scalar measures defined in Section 4.2. 4. Apply a Kruskal–Wallis test to check if there are significant differences among the medians for each algorithm with and without FS for a level of significance a = 0.05. 5. If there are differences among the medians, then apply a multiple comparison procedure (Tukey’s) to find the simplest approach whose score is not significantly different from the approach with the best score.
Table 2 Features selected by each feature selection method along with the time required for this task. Data
Filter
Connect4
None CFS Consistency INTERACT
42 6 40 38
0.00 8.00 2760.73 173.32
00:00:00.00 00:00:08.00 00:46:00.73 00:02:53.33
Forest
None CFS Consistency INTERACT
54 13 31 30
0.00 17.28 7702.12 203.57
00:00:00.00 00:00:17.28 02:08:22.12 00:03:23.58
KDD Cup 99
None CFS Consistency INTERACT
42 5 7 7
0.00 61.94 382.01 106.71
00:00:00.00 00:01:01.95 00:06:22.01 00:01:46.72
MNIST
None CFS Consistency INTERACT
748 55 18 36
0.00 805.27 13775.09 652.96
00:00:00.00 00:13:25.27 03:49:35.10 00:10:52.96
5. Results 5.1. Classification During the preprocessing step, the filters CFS, Consistencybased and INTERACT were applied over the training set in order to obtain a subset of features which will be employed in the classification stage. The number of features selected by each method, along with the time required for this task, is depicted in Table 2. It has to be noted that CFS is the filter which achieves the greatest reduction in the number of features in the minimum time in 3 out of the 4 dataset tested. On the other hand, Consistency-based is the filter which requires more time to perform the selection (see that for Forest dataset, Consistency-based takes 2 h to perform this task while CFS only needs 17 s). Table 3 presents the average test results obtained by the machine learning algorithms after applying the three different filters compared with those where no feature selection was performed.
Features
Time (s)
hh:mm:ss
Regarding the performance measures defined in Section 4.2, notice that the lower the result, the higher the scalability. It has to be noted that not all the learning algorithms were able to deal with all available samples for every dataset, mostly due to the spatial complexity of the algorithms. In particular, on the MNIST dataset the LM algorithm is not able to train even on the smallest subset when no feature selection is applied. If this occurs, the measures explained in Section 4.2 were computed on the largest dataset that the learning algorithms were able to process and this fact was specified along with the results. 5.2. Regression The filters CFS and ReliefF were selected for the regression task. It has to be reminded that ReliefF provides a ranking of features
Author's personal copy
2812
D. Peteiro-Barral et al. / Expert Systems with Applications 40 (2013) 2807–2816
Table 3 Performance measures for classification tasks. N/A stands for Not Applicable. Those algorithms whose Score average test results are not significantly worse than the best are labeled with a cross ( ). Method
Filter
Score
Err
AuTE
AuSE
Te5%
Se5%
Eff
None CFS Consistency INTERACT
8.67 5.50 7.83 6.67
0.38 0.51 0.40 0.35
5.16e1 1.01e1 4.32e1 3.37e1
0.97 1.24 0.94 0.90
1.08e2 1.39e1 8.44e1 8.14e1
1.00e2 1.00e2 1.00e2 1.00e2
0.43 0.26 0.40 0.40
GDX
None CFS Consistency INTERACT
7.00 4.00 4.83 4.83
0.31 0.32 0.31 0.28
3.71e1 9.44e0 2.61e1 2.71e1
0.92 0.89 0.87 0.80
7.98e1 1.70e1 5.71e1 6.96e1
6.00e4 1.00e4 1.00e3 1.00e4
0.40 0.25 0.37 0.38
LM
Nonea CFS Consistency INTERACT
8.83 6.67 9.33 8.17
0.23 0.31 0.27 0.24
3.79e2 4.79e1 2.31e2 1.60e2
0.77 0.87 0.85 0.79
7.80e2 6.68e1 5.01e2 3.50e2
1.00e4 1.00e4 1.00e4 1.00e4
0.77 0.44 0.71 0.68
SCG
None CFS Consistency INTERACT
7.17 3.83 6.83 6.17
0.21 0.29 0.23 0.23
7.01e1 9.97e0 5.34e1 4.95e1
0.77 0.82 0.72 0.70
2.62e2 2.28e1 1.44e2 1.41e2
1.00e4 1.00e4 6.00e4 6.00e4
0.50 0.31 0.47 0.47
None CFS Consistency INTERACT
9.00 5.67 7.67 7.00
0.38 0.45 0.41 0.38
1.24e2 2.74e1 5.67e1 4.97e1
1.20 1.36 1.30 1.28
2.78e2 3.34e1 1.12e2 1.08e2
1.00e3 1.00e2 1.00e3 1.00e5
0.49 0.35 0.42 0.41
GDX
None CFS Consistency INTERACT
7.33 5.33 5.67 4.33
0.42 0.51 0.38 0.40
4.74e1 6.81e0 3.21e1 2.26e1
1.32 1.41 1.23 1.11
1.01e2 0.43e0 7.20e1 4.93e1
1.00e4 1.00e2 1.00e5 1.00e3
0.41 0.24 0.37 0.35
LM
Nonea CFS Consistency INTERACT
9.33 9.17 8.17 6.67
0.24 0.32 0.26 0.25
6.41e2 2.99e2 6.71e1 5.72e1
0.94 0.95 0.96 0.93
1.74e3 5.15e2 1.72e2 1.55e2
1.00e4 1.00e3 1.00e4 1.00e4
0.84 0.58 0.59 0.58
SCG
None CFS Consistency INTERACT
7.50 4.00 6.50 5.67
0.20 0.29 0.23 0.20
1.64e2 3.51e1 7.21e1 6.21e1
0.81 0.86 0.84 0.84
5.80e2 5.97e1 2.48e2 1.70e2
1.00e5 1.00e3 1.00e4 1.00e5
0.55 0.40 0.48 0.47
Noneb CFS Consistency INTERACT
7.00 4.67 6.50 3.33
0.13 0.16 0.20 0.12
4.29e1 8.67e0 8.80e0 6.55e0
0.43 0.54 0.70 0.45
5.53e1 5.41e0 2.49e1 2.83e1
1.00e2 1.00e2 1.00e3 1.00e2
0.50 0.34 0.32 0.33
GDX
Noneb CFS Consistency INTERACT
7.00 1.83 5.33 3.50
0.15 0.11 0.19 0.11
2.55e1 4.61e0 7.06e0 5.68e0
0.46 0.37 0.70 0.48
5.93e1 2.15e1 5.65e0 3.85e1
1.00e3 1.00e3 1.00e2 1.00e3
0.44 0.30 0.31 0.32
LM
Nonea CFS Consistency INTERACT
9.17 8.00 9.33 8.00
0.11 0.12 0.17 0.12
2.21e2 3.38e1 3.11e1 2.89e1
0.46 0.49 0.63 0.55
1.24e3 1.47e2 1.10e2 6.32e1
1.00e4 1.00e2 1.00e4 1.00e5
0.80 0.46 0.42 0.44
SCG
Noneb CFS Consistency INTERACT
9.67 4.17 7.83 8.00
0.14 0.08 0.18 0.17
1.10e2 1.13e1 2.06e1 2.15e1
0.51 0.31 0.70 0.56
3.54e2 4.40e1 4.88e1 8.41e1
1.00e4 1.00e4 1.00e3 1.00e2
0.55 0.38 0.39 0.39
Nonea CFS Consistency INTERACT
9.17 6.67 5.00 5.50
0.36 0.26 0.37 0.31
1.41e2 4.04e1 1.72e1 2.92e1
0.85 0.69 1.07 0.79
2.26e2 1.07e2 3.99e1 7.46e1
1.00e2 1.00e3 1.00e3 1.00e3
0.65 0.43 0.33 0.39
GDX
Nonea CFS Consistency INTERACT
9.00 5.50 3.50 3.83
0.22 0.21 0.22 0.21
2.30e2 3.32e1 1.50e1 2.33e1
0.66 0.66 0.68 0.64
6.91e2 9.98e1 3.98e1 6.80e1
1.00e3 1.00e3 1.00e4 1.00e3
0.72 0.42 0.33 0.38
LM
None CFS Consistency INTERACT
N/A 9.17 6.83 6.33
N/A 0.13 0.10 0.13
N/A 3.22e2 8.52e1 6.45e1
N/A 0.56 0.49 0.48
N/A 1.10e3 3.38e2 2.05e2
N/A 1.00e4 6.00e4 1.00e4
N/A 0.80 0.55 0.62
(a) Connect-4 GD
(b) Forest GD
(c) KDD Cup 99 GD
(d) MNIST GD
Author's personal copy
2813
D. Peteiro-Barral et al. / Expert Systems with Applications 40 (2013) 2807–2816 Table 3 (continued) Method SCG
a b
Filter a
None CFS Consistency INTERACT
Score
Err
AuTE
AuSE
Te5%
Se5%
Eff
8.00 6.33 3.83 5.50
0.05 0.11 0.11 0.11
2.85e2 4.77e1 1.73e1 3.14e1
0.40 0.49 0.51 0.54
1.62e3 2.42e2 8.62e1 1.61e2
1.00e4 6.00e4 6.00e4 6.00e4
0.81 0.50 0.40 0.46
Largest training set it can deal with: 1e4 samples. Largest training set it can deal with: 1e5 samples.
Table 4 Features selected by each feature selection method along with the time required for this task. ReliefF_ and ReliefF^ stand for aggressive and soft reduction, respectively. Data
Filter
Forest
None CFS ReliefF_ ReliefF^
Features 54 23 5 48
Time (s) 0.00 21.76 11631.56 11631.56
00:00:00.00 00:00:21.76 03:13:51.57 03:13:51.57
hh:mm:ss
Friedman
None CFS ReliefF_ ReliefF^
10 5 4 5
0.00 17.39 289191.43 289191.43
00:00:00.00 00:00:17.39 80:19:51.44 80:19:51.44
Lorenz
None CFS ReliefF_ ReliefF^
8 1 4 6
0.00 11.83 262575.40 262575.40
00:00:00.00 00:00:11.83 72:56:15.40 72:56:15.40
MNIST
None CFS ReliefF_ ReliefF^
748 103 31 418
0.00 938.73 57048.71 57048.71
00:00:00.00 00:15:38.73 15:50:48.72 15:50:48.72
tively the remaining measures and also to find a trade-off among all the measures which will be reflected on the final average Score. 6.1. Classification
and a threshold is required. In this work, we have opted for two different arrangements: an aggressive and a soft reduction, represented in the tables as ReliefF_ and ReliefF^, respectively. As for classification, Table 4 shows the number of features selected by each method along with the time required. Note that the time required by ReliefF is the same for the two arrangements, since the construction of the ranking is a mutual step. Again, CFS is able to perform the selection mostly in the order of seconds whilst ReliefF needs in the order of hours. Table 5 depicts the average test results achieved by the machine learning algorithms after applying CFS and ReliefF filters compared with those where no feature selection was performed. As well as in the classification case, when the learning algorithm is not able to train on all available samples, this fact is specified along with the results. LM was again not able to train on MNIST dataset (see Table 5(d)). Even when the number of weights of an ANN is lower in a regression task than in a classification task (as the number of outputs is also lower), the spatial complexity of the algorithm LM is still very high.
In general lines, results without FS show a scarcely lower error (although in some cases it is maintained or even improved, depending on the filter) at the expense of a longer training time. On the other hand, the results after applying feature selection present a shorter training time. As expected, the measures related to the training time (AuTe, Te5% and Eff) improve, since a shorter time is needed to train the same number of data. On the other hand, AuSE and Se5% deteriorate their results after applying feature selection. Although the error was expected to be higher after applying feature selection, INTERACT maintains or improves the classification error in most of the cases, obtaining also a good performance on the other measures. Since the assessment of the scalability of learning algorithms is a multi-objective problem and there is no chance of defining a single optimal order of importance among measures, we have opted to focus on the general measure of scalability (Score). Table 3 shows that in most cases applying feature selection was significantly better than not applying it (13 out of 16 cases). In order to decide which filter is the best option, Table 6 depicts the average score for each filter on each dataset, as well as the average filtering time. Since the algorithm LM is not able to train over MNIST when no feature selection is applied, the results over this dataset are averaged on three learning algorithms, instead of four. In light of the results showed in Table 6, it remains clear that applying feature selection is better than not doing it. Among the three filters tested, CFS exhibits the best Score in average, closely followed by INTERACT. Bearing in mind the average time required by each filter (see last column in Table 6), the consistency-based filter does not seem to be a good option, due to the fact that it obtains the worst score along with the highest processing time. CFS tends to select the smallest number of features at the expense of a slightly higher error than INTERACT, therefore the decision of which one is more adequate for the scalability of ANNs depends on if the user is more interested in minimizing either the error or the other measures.
6. Discussion
6.2. Regression
The aim of the experiments carried out in this work is to assess the performance of ANN algorithms in terms of scalability and not simply in terms of error like the great majority of papers in the literature. All the six scalar measures defined in Section 4.2 are considered to evaluate the scalability of learning algorithms, trying to achieve a balance among them. When applying feature selection, it is expected that some measures are positively affected by the dimensionality reduction (such as AuTe, Te5% or Eff) because the algorithms can deal with a larger number of samples employing the same execution time. Nevertheless, the goal of this research is to demonstrate that this reduction does not always affect nega-
Over Forest and MNIST datasets, the performance measures follow the same trends as with classification tasks: those related with the training time (AuTe, Te5% and Eff) improve while AuSE and Se5% slightly worsen. However, this is not the case with Friedman and Lorenz datasets. This fact is explained because these datasets have only 10 and 8 features, respectively, and reducing the input dimensionality does not lead to a significant reduction in the training time, whilst it may remove some important information which affects on the accuracy. Although in general it seems that feature selection methods do not achieve results as good as over classification tasks, an in-depth
Author's personal copy
2814
D. Peteiro-Barral et al. / Expert Systems with Applications 40 (2013) 2807–2816
Table 5 Performance measures for regression tasks. N/A stands for Not Applicable. Those algorithms whose Score average test results are not significantly worse than the best are labeled with a cross ( ). ReliefF_ and ReliefF^ stand for aggressive and soft reduction, respectively. Method
Filter
Score
Err
AuTE
Te5%
Se5%
Eff
None CFS ReliefF_ ReliefF^
9.17 7.67 4.67 8.67
0.90 0.99 0.95 0.90
1.26e3 1.29e2 2.37e1 3.90e2
3.62 3.15 2.96 2.98
5.38e2 8.27e1 1.54e1 1.59e2
1.00e4 1.00e3 1.00e3 1.00e4
0.55 0.38 0.26 0.44
GDX
None CFS ReliefF_ ReliefF^
8.50 7.00 4.67 9.33
0.68 1.13 0.99 1.16
1.01e3 6.91e1 1.61e1 3.60e2
4.17 3.32 3.00 4.63
4.54e2 3.59e1 1.07e1 1.03e2
1.00e5 1.00e3 1.00e3 1.00e3
0.53 0.32 0.21 0.41
LM
Nonea CFS ReliefF_ ReliefF^
9.17 6.83 4.50 9.67
0.60 0.80 0.64 0.54
1.02e4 4.20e2 4.71e1 2.36e3
3.42 2.48 2.39 3.04
1.35e3 8.83e1 4.07e1 2.83e2
1.00e4 1.00e3 1.00e4 1.00e4
0.82 0.40 0.35 0.65
SCG
None CFS ReliefF_ ReliefF^
8.17 7.00 3.67 8.00
0.57 0.81 0.67 0.61
1.64e3 1.41e2 2.83e1 5.33e2
2.72 2.66 2.29 2.60
9.86e2 5.89e1 2.48e1 2.16e2
1.00e5 1.00e4 1.00e4 1.00e5
0.60 0.42 0.31 0.50
Noneb CFS ReliefF_ ReliefF^
6.50 7.83 7.67 8.67
8.33 8.16 9.90 9.71
2.19e3 6.72e3 6.30e3 7.09e3
36.77 35.80 43.70 36.80
7.51e1 1.71e2 1.36e2 1.71e2
1.00e3 1.00e3 1.00e4 1.00e3
0.37 0.37 0.35 0.37
GDX
Noneb CFS ReliefF_ ReliefF^
5.84 6.50 6.33 6.67
4.41 3.86 5.08 4.00
1.83e3 6.36e3 5.49e3 6.44e3
24.57 23.00 31.10 23.71
7.20e1 1.69e2 1.36e2 1.69e2
1.00e5 1.00e5 1.00e6 1.00e4
0.37 0.37 0.35 0.37
LM
Noneb CFS ReliefF_ ReliefF^
5.00 8.33 7.50 7.00
0.11 2.34 2.35 0.27
1.11e3 1.87e4 1.38e4 1.79e4
8.57 14.82 14.24 6.98
8.74e2 1.03e3 7.76e2 1.03e3
1.00e5 1.00e4 1.00e4 1.00e4
0.59 0.50 0.48 0.50
SCG
Noneb CFS ReliefF_ ReliefF^
5.00 6.33 6.17 5.33
0.79 2.66 2.85 0.93
1.67e3 4.77e3 4.49e3 5.58e3
10.33 16.10 17.81 9.34
1.71e2 3.88e2 3.10e2 3.87e2
1.00e5 1.00e5 1.00e6 1.00e4
0.44 0.43 0.41 0.43
Noneb CFS ReliefF_ ReliefF^
5.17 7.00 7.67 8.17
0.74 6.57 1.44 0.94
4.82e2 1.57e3 2.16e3 2.26e3
2.98 26.01 6.37 4.30
6.17e1 5.60e1 1.37e2 2.02e2
1.00e2 1.00e2 1.00e3 1.00e6
0.36 0.29 0.35 0.38
GDX
Noneb CFS ReliefF_ ReliefF^
4.83 4.83 6.83 7.17
2.66 1.29 4.02 5.45
2.45e2 1.19e3 1.76e3 1.57e3
13.63 5.37 13.23 18.71
2.04e1 4.26e1 7.53e1 7.73e1
1.00e4 1.00e3 1.00e3 1.00e2
0.26 0.27 0.31 0.32
LM
Noneb CFS ReliefF_ ReliefF^
8.17 5.17 7.83 8.50
0.00 0.18 0.00 0.00
3.26e3 8.40e2 3.12e3 8.97e3
0.00 0.80 0.00 0.00
5.19e2 1.24e2 7.82e2 1.44e3
1.00e5 1.00e3 1.00e6 1.00e6
0.54 0.36 0.48 0.52
SCG
Noneb CFS ReliefF_ ReliefF^
5.83 5.00 6.33 7.17
0.01 0.21 0.01 0.01
5.61e2 4.86e2 1.21e3 2.00e3
0.05 1.59 0.10 0.12
1.38e2 1.15e2 3.11e2 4.53e2
1.00e4 1.00e4 1.00e6 1.00e4
0.43 0.35 0.41 0.44
Nonea CFS ReliefF_ ReliefF^
8.17 5.17 3.50 6.17
303.12 37.14 1.33 210.98
1.66e3 1.44e2 3.86e1 9.11e2
903.14 143.11 3.95 528.32
6.60e0 3.23e0 1.40e1 4.55e0
1.00e2 6.00e4 1.00e3 1.00e2
0.44 0.23 0.28 0.35
GDX
Nonea CFS ReliefF_ ReliefF^
9.33 6.33 2.83 8.83
9.25 1.52 0.85 12.92
7.49e4 1.34e3 3.73e1 1.51e4
66.06 6.02 3.86 76.51
9.71e2 1.62e2 1.26e1 3.06e2
1.00e4 1.00e3 1.00e4 1.00e4
0.75 0.46 0.27 0.62
LM
Nonea CFS ReliefF_ ReliefF^
N/A N/A 4.83 N/A
N/A N/A 0.59 N/A
N/A N/A 3.80e2 N/A
N/A N/A 3.08 N/A
N/A N/A 8.68e1 N/A
N/A N/A 1.00e4 N/A
N/A N/A 0.51 N/A
(a) Forest GD
(b) Friedman GD
(c) Lorenz GD
(d) MNIST GD
AuSE
Author's personal copy
2815
D. Peteiro-Barral et al. / Expert Systems with Applications 40 (2013) 2807–2816 Table 5 (continued)
a b
Method
Filter
Score
Err
AuTE
SCG
None CFS ReliefF_ ReliefF^
9.17 5.83 3.00 8.50
3.06 0.44 0.55 2.51
3.10e4 1.09e3 3.87e1 1.12e4
AuSE 41.52 3.62 2.12 36.71
Te5%
Se5%
Eff
1.82e3 4.20e2 1.56e1 6.68e2
1.00e4 6.00e4 1.00e4 1.00e4
0.82 0.55 0.32 0.71
Largest training set it can deal with: 1e4 samples. Largest training set it can deal with: 1e5 samples.
Table 6 Average of Score for each filter on each dataset for classification tasks along with the average time required by the filters. Filter
Connect-4
Forest
KDD Cup 99
MNIST
Average
Time (s)
None CFS Consistency INTERACT
7.92 5.00 7.21 6.46
8.29 6.04 7.00 5.92
8.21 4.67 7.25 5.71
8.72 6.17 4.11 4.94
8.29 5.47 6.39 5.76
– 223.12 6154.99 284.14
Table 7 Average of Score for each filter on each dataset for regression tasks along with the average time required by the filters. Filter
Forest
Friedman
Lorenz
MNIST
Average
Time (s)
None CFS ReliefF_ ReliefF^
7.88 7.13 4.38 8.92
5.59 7.25 6.92 6.92
6.00 5.50 7.17 7.75
8.89 5.78 3.11 7.83
7.09 6.42 5.39 7.86
– 247.43 155111.60 155111.60
GDX algorithm, the lower error was achieved by CFS using only that one feature (see Table 5(c)). Finally, notice that the current performance measures and the aggregation of the ranks are detrimental to learning algorithms which are accurate but slow. A case in point of the change in approach for assessing training algorithms for ANNs followed in this paper concerns the LM (see Tables 3 and 5), as it usually reaches the lowest error by a large margin but never ranks among the best. In this manner, the ranking could be scrambled by using a naive but fast training algorithm. For example, in the regression task of MNIST dataset (see Table 5(d)), the algorithm GD ranks better than GDX and SCG due to its short training time, early stopping, in spite of obtaining a huge error. Further experiments showed convergence problems with regard to the training process of the algorithm GD. This is an isolated case and the conclusions of this work will not be disturbed by it. However, and this is the point, if a learning algorithm which shows convergence problems is able to beat other algorithms in terms of scalability, the performance measures must be revised. 7. Conclusions
analysis reveals that for all the combinations between dataset and learning algorithm, using feature selection obtained a significantly better or equal Score (for all filters) than not using it. Studying in detail the behavior of the filters, one may find that ReliefF_ (aggressive reduction) is the best (or one of the best ones) method with a significant difference in all cases but two. Table 7 reinforces this fact by showing that this method obtains the best Score in average for all datasets and learning algorithms. On the other hand, ReliefF^ (soft reduction) presents the worst Score in average. This fact can be explained because ReliefF^ selects a higher number of features, which leads to a slightly better error, but at the expense of requiring more training time and hence getting worse results on the other measures. Albeit ReliefF_ is the best method according to Score, it has to be reminded that ReliefF requires a much higher computational time than CFS (see last column in Table 7). For this reason, in the opinion of the authors, CFS is the best option when looking for a feature selection method which could help on the scalability of ANNs on regression tasks. In fact, its Score shows that applying CFS is better than not doing it and the computational time required is not prohibitive, as happens with ReliefF. Regarding the results for Friedman and Lorenz datasets, with 10 and 8 input features respectively, and bearing in mind the results depicted in Table 7, one may question the adequacy of feature selection on these datasets with such a small number of features. However, there is not a universal answer to this question, since it depends on the nature of the problem and on the presence of irrelevant features. In this work, no benefits were found after applying feature selection over Friedman dataset, but it was worth to do it over Lorenz. In fact, the filter CFS over the latter dataset only needs a couple of seconds to perform the selection and it retains one single feature. Further experimentation showed that this feature is highly correlated with the output and obtained as good results as with the whole set of features. It is remarkable the fact that for
When dealing with the performance of machine learning algorithms, most papers are focused on the accuracy obtained by the algorithm. However, with the advent of high dimensionality problems, researchers must study not only accuracy but also scalability. Aiming at dealing with a problem as large as possible, feature selection can be helpful as it reduces the input dimensionality and therefore the run-time required by an algorithm. In this work, the effectiveness of feature selection on the scalability of training algorithms for ANNs was evaluated, both for classification and regression tasks. Since there are no standard measures of scalability, those defined in the PASCAL Large Scale Learning Challenge were used to assess the scalability of the algorithms in terms of error, computational effort, allocated memory and training time. Results showed that feature selection as a preprocessing step is beneficial for the scalability of ANNs, even allowing certain algorithms to be able to train on some datasets in cases where it was impossible due to the spatial complexity. Moreover, some conclusions about the adequacy of the different feature selection methods over this problem were extracted. As future work, it is necessary to develop fairer measures of scalability. We are aware that evaluating scalability is a multi-objective problem and there is no chance of establishing a single fair absolute order. However, we believe that some shortcoming of PASCAL measures can be overcome by putting some constraints on learning algorithms (e.g. avoiding early stopping) and by adding some new scalar measures (e.g. largest dataset the algorithm is able to deal with). Acknowledgments This work was supported by Secretaría de Estado de Investigación of the Spanish Government under projects TIN 2009-02402 and TIN2012-37954, and by the Xunta de Galicia through projects
Author's personal copy
2816
D. Peteiro-Barral et al. / Expert Systems with Applications 40 (2013) 2807–2816
CN2011/007 and CN2012/211, all partially supported by the European Union ERDF. D. Peteiro-Barral and V. Bolón-Canedo acknowledge the support of Xunta de Galicia under Plan I2C Grant Program. References Catlett, J. (1991). Basser Department of Computer Science, Megainduction: Machine learning on very large databases. Ph.D. thesis, University of Sydney Australia. Sonnenburg, S., Franc, V., Yom-Tov, E., & Sebag, M. (2008). Pascal large scale learning challenge, machine learning research. Peteiro-Barral, D., Guijarro-Berdinas, B., Pérez-Sánchez, B., & Fontenla-Romero, O. (2011). On the scalability of machine learning algorithms for artificial neural networks. Journal of Neural Networks (Under Review). Guyon, I. (2006). Feature extraction: Foundations and applications (Vol. 207). Springer-Verlag. Provost, F., & Kolluri, V. (1999). A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 3(2), 131–169. Domingo, C., Gavaldà, R., & Watanabe, O. (2002). Adaptive sampling methods for scaling up knowledge discovery algorithms. Data Mining and Knowledge Discovery, 6(2), 131–152. Weiss, G., & Provost, F. (2001). The effect of class distribution on classifier learning: An empirical study, Rutgers University. Yu, L., & Liu, H. (2004). Redundancy based feature selection for microarray data. In Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 737–742). Springer. Bolón-Canedo, V., Sánchez-Maroño, N., & Alonso-Betanzos, A. (2010). Feature selection and classification in multiple class datasets. An application to kdd cup 99 dataset. Expert Systems with Applications, 38(5), 5947–5957. Lee, W., Stolfo, S., & Mok, K. (2000). Adaptive intrusion detection: A data mining approach. Artificial Intelligence Review, 14(6), 533–567. Forman, G. (2003). An extensive empirical study of feature selection metrics for text classification. The Journal of Machine Learning Research, 3, 1289–1305. Gomez, J., Boiy, E., & Moens, M. (2011). Highly discriminative statistical features for email classification. Knowledge and Information Systems, 1–31. Egozi, O., Gabrilovich, E., & Markovitch, S. (2008). Concept-based feature generation and selection for information retrieval. In AAAI08 (pp. 1132–1137). Dy, J., Brodley, C., Kak, A., Broderick, L., & Aisen, A. (2003). Unsupervised feature selection applied to content-based retrieval of lung images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 25(3), 373–378. Saari, P., Eerola, T., & Lartillot, O. (2011). Generalizability and simplicity as criteria in feature selection: Application to mood classification in music. IEEE Transactions on Audio, Speech, and Language Processing, 19(6). Polikar, R., Upda, L., Upda, S., & Honavar, V. (2001). Learn++: An incremental learning algorithm for supervised neural networks. IEEE Transactions on Systems, Man, and Cybernetics. Part C: Applications and Reviews, 31(4), 497–508. Ruping, S. (2001). Incremental learning with support vector machines. In Proceedings IEEE international conference on data mining, ICDM 2001 (pp. 641–642). IEEE. Parikh, D., & Polikar, R. (2007). An ensemble-based incremental learning approach to data fusion. IEEE Transactions on Systems, Man, and Cybernetics. Part B: Cybernetics, 37(2), 437–450.
Pérez-Sánchez, B., Fontenla-Romero, O., & Guijarro-Berdiñas, B. (2010). An incremental learning method for neural networks based on sensitivity analysis. Current Topics in Artificial Intelligence, 42–50. Chan, P., & Stolfo, S. (1993). Toward parallel and distributed learning by metalearning. In AAAI workshop in knowledge discovery in databases (pp. 227–240). Ananthanarayana, V., Subramanian, D., & Murty, M. (2000). Scalable, distributed and dynamic mining of association rules. High Performance Computing HiPC, 2000, 559–566. Tsoumakas, G., & Vlahavas, I. (2002). Distributed data mining of large classifier ensembles. In Proceedings companion volume of the second hellenic conference on artificial intelligence (pp. 249–256). Citeseer. Peteiro-Barral, D., Guijarro-Berdinas, B., Pérez-Sánchez, B., & Fontenla-Romero, O. (2011). A distributed learning algorithm based on two-layer artificial neural networks and genetic algorithms. In Proceedings of the ESANN11 (pp. 471–476). Kargupta, H., Park, B. Johnson, E., Sanseverino, E., Silvestre, L., & Hershberger, D. (1998). Collective data mining from distributed vertically partitioned feature space. In Workshop on distributed data mining, international conference on knowledge discovery and data mining. McConnell, S., & Skillicorn, D. (2004). Building predictors from vertically distributed data. In Proceedings of the conference of the centre for advanced studies on collaborative research (pp. 150–162). IBM Press. Skillicorn, D., & McConnell, S. (2008). Distributed prediction from vertically partitioned data. Journal of Parallel and Distributed Computing, 68(1), 16–36. Bolón-Canedo, V., Peteiro-Barral, D., Alonso-Betanzos, A., Guijarro-Berdinas, B., & Sánchez-Marono, N. (2011). Scalability analysis of ann training algorithms with feature selection. Advances in Artificial Intelligence, 84–93. Yu, L., & Liu, H. (2004). Efficient feature selection via analysis of relevance and redundancy. The Journal of Machine Learning Research, 5, 1205–1224. Hall, M. (1999). Correlation-based feature selection for machine learning. Ph.D. thesis, The University of Waikato. Dash, M., & Liu, H. (2003). Consistency-based search in feature selection. Artificial Intelligence, 151(1–2), 155–176. Zhao, Z., & Liu, H. (2007). Searching for interacting features. In Proceedings of the 20th international joint conference on artificial intelligence (pp. 1161–1516). Morgan Kaufmann Publishers Inc.. Kononenko, I. (1994). Estimating attributes: Analysis and extensions of relief. In Machine learning: ECML-94 (pp. 171–182). Springer. Kira, K., & Rendell, L. (1992). A practical approach to feature selection. In Proceedings of the ninth international workshop on machine learning (pp. 249–256). Morgan Kaufmann Publishers Inc.. Bishop, C. (2006). Pattern recognition and machine learning. New York:Springer. Moller, M. (1993). A scaled conjugate gradient algorithm for fast supervised learning. Neural Networks, 6(4), 525–533. More, J. (1978). The Levenberg–Marquardt algorithm: Implementation and theory. Numerical Analysis, 105–116. Collobert, R., & Bengio, S. (2008). Support vector machines for large-scale regression problems. The Journal of Machine Learning Research, 1. Weiss, S., & Kulikowski, C. (1991). Computer systems that learn: Classification and prediction methods from statistics, neural nets, machine learning, and expert systems. San Francisco: Morgan Kaufmann. Hecht-Nielsen, R. (1990). Neurocomputing. Menlo Park, California: Addison-Wesley.