ParadisEO: a Framework for Parallel and Distributed Biologically Inspired Heuristics S´ebastien Cahon, El-Ghazali Talbi and Nordine Melab Laboratoire d’Informatique Fondamentale de Lille Universit´e des Sciences et Technologies de Lille1 59655 - Villeneuve d’Ascq cedex - France E-mail: fcahon,talbi,
[email protected] Abstract In this paper we present P ARADI S E O1 , an open source framework for flexible parallel and distributed design of hybrid metaheuristics. Flexibility means that the parameters such as data representation and variation operators can be evolved. It is inherited from the E O object-oriented library for evolutionary computation. P ARADI S E O provides different parallel and/or distributed models and allows a transparent multi-threaded implementation. Moreover, it supplies different natural hybridization mechanisms mainly for metaheuristics including evolutionary algorithms and local search methods. The framework is experimented here in the spectroscopic data mining field. The flexibility property allowed an easy and straightforward development of a GeneticAlgorithm-based attribute selection for models discovery in N I R spectroscopic data. Experiments on a cluster of SMPs (IBM SP3) show that a good speed-up is achieved by using the provided parallel distributed models and multi-threading. Furthermore, the hybridization of the GA with the efficient P LS method allows to discover high-quality models. Indeed, their accuracy and understandability are improved respectively and . by
37%
88%
Keywords: Metaheuristics library, Parallel and Distributed algorithms, N I R Spectroscopic Data Mining.
1 Introduction A great deal of important optimization problems are NP-hard. Therefore, there is no efficient way to find op1 This work is a part of a current french joint grid computing project ACI-GRID DOC-G (Challenges in Combinatorial Optimization on Grids)
timal solutions in a polynomial execution time. Over the last years, metaheuristics have been revealed to be powerful general methods to compute efficiently nearoptimal solutions. Nowadays, they become more and more popular across many research domains including logistics, genomics, electrical engineering, telecommunications, etc. People from those last domains need particularly metaheuristics libraries because coding these algorithms by themselves is a huge and difficult task. However, with the most existing libraries [17] we are faced to a lack of flexibility in terms of data representation and the associated variation operators. Indeed, they allow a few predefined representations and operators and don’t generally consider parallelism, hybridizations. Recently, hybrid metaheuristics have gain a considerable interest [14]. For many practical or academic optimization problems, the best found solutions are obtained by hybrid algorithms. Combinations of metaheuristics such as descent local search, simulated annealing, tabu search and evolutionary algorithms have provided very powerful search methods. Although hybrid metaheuristics are efficient, they remain time-consuming on large size real-life problems [9]. Parallelism and multi-threading have been proven to be two powerful ways to achieve high-performance execution. However, these tools are often difficult access to the communities of the research domains quoted above. Therefore, parallelism and multi-threading are two crucial issues that have to be taken into account in a transparent way for the programmer. To summarize, a modern library allowing an easy design of high-performance optimization algorithms requires three major characteristics: flexibility, parallelism and multi-threading transparency, and hybridization. Unfortunately, to our best of known such library does not exist because it is needs skills both on objectoriented technologies, parallel and distributed systems
0-7695-1926-1/03/$17.00 (C) 2003 IEEE
and combinatorial optimization. According to the flexibility criterion, the EOlib [4] library is particularly outstanding. It is an open source paradigm-free evolutionary computation library allowing to easily evolve any data structures (objects). It is component-based with regards to the data representation, the transformation operators, the stopping criteria, etc. The library includes evolutionary computation algorithms such as genetic algorithms, evolution strategies, evolutionary and genetic programming methods. However, it does not include single solution-based metaheuristics such as tabu search. Therefore, it is limited regarding the hybridization criterion. Furthermore, it allows transparent parallel and multi-threaded design of metaheuristics. In this paper, we aim at extending EOlib in order to achieve the following objectives:
The flexible design of single solution-based metaheuristics (in addition to the evolutionary algorithms). The descent local search, simulated annealing and tabu search algorithms are included into the library. The parallel and multi-threaded design of metaheuristics. Different models are provided and can be deployed on distributed and/or shared memory machines. The possible use of the hybridization mechanisms to develop hybrid metaheuristics. Different hybridization schemes are now predefined and easyto-use.
The framework has been experimented on several academic and real-life applications such as radionetwork design in mobile telecommunications [9], etc. In this paper, we present a spectroscopic data mining optimization problem that is solved by a metaheuristic developed with P ARADI S E O. It deals with predicting the concentration of compound in agricultural products [12]. The predictive model is built from a set of data samples. Each sample contains the values of the absorbances of the product to Near Infra-Red (N I R) radiations. Each absorbance value corresponds to a wavenm nm. For length belonging to the domain each sample, a concentration of sugar is measured by an expensive chemical analysis. The feature selection problem consists here of identifying the relevant wavelengths that allow to predict the concentration of sugar in beet. The rest of the paper is organized as follows: in Section 2, we describe the P ARADI S E O framework. Section 3 and Section 4 present respectively the studied spectroscopic optimization problem and the proposed
830
2500
PARADISEO-based approach to solve it. In Section 5, experimental results on an IBM-SP3 machine are presented and discussed. Finally, Section 6 concludes the paper and draws some perspectives.
2. The PARADISEO Framework P ARADI S E O is an extension of the E O framework developed by the European research group composed of M. Keijzer2 , J.J. Merelo, G. Romero3 et M. Schoenauer4 [4]. Basically, E Olib is a library dedicated to the flexible design of evolutionary algorithms. The flexibility is enabled by the object-oriented paradigm. Everything is object: data structures, operators, statistic computing routines, etc. Technically, E Olib is an Open Source C++ class library downloadable from http: //eodev.sourceforge.net. Besides the “evolutionary classes” general facilities for EA applications are also provided: checkpointing for stopping and restarting applications, multiple statistics gathering, and graphic on-line representation. Furthermore, E O is open meaning the programmer can use existing on-line tutorial template files and/or implement his/her own new components (data structures, operators, etc.). E O is nowadays used successfully for several applications including evolving multilayer perceptrons [1], voice segmentation [10], etc. Evolutionary algorithms are known to be powerful in the exploration process of the global solution space. However, they revealed less interesting in the exploitation of local search regions than local search metaheuristics [14]. Therefore, we first extended and experimented the library to local search methods including tabu search, descent local search and simulated annealing. Moreover, in order to take benefit from the two approaches their hybridization is highly recommended. In [14], E-G. Talbi has distinguished two levels and two modes of hybridization (Figure 1): Low and High levels, and Relay and Teamwork modes. The low-level hybridization addresses the functional composition of a single optimization method. A function of a given metaheuristic is replaced by another metaheuristic. At the contrary, for high-level hybrid algorithms the different metaheuristics are selfcontaining, meaning no direct relationship to the their internal working is considered. On the other hand, relay hybridization means a set of metaheuristics is applied in a pipeline way. The output of a metaheuristic (except the last) is the input of the following one (except the first).
2 Vrije
Universiteit Amsterdam de Granada 4 INRIA Rocquencourt 3 Universidad
0-7695-1926-1/03/$17.00 (C) 2003 IEEE
population). It also distributes the set of newly generated solutions between different evaluators dedicated to this task. An efficient execution is often obtained particularly when the ratio between communication and computation is high.
Hybrid metaheuristic
Low level
Coevolution
High level
Relay
Figure 1. Hierarchical taxonomy of hybrid metaheuristics
Conversely, teamwork hybridization is a cooperative optimization model. Each metaheuristic performs a search in a solution space, and exchange solutions with others. Hybrid metaheuritics allow to compute robust and high-quality solutions. However, for real-life intensive applications they are faced to the efficiency problem. Parallelism and multi-threading are two powerful ways to achieve high-performance optimization. Parallelism allows to perform the search process as well as the I/O operations in parallel. Multi-threading is particularly well-suited for applications accessing data on disk. Three major parallel models can be distinguished (see Figure2): island asynchronous cooperative model, parallel/distributed (a)synchronous population evaluation, and distributed evaluation of a single solution.
Island asynchronous cooperative model: In this model, different evolutionary algorithms are simultaneously deployed. Each of them performs a search on a sub-population. All the genetic mechanisms are applied locally from the selection step to the replacement step. Solution migrations guided by some criteria are performed between the subpopulations in a regular or irregular way. Exchange of genetic materials induces a diversification of the search allowing to delay the global convergence, especially when the evolutionary algorithms are heterogeneous, regarding the variation operators. Parallel/distributed (a)synchronous population evaluation: The evaluation step of an E.A. is in general the most time-consuming. Therefore, its parallelization is required. In this centralized model, a master process applies the following mechanisms: selection, transformation and replacement (requires a global management of the
Distributed evaluation of a single solution: In this model, the quality of each solution is evaluated in a parallel centralized way. Such model is very important when for example the evaluation of each solution requires an access to very large databases distributed among different processing nodes.
All the hybridization mechanisms, parallel and distributed models, multi-threading are implemented in P ARADI S E O . From technical point of view, the platform is developed with C++/MPI/Posix Threads. It is successfully used to develop metaheuristics in different fields such as molecular and multi-scale modeling, theoric physics, robotics, mobile telecommunications, etc. In the next section, we present an experimentation of the framework on a spectroscopic data mining application.
3. Spectroscopic Data Mining 3.1. The studied problem The spectroscopic data mining problem this paper deals with consists of building a model allowing to predict the concentration of a given component in a given product. To do that, N I R radiations (N wavelengths) are sent through a sample of the product (see Figure 3). Sensors are then utilized to measure the absorbances to those radiations by the product. A spectrum of N absorbances is thus obtained. In addition, the concentration of the component in the product is measured by chemical analysis. The experience is renewed a certain number M of times. Therefore, a set of M samples is obtained. Each sample contains a spectrum of N absorbances and its corresponding real concentration. The set of samples constitutes the data of the studied problem. In order to evaluate the accuracy of the built model the data set is divided into two sub-sets: a calibration sub-set and a validation sub-set. They allow respectively the building of the model and its validation. According to the Beer Lambert law, the absorbance for a given wavelength is a linear combination of the pure compounds contributions. Therefore, the variation of the concentration of a given component acts on the absorbance associated with the wavelengths that component can absorb. In other words, the concentration of the component is a function of the absorbances corresponding to the wavelengths it can absorb. Predicting the concentration of a given component in an organic
0-7695-1926-1/03/$17.00 (C) 2003 IEEE
Population
Transformation
Population
Transformation
Population
Migrations
Population Selection
Replacement
Selection
Replacement
Population
Evaluation
Evaluation Sending / Reception of solutions
Sending / Reception of solutions
f
f
f
f1
fn
f2
Parallel/Distributed evaluating nodes
Distributed and partial evaluating nodes
Figure 2. The three major parallel models Sample
Polychromatic source
Sensors
11111111111111 00000000000000 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111 00000000000000 11111111111111
Figure 3. Measurement of the radiations absorption in the near-infrared domain
for the application. We measured on the absorbances of the calibration samples the two parameters: redundancy and relevance. Figure 4 shows the average correlation of the absorbance corresponding to each wavelength with the absorbances associated with all other wavelengths. The average values are between and . Features (absorbances) with a high average correlation are redundant. Those around nm and nm are particularly redundant.
0 1 800
400
0.45 ’correl_abs.data’
RM S E P
=
r Pn
^i)2 i=1 ( i c
0.3
0.25
0.2
0.15
0.1
0.05
0 0
200
n
^
1020
0.35
c
where n is the number of validation data samples. For each sample i, ci and ci designate respectively its measured (real) concentration and its predicted concentration (computed with the built model). This paper deals with a particular instance of the problem: predicting the concentration of sugar in beet. The data set is provided by the Laboratoire de spectrochimie infrarouge et raman de Lille, and contains samples. Each sample contains absorbances (corresponding to wavelengths) and the associated concentration measure determined by chemical analysis.
1800
0.4
Independency
product consists of estimating that function (model). As the problem is linear (due to the Beer Lambert law), it is classically solved by statistical methods. P LS regression [13] is particularly well-suited for N I R spectroscopy. Moreover, once the predictive model is built from the calibration data sub-set its accuracy is evaluated on the validation data sub-set by using the following metric, called Root Mean-Square Error Prediction (RM S E P ):
1020
Before applying P LS , a feature selection has to be performed. Let us examine why this selection is needed
600 Wavelengths
800
1000
1200
Figure 4. Average correlation of each absorbance with all other ones
On the other hand, Figure 5 illustrates the correlation of the absorbances with the concentration. The values of and . The values near to the correlation are between the extremities define a perfect correlation i.e. strongly relevant features. Conversely, features with a correlation near to are irrelevant. One can remark that the features that correspond to the wavelengths between and over are particularly irrelevant. Given the above measurements, before applying the P LS procedure one has really to perform a feature selection. The objective is to select the absorbances that
1
0
350
3.2 Genetic feature selection
400
0-7695-1926-1/03/$17.00 (C) 2003 IEEE
1
0
1 ’.correl_conc’ 0.8 0.6
12
0.4 0.2 Relevance
dividuals of the population are binary strings with the length equal to the total number of considered wavelengths. Each individual is a string of bits w wi ; i ; ; :::; N in which wi means that wi is relevant, thus selected, and wi means that wi is irrelevant, thus not selected. At each generation, a P LS is performed for each individual i.e. selection according to its absorbances (wavelengths). The fitness of the individual is the prediction error RM S E P of the model being built from its corresponding selection. The genetic operators are the classical ones meaning the mutation and the crossover. In this work, the crossover operator has two schemes: uniform and one-point.
0 -0.2 -0.4 -0.6 -0.8 -1 0
200
400
600
800
1000
1200
Wavelengths
Figure 5. The correlation between the absorbances and the corresponding concentration
RMSEP
G.A. O 1 1
O O
O
O 1
Mutation (flip)
= (
=
Evaluation
1
are less correlated between them and more correlated with the concentration. The feature selection problem can be formulated as follows: a data sample can be viewed as an assignment of absorbance values to a set of wavelengths (w1 ; w2 ; :::; wn ) with a concentration c. If f represents the model to be built, c f w1 ; w2 ; :::; wn . In presence of irrelevant and/or redundant wavelengths, feature selection consists of selecting M relevant wavelengths from N given wavelengths where M N . Several techniques attempting to solve the problem have been proposed in the literature [11, 15, 2, 7]. These approaches may be classified in two categories [5]: filter approaches and wrapper approaches. Filter techniques consider two separated steps: the first one performs the feature selection ; the second step builds the model from the selected attributes. The selection is not based on the accuracy of the built model but on some criteria such as the correlation between the features. Conversely, in the wrapper methods the attribute selection and the model building are mixed. The attribute selection is performed in several steps. At each step, a model is built and its accuracy is evaluated. The accuracy informs about the quality of the performed selection: more accurate is the model more interesting is the selection. The selection is more and more enhanced at each step until no accuracy improvement is possible. In each category, different techniques have been proposed. In this work, we focus on the genetic algorithms (GAs). GAs have been revealed as a powerful tool for attribute selection [16]. A survey on genetic feature selection in mining issues may be found in [8]. As it is shown in Figure 6, our method is a wrapper GA, namely Wrapper GA Feature Selection. The in-
=
=1 =0
a1 Selection (ranking)
a2
PLS regression
Children
Parents an
Crossover (one−point and uniform crossover)
00000000000000000 11111111111111111 11111111111111111 00000000000000000 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111 00000000000000000 11111111111111111
Database
)
Figure 6. The wrapper GA-based attribute selection method
4. PARADISEO-based W-GAFS During the evaluation step of W GAF S , for each individual (feature selection) a model is built and its RM S E P is computed. Some preliminary experiments have shown that P LS is costly in terms of CPU computation time and I/O operations. Indeed, the execution of P LS on an IBM RS/6000 375MHz takes about seconds. Therefore, the evaluation of the individuals must be parallel. In addition, in order to overlap the I/O operations by the computation concurrency is required. The definition of the parallel model depends strongly on the partitioning and the placement of the population and the database. The population can be either centralized on one processor or distributed among several processors. On the other hand, the database can be centralized, replicated or distributed. Table 1 summarizes the different possible combinations and the advantages (+) and drawbacks (-) of each configuration. Computing consists of applying the P LS method according to the individuals. I/O operations are mainly performed during the application of P LS .
0-7695-1926-1/03/$17.00 (C) 2003 IEEE
10
Table 1. Different possible parallel configurations Databasen Population Centralized Distributed Centralized No parallelism + Parallel computing - Throttling situation with I/O Replicated + Parallel I/O + Parallel computing and I/O - Loss of disk space - Loss of disk space Distributed + Parallel I/O + Parallel computing and I/O
Therefore, defining a parallel model requires to take into account the advantages and drawbacks of each configuration. It consists of finding the best compromise between the different configurations. According to Table 1, the population must be distributed. Regarding the database, as it is not large in our case its distribution would not be beneficial. Its replication allows I/O parallelism but a loss of disk space. On the other hand, its centralizing causes throttling situations, thus performance decreasing. The best way to achieve I/O efficiency would be the hybridization of the two configurations. That consists of logically partitioning the parallel (or distributed) machine into clusters. The database is replicated on each cluster, and is centralized for the processors belonging to the same cluster. The idea is developed in the present work. As Figure 7 illustrates it, our parallel model is the S P M D one. The farmer creates the workers on N nodes (one worker per node) and sends them the database. Each node (worker) represents a cluster. Then the farmer divides the population into sub-populations, and each of them is affected to one worker. Each worker evaluates the individuals of its sub-population in parallel on the processors of its cluster. Those ones perform the P LS procedure by accessing the database of their cluster. Once all the individuals are evaluated, the worker returns back the results to the farmer which performs the replacement operation on the whole new population. The cycle is repeated during a certain number of generations. The parallel model corresponds to the parallel/distributed synchronous population evaluation model quoted in Section 2. Through P ARADI S E O, it is exploited in a transparent way. In addition, the framework enables a transparent exploitation of multi-threading at two levels. At the first level, each worker is multithreaded, and each thread evaluates one or several individual(s). At the second level, the access to the database performed by the P LS procedure is multithreaded. Furthermore, the hybridization of the GA with P LS is straightforward in P ARADI S E O . Finally, the P ARADI S E O -based W GAF S is flexible. Indeed, all its parameters (population representation, genetic op-
Global population
... ...
Distribution of the selections Node1
Node2
...
NodeN
...
...
... ... ...
Parallel multi-threaded evaluation of the selections
Concurrent accesses Database
Figure 7. The parallel multi-threaded
W
GAF S
erators, P LS , etc.) can be easily evolved. For example, the P LS can be replaced without a great effort by another efficient model-building procedure.
5. Experimental results In this Section, we present the parameters of W , the experimentation material platform and the experimental results. The results must let us evaluate the benefits, in addition to its flexibility and ease of use, of P ARADI S E O in terms of prediction quality and efficiency in spectroscopic data mining. The parameters of W GAF S are reported in Table 8. The stopping criterion is computed empirically by a sequential execution of the P LS procedure on the whole data set. The minimum RM S E P is achieved after generations. The bit-string format predefined in P ARADI S E O is used to represent the individuals of the population. The experimentation material platform is a 4-nodes IBM/SP3 CLUMPS (Cluster of SMP: Symmetric MultiProcessors). Each node (SMP) is composed of 16 processors Power3 NH2 - 375MHz, 16Go RAM. The communication network between SMPs is the Colony netGAF S
0-7695-1926-1/03/$17.00 (C) 2003 IEEE
170
Initialization Population Crossover Mutation Selection Replacement Stopping criterion
10% of features randomly selected 200 individuals Uniform (100% of effective recombination) Bit-flip (10% of the offspring, and 1% per gene) Ranking (100 parents selected) Tournament (bias 0.7) A fixed number of generations (170)
Figure 8. Parameters of W
0.15 ’evol_best_sel’
0.14
0.13
RMSEV
work i.e. an Omega switch 800Mo/s. The parallel multithreaded model presented above fits well the material architecture. The data set is replicated on the 4 SMPs. On each SMP, the 16 processors access the data samples in a transparent multi-threaded way. Furthermore, computation threads are spawned on these processors to evaluate one or several individuals. Let us now discuss the experimental results. Three parameters are important: the prediction accuracy, the understandability of the predictive model and the speedup of the execution. The prediction accuracy is estimated by the RM S E P metric. Recall that two data sub-sets are used to build and validate the model: the calibration sub-set and the validation sub-set. Traditionally, the RM S E P of the model is computed from the validation sub-set. We believe that as it is used in the model-building process the RM S E P is biased. To overcome that problem we need an additional non-used data sub-set, called a prediction sub-set. Each one of the three sub-sets contains 600 samples. The prediction subset is used at the post-prediction process to compute the RM S E P (namely, real RM S E P ) only on the best solution (discovered model). In order to evaluate the benefits of our GA-based feature selection, we first evaluated the RM S E P by executing the P LS procedure without preliminary attribute selection. The obtained RM S E P is 0.170. Figure 9 illustrates the evolution of the validation error (real RM S E P ) as a function of the number of generations obtained by running W GAF S . The generations and 20 hours. algorithm converges after The best solution (final model) has a biased RM S E P of 0.095 and a real RM S E P of 0.107. The prediction . accuracy is improved by On the other hand, the built model contains only wavelengths, meaning of wavelengths have been withdrawn. Figure 10 illustrates an example of a spectral selection obtained by W GAF S . One can clearly distinguish some empty areas where redundant and/or irrelevant wavelengths have been withdrawn. The attribute reduction allows to enhance the understandability of the model. Finally, Figure 11 presents the near linear speed-up
GAF S
0.12
0.11
0.1
0.09 0
20
40
60
80 100 Number of generations
88%
114
140
160
180
Figure 9. Evolution of the validation error in accordance with the number of generations
The full NIR spectrum
’selected’
0
100
200
300
400
500 600 Wavelength
170
37%
120
700
800
900
1000
Selection ’selected’
0
100
200
300
400
500
600
700
800
900
1000
Wavelength
The final set of selected wavelengths
Figure 10. A view of a such obtained optimal feature selection
0-7695-1926-1/03/$17.00 (C) 2003 IEEE
and efficiency obtained with the parallel multi-threaded implementation of W GAF S . Number of processors 2 3 4 5 6 7 8 9 10
Speed-up
Efficiency
1.94 2.81 3.88 4.71 5.67 6.17 7.24 7.77 8.93
0.97 0.93 0.97 0.94 0.95 0.88 0.9 0.86 0.89
9
1
8
0.98
7
0.96
6
0.94
5
0.92
4
0.9
3
0.88
2
0.86
1
Efficiency
Speed-up
’speed-up’ ’efficiency’
Although the experience with P ARADI S E O in -spectroscopy is promising some of the major topics have not been exploited. Actually, we are experimenting the platform on a more complete application. It consists of the radio-network design in mobile telecommunications. The three parallel distributed design models are used. Moreover, hybridization mechanisms such as GAs with the Tabu search metaheuristic are exploited. On the other hand, in order to deal with largescale optimization applications such as bioinformatics we are developing a Grid-enabled multi-domain version of P ARADI S E O. The current focus is to interface the platform with Condor-G [3]. Condor-G combines the inter-domain resource management protocols of the Globus Toolkit [3] and the intra-domain resource and job management methods of Condor [6] to allow the user to harness multi-domain resources as if they all belong to one personal domain. NIR
References [1] P.A. Castillo, J.J. Merelo, V. Rivas, G. Romero, and A. Prieto. Evolving Multilayer Perceptrons. Neural Processing Letters, 12(2):115–127, 2000.
0.84 1
2
3
4
5 6 7 Number of processors used
8
9
10
Figure 11. Performance results
[2] M. Dash and H. Liu. Feature selection methods for classifications. Intelligent Data Analysis, 1(3):131–156, 1997. [3] I. Foster and C. Kesselman. Globus: A ToolkitBased Grid Architecture. In Foster, I. and Kesselman, C. eds. The Grid: Blueprint for a New Computing Infrastructure, Morgan Kaufmann, pages 259–278, 1999.
6. Conclusion In this paper we presented P ARADI S E O, an easyto-use open source framework allowing the design and development of flexible and efficient metaheuristics. Flexibility means that the parameters including data representation, variation operators, stopping criteria, and so on can be evolved. It takes essence from the componentoriented paradigm. Efficiency can be achieved by exploiting in a transparent way multi-threading and predefined parallelism and distribution models. Furthermore, hybridization mechanisms are implemented and are easy access for the programmer. The platform has been experimented on a spectroscopic data mining application. Its development with P ARADI S E O is straightforward due to the flexibility property. In addition, multi-threading, parallelism and distribution allow to achieve a near linear speed-up. The hybridization GA P LS allows to enhance the accu, and its racy of the discovered model by a factor of understandabilty by a factor of .
88%
37%
[4] M. Keijzer, J.J. Morelo, G. Romero, and M. Schoenauer. Evolving Objects: A General Purpose Evolutionary Computation Library. Proc. of the th Intl. Conf. on Artificial Evolution (EA’01), Le Creusot, France, Oct. 2001.
5
[5] P. Langley. Selection of Relevant Features in Machine Learning. In Proc. of AAAI Fall Symp. on Relevance, New Orleans, LA: AAAI Press, pages 1–5, 1994. [6] M. Litzkow, M. Livny, and M. Mutka. Condor: A Hunter of Idle Workstations. In Proc. of the th Intl. Conf. of Distributed Computing Systems, pages 104–111, 1988.
8
[7] Huan Liu and Hiroshi Motoda (ed.), editors. Feature Extraction, Construction and Selection: A Data Mining Perspective. Kluwer Academic Publishers, 1998.
0-7695-1926-1/03/$17.00 (C) 2003 IEEE
[8] M.J. Martin-Bauvista and M-A. Villa. Towards an Evolutionary Algorithm: A Comparison of Two Feature Selection Algorithms. 1999 Congress on Evolutionary Computation, Washington D.C., USA, pages 1314–1321, July 6–9 1999. [9] Herv´e Meunier. Algorithmes e´ volutionnaires parall`eles pour l’optimisation multi-objectif de r´eseaux de t´el´ecommunications mobiles. PhD thesis, Universit´e de Lille1, 2002. [10] Diego H. Milone, Juan J. Merelo, and H. L. Rufiner. Evolutionary Algorithm for Speech Segmentation. Proc. of the 2002 Congress on Evolutionary Computation (CEC2002), IEEE Press, pages 1115–1120, 2002. [11] P.M. Narendra and K. Fukunaga. A branch and bound algorithm for feature subset selection. IEEE Transactions On Computers, C-26(9):917– 922, Sept. 1977. [12] Y. Roggo, L. Duponchel, B. Noeb, and J.-P. Huvenn. Sucrose content determination of sugar beets by near infrared reflectance spectroscopy. comparison of calibration methods and calibration transfer. Journal of Near Infrared Spectroscopy, 10(137150):121–135, 2002. [13] H. Martens S. Wold and H. Wold. The multivariate calibration problem in chemistry solved by the PLS. In Proc. of the Conf. on Matrix Pencils (A. Ruhe and B. Kagstrom, Eds.), Lecture Notes in Mathematics, Springer Verlag, Berlin, pages 286– 293, 1982. [14] E-G. Talbi. A Taxonomy of Hybrid Metaheuristics. Journal of Heuristics, Kluwer Academic Publishers, Vol.8:541–564, 2002. [15] H. Vafaie and F. Imam. Feature Selection Methods: Genetic Algorithms vs. Greedy-like Search. In Proc. of the 1994 Intl. Fuzzy Systems and Intelligent Control Conf., Luisville, KY, 1994. [16] H. Vafaie and K. De Jong. Genetic Algorithms as a Tool for Feature Selection in Machine Learning. In Proc. of the 1992 IEEE Intl. Conf. on Tools with Artificial Intelligence, Arlington, VA, pages 200– 204, 1992. [17] S. Voss and D.L. Woodruff (eds.). Optimization Software Class Libraries. Kluwer Academic Publishers, Boston, Hardbound, Apr. 2002.
0-7695-1926-1/03/$17.00 (C) 2003 IEEE