Multi-Objective Memetic Evolution of ART-based ... - Semantic Scholar

3 downloads 732 Views 292KB Size Report
programming based approach for 2-class classifiers is being presented. ... neering Department, Florida Institute of Technology, Melbourne, Florida ,. US (phone: +1 321 ..... schedule in place, mutations grow increasingly deliberate and serve to ...
Multi-Objective Memetic Evolution of ART-based Classifiers Rong Li, Timothy R. Mersch, Oriana X. Wen, Assem Kaylani, and Georgios C. Anagnostopoulos, Member, IEEE Abstract—In this paper we present a novel framework for evolving ART-based classification models, which we refer to as MOME-ART. The new training framework aims to evolve populations of ART classifiers to optimize both their classification error and their structural complexity. Towards this end, it combines the use of interacting sub-populations, some traditional elements of genetic algorithms to evolve these populations and a simulated annealing process used for solution refinement to eventually give rise to a multi-objective, memetic evolutionary framework. In order to demonstrate its capabilities, we utilize the new framework to train populations of semi-supervised Fuzzy ARTMAP and compare them with similar networks trained via the recently published MO-GART framework, which has been shown as being very effective in yielding highquality ART-based classifiers. The experimental results show clear advantages of MOME-ART in terms of Pareto Front quality and density, as well as parsimony properties of the resulting classifiers.

I. I NTRODUCTION Using evolutionary algorithms to design/train classification models has lately become a flourishing area of research. Numerous evolution-based approaches have been suggested; among them are methods that evolve single models, populations of models and populations of ensembles of models. For example, in [1] the authors introduce hybrid evolutionary learning for pattern recognition (HELPR), which combines a variety of concepts from evolutionary algorithms to discover effective classification features. In [2], an genetic programming based approach for 2-class classifiers is being presented. One of the few recent papers that discusses multi-objective evolution of classifiers, also based on genetic programming, is [3]. Moreover, a symbiotic approach to evolving fuzzy rule-based classification models is illustrated in [4]. Finally, we mention the work presented in [5], which deals with evolving hyper-networks to perform classification tasks. The primary reported strengths of evolutionary approaches are their capabilities for global search of the classifiers’ parameter space and, thus, the potential to build high-quality models. Rong Li is with the Department of Electrical & Computer Engineering, Florida Institute of Technology, Melbourne, Florida, US (email: [email protected]). Timothy R. Mersch is with the Department of Computer Sciences, Florida Institute of Technology, Melbourne, Florida, US (email: [email protected]). Oriana X. Wen is with the Departments of Biomedical Engineering and Electrical & Computer Engineering, Duke University, Durham, North Carolina, US (email: [email protected]). Assem Kaylani is with the Research & Development Department, InCube FZCO, Amman, Jordan (email: [email protected]). Georgios C. Anagnostopoulos is with the Electrical & Computer Engineering Department, Florida Institute of Technology, Melbourne, Florida , US (phone: +1 321 6747125; email: [email protected]).

A recently-published evolutionary algorithm that trains ART-based classifiers is Multi-Objective Genetic ART (MOGART) ([6], [7], [8]). Research on Adaptive Resonance Theory (ART) neural networks has been active over the last two decades and several architectures have been proposed over the years. For example, extensions to the classic Fuzzy ARTMAP classification model [9] include Hypersphere ARTMAP [10], Ellipsoid ARTMAP [11], BARTMAP-S [12] as well as semi-supervised Fuzzy ARTMAP (ssFAM) and Ellipsoid ARTMAP[13] to mention only a very few. MOGART is capable of evolving populations of such types of ART-based classifiers to produce superior quality models. It utilizes evolutionary multi-objective optimization concepts and techniques, as well as appropriate mutation and recombination operators to eventually produce a collection of classifiers, whose members represent a trade-off between training set accuracy and model complexity. Having access to such a set of models is an advantage for the designer, as it facilitates the identification of models that demonstrate the best generalization performance for minimum structural complexity. In [8] it is shown that MO-GART rapidly converges to a set of high-quality models, whose champion networks’ generalization performance rivals, if not exceeds, the performance of Support Vector Machine or CART decision tree classifiers. In this paper we present a new method for evolving ARTbased classifiers, which we call Multi-Objective Memetic Evolution of ART (MOME-ART). We also refer to it as an evolutionary framework, instead of an algorithm, due to its generality in terms of its applicability. First of all, unlike MO-GART, it is agnostic to the category representation (see [14]) utilized by each ART-based classifier. Secondly, it is agnostic to how the network uses these categories to perform its classification task. Also unlike MO-GART, its initial population does not need to consist of already-trained classifiers. Aside from these important characteristics, it also aims to deliver sets of non-dominated models (Pareto fronts), that are denser and more diverse than the ones produced by MO-GART. Towards this end, MOME-ART evolves in parallel many sub-populations that are differentiated by degrees of model complexity. The superiority of the produced Pareto fronts is achieved by a specially tailored mutation protocol and a synergistic simulated annealing process that aggressively refines the quality of the evolved networks. The combination of all these elements eventually gives rise to a powerful memetic algorithm that takes advantage of both global exploration and localized exploitation. In order to showcase the potential of MOME-ART, we compare its performance against MO-GART on a collection of real

datasets and we show that, with respect to a few important metrics, MOME-ART exhibits noteworthy advantages. The rest of the paper is organized as follows: Section II provides some important background information on MOGART. Next, Section III describes the inner workings of MOME-ART . This material is followed by Section IV, where we present a variety of experimental results and compare MOME-ART to MO-GART. We end our paper with a brief mention of conclusions and potential future work in Section V. II. BACKGROUND AND P RIOR W ORK A. Multi-Objective Optimization Ideal classification models feature low complexity and high accuracy. Thus, the search for optimal models is, in essence, a multiple-objective optimization problem. As with most multi-objective solutions, optimizing one objective often occurs at the expense of the other objective. Naturally, in order to find the best model, it is imperative to be able to simultaneously optimize all of the objectives. A methodical and quantitative approach that is commonly adopted is the Pareto optimum. It is a definition of the optimal solution originally proposed by Francis Ysidro Edgeworth [15] and later generalized by Vilfredo Pareto [16]. When plotted in objective space, the non-dominated solutions are collectively known as the Pareto front (PF). It is also referred to as the Pareto set. Detail definitions of dominance can be found in [16]. The ideal situation is for the PF to approach the Paretooptimal front that contains the absolute non-dominated vectors from the entire search space. Furthermore, the PF should be populated with unique models as densely as possible; this is because having a highly diverse and dense PF means a wilder spectrum of models are available for pinpointing the optimum (typically via cross-validation). B. MO-GART Algorithm A common approach for solving multi-objective problems is to re-cast them into single objective problems using weighted summing. This approach has two fundamental problems. First, this approach lacks robustness due to the optimal weights being strongly problem-dependent. Second, there is not a single best solution to multi-objective problems with conflicting objectives. Kaylani et al.’s MO-GART[7] is an attempt to address these issues. While the work focuses only on ART-based classifiers, it manages to find the solutions without using weighted sums, and it also produces a set of solutions that are trade-offs to each other. In genetic algorithms, models are frequently being evaluated and compared throughout the evolutionary process. Without combining multiple objectives into one single value, MO-GART employs the dominance relationship described in Section II-A to attend the need. Simply put, network A is considered a better network than B if A dominates B. While this approach would likely lead to networks that are mutually non-dominant, it is in fact a solution to the second problem mentioned above.

MO-GART Population

Archive Select nondominated networks, add to archive

Merge

New generation replaces old population

Evolution

Select parents

Crossover to produce child

Prune and/or Mutate

Fig. 1. Flowchart of MO-GART. The archive records models in each generation that are not dominated by any other model in the population. It undergoes a purging process every generation to make sure that each model in it is unique and is not dominated by any other model. Elitism is implicitly employed, as the archive is available for parent selection during crossover.

At each generation, the models that are not dominated by any other models in the current population are inducted into the archive, a singular model storage space persisting throughout the evolution process. Immediately after, the archive goes through a purging process, where any models that have become dominated due to newcomers, as well as duplicate models, are removed. The archive only maintains non-dominated models. Besides recording the best solutions, the archive in MO-GART is also involved in the evolution processes. During the parent selection for crossover, the current population is joined with the archive to form the selection pool. Furthermore, the archive eliminates the need for the common elitism phase in genetic algorithms, where a percentage of top performing individuals is directly passed into the next generation, so elitism still exists implicitly through the archive. Moreover, the evolutionary process is considered to have converged, if the archive has not been updated for a certain number of consecutive generations. After each archive update, mutation and crossover take place. The MO-GART algorithm flow chart is shown in Figure 1. III. T HE MOME-ART F RAMEWORK A. Motivation & Main Design Principles Several critical realizations can be drawn from observations on the guiding principles mentioned in section II-A. Note that, classification model complexity can be simply defined as its size, namely the number of categories that partake in the model. This means that the model complexity is a discrete measure. Furthermore, model training error varies in fixed amounts, because the number of training samples is finite. Consequently, the trade-off curve and Pareto

fronts take on discrete representations. This leads to the first important realization: the objective space is quantized. Next, common operators like relabeling (changing the class label of a category), augmenting (adding a category), and pruning allow evolutionary search to extensively explore the objective space, and, thus, are the driving forces in obtaining the ideal Pareto front. Specifically, augmenting and pruning directly affect model complexity, which can be visualized as a predictable, controlled movement along the size axis; on the other hand, the effects of relabeling are manifested on the error axis, though the direction and magnitude of change along this axis cannot be accurately predicted. The main design principle of this framework is to exploit the above realizations by employing parallel evolution of subpopulations. The entire population of individuals is divided into subpopulations of bounded complexity that are evolved independently to minimize training error. The make-up of each subpopulation is limited to models of the same size. In other words, there is a subpopulation for each particular model size. Since each subpopulation only evolves individuals of a given complexity, the multi-objective space reduces to a single-objective optimization within each subpopulation. With the individual sizes held constant in subpopulations, the fitness evaluation, that is essential to all genetic algorithms, simplifies to model error calculation. The evolutionary process acting on each subpopulation seeks only to drive down the error of individuals. Population subdivision avoids the computationally expensive dominance calculations inherent in multi-objective evolutionary algorithms.

Chromosome Exemplar Description Indices

Exemplar Class Label

...

Gene Pool exemplar exemplar exemplar exemplar exemplar

exemplar exemplar exemplar exemplar

Fig. 2. Each model in MOME-ART is represented by a chromosome that is made up of genes. Each gene contains a category index and a class label. The indices map to a gene pool, which is a collection of all the categories currently utilized in the population.

MOME-ART Flowchart Population Generate population randomly or from pre-trained networks

Evaluate fitness

Mutation

Prune

Augment

Update Hall of Fame For each size, add fittest individual from population, adjust accuracy bracket and keep only individuals within bracket

Relabel

Evaluate fitness If fitness increases

Perform mutation

If fitness decreases Hall of Fame

Perform mutation with a certain probability After a random number of mutations have occured

Migration Move networks into new subpopulation if size has changed For each subpopulation

B. Overall Structure As laid out in Figure 2, each individual model is represented by a chromosome made up of genes, which contain category description indices and labels. The indices map to a specific category description residing in the gene pool. The utilization of a gene pool improves memory allocation, since the same category description is not repeatedly stored in different chromosomes. Furthermore, the gene pool dynamically changes size based on the introduction or removal of categories, so unnecessary memory allocations for storing non-existing genes are prevented. Also, the gene indexing allows for a more efficient appraisal of consanguinity between different individuals and gene duplication within one individual. Before evolution takes place, a population is initialized either with randomly generated categories or with categories stemming from previously trained classifiers. Two parameters determine the setup of subpopulations: • maxSize places a cap on the maximum possible complexity any individual model can achieve. For ARTbased classifiers, the value of maxSize typically should not surpass the number of patterns in the training set. • nominalSize limits the number of individuals in each subpopulation. After proper initialization, the population and gene pool are filled and the individuals are ready to undergo evolution.

Stopping Criteria If Hall of Fame has not been updated for 10 generations, terminate

Selection Crossover

Fig. 3. The top level description of the evolutionary framework. After a population is initialized (randomly generated) and evaluated, the parallel evolution of subpopulations begins. First, the Hall of Fame is updated (see Section III-C); then, the mutation (see Section III-D) of each subpopulation begins. To account for migration of individuals between subpopulations, the population is redistributed. Finally, each subpopulation undergoes recombination, which involves binary tournament selection. The procedure terminates when a convergence criterion is met.

The first step of the evolutionary process is to update the Hall of Fame, which, for all generations, catalogs the best individuals of every size that fall within a certain error bracket. Then, three mutation operators, namely relabeling, augmenting, and pruning, are used to evolve the subpopulations. This procedure is carried out with a ranking and selection scheme that is coupled with simulated annealing [17]. The mutation scheme transforms the framework into a memetic algorithm, since it searches both globally and locally. Also, due to the constraints imposed by subpopulations, where only models with a particular size are allowed, augmenting and pruning causes migration to neighboring subpopulations and, thus, must be addressed with the use of redistribution. Then, after mutation and migration are successfully carried out,

each subpopulation goes through selection and crossover, during which gene duplication is prevented. Finally, the offspring replace the entire parent subpopulation and the same procedure is repeated until a stopping criterion is met. The evolutionary framework is briefly sketched in Figure 3. C. Hall of Fame Inspired by the archive in MO-GART, the Hall of Fame is a permanent library of individuals with the highest fitness from all generations. The Hall of Fame improves upon the archive in two ways: avoiding computationally expensive dominance checking and eliminating the unchecked growth and duplication in the archive. It is structured to store the fittest individuals of every size, and, thus, guarantees diversity. At each generation, for every subpopulation, the best performing, unique individuals are compared with the individuals of the same size currently in the Hall of Fame. For each size in the Hall of Fame, there is an error bracket that determines a certain allowable variation in error rate. Calculated using Fisher’s Exact Test [18], the error bracket finds the error cutoff that is statistically different from the fittest model’s error rate. If the newcomers fall within the error bracket, they are inducted into the corresponding size slot of the Hall of Fame. Additionally, if the newcomers perform better than the existing fittest individuals in the Hall of Fame, the error bracket is redefined based on the newcomers’ fitness. Any models that fall out of the new bracket are promptly removed from the Hall of Fame. Moreover, the Hall of Fame is an entity separate from the evolving population, so the fittest individuals are barred from imposing too great of an influence over the others. D. Mutation: A Selection Scheme and Simulated Annealing In order to improve the mutation process, the framework integrates a selection scheme based on simulated annealing process which is very similar to the one used in [17]. These adjustments strengthen the role of mutation as a local search and diversity enhancement. Also, the selection of individuals for mutation employs a ranking system that allows mutation operators to focus on particular types of individuals. For example, it is possible to target and aggressively refine the fittest individuals. Additionally, mutations performed on individuals with below-average fitness are not wasted efforts either, because they still contribute to the diversity of the subpopulation. At the beginning of the mutation process, a random number of relabeling, augmenting, and pruning is chosen based on probability parameters. Then, from the three mutation methods, one is randomly chosen for execution. The next step selects an individual from the subpopulation using the ranking scheme mentioned above. The fittest individuals are assigned a rank of 0 and individuals with the same fitness share ranks. Equation (1) assigns probabilities to ranks based on the parameter λ ∈ (0, 1). When λ = 0, rank 0 is selected, and when λ = 1, the largest rank, which contains the worst individuals, is selected. Once a rank is selected, all individual sharing that rank have equal probability in being selected for

mutation. Note that, setting λ = 12 equalizes the probability of mutation selection across all individuals in the population.  Psel (r) ∝

λ 1−λ

r ,

λ 6= 0, 1

(1)

Once an individual has been selected, the chosen mutation method is carried out. After the mutation is performed, the fitness of the model is evaluated. If the error decreases, the improved mutated individual is added to the subpopulation. If the fitness of the individual degrades, the mutation is accepted with the probability given by Equation (2), where F is the fitness (accuracy of the model on the training set). Paccept = e

∆F

/KT

(2)

K is a greediness factor; when K = 0, Paccept is zero and no degrading mutations, in which the fitness of the individual regresses, are accepted. Conversely, when K approaches infinity, all mutations are accepted. As the temperature, T , cools, the probability of accepting disadvantageous mutations decreases. On the other hand, when the temperature is still high in early generations, randomness dominates the mutation scheme; the early stages of evolution is marked by exploration of the global search space. With the cooling schedule in place, mutations grow increasingly deliberate and serve to expedite convergence of the subpopulation. The later stages of evolution are characterized by exploitation. The utilized cooling schedule is parameterized to allow adaptation to the problem at hand and to offer direct control over the greediness of mutation. With Ntrain being the the number of training patterns, the relevant parameters are, •





the probability Pmin,0 of a degrading transition of size ∆Emin = N 1 at the beginning of evolution train the probability Pmin,f of a degrading transition of size ∆Emin = N 1 at the beginning of evolution train the number f of generations, after which cooling occurs

The following equations dictate T0 , Tf , and α, which control the cooling schedule. T0 is the temperature of beginning generations, Tf is temperature of final generations, and α is the cooling rate. T0 = −

1 Ntrain ln(Pmin,0 )

Tf = −

1 Ntrain ln(Pmin,f )

 α=

Tf T0

(3)

1  f −1

If the mutation is still not accepted, the mutation is performed again on an untried gene. The same mutation on the same candidate is repeated until acceptance or until all possibilities have been exhausted.

IV. E XPERIMENTS AND R ESULTS A. Performance Measure and Comparison Metric In the framework of classification model training that we are considering, in order to maximize the chances of eventually identifying a useful, parsimonious classifier, it is important that the training process produces an ensemble of classifiers that is diverse in terms of structural complexity and rich in non-dominated alternatives. Therefore, good quality PFs of classifier characteristics should reflect the best possible trade-off between complexity and training error. Thus, the set of discovered solutions should provide the lowest misclassification error for a given network size. Furthermore, good quality PF should include at least one alternative for each level of complexity considered, i.e. the PF should be well-sampled and dense. A high density PF maximizes our ability to select the best-generalizing individual due to the wider spectrum of alternative models. MOME-ART’s design goal is centered on yielding a collection of models with exactly these aforementioned characteristics. 1) Modified Hyper-area: There are several existing measures that facilitate the evaluation of the PFs. A relatively simple and straightforward metric is the hyper-area method proposed by Zitzler and Thiele [19]. However, it suffers from a significant drawback: when a new, non-dominated individual is added to the current PF, the PF’s density is increased and, thus, improves; yet, this could result in an increase in hyper-area. To remedy this counter-intuitive shortcoming, a modified hyper-area metric was devised. In its simplest terms, the modified hyper-area is an overestimation of the area under the PF, while in its original definition it constitutes an underestimation. Equation (4) presents the revised hyper-area measure in mathematical terms, where h is the PF’s hyper-area value, N is the number of models along the PF, ε is the model error, and s is the model size with smax being the maximum size. Note that we use ε0 ≡ 1, sN +1 ≡ smax . From this point on, when we refer to hyper-area, we will be implying this version.

h=

1

N X

smax

n=0

εn (sn+1 − sn )

For the purpose of comparing MOME-ART and MOGART, in our experiments we collected hyper-area values versus time rather than versus number of generations executed, in order to facilitate a fair comparison between the two training procedures, as evolving MO-GART’s populations is typically more computationally demanding per generation than evolving MOME-ART populations is Additionally, these hyper-area-versus-time curves typically start at a high value and decay monotonically to a terminal/asymptotic level, after which improvement, within given time limits, is quite unlikely. Therefore, in most cases they resemble noise-corrupted samples of a decaying exponential of the form h(t) = h(∞) + [h(0) − h(∞)] e−λt

(5)

where λ ≥ 0 is the decay (convergence) rate, h(0) is the hyper-area’s value at t = 0 and, finally, h(∞) is the asymptotic hyper-area value. Highly efficient multi-objective training methods feature high values of λ and very small h(∞) values. Therefore, each MO-GART and MOMEART evolution run can be summarized by a pair of values (h(∞), λ) or, equivalently, (h(∞), 1/λ), which facilitates a straightforward visual comparison between runs. As a matter of fact, in Section IV-D we show such plots, when comparing MOME-ART to MO-GART. We estimate the parameters in Equation (5) using a customized, hybrid Newton-Raphson method that fits a sum-of-squares minimizing curve through the recorded data. 3) Pareto Front Density: A given PF’s density d can be defined as the number of model complexity levels that are represented in the PF over the total numbers of model complexity levels considered, which is identical to the maximum model size assumed. Note that, the most natural measure of model complexity of ART-based classifiers is the number of categories employed in the network. Among our experimental results, we present density versus hyperarea plots for the obtained PFs. Ideally, a training procedure should produce high-density PFs with very low hyper-area values. B. Datasets

(4)

2) Hyper-area Reduction Rate and Predicted Asymptotic Hyper-area: While the hyper-area directly measures the quality of the PF as estimated based on a training set, when comparing different multi-objective algorithms, it is important to measure the amount of computational effort that needs to be invested by each method to reach a certain level of PF quality. Ideally, such a collection of good quality solutions should be attained with minimal effort. When measuring PF quality via hyper-area values and computational effort via execution time, training approaches that reduce the PF hyperarea the fastest and, eventually, produce a PF featuring the lowest asymptotic hyper-area value, are, obviously, highly desirable.

Six different datasets obtained from the UCI machine learning repository [20] were used for our experiments: Iris, Glass, Pima, Wine, Abalone (treated as a 3-class problem by grouping classes 1-8, 9 and 10, and 11 on, only one third of patterns used), and Statlog Vehicle Silhouettes (contributed by the Turing Institute, Glasgow, Scotland). Of them, Iris, Glass, and Pima were also used in [7], while Abalone was used in [8]. None of them have missing attributes, and, together, they present a typical variety of classification problems. C. Experimental Setup Each dataset was separated into four subsets, which we specifically named for the sake of semantic distinction: training set A, training set B, cross-validation, and test. Each

D. Experiment Results Figure 4 shows h(∞) versus 1/λ plots, where each point represents a single MO-GART or MOME-ART evolution run for the Pima and Glass datasets. The closer the run characteristics to the origin, the faster the run and the better the PF it produced. The results clearly show the rapid convergence of MO-GART, which can be attributed mainly to

Predicted Asymptotic Hyper-area VS Inversed Convergence Rate : Pima dataset 0.3 MO-GART MOME-ART 0.28

h()

0.26

0.24

0.22

0.2

0.18

0

500

1000

1500

2000

2500

3000

3500

1/  Predicted Asymptotic Hyper-area VS Inversed Convergence Rate : Glass dataset 0.4 MO-GART MOME-ART 0.38

0.36

0.34

h()

of these subsets contains one third, one third, one sixth, and one sixth of the total dataset instances respectively. This setup could results in datasets that are too small and could potentially produce poor result due to the limited instances; however, it is very important to have all four datasets for fair champion network selection and comparison. Both MO-GART and MOME-ART were used to evolve ssFAM [13] classifiers. MO-GART requires an initial population of already-trained networks. Towards this end, its initial population was generated by training a collection of ssFAM networks using training set A and a randomly-generated variety of network parameters. On the other hand, MOME-ART’s initial population was formed by networks of completely randomly-generated categories/genes. We ensured, though, that the total number of genes in both cases were equal. Also, we used both training sets A and B for the evolution process of both MOME-ART and MO-GART. MOME-ART was parameterized as follows: λ = 0.5 (all individuals were uniformly selected for mutation), a probability of mutation of 0.25, Pmin,0 = 0.05 (probability of allowing the minimal increase in error at the beginning of evolution, which determines the initial temperature T0 ), Pmin,f = 0.005 (equivalent probability at the end of evolution, which determines the final temperature) and a cooling generation of one-third of the maximum number of generations (the generation, in which the evolution process reaches the final temperature, which determines the cooling rate α). For each parameter, a variety of values were experimented with, and the above parameter settings produced the best result. To assess computational efforts of both MO-GART and the novel approach, the following measures were taken. First of all, MO-GART was allowed to evolve until a pre-specified maximum number of generations, which, in this case, was set to 100. The novel approach was then executed for the same amount of time it took for MO-GART to reach its maximum number of generations. Additionally, both methods were run using the same machine configuration to avoid any bias towards either training procedure. In this manner, we can compare the quality of both methods’ results for the same amount of computational effort. For each experiment’s final PF output, a champion network is selected by using the cross-validation set. The network in the PF that achieves the lowest classification error on the validation set is selected as the champion network for the given classification problem and the training method that produced it. If there is a tie in error rate, then the network with the smaller size is always preferred. Once the champion network is chosen, its true performance is then estimated using the test set.

0.32

0.3

0.28

0.26

0

500

1000

1500

2000

2500

1/ 

Fig. 4. In these plots, the pair of inverse convergence rate λ and predicted asymptotic hyper-area value h(∞) summarizes the performance of an evolution run for MO-GART and MOME-ART. The results show that while MOME-ART does not has the rapid decay rate of MO-GART, it is able to converge to a PF with very low hyper-area.

the fact that its initial population consists of already-trained classifiers, as well as to its implicit elitism scheme through its archive. This allows it to quickly evolve a PF consisting of high-performing individuals and reach a low value of hyperarea relatively quickly. On the other hand, MOME-ART, being initialized with networks made of completely arbitrary categories, has a much slower convergence rate. In lieu of its handicap, MOME-ART demonstrated that it can produce PFs with lower asymptotic hyper-area. Let us note here that these observations were very similar across all tested datasets. Next, Figure 5 depicts evolutionary runs from both approaches as points on final PF hyper-area versus the corresponding density plots. As a reminder, a PF’s density is defined as the fraction of considered/possible model complexity levels that are represented in the PF. An ideal training method would produce PFs with a hyper-area of 0 and a density of 1.0, as this would maximize the likelihood of the PF to include classifiers with, potentially, remarkably good generalization properties. Owing to its design, we point out

Generalization vs. Complexity of Champion Classifiers : Iris dataset

Hyper-area vs. PF Density : Pima dataset 0.12

0.3

MO-GART MOME-ART

0.28

0.1

0.26

0.08

Test Error

h

MO-GART MOME-ART

0.24

0.06

0.22

0.04

0.2

0.02

0.18

0

0.05

0.1

0.15

0.2

0.25 0.3 PF Density

0.35

0.4

0.45

0

0.5

5

10 15 20 25 Model Complexity (Number of Categories)

30

35

Generalization vs. Complexity of Champion Classifiers : Reduced Abalone dataset 0.4 MO-GART MOME-ART 0.38

Hyper-area vs. PF Density : Glass dataset 0.4 MO-GART MOME-ART

0.39

0

0.38 0.37

0.36

Test Error

h

0.36 0.35

0.34

0.34 0.32 0.33 0.32

0.3

0.31 0.3

0

0.05

0.1

0.15 0.2 PF Density

0.25

0.3

0.35

0

5

10 15 20 25 30 Model Complexity (Number of Categories)

35

40

Fig. 5. Again, each point in the plots represents a single evolution run for each of the two methods, when using the Pima and Glass datasets. This time, the final hyper-area and the PF density are correlated and the plots clearly indicate that MOME-ART can produce denser and better PFs than MO-GART.

Fig. 6. This figure also depicts similar plots. Only this time, the plots relate generalization performance to structural complexity of the champion networks produced by MO-GART and MOME-ART. Both plots illustrate the capabilities of MOME-ART to evolve ssFAM classifiers that are both very accurate and small in size, thus, trustworthy.

the impressive quality of PFs produced by MOME-ART, which, for these two datasets, feature very low hyper-area and densities often ranging from 20% to 50% of the model complexity range. Again, the results shown are typical for these metrics, when comparing the two training procedures on the remaining datasets. Finally, the generalization abilities for given model complexities of the champion networks are provided in Figure 6. From the final PFs obtained, we select a unique champion individual based on its cross-validation set performance. Then, the generalization performance of all these champions is assessed via the test set. An utopian model would exhibit zero error (perfect generalization) at zero complexity (perfectly simple model). After reviewing the results from all datasets considered, we came to the realization that the champion networks produced by both schemes were more or less equally good in terms of generalization accuracy. As a matter of fact, the overall champions (the champion of champions produced by each method) typically exhibited differences of

±0.03, which, given the sizes of test sets utilized, are most likely not statistical significant. This observation is not short of impressive, considering the type of initialization MOMEART undergoes. However, in 5 out of 6 datasets the best performing MOME-ART champion featured less number of categories than its MO-GART equivalent-performing counterpart. We specifically picked the Iris and Abalone datasets to clearly illustrate this interesting point in Figure 6. For Iris, the best champion produced by MOME-ART features a test set error of 0.0 and utilizes only 3 categories, which, by the way, matches the number of classes present in the dataset. In other words, MOME-ART produced a ssFAM network that featured a category for each class and is able to accurately separate the three subspecies of Iris flowers. A similar situation arises with the Abalone dataset. MOMEART managed to evolve a parsimonious network with less that 10% test error with only 4 categories, while the number of classes is 3.

V. C ONCLUSIONS In this paper we presented MOME-ART, a multi-objective, memetic framework intended to evolve populations of ARTbased classifiers. In brief, it evolves interacting subpopulations to produce dense Pareto fronts of models, it attempts to minimize training error and simultaneously decrease model complexity and, finally, employs simulated annealing to further refine its solutions. The experiments of the previous section showcased clear advantages in terms of Pareto front density and hyper-area quality over an existing, state-of-theart, powerful evolutionary framework, namely MO-GART. While our presentation here focused on evolving ssFAM networks, owing to the genetic operators of MOME-ART that are agnostic to the way categories are represented and how networks utilize categories in their decision-making process, the framework can be readily used to train populations of other ART architectures. Even further, based on the same rationale, it is foreseeable that MOME-ART can be used to evolve populations of other classification models that employ exemplars, a generalized notion of categories. Examples that come to mind are k-nearest prototype models, kernel density based models, Radial Basis Function neural networks and others. ACKNOWLEDGMENT The authors acknowledge partial support from the following NSF grants: No. 0647018, No. 0647120, No. 0717680 and No. 0717674. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation. Finally, the authors are grateful to the three anonymous reviewers that provided constructive feedback about this manuscript. R EFERENCES [1] M. Rizki, M. Zmuda, and T. L.A., “Evolving pattern recognition systems,” Evolutionary Computation, IEEE Transactions on, vol. 6, no. 6, pp. 594–609, 2002. [2] D. Muni, N. Pal, and D. J., “A novel approach to design classifiers using genetic programming,” Evolutionary Computation, IEEE Transactions on, vol. 8, no. 2, pp. 183–196, 2004. [3] D. Parrott, X. Li, and V. Ciesielski, “Multi-objective techniques in genetic programming for evolving classifiers,” in Evolutionary Computation, 2005. The 2005 IEEE Congress on, vol. 2, Sept. 2005, pp. 1141–1148 Vol. 2. [4] M. Baghshah, S. Shouraki, R. Halavati, and C. Lucas, “Evolving fuzzy classifiers using a symbiotic approach,” in Evolutionary Computation, 2007. CEC 2007. IEEE Congress on, Sept. 2007, pp. 1601–1607.

[5] J.-K. Kim and B.-T. Zhang, “Evolving hypernetworks for pattern classification,” in Evolutionary Computation, 2007. CEC 2007. IEEE Congress on, Sept. 2007, pp. 1856–1862. [6] A. Kaylani, “An adaptive multiobjective evolutionary approach to optimize artmap neural networks,” Ph.D. dissertation, University of Central Florida, Orlando, Florida, USA, 2008. [Online]. Available: http://purl.fcla.edu/fcla/etd/CFE0002212 [7] A. Kaylani, M. Georgiopoulos, M. Mollaghasemi, and G. Anagnostopoulos, “MO-GART: Multiobjective genetic ART architectures,” in Evolutionary Computation, 2008. CEC 2008. (IEEE World Congress on Computational Intelligence). IEEE Congress on, June 2008, pp. 1425–1432. [8] A. Kaylani, M. Georgiopoulos, M. Mollaghasemi, G. C. Anagnostopoulos, C. Sentelle, and M. Zhong, “An adaptive multiobjective approach to evolving ART architectures,” IEEE Transactions on Neural Networks, vol. 21, no. 4, pp. 529–550, 2010. [9] G. Carpenter, S. Grossberg, N. Markuzon, J. Reynolds, and D. Rosen, “Fuzzy ARTMAP: A neural network architecture for incremental supervised learning of analog multidimensional maps,” Neural Networks, IEEE Transactions on, vol. 3, no. 5, pp. 698–713, 1992. [10] G. Anagnostopoulos and M. Georgiopulos, “Hypersphere ART and ARTMAP for unsupervised and supervised, incremental learning,” in Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNSENNS International Joint Conference on, vol. 6, 2000, pp. 59–64. [11] G. Anagnostopoulos and M. Georgiopoulos, “Ellipsoid ART and ARTMAP for incremental clustering and classification,” in Neural Networks, 2001. Proceedings. IJCNN ’01. International Joint Conference on, vol. 2, 2001, pp. 1221–1226. [12] S. Verzi, G. Heileman, M. Georgiopoulos, and G. Anagnostopoulos, “Off-line structural risk minimization and BARTMAP-S,” in Neural Networks, 2002. IJCNN ’02. Proceedings of the 2002 International Joint Conference on, vol. 3, 2002, pp. 2533–2538. [13] G. Anagnostopoulos, M. Bharadwaj, M. Georgiopoulos, S. Verzi, and G. Heileman, “Exemplar-based pattern recognition via semi-supervised learning,” in Neural Networks, 2003. Proceedings of the International Joint Conference on, vol. 4, 2003, pp. 2782–2787. [14] G. Anagnostopoulos and M. Georgiopoulos, “Category regions as new geometrical concepts in Fuzzy ART and Fuzzy ARTMAP,” Neural Networks, vol. 15(10), pp. 1205–1221, 2002. [15] F. Edgeworth, Mathematical Physics. London, England: P. Keagan, 1896. [16] V. Pareto, Cours d’conomie Politique. Lausanne, Switzerland: F. Rougee, 1896, vol. I and II. [17] G. Anagnostopoulos and G. Rabadi, “A simulated annealing algorithm for the unrelated parallel machine scheduling problem,” in Automation Congress, 2002 Proceedings of the 5th Biannual World, vol. 14, 2002, pp. 115–120. [18] R. Fisher, “On the interpretation of χ2 from contingency tables, and the calculation of p,” Journal of the Royal Statistical Society, vol. 85, no. 1, pp. 87–94, 1922. [19] E. Zitzler and L. Thiele, “Multiobjective optimization using evolutionary algorithms - a comparative case study,” in Parallel Problem Solving from Nature V, ser. Lecture Notes in Computer Science, A. Eiben, Ed., vol. 1498. Amsterdam, The Netherlands: Springer-Verlag, September 1998, pp. 292–301. [20] A. Asuncion and D. Newman, “UCI machine learning repository,” 2007. [Online]. Available: http://archive.ics.uci.edu/ml/

Suggest Documents