An Immune-Evolutionary Algorithm for Multiple ... - Springer Link

3 downloads 171 Views 703KB Size Report
best linear order that puts together genes presenting similar profiles. ...... in C language, using the C++ Builder 3 compiler, and the simulations were ... experimental data used in this work are available at the Infobiosys Research Group website.
Genetic Programming and Evolvable Machines, 5, 157–179, 2004  C 2004 Kluwer Academic Publishers. Manufactured in The Netherlands.

An Immune-Evolutionary Algorithm for Multiple Rearrangements of Gene Expression Data JANA´INA S. DE SOUSA [email protected] LALINKA DE C. T. GOMES [email protected] GEORGE B. BEZERRA [email protected] LEANDRO N. DE CASTRO [email protected] FERNANDO J. VON ZUBEN [email protected] Department of Computer Engineering and Industrial Automation, State University of Campinas (Unicamp), CP. 6101, Campinas, SP, 13083-970, Brazil Submitted February 14, 2003; Revised November 30, 2003

Abstract. Microarray technologies are employed to simultaneously measure expression levels of thousands of genes. Data obtained from such experiments allow inference of individual gene functions, help to identify genes from specific tissues, to analyze the behavior of gene expression levels under various environmental conditions and under different cell cycle stages, and to identify inappropriately transcribed genes and several genetic diseases, among many other applications. As thousands of genes may be involved in a microarray experiment, computational tools for organizing and providing possible visualizations of the genes and their relationships are crucial to the understanding and analysis of the data. This work proposes an algorithm based on artificial immune systems for organizing gene expression data in order to simultaneously reveal multiple features in large amounts of data. A distinctive property of the proposed algorithm is the ability to provide a diversified set of high-quality rearrangements of the genes, opening up the possibility of identifying various co-regulated genes from representative graphical configurations of the expression levels. This is a very useful approach for biologists, because several co-regulated genes may exist under different conditions. Keywords: gene expression, microarray, artificial immune systems, clustering, evolutionary algorithms

1.

Introduction

Although biological data has been produced in large scales, there is a relatively small amount of knowledge concerning the genomic functionality. Five decades after the discovery of the DNA double-helix structure, the human genome and the genome of several organisms, including Escherichia coli, Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster, are completely sequenced. The analysis of expression levels of genes under different conditions can provide insights into the functional genome. The expression levels of genes are closely related to the production of proteins, being therefore a determinant factor in the control of cellular processes. The application field is vast: measuring gene expression levels at large scales may help scientists to identify the functional role of various genes, to discover in which process they participate, to reveal how gene expression levels are influenced by diseases or drug treatments, drug discovery, toxicological research, diagnosis, cellular differentiation, among many others. For these reasons, gene expression analysis is one of the most studied subjects in the field of bioinformatics [31].

158

DE SOUSA ET AL.

In the last few years, the microarray technology [40] has attracted a lot of attention from researchers. Microarray experiments enable the monitoring of thousands of gene expression levels at the same time. Thus, the knowledge extraction and the analysis of gene expression information require the processing of large amounts of data, making impracticable simple visual analyses of the raw data. Therefore, computational tools are necessary to organize the information in such a way that scientists can effectively analyze and interpret gene expression information in a intuitive manner. Once the raw data is processed, it must be displayed so that useful information may be extracted simply by eye inspection, generally by means of a graphical representation [21]. A broadly used method involves the rearrangement of gene expression data; that is, the reordering of genes according to their functionality. There are basically two approaches in the literature to perform this task: clustering methods [4, 21] and ordering methods [24, 44]. Clustering methods aim at grouping genes with similar profiles along the microarray experiments, while ordering methods search for the best linear order that puts together genes presenting similar profiles. In this way, the problem to be solved may be described as defining the “best” sequential order of genes in such a way that genes with similar expression profiles will be relatively close to each other. In this context, several NP-hard combinatorial optimization problems [22] will certainly be faced. Computational intelligence techniques have been playing an important role in combinatorial optimization. The solution of mixed optimization problems often involves a variety of characteristics such as non-linearity, non-differentiability, discrete search spaces and explosion of feasible solutions. In terms of computational complexity, a significant set of such problems is classified as NP-hard. Several approaches capable of providing high-quality solutions to NP-hard combinatorial problems are available in the literature [25, 34]. In this work, a method that explores concepts of artificial immune systems [12, 16, 17] to solve the gene ordering problem is proposed. The model presented in this paper, called copt-aiNet (see Section 5.3), based on the approach presented in [14] and on the formalism proposed by [17], is introduced. The algorithm has some common features with other populationbased search techniques, such as evolutionary algorithms [2], allowing it to be classified as an immune-evolutionary algorithm (see Section 5.4). The copt-aiNet adaptation procedure presents three important features designed with inspiration in the biological immune system: • a population with size dynamically adjustable; • the ability to simultaneously provide multiple high-quality solutions; • a compromise between exploitation and exploration of the search-space by means of a bio-inspired mechanism of diversity maintenance (see Section 5.4). In the context of gene expression data analysis, the maintenance of diversity plays a fundamental role. Since there are several proteins with different functions depending on the groups of genes taken into account under different conditions, many genes may be simultaneously expressed in several groups. The methods typically used cannot appropriately detect these cases, and relationships among conditionally co-regulated genes are often unperceived [23]. The copt-aiNet ability to simultaneously maintain several high-quality rearrangements of the gene expression data allows the identification of important relationships among genes. Once multiple rearrangements are obtained, novel and possibly unexpected insights may be revealed to support a decision to be made by an expert.

AN IMMUNE-EVOLUTIONARY ALGORITHM

159

This paper is organized as follows. Section 2 presents the basic concepts associated with microarrays and gene expression data. In Section 3, we provide a brief literature review of the methods currently applied in the analysis of gene expression data. In Section 4, the ordering problem to be addressed is properly formalized. Section 5 discusses the main limitations of clustering procedures. Section 6 presents the basic concepts related to artificial immune systems and details the proposed algorithm. In Section 7 the computational results are presented, and Section 8 is dedicated to concluding remarks and future perspectives. 2.

Microarrays and gene expression

The objective of microarray experiments is to compare gene expression levels in two samples of genetic material. A microarray experiment basically involves the following elements: • the array, which is a two-dimensional glass slide containing thousands or hundreds of spots. Each spot is typically 200 microns or less in size; • a set of target samples, which are immobilized on the spots of the array. The choice of DNA’s to be used in the spots on a microarray determines which genes can be detected; • two populations of labeled samples, consisting of single strands of cDNA (complementary DNA, synthesized from messenger RNA by means of reverse transcription) of a known sample and molecules corresponding to the cDNA to be analyzed; and • a detector to measure the expression levels of the labeled samples. At first, a known sequence of DNA called “target” is immobilized in the array. The genes forming the target DNA determine the genes whose expression level will be analyzed in the experiment. The next step to perform a microarray experiment is to define the sample populations composed of single strands of cDNA to be hybridized in the array. Typically, an experiment involves two samples: a control sample containing known cDNA and a sample to be studied. The control and study samples might be composed of, for example, healthy cells and cells with genetic diseases, or cells in different stages of the cell cycle. In the sequence, the single strands of known cDNA and of the cDNA to be analyzed are hybridized in the spots of a glass slide by means of complementary base-pairing rules. Since genes are formed by a given alphabet, composed of Adenine (A), Timine (T), Cytosine (C), and Guanine (G), with the pairing rules A-T and C-G, probes can be constructed in such way that they are bound to the genetic material of the labeled samples of cDNA molecules immobilized on the spots. Generally, cDNA molecules are labeled using fluorescent green and red dyes (Cy3 and Cy5) for the control sample and for the sample to be analyzed, respectively. The fluorescent molecules are detectable when stimulated by laser. If a sample contains a cDNA whose sequence is complementary to the DNA on a given spot, the cDNA strand will hybridize to the spot. Each spot contains enough DNA so that both cDNA samples can freely hybridize to it. The product of a microarray experiment is an image composed of several colored spots, each one corresponding to a gene or to an ORF (open reading frame) [21]. The color of each spot is given by the amount of bound cDNA. Spots that have mRNA’s present at a higher level in one or the other population are predominantly red or green: red dots correspond to

160

DE SOUSA ET AL.

over-expressed genes in the sample under study; green dots denote under-expressed genes; yellow dots correspond to equally-expressed genes for both samples; and a black color indicates spots where neither of the samples have hybridized. Once the cDNA probes have been hybridized, the array should be scanned to determine the degree of hybridization of each probe. Molecules that have not hybridized to the genetic material are washed off and the array is stimulated by laser, emitting detectable light. The intensity of the light emitted by each spot reflects the amount of probes that have bound in the spot. The visualization of gene expression data is generally made by means of a graphical representation, such as the representation proposed in [21]. The graphical representation consists of an expression matrix M, where rows denote genes and columns denote experiments. The colors and intensities obtained in the array are mapped to a real-valued gene expression matrix M, where Mi j is the log2 (Cy5/Cy3) ratio for gene i and experiment j. Therefore, the gene expression matrix M contains rows corresponding to the relative expression of a specific gene, and columns corresponding to experiments. Under-expressed genes will correspond to negative values, while over-expressed genes will be represented by positive values. Equally expressed genes will correspond to the value 0. Figure 1 summarizes the main steps to perform a microarray experiment.

3.

Organizing gene expression data: Clustering versus ordering

Clustering is the process of organizing genes into groups in such a way that genes belonging to the same group are similar to each other according to some criterion. In general, the choice of the most suitable clustering technique to be used in a given data set is a very difficult task. Most techniques must have the number of clusters defined a priori, and often require the adjustment of several parameters. Another aspect is that the data may indicate clusters with different shapes and sizes. Generally the clustering methods are rooted on the existence of well-formed clusters, and assume a predefined spatial configuration for the clusters. Different clustering procedures may yield quite different solutions, and the best choice is generally highly dependent on particular features of the data set being analyzed. Direct-ordering methods find the best linear order for gene expression data. In this sense, the ordering problem becomes equivalent to a traveling salesman problem (TSP) [30]. The objective is to find a linear order of the genes that minimizes the sum of the distance (usually measured using Euclidean distance or the correlation coefficient [21]) between adjacent genes. Once the problem of rearranging gene expression data is modeled as a combinatorial optimization problem, the goal is to define a sequence of genes, among an enormous number of candidates, in such a way that genes presenting similar expression profiles in the N-dimensional space (where N is the number of experiments) are placed as close to each other as possible [24]. Figure 2 illustrates the ordering of 102 artificially generated points of dimension 2, plotted in the 2-D plane, and their corresponding graphical representation before and after the application of an ordering algorithm. Note that the graphical representation in Figure 2(b) stresses the existence of genes with similar expression profiles, having similar patterns placed close to each other.

161

AN IMMUNE-EVOLUTIONARY ALGORITHM

Immobilized DNA containing genes of interest

Control sample labeled using green dye (Cy3)

Sample in analysis labeled using red dye (Cy5) Hybridization A

C

T

G

T

G

A

C

Laser Stimulation

Spots with detectable color and intensity Mapping of colors and intensities to a real-valued matrix Figure 1.

Main steps of a microarray experiment.

Due to the difficulty in selecting the best clustering method, specially when there is little information regarding the data being analyzed, methods based on direct-ordering can potentially provide valuable information to analyze expression profiles and thus constitute interesting alternatives to be used in place of, or in conjunction with, clustering methods. The most popular method used for clustering gene expression data is hierarchical clustering. The outcome of a hierarchical clustering procedure applied to the gene expression problem is a tree, with the gene expression profiles as leaves, where each leaf corresponds to a row of matrix M, as defined in Section 2. The sequence of leaves may be taken as a rearrangement of the rows as far as matrix M provides the original arrangement. However, a multitude of rearrangements is possible for the leaves in a given tree without violating the hierarchical relationships already determined, that is, without changing the tree. Post-processing techniques are then necessary to perform a permutation of branches in the tree aiming at optimizing a cost function, such as the approach proposed in [5] where a polynomial-time algorithm for rearranging the leaves in the tree is used. Figure 3 shows the ordering of 1,000 points generated using a uniform distribution in the 2-D plane and the corresponding graphical representation after the application of a

162

DE SOUSA ET AL.

Figure 2. Example of the application of an ordering algorithm to an artificially generated set. (a) Initial ordering for 102 genes and 2 experiments in the 2-D plane and corresponding graphical representation; (b) sequence of genes in the plane and respective graphical representation after the application of an ordering algorithm.

TSP-like ordering algorithm, and the same set of data ordered by means of a hierarchical clustering algorithm (agglomerative clustering). Note that the configuration of the genes does not characterize well-defined clusters. It is also interesting to note that the resulting graphical representation of the ordering algorithm presents a more uniform aspect than the one resulting from the hierarchical clustering algorithm used. Figure 4 shows the sequence of genes plotted in the plane and the corresponding graphical representation for a set of data containing 100 genes and 2 experiments with four well-defined clusters. Figure 4(a) presents the result for the linear ordering algorithm and Figure 4(b) depicts the result for the hierarchical clustering algorithm.

AN IMMUNE-EVOLUTIONARY ALGORITHM

163

Figure 3. Ordering versus clustering 1,000 genes generated using a uniform distribution. (a) solution (cost 24) provided by an ordering algorithm and the corresponding graphical representation; (b) sequence of genes (cost 40) in the plane and respective graphical representation after the application of a hierarchical clustering algorithm.

4.

Literature review

Several works can be found in the literature concerning the two basic approaches for dealing with the problem of organizing gene expression data: clustering procedures [21] and directordering methods [24]. Essentially, both strategies result in combinatorial optimization problems [22] aiming at rearranging the genes so that genes with similar expression profiles are placed close to each other. Gene expression data analysis tools can also be classified as classical or standard methods [27] and as new metaheuristcs originated from the field of computational intelligence [6, 10, 25, 34]. Gene expression studies have so far focused on clustering methods to separate genes by similarities in expression profiles. A number of clustering algorithms have been applied to gene expression data. The most commonly used approaches are hierarchical clustering [21] and cluster partitioning (generally implemented using self-organizing maps [42], and k-means clustering [23]).

164

DE SOUSA ET AL.

Figure 4. Ordering versus clustering points belonging to four well-defined clusters. (a) solution (cost 5.37) provided by an ordering algorithm for 100 genes and 2 experiments distributed in 4 clusters plotted in the plane and the corresponding graphical representation; (b) sequence of genes (cost 8.25) in the plane and respective graphical representation after the application of a hierarchical clustering algorithm.

Hierarchical clustering methods proceed by grouping genes in a bottom-up fashion. Genes with most similar profiles are grouped together first, and those with more different profiles are included iteratively until all genes are grouped in the same group. The product of a hierarchical clustering process is a hierarchical tree, whose leaves are distributed so that the hierarchical structure of the data is revealed. Self-organizing maps are a class of artificial neural networks [26] based on unsupervised and competitive learning, and constitute a method widely applied for data clustering. The neurons in the self-organizing map perform a dimensionality reduction, and tend to highlight similarities in the data set. The k-means algorithm starts with an arbitrary initial distribution of the gene expression data in clusters. The genes are then used to iteratively adjust the clusters in order to obtain a

AN IMMUNE-EVOLUTIONARY ALGORITHM

165

better clustering solution, given by maximizing the across-cluster variance and minimizing the within-cluster variance. The k-means method does not assume any particular structure on the data. However, the number of clusters must be defined a priori and k-means may require an excessive amount of computational time when applied to large sets of data. It is also sensitive to the initial choice of prototypes, thus being susceptible to get stuck in local optima solutions. Hierarchical clustering is a greedy method also susceptible to provide local minima solutions. Several strategies were proposed to enhance the solutions obtained by hierarchical clustering procedures, such as reordering the leaves of the hierarchical tree generated [4, 5]. In general, clustering techniques are very effective when the dataset presents few data points. However, when considering high-dimensional data, the relationships among genes tend not to be unique. Whether the rearrangement of gene expression data is modeled as a clustering problem or as a TSP-like (combinatorial optimization) problem, metaheuristics achieve very good results, presenting in general a better performance when compared with classical approaches. A class of metaheuristics that has been successfully applied to hard problems is based on evolutionary computation [2]. Evolutionary approaches have been yielding very satisfactory results when applied to complex combinatorial problems. One of the major advantages of using evolutionary algorithms lies in the fact that these methods work with a population of solutions, simultaneously exploring and exploiting the search space. There are a few works dealing with the problem of rearranging gene expression data based on evolutionary algorithms. In [44], Tsai et al. proposed a genetic algorithm for finding a near-optimal solution for the gene ordering problem. In order to maintain the diversity of the population, they used the concept of “family competition”. In their approach, each individual of the entire population is sequentially considered the “father”, and is combined with other individuals randomly chosen, generating λ offspring solutions. At the end of one cycle of the family competition, the µ individuals with highest fitness are selected from the (µ + λ) offspring generated. The most well-adapted individuals will compose the new population, and the process is repeated until a stopping criterion is reached. The method led to very satisfactory results in the problems evaluated, providing, in general, superior results to those obtained using a hierarchical and a self-organizing clustering procedure. In [33], Merz and Zell proposed a memetic algorithm [35] using k-means to perform a local search for clustering gene expression data. In this work, the clustering procedure is seen as a combinatorial problem in which the gene vectors must be assigned to clusters such that the sum of squared distances of the vectors to their cluster centroid is minimal. The performance of the memetic algorithm was compared with that of a multi-start k-means, and the results obtained demonstrated the superiority of the memetic algorithm. In [11], Cotta et al. also applied a memetic algorithm to the gene ordering problem. The main features of their approach were the use of a fitness function that captures global properties of the gene arrangement, the use of a tree representation of solutions, the definition of ad hoc operators to manipulate the representation, the use of two local search procedures, and the use of multiple populations in an island model migration scheme [45]. The instances used for experimentation were extracted from the literature and represent real biological systems. The results presented suggested that a memetic algorithm can improve the performance of classical agglomerative clustering algorithms.

166

DE SOUSA ET AL.

Xiao et al. [46] proposed a hybrid evolutionary approach involving self-organizing maps (SOM) and particle swarm optimization (PSO) algorithms [29]. The synaptic weights of the self-organizing map are determined by the particle swarm algorithm. The weights are trained by a self-organizing algorithm in a first stage and are then optimized by the particle swarm algorithm to refine the clustering process. The authors also provide a comparison between the SOM-PSO approach and hierarchical clustering, and in most simulations performed, the SOM-PSO presented a superior performance. 5.

An immune-evolutionary algorithm for ordering gene expression data

The last decades have witnessed the emergence of novel problem solving techniques inspired by nature [20]. In addition to the already well-known artificial neural networks and evolutionary algorithms, swarm systems [7, 29] and artificial immune systems [17, 43] have been receiving increasily more attention from academics and industrial parties interested in solving complex problems. In contrast to neural networks and evolutionary algorithms, which were inspired by the mammalian nervous system and the evolution of species, respectively, swarm systems are inspired by the collective behavior of social organisms and artificial immune systems are motivated by the vertebrate immune system. The immune-evolutionary algorithm to be presented here as a tool for solving the gene ordering problem was designed by taking into account some theories of how the immune system works. One of these theories, which will be described later, is evolutionary in nature, and thus gave rise to an algorithm that can be viewed as a hybrid between an evolutionary and an immune algorithm. This section is divided into five parts. In Section 5.1 the necessary background on immunology is provided; Section 5.2 briefly introduces the field of artificial immune systems; Section 5.3 describes the proposed copt-aiNet algorithm; Section 5.4 outlines the diversity maintenance mechanism of the algorithm; and Section 5.5 places copt-aiNet in the context of evolutionary algorithms. 5.1.

Background on immunology

The copt-aiNet adaptation procedure, used in this paper as a nonhierarchical clustering approach for gene ordering, employs three important concepts from immunology: (1) clonal selection and affinity maturation [9, 41], (2) immune network theory [28], and (3) gene recombination [37]. This section initially provides some background knowledge on the immune system, introducing the appropriate terminology, and then reviews these three main theories and processes. For a more comprehensive discussion, please refer to [17]. The immune system plays the important role of protecting our bodies from diseasecausing agents, known as pathogens or antigens. To provide protection, the immune system makes use of several types of organs, cells, and molecules. Most immune cells have receptors on their surfaces, called as antibodies, that allow them to recognize and bind with antigens. The recognition process depends on a shape matching between the antigen and the antibody: the better the matching, the better the recognition and the stronger the binding. The degree of recognition, or strength of binding, between an antigen and an antibody is termed affinity. This binding triggers an immune response by signaling other cells to destroy the antigenantibody complex.

167

AN IMMUNE-EVOLUTIONARY ALGORITHM

The clonal selection theory [9] states that when an antigen invades the organism, some immune cells with antibodies that recognize this antigen start proliferating. The higher the affinity between an antibody and an antigen, the more offspring cells, called clones, will be generated. During proliferation, the antibodies on these cells may suffer mutation with rates proportional to their affinity [36]: the higher the affinity, the smaller the mutation rate, and vice-versa. The mutated clones with increased affinity with the antigen are selected, through a natural selection process, to become memory cells and promote fast and reliable future immune responses. Clonal selection and affinity maturation constitute the basis of the vaccination procedure: individuals are innoculated with dead or inactivated samples of antigens, and antibodies capable of successfully binding with these antigens are generated and kept as immune memories. Thus, clonal selection and affinity maturation have several important features from a computational perspective: affinity proportional selection and mutation, and maintenance of memory cells with long life spans. The next important theory used in the development of copt-aiNet is the so-called immune network theory [28]. It proposes that immune cells are not only capable of recognizing antigens, but they are also capable of recognizing each other. This suggests an immune system that is responsive even in the absence of antigen; there is an intrinsic dynamic state within the immune system. Basically, when an immune cell recognizes an antigen or another immune cell, it is stimulated. By contrast, when an immune cell is recognized by another immune cell, it is suppressed. In the present paper, stimulation by an antigen results in the clonal selection and affinity maturation of the stimulated cell, and network suppression results in the elimination of some of the suppressed cells. The last immune principle used in this work is the generation of antibody molecules from gene libraries [37]. All blood cells, including the immune cells, are generated in a site known as bone marrow, a soft tissue inside the most elongated bones. Within the bone marrow, genes, from several gene libraries, are concatenated in order to produce a single cell receptor, in a process similar to the one illustrated in Figure 5.

5.2.

Artificial immune systems

Artificial immune systems (AIS) constitute a new computational intelligence approach based on principles and models from immunology [17]. Similarly to most bio-inspired techniques and other metaheuristics [20, 25], its main objective is to solve complex problems whose (satisfactory) solutions cannot be obtained with the more traditional methods, such as linear, nonlinear, integer and dynamic programming.

1A 1B 1C 1D 1E 1F

2A 2B 2C 2D 2E 2F

3A 3B 3C 3D 3E 3F

1A 1B

2E 2F

3C 3D

1A 1B 2E 2F 3C 3D Figure 5.

Concept of gene library, with a new individual generated by grouping genes from 3 libraries.

168

DE SOUSA ET AL.

The AIS field emerged in the mid 1980s by observing that theoretical models of the immune system present similarities with other machine-learning techniques and thus can potentially provide solutions to complex computational and engineering problems. AIS have also been argued to be useful as a novel soft computing paradigm, being compared to and hybridized with a number of other techniques, including evolutionary algorithms, fuzzy systems and artificial neural networks [19]. The application domains of AIS tools are very wide ranging, from computer network security to control and bioinformatics [12, 17, 43]. The design of an artificial immune system is quite similar to the design of the other, more traditional, computational intelligence approaches. It is first necessary to choose a representation scheme for the components of the system (this is usually problem-dependent), it is then necessary to define one or more evaluation functions to assess the behavior of the components of the system, and finally it is necessary to choose or propose an immune algorithm to govern the dynamics of the system (the immune algorithms are usually simplified models, sometimes referred to as abstractions or metaphors, of an immune principle, theory or theoretical model). The description of the copt-aiNet algorithm, to be presented in the sequence, follows these three basic steps: representation, evaluation and dynamics.

5.3.

Copt-aiNet adaptation procedure and gene ordering

The copt-aiNet algorithm is an extension of a simple artificial immune network model, named aiNet (Artificial Immune Network), originally proposed to perform data compression and clustering [13, 14], and further adapted to solve multimodal optimization tasks [16], giving rise to the opt-aiNet algorithm. The acronym copt-aiNet thus refers to the optimization version of aiNet, opt-aiNet, for solving combinatorial optimization problems. The coptaiNet algorithm uses the following terminology: • Network cell: a feasible solution to the problem. In this case, it is a sequence of integer numbers; each position in the vector corresponds to a gene. • Fitness: fitness of a cell in relation to an objective function to be optimized (either minimized or maximized). In the specific case of the gene ordering task, the fitness consists of the sum of the Euclidean distances between pairs of the sequence of gene vectors. • Affinity: distance between a cell and another one; • Clone: offspring cells generated from a single parent cell. These cells will suffer a mutation process so that they become variations of their parent cell. In the copt-aiNet algorithm, each individual in the population corresponds to an antibody or network cell and is represented by an attribute string in a Shape-Space S [39], in which the elements correspond to a sort of generalization of specific input patterns. In the case of the gene ordering problem, each immune cell corresponds to a gene, therefore to a row in the gene expression matrix. In the problem of ordering expression data, the copt-aiNet does not construct a closed path, but a path between 2 genes. To adapt the proposed algorithm to the target problem, an auxiliary gene whose distance from all others is equal to 0 is inserted in the problem, thus allowing the representation of the solution as an open path.

169

AN IMMUNE-EVOLUTIONARY ALGORITHM

As mentioned above, two indexes are used to evaluate the quality (adaptability) of each network cell: (1) a fitness value, and (2) an affinity value. The fitness of a network cell indicates its degree of adaptability to the external environment; that is, how good this cell is at solving the problem. The fitness value of a network cell is based on an objective function to be optimized. For the specific case of the gene expression application, the fitness corresponds to the inverse of the sum of the Euclidean distances between pairs of the sequence of gene vectors, and can be computed according to Eq. (1): Fitness =  N um−1 i=1

1 xi − xi+1  + x1 − x N um  + 1

,

(1)

where Num denotes the number of genes, and xi is the gene vector i in the N -dimensional shape-space. In this way, the fitness is directly proportional to the inverse of the sum of the Euclidean distances between each pair of adjacent genes in a sequence; the larger the distance, the smaller the fitness, and vice-versa. The affinity of a network cell indicates its degree of recognition of other network cells. The recognition of one network cell by another cell is measured by a similarity metric that indicates how similar they are. This feature is explored in the copt-aiNet algorithm to regulate the number of network cells and to preserve the population diversity. Similar cells are identified, using a similarity metric, and the ones having lower fitness are removed from the population. This corresponds to the immune network suppression. The pseudocode presented in Algorithm 1 summarizes the main steps of the copt-aiNet adaptation procedure. A more detailed description is provided in the sequence. Each of the steps described in the pseudocode presented in Algorithm 1 can be explained in more details as follows. The specific parameters of each of these steps will be presented in Section 6.2 (Parameter Setting). 1. While StopCriterion is not met do 1.1. Generatethe initial population 1.2. Perform cloning and mutation 1.3. If Average Fitness (avF) has not improved then 1.3.1. Perform suppression 1.4. If number of individuals in the population is less than m then 1.4.1. Insert new individuals generated by recombination 1.5. If no one of the k best individuals had its Fitness improved during M iterations then 1.5.1. Submit every individual in the population to a weak affinity maturation 1.6. If no one of the k best individuals had its Fitness improved for Max it iterations then 1.6.1. StopCriterion = true 2. Submit the k best individuals in the population to an intensive affinity maturation process Algorithm 1. Pseudocode for the copt-aiNet algorithm.

170

DE SOUSA ET AL.

Step 1.1. Generation of the initial population. The initial population is constructed following the nearest neighbor heuristic starting with different seeds. Initially the population contains a number of individuals and is allowed to grow and shrink dynamically, according to the requirements of the problem (see explanation below). Step 1.2. Clonal Selection and Affinity Maturation [18]. At each iteration, a population of cells is optimized locally through clonal selection and affinity proportional mutation (exploitation of the fitness landscape). In this phase, each cell gives origin to a number of clones. The fact that no parent cell has a selective advantage over the others contributes to the multimodal search capability of the algorithm. The offspring cells suffer a mutation where n genes of each clone are randomly selected for mutation. The number n of genes to be mutated is inversely proportional to the fitness of the parent cell: cells with higher fitness will be submitted to lower mutation rates. There are three types of mutation operators in the copt-aiNet algorithm: (1) the first one performs exchanges of two genes; (2) the second one exchanges three genes; and (3) the third alters four genes. The genes to suffer mutation and the type of mutation are randomly chosen. For each set of clones, the clone with highest fitness is selected. If these clones have a fitness value higher than their corresponding parent cells, then the parent cells are replaced by their offspring. Step 1.3. Network Interactions and Diversification. When the network cells reach a stable state in terms of capability of improving fitness (measured via the stagnation of the average fitness of the whole population, avF), the cells interact with each other in a network form by determining their affinity. Some of the similar cells are eliminated to avoid redundancy and thus maintain diversity. The similarity among cells is computed as detailed in Section 5.4. This corresponds to the network suppression, a process by which the most similar cells (candidate solutions) are eliminated. The similarity between two solutions (affinity) is defined as the shortest sequence of operations that transforms one solution into the other. Notice that an assertion like “the higher the number of edges in common between two solutions, the smaller the number of operations to convert one solution into the other” is not completely true in the present case, and metrics based on the maximal common subgraph [8] should not be adopted here. Step 1.4. Gene Recombination. New network cells composed of individuals generated by means of recombination operators using the concept of gene library [37] are introduced. The gene libraries are constructed from a set of memory cells, formed by antibodies having high fitness. To do that, the memory cells are broken down into p pieces of length l each, where p is a random number varying from 2 to 4. These pieces are randomly selected and recombined in order to build a new solution. Step 1.5. Local Search. If none of the k best solutions is improved during a predefined number of iterations, all the cells in the population suffer a maturation process. During the maturation process, the immune cells suffer a series of guided mutations in order to better match the antigens. This process is implemented by a local search heuristic. In copt-aiNet, a tabu search heuristic [25, 34] is employed as a local search procedure. The maturation process is performed until the current individual has its fitness improved t times by a randomly selected move, or until its fitness is not improved by a complete tabu search iteration. There are 3 types of moves available, involving the exchange of 2, 3 and 4 genes. Step 2. Intensive Local Search. If none of the k best solutions is improved along Max it iterations, the stopping criterion is met and the k most adapted individuals are submitted to

AN IMMUNE-EVOLUTIONARY ALGORITHM

171

an intensive maturation process, implemented by the tabu search heuristic. As described above, the local search employs 3 types of moves, involving the exchange of 2, 3 and 4 genes. All the parameters of the tabu search, such as tabu tenure, number of nonimprovement iterations performed before changing the move type, among others, are problem dependent and are automatically determined by the algorithm. 5.4.

Diversity maintenance: Similarity criterion

Particularly when considering analysis of gene expression data, the recognition of multiple co-regulation relationships is a very important feature, and the diversity maintenance of the population of solutions is crucial. In this section the procedure for computing the similarity among solutions is briefly described. Let S(i) be the edge (gene) of index i from solution S. The pseudocode to compute the similarity between two solutions A and B can be summarized as follows: Step 1 - k := 1; Num Operations := 0; Cur Element := A(k) Finished := False Step 2 - Search for index j so that B(j) = Cur Element Step 3 - While Finished = False do Step 3.1 - k := k + 1; Cur Element := A(k) Step 3.2 - Search for index n so that B(n)=Cur Element Step 3.3 - If j + 1 > Num Genes then Index1:=1 else Index1 := j + 1 If j-1 < 1 then Index2 := Number Of Genes else Index2 := j-1 Step 3.4 - If n=Index1 and n=Index2 then Num Operations := Num Operations + 1 Step 3.5 - j := n Step 3.6 - If k=Num Genes then Finished:=True In the algorithm presented, Num Genes is the number of genes involved in the microarray experiment, and Num Operations represents the minimum number of operations necessary to convert one solution into another. 5.5.

Copt-aiNet as an immune-evolutionary algorithm

Standard evolutionary algorithms work in the sense of obtaining one high-quality near optimal solution by exploring promising regions of the search space by means of a population of candidate solutions. An inherent problem of most evolutionary algorithms is the loss of diversity in the population. Several strategies have been proposed to overcome this difficulty, such as island models [45], and niche and species methods [3, 32, 38]. This work proposes the use of a new type of evolutionary algorithm inspired by the immune network theory [28]. The copt-aiNet algorithm has the advantage of being inherently capable of locating and maintaining stable multiple and diverse solutions to complex problems [14, 15]. The immune system of humans is naturally capable of maintaining several sets of cells

172

DE SOUSA ET AL.

Figure 6. Co-regulated genes inside the circle in two arragements for an instance with 100 genes and 2 experiments. The hierarchical relationships among elements is distinct in the two arrangements, and cannot be explained by a single tree. (a) one possible arrangement and (b) a different arrangement for the same points.

and molecules capable of fighting against different viruses, bacteria, fungi, etc. The simple model of the immune system presented here is capable of reproducing this potentiality, to a certain degree, and is thus suitable for the application to ordering gene expression data. The determination of multiple solutions is particularly interesting to the analysis of gene expression data. As an example, consider the two distinct rearrangements shown in Figure 6, for an instance with 100 genes and 2 experiments plotted in the 2-D plane. As can be observed, there are 2 possibly co-regulated genes. However, there is a long path between these two genes in the first solution, while in the second solution they are very close together. It is then possible to conclude that some closely-related gene relationships may be hidden to give place to the representation of other co-regulation relationships, each one associated with a distinct sequence of hierarchical relations. Therefore, a single hierarchical approach followed by a post-processing phase will not be capable of providing these two alternative views of the same data set. The copt-aiNet adaptation procedure was inspired by clonal selection, affinity maturation and the immune network theory. It can be observed, from its adaptation procedure, that it performs the main steps of all evolutionary algorithms, namely, reproduction with inheritance, genetic variation and selection. These features allow copt-aiNet to be classified as an evolutionary algorithm, or more accurately, an immune-evolutionary algorithm. The similarities between these two approaches have a natural basis. They come from the close relationship between some of the main functioning mechanisms of the immune system and the neo-Darwinian theory of evolution. From this evolutionary perspective, clonal selection and affinity maturation together result in a micro-evolutionary process that occurs within our bodies in a time scale that is orders of magnitude faster than the evolution of species (see [17] for a broader discussion, pp. 188–192). The combination of features presented by copt-aiNet results in an algorithm with great potential to solve multimodal and multiobjective problems, such as the multiple rearrangement of gene expression data. To name just a few of these important characteristics, there

AN IMMUNE-EVOLUTIONARY ALGORITHM

173

is the continual self-recognition of the network cells; the use of a fitness proportionate mutation as a single genetic variation operator; the use of gene libraries to generate new individuals to the population; the use of a fitness proportionate reproduction rate; and a population with size dynamically adjustable. 6. 6.1.

Computational results Materials and methods

In this section, the computational results obtained are presented. The program was coded in C language, using the C++ Builder 3 compiler, and the simulations were performed on a PC microcomputer with a 1.8 GHz Xeon processor. We performed simulations with three different sets of data: • Diffuse Large B-cell Lymphoma: 380 genes and 63 experiments from the biological public dataset described in [1] were used. The data together with additional information are available at http://llmpp.nih.gov/lymphoma/data/figure4. • Yeast Saccharomyces cerevisiae: 300 genes and 79 experiments, constituting a subset of the well-known Eisen data set [21], were used. Eisen’s original PNAS paper, the whole data set and further information are available at http://rana.lbl.gov/EisenData.htm. • Yeast Saccharomyces cerevisiae: 500 genes and 79 experiments, constituting another subset of the well-known Eisen data set [21], were used. All experiments containing missing values were filled with the value ‘0’. The complete experimental data used in this work are available at the Infobiosys Research Group website at http://www.dca.fee.unicamp.br/projects/infobiosys/geneexp. In order to assess the effectiveness of copt-aiNet, a comparative study involving the proposed immune-evolutionary algorithm and hierarchical clustering followed by reordering of the leaves of the generated tree was performed. The public tool TreeArrange was used to flip the leaves of the hierarchical tree. TreeArrange is offered by Biedl et al. [5], and is available at http://monod.uwaterloo.ca/downloads/treearrange. Euclidean distance was used as the dissimilarity metric for all algorithms. 6.2.

Parameter setting

The parameters employed by the copt-aiNet algorithm were the same for all tested instances, and these are described here according to the pseudocode presented in Section 5.3. Step 1.1. Generation of the initial population. An initial population with 20 individuals is randomly generated. Step 1.2. Clonal Selection and Affinity Maturation. Number of offspring generated for each cell: 10. The mutation rate pm (i) for individual i is given by: pm (i) =

(max pm − 0.01) × ( f (i) − f min ) + max pm , ( f min − f max + 0.001)

(2)

174

DE SOUSA ET AL.

where max pm = 0.2. Number of genes n to be mutated: n = max{(total number of genes × pm (i)),1}. Step 1.3. Network Interactions and Diversification. If 1 − (avF(t)/avF(t − 1)) < 0.0015, then the algorithm is assumed to have reached a steady state in terms of capability of improving fitness; where avF(t) is the average fitness of the population at iteration t, and avF(t−1) is the average fitness at iteration t−1. This verification is performed every 5 iterations. In this case, suppression is needed and those individuals whose fitness are lower than the average fitness are successively eliminated, until the minimal size of the population (10 individuals) is reached. Step 1.4. Gene Recombination. If the number of individuals in the population is less than the maximal number m allowed (m = 60), then new network cells are generated by gene recombination. To generate the gene libraries, the 10 individuals with highest fitness are determined and a random number between 2 and 4 is chosen as the number of gene fragments to generate a new individual. These fragments are taken randomly from the best individuals of the population. Step 1.5. Local Search. A tabu search procedure is applied to each individual of the population until there is no improvement for one complete tabu iteration or until the tabu search results in three (t = 3) consecutive improvements of the individual. The tabu tenure used was equal to 10% of the total number of genes and the number of non-improvement iterations performed before changing the type of move was equal to 20. Step 2. Intensive Local Search. The intensive maturation process finishes when the current solution is not improved along 250 tabu iterations. As above, the tabu tenure used was equal to 10% of the total number of genes and the number of non-improvement iterations performed before changing the type of move was equal to 20.

6.3.

Results

To benchmark the proposed algorithm, its performance was compared with the solutions provided by the software TreeArrange offered by the Bioinformatics Research Group at the University of Waterloo, available at http://monod.uwaterloo.ca/downloads/treearrange/. TreeArrange rearranges the leaves of the tree generated by a hierarchical clustering procedure in order to improve the hierarchical clustering solution. It is important to remark, however, that while TreeArrange will provide a single solution to the problem under study, the proposed immune-evolutionary algorithm is aimed at providing multiple high-quality solutions to the problem. This leads to the possibility of identification of various co-regulated genes from representative graphical configurations of the expression levels. The results to be presented here will thus correspond to the cost of the best solution found by the TreeArrange software, CostTA , and the cost of several solutions found by the copt-aiNet algorithm. The cost of a solution is calculated as the sum of the Euclidean distances between each pair of adjacent genes. In addition, it is also important to compare the similarity or dissimilarity between the solutions obtained by the copt-aiNet algorithm; and this will follow the criterion described in Section 5.4. An integer value corresponding to the number of operations (Num Operations) is used here to represent the diversity (dissimilarity) between two solutions: larger values indicate more dissimilar solutions, whilst

175

AN IMMUNE-EVOLUTIONARY ALGORITHM

Table 1. Diversity and cost of the best solutions obtained for 380 genes and 63 experiments from the biological public dataset of Diffuse Large B-cell Lymphoma. The costs corresponds to the sum of the Euclidean distances between each pair of adjacent genes Solution

1

2

3

4

5

6

7

8

Cost

1 2 3 4 5 6 7 8

0 102 87 102 99 91 118 107

102 0 107 108 102 108 93 115

87 107 0 93 104 105 108 123

102 108 93 0 99 110 117 117

99 102 104 99 0 111 117 100

91 105 108 110 111 0 115 117

118 93 108 117 117 115 0 130

107 115 123 117 100 117 130 0 CostTA :

1827.25 1827.69 1829.07 1829.18 1829.58 1830.44 1830.83 1831.40 1909.50

smaller values indicate more similar solutions. If two solutions are exactly the same, then Num Operations=‘0’. Tables 1–3 present the results for eight individuals of the population evolved by the coptaiNet algorithm in terms of diversity and compare the cost of these solutions with those obtained by the TreeArrange software. Each table contains the results of the algorithms to one of the three data sets used to assess the proposed method. It can be noted, in all cases, that the solutions obtained by the copt-aiNet algorithm are not only high-quality solutions (with costs very similar to those obtained by TreeArrange), but also diversified solutions. To illustrate the differences in terms of multiple views (rearrangements) of gene expression data, Figure 7 shows the graphical representations for the 4 best solutions obtained by the immune algorithm for the Diffuse Large B-cell Lymphoma dataset. As already depicted in Figure 6 for the 2-D case, distinct rearrangements in Figure 7 are not only characterized by a distinct allocation for the same clusters, but also by the proposition of novel clusters.

Table 2. Diversity and cost of the best solutions obtained for 300 genes and 79 experiments from Yeast Saccharomyces cerevisiae. The costs corresponds to the sum of the Euclidean distances between each pair of adjacent genes Solution 1 2 3 4 5 6 7 8

1 0 78 72 71 92 82 96 93

2

3

4

5

6

7

8

Cost

78 0 85 88 101 102 94 108

72 85 0 84 89 90 95 110

71 88 84 0 88 94 95 101

92 101 89 88 0 93 106 110

82 102 90 94 93 0 107 105

96 94 95 95 106 107 0 110

93 108 110 101 110 105 110 0 CostTA :

1070.54 1072.80 1072.81 1074.52 1074.66 1075.59 1075.73 1077.09 1120.10

176

DE SOUSA ET AL.

Table 3. Diversity and cost of the best solutions obtained for 500 genes and 79 experiments from Yeast Saccharomyces cerevisiae. The costs corresponds to the sum of the Euclidean distances between each pair of adjacent genes Solution

1

2

3

4

5

6

7

8

Cost

1 2 3 4 5 6 7 8

0 177 182 174 179 174 182 172

177 0 167 170 171 177 187 183

182 167 0 186 177 194 200 187

174 170 186 0 189 187 183 190

179 171 177 189 0 176 164 192

174 177 194 187 176 0 121 200

182 187 200 183 164 121 0 215

172 183 187 190 192 200 215 0 CostTA :

1913.84 1916.80 1917.31 1917.70 1918.35 1919.00 1919.79 1920.77 2016.10

Figure 7. Four solutions obtained by the copt-aiNet algorithm for 380 genes and 63 experiments from the biological public dataset of Diffuse Large B-cell Lymphoma.

AN IMMUNE-EVOLUTIONARY ALGORITHM

7.

177

Discussion and future trends

The application of mathematics, statistics, and technology derived from computer science and information theory are essentially what characterize the emerging field of bioinformatics. However, bioinformatics is not directly devoted to the proposition of fundamental mathematical laws to describe biological phenomena. Instead, the main purpose is the conception of computational tools to be used by an expert to analyze biological data, given the challenging gap between availability of new biological data and the current reduced capability of gaining useful results from them. Technological innovations are generating an increasing demand for data analysis in biological investigations of gene expression, and the specific attempt of the present paper is to provide guidance to practitioners in interpreting the results from the simultaneous proposal of alternative and well-characterized relationships among the expression profiles. The exploratory nature of the proposed computational tool is motivated by the fact that gene expression is associated with complex stochastic processes occurring over time and space. Together with the inherent randomness of any sampling process, many chance mechanisms are involved in the production of gene expression data. As a consequence, the availability of a diverse set of high-quality rearrangements should be interpreted as the beginning, rather than the end, of a gene expression analysis. This approach is motivated by an increasing consensus that it is not efficient to consider each gene in isolation, or a single association involving a specific subset of genes. Improved and more reliable results can be obtained by considering multiple rearrangements of gene expression profiles, which are still under-explored in the context of gene expression data analysis. The existence of several and diverse possibilities of interpretation may uncover novel and possibly unexpected relationships among gene expression data. The computational tool developed here to organize gene expression data is a multimodal optimization algorithm, inspired by immunology, and named copt-aiNet (artificial immune network for multimodal and combinatorial optimization). The description of copt-aiNet was made so as to allow a proper understanding and implementation of the algorithm, and taking into account its biological motivation and evolutionary nature. A brief review of the main methods used in the analysis of gene expression data has also been provided. Some computational experiments were presented, using public biological datasets, to assess the effectiveness of the proposed method. We performed simulations involving 380 genes and 63 experiments from the biological public dataset of Diffuse Large B-cell Lymphoma, 300 genes and 79 experiments from Yeast Saccharomyces cerevisiae, and 500 genes and 79 experiments from Yeast Saccharomyces cerevisiae. The data used in the first experiment was extracted from [1], and for experiments 2 and 3, we selected subsets of genes from the public dataset of Eisen et al. [21]. The performance of the copt-aiNet ordering algorithm was compared with the solution generated by the software TreeArrange, which corresponds to a hierarchical clustering technique (the most used clustering approach in the context of gene expression data). As a topic for future research, we intend to incorporate the degree of reliability associated with the expressed profile of each gene. Also, more accurate indications of the uncertainty to be attributed to each cluster are desirable, given that alternative rearrangements may indicate stable and unstable clustering configurations, in the sense that the most frequent clusters along the multiple rearrangements should guide to more confident hypotheses.

178

DE SOUSA ET AL.

Acknowledgments This research is supported by CNPq grants 521100/01-1, 350310/02-5, 300910/96-7, and 141882/01-8. References 1. A. A. Alizadeh, M. B. Eisen, R. E. Davis, C. Ma, I. S. Lossos, A. Rosenwald, J. C. Boldrick, H. Sabet, T. Tran, X. Yu, J. I. Powell, I. Yang, G. E. Marti, T. Moore, J. Hudson, L. Lu, and D. B. Lewis,. “Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling,” Nature, vol. 403, pp. 503–511, 2000. 2. T. B¨ack, and D. B. Fogel, Handbook of Evolutionary Computation, Institute of Physics Publishing and Oxford University Press, 1997. 3. T. B¨ack, D. B. Fogel, and Z. Michalewicz, Evolutionary Computation 2: Advanced Algorithms and Operators, Institute of Physiscs Publishing (IOP): Bristol and Philadelphia, 2000. 4. Z. Bar-Joseph, D. Gifford, and T. Jaakkola, Fast optimal leaf ordering for hierarchical clustering, Bioinformatics (Proceedings of ISMB 2001, Denmark), vol. 17, pp. 22–39. 5. T. Biedl, B. Brejov´a, E. D. Demaine, M. A. Hamel, and T. Vinar, “Optimal arrangement of leaves in the tree representing hierarchical clustering of gene expression data,” Technical Report 2001-14, University of Waterloo Canada, 2001. 6. A. Ben-Hur, D. Horn, H. T. Siegelmann, and V. Vapnik, “Support vector clustering,” Journal of Machine Learning Research, vol. 2, pp. 125–137, 2001. 7. E. Bonabeau, M. Dorigo, and T. Theraulaz, Swarm Intelligence: From Natural to Artificial Systems, Oxford University Press: New York, 1999. 8. H. Bunke and K. Shearer, “A graph distance metric based on the maximum common subgraph,” Pattern Recognition Letters, vol. 19, pp. 255–259, 1998. 9. F. M. Burnet, The Clonal Selection Theory of Acquired Immunity, Cambridge University Press, 1959. 10. D. Corne, M. Dorigo, and F. Glover, New Ideas in Optimization, McGraw-Hill, 1999. 11. C. Cotta, A. Mendes, V. Garcia, P. Fran¸ca, and P. Moscato, “Applying memetic algorithms to the analysis of microarray data,” in Proceedings of EvoBIO2003—1st European Workshop on Evolutionary Bioinformatics, Lecture Notes in Computer Science, no. 2611, Springer-Verlag: Colchester, England, April, 2003, pp. 22–32. 12. D. Dasgupta (ed.), Artificial Immune Systems and Their Applications, Springer-Verlag, 1999. 13. L. N. de Castro and F. J. Von Zuben, “An evolutionary immune network for data clustering,” in Proceedings of the IEEE Brazilian Symposium on Neural Networks (SBRN’00), 2000, pp. 84–89. 14. L. N. de Castro and F. J. Von Zuben, “aiNet: An artificial immune network for data analysis,” in Data Mining: A Heuristic Approach, H. A. Abbass, R. A. Sarker, and C. S. Newton (eds.), Idea Group Publishing, USA, Chapter XII, 2001, pp. 231–259. 15. L. N. de Castro and F. J. Von Zuben, “Immune and neural network models: Theoretical and empirical comparisons,” International Journal of Computational Intelligence and Applications (IJCIA), vol. 1, no. 3, pp. 239–257, 2001. 16. L. N. de Castro and J. I. Timmis, “An artificial immune network for multimodal function optimization,” in Proceedings of IEEE Congress on Evolutionary Computation (CEC’02), 2002, vol. 1, pp. 699–674. 17. L. N. de Castro and J. Timmis, Artificial Immune Systems: A New Computational Intelligence Approach, Springer-Verlag, 2002. 18. L. N. de Castro and F. J. Von Zuben, “Learning and optimization using the clonal selection principle,” IEEE Transaction on Evolutionary Computation, vol. 6, no. 3, pp. 239–251, 2002. 19. L. N. de Castro and J. I. Timmis, “Artificial immune systems as a novel soft computing paradigm,” Soft Computing, vol. 7, no. 8, pp. 526–544, 2003. 20. L. N. de Castro and F. J. Von Zuben, “Recent Developments in Biologically Inspired Computing, Idea Group Publishing, 2004. 21. M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein, “Cluster analysis and display of genome-wide expression patterns,” Proc Natl Acad Sci USA, 1998, vol. 95, no. 25, pp. 14863–14868. 22. M. R. Garey and D. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness, Freeman, 1979.

AN IMMUNE-EVOLUTIONARY ALGORITHM

179

23. A. P. Gash and M. B. Eisen, “Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering,” Genome Biology, vol. 3, no. 11, pp. 1–22, 2002. 24. L. C. T. Gomes, F. J. Von Zuben, and P. Moscato, “Ordering microarray gene expression data using a selforganising neural network,” in Proceedings of the Recent Advances in Soft Computing (RASC2002), 2002 pp. 307–312. 25. F. Glover and G. A. Kochenberger, Handbook of Metaheuristics, Kluwer Academic Publishers, 2002. 26. S. Haykin, Neural Networks: A Comprehensive Foundation, Prentice Hall, 1999. 27. A. K. Jain, M. N. Murty, and P. J. Flynn, “Data Clustering: A Review, ACM Computer Surveys”, vol. 31, no. 3, pp. 264–323, 1999. 28. N. K. Jerne, “Towards a network theory of the immune system”, Ann. Immunol. (Inst. Pasteur) 125C, pp. 373–389, 1974. 29. J. Kennedy, R. Eberhart, and Y. Shi, Swarm Intelligence, Morgan Kaufmann Publishers, 2001. 30. E. L. Lawler and J. K. Lenstra, The Traveling Salesman Problem, John Wiley & Sons, 1985. 31. N. M. Luscombe, D. Greenbaum, and M. Gerstein, “What is bioinformatics? A proposed definition and overview of the field,” Methods of Information in Medicine, vol. 40, no. 4, pp. 346–358, 2001. 32. S. W. Mahfoud, Niching Methods for Genetic Algorithms, Illinois Genetic Algorithms Laboratory, Illinois, IL, IlliGAL Report no. 95001, May 1995. 33. P. Merz and A. Zell, “Clustering gene expression profiles with memetic algorithms,” in Proceedings of the 7th International Conference on Parallel Problem Solving from Nature—PPSN VII, Lecture Notes in Computer Science 2439, Springer, Berlin, Heidelberg, 2002, pp. 811–820. 34. Z. Michalewicz and D. B. Fogel, How to Solve It: Modern Heuristics, Springer-Verlag, 2000. 35. P. Moscato, “ On evolution, search, optimization, genetic algorithms and martial arts: Towards memetic algorithms,” Technical ReportC3P Report 826, Caltech Concurrent Computation Program, California Intitute of Technology, 1989 . 36. G. J. V. Nossal, “The molecular and cellular basis of affinity maturation in the antibody response,” Cell, vol. 68, pp. 1–2, 1993. 37. M. Oprea, “Antibody repertoires and pathogen recognition: The role of germline diversity and somatic hypermutation,” Ph.D. Dissertation, University of New Mexico, Albuquerque, New Mexico, EUA, 1999. 38. J. P. Pedroso, “Niche search: An evolutionary algorithm for global optimization,” in Parallel Problem Solving from Nature IV, vol 1141 of Lecture Notes in Computer Science, H.-M. Voigt, W. Ebeling, I. Rechenberg, and H. P. Schwefel” (eds.), Springer, Berlin, 1996, pp. 430–440. 39. A. S. Perelson and G. F. Oster, “Theoretical studies of clonal selection: Minimal antibody repertoire size and reliability of self-nonself discrimination,” J. theor. Biol., vol. 81, pp. 645–670, 1979. 40. M. Schena, “Genome Analysis with Gene Expression Microarrays,” Bioessays, vol. 18, pp. 427–431, 1996. 41. U. Storb, “Progress in understanding the mechanisms and consequences of somatic hypermutation,” Immun. Rev., vol. 162, pp. 5–11, 1998. 42. P. Tamayo, D. Slonim, Q. Zhu, S. Kitareewan, E. S. Lander, and T. R. Golub, “Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation,” in Proc. Natl. Acad. Sci. USA, vol. 96, 6, pp. 2907–2912, 1999. 43. J. Timmis, P. J. Bentley, and E. Hart (eds.), Proc. of the 2nd Int. Conf. on Artificial Immune Systems, Lecture Notes in Computer Science, vol. 2787, Springer-Verlag, 2003. 44. H.-K. Tsai, J.-M. Yang, and C.-Y. Kao, “Applying genetic algorithms to finding the optimal gene order in displaying the microarray data,” in the Proc. of The Genetic and Evolutionary Computation Conference (GECCO 2002), 2002, pp. 610–617. 45. D. Whitley, S. Rana, and R. B. Heckendorn, “Island model genetic algorithms and linearly separable problems,” in Proceedings of the AISB Workshop on Evolutionary Computation, 1997, pp. 109–125. 46. X. Xiao, E. Dow, R. Eberhart, Z. B. Miled, and R. J. Oppelt, “Gene clustering using self-organizing maps and particle swarm optimization,” Second IEEE International Workshop on High Performance Computational Biology (HiCOMB 2003), 2003, pp. 154.