A Hardware Engine for Genetic Algorithms - CiteSeerX

4 downloads 2658 Views 781KB Size Report
Jul 4, 1997 - A genetic algorithm (GA) is an optimization method based on natural ... application of GAs if the search space is large or if real-time ...... this is similar to that in Step 2 except that the FM's second request is guaranteed to be.
A Hardware Engine for Genetic Algorithms Technical Report UNL-CSE-97-001 University of Nebraska-Lincoln July 4, 1997

Stephen D. Scott, Sharad Seth, and Ashok Samal

Stephen D. Scott is with the Department of Computer Science at Washington University in St. Louis. He was at the University of Nebraska-Lincoln when this work was done. E-mail: [email protected]. Sharad Seth and Ashok Samal are with the Department of Computer Science and Engineering at the University of Nebraska-Lincoln. E-mail: [email protected] and [email protected].

Abstract A genetic algorithm (GA) is an optimization method based on natural selection. Genetic algorithms have been applied to many hard optimization problems including VLSI layout optimization, boolean satis ability, power system control, fault detection, control systems, and signal processing. GAs have been recognized as a robust general-purpose optimization technique. But application of GAs to increasingly complex problems can overwhelm software implementations of GAs, causing unacceptable delays in the optimization process. This is true of any non-trivial application of GAs if the search space is large or if real-time performance is necessary. It follows that a hardware implementation of a GA is desirable for application to problems too complex for software-based GAs. Hardware's speed advantage and its ability to parallelize o er great rewards to genetic algorithms. Speedups of 1{2 orders of magnitude have been observed when frequently used software routines were implemented in hardware by way of eld-programmable gate arrays (FPGAs). Since most of the GA's operations are simple, a hardware implementation is feasible. Reprogrammability is essential in a general-purpose GA engine because certain GA modules require changeability (e.g. the function to be optimized by the GA). In hardware, reprogrammability is possible with FPGAs. Thus an FPGA-based GA is both feasible and desirable. A fully functional self-contained hardware-based genetic algorithm (the HGA) is presented here as a proof-of-concept system. It was designed using VHDL to allow for easy scalability. It is designed to act as a coprocessor with the CPU of a PC. The user programs the FPGAs which implement the function to be optimized. Other GA parameters may also be speci ed by the user. An analysis of the design is given that identi es the bottleneck of the HGA's pipeline under varying conditions. Simulation results of the HGA are also presented. A prototype HGA is described and compared to a similar GA implemented in software. In our tests, the prototype and simulations took two to ve percent as many clock cycles to run as the software-based GA. Suggested design improvements could dramatically increase the HGA's speed even further. Finally, we give other potential applications of the HGA which are feasible with current FPGA technology.

Keywords Parallel Genetic Algorithms, Function Optimization, Field Programmable Gate Arrays (FPGAs), Performance Acceleration, Performance Evaluation.

I. Introduction

A genetic algorithm (GA) is an optimization method based on natural selection that is simple to implement (see the Appendix for an introduction to GAs). Genetic algorithms have been applied to many hard optimization problems including VLSI layout optimization, boolean satis ability and the Hamiltonian circuit problem [1], [2], [3]. They've also been applied to many engineering and manufacturing problems, including aerodynamic optimization, power system control, fault detection, control systems, and signal processing [4]. GAs have been recognized as a robust general-purpose optimization technique. But application of GAs to increasingly complex problems can overwhelm software implementations of GAs, causing unacceptable delays in the optimization process. This is true of any non-trivial application of GAs if the search space is large. This especially holds true for real-time applications, e.g. disk scheduling [5] and image registration [6]. It follows that a hardware implementation of a GA is desirable for application to problems too complex for software-based GAs. Hardware's speed advantage and its ability to parallelize o er great rewards to genetic algorithms. Speedups of 1{2 orders of magnitude have been observed when frequently used software routines were implemented in hardware by way of eld-programmable gate arrays (FPGAs) [7]{ [15]. Characteristically, these systems identify simple operations that are bottlenecks in software and map them to hardware. This is the approach we take in this work, except we map the entire GA to hardware. Since most of the GA's operations are simple, a hardware implementation is feasible. Although this was not the case before the advent of reprogrammable FPGAs, e.g. those from Xilinx [16] that are programmed via a bit pattern stored in a static RAM. This is because a general-purpose GA engine requires certain parts of its design, notably the function to be optimized, to be easily changed. Thus reprogrammable FPGAs are essential to the development of the HGA system. Typically the function to be optimized is the only component that requires changing in a general-purpose GA. The other operations are generic and can be implemented in an ASIC or other non-reprogrammable medium if desired. Our empirical analyses of software-based GAs indicate that in basic GAs, a small number of simple operations and the function to be optimized are executed frequently during the run. Neglecting I/O, these operations accounted for 80{90% of the total execution time. If m is the population size (number of members manipulated by the GA in one iteration) and g is the number of generations (GA iterations), a typical GA would execute each of its operations mg 1

times. For complex problems, large values of m and g are required, so it is imperative to make the operations as ecient as possible. Work by Spears and De Jong [2] indicates that for NPcomplete problems, m = 100 and values of g on the order of 104 {105 may be necessary to obtain a good result and avoid premature convergence to a local optimum. Pipelining and parallelization can help provide the desired eciency, and these are easily done in hardware. This work describes the HGA, a self-contained implementation of a hardware-based genetic algorithm. Because of the reprogrammability of FPGAs, the HGA is a general-purpose GA engine which is useful in many applications where conventional GA implementations are too slow. The HGA works as a coprocessor with the CPU of a PC and gives its user the ability to specify many of the GA parameters via a software interface. It was designed in VHDL to facilitate scaling. The goals of our work were to: (a) propose an architecture for a general-purpose GA engine that can employ a combination of pipelining and parallelization to achieve speedups, (b) study the design analytically and empirically to identify the bottlenecks, (c) demonstrate the feasibility of a GA engine by developing a prototype HGA coprocessor, and (d) collect accurate simulation data and demonstrate its usefulness in comparing the performance of the HGA versus a software-based GA. Many implementation details of this work are omitted here. The interested reader can nd these in the rst author's master's thesis [17]. VHDL source code for the design is also available [17], [18] The remainder of this paper is organized as follows. Section II presents the HGA design. In Section III we study the design analytically and empirically to identify its bottlenecks while deriving equations to predict the system's performance. A prototype and simulations on a complex optimization problem are presented in Section IV, along with a contrast with a softwarebased GA on the same problems. We then give other feasible applications of the HGA in Section V. Design improvements and extensions appear in Section VI. In Section VII we compare related work in hardware GAs and in recon gurable hardware systems to the HGA. Finally, we conclude in Section VIII. For background on GAs, the reader is referred to the Appendix. II. The HGA System Design

Conceptually, the HGA ts into a general computing environment in the following way. The front end of the HGA system consists of a simple interface program running on a personal 2

computer or workstation. This interface gets the GA parameters (Section II-A) interactively from the user and writes them into a memory which is shared with the back end, consisting of the HGA hardware. Additionally, the user speci es the tness function in some programming or other speci cation language (e.g. C or VHDL). Then software translates the speci cation into a hardware image and programs the FPGA(s) which implement the tness function. This softwareto-hardware translator could be similar in function to the PRISM II compiler [12], [13], the spC compiler [19], Flamel [20], Cyber [21], or an HDL synthesizer such as IBM's HIS system [22] or AutoLogic from Mentor Graphics [23]. Then the front end sends a \Go" signal to the back end. When the HGA back end detects the signal, it runs the GA based on the parameters already in the shared memory. When done, the back end sends a \Done" signal to the front end. The front end detects this signal and reads the nal population from the shared memory. The population could then be written to a le for the user to view or have other computations performed on it. Currently the only termination condition that the user can specify is the number of generations to run. If other termination conditions are desired (e.g. amount of population diversity, minimum average tness), the user must tell the HGA to run for a xed number of generations and then check the resultant population to see if it satis es the termination criteria. If not, then that population can be retained in the HGA's memory for another run. This process repeats until the termination criteria are satis ed. Note that while the front end consisted of a user interface in our experiments, the front end could in fact be any interface between the HGA and a (possibly autonomous) system that occasionally requires the use of a GA. This system would be in charge of selecting its own HGA parameters and initial population, giving them to the front end for writing to the shared memory, and invoking the HGA. The system could even be given the ability to program the FPGAs containing the tness function. The HGA hardware was designed using VHDL. This allowed the design to be speci ed behaviorally rather than structurally. It also allowed for general (parameter-independent) designs to be created, facilitating scaling. The speci c designs implemented from the general designs depend upon designer-speci ed parameters provided at implementation time, e.g. the maximum width of the population members. When the parameters are speci ed, the design can be implemented with a VHDL synthesizer such as AutoLogic from Mentor Graphics. 3

A. The User-Controlled Parameters

Our goal was to create a VHDL implementation of a general genetic algorithm which would allow its user1 to choose the values of several GA parameters. The user-controlled parameters are the initial population's size and its members, the number of generations in the HGA run, the initial seed for the pseudorandom number generator, and the mutation and crossover probabilities. Values for these parameters are selected by the user and given to the front end which sends the appropriate signals to initialize and start the HGA. B. The Architecture

The HGA (Fig. 1) was designed to satisfy several criteria. First, the components should be simple and easily scalable (Section II-E). This allows us to arbitrarily scale the design (e.g. allow for larger population members) within the constraints imposed by the hardware that the HGA is implemented on. Second, we should be able to easily parallelize the design's components as long as there is sucient chip area (Section II-D). Thus the design should be modular and the components should have simple, scalable interfaces. Third, due to the modules' simplicity, some of them can exploit concurrency within themselves. To this end, the selection module (Section II-C.5) and the crossover/mutation module (Section II-C.6) operate on two population members simultaneously. Finally, we want the design to exploit the advantages of pipelining (Section II-D). Note that in Fig. 1, when a module completes a task it can immediately await more input to repeat processing. Due to pipelining, GA operations do not have to be suspended while other GA operations run. The modules in Fig. 1 are patterned after the GA operators de ned in Goldberg's simple genetic algorithm (SGA) [1]. The HGA modules operate concurrently with each other and together form a coarse-grained pipeline. All modules are written in VHDL and are independent of the operating environment and implementation technology (e.g. Xilinx FPGAs or fabricated chips) except for the memory interface and control module. The functionality of this module varies according to the physical memory attached to it and the HGA's interface. The basic functionality of the HGA design of Fig. 1 is as follows.2 . 1. After all the parameters have been loaded into the shared memory, the memory interface 1 2

Note that in this paper, a \user" can be a person using a software front end or any system that utilizes a GA. Note that Fig. 1 shows only the data path; control lines are omitted.

4

new members/fitnesses

sum of fitnesses

to write and their addresses

to selection module

memory

Fitness (FM)

requests

Shared memory

crossed members random numbers

Memory interface and control (MIC)

Crossover/mut. (CMM)

selected members Rand. no. gen. (RNG)

random numbers

member’s address

Selection (SM)

parameters and data

member and fitness Front end

Population sequencer (PS) member and fitness

Fig. 1. The data path of the HGA architecture.

and control module (MIC) receives a \Go" signal from the front end via the shared memory. The MIC acts as the main control unit of the HGA during start-up and shut-down and is the HGA's sole interface to the outside world. After start-up and before shut-down, control is distributed; all modules operate autonomously and asynchronously. 2. The MIC noti es the tness module (FM), crossover/mutation module (CMM), the pseudorandom number generator (RNG) and the population sequencer (PS) that the HGA is to begin execution. Each of these modules requests its required parameters from the MIC, which fetches them from predetermined locations in the shared memory. 3. The PS starts the pipeline by requesting population members from the MIC and passing them along to the selection module (SM). 4. The task of the SM is to receive new members from the PS and judge them until a pair of suciently t members is found. At that time it passes the pair to the CMM, resets itself, and restarts the selection process. 5. When the CMM receives a selected pair of members from the SM, it decides whether to perform crossover and mutation based on random values sent from the RNG. When done, the new members are sent to the FM for evaluation. 6. The FM evaluates the two new members from the CMM and writes the new members to memory through the MIC. The FM also maintains information about the current state of the HGA that is used by the SM to select new members and by the FM to determine when the HGA is nished. 5

7. The above steps continue until the FM determines that the current HGA run is nished. It then noti es the MIC of completion which in turn shuts down the HGA modules and sends the \Done" signal to the front end. C. Component Modules

In this section we describe in more detail the exact functionality of each module from Fig. 1. The VHDL code describing each module is available elsewhere [17], [18]. C.1 Pseudorandom Number Generator (RNG) The pseudorandom number generator (RNG) is a key component of the HGA system. Its output is used by two HGA modules. The RNG supplies pseudorandom bit strings to the selection module for scaling down the sum of tnesses (Section II-C.5). This scaled sum is used when selecting pairs of members from the population. The RNG also supplies pseudorandom bit strings to the crossover/mutation module for determining whether to perform crossover, mutation, both, or neither (Section II-C.6). Other pseudorandom bit strings are given to the CMM to choose the crossover and mutation points. The RNG generates a sequence of pseudorandom bit strings based on the theory of linear cellular automata (CA). CA were shown by Hortensius et al. [24] to generate better random sequences than linear feedback shift registers (LFSRs) which are commonly used as pseudorandom number generators. The CA used in the RNG consists of 16 alternating cells which change their states according to rules 90 and 150 as described by Wolfram [25]: Rule 90: s+i = si?1  si+1 Rule 150: s+i = si?1  si  si+1 . Here si is the current state of site (cell) i in the linear array, s+i is the next state for si , and  is the exclusive OR operator. Thus Rule 90 updates itself based only on the inputs from its neighbors while Rule 150 also considers its own state when updating. Serra et al. [26] showed that a 16-cell CA whose cells are updated by the rule sequence 150-150-90-150    90-150 produces a maximum-length cycle, i.e. it cycles through all possible 216 bit patterns except the all 0s pattern. It has also been shown that such a rule sequence has more randomness than an LFSR of corresponding length [24]. This scheme is implemented in the RNG. 6

C.2 Shared Memory The shared memory is actually external to the HGA system, but is presented here for completeness. It is assumed that some memory is available to the HGA system and that its speci cations are known to the memory interface and control module (MIC). The memory is shared by the back end and front end and acts as the communication medium between them. Before the HGA run, the front end writes the GA parameters speci ed in Section II-A into the memory and signals the MIC via the shared memory. After receiving the signal, the MIC distributes the parameters to the appropriate modules. During the HGA run, the population members are read from and written to the memory by the MIC. When the HGA run is nished, the memory holds the nal population which is then read by the front end. C.3 Memory Interface and Control Module (MIC) The memory interface and control module (MIC) is the only module in the HGA system which has knowledge of the HGA's environment. It provides a transparent interface to the memory for the rest of the system. At HGA startup time, the memory interface and control module instructs the other modules to initialize. While initializing, the other modules send requests to the MIC for user-speci ed HGA parameters. These requests involve a simple handshaking protocol initiated by the requesting modules providing a coded address to the MIC. The MIC then translates this coded address to a physical address for the memory it is accessing. The parameter received from memory is then passed along to the requesting module. When noti ed by the tness module that the GA run is complete, the MIC shuts down the system and noti es the front end of completion. More coded address translation is done by the MIC when interacting with the population sequencer (PS) and the tness module (FM). At any time t, the HGA system keeps track of the current population Pt and the next population Pt+1 . The PS reads from Pt and the FM writes to Pt+1 . In memory, Pt and Pt+1 are stored in distinct memory blocks, labeled B0 and B1 . The mapping h : fPt ; Pt+1g ! fB0 ; B1g alternates between generations, e.g. if h(Pt) = B0 in the current generation, then h(Pt) = B1 in the next generation. At any given time the FM knows which of Pt and Pt+1 is located in B0 and B1 and makes this information available to the MIC via a signal that is toggled after every generation. To make the HGA independent of its operating environment (and thus portable to other 7

environments), the PS and FM only know the indexes of the population members they want to read or write instead of the physical memory addresses. Therefore, it is up to the MIC to translate each index from the PS and FM to the correct address by adding the appropriate base address (i.e. B0 's base address or B1 's base address) to the the product of the index and the number of bytes per population member. C.4 Population Sequencer (PS) The job of the population sequencer (PS) is to simply cycle through the current population and pass the members on to the selection module(s). The roulette wheel selection process used by the SGA [1] (Section II-C.5) is independent of the order that the population members are searched, so a blind cycle through the population works as well as the SGA implementation. The PS sends the index of a population member to the MIC, awaits reception of the member, and passes this member to the selection module(s). The index is then incremented modulo the population size so the next population member will be requested from the MIC. This process continues until the GA run is complete and the MIC shuts down all the modules. C.5 Selection Module (SM) The HGA's selection method is similar to the implementation of roulette wheel selection found in the SGA [1]. The SGA's selection procedure is as follows. 1. Using a random real number r 2 [0; 1], scale down the sum of the tnesses of the current population to get Sscale = r  Sfit. 2. Starting at the rst population member, examine the members in the order they appear in the population. 3. Each time a new member is examined, accumulate its sum in a running sum of tnesses SR . If at that time SR  Sscale , then the member under examination is selected. Otherwise the next population member is examined (Step 2). Each time a new population member is to be selected, the above process is executed. The selection module (SM) implements the roulette wheel selection process used by the SGA, but it selects a pair of population members A and B simultaneously rather than a single member at a time. First it receives the sum of the tnesses of the current population from the tness module (FM). It then scales down this sum by two random values provided by the pseudorandom number generator. These two scaled sums SA and SB are stored for future use. Upon receipt of a 8

population member M and its tness from the population sequencer, M 's tness is accumulated in a running sum of tnesses SR . If SR surpasses SA at this time, then M is latched as the selected member A. Selected member B is chosen in the same fashion. Since the values SA and SB are determined by independent random numbers, selection of A and B are independent, concurrent processes. Once A and B are both selected, they are sent to the crossover/mutation module for further processing. After sending A and B , the SM resets SR and scales down the sum of tnesses by two new random values to generate new values for SA and SB . When an entire generation has been selected, the FM resets the SM so it can use the new sum of tnesses in its calculations. This process repeats until the SM is shut down by the MIC at the end of the HGA run. C.6 Crossover/Mutation Module (CMM) The crossover/mutation module (CMM) waits for a new pair of members A and B to be sent by the selection module (SM). Two random numbers S1 and S2 from the pseudorandom number generator (RNG) and a user-speci ed parameter pc are used by the CMM for crossover. S1 and pc are numbers from [0; 1] and S2 is from [0; n ? 1], where n is the number of bits per population member. If S1 < pc , crossover between A and B is performed using S2 as the crossover point, yielding two new members A0 and B 0 . If S1  pc then crossover is not performed and A0 = A and B 0 = B . After crossover, mutation is performed. For mutation, the CMM uses S3 2 [0; 1] and S4 2 [0; n ? 1] from the RNG and the user-speci ed parameter pm 2 [0; 1]. If S3 < pm then mutation is performed on the bit in A0 indexed by S4 . Here the HGA di ers from the SGA in that the SGA makes a decision about mutating each bit in both A0 and B 0 , e ectively increasing the mutation probability. The HGA makes only one mutation decision per pair of strings A0 and B 0 and chooses the mutation point at random. This implementation decision was based on simplifying the design. Because of this implementation, the crossover and mutation steps together take only one clock cycle and require logic consisting essentially of arrays of multiplexers and inverters. C.7 Fitness Module (FM) The purpose of the tness module (FM) is to evaluate the population members received from the crossover/mutation module and insert them into the new population. Because of the hard9

ware implementation, the evaluation step could take as little as one clock cycle per member, or if sucient space is available on the chip, the FM could evaluate two members in parallel. This is potentially much more ecient than any software implementation. Because tness evaluation is often the bottleneck of a GA, this is a signi cant advantage of implementing a GA in hardware. When the FM receives a pair of members from the crossover/mutation module (CMM), it evaluates their tnesses and then writes the new members and their tnesses to the appropriate memory location with the help of the memory interface and control module (MIC). The FM then waits for the CMM to send other members. The FM also maintains a running sum of the tness values for reporting to the SM after each generation. The SM will use that sum for the selection process (Section II-C.5). Additionally, the tness module maintains a record of how many generations remain and how many members still need to be chosen in the current generation. When the last generation is complete, the FM noti es the memory interface and control module of HGA completion. Since the FM does more than just evaluate the new population members, it may be desirable to implement another module to do only tness evaluation in case there is insucient space for the actual tness function on the FM's chip. Thus the FM supports the ability to attach to it an optional external tness evaluator (FE) which can perform the tness evaluation of each member. Upon receiving a new member, the FE is expected to evaluate this member and then make the tness available to the FM. If no FE is attached, the FM evaluates the members with an internal default tness function. Use of an external FE allows all the other HGA modules (including the FM) to be implemented in a non-reprogrammable technology such as fabricated chips, to reap a space and time savings over FPGAs. Only the FE need be implemented on reprogrammable FPGAs. Additionally, since the implementation of the FE is independent of the FM's implementation, the FE could be implemented in software if the tness function is much too complex for an FPGA implementation. All that is required is that the sofware-based FE adhere to the communication protocol expected by the FM. Naturally, in this case the speedup of hardware over software will be limited by how much of a bottleneck the tness evaluation process is. D. Pipelining and Parallelization

The design in Fig. 1 is a coarse-grained pipeline. This is evident by noting that when a module completes a task as described in Section II-C, it immediately awaits more input to 10

repeat processing. Because of this pipelining, GA operations do not have to be suspended while other GA operations run, which happens in a sequential software implementation. Thus we realize a speedup over a software GA. Parallelization of HGA modules is also possible. For example, multiple SMs can be inserted, all of which feed into a single CMM (Fig. 2). This can help expedite the GA run by easing the pipeline bottleneck when it exists in the PS (Section III). When the bottleneck lies in the FM, multiple FMs can be utilized. sum of fitnesses to both selection modules

new members/fitnesses

memory

to write and their addresses

Fitness (FM)

requests

crossed members random numbers

selected members random numbers

Memory interface and control (MIC)

Crossover/mut. (CMM) selected members

member’s address

Rand. no. gen. (RNG)

parameters and data

member and fitness

member/fitness

Selection (SM)

Shared memory

Front end

Population sequencer (PS)

random numbers

Selection (SM)

member/fitness

Fig. 2. Example of parallel selection modules.

Due to area constraints of our FPGAs, only the con guration in Fig. 1 was implemented in a prototype (Section IV-A). But both the con gurations of Figs. 1 and 2 were created in VHDL, analyzed and simulated (Sections III and IV-B). Other parallelization schemes are mentioned in Section VI-B. E. Parameterization and Scalability

Since the modules of the HGA system were written entirely in VHDL, speci c aspects of the design such as I/O bus size and memory size can be parameterized. Among the interesting parameters are the maximum width in bits of the population members, the maximum width in bits of the tness values, the maximum size of the population, and the maximum number of generations. When module parallelization is involved (Section II-D), the parameter nsel indicates the number of parallel selection modules and nfit indicates the number of parallel 11

tness modules. These parameters are speci ed at VHDL compile time and should not be confused with the HGA run time parameters described in Section II-A. The HGA's parameterization a ords it a scalability that is limited only by chip area, pin counts, and the complexity and size requirements of the tness function. Chip area limitations are mitigated by steadily increasing FPGA densities and we address pin count limitations in Section VI-B.1. Also, Sections IV-B and V give several tness functions that are suciently simple and compact for an FPGA implementation given current technology. III. Performance Analysis

We analyzed the HGA's performance by studying the modules in the pipelines pictured in Figs. 1 and 2 (with an arbitrary number of SMs) to determine the parameters which impact asynchronous pipeline performance [27]. The parameters used in this analysis are as follows. 1. The actual service time si of pipeline stage (module) i is the amount of time stage i takes to receive a message at its inputs, process it and send the output to the next stage. 2. The ow rate Fi of stage i is the number of messages arriving at stage i during the entire run. 3. The normalized service time Snormi of stage i is de ned as Snormi = (si  Fi )=Fout where Fout , the ow rate out of the pipeline, acts as a normalizing factor. In this analysis, Fout = mg=2 because mg members are selected in the GA run a pair at a time. The stage with the highest value of Snormi is the pipeline's bottleneck. One key di erence to note between the HGA's pipeline and those in Kenyon et al.'s model [27] is that in ours, not all module output is useful. For example, if all SMs are waiting for acknowledgement from the CMM, none of them is currently selecting members. Thus all outputs from the PS are ignored until an SM starts a new selection process. Also, while the FM processes the last pair of members at the end of a generation, the SMs and CMM continue to run, generating more output. These outputs are ignored since they are based on members from an old population. Thus when we describe the pipeline performance analytically and empirically, we consider only useful computation, i.e. processing whose results are used throughout the rest of the pipeline. This normalizes our results and prevents useless computation from deceptively changing the pipeline's bottleneck. Another di erence between the HGA and Kenyon et al.'s model is that their model assumes all service times are nearly constant. By contrast, the randomized nature of the HGA prevents 12

exact predictability of the service times and ow rates for some of the modules. Several of these parameters are random variables with unknown distributions and are not necessarily independent. So we instead nd upper and lower bounds on these values and attempt to approximate their expected values. When estimating the expectations of these random variables, we assume independence and a uniform distribution. We later use empirical evidence to assess the accuracy of our approximations. All formulas and simulation results are in terms of clock cycles. In the following sections, g is the number of generations in the HGA run, m is the population size, T is the total number of clock cycles in the HGA run, and nsel is the number of parallel selection modules in the pipeline. Technology dependent parameters are r and w, the total number of cycles for a single population member to be read from and written to the memory. Additionally, when discussing random variables in this section, we use superscripts max and min to denote upper and lower bounds, respectively. A superscript avg denotes that variable's expected value. Although not explicitly mentioned, the analyses of the PS (Section III-A) and the FM (Section III-D) involve the MIC. This is because the PS's and the FM's service times partially depend on communication overhead with the MIC and the MIC's time to read from and write to the memory. Thus the MIC can be thought of as partially merged with the PS and partially merged with the FM. Table I summarizes the variables de ned and used in this section. Sections III-A{III-D explain the derivation of the equations for si , Fi and Snormi . These equations are summarized in Table II. The equations are compared and evaluated in Sections III-E and III-F. Then the equation evaluations are contrasted with simulation results in Section III-G. A. Population Sequencer Analysis

The request/acknowledge handshaking protocol between the PS and the MIC requires 4 clock cycles to communicate a request for a new population member and to receive that member. It also requires r cycles for the MIC to read the member from memory. Thus it takes at least smin 0 = 4 + r cycles to fetch a member from memory. If the FM requests access to the MIC to write a new population member to the memory, it will receive priority. No preemption is supported, so if the FM has a lock on the MIC, the PS is blocked. The FM's access of the MIC min could block the PS between 1 and 3 + w clock cycles, so smax 0 = 7 + r + w. Note that s0 > s0 at most mg times during the entire HGA run since the FM will make exactly mg write requests 13

TABLE I The variables used in Section III.

Varible i si s1in s1out Fi Fout Snormi g m T nsel nfit r w mselect teval xmax xmin xavg

De nition Module number: i = 0 ) PS, 1 ) SM, 2 ) CMM, 3 ) FM Actual service time of module i SM's actual service time per input SM's actual service time per output Flow rate of module i Flow rate out of pipeline Normalized service time of module i Number of generations Population size Number of clock cycles in HGA run Number of parallel SMs Number of parallel FMs Number of clock cycles required to read a member from memory Number of clock cycles required to write a member to memory Number of members seen by SM before pair selected Number of clock cycles for FM to evaluate two members and contact MIC Maximum value of random variable x Minimum value of random variable x Expected value of random variable x

during the run. The rest of the time s0 = smin 0 . Assuming that the distribution of FM write requests is uniform over the entire HGA run of duration T , the probability of an FM write request at a given clock cycle is mg=T . Hence at the time of a PS read request (assuming this is also uniformly distributed), the additional delay to smin 0 is 3 + w with probability mg=T , 2 + w with probability mg=T , etc. The weighted average of these possible delays is w mg P3+ i=1 i = (3 + w)(4 + w)mg : T 2T Therefore the average service time of the PS is (3 + w)(4 + w)mg : savg 0  4+r+ 2T 14

The ow rate of the PS is the number of useful inputs it receives, which equals the number of useful inputs it generates. The selection modules must select a total of mg=2 pairs of members. De ne mselect 2 [1; m] as a random variable indicating the number of members an SM must see before selecting a pair. In the best case, each SM would select a pair of members after seeing only one member from the PS, so F0 is at least F0min = mg=(2  nsel). In the worst case, each pair selected requires examination of all m population members. Thus F0max = m2 g . We expect each SM to select about the same number of pairs of members, so using mavg select as the average number of members seen before selection yields avg mg m  2select  nsel : The normalized service time for the PS (Snorm0 ) is the ratio of s0 and F0 's product to the normalizing factor Fout = mg=2, the total ow out of the pipeline. So min = 4 + r  S max Snorm norm0  2m(7 + r + w) = Snorm0 0 nsel

F0avg

and

avg Snorm 0

 avg  F avg mavg  (3 + w )(4 + w ) mg s 0 0 select  nsel 4 + r + :  F 2T out

B. Selection Module Analysis

Since the input ow di ers from the output ow, the service time s1 for the SM(s) is a weighted average of the service time per input (s1in ) and the service time per output (s1out ). Accumulation of tnesses and the necessary comparisons to check for selection are easily done in one clock step, so s1in = 1. To determine s1out , note that the handshaking and transmission between the SM and the crossover/mutation module (CMM) requires 5 cycles. Add another 3 cycles for reinitialization after selecting a pair, and each output sequence by an SM takes 8 cycles, so s1out = 8. Since one output message is generated for every mselect input messages, s1 = (mselect s1in + s1out )=mselect = max (mselect + 8)=mselect = 1 + 8=mselect . Since mselect 2 [1; m], smin 1 = 1 + 8=m and s1 = 9. Finally, 8 savg 1  1 + mavg : select

To determine F1 , note that since we only consider useful inputs and outputs, the number of messages leaving the PS is the same as the number entering the SM. Thus F1 = F0 , yielding F1min = mg=(2  nsel), F1max = m2g and avg mg F1avg  m2select  nsel :

15

Since Snorm1 = s1  F1 =Fout , max min = 1 + 8=m  S Snorm norm1  18m = Snorm1 1 nsel

and

avg

avg  mselect + 8 : Snorm 1 nsel

C. Crossover/Mutation Module Analysis

The simplest HGA module is the CMM. It receives input, processes it, and transmits it to the FM. This process takes exactly 5 cycles. Thus s2 = 5. The CMM's ow rate is always F2 = mg 2 because it receives and outputs exactly that many useful pairs during the HGA run. Since this equals Fout , Snorm2 = s2 = 5. D. Fitness Module Analysis

When the FM receives input in the form of members A and B , the following events occur during processing. 1. Evaluate A and B , accumulate their tnesses and request access to the MIC: this takes some number of cycles denoted by teval . 2. Wait for the MIC to acknowledge the request: this takes some number of cycles between 1 and 3 + r. The potential additional delay is due to the lock that the PS may have on the MIC. In the worst case, the FM requested MIC access immediately after the PS locked the MIC. This worst case would cause the delay of 3 + r. 3. Receive the MIC's acknowledgment, send A and its tness, wait for the MIC to write A and tell the FM it is done, and issue the next request: this requires 3 + w cycles. 4. Wait for the MIC to acknowledge the request: this takes 3 + r cycles. The explanation for this is similar to that in Step 2 except that the FM's second request is guaranteed to be blocked by the PS because the PS gets access to the MIC right after the MIC nishes writing the members. 5. Receive the MIC's acknowledgment, send B and its tness, wait for the MIC to write B and tell the FM it is done: this requires 3 + w cycles. The number of cycles in Step 2 can vary anywhere between 1 and 3+ r, so smin 3 = teval +1+3+ w+3+r+3+w = teval +10+2w+r and smax 3 = teval +3+r +3+w +3+r +3+w = teval +12+2w +2r. If we assume that all delays in Step 2 are equally probable, then the average wait is (4 + r)=2, 16

making the expected service time for the tness module savg 3  teval + 11 + 2w + 3r=2. As with the CMM, the ow rate for the tness module is F3 = mg 2 , implying that Snorm3 = s3 . The equations from Sections III-A{III-D are summarized in Table II. TABLE II The equations derived in Sections III-A{III-D.

Module PS

SM CMM FM

Minimum Maximum Approx. Average w)mg smin =4+r smax = 7+r+w savg  4 + r + (3+w)(4+ 0 0 0 2T avg 2 mg F0min = 2 mg F0max = m2 g F0avg  mselect nsel 2 nsel  min = 4+r max = 2m(7 + r + w) avg  mavg select 4 + r + (3+w )(4+w )mg Snorm S S norm norm 0 0 0 nsel nsel 2T avg 8 8 min max s1 = 1 + m s1 = 9 s1  1 + mavg avg select 2g mg mg m F1min = 2 nsel F1max = 2 F1avg  mselect 2 nsel min = 1+8=m max = 18m avg  mavg select +8 Snorm Snorm Snorm nsel nsel 1 1 1 avg = 5 max = 5 smin = 5 s s 2 2 2 avg mg mg min max F2 = 2 F2 = 2 F2 = mg2 min = 5 max = 5 avg = 5 Snorm Snorm Snorm 2 2 2 avg min max s3 = teval + 10 + 2w + r s3 = teval + 12 + 2w + 2r s3  teval + 11 + 2w + 3r=2 mg mg F3min = 2 F3max = 2 F3avg = mg2 min = teval + 10 + 2w + r S max = teval + 12 + 2w + 2r avg  teval + 11 + 2w + 3r=2 Snorm Snorm norm3 3 3 







E. Comparison of Equations

By comparing the equations of Table II, we see that the CMM will never be the bottleneck since min > S max . We also see that S avg is at least mavg =nsel times 4+ r. But S avg is only Snorm norm2 norm0 norm1 select 3 avg avg avg mselect =nsel plus 8=nsel. So typically Snorm0 > Snorm1 , eliminating the SMs as the bottleneck. Thus the bottleneck's location depends on the value of teval , i.e. if teval is suciently large such avg > S avg , we expect the FM to be the bottleneck. Otherwise the bottleneck lies in that Snorm norm0 3 avg for di erent values of the parameters. the PS. In the next section we evaluate Snorm i F. Evaluation of Analysis Functions

We now evaluate each formula with di erent values of m, g , nsel, and teval and compare the avg since the minimum and maximum values vary greatly. results. We will concentrate on Snorm i But before evaluating the equations, we must rst estimate T avg , the average total run time of 17

the HGA. Its value depends on where the system's bottleneck lies. Case 1: The bottleneck lies in the PS3 . Then T avg is roughly the amount of time required to select mg=2 pairs of members and give each pair to the CMM. Selection of each pair takes avg avg mavg select s0 cycles to fetch and examine enough members, and s1out cycles to send the selected pair to the CMM. Since bottleneck lies in the PS, we ignore the time for the CMM and FM to process the last pair, clearing the pipeline. But we do add the amount of time the PS and SMs waste generating useless output. This happens at the end of each generation when the FM resets the SMs; the work done between the last successful selection and the reset is lost. avg The amount of time dedicated to this work is approximately savg 2 + s3 . So avg + 8)mg savg + savg )mg (mavg (mavg s avg avg 0 1 select 0 avg out select + g (s2 + s3 ) = + g (5 + savg T  3 ) (1) 2  nsel 2  nsel assuming each SM selects about the same number of members. Substituting r = 2 and w = 3 (the values used in our simulations) yields 2 2 avg 2T avg nsel  6mgmavg + 21m g mselect + 8mg + 2g  nsel  (5 + savg ): select

T avg

3

Multiplying through by T avg and applying the quadratic formula gives (ignoring negative roots) q 2 168  nsel  m2 g 2mavg b + 1 select + b1 avg T  (2) 4  nsel avg where b1 = 8mg + 6mgmavg select + 2g  nsel  (5 + s3 ). If we assume mselect is uniformly distributed over [1; m], then mavg select = (m + 1)=2. However, as we will see in the simulations, this assumption is poor and yields to gross underestimations of T avg . But even though avg to mavg select may be dicult to determine, it will likely grow linearly with m, causing T grow quadratically with m and linearly with g . Also note that T avg approximately decreases linearly with nsel. Case 2: The bottleneck lies in the FM. Then T is the amount of time required to perform crossover, evaluate and store mg=2 pairs of members. Also, as in Case 1, there are g resets of the SMs, one per generation. At these times the FM sits idle waiting for the SMs to generate another pair and send it down the pipeline through the CMM. This extra delay is avg avg avg approximately mavg select s0 + s1out + s2 , giving avg + g (mavg savg + savg + savg ) = mg savg + g (mavg savg + 13): s T avg  mg 2 1out select 0 select 0 2 3 2 3

3

The same arguments hold if the bottleneck lies in the SMs.

18

(3)

Again, substituting r = 2 and w = 3 yields avg avg + 6gmavg + 21mg 2mselect + 13g: T avg  mg s select 2 3 T avg

Applying the quadratic equation gives (ignoring negative roots) q

T avg 

2 b2 + 84mg 2mavg select + b2

2

(4)

avg avg where b2 = mg 2 s3 + 6gmselect + 13g . Note here that there is still a linear dependence on g but the dependence on m is now only linear. Notice also that changes in teval have a greater impact on T avg than in Case 1 because here the e ect grows with m and g whereas in Case 1 the e ect only grew with g (we assume nsel  m; g ). This makes sense since the bottleneck is now in the FM. Finally, note that T avg does not decrease linearly with nsel as in Case 1. This is also logical since the bottleneck is in the FM. We rst evaluated the equations with teval = 3, which is the minimum number of cycles required to evaluate both members and accumulate their tnesses if only one evaluator and one accumulator are available. Here we expect the bottleneck to lie in the PS, so we use Equation 2 to determine T . Using this equation for T avg , we determined si and Snormi for m 2 f32; 64g, g 2 f256; 512; 1024g and nsel 2 f1; 2; 3g. The results are in Table III. They suggest that the PS is indeed the bottleneck for these parameter values. We also evaluated the equations using Equation 4 for T (results are not shown). In these evaluations, Snorm1;2;3 are unchanged and Snorm0 is increased, con rming the bottleneck lies in the PS. The results of Table III also indicate that Snorm0 increases nearly linearly with m and decreases nearly linearly with nsel. By the equations, increasing nsel to 5 or 6 should move the bottleneck to the FM if m = 32. We then evaluated the equations with teval suciently large to move the bottleneck to the FM for all values of m, g and nsel used in Table III. The results of Table III imply that any teval > 181 would suce. However, as mentioned before, our estimate of mavg select is too low. Thus teval should be larger. Based on our simulations with teval = 3 (Section III-G), we decided that teval = 272 is suciently large to move the bottleneck to the FM. Now we use Equation 4 to determine T . Using this equation for T , we determined si and Snormi for the same values of m, g and nsel as before. The results are in Table IV and suggest that the FM is indeed the bottleneck for these parameter values. Not shown are the same evaluations using Equation 2 for T . In

19

TABLE III Bottleneck analysis function evaluations of the HGA tests for teval = 3. The value of

T comes from Equation 2 and mavg select = (m + 1)=2.

m 32 64 32 64 32 64 32 64 32 64 32 64 32 64 32 64 32 64

g nsel 256 1 256 1 512 1 512 1 1024 1 1024 1 256 2 256 2 512 2 512 2 1024 2 1024 2 256 3 256 3 512 3 512 3 1024 3 1024 3

T 470169 1723300 940337 3446600 1880674 6893200 249595 890112 499190 1780224 998380 3560449 175359 611433 350718 1222865 701437 2445731

Sequencer s0 Snorm0 6.366 105.04 6.200 201.49 6.366 105.04 6.200 201.49 6.366 105.04 6.200 201.49 6.689 55.19 6.387 103.78 6.689 55.19 6.387 103.78 6.689 55.19 6.387 103.78 6.981 38.4 6.563 71.1 6.981 38.4 6.563 71.1 6.981 38.4 6.563 71.1

Selection Crossover s1 Snorm1 s2 Snorm2 1.485 24.50 5 5 1.246 40.50 5 5 1.485 24.50 5 5 1.246 40.50 5 5 1.485 24.50 5 5 1.246 40.50 5 5 1.485 12.25 5 5 1.246 20.25 5 5 1.485 12.25 5 5 1.246 20.25 5 5 1.485 12.25 5 5 1.246 20.25 5 5 1.485 8.17 5 5 1.246 13.5 5 5 1.485 8.17 5 5 1.246 13.5 5 5 1.485 8.17 5 5 1.246 13.5 5 5

Fitness s3 Snorm3 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23

these evaluations, Snorm1;2;3 are unchanged and Snorm0 remains less than Snorm3 , con rming the bottleneck lies in the FM. The results of Table IV also indicate that Snorm0 is a ected by m and nsel as in Table III. G. Simulation Results

We simulated the HGA on the same values of m, g , nsel and teval used in Section III-F. In these simulations, the FM evaluated each population member with the tness function f (x) = 2x. After evaluating each pair of members and accumulating their tnesses, the FM sat idle for some time before requesting access to the MIC. This delay simulated a total time of teval to evaluate both members, accumulate their tnesses, and issue a write request to the MIC. During the simulations, statistics of number of inputs and total service times were generated for all modules listed above and then adjusted to only measure useful computation. For each value 20

TABLE IV Bottleneck analysis function evaluations of the HGA tests for teval = 272. The value of T comes from Equation 4 and mavg select = (m + 1)=2.

m 32 64 32 64 32 64 32 64 32 64 32 64 32 64 32 64 32 64

g nsel 256 1 256 1 512 1 512 1 1024 1 1024 1 256 2 256 2 512 2 512 2 1024 2 1024 2 256 3 256 3 512 3 512 3 1024 3 1024 3

T 1225297 2446482 2450594 4892964 4901188 9785928 1225297 2446482 2450594 4892964 4901188 9785928 1225297 2446482 2450594 4892964 4901188 9785928

Sequencer s0 Snorm0 6.140 101.32 6.141 199.57 6.140 101.32 6.141 199.57 6.140 101.32 6.141 199.57 6.140 50.66 6.141 99.79 6.140 50.66 6.141 99.79 6.140 50.66 6.141 99.79 6.140 33.77 6.141 66.52 6.140 33.77 6.141 66.52 6.140 33.77 6.141 66.52

Selection Crossover s1 Snorm1 s2 Snorm2 1.485 24.50 5 5 1.246 40.50 5 5 1.485 24.50 5 5 1.246 40.50 5 5 1.485 24.50 5 5 1.246 40.50 5 5 1.485 12.25 5 5 1.246 20.25 5 5 1.485 12.25 5 5 1.246 20.25 5 5 1.485 12.25 5 5 1.246 20.25 5 5 1.485 8.17 5 5 1.246 13.50 5 5 1.485 8.17 5 5 1.246 13.50 5 5 1.485 8.17 5 5 1.246 13.50 5 5

Fitness s3 Snorm3 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292

of m, g , nsel and teval , simulations were run on two initial populations. These two populations were used for each combination of parameters and were randomly generated. The results were averaged and are reported in Tables V (for teval = 3) and VI (for teval = 272). When multiple SMs were involved, the values for all SMs were averaged. Contrasting Table V with Table III and Table VI with Table IV indicate two problems with our estimates. First, the estimates of s3 and Snorm3 are about 1{2 cycles more than the simulation results indicate. This is because our estimate of the delay in Step 2 of Section III-D is (4+ r)=2 = 3, but in the simulations this value was typically very near 2, causing the discrepancy. We also discovered that our estimates of T , Snorm0 and Snorm1 were each too low by over 20% (except our T estimate is very accurate when teval = 272). This is because the empirical value of mavg select is 31{39% higher than our estimate when teval = 3 and is 37{54% higher than our estimate when teval = 272. Adjusting our estimate of mavg select to match the empirical evidence greatly improves 21

TABLE V Bottleneck simulation results of the HGA tests for teval = 3. m 32 64 32 64 32 64 32 64 32 64 32 64 32 64 32 64 32 64

Sequencer Selection Crossover g nsel T s0 Snorm0 s1 Snorm1 s2 Snorm2 256 1 573208 6.4620 140.1845 1.3781 29.8963 5 5 256 1 2183381 6.2340 266.4905 1.1893 50.8421 5 5 512 1 1144997 6.4625 139.8690 1.3786 29.8374 5 5 512 1 4353733 6.2346 265.7175 1.1899 50.7127 5 5 1024 1 2297290 6.4608 140.1705 1.3773 29.8822 5 5 1024 1 8716412 6.2344 265.9965 1.1897 50.7596 5 5 256 2 314468 6.9017 76.3578 1.3798 15.2658 5 5 256 2 1149130 6.4613 139.9795 1.1887 25.7535 5 5 512 2 633357 6.8942 76.7432 1.3747 15.3031 5 5 512 2 2292408 6.4625 139.7515 1.1893 25.7183 5 5 1024 2 1263788 6.8966 76.7499 1.3768 15.3218 5 5 1024 2 4572030 6.4639 139.3660 1.1899 25.6541 5 5 256 3 231644 7.3354 55.7471 1.3752 10.4509 5 5 256 3 804314 6.7079 97.8319 1.1892 17.3441 5 5 512 3 463901 7.3343 55.7888 1.3747 10.4566 5 5 512 3 1608330 6.7070 97.8443 1.1892 17.3482 5 5 1024 3 930243 7.3276 55.9402 1.3734 10.4846 5 5 1024 3 3216746 6.7068 97.8345 1.1892 17.3473 5 5

Fitness s3 Snorm3 21.9894 21.9894 21.9953 21.9953 21.9873 21.9873 21.9947 21.9947 21.9883 21.9883 21.9943 21.9943 21.4327 21.4327 21.4661 21.4661 21.4342 21.4342 21.4644 21.4644 21.4309 21.4309 21.4654 21.4654 21.1914 21.1914 21.1832 21.1832 21.1838 21.1838 21.1848 21.1848 21.1917 21.1917 21.1915 21.1915

our estimates of T , Snorm0 and Snorm1 , each of which is now within 5% of its empirical value (Tables VII and VIII). Unfortunately, in general it is not clear how to obtain a better estimate of mavg select without a priori empirical evidence. If g is suciently large, we could expect the population to be nearly homogeneous in terms of tness values at the end of the run, implying that in fact mavg select = (m+1)=2 towards the end of the run since empirically our random bit strings (Section II-C.1) are uniformly distributed. But in the rest of the HGA run, the distribution of mselect depends on the tness function and the diversity of the population. It is obvious that for suciently small values of teval and nsel and for suciently large values of m, the population sequencer is the bottleneck of the HGA system. To remove a bottleneck, Kenyon et al. [27] suggest either parallelizing the bottleneck stage or breaking it into smaller stages. Due to the functional simplicity of the PS and its tight coupling with the MIC, neither of these options is possible. However, parallelizing the SMs increases the system's overall selection rate and speeds up the run, e ectively reducing the PS's normalized service time. But this will 22

TABLE VI Bottleneck simulation results of the HGA tests for teval = 272.

m 32 64 32 64 32 64 32 64 32 64 32 64 32 64 32 64 32 64

g nsel 256 1 256 1 512 1 512 1 1024 1 1024 1 256 2 256 2 512 2 512 2 1024 2 1024 2 256 3 256 3 512 3 512 3 1024 3 1024 3

T 1230739 2492444 2461733 4983424 4923862 9964958 1224577 2444369 2447966 4892602 4895875 9778655 1219560 2437026 2439942 4875130 4878060 9750577

Sequencer s0 Snorm0 6.2066 141.3635 6.2040 270.6855 6.2066 141.0290 6.2040 269.8280 6.2065 141.2295 6.2040 269.0830 6.2077 74.5652 6.2081 139.3715 6.2078 74.6168 6.2079 139.2895 6.2078 74.5638 6.2080 140.0110 6.2086 52.4304 6.2088 95.2322 6.2085 52.4402 6.2087 95.7456 6.2086 52.1630 6.2087 95.3931

Service Times Selection Crossover s1 Snorm1 s2 Snorm2 1.3616 31.0115 5 5 1.1857 51.7323 5 5 1.3624 30.9577 5 5 1.1863 51.5932 5 5 1.3619 30.9902 5 5 1.1868 51.4735 5 5 1.3526 16.2470 5 5 1.1833 26.5641 5 5 1.3524 16.2552 5 5 1.1834 26.5520 5 5 1.3526 16.2467 5 5 1.1824 26.6675 5 5 1.3437 11.3467 5 5 1.1815 18.1217 5 5 1.3436 11.3485 5 5 1.1805 18.2043 5 5 1.3454 11.3038 5 5 1.1812 18.1480 5 5

Fitness s3 Snorm3 291.0620 291.0620 291.1150 291.1150 291.0620 291.0620 291.1090 291.1090 291.0620 291.0620 291.1060 291.1060 291.0280 291.0280 291.0160 291.0160 291.0315 291.0315 291.0170 291.0170 291.0335 291.0335 291.0160 291.0160 291.0000 291.0000 291.0005 291.0005 291.0040 291.0040 291.0000 291.0000 291.0020 291.0020 291.0000 291.0000

only work up to a certain limit after which the bottleneck will shift from the PS to the FM. With the improved estimate of mavg select and by subtracting 1 from s3 and Snorm3 , we can use the equations of Table II to estimate this maximum value of nsel. When it is the case that the bottleneck lies in the FM, then FM parallelization is possible to relieve it. In this case, the duty of writing new members to memory and maintaining records of the HGA's state (see Section II-C, Item 6) would have to be shifted from the FM to a new module called the memory writer (MW). This would be necessary because parallel FMs would have diculty performing these jobs themselves. If the number of parallel FMs is nfit, then since the FMs' duties are only to evaluate the two members and hand them to the MW, s3 = teval +4. If we assume the pairs to evaluate are evenly distributed over all FMs, F3avg  mg=(2  nfit), avg  (teval + 4)=nfit, giving an approximately linear improvement like with the yielding Snorm 3 SMs. Since the MW must now handle writing new members to memory, savg 4  11 + 2w + 3r=2. 23

TABLE VII Bottleneck analysis function evaluations of the HGA tests for teval = 3. The value of

T comes from Equation 2 and the value of mavg select is empirically determined. m 32 64 32 64 32 64 32 64 32 64 32 64 32 64 32 64 32 64

g nsel 256 1 256 1 512 1 512 1 1024 1 1024 1 256 2 256 2 512 2 512 2 1024 2 1024 2 256 3 256 3 512 3 512 3 1024 3 1024 3

T 598319 2225066 1196638 4450132 2393276 8900264 321000 1154696 642000 2309391 1284000 4618782 228678 797549 457356 1595098 914712 3190196

Sequencer s0 Snorm0 6.288 136.32 6.155 262.74 6.288 136.32 6.155 262.74 6.288 136.32 6.155 262.74 6.536 72.62 6.298 136.08 6.536 72.62 6.298 136.08 6.536 72.62 6.298 136.08 6.752 51.41 6.431 93.82 6.752 51.41 6.431 93.82 6.752 51.41 6.431 93.82

Selection Crossover s1 Snorm1 s2 Snorm2 1.369 29.68 5 5 1.187 50.69 5 5 1.369 29.68 5 5 1.187 50.69 5 5 1.369 29.68 5 5 1.187 50.69 5 5 1.360 15.11 5 5 1.185 25.61 5 5 1.360 15.11 5 5 1.185 25.61 5 5 1.360 15.11 5 5 1.185 25.61 5 5 1.350 10.28 5 5 1.183 17.25 5 5 1.350 10.28 5 5 1.183 17.25 5 5 1.350 10.28 5 5 1.183 17.25 5 5

Fitness s3 Snorm3 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23

avg = savg . Since it must write all member pairs, F4 = mg=2, giving Snorm 4 4

IV. Example Implementations

To demonstrate the HGA's feasibility, a proof-of-concept prototype was built and tested on simple tness functions. Additionally, the HGA was simulated on a more complex tness function. This section presents the results of these tests and contrasts them with runs of a software GA on the same functions. A. Simple Fitness Functions (Prototype)

The HGA system presented in Fig. 1 was prototyped and tested using three tness functions. The HGA was designed to operate in a coprocessor capacity, waiting for the CPU to supply a \Go" signal to start HGA execution. For this to be feasible, the HGA was implemented on a prototyping board called the BORG board [28] which was connected to the bus of a PC. This 24

TABLE VIII Bottleneck analysis function evaluations of the HGA tests for teval = 272. The value of T comes from Equation 4 and the value of mavg select is empirically determined. m 32 64 32 64 32 64 32 64 32 64 32 64 32 64 32 64 32 64

g nsel 256 1 256 1 512 1 512 1 1024 1 1024 1 256 2 256 2 512 2 512 2 1024 2 1024 2 256 3 256 3 512 3 512 3 1024 3 1024 3

T 1235117 2463761 2470234 4927522 4940468 9855044 1237124 2466057 2474247 4932113 4948494 9864227 1239110 2467885 2478220 4935770 4956439 9871541

Sequencer s0 Snorm0 6.139 139.68 6.140 267.07 6.139 139.68 6.140 267.07 6.139 139.68 6.140 267.07 6.139 73.76 6.140 138.02 6.139 73.76 6.140 138.02 6.139 73.76 6.140 138.02 6.139 51.76 6.139 94.39 6.139 51.76 6.139 94.39 6.139 51.76 6.139 94.39

Selection Crossover s1 Snorm1 s2 Snorm2 1.352 30.75 5 5 1.184 51.50 5 5 1.352 30.75 5 5 1.184 51.50 5 5 1.352 30.75 5 5 1.184 51.50 5 5 1.333 16.01 5 5 1.178 26.48 5 5 1.333 16.01 5 5 1.178 26.48 5 5 1.333 16.01 5 5 1.178 26.48 5 5 1.316 11.10 5 5 1.173 18.04 5 5 1.316 11.10 5 5 1.173 18.04 5 5 1.316 11.10 5 5 1.173 18.04 5 5

Fitness s3 Snorm3 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292 292

allowed the HGA and the CPU to share memory, thus relieving the need for large amounts of I/O between the CPU and HGA. The BORG board consists of ve Xilinx FPGAs (Fig. 3). Two XC4003s (X1 and X2) contain user-speci ed logic, two XC4002s (R1 and R2) provide userspeci ed interconnect between the XC4003s, and one XC4003 (X0) controls the interface to the PC's bus. Also available on the BORG board are 8 kilobytes of static RAM, an 8 MHz oscillator, and a \sea-of-holes" prototyping area. One of the two user-programmable XC4003s on the BORG board housed the pseudorandom number generator and the crossover/mutation module. These two modules shared an XC4003 to reduce the number of inter-chip connections. The other XC4003 was unused. Since the FPGAs on the BORG board were too small for the entire HGA design, additional FPGAs were inserted into the BORG board's prototyping area and connected to the BORG FPGAs. The BORG's prototyping area was used to support the tness, selection, and memory 25

CLK R1

X1

Prototyping Area

X0

SRAM

X2

X1, X2 = Logic FPGAs R1, R2 = Routing FPGAs X0 = FPGA Controlling PC Bus Interface

R2

SRAM = 8K x 8 Static RAM CLK = 8MHz System Clock BUS = PC Bus Interface

BUS

Fig. 3. Simpli ed schematic of the Xilinx BORG prototyping board.

interface and control modules (Fig. 4). The population sequencer shared an FPGA with the memory interface and control module. The prototyping area had three XC4005s wire-wrapped to each other and to the chips on the BORG board. If the desired tness function is not the default programmed in the tness module (f (x) = x was our default), the user can place other FPGAs in the prototyping area and connect them to the tness module. These extra FPGAs act as an external tness evaluator (Section II-C.7) which does no bookkeeping like the tness module does. Rather, it is used to evaluate whatever member is presented to it.

External Fitness Evaluator (Optional)

Fitness Module

Selection Module

Memory Interface Module and Population Sequencer

Wires Connected to Chips on BORG Board

Fig. 4. Simpli ed schematic of the FPGAs in the BORG board's prototyping area.

Finally, a software front end was written to act as a simple user interface to the HGA. It 26

receives as input all user-speci ed HGA parameters described in Section II-A and loads them into the memory shared by the HGA and CPU. The parameters are speci ed in a text le created by the user and read by the front end. After the parameters are stored in memory, the front end sends a \Go" signal to the memory interface and control module and waits for the MIC to send the \Done" signal back. Then the front end reads the nal population from the shared memory. We tested the HGA prototype against the software-based SGA [1] running on an Ultra SPARC with a 143 MHz CPU. We chose the SGA for comparison because the HGA was patterned after it. The HGA was compared with the SGA when optimizing the tness functions f (x) = x, f (x) = 2x and f (x) = x + 5 (the tness functions' simplicity is due to the limited space available on our prototype's FPGAs). The di erent tness functions were tested by changing the default tness function in the tness module and reprogramming the tness module FPGA in the prototyping area. The HGA and SGA both ran with population size m = 16, population member width = 3 bits, and population member tness width = 4 bits. The SGA executable was optimized during compilation. Both the SGA and HGA started with the same initial population, so the only variations in the runs were from the pseudorandom number generation. Each di erent tness function was optimized six times by both the SGA and the HGA. Three optimization runs were for 10 generations and three runs were for 20 generations. The results were then averaged. The results of the runs appear in Table IX. In this table f (x) is the tness function used. The notations \2x (add)" and \2x (mult)" refer to how the function f (x) = 2x was implemented in the SGA. The \2x (add)" rows indicate that the SGA implemented the function with x + x. The \2x (mult)" rows indicate that the SGA implemented the function with 2  x. In both cases, the HGA implemented the function with a single left bit shift. The rest of the table presents the average execution times of the SGA and HGA in number of clock cycles and in milliseconds, where the SGA's performance values were obtained via the ACE Pro le Timer [29]. All I/O times are removed from the comparisons. The HGA prototype used an average 4.431% as many clock cycles as the SGA, but the execution time for the HGA was 3.168 times as long as the SGA's. This is due to the HGA's meager 2 MHz clock rate caused by noise or glitches in the wirewrapped part of the prototype. However, if the HGA as currently designed were implemented on a printed circuit board (PCB), it should run at at least 10 MHz when implemented with 27

TABLE IX Performance of the HGA prototype and the SGA on simple fitness functions.

Fitness No. of HGA SGA Ratio: HGA/SGA Function Gens. g Cycles ms Cycles ms Cycles ms x 10 5636 2.818 122890 0.859 0.046 3.281 x 20 10622 5.311 233829 1.635 0.045 3.248 x+5 10 5585 2.793 125545 0.878 0.044 3.181 x+5 20 10945 5.473 243812 1.705 0.045 3.210 2x (add) 10 5390 2.695 126378 0.884 0.043 3.049 2x (add) 20 10659 5.330 238660 1.669 0.045 3.194 2x (mult) 10 5390 2.695 130409 0.912 0.041 2.955 2x (mult) 20 10659 5.330 235956 1.650 0.045 3.230 Xilinx XC4005-4 FPGAs4. This is because a critical path analysis of the SM's multiplier (the design's longest register-to-register path) reveals that the design's critical path can be traversed in about 97 ns in a Xilinx XC4005-4 FPGA, including register set-up times. After factoring out set-up times, we found that a signal could traverse the critical path at a rate of about 6.36 ns per con gurable logic block (CLB) [16] (for a critical path 14 CLBs long) in an XC4005-4. Measuring the critical path length in terms of CLBs allows us to estimate the maximum clock rates for our design on Xilinx XC4000 series parts with varying speed grades [16], [30]. Using a 10 MHz clock, a speedup of 1.58 is possible for even this simple prototype with only a single SM. Based on the equations of Table II, up to 3 SMs (one per XC4005) are possible before the pipeline's bottleneck shifts to the FM5. These extra SMs would give our prototype a speedup of > 4 over the software GA. Also, the number of cycles in the run can be reduced via some of the design improvements suggested in Section VI-B, including merging the PS with the MIC, which could cut the run time in half when the bottleneck lies in the PS (Section VI-B.2). Finally, the FPGA-based multiplier could be replaced with a dedicated multiplier chip, such as the AMD Am29323 multiplier [31]. This multiplier uses only combinational logic and thus operates in The -4 is a \speed grade" that Xilinx uses to rate the speed of its FPGAs [16]. In the XC4000 series, smaller speed grades imply faster parts. Currently available speed grades include -6 through -2. 5 Naturally, the FMs can be parallelized as well after the bottleneck shifts to that part of the pipeline. 4

28

a single clock cycle, just like the multiplier in our prototype. Even with 1985 technology, the Am29323 can multiply two 32-bit numbers in under 80 ns, potentially improving the maximum clock rate of our prototype to 12.5 MHz. In lieu of a dedicated multiplier chip, the pure combinational multiplier could be replaced with one utilizing sequential logic, e.g. using an \add and shift" approach [32]. This requires only one register, one shift register, an adder, and some control logic, decreasing the multiplier's critical path length to 6 CLBs. This allows the clock rate to increase to 21 MHz on an XC4005-4. But reducing the critical path length for the multiplier only increases the system's clock rate if the multiplier's critical path remains the system's critical path. A more thorough analysis is required to determine the system's actual critical path when this new multiplier is added. Another problem with this new multiplier is that it increases by an additive factor the number of cycles for an SM to select a pair of members, because the SM uses the multiplier twice to scale down the sum of tnesses from the FM before each selection process begins. The additive factor is about twice the length in bits of the largest argument to the multiplier6 since the number of cycles for each multiplication is roughly the length of the largest argument, and the SM must perform two multiplications per selection. Thus the impact of the new multiplier on the total number of cycles for the HGA run can in general be approximated by adding the appropriate avg factor to savg 1out in Equations 1 and 3. In our prototype, s1out would increase from 8 to about 24. Our theoretical and empirical evidence indicate that the total execution times of both the SGA and HGA grow quadratically with m and linearly with g , which concurs with Table IX. Thus the HGA's speed advantage over the SGA for a given problem should remain roughly constant independent of m and g . B. A Complex Fitness Function (Simulations)

To understand the HGA's behavior on a more complex tness function, we performed VHDL simulations based on an example from Michalewicz [33]. The evaluation function was

f (x1; x2) = 21:5 + x1 sin(4x1) + x2 sin(20x2)

(5)

where ?3:0  x1  12:1 and 4:1  x2  5:8. A plot of f (x1 ; x2) is in Fig. 5. The spiky nature of the plot indicates that it should be dicult to optimize, given the myriad local minima. 6

For our prototype, this size is 8, so the additive factor is 16.

29

30 20 5.5 10 0 5 0

4.5

5

10

Fig. 5. The function f(x1 ; x2) of Equation 5 optimized by the HGA and SGA.

To obtain four decimal places of precision for each variable, Michalewicz used 18 bits for x1 and 15 for x2 . His binary strings were manipulated directly and converted to real values only during tness evaluation. We attempted to model this as closely as possible in our VHDL and software simulations. Although our VHDL simulations evaluated Equation 5 directly, we inserted an arti cial delay into the simulations (as in Section III-G) to simulate the time required in a straightforward hardware implementation. This arti cial delay assumes that the evaluations of sine in Equation 5 were performed using the CORDIC algorithm [34], [35]. In the evaluation process, rst x1 is multiplied by 4 in a multiplier, and then x2 is multiplied by 20 in the same multiplier in the next cycle. Then both sine calculations run concurrently via CORDIC. But since CORDIC only works on arguments between 0 and 2 , before running CORDIC, we must subtract from each argument to sine an amount 2i , where i = bx=(2 )c for x 2 f4x1; 20x2g. These amounts are 2 b2x1c and 2 b10x2c, each of which is computable with two more uses of the multiplier7 . So after 7

For arbitrary x  0, 2i can be found with a binary search, repeatedly subtracting 2j  from x for j ranging

30

four cycles, one of the two concurrent CORDIC systems is given an argument of 4x1 ? 2 b2x1c. After another three cycles, the other CORDIC system receives an argument of 20x2 ? 2 b10x2 c. Fourteen steps of each CORDIC are required to attain 4 decimal places of precision in the result, which is the precision used in Michalewicz's implementation. Fourteen steps are required because approximately one bit of accuracy is attained in each step, and 2?14 < 10?4 < 2?13 . After the rst CORDIC nishes, multiply x1 by sin(4x1) and add the result to 21:5. One cycle after this, the second CORDIC will nish. So multiply x2 by sin(20x2) and add this result to x1 sin(4x1) + 21:5. By overlapping as many operations as possible while sharing the multiplier and adder, the total time to evaluate the member is 23 cycles. Now repeat for the other member, yielding a cumulative delay of 46 cycles. But note that operations performed in evaluating the rst member can partially overlap operations evaluating the second. So in fact, both members can be evaluated in 43 cycles. After evaluating the second member, the FM accumulates its tness and requests access to the MIC to write the members. This takes one more cycle, so teval = 44. Like the SGA runs of Section IV-A, the SGA implementation of this problem ran on an Ultra SPARC with a 143 MHz CPU, was optimized during compilation, and started with the same initial populations as the HGA. The SGA was run twice and the HGA was run six times, twice for each value of nsel 2 f1; 2; 3g. As in Michalewicz's implementation, the population size was m = 20 and the number of generations was g = 1000. We used the same crossover and perbit mutation probabilities as Michalewicz, namely 0:25 and 0:01 respectively. However, since the HGA only considers mutation once per member rather than once per bit, its mutation probability was set to 0:33 to compensate (all members are 33 bits long). Michalewicz's implementation yielded a nal maximum tness of 35.4780 and a nal mean tness of 31.2686. He also tracked the best tness over the entire run, which was 38.8276. Our SGA implementation gave (averaged over both runs), a nal maximum tness of 36.2321 and a nal mean tness of 33.4366. The best tness over both runs was 38.5764 and the average of the best tnesses over all runs was 37.8648. Our HGA implementation gave (averaged over all six runs), a nal maximum tness of 35.8741 and a nal mean tness of 33.2648. The best tness from blog2 (x=)c down to 1. After each subtraction, if the result  0, put j in a set S and continue. If the result P < 0, then add 2j  back to x and continue. When nished, 2i =  j S 2j . If initially x < 0, then perform P a similar process, but repeatedly add 2j  to x and test if the result is < 0. When nished,  j S 2j yields a quantity between ?2 and 0. Now add 2 to this quantity. 2

2

31

TABLE X HGA and SGA performance on Michalewicz's fitness function (Equation 5).

HGA SGA nsel Cycles msa Cycles 1 9:72  105 78 4:06  107 2 7:12  105 57 4:06  107 3 7:02  105 56 4:06  107

ms 284 284 284

Ratio: HGA/SGA Cycles ms 0.024 0.275 0.018 0.201 0.017 0.197

a Assuming the use of a dedicated multiplier chip allows a clock rate of 12.5 MHz.

over all runs was 38.8419 and the average of the best tnesses over all runs was 37.2174. Thus the results of the HGA and SGA were competitive with Michalewicz's run. Given the constraints on x1 and x2 , f (x1; x2)  39:4, so all best results are within 2% of the global optimum. The timing results of our simulations appear in Table X. As in Table IX, the ratio of HGA cycles to SGA cycles is very small. To compute an approximate clock rate for the HGA in this experiment, we again note that the HGA's multiplier has the longest critical path. We found that this path can be traversed in about 382 ns in a Xilinx XC4013-4 FPGA, including register set-up times. After factoring out set-up times, we found that a signal could traverse the critical path at a rate of about 8.92 ns per CLB (for a critical path 42 CLBs long) in an XC4013-4. So we can assume a clock rate of 2 MHz, allowing the HGA to complete its run in 356 ms for nsel = 2. This is unacceptable, so we should exercise one of the new multiplier options described in Section IV-A, i.e. using a dedicated multiplier chip or one based on sequential logic. The sequential logic-based multiplier for this problem has a critical path of length 22 CLBs, allowing the clock rate to increase to 6 MHz, which is less than the improvement from using the dedicated Am29323 multiplier (12.5 MHz). Thus Table X assumes the use of the dedicated multiplier and a clock rate of 12.5 MHz. So for nsel = 2, the HGA is about 5 times faster than the SGA. Finally, as in Section IV-A, the execution time of the run can be reduced further via some of the design improvements suggested in Section VI-B. These improvements include extensive parallelization of the SMs and FMs. avg Empirically we found that for the HGA, mavg select  1:34(m + 1)=2 for nsel = 1, mselect  1:47(m + 1)=2 for nsel = 2, and mavg select  1:58(m + 1)=2 for nsel = 3. Using these values in the

32

equations of Table II, we expect the bottleneck to lie in the PS for nsel = 1 and in the FM for nsel 2 f2; 3g. This in fact happened in our simulations, explaining why increasing nsel did not nearly linearly reduce the execution time. Our estimates of the normalized service times were within 10% of the empirical results, and out estimates of T avg were within 9% of the empirical results. Given the values of the HGA parameters, we estimated the following hardware requirements for the tness function implementation8 . If a dedicated multiplier chip is not used, the single pure combinational multiplier (shared by the FM and the SMs9 ) would require approximately 315 CLBs. The two adder/subtracters together require approximately 20 CLBs. Six registers are also required, occupying a total of about 6 CLBs and 84 ip- ops. The tness module also needs two shifters (72 CLBs) and a lookup table in the form of a 16  18 ROM (9 CLBs). So the FM requires approximately 422 CLBs and 84 ip- ops, which is easily provided by a Xilinx XC4013 [16]. The remaining logic in the FM and the hardware for the other modules grow slowly with the HGA parameters or remain constant since the SM shares the FM's multiplier. Problems with pin counts can be overcome by using a bus and time multiplexing of the chip pins. Thus the HGA can optimize Equation 5 using current FPGA technology. V. Other Applications

Section IV-B illustrated the feasibility of the HGA in optimizing a complex numeric function. This section gives a high-level description of several non-numeric problems that the HGA is applicable to given the current state-of-the-art in FPGA technology. A. FPGA Partitioning

Sitko et al. [36] have proposed a scheme to apply GAs to the problem of partitioning logic designs across two FPGAs. A design is comprised of c components and a particular partitioning (population member) is represented by a c-bit string P , where the ith bit is 1 if and only if component i lies in FPGA 1. Accompanying this bit string is a set of c-bit strings Nj , one per inter-component net in the design, where the ith bit of Nj is 1 if and only if component i is connected to Nj . So net j lies in FPGA 1 of partition P if one of the bits in P ^ Nj is 1, where ^ The estimates came from mapping Mentor Graphics schematics containing the components to Xilinx .bit les and counting the number of CLBs occupied by those components. avg avg 9 Since each SM only needs to perform two multiplications every mavg selects0 + s1out steps, sharing a single multiplier via an arbitrator is reasonable. 8

33

is the bitwise AND operator. Likewise, net j lies in FPGA 2 of partition P if one of the bits in P ^ Nj is 1, where P is the bitwise not of P . Thus, a net j crosses a chip boundary if and only if some bit from P ^ Nj is 1 and some bit from P ^ Nj is 1. This can easily be determined with combinational logic. A partition's tness is then the total number of boundary crossings. This tness function can easily be evaluated in hardware, just like in Sitko et al.'s work [36]. The nets Nj used to evaluate each P are the same for each P , so they can be permanently stored in the FM. Since the number of potential nets is exponential in c, the nets Nj might need to be stored in some memory attached to the FM. This approach can be generalized to an arbitrary (but bounded) number of FPGAs F as follows (Fig. 6). First store P in c registers, each with dlog2 F e bits. Each register represents which of the F FPGAs its corresponding component lies in. Then for each net Nj , a counter initializes itself to 0 and cycles through all integer values v 2 [0; F ]. For each value v , compare it to Pi (the index of the FPGA holding component i) for all i. If they are equal, then component i lies in FPGA v in partition P . Now logically AND this result with the ith bit of Nj . If this result is 1, then Nj lies at least in part on FPGA v . The results for all i are logically ORed yielding a single bit indicating if Nj lies in FPGA v . This result is fed into an accumulator which counts the number of FPGAs that Nj lies in. After looping through all values of v 2 [0; F ], the accumulator is checked. If it holds a value > 1, then Nj crosses a chip boundary. Repeat this process for all Nj . The tness of P is as de ned before. P

(each P1 P2

log 2 F

bits wide)

Pc -1 Pc

log 2 F

comparators

N j (each 1 bit wide) 1

2

Nj Nj

accumulator

(

log 2 F

- bit counter

c-1

Nj

c

Nj

OR of all but LSB indicates accumulator >1

bits )

Fig. 6. Circuit to evaluate a general F-way partition.

34

This scheme will work if F is a constant known a priori. In addition to some control logic, its hardware requirements are as follows. To store P , we need c registers, each of size dlog2 F e. One (dlog2 F e)-bit counter is required to cycle through the values v . The counter output will be fed into c (dlog2 F e)-bit comparators, each comparator taking its other input from one of the registers storing part of P . Each comparator's output is fed into one of c 2-input AND gates along with one value from the c-bit register storing Nj . The outputs of these AND gates feed into one c-input OR gate, whose output enters a (dlog2 F e)-bit accumulator. After cycling through all the v s, all bits except the lowest order one of the accumulator are fed into a (dlog2 F e ? 1)input OR gate to determine if the accumulator's value is > 1. This output activates a nal accumulator that counts the number of inter-chip nets in P . The width of this accumulator is at most dlog2 ne, where n is the number of nets in the design. Note that dlog2 ne  c since n  2c . Finally, a table of the Nj s is needed either in a bank of registers or in an o -chip memory. In this scheme, the time to evaluate two members is teval  2  n  dlog2 F e. There is a potential diculty of invalid partitions if bitwise crossover and mutation are used when F < 2dlog2 F e . This is because a value could appear in Pi that is > F , de ning an invalid partition. This can be remedied by requiring that the initial population be valid, crossover respects the boundaries between bit groups, and that mutation only maps a bit group into a valid bit group. Finally, note that given the net speci cation of a circuit, the set of vectors Nj and the tness evaluation hardware can be automatically generated by software. So the user's work is limited to specifying the components and nets of the circuit. B. Hypergraph Partitioning

A GA approach to the hypergraph partitioning problem by Chan and Mazumder [37] uses a tness function similar to that described in Section V-A. After counting the number of nets that span the partition, this value is divided by the product of the sizes of the two partitions. This is known as the ratio cut cost and only involves a little extra work (the multiplication and division operations) beyond what is done in Section V-A. Thus the HGA is applicable to this problem. Also, as in Section V-A, the tness function can be generalized to F -way partitioning where F is arbitrary but xed. 35

C. Test Vector Generation

O'Dare and Arslan [38] have described a GA to generate test vectors for stuck-at fault detection in combinational circuits. In their scheme, each population member is a single test vector. The member's tness is evaluated on the basis of how many new faults it covers. The GA maintains a global table which de nes the current test set. A pattern is added to the table if it covers a fault not already covered by another pattern in the table. Using a software-based fault simulator, each vector is evaluated by rst applying it to a fault-free version of the circuit under test (CUT). Then each node within the CUT is in turn forced to a logic 1 and a logic 0 to simulate the stuck-at faults. If the circuit's output di ers from the fault-free output, then the vector detects the given fault. Each pattern gets a xed number of points for each fault it covers that is not already covered by the test set, and it receives a smaller number of points for each fault it covers that is already covered. We now propose how to map O'Dare and Arslan's tness function to hardware for the HGA. Each logic gate in the CUT is mapped to a pair of gates that allow simulation of stuck-at faults. Fig. 7 gives an example of this for an AND gate. To simulate the output c as stuck at 0, both x and y are set to 0. To simulate c stuck at 1, x is set to 0 and y is set to 1. To simulate faultfree behavior, x is set to 1 and y is set to 0. OR and NOT gates and the original inputs can be modi ed in a similar fashion. For a circuit with n gates and m inputs, the new circuit has at most 2n + 2m gates and at most 2n + 2m extra inputs that are controlled by the tness module. This hardware-based fault simulation component of our proposed hardware implementation of O'Dare and Arslan's tness function is similar to hardware accelerators designed for fault simulation [39], [40] and logic simulation [41], [42], [43]. Then tness evaluation simply requires a look-up table of previously selected vectors and the faults that they cover, a counter to cycle through all 2(m + n) possible stuck-at faults, an accumulator for the members' scores, and some simple control logic. The time to evaluate two members is about twice the number of faults plus one, or teval  4(m + n) + 2. Finally, the mapping process from the original circuit to the tness evaluation hardware can be automated as in Section V-A, relieving the user of that responsibility. 36

a b

a b

c

c x

y

Fig. 7. An example of mapping a logic gate to a stuck-at fault simulation gate.

D. Traveling Salesman and Related Problems

Here we consider GA approaches to the traveling salesman problem (TSP), which poses special diculties. Using a straightforward encoding consisting of a permutation of cities encounters problems if conventional crossover and mutation operators are used. This is because regular (or uniform) crossover, if it changes anything, will create two invalid tours, i.e. some cities will appear more than once and some will not appear at all. Thus much work in applying GAs to the TSP (e.g. [1], [44], [45]) involve the use of special crossover operators that preserve the validity of tours. This method can be used in the HGA but requires modi cation of the CMM. In lieu of this, conventional crossover operators can be used in conjunction with a special encoding of the population members. One such encoding is called a random keys encoding [46], [47], [48]. In this encoding, each tour is represented by a tuple of random numbers, one number for each city, with each number from [0; 1]. After selecting a pair of tours, simple or uniform crossover can be applied, yielding two new tuples. To evaluate these tuples, sort the numbers and visit the cities in ascending order of the sort. For example, in a ve city problem the tuple (0:52; 0:93; 0:26; 0:07; 0:78) represents the tour 4 ! 3 ! 1 ! 5 ! 2. This tour can then be evaluated. Note that every tuple maps to a valid tour, so any crossover scheme is applicable. When employing this scheme, all that is required of the HGA's FM is, upon receiving a population member, to sort the tuple and accumulate the distances between the cities of the tour. Sorting of the numbers can be done with a sorting circuit based on the Odd-Even Merge Sort algorithm [49]. For sorting n numbers (i.e. for n-city tours), the depth of the sorting network is (log22 n + log2 n)=2. Each level of the network has n registers and n=2 comparators10. Also, a single set of n registers and n=2 comparators can simulate the sorting network using some nite state logic and (log22 n + log2 n)=2 steps. Thus the hardware requirements of the FM include some nite state logic, a linear number of registers and comparators, and a lookup table that The size of each register and comparator depends on the desired precision of the numbers in the tuples, but should be at least log2 n bits. 10

37

provides the inter-city distances (with O(n2 ) entries). In this case, teval  2(log22 n + log2 n). Finally, we note that the scheme just presented can be adapted for application to other problems with similar constraints as the TSP. These include scheduling problems, vehicle routing, and resource allocation (a generalization of the 0-1 knapsack problem and the set partitioning problem) [46], [47], [48]. The HGA can be applied to these problems as well with a slight increase in complexity, yielding a solution to the real-time GA disk scheduling problem of Turton et al. [5]. But a major di erence is that each member selected in Turton et al.'s scheme is selected from a very small set of members and parallelism is employed. This structure is used to ensure real-time performance. Our selection routine (running on the entire population) would probably require modi cation so real-time performance is not compromised. E. Other NP-Complete Problems

Since each problem attacked by a GA requires special considerations of the choice of member encoding, tness function, and operators, a \universal GA" would be welcome. To our knowledge, such a GA does not exist, but what do exist are polynomial-time reductions between the solutions of NP-complete problems. Therefore, developing a GA to solve any NP-complete problem (e.g. SAT, the boolean satis ability problem) yields automatic solutions to all other NP-complete problems via the reductions. All that is required is to (in software) map the instance of any NP-complete problem to SAT, apply the SAT (hardware-based) GA to solve it, and then (in software) map the SAT solution to a solution of the original problem. Of course, the GA must nd an optimal solution (i.e. a satisfying assignment for a SAT instance) for this to work. A t member in the SAT GA may map to a worthless non-solution in another problem unless the SAT member is optimal. The idea of exploiting the reductions between NP-complete problems has been suggested by Megson and Bland [50]. It was studied extensively by DeJong and Spears [2], [3], who also provided a SAT GA and empirical results on the hamiltonian circuit (HC) problem. Their SAT GA evaluates a population member by quantifying how \close" the bit string is to satisfying the given boolean function f . They do this by assigning a numeric value to each expression in f and combining them. Speci cally, given boolean expressions e1 ; : : :; e` , the tness value of ei , denoted val(ei), is given as follows for the operators AND, OR and NOT:

val(e1 ^    ^ e`) = avg(val(e1); : : :; val(e`)); 38

val(e1 _    _ e`) = max(val(e1); : : :; val(e`)); and

val(ei) = 1 ? val(ei); where avg(x1; : : :; x` ) returns the mean of the values x1; : : :; x` . An example of these evaluation functions appears in Table XI. Notice in the table that an assignment has a tness of 1.0 if and only if it satis es f . This is true in general if f satis es some simple conditions11 . While improvements to these evaluation functions were suggested, empirically those described above performed well in the work of DeJong and Spears [2] and are simple enough for a hardware implementation. TABLE XI An example of the SAT GA evaluation functions for f(x1 ; x2) = x1 ^ (x1 _ x2).

x1 x2 val(f (x1; x2)) = avg(1 ? x1 ; max(x1; x2)) 0 0 avg(1 ? 0; max(0; 0)) = 0:5 0 1 avg(1 ? 0; max(0; 1)) = 1:0 1 0 avg(1 ? 1; max(1; 0)) = 0:5 1 1 avg(1 ? 1; max(1; 1)) = 0:5 A hardware GA implementation for the SAT GA could work as follows. Take the given boolean formula f and map it to a circuit consisting of AND, OR and NOT gates where each AND and OR gate has only two inputs. Then replace each inverter in the circuit with a module that subtracts its input from the constant 1, replace each OR gate with a module that outputs the maximum of its two inputs, and replace each AND gate with a module that adds its two inputs and right shifts the result by one bit. The result is a circuit that outputs the tness of the input binary string. The number of gates in the new circuit is more than in the old circuit by only a linear factor of the precision (number of bits) used to represent the tnesses. Thus an HGA implementation of the SAT GA is feasible. These conditions are not listed here, but any boolean formula can be made to satisfy them with only a linear increase in its size. 11

39

VI. Extensions of this Work

This paper combined the bene ts of hardware with genetic algorithms. Both areas can be extended as described below. A. Genetic Algorithm Extensions

The genetic algorithm side of this work could be extended by implementing other genetic algorithm operators such as uniform crossover [51], [52], multi-point crossover (allowing for > 2 parents) and inversion [1]. Permutation-preserving crossover and mutation operators [1], [44], [45] could be implemented for constrained problems such as the TSP (Section V-D). Additionally, the CMM could be parameterized to respect the boundaries of bit groups, i.e. only permit crossover at certain locations. This would be useful in preventing invalid strings in the generalized FPGA partitioning (Section V-A) and hypergraph partitioning (Section V-B) problems. Also, other selection methods [53], [54] could be implemented. When implemented, these methods would be made available to the user via the software front end. The user would select the desired selection method as an HGA parameter. Finally, note that small extensions to the front end make it is possible to impose other termination conditions on the HGA rather than simply run for a xed number of generations. As mentioned in Section II, if other termination conditions are desired (e.g. amount of population diversity, minimum average tness), the front end can tell the HGA to run for a xed number of generations and then check the resultant population to see if it satis es the termination criteria. If not, then that population is retained in the HGA's memory for another run. This process repeats until the termination criteria are satis ed. B. Hardware Extensions

A potential diculty associated with a hardware GA is that some of the applications require extremely large members, a problem mentioned by Megson and Bland [50], [55]. In Section VIB.1, we address this dicultly by introducing the stream model of the HGA. Other hardware extensions are given in Section VI-B.2. B.1 The Stream Model If the population members are extremely large (e.g. hundreds or thousands of bits), then it is unrealistic to send entire members between modules in parallel. It may not even be realistic to 40

time multiplex the pins and wait until an entire member is bu ered in a module before processing it. Instead the modules could process data in stream form, where processing begins when the rst few bits (words) arrive in the module and output begins while input and processing still occur. That is, the result of processing the rst portion of input is sent to the next module before the rest of the input arrives. The remainder of this section describes the required modi cations to the original design in order to support the stream model, which appears in Fig. 8. Like the original HGA model, the stream model can be parallelized (Sections II-D and VI-B.2). sum of fitnesses to selection module

Shared Memory (in several

stream of 2 new members and their fitnesses to write Fitness (FM)

memory requests

stream of 2 crossed members

members’ addresses

random numbers Crossover/mut. (CMM)

Rand. no. gen. (RNG)

random numbers

small chips)

selected members’ addresses

stream of 2 members

Memory interface and control (MIC)

member’s address

Selection (SM)

fitness

parameters and data

Population sequencer (PS) address and fitness

Front end

Fig. 8. An overview of the stream model for an HGA with large population members.

In the stream model, the SM operates on members' addresses and tnesses rather than on the members themselves and their tnesses. Once a pair of members is selected, the SM tells the CMM the addresses of the selected members. The CMM then fetches these members from memory. The CMM chooses a random crossover point, initiates the fetch from memory, and then counts the bits (words) of the members as they enter the module. Before the crossover point, input A is sent to output A0 and input B is sent to output B 0 . After the crossover point, it swaps the two streams, sending input A to output B 0 and input B to output A0. Uniform crossover can also be implemented in this model. To implement mutation, either the module ips a biased coin to decide whether to apply mutation to each bit or it ips one coin before fetching the members and if it decides to apply mutation, it then chooses a random bit. This model is most ecient if tness function evaluation can begin before receiving the entire 41

member, especially if the member could be evaluated with one pass through the FM. If this is the case, then as it evaluates the member, it begins to write it into memory before evaluation or input is complete. O hand, the only functions we know of that can be evaluated in this fashion are numeric functions in which each variable appears at most once (so-called read-once formulas). Then if the variables are properly ordered within each member, evaluation can run in a single pass over all the variables. If multiple passes over the members is necessary for evaluation, then on-chip bu ers are required. This is not a serious problem since the stream model's primary objective is to reduce pin counts, which is still achieved in the revised model. Finally, the memory in the stream model is composed of several small memory chips rather than one large one since the length of memory accesses increases substantially in this model due to the large size of the members. Using several small memory chips and a more powerful MIC would reduce blocking among the modules due to memory accesses. A multi-port RAM could also be used for this purpose. B.2 Other Hardware Extensions We have shown how parallelization of the SMs or FMs can improve performance depending on the bottleneck's location. To extend the parallelization of the system, the SM-CMM-FM pipeline could be replicated to form several parallel pipelines by replicating the highlighted portion (dotted box) of Fig. 2 (see Fig. 9). In this case, the duty of writing new members to memory and maintaining records of the HGA's state (see Section II-C, Item 6) would have to be shifted from the FM to a new module called the memory writer (MW) as described in Section IIIG. Since the ow rates into each module except the MW should approximately decrease linearly with the number of pipelines, the intuition is that the equations for normalized service times of the PS, SMs, CMMs and FMs are the same as Table II except that each would be divided by the number of parallel pipelines12 . Also, the MW's normalized service time should be as given in Section III-G. Empirical analysis is required to con rm these ideas. The highest degree of parallelism of the HGA involves banks of modules (Fig. 10). Here there are an arbitrary number of SMs, CMMs and FMs. Connectivity is complete in the sense that each SM is connected to each CMM and each CMM is connected to each FM. This con guration avg is The PS's normalized service time is divided by the number of pipelines for the same reason that Snorm 0 avg divided by nsel in Table II: increasing the number of pipelines reduces F0 because the PS's output is used by multiple SMs simultaneously. 12

42

random numbers

random numbers

sum of fitnesses to all selection modules

Rand. no. gen. (RNG)

new members/fitnesses

Fitness (FM)

new members/fitnesses

Fitness (FM)

crossed members

Crossover/mut. (CMM)

selected members

Selection (SM)

selected members member/fitness

new members/fitnesses to write and their addresses

Fitness (FM) crossed members

crossed members

Crossover/mut. (CMM) selected members

Memory writer

new members/fitnesses

selected members

Selection (SM)

Memory interface and control (MIC)

Crossover/mut. (CMM) selected members member/fitness

selected members

Selection (SM)

member’s address

memory requests

Shared memory

member and fitness parameters and data

member/fitness Population sequencer (PS)

Selection (SM)

Selection (SM)

member/fitness

member/fitness

Selection (SM)

Front end

member/fitness

Fig. 9. Example of parallel selection-crossover- tness pipelines.

would maximize the utilization of each module but also complicates the communication between modules. Naturally, a memory writer is required in this scheme. Again, intuition states that each module's normalized service time is what is given in Table II but divided by the number of instances of that module. If this holds true, the optimal number of each type of module (given the available chip area and the constraint that only one MW may exist) could be determined analytically. sum of fitnesses to all selection modules new members and fitnesses Memory writer random numbers to crossover/mut mods.

Crossover/mut. (CMM) crossed members

Fitness (FM)

Memory interface and control (MIC)

Selection (SM)

Crossover/mut. (CMM)

Selection (SM) selected members

Crossover/mut. (CMM)

Shared memory

memory requests

selected members

crossed members Fitness (FM)

new members/fitnesses to write and their addresses

selected members

crossed members Fitness (FM)

Random no. gen. (RNG)

random numbers to selection mods.

Selection (SM)

member’s address

member/fitness

parameters and data member and fitness Front end

Population sequencer (PS)

Fig. 10. Example of a completely parallel HGA.

Another improvement of the HGA is to merge the PS with the MIC. Since much of the PS's time is wasted due to the communication delay between the PS and MIC, merging them would greatly reduce the waste. Speci cally, s0 would be reduced by an additive factor of 4. To approximate the impact of this improvement on the average total run time T avg , recall the equations of Section III-F. If the bottleneck lies in the PS, then Case 1 applies. So the new run 43

time T avg would be, based on Equation 1, avg (mavg (savg ? 4) + savg )mg avg )  T avg ? 2mgmselect : + s T avg  select 0 2  nsel 1out + g(savg 2 3 nsel Evaluating this equation with the values from Table VII yields a reduction in run time of about 57% for m = 32 and about 61% for m = 64. These reductions were about 1{3% less for each additional SM. Of course, these results require empirical veri cation. If the bottleneck lies in the FM, then Case 2 applies. So the new run time T avg would be, based on Equation 3, 0

0

0

avg avg avg avg avg avg avg T avg  mg 2 s3 + g (mselect (s0 ? 4) + s1out + s2 )  T ? 4gmselect : 0

Evaluating this equation with the values from Table VIII yields a reduction in run time of about 2%. Since the bottleneck is in the FM, it is logical that the merger would not save much time. Of course, this also requires empirical veri cation. When the bottleneck lies in the PS, the time savings from merging the MIC with the PS is signi cant, but the merger begins to undermine the portability of the HGA system as described in Section II-C.3. However, the bene t appears to be worth this sacri ce. The current handshaking protocol between modules requires four clock cycles per data transfer. If these delays were reduced, e.g. by overlapping communication with execution or using a more ecient protocol, the modules' service times would decrease, accelerating the HGA. The potential speedup can be estimated by altering the equations of Table II. Performance can also be improved by using a memory con guration which supports reads and writes of population members in parallel. This is possible via a multi-port RAM or a multi-chip memory con guration as in Fig. 8. An ability to read from one population while concurrently writing to another would eliminate the blocking that occurs between the population sequencer and the tness module, which is described in Sections III-A and III-D. For a greater speedup and a more compact implementation, the entire HGA except the external tness evaluator (FE) described in Section II-C.7 could be implemented on a fabricated chip since the other modules do not require recon guration. The external FE is the only module which requires reprogrammability, thus it is the only module which truly needs an FPGA implementation. However, this xes the dimensions of the design, preventing easy rescaling. The state of the art in FPGA technology is constantly advancing. These improved technologies could be exploited to improve the HGA's capabilities. For example, the parameters of the design 44

could be scaled up so that the HGA could handle larger members, larger populations, more complex tness functions and more advanced GA operators. Also, increased FPGA densities will allow increased parallelism of the HGA modules, potentially providing an increased speedup over software-based GAs. VII. Comparison with Related Work

This work is similar to other research in recon gurable hardware systems which improved performance by mapping some or all software components to hardware using reprogrammable FPGAs. These systems rst analyze the software and identify simple, frequently executed components and map them to hardware. Some examples of these are the Splash project [7], [8], the programmable active memory (PAM) architecture [9], the Armstrong/PRISM project [11], [12], [13], the BORG board [28], the FPGA-based neural network [56] which utilizes run-time recon guration, and the Nano Processor (nP) [57]. This line of research has also inspired many commercial and university-based products that are intended for use in recon gurable hardware systems [58]. The earliest hardware-based GA known to us is from DCP Research Corporation in Edmonton, Alberta. It implemented a suite of proprietary GAs in a text compression chip [59]. The chip was not intended to implement a general-purpose genetic algorithm. A little later, Liu [60] designed and simulated a hardware implementation of the crossover and mutation operators. This design expected a software-based GA to perform tness evaluation and selection and present the selected members to the GA board whenever crossover and mutation were desired. In similar work, Red'ko et al. [61] developed a hardware GA which implemented crossover and mutation with standard LSI components, expected tness evaluation to be executed on m parallel processors (where m is the population size), and implemented selection in software. Hesser et al. [62] implemented crossover, mutation, and a simple neighborhood-based selection routine13 as a pipeline on a single FPGA, assuming tness is evaluated externally. In contrast to the above works, our design [17], [63] is a completely self-contained general-purpose GA implemented in hardware, i.e. the modules for tness evaluation and selection also reside on the board. This greatly reduces communication between the board and the CPU of the host computer. In their selection routine, a member is mated with the most t member lying within a small neighborhood around it. 13

45

Many hardware GAs have been implemented only for speci c problems, e.g. image processing [64], image registration [6], disk scheduling [5], and hypergraph partitioning [37]. These GAs were designed for implementation on VLSI chips and thus are neither recon gurable nor general-purpose. They are also expensive to produce in small quantities. However, the intended applications are popular, so a VLSI implementation seems justi able since the systems can be produced in bulk. Alander et al. [65] wrote a general GA engine in VHDL, but intended a VLSI rather than an FPGA implementation of the tness function. Our original design [17], [63] was adapted by Salami and Cain for application to the problems of nding optimal gains for a proportional integral di erential (PID) controller [66] and optimization of electricity generation in response to demand [67]. In these implementations, they address a potential problem of extremely long population members by splitting these large members across multiple GA processors (GAPs). Each GAP would hold its own population of partial strings and perform selection, crossover and mutation independently of other GAPs. For tness evaluation, each GAP would present a substring to the tness evaluator which evaluates the entire string, giving each substring its own tness. This scheme would seem to encounter problems since a substring's tness could conceivably change over time despite a xed tness function. This could happen when a substring is juxtaposed with several other di erent substrings during the course of the GA run. But despite this potential problem, this scheme worked well empirically for the PID controller and electricity generation problems. Another approach to hardware-based GAs lies in the paradigm of hardware/software codesign used in the Armstrong III/PRISM II system [12], [13]. Armstrong III is a loosely coupled MIMD multicomputer composed of several PRISM II boards, each connected to its own communications board. Each PRISM II board includes a 33 MHz AMD Am29050 RISC processor [68] for conventional processing and three Xilinx XC4010 FPGAs [16] for recon gurable computing. Ideally, the user need only write a C or C++ implementation of a system, run the PRISM II con guration compiler to map the software bottlenecks to hardware once they've been identi ed, and compile the remaining software. If the bottlenecks are too dicult to map automatically to hardware, they must be mapped by hand. A library of routines is available to access the hardware components from the software component. This is the approach taken by Sitko et al. [36] with their GA for partitioning Xilinx CLBs [16] across FPGAs. After running the GA in software on a PRISM II board, the bottleneck was determined to be in evaluation of the tness 46

function. Thus parallel tness evaluation modules were implemented on the FPGAs and the remainder of the GA ran in software. Since Armstrong III is a parallel system, several GAs ran simultaneously, periodically exchanging their best population members with each other (a process called migration). This is an example of performing only tness evaluation in hardware and the rest in software. Graham and Nelson's [44] hardware GA for the traveling salesman problem (TSP) was based on the Splash 2 system [8]. Splash 2 is a recon gurable computer consisting of two processor array boards, each with 16 Xilinx XC4010 FPGAs to provide recon gurable computing capabilities. In this TSP GA, called the SPGA, each population member represented a possible tour. Since each city must be visited exactly once, special crossover and mutation operators were used to ensure that each member was a valid tour. A four-stage (consisting of four FPGAs), coarse-grained, bidirectional pipeline was used, with each stage controlling its own memory. The rst three stages respectfully performed selection, crossover, and tness evaluation as well as mutation. The nal stage generated statistics about the new population. When an entire generation was complete, the new members were copied from stage 3 back to stages 2 and 1 to be stored in their private memories. This pipeline is similar to ours except that the operations are di erently distributed over the stages and ours has a single, central memory, removing the requirement of copying new members back to earlier stages. Since only four FPGAs were required per SPGA run in Graham and Nelson's design, eight SPGAs could run in parallel, occasionally allowing for migration. Our system can also be parallelized arbitrarily, but migration would require additional hardware or software control. In a later work [45], the SPGA was found to yield a signi cant speedup in number of clock cycles when compared to a software-based GA on the same problem. This is primarily attributed to faster random number generation in the SPGA (use of a linear feedback shift register rather than incurring the cost of function call overhead), lack of address, branch and function call overheads, coarse-grain parallelism of the four-stage pipeline, and ne-grained parallelism within the selection module. This last feature is identi ed as the cause for the most signi cant speedup because their software GA's bottleneck existed in the selection process. Tommiska and Vuori [69] wrote a GA engine in the Altera Hardware Description Language (AHDL) for implementation on Altera FLEX 10K FPGAs [70]. Their pipeline is similar to ours except that they use bu ers at the input ports of their modules to synchronize the inter-module 47

communication while we use a handshake protocol. Hamalainen et al. [71], [72] designed the genetic algorithm parallel accelerator (GAPA), which is a tree structure of processors for executing a GA. The GAPA is essentially a parallel GA with specialized hardware support to accelerate certain operations. The tree's root is a computer, the internal nodes are called communication units (CUs) which transfer and process information, and the leaves are processing units (PUs) which process information. Each PU consists of a Texas Instruments TMS320C25 general purpose digital signal processor (DSP) [73] and a Xilinx XC4005 FPGA. Each CU consists of a Xilinx XC4005 FPGA. In one mode of operation, each PU creates a subpopulation of members, evaluates their tnesses on the DSP, and transmits them up the tree. The CUs sort the members according to their tnesses as they propagate towards the root, and the root performs selection based on the sorted population. Then the indices of the selected members are broadcast down the tree, and the PUs transmit the selected members back up the tree. On their way to the root, crossover and mutation are performed on the members by the CUs. Upon arriving at the root, the new members are broadcast down the tree for storage in the PUs. In its other mode of operation, each PU runs a sequential GA and periodically transmits its population toward the root for the purpose of migrating its members to other PUs. The CUs lter out all members except the best k (where k is a constant known to the CUs) as the members make their way up the tree. Then the root broadcasts those k best down the tree, where each PU replaces its worst members with the k it receives. These two modes of operation are di erent types of parallel GAs that exploit specialized hardware for acceleration of some functions. Thus the GAPA is not directly comparable with our design that pipelines the GA operations. A di erent approach to hardware-based GAs is taken by Megson and Bland [50], [55]. They present a design for a hardware-based GA that consists of a pipeline of seven systolic arrays. The rst ve stages implement selection while the remaining two perform crossover and mutation. To facilitate a VLSI implementation (giving a size and speed advantage over an FPGA implementation), they assume that typically tness would be evaluated in software, perhaps on parallel processors so it is not as severe a bottleneck. Although the design does allow for a tness evaluation chip to sit in the pipeline if the tness function can be evaluated in a bit-serial or byte-serial manner for compatibility with the systolic array. By contrast, our design assumes 48

that the tness function is implemented in hardware14, giving a complete hardware GA solution to problems with population members that cannot be evaluated bit-serially. Our design is also useful when no parallel processors are available for software tness evaluation. By assuming tness evaluation resides on the board, we also greatly reduce the communication overhead between the board and the CPU. Of course, our system forces the user to specify the tness function in hardware and ensure that it ts on the user's FPGAs. But the speci cation problem is becoming easier with the advent of compilers that map high-level software descriptions to hardware [12], [13], [19]{[23]. Also, problems with speed and density of FPGAs are mitigated by the steady improvements of FPGA technologies. Additionally, Section V describes real problems that can be attacked with our design given current FPGA technologies, showing that FPGA density is no longer a very serious issue. Finally, it should be noted that if the user insists on evaluating tness in software (e.g. if the tness function is too complex for a hardware implementation), our design is still applicable via the use of the external tness evaluator (FE) mentioned in Section II-C.7. The FE can be implemented in software if it adheres to the communication protocol expected by the tness module. Naturally, in this case the speedup of hardware over software will be limited by how much of a bottleneck the tness evaluation process is. One advantage of the systolic array implementation is that it can handle arbitrarily long population members without reimplementation. While our VHDL design allows for easy rescaling, reimplementation is still required if the maximum population member size increases. Also, our current design requires that each member be completely transmitted in parallel, causing potential problems with pin counts. We worked around this diculty by proposing a stream model in Section VI-B.1 where the members are transmitted bit- or byte-serially through the pipeline. Another advantage of the systolic array implementation is that the time to select a new population is linear in the population size, while in our system, this time is quadratic in the population size. This is problematic because our experiments and others [45], [50], [55] show that selection can easily be the bottleneck. But there is a cost to the systolic array implementation: the size of the selection systolic array of the pipeline grows quadratically with the population size, while in our selection module, only a few registers, one set of input pins, and a multiplier have a size dependency on the population size, and this dependency is only polylogarithmic, revealing a tradeo Typically the tness function would be implemented on FPGAs, although this is not necessary if only one function is ever optimized by the system. 14

49

between area and time in the selection process. To rectify the quadratic growth problem, Megson and Bland suggest breaking the population into subpopulations and running parallel versions of their system with occasional migration between the subpopulations. This scheme would work but incurs an additional cost for the parallel hardware and the migration control. Additionally, we can arbitrarily parallelize our selection modules with little additional cost, yielding an almost linear speedup in selection. Finally, it is important to distinguish hardware-based GAs from evolvable hardware. The former (what we study here) is an implementation of a genetic algorithm in hardware. The latter involves using GAs and other evolution-based strategies to generate hardware designs that implement speci c functions. In this case, each population member represents a hardware design, and the goal is to nd an optimal design with respect to an objective function, e.g. how well the design performs a speci c task. There are many examples of evolvable hardware in the literature [74]{[77]. VIII. Summary

Presented here was the HGA, a working implementation of a general-purpose hardware-based genetic algorithm. Due to the reprogrammability of FPGAs, the HGA possessed the speed of hardware while retaining the exibility of a software implementation, thus overcoming a major obstacle which previously prevented hardware-based GA implementations. The result is a general-purpose GA engine which is useful in many applications where software-based GA implementations are too slow. This is especially true for GAs that use large populations or number of generations, or for GAs with real-time constraints (e.g. disk scheduling). The HGA was designed with parameterized modules to allow scalability, providing easy reimplementation as the state of the art in FPGAs advances. Simulation and analysis were used to study the HGA's performance and identify its bottleneck. The performance analyses revealed possible improvements to the design. These improvements include options of di erent parallel and pipelined con gurations. Some future work includes implementing and analyzing the extensions of Section VI. Another avenue of future work is addressing the problem of nding better methods for estimating the average number of members seen per selection (mavg select ) for the equations of Section III. This is important because our assumption of Section III that mavg select = (m + 1)=2 is apparently the only assumption that is grossly inconsistent with our simulation results. Improving this estimate 50

could allow for an accurate performance analysis of the HGA without a priori empirical evidence and indicate where and how much parallelism is useful. Acknowledgments

Assistance for this work was received from Mentor Graphics Corporation and Xilinx, Incorporated through their donations of software and hardware, respectively. Help with speci c problems was garnered from Sue Drouin, Sam Picken, David Lam, and David M. Zar. Appreciation is offered to Paul Kenyon for his help with pipeline analysis, John Kelty and Mike Dvorsky for their assistance with the prototype, Douglas C. Schmidt for his aid with the software timing analyses, and Pak K. Chan for the BORG prototyping board. Finally, the authors thank the FPGA '95 committee members for their helpful comments on an early version of this paper and SIGDA for their generous support, enabling the rst author to present the FPGA '95 paper. References [1] D. E. Goldberg, Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley Publishing Company, Incorporated, Reading, Massachusetts, 1989. [2] K. A. De Jong and W. M. Spears, \Using genetic algorithms to solve NP-complete problems," in Proceedings of the Third International Conference on Genetic Algorithms, J. D. Scha er, Ed. June 1989, pp. 124{132, Morgan Kaufmann Publishers, Incorporated. [3] W. M. Spears and K. A. De Jong, \Using neural networks and genetic algorithms as heuristics for NPcomplete problems," in Proceedings of the International Joint Conference on Neural Networks. January 1990, vol. 1, pp. 118{125, Lawerence Erlbaum. [4] A. M. S. Zalzala, Ed., Proceedings of the First IEE/IEEE International Conference on Genetic Algorithms in Engineering Systems: Innovations and Applications, IEE, 1995. [5] B. C. H. Turton and T. Arslan, \A parallel genetic VLSI architecture for combinatorial real-time applications|disc scheduling," in Proceedings of the First IEE/IEEE International Conference on Genetic Algorithms in Engineering Systems: Innovations and Applications, September 1995, pp. 493{499, http://vlsi2.elsy.cf.ac.uk/group/. [6] B. C. H. Turton, T. Arslan, and D. H. Horrocks, \A hardware architecture for a parallel genetic algorithm for image registration," in Proceedings of the IEE Colloquium on Genetic Algorithms in Image Processing and Vision, October 1994, pp. 11/1{11/6, http://vlsi2.elsy.cf.ac.uk/group/. [7] M. Gokhale, W. Holmes, A. Kosper, S. Lucas, R. Minnich, D. Sweely, and D. Lopresti, \Building and using a highly parallel programmable logic array," IEEE Computer, vol. 24, no. 1, pp. 81{89, January 1991. [8] J. M. Arnold, D. A. Buell, and E. G. Davis, \Splash 2," in Proceedings of the 4th Annual ACM Symposium on Parallel Algorithms and Architectures, June 1992, pp. 316{324.

51

[9] P. Bertin, D. Roncin, and J. Vuillemin, \Programmable active memories: A performance assessment," Tech. Rep. 24, Digital Equipment Corporation Paris Research Laboratory, Cedex France, March 1993. [10] P. Bertin, D. Roncin, and J. Vuillemin, \Programmable active memories: A performance assessment," in Research on Integrated Systems: Proceedings of the 1993 Symposium, G. Borriello and C. Ebeling, Eds., 1993, pp. 88{102. [11] P. M. Athanas and H. F. Silverman, \Processor recon guration through instruction-set metamorphosis," IEEE Computer, vol. 26, no. 3, pp. 11{18, March 1993. [12] M. Wazlowski, A. Smith, R. Citro, and H. F. Silverman, \Armstrong III: A loosely-coupled parallel processor with recon gurable computing capabilities," Tech. Rep., Brown University, 1996, http://www.lems.brown.edu/arm/. [13] L. Agarwal, M. Wazlowski, and S. Ghosh, \An asynchronous approach to synthesizing custom architectures for ecient execution of programs on FPGAs," in Proceedings of the International Conference on Parallel Processing, 1994, vol. 2, pp. 290{294, http://www.lems.brown.edu/arm/. [14] S. Casselman, \Virtual computing and the virtual computer," in Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, R. Werner and R. S. Sipple, Eds. April 1993, pp. 43{48, IEEE Computer Society Press, http://www.vcc.com/. [15] J. McLeod, \Recon gurable computer changes architecture," Electronics, p. 5, April 1994. [16] Xilinx, Incorporated, San Jose, California, The Programmable Logic Data Book, 1996, http://www.xilinx.com/. [17] S. D. Scott, \HGA: A hardware-based genetic algorithm," M.S. thesis, University of Nebraska-Lincoln, August 1994, ftp://ftp.cs.unl.edu/pub/TechReps/UNL-CSE-94-020.ps.gz. [18] S. D. Scott, \HGA v1.3: VHDL source code for the HGA design," June 1997, http://www.cs.wustl.edu/~sds/. [19] H. Hogl, A. Kugel, J. Ludvig, R. Manner, K.-H. No z, and R. Zoz, \Enable++: A second generation FPGA processor," in Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, April 1995, pp. 45{53, http://www-mp.informatik.uni-mannheim.de/. [20] H. Trickey, \Flamel: A high-level hardware compiler," IEEE Transactions on Computer Aided Design, vol. CAD-6, no. 2, pp. 259{269, March 1987. [21] K. Wakabayashi, \Cyber: High-level synthesis from software into ASIC," in High-Level VLSI Synthesis, 1991, pp. 127{151. [22] R. Camposano, R. A. Bergamaschi, C. E. Haynes, M. Payer, and S. M. Wu, \The IBM high-level synthesis system," in High-Level VLSI Synthesis, 1991, pp. 79{104. [23] Mentor Graphics Corporation, Wilsonville, Oregon, AutoLogic VHDL Synthesis Guide, 1994, http://www.mentorg.com/. [24] P. D. Hortensius, H. C. Card, and R. D. McLeod, \Parallel random number generation for VLSI using cellular automata," IEEE Transactions on Computers, vol. 38, pp. 1466{1473, October 1989. [25] S. Wolfram, \Universality and complexity in cellular automata," Physica, vol. 10D, pp. 1{35, 1984. [26] M. Serra, T. Slater, J. C. Muzio, and D. M. Miller, \The analysis of one-dimensional linear cellular automata

52

[27]

[28] [29]

[30] [31] [32] [33] [34] [35] [36]

[37]

[38]

[39] [40]

[41] [42]

and their aliasing properties," IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 9, no. 7, pp. 767{778, July 1990. P. Kenyon, S. Seth, P. Agrawal, A. Clematis, G. Dodero, and V. Gianuzzi, \Programming pipelined CAD applications on message passing architectures," Concurrency Practice and Experience, vol. 7, no. 4, pp. 315{337, June 1995. P. K. Chan, A Field-Programmable Prototyping Board: XC4000 BORG User's Guide, Board of Studies in Computer Engineering, University of California, Santa Cruz, April 1994, http://www.cse.ucsc.edu/~pak/. D. C. Schmidt, \ASX: An object-oriented framework for developing distributed applications," in Proceedings of the 6th USENIX C++ Technical Conference. April 1994, pp. 200{220, USENIX Association, http://www.cs.wustl.edu/~schmidt/. B. New, \LCA speed estimation: Asking the right question," in The Programmable Logic Data Book, p. 8.16. Xilinx, Incorporated, San Jose, California, 1993, Xilinx Application Note 011.001. http://www.xilinx.com/. Advanced Micro Devices, Bipolar Microprocessor and Logic Interface (Am29000 Family) Data Book, 1985, http://www.amd.com/. D. A. Patterson and J. L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, Morgan Kaufmann Publishers, Inc., San Mateo, California, 1994. Z. Michalewicz, Genetic Algorithms + Data Structures = Evolution Programs, Springer-Verlag, Berlin, second edition, 1994. J. P. Hayes, Computer Architecture and Organization, McGraw-Hill Book Company, New York, second edition, 1989. J. E. Volder, \The CORDIC trigonometric computing technique," IRE Transactions on Electronic Computers, vol. EC-8, pp. 330{334, September 1959. N. Sitko , M. Wazlowski, A. Smith, and H. Silverman, \Implementing a genetic algorithm on a parallel custom computing machine," in IEEE Symposium on FPGAs for Custom Computing Machines, April 1995, pp. 180{187, http://www.lems.brown.edu/arm/. H. Chan and P. Mazumder, \A systolic architecture for high speed hypergraph partitioning using a genetic algorithm," in Progress in Evolutionary Computation, X. Yao, Ed., Berlin, 1995, pp. 109{126, Springer-Verlag, Lecture Notes in Computer Science number 956. M. J. O'Dare and T. Arslan, \Hierarchical test pattern generation using a genetic algorithm with a dynamic global reference table," in Proceedings of the First IEE/IEEE International Conference on Genetic Algorithms in Engineering Systems: Innovations and Applications, September 1995, pp. 517{523, http://vlsi2.elsy.cf.ac.uk/group/. Zycad Corporation, Fremont, California, Paradigm XP Product News, 1996, http://www.zycad.com/. S. Kang, Y. Hur, and S. A. Szygenda, \A hardware accelerator for fault simulation utilizing a recon gurable array architecture," VLSI Design, vol. 4, no. 2, pp. 119{133, 1996, http://www.ece.utexas.edu/ece/people/profs/Szygenda.html. Precedence, Incorporated, Campbell, California, Product Brief, 1996, http://www.precedence.com/. Synopsys, Incorporated, Mountain View, California, Arkos Datasheet, 1997, http://www.synopsys.com/.

53

[43] C. Burns, \An architecture for a Verilog hardware accelerator," in Proceedings of the IEEE International Verilog HDL Conference, February 1996, pp. 2{11, http://www.crl.com/www/users/cb/cburns/. [44] P. Graham and B. Nelson, \A hardware genetic algorithm for the traveling salesman problem on Splash 2," in 5th International Workshop on Field-Programmable Logic and its Applications, August 1995, pp. 352{361, http://splish.ee.byu.edu/. [45] P. Graham and B. Nelson, \Genetic algorithms in software and in hardware|a performance analysis of workstation and custom computing machine implementations," in Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, April 1996, pp. 216{225, http://splish.ee.byu.edu/. [46] B. Norman and J. Bean, \Random keys genetic algorithm for job shop scheduling," Engineering Design and Automation, to appear, http://www-personal.engin.umich.edu/~jbean/. [47] B. Norman and J. Bean, \A genetic algorithm methodology for complex scheduling problems," Tech. Rep. 94-5, University of Michigan, Ann Arbor, Department of Industrial and Operations Engineering, 1994, http://www-personal.engin.umich.edu/~jbean/. [48] J. Bean, \Genetics and random keys for sequencing and optimization," ORSA Journal on Computing, vol. 6, pp. 154{160, 1994. [49] F. T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes, Morgan Kaufmann Publishers, Incorporated, San Mateo, California, 1992. [50] G. M. Megson and I. M. Bland, \A generic systolic array for genetic algorithms," Tech. Rep., University of Reading, May 1996, http://www.cs.rdg.ac.uk/cs/research/Publications/reports.html. [51] G. Syswerda, \Uniform crossover in genetic algorithms," in Proceedings of the Third International Conference on Genetic Algorithms and their Applications, 1989, pp. 2{9. [52] D. Beasley, D. R. Bull, and R. R. Martin, \An overview of genetic algorithms: Part 2, research topics," University Computing, vol. 15, no. 4, pp. 170{181, 1993. [53] J. E. Baker, \Reducing bias and ineciency in the selection algorithm," in Proceedings of the First International Conference on Genetic Algorithms, 1987, pp. 14{21. [54] D. E. Goldberg and K. Deb, \A comparative analysis of selection schemes used in genetic algorithms," in Foundations of Genetic Algorithms, G. Rawlings, Ed., 1991, pp. 69{93. [55] I. M. Bland and G. M. Megson, \Implementing a generic systolic array for genetic algorithms," in WSC1: The First Online Workshop for Soft Computing, 1996, http://www.bioele.nuee.nagoya-u.ac.jp/wsc1/papers/p058.html. [56] J. G. Eldredge and B. L. Hutchings, \FPGA density enhancement of a neural network through run-time recon guration," in IEEE Workshop on FPGAs for Custom Computing Machines, Napa, CA, April 1994, pp. 180{188. [57] M. K. Wirthlin, K. Gilson, and B. L. Hutchings, \The nanoprocessor: A low resource recon gurable processor," in IEEE Workshop on FPGAs for Custom Computing Machines, Napa, CA, April 1994, pp. 23{30. [58] S. Guccione, \List of FPGA-based computing machines," June 1997, http://www.io.com/~guccione/. [59] L. Wirbel, \Compression chip is rst to use genetic algorithms," Electronic Engineering Times, p. 17, December 1992.

54

[60] J. Liu, \A general purpose hardware implementation of genetic algorithms," M.S. thesis, University of North Carolina at Charlotte, 1993. [61] V. G. Red'ko, M. I. Dyabin, V. M. Elagin, N. G. Karpinskii, A. I. Polovyanyuk, V. A. Serechenko, and O. V. Urgant, \On microelectronic implementation of an evolutionary optimizer," Russian Microelectronics, vol. 24, no. 3, pp. 182{185, 1995, Translated from Mikroelektronika, vol. 24, no. 3, pp. 207{210, 1995. [62] J. Hesser, J. Ludvig, and R. Manner, \Real-time optimization by hardware supported genetic algorithms," in Proceedings of the 2nd International Mendel Conference on Genetic Algorithms, Optimization, Fuzzy Logic and Neural Networks, P. Osmera, Ed., 1996, pp. 52{59. [63] S. D. Scott, A. Samal, and S. Seth, \HGA: A hardware-based genetic algorithm," in Proceedings of the 1995 ACM/SIGDA Third International Symposium on Field-Programmable Gate Arrays, February 1995, pp. 53{59, http://www.cs.wustl.edu/~sds/. [64] B. C. H. Turton and T. Arslan, \An architecture for enhancing image processing via parallel genetic algorithms & data compression," in Proceedings of the First IEE/IEEE International Conference on Genetic Algorithms in Engineering Systems: Innovations and Applications, September 1995, pp. 337{342, http://vlsi2.elsy.cf.ac.uk/group/. [65] J. T. Alander, M. Nordman, and H. Setala, \Register-level hardware design and simulation of a genetic algorithm using VHDL," in Proceedings of the 1st International Mendel Conference on Genetic Algorithms, Optimization, Fuzzy Logic and Neural Networks, P. Osmera, Ed., 1995, pp. 10{14. [66] M. Salami and G. Cain, \An adaptive PID controller based on a genetic algorithm processor," in Proceedings of the First IEE/IEEE International Conference on Genetic Algorithms in Engineering Systems: Innovations and Applications, September 1995, pp. 88{93. [67] M. Salami and G. Cain, \Multiple genetic algorithm processor for the economic power dispatch problem," in Proceedings of the First IEE/IEEE International Conference on Genetic Algorithms in Engineering Systems: Innovations and Applications, September 1995, pp. 188{193. [68] Advanced Micro Devices, AM29050 Microprocessor User's Manual, 1991, http://www.amd.com/. [69] M. Tommiska and J. Vuori, \Implementation of genetic algorithms with programmable logic devices," in Proceedings of the Second Nordic Workshop on Genetic Algorithms and their Applications (2NWGA), J. T. Alander, Ed., August 1996, pp. 71{78, http://www.uwasa.fi/cs/publications/2NWGA.html. [70] Altera Corporation, San Jose, California, Flex 10k Embedded Programmable Logic Family, 1996, http://www.altera.com/. [71] T. Hamalainen, J. Saarinen, P. Ojala, and K. Kaski, \Implementing genetic algorithms in a tree shape computer architecture," in Proceedings of the First Nordic Workshop on Genetic Algorithms and their Applications (1NWGA), J. T. Alander, Ed., January 1995, pp. 259{283, ftp://ftp.uwasa.fi/cs/1NWGA/Hamalainen.ps.Z. [72] T. Hamalainen, H. Klapuri, J. Saarinen, P. Ojala, and K. Kaski, \Accelerating genetic algorithm computation in tree shaped parallel computer," Journal of Systems Architecture, vol. 42, no. 1, pp. 19{36, August 1996. [73] Texas Instruments, Incorporated, Second Generation TMS320 User's Guide, 1989, http://www.ti.com/. [74] E. Sanchez and M. Tomassini, Eds., Towards Evolvable Hardware: The Evolutionary Engineering Approach,

55

Springer-Verlag, Berlin, 1996, Lecture Notes in Computer Science number 1062. [75] H. de Garis, \An arti cial brain," New Generation Computing, vol. 12, pp. 215{221, 1994. [76] T. Higuchi, H. Iba, and B. Manderick, \Evolvable hardware," in Massively Parallel Arti cial Intelligence, H. Kitano and J. A. Hendler, Eds. 1994, pp. 398{421, MIT Press. [77] \Darwin on a chip," The Economist, p. 85, February 1993. [78] J. H. Holland, Adaptation in Natural and Arti cial Systems, Ph.D. thesis, University of Michigan, Ann Arbor, 1975. [79] T. C. Fogarty and R. Huang, \Implementing the genetic algorithm on transputer-based parallel processing systems," in Parallel Problem Solving from Nature, 1st Workshop. 1990, pp. 145{149, Springer-Verlag.

Appendix Background on Genetic Algorithms

A genetic algorithm (GA) [78] is a natural selection-based optimization technique. There are four major di erences between GA-based approaches and conventional problem-solving methods: (a) GAs work with an encoding of the parameter set, not the parameters themselves; (b) GAs search for optima from a population of points, not a single point; (c) GAs use payo (objective function) information, not other auxiliary knowledge such as derivative information used in calculus-based methods; and (d) GAs use probabilistic transition rules, not deterministic rules. These four properties make GAs robust, powerful, and data-independent [1]. A GA is a stochastic technique with simple operations based on the theory of natural selection. The basic operations are selection of population members for the next generation, \mating" these members via crossover of \chromosomes," and performing mutations on the chromosomes to preserve population diversity so as to avoid convergence to local optima. Finally, the tness of each member in the new generation is determined using an evaluation ( tness) function. This tness in uences the selection process for the next generation. The GA operations selection, crossover and mutation primarily involve random number generation, copying, and partial string exchange. Thus they are powerful tools which are simple to implement. Its basis in natural selection allows a GA to employ a \survival of the ttest" strategy when searching for optima. The use of a population of points helps the GA avoid converging to false peaks (local optima) in the search space. 56

A. A Genetic Algorithm Example

As a simple example, imagine a population of four members, each with ve bits. Also consider the objective function f (x) = 2x. The goal is to optimize (in this case maximize) the objective function over the domain 0  x  31. Now consider the population of the four members in Table XII, generated at random before GA execution. The corresponding tness values and percentages come from the objective function f (x): TABLE XII Four members and their fitness values.

i 1 2 3 4

String Fitness fi = xi f (xi ) = 2xi 11000 48 00101 10 10110 44 01100 24 Sum 126 Avg 31.5 Max 48

Expected Count Actual P fi = fi fi =f Count 0.381 1.524 2 0.079 0.317 0 0.349 1.397 1 0.191 0.762 1 1.000 4.000 4 f = 0:250 1.000 1 0.381 1.524 2

P

The values in the \fi = fi " column provide the probability of each member's selection. So initially 11000 has a 38.1% chance of selection, 00101 has an 7.9% chance, and so on. The selection process can be thought of as spinning a \weighted roulette wheel", where the size of each member's share of the wheel is proportional to its fraction of the total tness. Section II-C.5 describes a way to simulate the roulette wheel selection process. The results of the selections are given in the \Actual Count" column of Table XII. These values are similar to those in the \Expected Count" column, which shows each member's tness fi divided by the average total tness f (f is shown in the box). After selecting the members, the GA randomly pairs the newly selected members and looks at each pair individually. For each pair (e.g. A = 11000 and B = 10110), the GA decides whether or not to perform crossover. If it does not, then both members in the pair are placed into the 57

population with possible mutations (described below). If it does, then a random crossover point is selected and crossover proceeds as follows:

A=11j000

B=10j110

A0 = 1 1 1 1 0

B 0 = 1 0 0 0 0.

are crossed and become Then the children A0 and B 0 are placed in the population with possible mutations. The GA invokes the mutation operator on the new bit strings very rarely (usually on the order of  0:01 probability), generating a random number for each bit and ipping that particular bit only if the random number is less than or equal to the mutation probability. After the current generation's selections, crossovers, and mutations are complete, the new members are placed in a new population representing the next generation, as shown in Table XIII. In this example generation, average tness increased by approximately 30% and maximum tness increased by 25%. This simple process would continue for several generations until a stopping criterion is met. Possible stopping criteria include number of generations, amount of population diversity, and minimum average tness. TABLE XIII The population after selection and crossover.

After Crossover After Fitness fi = Reproduction Mate Point Crossover f (xi ) = 2xi 11j000 x3 2 11110 60 1j1000 x4 1 11100 56 10j110 x1 2 10000 32 0j1100 x2 1 01000 16 Sum 164 Avg. 41 Max 60 GAs work well because they exploit the existence of t schemata, or templates. A schema is any string from f0; 1; gn, where n is the length of each population member and  is a wildcard 58

character. For example, the schema 10  0 represents the four members yielded from substituting a 0 or 1 for each . A t schema is one that represents members that are very t on average. In the examples of Tables XII and XIII, one very t schema is 111  . GAs are very good about discovering and retaining members from t schemata. B. Advanced GA Operators

A popular alternative to the standard crossover operator is uniform crossover [51], [52]. In this scheme, for each bit position i in the children A0 and B 0 , a fair coin is tossed. If it comes up heads, then A0 gets bit i from A and B 0 gets bit i from B . Otherwise A0 gets bit i from B and B0 gets bit i from A. Uniform crossover is believed to be e ective in preserving t schemata. Genetic algorithms have been implemented on parallel architectures [36], [44], [50], [55], [79] by running a standard sequential GA on several processors. Each processor maintains its own subpopulation (called a deme) and runs oblivious to the other processors. Periodically, the processors exchange members from their demes (a process called migration). This allows smaller GAs to run quickly while still in uencing each other. However, it requires extra control for the migration processes.

59