A Parallel BSP implementation - Semantic Scholar

Speeding up Genetic Programming: A Parallel BSP implementation Simon Kent

Dimitris C. Dracopoulos

Brunel University Department of Computer Science and Information Systems London, UK [email protected] http://www.brunel.ac.uk:8080/˜cspgssk

Brunel University Department of Computer Science and Information Systems London, UK [email protected] http://www.brunel.ac.uk:8080/˜csstdcd

ABSTRACT A parallel implementation of Genetic Programming is described, using the Bulk Synchronous Parallel Programming (BSP) model, as implemented by the Oxford BSP library. It is shown that considerable speedup of the GP execution can be achieved. As the complexity and the size of the problem increases, the actual speedup can be improved (assuming a constant number of processors), since the communication overhead is small compared with the parallel GP parts. The ease of use of BSP and the speedup achieved with the corresponding parallel implementation, suggests that GP researchers should consider BSP parallel implementations when dealing with time consuming problems.

1 Introduction Despite its advantages, Genetic Programming (GP) often has the drawback that it can take a long time to arrive at a solution. This is mainly due to the significant amount of CPU time taken to evaluate the fitness of individuals. Since its birth, much effort has been applied to Genetic Programming in an attempt to reduce the amount of time taken to produce results. One approach to reducing increasing speed is to parallelise the GP process. Quite a lot of work has been carried out on the parallelisation of Genetic Algorithms (GA’s), as is demonstrated by Cant´u-Paz (1995) , however this is not the case for Genetic Programming. Koza and Andre (1995) have carried out some

work on parallel GP using transputers, however there is generally very little work which has been carried out in this area. Those involved in GP would wish to encourage the acceptance of the field within the mainstream computer science community. General computer scientists are not necessarily going to have specialist equipment such as transputers at their disposal. However, it would be much more likely that they would have access to a local area network of reasonably powerful workstations. This paper therefore investigates the simple parallelisation of GP on such a network, which is based on the Bulk Synchronous Programming (BSP) model of parallel computation.

2 BSP The Bulk Synchronous Parallel Programming (BSP) model was proposed by Leslie Valiant (1990) for general-purpose parallel computing. The model provides a very basic set of operations which allow synchronisation of processors, and communication among them. The model may be implemented using a dedicated language, or more probably by means of a library of routines. A typical BSP computer model consists of:

a number of processors, each with local memory a mechanism for performing one-way fetches and stores between processors to access non local data a method for bulk synchronisation of all processors

Each processor executes the same program, therefore conforming to a single program, multiple data (SPMD) model. Processor allocation is static, meaning that processors are allocated at the beginning of the program with no subsequent change to the allocation during execution. BSP uses an abstract communication system to perform stores and fetches between processors. The physical communication system may be implemented in a number of ways for example TCP/IP or shared memory. Processors within a

BSP BSP BSP BSP BSP BSP

START FINISH SSTEP START(n) SSTEP END(n) STORE(to, from data, to data, length) FETCH(from, from data, to data, length)

start of the BSP program end of the BSP program start of superstep n end of superstep n store from local to remote processor fetch from remote to local processor

Table 1: The Oxford BSP Library basic operations. BSP computer do not proceed in lock-step. Synchronisation is achieved by using supersteps. A superstep is a unit of parallel execution during which a processor may perform computations on its own personal data, or initiate stores or fetches between itself and other processors. At the end of a given superstep, the processors in the BSP machine wait until all other processors have finished, and all non-local communication has completed. To ensure deterministic behaviour, it should never be assumed that a communication which is started in a particular superstep will be completed in the same superstep. However, the barrier synchronisation which occurs at the end of each superstep does guarantee that a non-local data store or fetch started during one superstep will be available in the following superstep. A BSP computer can be defined in terms of the following parameters:

p = number of processors s = CPU speed of processors L = minimum elapsed time between successive synchronisations g = global computation / communication balance The units of s, L and g are arbitrary, although they must obviously be the same to make meaningful comparisons. An appropriate unit is flops (floating point instructions per second). Whilst many GP applications may not use floating point operations, flops provide a unifying measure which is easily computed or referenced for a given processor. These parameters can be useful when making predictions as to the increase in speed that can be expected from a particular BSP configuration. L and g are indicators of the overhead which may be experienced when using BSP, where L represents the overhead involved in completing a superstep (barrier synchronisation) and g represents the amount of computation which could have been carried out locally while inter-processor communication takes place. The BSP model can be applied through a library of routines which implement the basic BSP operations. The implementation carried out for this paper used the Oxford BSP Library, Miller and Reed (1993) . This Library can be used with FORTRAN, C and Pascal, using different means of communication, including PVM, TCP/IP sockets and shared memory. BSP was chosen because it provides an easy to use, portable model for what is a first attempt at implementing a Parallel GP system. In addition, the performance of the BSP GP

implementation can be predicted across a variety of parallel architectures. The Oxford BSP Library consists of only six operations as shown in Table 1.

3 BSP GP Implementation The software developed for this paper was an enhancement of a GP system written in C by Kent (1995) . Combining the Oxford BSP Library and the existing GP code provided a relatively straightforward means to obtain the results presented here, in this new area of Parallel GP. Version 1.2 of the Oxford BSP Library was used. The workstations used to run the software were a number of SUN SPARCstation 5 machines with 70MHz microSPARCII processors, and running SOLARIS 2.4. These machines were connected on a common segment with ethernet cabling. Communication between processors used TCP/IP. This was achieved by linking to an appropriate TCP/IP version of the Oxford BSP Library. It is a simple matter of relinking the software to allow different ways of communication to use, for example, PVM or shared memory. The aim of the implementation was to parallelise the most computationally expensive part of the GP process; that is the fitness evaluation of the individual programs which pose possible solutions to a problem. The processors were organised in a Master-Slave paradigm in which a single master processor coordinates the work, with the slaves supporting the master during fitness evaluation. This is shown in Figure 1. The master is responsible for initialising the population for the first generation. The population is then divided into equal portions for the master and slaves to evaluate. In this way, load balancing is obtained since it is assumed that all processors have similar capabilities and the same amount of free resources. At an abstract level, the slaves should now be able to fetch their portion from the master. However, the current BSP Oxford library implementation only allows non-local communication of statically declared data1. The GP code dynamically allocates and deallocates tree nodes every generation, so clearly these cannot be accessed directly using BSP. Instead, the tree representations of each of the programs to be evalu1 The next release of the BSP Oxford library will allow communication between dynamic data objects. Therefore in this paper, we present some additional results, assuming dynamic communication is possible.

MASTER

SLAVES

Initialise Population

Idle

Pack Population / Prepare Index

Idle

Idle

Fetch Index

Idle

Fetch Population

Idle (overhead)

Unpack Population

Evaluate Fitness of local portion

Evaluate Fitness of local portion

Idle (overhead)

Pack Results

Idle (overhead)

Store Results

Unpack Results

Idle (overhead)

Process Results

Idle

Evolve

Idle

SUPERSTEP 1 SUPERSTEP 2

SUPERSTEP 3

Done ? no

Done ? no

yes

yes

Figure 1: Architecture of BSP GP system ated by slaves are traversed postorder, and the nodes packed into a statically declared data buffer. As it does so, the master also prepares a second index buffer which stores, for each individual, an offset into the data buffer. Once the master has packed the population, communication can begin. Firstly, the slaves fetch their portion of the index, then, they are able to fetch their portion of the data buffer. Having fetched the population data, the slaves unpack the nodes, expanding them back into trees. The data buffer actually contains individuals in reverse polish notation form. It is therefore relatively simple to regenerate trees from the buffer using a stack. Having retrieved a personal portion of the population, the slave is able to evaluate it. The slave packs the results into a small results buffer which is returned to the master. From Figure 1 it can be seen that three supersteps are used during each generation. The supersteps do not encompass all commands executed, only those initiating communication between processors. After communication it is essential for bulk, barrier synchronisation to occur. This is to ensure that: 1. all stores and fetches are complete, 2. one or more processes do not have an incomplete set of data. The remaining commands can be considered to be contained within implicit supersteps.

4 The Problem The problem chosen to test the BSP GP implementation was the Artificial Ant problem as used by Koza (1993) . To clearly demonstrate the speedup achieved, a large population was required so that even when several processors were working on the problem, they still had a reasonable number of individuals to evaluate. If many processors are evaluating only a very small number of individuals each, the communication overhead will outweigh the benefit achieved by using extra processors. The trail used was referred to by Koza (1993) as the Los Altos Hills Trail. The robot ant navigates a 100100 grid, of which the upper left 5070 section contains a trail of 157 pieces of food as shown in Figure 2. In this Figure, a black square represents a piece of food, and a grey square represents a gap in the trail. To allow comparison, the same GP parameters were used for the problem, as were used by Koza (1993) . These parameters are summarised in the Table 2.

5 Speedup Prediction Previous reference was made in section 2 to the BSP parameters and the fact that they could be used to make predictions of the speedup which could be achieved by parallelising the GP process. Given perfect conditions, if a program running sequentially takes time t to execute, then with p processors, the time would

population size crossover probability reproduction probability mutation probability probability of choosing function node for crossover maximum depth of individuals in initial population maximum depth of individuals in evolved population generative method selection method adjusted fitness used over-selection used

2000 0.90 0.10 0.00 0.90 6 17 ramped half-and-half fitness proportionate TRUE TRUE

Table 2: GP Parameters for Artificial Ant Los Altos Hills Problem. used are shown in Table 3.

p 2 4 8 16

s 1.75 (Mflops)

L 8000 13000 22000 55000

g1 35 80 180 350

N 12 2.3 2.0 1.6 1.5

Table 3: Estimated values of BSP parameters for a network of SPARCstation 5 workstations communicating using TCP/IP.

As has already been discussed, the data which is transferred must be packed into a data buffer. The time which is taken to perform this packing must also be considered as a cost, and therefore allowance must be made for it when performing speedup predictions.

be reduced to pt . Obviously these perfect conditions cannot be achieved, so it is necessary when predicting speedup to take account of the cost incurred when using more than one processor to run a program. The cost of performing one of the supersteps in the BSP GP system is:

C

=

L+gh

The variables L and g are as defined previously in section 2, and h is the number of data-words communicated (words sent + words fetched) in the superstep. Values for the BSP parameters were determined by running a short program which produced estimated values of s, L, g1 and N 21 . The variable g1 represent the asymptotic value of g which would be approached by very large data objects, and N 12 is the number of words communicated to achieve half the asymptotic bandwidth. The value of g can be approximated by g1 (1 + N 12 =n). The actual values for the BSP machine

5.5 predicted speedup, less packing time predicted speedup 5

4.5

4 Speedup Factor

Figure 2: The Artificial Ant Los Altos Hills trail.

Taking all the overheads into consideration, calculations were made to predict the speedup which could be achieved with the BSP GP system. The graph in Figure 3 shows the predicted speedup achievable.

3.5

3

2.5

2

1.5

1 1

2

3

4 5 Number of Processors

6

Figure 3: Graph of predicted speedup

7

8

6 Results

7 Conclusions

The implemented system was run a number of times, and average elapsed time for a run of 50 generations was recorded. The results are shown in Table 4.

A parallel implementation of a Genetic Programming system was described, using the Bulk Synchronous Programming (BSP) model of parallel computation. Although the size of the test problem was quite small, the speedup achieved was significant. The implementation is portable to a number of different platforms, and the performance of the BSP GP system can be predicted for different parallel architectures. The proposed parallelisation of an existing GP system (i.e. parallelisation of the evaluation of fitness) can achieve very high speedups in large problems. In addition, the proposed BSP GP model is easy to implement and portable to many different platforms. The time required to parallelise an existing serial GP system with the BSP model is insignificant compared with the achieved speedups, so as to suggest that GP researchers should consider such an implementation when they deal with complex, time consuming problems.

processors 1 2 4 8 16

elapsed time (sec) 4786 3380 2268 1878 1611

Table 4: Results of BSP GP runs for 50 generations of the Los Altos Hills Artificial Ant Problem The actual speedup recorded is shown as a graph in Figure 4. 3.5

Acknowledgements

actual speedup, less packing time actual speedup

This work was partly supported by EPSRC award no 95700741.

Speedup Factor

3

2.5

Bibliography 2

[1] Erick Cant´u-Paz. A summary of research on parallel genetic algorithms. Technical Report 95007, Illinois Genetic Algorithms Laboratory, July 1995.

1.5

1 1

2

3

4 5 Number of Processors

6

7

8

Figure 4: Graph of actual speedup achieved Although the size of the problem was not very large, one can see that that achieved speedup of the GP run (per generation) is significant. As the size of the problem increases, one can expect that the achievable speedup of the BSP GP system will be better, as the communication overhead will be small compared with the actual parallel computational time. A prediction of such a speedup can be easily calculated by applying the cost modelling procedure of the previous section. Apparently, given a problem of fixed size, there is an optimum number of processors upto which a maximum speedup can be achieved. Therefore, if one would like to optimise the performance of the BSP GP system, the choice of the number of processors must follow the suggested number p of the cost modelling procedure. From Figures 3, 4, it can be seen that there is a small difference between actual and predicted performance. This is due to the inaccuracy of the program supplied with the Oxford BSP library, which was used to calculate the values for s, L, g1 and N 12 .

[2] V. Scott Gordon and Darrell Whitley. Serial and parallel genetic algorithms as function optimizers. In Stephanie Forrest, editor, Proceedings of the 5th International Conference on Genetic Algorithms. Morgan Kaufman, 1993. [3] Simon Kent. Using genetic programming techniques in problem solving. Final Year BSc Project, May 1995. [4] John R. Koza. Genetic Programming: on the Programming of Computers by means of Natural Selection. MIT Press, 1993. [5] John R. Koza and David Andre. Parallel genetic programming on a network of transputers. Technical Report CSTR-95-1542, Stanford University, 1995. [6] Richard Miller and Joy Reed. The Oxford BSP Library users’ guide. Technical report, University of Oxford, 1993. [7] Leslie G. Valiant. A bridging model for parallel computation. Communications of the Association for Computing Machinery, 33(8):103–111, 1990.

A Parallel BSP implementation - Semantic Scholar

A Parallel BSP implementation - Semantic Scholar

Suggest Documents

Neurovascular coupling: a parallel implementation - Semantic Scholar

The Implementation of the BSP Parallel Computing Model ... - InteGrade

Optimization Schemas for Parallel Implementation - Semantic Scholar

Parallel Implementation of Exact Matrix ... - Semantic Scholar

Independent AND-Parallel Implementation of ... - Semantic Scholar

Design and Implementation of Parallel ... - Semantic Scholar

Parallel Implementation of Video Surveillance ... - Semantic Scholar

A Parallel, Block, Jacobi-Davidson Implementation ... - Semantic Scholar

A Massively Parallel Implementation of QC-LDPC ... - Semantic Scholar

Analysis, Design, and Implementation of a Parallel ... - Semantic Scholar

OpenCL Implementation of a Parallel Universal ... - Semantic Scholar

A Scalable Cellular Implementation of Parallel ... - Semantic Scholar

Analysis and parallel implementation of a forced N ... - Semantic Scholar

Strategies for Parallel Implementation of a Global ... - Semantic Scholar

A parallel Bees Algorithm implementation on GPU - Semantic Scholar

A block-wise approximate parallel implementation ... - Semantic Scholar

A Scalable Cellular Implementation of Parallel ... - Semantic Scholar

An Analysis and Implementation of a Parallel Ball ... - Semantic Scholar

acidic proteins (BSP-A1, BSP-A2 and BSP-A3 ... - Semantic Scholar

Parallel Programming with BSP in Python

bsp-assisted constrained tetrahedralization - Semantic Scholar

A Parallel BSP Algorithm for Irregular Dynamic Programming

BSP-IB: the Foundation of a Well-Balanced ... - Semantic Scholar

A programming model for BSP with partitioned ... - Semantic Scholar