A Distributed Evolutionary Algorithm Using C♯ and Mono Simon Harding Department Of Computer Science Memorial University
[email protected] June 2006
Abstract
1.1 Genetic Algorithms
This technical report describes the implementation of a distributed genetic algorithm, using the Mono implementation of the Microsoft .Net framework.
A genetic algorithm (GA) is a general purpose search technique inspired by the principles of Darwinian evolution. As in nature, a genetic algorithm optimizes a population of individuals by selecting the ones that are best suited to solving a problem and allowing their genetic make-up to propagate into future generations (1; 2; 3). It is typically guided only by the evolutionary process and often contains very limited domain specific knowledge. Although these algorithms are bio-inspired, it is important that any analogies drawn with nature are considered only as analogies.
1 Introduction This technical report describes the implementation of a distributed genetic algorithm, using the Mono implementation of the Microsoft .Net framework. Memorial University has several hundred Linux machines accessible for distributed computing: some from dedicated computer clusters, other machines in classrooms and laboratories are accessible offpeak. This particular arrangement introduces constraints into the design, particularly in the transient availability of the non-dedicated machines. It is believed that this project represents one of the largest, if not the largest, use of Mono.
Their lack of specialization for a problem makes genetic algorithms ideal search techniques where little is known about a problem. As long as a suitable representation is chosen along with a fitness function that allows for ease of movement around a search space, a GA can search vast problem spaces rapidly. Another feature of their “dumb” behaviour is that they will try solutions to problems that are
unconventional. A human designer normally has a set of predefined rules and strategies that they adopt to solve a problem. These preconceptions may prevent trying a new method, and may prevent the designer using a better solution. A genetic algorithm does not necessarily require such domain knowledge. Evolutionary algorithms have been shown to beat human designed solutions in a number of different areas. Genetic algorithm’s ability to use an unconventional approach to discovering solutions means that they are able to exploit things that human designers are not. A human designer works from an understanding of how a system works; however a GA has no such knowledge. Using a trial and error approach allows the GA to explore a search space without inhibitions, and it will try candidate solutions that would be incomprehensible to a human designer. This is a very powerful feature, and it is the corner stone of this dissertation. Many different versions of genetic algorithms exist. Variations in representations and genetic operators change the performance characteristics of the algorithm, and depending on upon the problem people employ a variety of modifications of the basic algorithm. However, all the algorithms follow a basic pattern. A population of random individuals is first generated. Each individual in the population encodes the properties of a potential solution. Each encoding, or genotype, comprises of one or more chromosomes. In its most basic form, an individual has a single chromosome made of a binary string of 1s and 0s. However, it is common to use integer and floating-point numbers if more appropriate for the task. Each entry in the chromosome string is an allele, with segment of chromosome making a gene,
as illustrated in figure 1. Combinations of different representations can also be used within the same chromosome, and that is the approach used within this thesis. Whatever representation is used, it should be able to adequately describe the individual and provide a mechanism where its characteristics can be transferred to future generations without loss of information. Each of these individuals is then tested to see how “fit” they are. The genotype is decoded into its phenotype - the outward, physical manifestation of the individual. It is the phenotype that is tested, and assigned a fitness score. Typically it is this phase in a genetic algorithm that is the most time consuming. The performance of a genetic algorithm is normally measured in terms of the number of evaluations required to find a solution of a given quality. The next stage is to select what genetic information will proceed to the next generation. In nature the fitness function and selection are essentially the same - individuals that are better suited to the environment survive to reproduce and pass on their genes. In the genetic algorithm a procedure is applied to determine what information gets to proceed. Genetic algorithms are often generational where all the old population is removed before moving to the next generation, in nature this process is not as rigorously clocked. However, to increase the continuity of information between generations, some versions of the algorithm use elitism, where the fittest individuals are always selected for promotion to the next generation. This ensures that good solutions are not lost from the population, but may have interesting side effects where the genetic information in the population converges too quickly - a form of inbreeding.
Figure 1: A natural and binary chromosome To generate the next population, a procedure analogous to sexual reproduction occurs. For example, two individuals will be selected and they will then have their genetic information combined together to produce the genotype for the offspring. This process is called recombination or crossover. The genotype is split into sections at randomly selected points called cross over points. A “simple” GA has only one of these points, however it is possible to perform multiple point cross over. Sections of the two chromosomes are then put together to form a new individual. This individual shares some of the characteristics of both parents. There are many different ways to choose which members of the population to breed with each other, the aim in general is to try and ensure that fit individuals get to reproduce with other fit individuals. Individuals can be selected with a probability proportional to their relative fitness or selected in some form of tournament. In natural recombination, errors occur when the DNA is split and combined together. Errors in the DNA of a cell can occur at any time under the influence of a mutagen, such as radiation, a virus or toxic chemical. The ge-
netic algorithm also has mutations. A number of alleles are selected at random and modified in some way. For a binary GA, the bit may be flipped, in a real-numbered GA a random value may be added or subtracted. Although GAs often have both mutation and crossover, it is possible to use just one. Using a mutation only approach has been demonstrated to work, and crossover often acts as a macro mutation operator - effectively mutating large sections of a chromosome. The new individuals in the population are then retested and have fitness scores assigned. Hopefully the average fitness of the population has increased, and the population has moved close toward a solution. This cycle of test, select and reproduce is continued until a solution is found (or some other termination condition is reached), at which point the algorithm stops.
2 .Net and Mono The Microsoft .Net Framework is a development platform comprising a large API and a virtual machine (Common Language Interface). The virtual machine, called the common
Figure 2: Flow chart for an evoltuionary algorithm language runtime, features an optimising just in time compiler that interprets a common intermediate language(CLI). Many languages, such as C++, Visual Basic, Perl, Fortran, Java and C♯ can be compiled to the CLI. A full list of available languages can be found at http://www.dotnetpowered.com/ languages.aspx. One of the major benefits of the .Net framework is data type compatibility between different programming languages. Applications and libraries are compiled into assemblies. .Net assemblies have introspection. Functions in applications can be called as easily as libraries. An application can therefore be created by interfacing to
separately developed applications, easing the development process. Different versions of the same assembly may exist on the same machine as the platform assigns each assembly a unique name. The standard for the CLI is openly available and has been ratified as ECMA standards (ECMA 335 and ECMA 334). ISO followed in April, 2003 (ISO/IEC 23271 and ISO/IEC 23270) (http://en.wikipedia.org/wiki/ Mono_development_platform). Mono is an open source 1 implementation of the CLI and C♯ compiler that aims to 1
Mono is licensed under a combination of the LGPL, GPL and X11 agreements.
meet the ECMA standards. Mono can be run on many operating systems including: Linux, FreeBSD, UNIX, Mac OS X, Solaris and Windows. Existing .Net applications can be executed using the Mono CLR. (http://www.mono-project.com/ Main_Page). Development of Mono is ongoing, and there are many features missing from it that can be found in the Microsoft .Net framework. The main difference is in the implementation of the GUI components, with limited support for the Microsoft Windows Forms API. However, Mono does provide GTK controls for interface design, which are cross platform on the Mono platform. Distributed tasks typically operate “headless”, so this is not an important difference. Mono uses a different method of binary serialisation to the Microsoft implementation, and is therefore incompatible. However, .Net objects can be serialised to XML, which is compatible. This difference is important in remoting (discussed in section 2.3).
2.1 C♯ C♯ is an object oriented language based on C++, Java and Delphi. It is compiled down to the CLI, for execution on a CLR, such as Mono. The language has features such as automatic garbage collection, generics, hierarchical namespaces and enumeration. One feature that is not shared with languages such as Java is the use of pointers, which can be used within code marked as being “unsafe”. In future releases the language will support lambda expressions and SQL like select and where operators on SQL datasets, XML and other user defined collections.
2.2 Efficiency of Mono One of the main criticisms of languages that use virtual machine implementations is that programs execute poorly. From the benchmarks available at http://shootout.alioth.debian. org/gp4/index.php, we can see that over all the benchmarks performed, C♯ running on Mono is on average 2.18 times slower2 than GNU C++ (see figure 3). Each bar on the plot represents the performance on a particular benchmark, in terms of both processor and memory use. However, it is expected that as development of Mono continues these margins will be reduced. Compared to other languages, Mono’s performance is comparable with Java (figure 4), and shows considerable performance gains over Perl and Python (figures 5 and 6).
2.3 Remoting Remoting allows for objects to communicate to other objects running in other applications, including applications running on different machines. Using C♯ and the .Net framework, remoting can be a straightforward task. In essence, it allows for objects to be instantiated on a remote machine, and the methods executed on that remote object with the results visible locally. The process can be made virtually transparent, with only code needed to provide an object server and code to attach to such a server. Example code can be found in section 4. Mono requires more assistance in remoting than the Microsoft implementation, and requires all classes to be tagged as [Serializable] 2
Ignoring the startup time result.
Figure 3: Benchmark results comparing C♯ on Figure 5: Benchmark results comparing C♯ on Mono against G++. Mono against Perl.
Figure 4: Benchmark results comparing C♯ on Figure 6: Benchmark results comparing C♯ on Mono against Java. Mono against Python.
For remoting, objects have to support serialization. Typically, binary serialization is preferred, however for communicating between Mono and the Microsoft CLL, XML serialisation is the preferred approach. However, XML serialization is slower and consumes more bandwidth. XML remoting is handled as a SOAP request, and typically uses HTTP - which can be beneficial if there are firewall restrictions between the distributed machines. TCP is used when remoting using binary serialisation, and any available port can be used. Interprocess communication can be made secure using authentication and encryption. http://msdn.microsoft.com/ webservices/remoting/default. aspx
2.0. Databases are an attractive alternative to flat files in a distributed environment, and allow for ease of asynchronous access to shared information. The database APIs are
3 The Distributed rithm
Algo-
2.4 Reflection One benefit of languages such as C♯ and Java is that they allow reflection. This is essentially a technique where programs can modify themselves. In this project reflection is used to apply experiment configurations to the client that is performing fitness evaluations. A list of fully qualified variable names and their parameters can be accessed from a database, and applied to the program - dynamically changing the values of the variables. This mechanism provides a simple and generic method to pass configuration information to programs.
2.5 Databases and Persistence Under the Microsoft CLR, C♯, and other .Net languages, can communicate to database servers using the ADO.Net libraries. The Mono project provides compatible libraries, which provide the same basic functionality but as yet do not implement the full API of .Net
Figure 7: Diagram showing the tiers in the distributed algorithm. For the purposes of this project, a layered approach was used (figure 7). At the top sits an SQL database. In this instance MS SQL Server was used, mainly because of its XML data type which allows for querying (using Xquery) on XML fields. Below this sits the job server, this server acts as a wrapper between the database and the clients. It helps marshall the flow of jobs to the clients, and runs the actually genetic algorithm. At the bottom are the individual clients, that will the perform fitness evaluations. The job server and the database maintain a number of experiments, which are run in parallel. In order to obtain results quickly, a subset of the total experiments are active at any one time. When an active experiment is finished, another experiment from the pool of all
experiments is activated and becomes available for evaluation. When a client is started, it connects using remoting to a job server. It then requests an experiment configuration from the job server. The job server can retrieve this from the SQL database. If there are no more experiments to be performed, the client exits. If there are experiments still running, the client then downloads a number of unevaluated individuals from the population. Again, this is done using remoting, so all the client needs do is call a method that returns an instance of an individual as an object. If the client requests more unevaluated individuals than the population contains, the job server generates new individuals using selection and mutation. This guarantees that the client will have individuals to evaluate. The job server selects experiments and unevaluated individuals at random, so it can be expected that all the currently active experiments will return in approximately the same time. The client then performs the fitness evaluation, before returning the fitness scores to the population, and repeats the process until all experiments are completed. To ensure integrity, all individuals and experiments have unique numbers assigned to them. The job server monitors what individuals it has sent for evaluation, and if they are not returned within a given time, the individuals are freed for evaluation by other clients. So if clients fail then the operation of the algorithm is not effected. As clients do nothing other than process individuals, they are able to join and leave the experiment at any time. In the environment used at Memorial University, there are a mixture of both permanently available clients and transient clients in labs, so such an approach
allows for full use of the available computing time.
References [1] D. G OLDBERG. Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, Reading, Massachusetts, 1989. [2] J. H OLLAND. Adaptation in Natural and Artificial Systems. MIT Press, Cambridge, Massachusetts, second edition, 1992. [3] M ELANIE M ITCHELL. An introduction to genetic algorithms. MIT Press, Cambridge, MA, USA, 1996.
4 Example code for client server u s i n g S ystem ; u s i n g S ystem . R untim e . R em oting ; /∗ T h i s code i s t h e s e r v e r ∗/ namespace Example { [ Serializable ] p u b l i c c l a s s MyRemoteObject { p u b l i c i n t Add ( i n t a , i n t b ) { return a + b ; } public s t a t i c S t a r t S e r v e r ( ) { / / r e g i s t e r a TCP c h a n n e l / / C h a n n e l s s h o u l d be g i v e n names and p o r t num bers T c p S e r v e r C h a n n e l chan = new T c p S e r v e r C h a n n e l ( ” RemotingServer ” , 8085); C h a n n e l S e r v i c e s . R e g i s t e r C h a n n e l ( chan ) ; / / r e g i s t e r remote o b j e c t RemotingConfiguration . RegisterWellKnownServiceType ( Type . GetType ( ” Example . MyRemoteObject ” ) , ” RemotingServer ” , WellKnownObjectMode . S i n g l e t o n ) ; } } } u s i n g S ystem ; u s i n g S ystem . R untim e . R em oting ; namespace Example { /∗ T h i s code i s f o r t h e c l i e n t
∗/ public c l a s s ClientNode { p u b l i c MyRemoteObject RemoteAdder = n u l l ; p r i v a t e v o i d C o n n e c t T o S e r v e r ( S t r i n g ServerName ) { T cpC hannel c h a n n e l = new T cpC hannel ( ) ; ChannelServices . RegisterChannel ( channel ) ; RemoteAdder = ( MyRemoteObject ) A c t i v a t o r . G e t O b j e c t ( t y p e o f ( Example . MyRemoteObject ) , ” t c p : / / ” + ServerName + ” : 8 0 8 5 / R e m o t i n g S e r v e r ” ) ; } p u b l i c v o i d Run ( ) { C o n n e c t T o S e r v e r ( ” m y s e r v e r . c s . mun . ca ” ) ; int x = 1; int y = 2; / / R em oteA dder i s a c t u a l l y r u n n i n g on t h e r e m o t e m achine / / so t h e a d d i t i o n c o m p u t a t i o n i s done r e m o t e l y , / / with the r e s u l t v i s i b l e l o c a l l y . i n t z = RemoteAdder . Add ( x , y ) ; Console . WriteL ine ( x + ” + ” + y + ” = ” + z ) ; } } }