New Generation Computing, 29(2011)163-184 Ohmsha, Ltd. and Springer
DEMCMC-GPU: An Efficient Multi-Objective Optimization Method with GPU Acceleration on the Fermi Architecture Weihang ZHU Department of Industrial Engineering Lamar University 211 Redbird Ln, P.O. Box 10032, Beaumont, TX 77710, USA
[email protected] Ashraf YASEEN and Yaohang LI Department of Computer Science Old Dominion University 4700 Elkhorn Ave, Suite 3300, Norfolk, VA 23529, USA
{ayaseen, yaohang}@cs.odu.edu Received 11 Ocotber 2010 Revised manuscript received 11 January 2011
Abstract
In this paper, we present an efficient method implemented on Graphics Processing Unit (GPU), DEMCMC-GPU, for multi-objective continuous optimization problems. The DEMCMC-GPU kernel is the DEMCMC algorithm, which combines the attractive features of Differential Evolution (DE) and Markov Chain Monte Carlo (MCMC) to evolve a population of Markov chains toward a diversified set of solutions at the Pareto optimal front in the multi-objective search space. With parallel evolution of a population of Markov chains, the DEMCMC algorithm is a natural fit for the GPU architecture. The implementation of DEMCMC-GPU on the pre-Fermi architecture can lead to a ˜25 speedup on a set of multi-objective benchmark function problems, compare to the CPU-only implementation of DEMCMC. By taking advantage of new cache mechanism in the emerging NVIDIA Fermi GPU architecture, efficient sorting algorithm on GPU, and efficient parallel pseudorandom number generators, the speedup of DEMCMC-GPU can be aggressively improved to ˜100. Keywords:
Markov Chain Monte Carlo, Multi-objective Optimization, Graphics Processing Unit.
164
§1
W. Zhu, A. Yaseen and Y. Li
Introduction
In recent years, with its massively parallel computing mechanism and fast floating point operation, Graphics Processing Unit (GPU) has shifted its paradigm from traditional graphics processing to general purpose scientific and engineering computing applications. Earlier efforts utilized vertex or fragment shaders as the main way of programming GPU for general purpose computing. With the introduction of NVIDIA CUDA environment in 2007, a wide variety of applications have emerged.1) Zhu has designed a number of GPU-accelerated algorithms with CUDA for global optimization.2, 3) Zhou and Tan4) developed a parallel approach to run standard particle swarm optimization on GPU with speedup of around 11. You5) designed a parallel Ant Colony Optimization (ACO) system for Traveling Salesman Problem on GPUs. Bakhtiari et al. applied GPU to Monte Carlo optimization in dose calculation in radiation therapy.6) A constantly updated list for general purpose evolutionary algorithms can be found at 7) . GPU applications in multi-objective optimization algorithms are relatively new and little literature can be found. Wong8) implemented a variant of multi-objective Genetic algorithm on GPUs and tested it on a set of classical benchmark problems. Li and Zhu8) used GPUs to accelerate the Multiobjective Shuffled Complex Evolution Metropolis (MOSCEM) algorithm on the protein structure modeling problem. In this paper, we present an efficient GPU implementation of the Differential Evolutionary Markov Chain Monte Carlo (DEMCMC) method for multiobjective optimization. The DEMCMC method incorporates Differential Evolution (DE),8) an efficient and robust evolutionary algorithm for continuous variables, in a population-based Markov Chain Monte Carlo (MCMC) sampler to evolve the Markov chains toward solutions at the Pareto optimal front in the presence of multiple objective functions. Since the evolution of the Markov chains in a population can be carried out in parallel as multiple threads, the DEMCMC algorithm is a natural fit for the Single Instruction Multiple Thread (SIMT) programming model in GPU. We implement the DEMCMCGPU method using a master-slave paradigm, where the slave threads in the GPU carry out the computations in Markov chains evolution and the master process in the CPU synchronizes the Markov chain population. Moreover, by taking advantage of the L1/L2 cache mechanism in the emerging GPU Fermi architecture and migrating more parallel components to the GPU, DEMCMCGPU can lead to a more significant performance improvement. DEMCMC-GPU is tested on a set of standardized benchmark functions to demonstrate the effectiveness of the approach and the efficiency of the GPU implementation on the NVIDIA Fermi GPU architecture. The rest of the paper is organized as follows. Section 2 discusses the MCMC method for multi-objective optimization. Sections 3 and 4 illustrate the DEMCMC algorithm and the implementation of DEMCMC-GPU, respectively. Section 5 analyzes the computational results and Section 6 summarizes the paper.
DEMCMC-GPU: An Efficient Multi-Objective Optimization Method with GPU Acceleration on the Fermi Architecture
§2
165
Markov Chain Monte Carlo and Multi-Objective Optimization
Markov Chain Monte Carlo (MCMC) methods have been used as indispensable tools to produce random samples from complex and intractable distributions in a variety of problems. The fundamental algorithm was proposed by Metropolis et al.9) and generalized by Hastings10) by incorporating a transition kernel. The rationale of MCMC is to generate a Markov chain by generating dependent samples with target distribution as its equilibrium distribution. Recent advances of MCMC method include simulated annealing,11) simulated tempering,13) parallel tempering,14) generalized-ensemble Monte Carlo,15) multiple try Metropolis,16) hybrid Monte Carlo,17) configuration-biased Monte Carlo 18) and many others. In some situations, an optimization problem can also be formulated as a Monte Carlo sampling problem, where solutions located within the vicinity of the global minimum of the objective function have the highest probability to be reached during the sampling process. Therefore, MCMC is also widely used as a stochastic optimization procedure. In most applications where MCMC is used for optimization, the goal is to find the global minima or maxima of a single objective function. Without loss of generality, suppose that we are interested in finding the global minimum of an objective function, f (x), for all possible configurations x, the fundamental formulation of MCMC is to make up a probability distribution: π(x) ∝ e−f (x)/T , T > 0, where T is the simulated temperature. Then, a Markov chain is constructed to sample π(x). The solutions in the vicinity of the global minimum of f (x) are more likely to be drawn in the sampling process. MCMC can also be applied to multi-objective optimization applications. In 2002, Vrugt et al. developed the Multiobjective Shuffled Complex Evolution Metropolis (MOSCEM) algorithm20) based on the concept of Pareto dominance. MOSCEM has demonstrated certain effectiveness in multi-objective optimization of hydrologic models. However, recent study21) has shown that, in a set of standardized benchmark functions and a couple of hydrologic calibration cases, the MOSCEM algorithm is inferior to genetic algorithm-based multi-objective optimization algorithms, including NSGA-II (Nondominated Sorted Genetic Algorithm II) 21) and SPEA-2 (Strength Pareto Evolutionary Algorithm 2),22) in measurement of both the capability of reaching Pareto optimal solutions and the computation time. One main reason is that the stochastic search operator in MOSCEM depends on the prior distribution, which requires covariance calculation of the parameters in the population. Due to this reason, the stochastic search in MOSCEM using complex shuffling “does not scale well for problems of increasing size and/or difficulty.”20) Moreover, the fitness assignment in MOSCEM requires consistently updating the current population, which is not amenable to the implementation on the emerging many-core parallel computing architectures.
166
W. Zhu, A. Yaseen and Y. Li
§3
Differential Evolutionary Markov Chain Monte Carlo (DEMCMC) Method for Multi-Objective Optimization
DEMCMC method for multi-objective optimization is an improvement of the MOSCEM algorithm.19) First of all, Differential Evolution (DE) operator is employed to replace the Gaussian search operator in MOSCEM.23) The DE operator does not rely on predefined probability distribution function. Instead, DE uses the differences of the parameter vectors in randomly sampled individuals to generate offspring configurations, where the estimation of the prior distribution can be avoided. Secondly, a genetic-algorithm-style selection scheme is employed in population reproduction to accelerate DEMCMC convergence. Thirdly, an adaptive Metropolis scheme is used to automatically tune the MCMC sampler during the sampling process. Finally, unlike MOSCEM, where the fitness assignment is based on the most recent population, the fitness evaluation in DEMCMC is based on the old population, which is particularly suitable for the SIMT programming paradigm in GPU.
3.1
Dominance-based Method
In multi-objective optimization, when N multiple objective functions, f1 (x), . . . , fN (x), are presented, the probability distribution is converted to π(x) ∝ e−dom(x)/T , T > 0, where dom(x) is the dominance function. The definition of dominance function is based on the concept of Pareto dominance.24) A configuration u is said to Pareto dominate another configuration v (u ≺ v) if both following conditions i) and ii) are satisfied (assuming all objectives are for minimization): 1. 2.
for each objective function fi (.), fi (u) = fi (v ) holds for all i ; and there is at least one scoring function fj (.) where fj (u) < fj (v ) is satisfied.
Consequently, the dominance function dom(x) is the integration of all possible configurations that dominate x: Z dom(x) = 1 d|y|. y≺x
As a result, the MCMC for multi-objective optimization can be formulated as a sampling problem of the dominance function, where the Pareto optimal solutions correspond to the global minima of the dominance function with value 0. In practice, since the Pareto optimal front is usually unknown, it is difficult to explicitly calculate the exact dominance function value for a configuration. Instead, a population of configuration samples is created and the Pareto dominance relationship is evaluated between each pair of the configurations, which can be used to estimate the dominance function value for each configuration. By sampling and minimizing the dominance function, the MCMC sampler evolves the population of Markov chains toward the solutions at the Pareto optimal front in the objective function space.
DEMCMC-GPU: An Efficient Multi-Objective Optimization Method with GPU Acceleration on the Fermi Architecture
3.2
167
Fitness Assignment
In many single objective optimization applications, the goal is to search for an optimal solution; whereas in multi-objective optimization, there are two equally important goals: 1) finding the Pareto optimal solutions; and 2) obtaining diversified coverage of the Pareto optimal front. Sampling the dominance function may result in clustering of solutions in certain regions of the objectives while missing the other Pareto optimal solutions. To preserve diversity in sampling, the dominance function is manipulated by giving preference to the undersampled solutions. This results in a new fitness function. Consequently, the MCMC sampler evolves based on the fitness function instead of the dominance function. Our DEMCMC algorithm employs the fitness assignment scheme 19) used in MOSCEM, which is based on the number of external non-dominated points. Considering a population P with m individual configurations, C1 , . . . , Cm , the fitness calculation is based on the strength si of each non-dominated configuration Ci . The strength si is defined as the proportion of configurations in P dominated by Ci . Then, the fitness of an individual Ci is defined as si Ci is non − dominated iÂj X f it(Ci ) = . sj Ci is dominated 1+ j
The configurations with fitness less than 1.0 are non-dominated. The fitness function biases to the non-dominated ones with less dominated configurations while those with a lot of neighbors in their niche are penalized.19)
3.3
Selection and Reproduction
A genetic-algorithm-style selection and reproduction scheme is employed in DEMCMC. Only the top d configurations out of the overall m configurations in a population will have a chance to be selected to generate a mutant vector in DE to form a new configuration in the next population. Similar to the parameters in selection and reproduction in genetic algorithms, d is an important tunable parameter. A too small d value may lead to premature convergence while a large d will cause slow convergence.
3.4
Differential Evolution
DE is used to produce new configurations based on the current population. In the DE operator,23) for each configuration Ci (x1 , . . . , xn ) a mutant vector Vi is typically formed by Vi = Cr1 + F (Cr2 − Cr3 ). where i,r1, r2, and r3 are mutually distinct integer random numbers in the interval [0, d ), d is the number of top configurations selected for reproduction, and F > 0 is a tunable amplification control constant as described in 23) . The DE strategy adopted in DEMCMC is DE/rand/1/bin.23) Other forms of mutant vector generation are also provided in 23) , which can be employed in DEMCMC
168
W. Zhu, A. Yaseen and Y. Li
as well. Then, a new conformation Ci ’ (x1 ’, . . . , xn ’ ) is generated by the crossover operation on Vi and Ci : ½ 0 vj j = hkin , hk + 1in , ..., hk + L − 1in xj = , xj otherwise where h.in denotes the modulo operation with modulus n, k is a randomly generated integer from the interval [0, n-1], L is an integer drawn from [0, n-1] with probability Pr(L = l ) = (CR)l , and CR ∈ [0, 1) is the crossover probability. Practical advice suggests that CR = 0.9 and F = 0.8 are suitable choices in the DE scheme.23) These parameter values are also adopted in the DEMCMC algorithm.
3.5
Adaptive Metropolis Scheme
The Metropolis transition is employed as Monte Carlo moves for generating new population. The transition probability depends only on the change of the fitness function value. The Metropolis-Hastings ratio is calculated as: w(Ci → Ci0 ) = e
−(f it(C 0 )−f it(Ci )) i T
.
The new configuration Ci ’ generated by DE is accepted with the probability min(1, w(Ci → Ci0 )). T is the simulated temperature, which is an important parameter used to control the acceptance rate of the Metropolis transitions. Studies25, 26) in Monte Carlo method literature have shown that the MCMC sampler yields the best possible performance with an acceptance rate of around 20%. In DEMCMC, we use an adaptive Metropolis scheme in order to maintain an acceptance rate of 15%25% by adjusting T in each iteration during the optimization process. When the acceptance rate is less than 15%, T is increased by 10%. On the other hand, when the acceptance rate is over 25%, T is decreased by 10%.
3.6
Description of DEMCMC Algorithm
Putting every piece of the puzzle together, the pseudo program of the DEMCMC algorithm is shown in Fig. 1. First of all, a population P of m configurations, C1 , . . . , Cm , are produced randomly and the corresponding objective functions values are evaluated. Then, the fitness function values of C1 , . . . , Cm are calculated against P. In the DEMCMC loop, configurations in P are sorted according to their fitness values and new configurations of C1 ’, . . . , Cm ’ are generated by DE. After evaluating the objective functions values of the new configurations, the fitness values of C1 ’, . . . , Cm ’ are calculated against the old population P. Afterwards, the Metropolis transitions are carried out to determine the acceptance of each new configuration to form the new population P’. The value of the simulated temperature T is adjusted to maintain a desired acceptance rate. The DEMCMC computation reiterates until the expected maximum iteration number is reached and then the non-dominated configurations are outputted as solutions.
DEMCMC-GPU: An Efficient Multi-Objective Optimization Method with GPU Acceleration on the Fermi Architecture
169
The computations related to parallel Markov chains evolution in the DEMCMC loop, including generation of new configurations (line 8-9 in Fig. 1), evaluation of objective functions, fitness assignment (line 10-11 in Fig. 1), and Metropolis transition (line 12-20 in Fig. 1), can be carried out independently and only require read operations to access the shared population information. This is particularly suitable for the master-slave paradigm in a massively parallel computing platform.
Fig. 1
§4
Pseudocode of DEMCMC Algorithm
DEMCMC-GPU
DEMCMC-GPU is the DEMCMC implementation on the CPU-GPU heterogeneous computing platform using the SIMT programming paradigm. The design target of DEMCMC-GPU is to move heavily repetitive computation tasks to the GPU for data-parallel computing whereas to limit the CPU-GPU communication overhead. The DEMCMC-GPU program is a combination of CPU and GPU codes using a master-slave paradigm. The initialization and evolution of Markov chains of the population P are carried out by m threads in the GPU, where each thread carries out the computation on a configuration. GPU memory hierarchy and new features of the Fermi architecture are utilized to achieve desired performance improvement.
170
4.1
W. Zhu, A. Yaseen and Y. Li
GPU Architectures
The GPU hardware utilized in this paper is an NVIDIA GeForce GTX 480, which is built on the recent Fermi architecture. In GPU, each thread runs the same program (instruction set) called a ‘kernel’. A kernel can employ 32KB ‘registers’ as fast access memory and slower ‘local memory’ if registers are not enough. For management purpose, all threads are organized into blocks. Each ‘block’ can have a maximum of 1,024 threads. Communication among threads in the same block can be realized with ‘shared memory’. Communication between the blocks can be performed with slower global memory. Communication between CPU and GPU can be done through global memory, constant memory, or texture memory on a GPU board. Global memory is a relatively slower memory location that allows both read and write operations. Texture memory is relatively fast read-only memory. Constant memory is fast read-only memory whose size cannot be dynamically changed in runtime. In Fermi architecture, the device memory (including global memory and local memory), texture memory and constant memory are all cached with small amount of cache memory, which is important for the program execution performance. Before Fermi architecture, device memory was not cached and therefore, sometimes it would help if data can be copied from global memory to texture memory for further processing, especially when the data is un-coalesced. In Fermi architecture, it is not necessarily the case since device memory is also cached with L1/L2 cache. L1 cache is shared in one multiprocessor while L2 cache is shared among all multiprocessors. The ‘L1 cache’ and ‘shared memory’ together are 64KB. It can be set as either 16KB L1 + 48KB shared memory (as the default setting), or 48KB L1 + 16KB shared memory. Due to the different performance of memory locations and the limited amount of fast and flexible memory locations, algorithm implementation and design choices must be guided by memory limitations.
4.2
Population Size and Number of Threads
The GPU relies on multithreading to hide the latency of accessing external memory. Therefore, for a program to efficiently take advantage of the GPU hardware, there must be enough parallel work to keep the resources of the processor fully utilized. Typically, in current GPUs, at least 5,000-10,000 simultaneous threads are required to efficiently utilize the GPU processor.28) On the other hand, our recent study32) has shown that the population size is an important parameter for the DEMCMC algorithm. For a complex multiobjective optimization problem, a large population size is necessary to achieve a sufficient coverage of the Pareto optimal front. Therefore, the DEMCMC algorithm can fully take advantage of the GPU processor by spawning large number of threads to handle large population size. As a result, our DEMCMCGPU implementation usually uses population size of 15,360 or more for complex multi-objective optimization problems while fully taking advantage of the GPU architecture.
DEMCMC-GPU: An Efficient Multi-Objective Optimization Method with GPU Acceleration on the Fermi Architecture
4.3
171
Data Structure for Coalesced Memory Access
In the DEMCMC-GPU program, the population of configurations in global memory or texture memory is shared by concurrent threads. Within a GPU kernel, when accessing global memory, coalescing all threads in a warp (32 threads per warp) so that they can access a continuous portion of memory at one time during evolution of parallel Markov chains is critical for utilizing the wide GPU memory bandwidth. In the implementation, coalesced access to global memory is achieved by reorganizing the solutions (variables) and objectives (costs) data structure from ‘arrays of struct’ (AoS) to ‘struct of arrays’ (SoA). Figure 2 (a) and (b) show the organization of population array using AoS and SoA, respectively. Since this is for multi-objective optimization, a large one dimensional array is defined for storing all solutions parameters in the decision space and another one dimensional array is defined for storing all objective functions values in the objective space, as below. typedef struct configurations t { float* variables; // # of parameters * population size float* costs; // # of objectives * population size float* fitness; // one fitness value * population size } configurations; Using the SoA data structure in accessing the configuration population in DEMCMC-GPU leads to coalesced memory access, that is, the memory locations accessed by the threads are continuous.
Fig. 2
Data Organization in GPU for Coalesced Memory Access
172
4.4
W. Zhu, A. Yaseen and Y. Li
Pseudorandom Numbers in GPU
One of the challenges in the DEMCMC-GPU implementation is the requirement of a large number of parallel pseudorandom numbers. In practice, an efficient approach is to generate a large number of pseudorandom numbers beforehand to be used by the other kernels. Due to their iterative nature, most traditional pseudorandom number generators do not map efficiently to GPU programming paradigm.25) In our implementation of the DEMCMC-GPU program, we adopt the XORWOW33) pseudorandom number generator (part of the CURAND library) available from the recent CUDA version 3.2RC as well as a parallel Mersenne Twister pseudorandom number generator.27) Both pseudorandom number generators can concurrently provide high quality random numbers to feed the random number needs of parallel Markov chains in DEMCMC-GPU.
4.5
Objective Functions Evaluation and Fitness Assignment
The parallelization of objective functions evaluations in DEMCMC-GPU is straightforward, since they are independent in each thread. It will significantly save the overall computational time by parallelizing these computations. This is due to the fact that, in many optimization applications, most of the computational time is spent on objective function evaluation. The computational complexity of the objective function evaluation is O(mN ), where m is the population size and N is the number of objective functions. The benchmark functions used in this paper are not computationally very expensive when compared to the time used in the fitness assignment. Therefore, the overall computation time is dominated by the fitness assignment procedure for multi-objective optimization. However, in real world problems, the objective functions are often quite complex and very computationally expensive. To illustrate this, we also applied our algorithm to a protein structure modeling problem, as detailed in the Section 5. It will demonstrate that when the objective function is complicated, the total computation time is dominated by the objective function evaluation. The computation of the fitness assignment requires evaluation of dominance relationship between pair-wise configurations in population P. As a result, the computational complexity of evaluating the fitness of a population with size m and N objective functions is O(m2 N ). Therefore, the fitness assignment can be time-consuming when a large population is employed. The fitness assignment computation in each thread is relatively independent but it requires read operations from the shared objective functions values in the population.
4.6
GPU Sort
The DEMCMC algorithm requires sorting configurations in a population for the selection and reproduction scheme. With large population size, migrating the sorting computation from CPU to GPU may further improve DEMCMCGPU performance. However, sorting on the GPU has been controversial.38) Depending on the algorithm complexity and solution data size, the GPU sort per se may have limited speedup while sometimes it may be even slower than CPU sort especially when sorting is only carried out on small amount of data
DEMCMC-GPU: An Efficient Multi-Objective Optimization Method with GPU Acceleration on the Fermi Architecture
173
due to lack of concurrent threads to fully utilize the GPU processor. However, if executed on a large data set, compared to CPU sort, the GPU sort benefit will become obvious because the GPU can be fully utilized and copying data back-and-forth between CPU and GPU memories can be avoided. In our DEMCMC-GPU implementation, a GPU-based radixsort method 28) is adopted. The GPU radixsort method sorts key-value pairs where the keys are the fitness and the values are the indices. First of all, in order to reorganize the data, we need to make another copy of the unsorted configurations. During the data reorganization process, these configurations are copied back to the population in device memory, according to the new indices sequence generated from the GPU radixsort. In pre-Fermi architecture, it is advisable to copy data to the texture memory before the data reorganization, because the texture memory is cached while the global device memory is not. The advantage of copying old data into the texture memory instead of another location in the device memory is to avoid frequent un-coalesced read on the device memory and take advantage of the cache mechanism of the texture memory. However, in Fermi architecture, copying data to the texture memory can be counter-productive as the global device memory is also cached and L1 cache has higher bandwidth than the texture cache. Therefore, we use different strategies in different architectures: in pre-Fermi architecture, the unsorted configurations are copied to texture memory while in Fermi architecture, they are copied to another location in device memory.
4.7
GPU Tasks
All operations in DEMCMC loop, including the generation of new configurations from the previous generation using DE, pseudorandom number generations, objective functions evaluations, fitness assignment computations, configurations sorting, and Metropolis transitions, are executed in the GPU. There is no need for memory synchronization until all the above operations are over. This makes the whole DEMCMC-GPU implementation highly efficient. The algorithm parameters, such as total population size, number of variables and number of objectives, are kept in constant memory. In pre-Fermi architecture, when the new configurations are generated in GPU global memory, they are copied into the texture memory for objective functions evaluation. After obtaining the values of every objective function, the results are copied to the texture memory for fitness assignment computation. The texture memory is used extensively to save time in accessing larger chunk of (possibly un-coalesced) memory that is more than the capacity of the available registers and shared memory. In contrast, in Fermi architecture, because of the GPU L1/L2 cache for the device memory, copying the data into the texture memory is no longer necessary. In any GPU architectures, coalescing data in global memory is always necessary to achieve optimal GPU performance. In this program, most global device memory access operations are coalesced except for the new configuration generation with DE where the configurations selected for crossover are randomly picked and their memory addresses are dispersed inevitably.
174
§5
W. Zhu, A. Yaseen and Y. Li
Computational Results
Our DEMCMC-GPU program is implemented in Visual C++ 2008 with the CUDA environment. The same source code can also be compiled and run under Linux Ubuntu 10.04LTS with G++ 4.4 compiler. For benchmarking reason, the DEMCMC algorithm is also implemented with CPU-only functions so as to compare the computation speed with the GPU-accelerated implementation. Both CPU and GPU implementations are tested on a Dell Precision T7400 Workstation computer with an Intel Xeon E5420 CPU, 2GB memory, and an NVIDIA GeForce GTX 480 GPU with 480 (15 multi-processors x 32 processors per multi-processor) SIMT processors. The graphics driver version is 260.21 and the CUDA version is 3.2RC. In the CUDA environment, GPU threads are organized into blocks. In our computational experiments, we use a heuristic number of 256 threads per block. With 256 threads per block, we test up to 30,720 threads (120 blocks). The number of blocks that can be launched with full GPU capability is determined by the registers, shared memory or max warps / blocks per multiprocessor needed per thread. When compiling the GPU kernels, it is possible to limit the number of registers used per thread. A maximum of 32 registers per thread is used in the compilation setting for this program. If the actual number of variables is more than the register number limit, then the variables will overflow to local memory, which is much slower, but it is now cached with L1/L2 cache in the Fermi architecture. For this program, the number of registers is the factor that limits the number of blocks that can be launched simultaneously in a multiprocessor. Since only 32KB registers per multiprocessor is available, it is able to launch 4 blocks per multiprocessor because 32 KB / (32 registers * 4B per register) is equal to 4 blocks. With 15 multiprocessors in the GTX 480 board, it is able to launch 15,360 threads (256 threads per block * 4 blocks per multiprocessor * 15 multiprocessor per GTX 480) at the same time. CUDA SDK provides a tool called CUDA occupancy calculator to help programmers determine these numbers.
5.1 [1]
Effectiveness of DEMCMC Algorithm
DEMCMC on benchmark functions We test the DEMCMC algorithm on a set of standard benchmark functions provided in the Deb et al.’s DTLZ test suite,34) which has been used broadly in multi-objective optimization literature. Figure 3 shows the Pareto optimal solutions found in DEMCMC in DTLZ 1-7. (DTLZ 2-6 have similar Pareto optimal front, but different objective functions.) A population size of 1,024 and 10,000 (40,000 used in DTLZ 6) iterations are used. One can find that in all benchmark functions, DEMCMC yields good convergence and coverage of the Pareto optimal front. Moreover, with the help of GPU, the computation on a test case can be completed within a minute, typically. Similar effectiveness32) of DEMCMC can be found in the Zilter et al.’s ZDT test suite35) where benchmark functions have convex, nonconvex, noncon-
DEMCMC-GPU: An Efficient Multi-Objective Optimization Method with GPU Acceleration on the Fermi Architecture
Fig. 3
175
DEMCMC solutions in benchmark functions of DTLZ1-7. The black solid lines represent the true Pareto optimal fronts and the green marks are the DEMCMC solutions. DEMCMC has led to solutions with good coverage of the Pareto optimal fronts in DTLZ1-7
tiguous, multimodal, or nonuniform Pareto optimal front. DEMCMC has also been applied to a protein loop structure modeling application.36) The rationale of the protein loop structure sampling is to obtain structurally diversified protein loop backbone conformations satisfying various scoring (objective) functions in order to discover conformations close to the
176
W. Zhu, A. Yaseen and Y. Li
native. Figure 4 shows the structure-score distribution of 849 non-dominated solutions obtained in an 11-residue protein loop 153l(154:164) using the DEMCMC algorithm with population size of 1,000 and 1,000 iterations. The three objective functions are: 1) Soft-sphere van der Waals scoring function [VDW]; 2) Atom pair-wise distance-based scoring function [DIST]; 3) Triplet torsion angle scoring function [TRIPLET]. One can find that optimization using DEMCMC has led to diversified, non-dominated solutions satisfying the three scoring functions. Among these solutions, a solution cluster close to the native conformation (< 0.5 Angstrom) emerges.
Fig. 4
[2]
Structure-Score Distribution of 849 Non-dominated Solutions in Protein Loop 153l(154:164)
Convergence analysis with hypervolume For multiobjective optimization, usually two distinct conditions need to be considered for determining the solution quality: 1) how close are the solutions to the Pareto optimal front; 2) how diversely are the solutions distributed on the current non-dominated front. Many performance metrics have been suggested for evaluating either of or both conditions.24) In this paper, we use the hypervolume metric30) to measure both conditions in DEMCMC. We test the DEMCMC algorithm on the DTLZ1 problem with 100 variables and 2 objectives, with the results in 31) as the comparison basis. The DTLZ1 Pareto optimal front is known to be a straight line and the reference point used is (11, 11) for calculating hypervolume. The maximum volume is about 120.87. The hypervolume value is computed and recorded every 10 generations. Figure 5 shows the hypervolume convergence curve on the DTLZ1 problem obtained with the DEMCMC-GPU program. It reaches a hypervolume of 120.85 in 500 seconds, which is in the order of magnitude faster than the CPU implementation.
DEMCMC-GPU: An Efficient Multi-Objective Optimization Method with GPU Acceleration on the Fermi Architecture
Fig. 5
5.2
177
Convergence Analysis with Hypervolume (DTLZ1 problem, 100 variables, 2 objectives, 15360 threads, 500 seconds time limit)
Speedup Analysis
We implemented three GPU schemes for testing the speedup in DEMCMCGPU. These three schemes are shown as follows. 1. 2. 3.
‘Scheme Fermi’ uses the default setting of 16KB L1 cache, and does not use the texture memory for data copying. ‘Scheme pre-Fermi’ uses texture memory for data copying. ‘Scheme CPU sort’ is similar to ‘Scheme pre-Fermi’ except that sort operation is done in CPU through CPU-GPU data synchronization.
Comparing ‘Scheme Fermi’ and ‘Scheme pre-Fermi’ helps us understand the performance difference between using texture memory and global memory for data copying. Comparison of pre-Fermi scheme and CPU sort scheme shows the performance difference of GPU sort and CPU sort in DEMCMC-GPU. Tables 1 and 2 summarize the GPU speedup using these three schemes on a set of scalable benchmark functions adapted from the PISA library37) with 30,720 threads. Table 1 compares the performance of the three schemes. One can find that with the help of L1 cache, the Fermi scheme yields better performance than the pre-Fermi scheme where texture memory is needed. Moreover, migrating fitness sorting from CPU to GPU in Fermi and pre-Fermi schemes also leads to performance improvement compared to CPU sort scheme. Table 2 shows the performance of Fermi scheme compared to the DEMCMC implementation on CPU, where speedups of 72-129 are found in these multi-objective optimization benchmark functions. Figure 6 illustrates the GPU speedup comparison among the three schemes with different numbers of threads in DTLZ1. It is interesting to find that with small number of threads, there doesn’t seem to be significant performance difference among the CPU sort, pre-Fermi, and Fermi scheme; however, as more threads are employed and thus the GPU is more efficiently utilized, the Fermi scheme exhibits significant performance improvement compared to the pre-Fermi
178
W. Zhu, A. Yaseen and Y. Li
Table 1
Function Simple2D ZDT1 ZDT2 ZDT3 ZDT4 ZDT6 SPHERE KUR QV DTLZ1 DTLZ2 DTLZ3 DTLZ4 DTLZ5 DTLZ6 DTLZ7 COMET
Fig. 6
Comparison of execution times of three schemese on a set of benchmark functions (30,720 threads, 100 iterations, average of 20 runs, all times in m-sec)
Var 2 30 30 30 10 10 100 3 100 100 100 100 100 100 100 100 3
Obj 3 2 2 2 2 2 3 2 2 3 3 3 3 3 3 3 3
Schemes Fermi 20,542 15,807 16,486 15,644 16,883 16,930 22,533 14,748 17,872 20,746 19,738 19,989 19,729 20,833 19,548 21,475 20,519
pre-Fermi 29,955 23,604 23,723 23,583 24,267 23,702 32,837 24,061 24,923 32,758 31,361 31,326 31,872 32,458 31,069 32,986 33,219
CPU sort 32,388 28,366 28,705 28,281 25,938 27,271 46,668 25,754 38,716 46,115 42,333 42,348 43,311 43,904 42,396 47,301 35,561
Schemes comparison as a function of the number of threads in DTLZ1 test function (100 variables, 3 objectives, 100 iterations, average of 20 runs)
scheme and the CPU sort scheme. Figures 7 and 8 illustrate the Fermi scheme GPU speedup compared to DEMCMC CPU implementation based on the number of threads in DTLZ1 test function and the protein modeling program, respectively. One can find that the computational time in the CPU program increases quadratically as the number of threads (population size) increases. This is due to the fact that the computational complexity of fitness assignment computation, one of the major components of the DEMCMC algorithm, is O(m2 N ), where m is the population size and N is the number of objective functions. By comparison, in DEMCMC-GPU, the increase of computational time is nearly linear as the number of threads grows, because the utilization of GPU is becoming more
179
DEMCMC-GPU: An Efficient Multi-Objective Optimization Method with GPU Acceleration on the Fermi Architecture
Table 2
Function Simple2D ZDT1 ZDT2 ZDT3 ZDT4 ZDT6 SPHERE KUR QV DTLZ1 DTLZ2 DTLZ3 DTLZ4 DTLZ5 DTLZ6 DTLZ7 COMET
Fig. 7
The speedup compared to the CPU-only implementation on a set of benchmark functions (30,720 threads, 100 iterations, average of 20 runs, all times in m-sec) Var 2 30 30 30 10 10 100 3 100 100 100 100 100 100 100 100 3
Obj 3 2 2 2 2 2 3 2 2 3 3 3 3 3 3 3 3
CPU (msec) 1,485,674 1,426,327 1,547,742 1,440,813 1,599,865 1,553,531 2,546,937 1,779,529 1,601,542 2,136,386 1,744,374 1,764,484 2,562,986 1,925,714 1,888,219 2,321,794 1,576,856
Scheme Fermi 20,542 15,807 16,486 15,644 16,883 16,930 22,533 14,748 17,872 20,746 19,738 19,989 19,729 20,833 19,548 21,475 20,519
Speedup 72.3 90.2 93.9 92.1 94.8 91.8 113.0 120.7 89.6 103.0 88.4 88.3 129.9 92.4 96.6 108.1 76.8
GPU speedup comparison based on the number of threads in DTLZ1 test function (100 variables, 3 objectives, 100 iterations, average of 20 runs)
efficient. More importantly, the DEMCMC-GPU exhibits speedup of over 100 on both DTLZ1 and protein modeling program. Tables 3 and 4 show the GPU time of major computational components in DTLZ1 and protein modeling program, respectively. DTLZ1 has simple objective functions. Therefore, evaluation of the DTLZ1 objective functions is fast and most of the computational time is in fitness assignment. In contrast, in protein modeling applications, the major computational time is spent in generating new configurations and evaluating objective functions. After all, DEMCMC-GPU demonstrates its efficiency in both multi-objective optimization problems with simple objective functions and complicated real-life applications.
180
W. Zhu, A. Yaseen and Y. Li
Fig. 8
GPU speedup comparison based on the number of threads in protein loop sampling program on a 12-residue loop 1xyz (813:824)
Table 3
Category Kernel Kernel Kernel Kernel Kernel Memory Kernel
Table 4
Category Kernel Kernel Kernel Memory Kernel
§6
Time breakdown of GPU tasks for DEMCMC in DTLZ1 function with 100 variables and 3 objectives (30,720 threads, 100 iterations)
Method fitness assignment make new solutions metropolis random # generation objectives evaluation memory sync Other kernels
#calls 100 99 99 99 100
GPU (µsec) 19,742,200 228,114 209,709 63,831 29,768 232,632
GPU time % 95.46 1.10 1.01 0.30 0.14 1.11 0.88
Time breakdown of GPU tasks for DEMCMC in the protein structure modeling program (30,720 threads, 1xyz (813:824), 100 iterations) Method make new solutions objectives evaluation fitness assignment memory sync Other kernels
#calls 101 101 101
GPU (usec) 63,438,524 55,893,169 7,281,382 3,547,711
GPU time % 48.73 42.93 5.58 2.71 0.05
Conclusions and Future Research Directions
In this paper, we present an implementation of the DEMCMC algorithm on the latest GPU Fermi architecture. The DEMCMC algorithm has shown efficiency in a variety of multi-objective optimization benchmark functions as well as a protein modeling application. By taking advantage of new cache mechanism in the Fermi GPU architecture, migrating fitness sorting from CPU to GPU, organizing DEMCMC data structures for coalesced memory access, using large population size with large number of threads to fully utilize GPU, and employing efficient parallel pseudorandom number generators, the speedup of DEMCMC-GPU can aggressively reach about 100. Our computational results also show that DEMCMC-GPU is efficient in both multi-objective optimization
DEMCMC-GPU: An Efficient Multi-Objective Optimization Method with GPU Acceleration on the Fermi Architecture
181
problems with simple objective functions as well as complex real-life applications such as protein structure modeling. Our future research directions will include finding optimal population size for DEMCMC that can sufficiently cover the Pareto optimal front, more effectively leveraging the large population size in GPU to attack large-scale optimization problems, and applying DEMCMC-GPU to other important multi-objective optimization applications.
Acknowledgements The work is partially supported by NSF under grants CCF-0829382 and CCF-0845702 to Y. Li and Lamar Research Enhancement Grants, a grant from Texas Hazardous Waste Research Center, and NSF grant DUE-0737173 to W. Zhu.
References 1) 2)
3)
4) 5) 6)
7) 8)
9)
10)
11) 12)
CUDA Zone, http://www.NVIDIA.com/object/cuda home new.html, accessed on October 1, 2010. Zhu, W., “Massively Parallel Differential Evolution with Graphics Hardware Acceleration: an Investigation on Bound Constrained Optimization Problems,” Journal of Global Optimization, doi: 10.1007/s10898-010-9590-0, 2010. Zhu, W., “Nonlinear Optimization with a Massively Parallel Evolution Strategy Algorithm on Graphics Hardware,” Applied Soft Computing Journal, doi: 10.1016/j.asoc.2010.05.020, 2010. Zhou, Y. and Tan, Y., “GPU-based Parallel Particle Swarm Optimization,” in Proc. of IEEE Congress on Evolutionary Computation, 2009. You, Y., “Parallel Ant System for Traveling Salesman Problem on GPUs,” in Proc. of ACM Genetic and Evolutionary Computation Conference, 2009. Bakhtiari, M., Malhotra, H., Jones, M. D., Chaudhary, V., Walters, J. P. and Nazareth, D., “Applying Graphics Processor Units to Monte Carlo Dose Calculation in Radiation Therapy,” J. Med. Phys., 35, 2, pp. 120–122, 2010. General purpose genetic programming on Graphics Processing Units, http://www.gpgpgpu.com, accessed on October 1, 2010. Wong, M. L., “Parallel Multi-Objective Evolutionary Algorithms on Graphics Processing Units,” in Proc. of ACM Genetic and Evolutionary Computation Conference, 2009. Li, Y. and Zhu, W., “GPU-Accelerated Multi-scoring Functions Protein Loop Structure Modeling,” in Proc. of 9th IEEE International Workshop on High Performance Computational Biology, 2010. Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H. and Teller, E., “Equation of State Calculations by Fast Computing Machines,” Journal of Chemical Physics, 21, pp. 1087–1092, 1953. Hastings, W. K., “Monte Carlo Sampling Methods using Markov Chains and Their Applications,” Biometrika, 57, pp. 97–109, 1970. Kirkpatrick, S., Gelatt, Jr. C. D. and Vecchi, M. P., “Optimization by Simulated Annealing,” Science, 220, pp. 671–680, 1983.
182 13) 14)
15)
16)
17)
18)
19)
20)
21)
22)
23)
24) 25)
26)
27) 28)
29)
W. Zhu, A. Yaseen and Y. Li
Marinari, E. and Parisi, G., “Simulated Tempering: a New Monte Carlo Scheme,” Europhysics Letters, 19, pp. 451–458, 1992. Geyer, C. J. and Thompson, E. A., “Annealing Markov Chain Monte Carlo with Applications to Ancestral Inference,” J. of the American Statistical Association, 90, pp. 909–920, 1995. Hansmann, U. H. E. and Okamoto, Y., “Generalized-ensemble Monte Carlo method for systems with rough energy landscape,” Physical Review E. 56, 20, pp. 2228–2233, 1997. Liu, J. S., Liang, F. and Wong, W. H., “The Use of Multiple-Try Method and Local Optimization in Metropolis Sampling,” Technical Report, Department of Statistics, Stanford University, 1998. Li, Y., Protopopescu, V. A., Arnold, N., Zhang, X. and Gorin, A., “Hybrid Parallel Tempering/Simulated Annealing Method,” Applied Mathematics and Computation, 212, pp. 216–228, 2009. Siepmann, J. I. and Frenkel, D., “Configuration bias Monte Carlo: a new sampling scheme for flexible chains,” Molecular Physics, 75, 1, pp. 59–70, 1992. Vrugt, J. A., Gupta, H. V., Bastidas, L. A., Boutem, W. and Sorooshian, S., “Effective and Efficient Algorithm for Multiobjective Optimization of Hydrologic Models,” Water Resources Research, 39, 8, pp. 1214–1232, 2002. Tang, Y., Reed, P. and Wagener, T., “How effective and efficient are multiobjective evolutionary algorithms at hydrologic model calibration?,” Hydrol. Earth Syst. Sci. 10, pp. 289–307, 2006. Deb, K., Pratap, A., Agarwal, S. and Meyarivan, T., “A Fast and Elitist Multiobjective Genetic Algorithm: NSGA-II,” IEEE Trans. Evol. Computation, 6, pp. 182–197, 2002. Zitzler, E., Laumanns, M. and Thiele, L., “SPEA2: Improving the Strength Pareto Evolutionary Algorithm,” TIK-103, Department of Electrical Engineering, Swiss Federal Institute of Technology, 2001. Storn, R. and Price, K., “Differential Evolution - A Simple and Efficient Heuristic for Global Optimization over Continuous Spaces,” J. of Global Optimization, 11, 4, pp. 341–359, 1997. Deb, K., Multi-objective Optimization Using Evolutionary Algorithms, John Wiley&Sons, 2001. Rathore, N., Chopra, M. and de Pablo, J. J., “Optimal Allocation of Replicas in Parallel Tempering Simulations,” J. Chem. Phys., 122, 2, 024111, American Institute of Physics, 2005. Kone, A. and Kofke, D. A., “Selection of Temperature Intervals for Parallel Tempering Simulations,” J. Chem. Phys., 122, 20, 206101, American Institute of Physics, 2005. Mersenne Twister random number generator, http://www.math.sci.hiroshimau.ac.jp/˜m-mat/MT/emt.html, accessed on October 1, 2010. Satish, N., Harris, M. and Garland, M., “Designing Efficient Sorting Algorithms for Manycore GPUs,” in Proc. of 23rd IEEE International Parallel and Distributed Processing Symposium, 2009. Zitzler, E., Deb, K. and Thiele, L., “Comparison of Multiobjective Evolutionary Algorithms: Empirical Results,” Evolutionary Computation, 8, 2, The MIT Press, pp. 173–195, 2000.
DEMCMC-GPU: An Efficient Multi-Objective Optimization Method with GPU Acceleration on the Fermi Architecture
30)
31)
32)
33) 34)
35) 36)
37)
38)
183
Beume, N., Fonseca, C. M., Lpez-Ibez, M., Paquete, L. and Vahrenhold, J., “On the complexity of computing the hypervolume indicator,” IEEE Transactions on Evolutionary Computation, 13, 5, pp. 1075–1082, 2009. DTLZ1 test problem and its convergence plot with maximum hypervolumes, http://www.tik.ee.ethz.ch/sop/download/supplementary/testproblems/dtlz1/ index.php, accessed on October 1, 2010. Zhu, W. and Li, Y., “GPU-Accelerated Differential Evolutionary Markov Chain Monte Carlo Method for Multi-Objective Optimization over Continuous Space,” in Proc. of 2nd Workshop on Bio-Inspired Algorithms for Distributed Systems, 2010. Marsaglia, G., “Xorshift RNGs,” Journal of Statistical Software, 8, 14, pp. 1–6, 2003. Deb, K., Thiele, L., Laumanns, M. and Zitzler, E., “Scalable Test Problems for Evolutionary Multi-Objective Optimization,” KanGAL Report, Indian Inst. Technol., 2001. Zitzler, E., Deb, K. and Thiele, L., “Comparison of multiobjective evolutionary algorithms: Empirical results,” Evol. Comput., 8, 2, pp. 173–195, 2000. Li, Y., Rata, I. and Jakobsson, E., “Integrating Multiple Scoring Functions to Improve Protein Loop Structure Conformation Space Sampling,” in Proc. of IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, 2010. Bleuler, S., Laumanns, M., Thiele, L. and Zitzler, E., “PISA A Platform and Programming Language Independent Interface for Search Algorithms,” LNCS, 2632, pp. 494–508, 2003. Kipfer, P. and Westermann, R., “Improved GPU Sorting,” GPU gems 2: Programming Techniques for high-performance graphics and general-purpose computation, Addison-Wesley Professional, pp. 733–746, 2005.
Weihang Zhu, Ph.D.: He is an Assistant Professor of Industrial Engineering, Lamar University, USA. He received his Ph.D. in Industrial Engineering from North Carolina State University (2003) and his M.S. (2000) and B.S. (1997) in Mechanical and Energy Engineering at Zhejiang University, China. His research interests include high performance computing, computer haptics, CAD/CAM, virtual reality-based medical simulation, haptics in education and training, computational biology, thermal system analysis, information technology, multi-axis NC surface machining, polyhedral machining, micro-machining and applied computational geometry.
184
W. Zhu, A. Yaseen and Y. Li
Ashraf Yaseen: He received his B.S. degree in CS from Jordan University of Science and Technology in 2002 and his M.S. degree in CS from New York Institute of Technology in 2003. He is currently pursuing his Ph.D. degree in Computer Science at Old Dominion University. His research interests include Computational Biology and High Performance Computing.
Yaohang Li, Ph.D.: He is an Associate Professor in Computer Science at Old Dominion University. He received his B.S. in Computer Science from South China University of Technology in 1997 and M.S. and Ph.D. degrees from Department of Computer Science, Florida State University in 2000 and 2003, respectively. After graduation, he has worked as a research associate in the Computer Science and Mathematics Division at Oak Ridge National Laboratory, TN. His research interests include Computational Biology, Monte Carlo Methods, and High Performance Computing.