OpenCL implementation of Particle Swarm Optimization

1 downloads 0 Views 331KB Size Report
mization (PSO), compiled under OpenCL for both GPUs and multi-core. CPUs, and separately ..... tional Conference on Neural Networks. vol. IV, pp. 1942–48.
OpenCL implementation of Particle Swarm Optimization: a comparison between multi-core CPU and GPU performances Stefano Cagnoni1 , Alessandro Bacchini1 , and Luca Mussi2 1

Dept. of Information Engineering, University of Parma, Italy [email protected], [email protected] 2 Henesis s.r.l., Parma, Italy [email protected]

Abstract. GPU-based parallel implementations of algorithms are usually compared against the corresponding sequential versions compiled for a single-core CPU machine, without taking advantage of the multicore and SIMD capabilities of modern processors. This leads to unfair comparisons, where speed-up figures are much larger than what could actually be obtained if the CPU-based version were properly parallelized and optimized. The availability of OpenCL, which compiles parallel code for both GPUs and multi-core CPUs, has made it much easier to compare execution speed of different architectures fully exploiting each architecture’s best features. We tested our latest parallel implementations of Particle Swarm Optimization (PSO), compiled under OpenCL for both GPUs and multi-core CPUs, and separately optimized for the two hardware architectures. Our results show that, for PSO, a GPU-based parallelization is still generally more efficient than a multi-core CPU-based one. However, the speed-up obtained by the GPU-based with respect to the CPU-based version is by far lower than the orders-of-magnitude figures reported by the papers which compare GPU-based parallel implementations to basic single-thread CPU code.

Keywords: Parallel computing, GPU computing, Particle Swarm Optimization

1

Introduction

Particle Swarm Optimization (PSO), the simple but powerful algorithm introduced by Kennedy and Eberhart [4], is intrinsically parallel, even more than other evolutionary, swarm intelligence or, more in general, population-based optimization algorithms. Because of this, several parallel PSO implementations have been proposed, the latest of which are mainly based on GPUs [2, 3, 8, 9]. It is very hard to fairly compare the results of different implementations, on different architectures or compilers, or even of the same programs run on different releases of

software-compatible hardware. Execution time is generally the only direct objective quantitative parameter on which comparisons can be based. Therefore, this is the approach that most authors are presently adopting. After all, obtaining a significant speed-up in the algorithm’s execution time is obviously the main reason for developing parallel versions of the algorithm, considering also that PSO is also intrinsically one of the most efficient stochastic search algorithm, for the simplicity and compactness of the equations on which it is based. Undoubtedly, the GPU-based parallelization of PSO has produced impressive results. Most papers on the topic report speed-ups of orders of magnitude with respect to single-core CPU-based versions, especially when large-size problems or large swarms are taken into consideration. This may lead to a misinterpretation of the results, suggesting that CPUs 3 are overwhelmingly outperformed by GPUs on this task. However, most comparisons have actually been made between accurately-tuned GPU-based parallel versions and a sequential version compiled for a single-processor machine. Thus, while being absolutely objective and informative, they do not reflect the actual disparity in the top performances which can be offered by the two computational architectures. In fact, they totally ignore the parallel computation capabilities of modern CPUs, both in terms of number of cores embedded by the CPU architecture and of CPU’s SIMD (Single Instruction Multiple Data) computation capabilities. Papers comparing GPUs and CPUs ’at their best’ have started being published only recently [5]. This has been mainly justified by the lack of a handy environment for developing parallel programs on CPUs. As new environments or libraries for parallel computing supporting both GPUs and CPUs, like OpenCL and Microsoft Accelerator, have been released, this gap has been filled. In particular, OpenCL allows one to develop parallel programs for both architectures, offering the chance to either compile the same code for execution on both CPUs and GPUs, or, more importantly, to develop parallel implementations of the same algorithm, specifically otpimized for either computing architecture. We have previously presented two GPU implementations of PSO [6, 7], a synchronous and an asynchronous version, developed both on CUDA, nVIDIA’s environment for GPU computing using its cards. Our implementations were aimed at maximizing execution speed, disregrding other limitations, such as the maximum number of particles of which a swarm could be composed. Thus, our best-performing GPU-based parallel implementation could only manage up to 64 particles, depending on hardware capabilities. In such a work, we also have made a comparison with the single-thread sequential implementation of a standard PSO (SPSO2006 [1]), mainly to allow for a comparison with other, previously published, results. In this paper we try to make the fairest possible comparison between computing performances which can be obtained by GPU-based and CPU-based parallelized versions of PSO, developed within the OpenCL environment. On the one hand, we have slightly modified our most efficient GPU-based PSO algo3

from here onwards, the term CPU will refer to a multi-core CPU, if not differently specified.

rithm, allowing for swarms of virtually any size to be run, at the price of a slight reduction in performances. On the other hand, we have also developed a parallel OpenCL version of PSO, whose structure and parameters are optimized for running on a CPU. This way, we have obtained two implementations having similar limitations, which can be used for a fair comparison between the actual performances which can be obtained by the two different architectures. We report results obtained on a set of classical functions used for testing stochastic optimization algorithms, on five different GPUs and CPUs, which are presently offered in typical medium/high-end desktop and laptop configurations. In the following sections, we first describe our parallel algorithm and the slight differences between the CPU-oriented and GPU-oriented versions. We then report results obtained in the extensive set of tests we have performed to compare their performances. Finally, we close the paper with a discussion of the results we obtained, and draw some conclusions about the efficiency and applicability of the two implementations.

2

PSO parallelization

The parallel versions of PSO developed in our previous work are quite similar, both being fine-grained parallelizations down to the dimension level, i.e., we allocate one thread per particle’s dimension in implementing the PSO update equations. Compatibly with resource availability, this means that it is virtually possible to perform a full update in a single step. The other common feature is the use of a ring topology of radius equal to one for the particles’ neighborhoods, which minimizes the data dependencies between the excutions of the update cycles of each particle. The main difference between the two implementations is related with the update of the particles’ social attractors, i.e., the best-performing particle in each particle’s neighborhood or in the whole swarm, depending on whether a “local-best” or “global-best” PSO is being implemented, respectively. Our earlier-developed version [6] implements the most natural way of parallelizing PSO, as regards both task separation and synchronization between particles: thus, it was implemented as three separate kernels: i. position/velocity update; ii. fitness evaluation; iii. local/global best(s) computation. Each kernel also represented a synchronization point, so the implementation corresponded to the so-called “synchronous PSO”, in which the algorithm waits for all fitness evaluations to be over before computing the best particles in the swarm or in each particle’s neighborhood. In a later-developed version [7] we relaxed this synchronization constraint, obtaining an “asynchronous PSO” by letting a particle’s position and velocity be updated independently of the computation of its neighbors’ fitness values. This allowed us to implement all the algorithm as a single kernel, minimizing the overhead related to both context switching and data exchange via global memory, as the whole process, from the first to the last generation, is scheduled as a single kernel call. Therefore, while, in the synchronous version the status of each particle needed to be saved/loaded after each kernel, in the

asynchronous version the parallelization occurs task-wise, with no data exchange in global memory before the whole process terminates. While this feature is optimal for speed, it introduces a severe drawback related with resource availability. No partial inter-generation results can be saved during it to allow another batch of particles to be run within the same generation; therefore, a swarm can only comprise up to Nmax particles, Nmax being the number of particles which the resources available in the GPU allow to be executed at the same time. To allow for a virtually unlimited size of the swarm, we have turned back to a synchronous version, which saves the partial results of each generation back to the global memory, while still using a single kernel for running a generation of a swarm of up to Nmax particles. Therefore, if a swarm of N > Nmax is to be run, a full update of the swarm can be obtained by running a batch of ⌈N/Nmax ⌉ instances of the kernel sequentially. The three stages of the previous synchronous version of PSO have been merged and synchronization between stages removed: the only synchronization occurs at the end of a generation. This avoids some accesses to the global memory because each particle loads its state at the beginning of each generation and writes it back to memory only at the end of it. While this, obviously, has a price in terms of execution speed, it has the advantage of limiting the delay between the times of update between particles to no more than one generation.

3

Open Computing Language

The evolution both of parallel processors and of the corresponding programming environments has been extremely intense but, until recently, far from any standardization. The Open Computing Language, OpenCLTM , released at the end of 2008 is the first open standard framework for cross-platform and heterogeneous resources programming. It permits to easily write parallel code for modern processors and obtain major improvements in speed and responsiveness possibly for a very wide range of applications, from gaming to scientific computing.

Fig. 1. OpenCL definition for the Platform Model (left) and the Memory Hierarchy (right)

On the one hand, to cope with modern computers which often includes one or more CPUs and one or more GPUs (and possibly also DSP processors and others devices), the OpenCL Platform Model definition (see Figure 1) includes one ’host device’ controlling one or more ’compute devices’. Each ’compute device’ is composed of one or more ’compute units’ each of which is further divided into one or more ’processing elements’. Usually ’host’ and ’compute devices’ are mapped onto different processors, but may also be mapped onto the same one. This is the case for the Intel OpenCL implementation where host and device can be the very CPU. On the other hand, the OpenCL Execution Model defines the structure of an OpenCL application which runs on a host using classical sequential code to submit work to the available compute devices. The basic unit of work for an OpenCL device is referred to as ’work item’ (something like a thread) and its code is called ’kernel’ (something similar to a plain C function). Work items are grouped under many ’work groups’: all the work items belonging to the same work group run effectively in parallel, while different work group can be scheduled sequentially based on the available resources. To achieve the best performances, each device has its own memory hierarchy: the global memory is shared across all work groups, the local memory is shared across all work items within a work group, and private memory is only visible to a single work item. Local and private memory usage usually greatly improve performances for the so called memory-bounded kernels. Finally, in addition to the memory hierarchy, OpenCL defines some barriers to synchronize all the work items within a work group: work items belonging to different work groups cannot be synchronized while a kernel is running. The OpenCL standard also allows the compilation of OpenCL code at execution time, so the code can be optimized on place to execute the requested operation as fast as possible. Despite this, to achieve best performances on different kind of devices, it is still necessary to write slightly different versions of the same kernel to achieve top performances: for example a kernel must explicitly use vector types (float4, float8, . . . ) to fully exploit the SIMD instructions of modern CPUs.

Fig. 2. Workflow organization for the GPU version (left) and the GPU version (right) of PSO.

4

PSO implementation within OpenCL

This new OpenCL implementation of PSO roughly inherits the same structure of our previous versions developed for CUDA [7]. The three main steps of PSO are implemented as a single kernel scheduled many times in a row to simulate the generational loop. This organization introduces a fixed synchronization point for all particles at the end of every generation, realizing a synchronous PSO variant. Each particle is simulated by a work group comprising a work item for each dimension of the problem to optimize. At the beginning of the kernel particles’ position, velocity and best position are read from the memory: this step severely limits the kernel performances because of the accesses to the global memory, the slowest operation for an OpenCL device. Then velocities and positions are updated, but no bottlenecks are present here because only simple arithmetic operations are required. Fitness evaluation comes next and is usually the most complex stage of the kernel because of the possible presence of transcendental functions and parallel reductions used to compute the sum of the many addends calculated by each work item. The last stage updates the particles’ best fitness values, positions, best positions and velocities and stores them into the global memory: some further waste of time occurs here, due to the high latency of global memory write operations. The OpenCL standard allows for the compilation of OpenCL kernels at execution time, so it is actually possible to define preprocessor constants to save some parameter passing. This feature is used to embed all PSO parameters into the kernel code, still allowing the user to specify everything when launching the application. This technique is especially advantageous when dealing with parallel reductions: in our case knowing in advance the number of problem dimensions (i.e., the number of work items inside each work group) allows us to minimize the number of synchronization barriers inside each work group and improve overall performances. Another way to reduce the execution time of the kernel is using the native transcendental functions (’fast math’ enabled) which are usually implemented in hardware and are an order of magnitude faster than non-native ones. Another trick to obtain the best performances, as already mentioned, is to write the code thinking about the actual device which is to run the kernel in order to exploit peculiar parallel instructions. This is the goal towards which we developed two versions of the same kernel for the two architectures under consideration. 4.1

GPU-based implementation

This version is oriented to massively parallel architectures, like GPUs, for whom each work item maps to a single thread. Indeed, nVidia GPUs have hundreds of simple ALU that process only a single instruction at a time: in this case it is usually better to run thousands of light threads instead of hundreds of heavy ones. Moreover, global and local memory on GPU devices are located in distinct areas and have different performances: local memory is usually one order of magnitude faster than global memory. The best practice is hence to use local

memory as much as possible and minimize the number of accesses to global memory. Finally, the OpenCL implementation by AMD and nVidia ensure that work items are scheduled in groups of at least 32 threads at a time which, in some cases, makes it possible to avoid the use of synchronization barriers. For example, this is the case for the final steps of parallel reductions. 4.2

CPU-based implementation

CPUs are architecturally different from GPUs and the above mentioned optimization are not always the best option for this kind of devices. CPUs have a very small number of cores compared to GPUs, but each core includes a large amount of cache and a complex processing unit: branch prediction, conditional instructions and misaligned memory accesses are usually more efficient on CPUs than on GPUs. The problem of having a limited number of cores is also overcome by SIMD (Single Instruction, Multiple Data) instructions, an extension to the standard x86 instruction set that allows each ALU to perform parallel operations over a set of 4 or 8 homogeneous values. SSE-SIMD instructions allow for parallel operations on 4 floats/ints or 2 doubles at the same time, while new AVX-SIMD instructions allow for parallel operations on 8 floats or 4 doubles. The OpenCL standard natively supports SIMD instructions via vector data type: int4, int8, float4, float8, . . . For these reasons our version of the PSO kernel was rewritten for CPUs, grouping the work items four by four or eight by eight so that each work item takes care of four or eight dimension of the problem under optimization. It is easy to see how this introduces big improvements, considering that the OpenCL standard requires that all the work items belonging to the same work group be executed on the same compute unit (one core of the CPU) sequentially.

5

Test and results

We compared the performance of our parallel CPU-based and GPU-based PSO implementation on a set of “classical” functions used to evaluate stochastic optimization algorithms. Our goal was to fairly compare the performances of our GPU and CPU implementations in terms of speed. We also verified the correctness of our implementations (results not reported, see [7] for the tests on the previously developed algorithms) by checking that the minima found by the parallel algorithms did not differ significantly from the results of the sequential SPSO [1]. We kept all algorithm parameters equal in all tests, setting them to the standard values suggested for SPSO: w = 0.72134 and C1 = C2 = 1.19315. The test were performed on different processors: two Intel CPUs (i7-2630M and i7-2600K) having 4 physical and 8 logical cores; two nVidia GPUs (GT-540M and GTX-560Ti) having 96 and 384 cores, respectively; and an ATI Radeon HD6950 GPU (1408 cores). These can be considered typical examples of the processors and GPUs which presently equip medium/high-end laptop and desktop computers.

Sphere

Rastrigin

Time (ms) - 32 Dims.

Nvidia GTX560Ti Nvidia GT540M ATI Radeon6950 Intel i7-2600K Intel i7-2630M 10000

10000

1000

1000

1000

100

100

100

64

128

256

512

1024

2048

4096

8192

32

Nvidia GTX560Ti Nvidia GT540M ATI Radeon6950 Intel i7-2600K Intel i7-2630M

Time (ms) - 64 Dims.

Nvidia GTX560Ti Nvidia GT540M ATI Radeon6950 Intel i7-2600K Intel i7-2630M

10000

32

64

128

256

512

1024

2048

4096

8192

32

Nvidia GTX560Ti Nvidia GT540M ATI Radeon6950 Intel i7-2600K Intel i7-2630M 10000

10000

1000

1000

1000

100

100

100

64

128

256

512

1024

2048

4096

8192

32

Nvidia GTX560Ti Nvidia GT540M ATI Radeon6950 Intel i7-2600K Intel i7-2630M

64

128

256

512

1024

2048

4096

8192

32

Nvidia GTX560Ti Nvidia GT540M ATI Radeon6950 Intel i7-2600K Intel i7-2630M 10000

10000

1000

1000

1000

100

100

100

64

128

256

512

1024

2048

4096

8192

32

64

128

256

512

1024

2048

4096

8192

64

128

256

512

1024

2048

4096

8192

128

256

512

1024

2048

4096

8192

Nvidia GTX560Ti Nvidia GT540M ATI Radeon6950 Intel i7-2600K Intel i7-2630M

10000

32

64

Nvidia GTX560Ti Nvidia GT540M ATI Radeon6950 Intel i7-2600K Intel i7-2630M

10000

32

Time (ms) - 128 Dims.

Rosenbrock

Nvidia GTX560Ti Nvidia GT540M ATI Radeon6950 Intel i7-2600K Intel i7-2630M

128

256

512

1024

2048

4096

8192

32

64

Number of Particles

Fig. 3. Execution times for different fitness functions, problem dimensions and swarm sizes.

Function Sphere Elliptic Rastrigin Rosenbrock Griewank

Search Range [−100, +100]N [−100, +100]N [−5.12, +5.12]N [−30, +30]N [−600, +600]N

Table 1. Function set used for the test and corresponding search ranges.

Table 1 lists the functions used as benchmark with the corresponding ranges within which the search was constrained. For each function, we tested the scaling properties with respect to problem dimension and swarm size by running 20 repetitions of each test. Figure 3 reports the graphs of the average results obtained in all tests for three out of the five functions we considered. Similar results were obtained for the other two functions. The relative performances of the five processors are quite regular and repeatable, with the GPUs outperforming CPUs by a speed gain ranging from about 1 to no more than 5-6 for the largest swarm sizes. For the smallest swarm sizes, depending on the resources available on each processor, and therefore more evidently for the most powerful processors, after a flat segment, a “knee” appears when the resources avaiable are no more sufficient for running the algorithm at the same time, after which the graph (in log-log scale) becomes quite smoothly linear. Only some minor peculiarities can be observed. One is the apparently surprising performances of the nVidia GT540M (a GPU dedicated to mobile systems, with a limited number of resources) with small swarm sizes. In some tests, for swarms of 32 particles, it has even exhibited the top performances. This can be easily explained, considering that the processor clock frequency of the GT540M is slightly higher than the corresponding frequency in the GTX560Ti so it can perform better than the latter if the function is not too memory-intensive (memory bandwidth is much lower in the GT540M). Another peculiarity is the relative performance of the desktop processor by Intel with respect to the mobile version, which shows a narrowing of the performance gap between the two processors in “extreme” conditions (large swarms and high-dimensional functions). We have no explanation for this, except for the possible influence of other components of the PC which may affect its global performances.

6

Final remarks

We have assessed the performances of basically the same version of PSO implemented on a set of five CPUs and GPUs, taking advantage of the opportunity, offered by OpenCL, to develop, and optimize, the code for different architectures within the same environment.

The main goal of our work was to compare GPU and CPU performances using PSO code which had been optimized for both computing architectures, going beyond the usual comparison between parallel GPU code and single-thread sequential CPU code, where GPUs outperform CPUs by orders of magnitude. From this point of view, we showed that, while GPUs still outperform CPUs in this task, the performance gap is not so large as the usual “unfair” comparison tends to suggest, as the speed gain within processors belonging to comparable market segments never gets even close to the order of magnitude. One should also consider that PSO, as well as the target functions used in the test, is probably one of the algorithms that is most suitable for parallelization on massively parallel architectures. From the practical point of view of the development cost, however, one should consider that the comparison has been fair also in this regard. In fact, developing the parallel PSO using OpenCL from scratch has the same cost for both architectures. So, especially for larger optimization problems to be tackled by PSO, it makes little sense to use the CPU, anyway, if a well-performing GPU is available, unless, of course, one needs to produce graphics at the same time. What this work suggests is that the range of problems in which the GPU clearly outperforms a CPU is possibly much smaller than a superficial interpretation of the results usually available on the topic might induce one to imagine.

References 1. http://www.particleswarm.info/Standard PSO 2006.c (2006) 2. Cadenas-Montes, M., Vega-Rodriguez, M., Rodriguez-Vazquez, J., Gomez-Iglesias, A.: Accelerating particle swarm algorithm with GPGPU. In: 19th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP). pp. 560–64. IEEE (2011) 3. de P. Veronese, L., Krohling, R.: Swarm’s flight: Accelerating the particles using C-CUDA. In: Proc. IEEE Congress on Evolutionary Computation (CEC’09). pp. 3264–70. IEEE (2009) 4. Kennedy, J., Eberhart, R.: Particle swarm optimization. In: Proc. IEEE International Conference on Neural Networks. vol. IV, pp. 1942–48. IEEE (1995) 5. Lee, V.W., Kim, C., Chhugani, J., Deisher, M., Kim, D., Nguyen, A.D., Satish, N., Smelyanskiy, M., Chennupaty, S., Hammarlund, P., Singhal, R., Dubey, P.: Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In: Proc. 37th International Symposium on Computer Architecture (ISCA). pp. 451–60. ACM (2010) 6. Mussi, L., Daolio, F., Cagnoni, S.: Evaluation of parallel particle swarm optimization algorithms within the CUDA architecture. Inf. Sciences 181(20), 4642–57 (2011) 7. Mussi, L., Nashed, Y.S., Cagnoni, S.: GPU-based asynchronous Particle Swarm Optimization. In: Proceedings of the 13th annual conference on Genetic and Evolutionary Computation (GECCO). pp. 1555–62. ACM (2011) 8. Papadakis, S., Bakrtzis, A.: A GPU accelerated PSO with application to Economic Dispatch problem. In: 16th International Conference on Intelligent System Application to Power Systems (ISAP), 2011. pp. 1–6. IEEE (2011) 9. Zhou, Y., Tan, Y.: GPU-based parallel particle swarm optimization. In: Proc. IEEE Congress on Evolutionary Computation. CEC ’09. pp. 1493–500. IEEE (2009)