Strategies for Accelerating Ant Colony Optimization Algorithms on

0 downloads 0 Views 200KB Size Report
Valencia, Cami de Vera S/N, 46022, Spain. (e-mail: [email protected]). ..... arrays of texels, where a texel is a set of channels to store color information in a ...
Strategies for Accelerating Ant Colony Optimization Algorithms on Graphical Processing Units Alejandro Catala, Javier Jaen, Jose A. Mocholi 

Abstract— Ant Colony Optimization (ACO) is being used to solve many combinatorial problems. However, existing implementations fail to solve large instances of problems effectively. In this paper we propose two ACO implementations that use Graphical Processing Units to support the needed computation. We also provide experimental results by solving several instances of the well-known Orienteering Problem to show their features, emphasizing the good properties that make these implementations extremely competitive versus parallel approaches.

I. INTRODUCTION

T

HE Orienteering Problem (OP) is a typical problem that arises in routing and scheduling. It is a combinatorial optimization problem that can be solved with exact methods such as those presented in [1] and [2]. However, we must not forget that the OP belongs to the class of NP-hard problems, i.e., solving medium size instances with these methods is infeasible, and specifically in the context of Hybrid Museums [3], [4] where we are interested in applying it for planning visits modeled as medium and large OP instances. Hence the use of approximate methods is inevitable for large instances, trading optimality for efficiency. Several approximate methods for solving OP are the algorithms S and D by Tsiligirides [5], the genetic approach by Tasgetiren [6], and several heuristics based on several optimization stages [7], [8]. Many metaheuristics frameworks have been developed in order to guide the approximate methods towards highquality solutions. Some of these metaheuristics are Genetic Algorithms, Simulated Annealing, and Tabu Search. The metaheuristic known as Ant Colony Optimization (ACO), which is based on the behavior of the natural ant colonies, has been proved to be superior to other heuristics methods, Manuscript received March 14, 2007. This work was supported in part by the Microsoft Research Labs in Cambridge under grant Excellence Awards for Embedded Systems. A. Catala is a Computer Science Postgraduate student and he is with the Department of Information Systems at the Polytechnic University of Valencia, Cami de Vera S/N, 46022, Spain. (e-mail: [email protected]). He is supported by a FPU fellowship from Spanish Ministry of Education and Science with reference AP2006-00181. J. Jaen is Head of the FutureLab and Computer Science professor. He is with the Department of Information Systems at the Polytechnic University of Valencia. (e-mail: [email protected]). J. A. Mocholi is a PhD candidate. He is with the Department of Information Systems at the Polytechnic University of Valencia. (e-mail: [email protected]).

according to the results presented in [9]. Remarkable ACO algorithms are the Ant System [10] and the Ant Colony System [11], both taken as works of reference within the field. Other algorithms are the Rank-based Ant System [12], the Best-Worst Ant System [13], and the Max-Min Ant System [14]. Moreover, research on parallel algorithms based on ACO has been developed with the aim of speeding up the process of obtaining solutions without degrading their quality. Parallel algorithms are commonly classified into fine-grained and coarse-grained approaches. Fine-grained approaches, also known as real parallel systems, are usually featured by an ant-level parallelism using one computational node per ant. On the contrary, coarse-grained approaches, also known as simulated parallel algorithms, are featured by a colony-level parallelism where several ants share the computing power of a node. The work of Bolondi [15] is an example of a fine-grained parallelization where every processor holds only a single ant. However this approach does not scale well because of the overhead associated to the required communication between ants. A representative of coarse grained parallelism is the work of Bullnheimer [16] who proposes a parallelization where information exchange between colonies takes place every k generations of solutions. The larger k the shorter it takes to obtain a solution. However, no discussion about quality degradation is provided in their work. Stützle [17] studies the quality of the obtained solutions when running several independent short runs as opposed to a long run whose overall running time is the sum of the running times of the shorter trials. In some cases the short trials outperform the long ones. In [18] an island model is proposed. Every computing node holds a colony of ants exchanging the locally best solution after a fixed number of iterations. If a received solution from a neighbor colony is better than the current best solution of the receiving colony, the best solution for this colony is updated. Also in [19] the authors investigate different types of information exchange strategies in multi colony ant algorithms and show that the exchange of only a small amount of information can be advantageous to obtain short running times. Finally, Talbi et al. [20] use a master-slave approach, where every slave holds a single ant that obtains one solution. Each slave then sends its solution to the master that computes the new pheromone matrix and sends it back to the slaves.

492 c 1-4244-1340-0/07/$25.00 2007 IEEE

There are two important issues that limit the performance of almost all existing parallel algorithms reviewed here. Firstly, in both fine and coarse grained approaches all ants or colonies work over the whole search space. This may restrict the number of nodes that can be considered in the construction graph because of memory space limitations. Secondly, ants and colonies must share information about pheromones which has to be communicated. Parallel algorithms that communicate the whole pheromone matrix do not achieve good performance and, even if only the best solution is shared as is described in [21], synchronization points between colonies must be established which also has a negative impact on performance. Another algorithm that deserves special attention is the GRID-ACO-OP [22]. This is a multicolony coarse-grained algorithm based on grid computation that brings together features from real and simulated parallel approaches by using a Divide and Conquer strategy, obtaining better quality solutions and scaling to larger instances. However, the major drawback for grids is that they need an important amount of hardware resources which may not always be available. After all these previous efforts it may seem difficult to find new parallelization strategies that contribute to the acceleration of ACO algorithms. However, the exponential growth of computers containing powerful programmable graphics processing units (GPU) opens new windows of research on parallel ACO algorithms. This new type of devices, traditionally used to make graphic specific computations, have shown to be more powerful than general purpose processors in terms of GFLOPS as shown in [23]. Besides, these devices are not exploited by ordinary computer applications since user’s applications usually do not need 3D graphics. These facts motivate that GPU may be used as an effective coprocessor such as the GeneralPurpose computation on GPUs community (GPGPU) has proposed [24]. Therefore, as a refinement of previous ACO algorithms for OP, and taking into account previous research efforts on parallel algorithms for ACO, especially those with ordinary low-cost hardware infrastructures, we present in this work two GPU based algorithms for solving OP through ACO. We define mathematically the required data structures and their GPU programming correspondences. We define a strategy to be applied on the GPU which leads to our two implementations. The first one (GPU-Vertex-ACO-OP) is oriented to the computation of vertices, and the second one (GPU-Fragment-ACO-OP) is oriented to the computation of fragments. Finally, we study the performance and quality of both GPU-ACO-OP algorithms and compare them to the results obtained with the algorithm GRID-ACO-OP. The organization of this paper is as follows: Section II describes the OP mathematically, the basic concepts of ACO and presents the algorithmic schema and data structures that have been used in this work. Section III introduces the

operation of a programmable GPU, and the main strategies used in GPGPU programming. Section IV defines our ACOOP algorithm with vertex-oriented processing, and compares the obtained experimental results versus a GRID oriented implementation. Section V defines our ACO-OP algorithm with fragment-oriented processing, and compares our two GPU implementations presented in this paper. Finally, conclusions and future work are discussed in Section VI. II. THE ORIENTEERING PROBLEM AND ANT COLONY OPTIMIZATION A. Definition of the Orienteering Problem To model the Orienteering Problem we denote a simple and directed graph G=(V,E) where V is a set of n control points (nodes) and E is the set of edges between points in V. Each node i is defined generally in a Euclidean space and has an associated score or benefit Si t 0. The distance, cij is the nonnegative cost of traveling between nodes i and j, and the goal is to obtain a tour from the start node 1 to the end node n, without revisiting nodes, so that the total score collected from the visited nodes is maximized without violating the given distance constraint (available Tmax). Formally, the OP can be formulated as follows: n

n

max ¦¦ Si ˜ xij

(1)

i 1 j 1

Subject to: n 1

n

¦ x1 j

¦x

1

(2)

ni

0

(3)

kj

d 1,

in

j 2

i 1

n 1

n

¦ x j1

¦x

n 1

n 1

j 2

i 1

¦x ¦x ik

i 2 n

k

2,..., n  1

(4)

j 2

n

¦¦ c

ij

˜ xij d Tmax

(5)

i 1 j 1

xij  ^0,1` ,

i, j 1,..., n

(6)

B. Ant Colony Optimization Ant colonies are insect societies that accomplish complex tasks by presenting highly structured organizations and communication mechanisms. Ants’ behavior is based on the use of pheromones and, among them, the trail pheromone is particularly interesting because it is used to mark paths on the ground from food sources to the nest. Several experiments [25], [26] have shown that this communication mechanism is very effective in finding the shortest paths and, as a result, it has inspired several stochastic models that describe the dynamics of these colonies. Our starting point is the ant colony approach for the OP known as ACO-OP proposed by Liang [9]. In this approach

2007 IEEE Congress on Evolutionary Computation (CEC 2007)

493

each ant belonging to the colony has to construct solutions iteratively, choosing a node in each iteration or step according to a predefined state transition rule. The algorithmic schema that we have adopted is: Set parameters Initialize pheromone trails Do Do Build solutions based on the state transition rule Until every ant has constructed its path Apply local search (optional) Evaluate solutions and record the best solution so far Apply the offline pheromone update rule Until the stopping criterion is reached

p

­ °1, ® °¯0,

if

v arg max ^[W uw ]D ˜ [Kuw ]E ` wA ik

x

(7)

otherwise

else D E ­ W uv @ ˜ >Kuv @ > , if v  A uk °° D E k puv ® ¦ wA k >W uw @ ˜ >Kuw @ (8) u ° 0, otherwise °¯ Sv is local heuristic information specific for where Kuv cuv

the OP,

W uv are the pheromone trails, Į and ȕ are parameters

that adjust the importance of trails of pheromone versus k

local heuristic information, and A u is the set of unvisited nodes by the ant k standing in node u. Thus, each ant can choose the next node to be visited in its tour according to the vector of probabilities that the state transition rule returns. Finally, there is an offline pheromone update rule expressed as (9) which controls how the pheromone trails must be modified after finding a better solution than the current one. This rule is governed by the parameter U that controls the pheromone persistence.

494

(9)

C. Structures supporting ACO for OP In order to facilitate the definition of our algorithms later, we introduce some mathematical structures for supporting information management. In our context, the underlying information structure for the OP can be defined as a 6-tuple

G OP

It is important to note here that the squared block in this algorithmic schema indicates the actions that we have considered to be expressed as GPU programs in our proposed algorithms. We will see in the sections IV and V that the step Build solutions based on the state transition rule is decomposed in two steps in our approach: selecting by “projection”, which selects the nodes to be visited, and updating the list of selected nodes, which marks the selected nodes. The remaining actions are still performed on CPU. The state transition rule is governed by a random number q that is generated from an uniform distribution U[0,1], the parameter q0[0, 1] that determines the importance of exploitation (7) versus exploration (8) and how the vector of probabilities pk will be calculated: if q d q0 k uv

W ijnew m U ˜W ijold  (1  U ) ˜ 'W ij

(V , E ,\ , w, I , A) , where

x x

V is the set of nodes, elements of V are natural numbers representing node identifiers, ranging from 1 to n where n=|V|, E is the set of edges connecting nodes, \ : E o VxV is the incidence function,

x

I : V o \ 4 returns additional information for each

node, such as the spatial coordinates, the score and the time spent visiting each node, w : E o \ returns the weight or cost to move between x nodes. This is computed as the Euclidean distance between them, A : E o \ returns the attractiveness value between x any pair of nodes. This function is a binary relation between nodes pondering their attraction. In addition, we define M as the set of ants belonging to the colony and m as the number of ants in M, m=|M|. III. INTRODUCTION TO THE GPU A. Operation of the GPU Graphics Processing Units are mainly designed for accelerating the transformation of 3D geometric primitives into pixels on screen. The processing is organized as a set of pipelined stages. From a high level point of view, we can consider the following stages [27]: vertex transformation; primitive assembly; rasterization; interpolation, texturing, coloring; and raster operations. The vertices are graphic entities that usually have information about their position, and, optionally, color, and texture coordinates, among others. The vertices which compose the geometric primitives (points, lines, triangles) begin to flow through these stages. In the stage of vertex transformation, the vertices are transformed according to the transformation matrices and the camera setup information to be located accordingly in the space. These transformed vertices are then assembled into the specified primitives for rasterization. The rasterization process decomposes the primitives into fragments, which are graphics entities considered as potential pixels. Next, the fragments parameters are interpolated, and a sequence of texturing and math operations are performed to determine the final color for each fragment. Textures can be understood as matrices or arrays of texels, where a texel is a set of channels to store color information in a specific format (typically RGBA32, a format of 4 channels, red-green-blue-alpha with floatingpoint numbers of 32 bits). Finally, raster operations are

2007 IEEE Congress on Evolutionary Computation (CEC 2007)

applied (depth test, alpha test, stencil test, among others) and the image is rendered with the definitive fragments updating their pixels. From a logical point of view current programmable GPUs have two processors: the vertex processor and the fragment processor. The former enables programmability to the vertex transformation stage and the latter enables the stage of texturing, and coloring of fragments. The remaining stages are not programmable, having the same functionality as the fixed function graphics pipeline provides for non programmable GPUs. The power of the GPU resides in its capability to exploit the data parallelism of vertices and fragments, in the vertex-pipelines of the vertex processor and the fragment-pipelines of the fragment processor respectively, because they are independent from each other.

streams mapped exclusively to textures that are processed by kernels implemented on fragment programs, we think, instead, that mapping streams to vertex buffers and using massive vertex computation on vertex processors is also a possibility to be considered. Our algorithms try to balance the use of both processors on GPU.

B. General-Purpose computation on GPUs With the latest advancements in capabilities of programmability of GPUs, more general computations that are not specific to graphics rendering can be performed. Consequently, the philosophy of making general purpose computation on GPUs is becoming more and more popular. The community of GPGPU bets on using the GPU as a coprocessor, which allows a balance of the workload among all the processing units in a computer. As a result of the intense research efforts of this community, now a wide diversity of problems have been addressed by algorithms based on GPUs, mainly in the field of scientific computing. Moreover, several programming interfaces have been proposed such as BrookGPU [28] and Sh [29]. Designing algorithms based on GPGPU computing is not a trivial task because adequate correspondences between concepts from the conventional computational problem space into the graphics computational space must be defined. The basic strategies to obtain such correspondences can be summarized as follows [30]: x Data with the same structure and requiring similar computation are grouped into collections called streams. x Computations to be applied to each element of a stream are considered as functions called kernels. x Streams are mapped to textures and sometimes to vertex buffers. x Kernels are mapped to vertex and fragment programs to be executed by vertex and fragment processors. x Chaining kernels, which consume and produce streams, are the steps in the algorithm for solving a specific problem. x Feedback of outputs consists of render-to-texture processes. x Usually kernels are concentrated in fragment programs, and consequently, the streams are exclusively mapped to textures, requiring an orthographic projection together with a viewpoint in order to get a pixel-to-texel mapping. Although algorithms on GPUs are usually designed with

A. GPU-Vertex-ACO-OP Streams The mathematical structures presented in section II.C are encoded in several streams. These are the Graph, the Attractiveness, the Tour, the Selection Node Set, and the Forbidden streams. The Graph stream is a stream implemented as a texture of size nu1 where each texel represents a node of the graph and encodes the information related to function I (see Fig. 1).

IV. ALGORITHM GPU-VERTEX-ACO-OP Our first proposed algorithm known as GPU-VERTEXACO-OP will obtain solutions to the OP by intensively processing vertices on GPU benefiting from the parallel nature of the vertex processor. Consequently, we will have to adequately define the required streams, their implementation in terms of buffers and textures, and finally, the kernels that will process them.

Fig. 1. Detail of a texel of the Graph stream.

The Attractiveness stream encodes the function A and is implemented as a texture of size nun where each texel stores the attractiveness between a pair of nodes in V. The Tour stream is the main output stream where tours obtained by ants are stored (see Fig. 2). It is implemented as a texture of size mun, where texel (k,i) encodes the i-th visited node by ant k, among other information required to build incrementally the tours. This information is: x The ordinal number representing the i-th visited node. x A seed for a random number generator. x The score associated to the tour partially built until the i-th iteration, and a flag that indicates whether the ant k has already found a tour or not. x The weight associated to the tour partially built until the i-th iteration.

Fig. 2. Structure of the texture for the Tour Stream.

2007 IEEE Congress on Evolutionary Computation (CEC 2007)

495

The Forbidden stream is an auxiliary texture of size mun where the texel (k,v) is used as a flag that marks whether the ant k has visited the node v in its tour (i.e., marking as forbidden the nodes already visited). Finally, the Selection Node Set is defined as:

^ k , v  `

S

2

: k  M š v  V `

(10)

The cardinality of S is |S|=m·n and (k,v)S represents the possibility or intention of the ant k to visit the node v. Although the previous streams have been mapped to textures whose texels contain the information encoded in their color channels, the Selection Node Set is mapped to a vertex buffer where each pair (k,v) is encoded using the position coordinates of a 3D vertex. This buffer and the previous textures are processed in parallel within the vertex processor by means of the kernels described next. B. Kernels for Building solutions based on the state transition rule 1) Selection by projection We use an orthographic camera which defines a clipping space of dimension munuAmax (see Fig. 3). The clipping space determines what is seen by the camera so that any graphical entity that is placed outside the boundaries of the clipping space is not rendered. To explain how ants select the next node to be visited we will define a strategy of positioning of vertices from the selection node set within the clipping space. It is important to note that this strategy takes place in parallel for every ant so that we take advantage of the processing power of advanced GPUs. Let us consider

Sk

^ k , v : v V ` Ž S

the subset

ant k at the (i-1)-th iteration,



zuvk



­ A  >W @ ˜ >K @ , in case of exploitation uv uv ° max ° k D E ® r  >W uv @ ˜ >Kuv @ , in case of exploration (11) ° f, the node v is already visited ° ¯



D

E



Equation (11) is a simplified low-cost version of the selection function proposed by Liang (see (7) and (8)). In our proposed simplification, u is the latest visited node by

496

is the

i

By locating the vertices S k in the space and adjusting its component z we are implementing a strategy based on i

graphical techniques to select the next node vk in the tour which is equivalent to vki arg z min( Ski ) . Fig. 3 summarizes the graphical strategy for

Ski . It is

based on the discarding of vertices by using the clipping planes and the visibility properties from the camera viewpoint. If a vertex is related to an already visited node then its z value is set beyond the far plane in order to discard it. Otherwise, the z value is set by applying exploitation or exploration in such a way that the selected node is the node related to the nearest vertex to the view-point. This strategy applied for each input vertex from S, that is, applied to S k : k  M at iteration i, obtains the set of selected nodes to be visited in this iteration by each ant, and consequently, a full column is rendered in the Tour stream. y-axis xax is

where the i, k, and z components are encoded as vertex coordinates (x, y, and z respectively) and v is encoded as the color of the vertex. We define a kernel (in this case a vertex program) over Sk as a function Ti: Sk Æ Ski to obtain the transformed vertices for the i-th iteration as follows: Ti((k,v))= (i,k,z,v) where z is set according to (11).

E

ortho

^ i, k , z, v  ` u ` u \ u ` : z  >0, f š v  V `

D

attractiveness between nodes u and v and Amax its maximum value. Note that Amax is needed to be able to obtain the maximum attractive node as a minimization problem. The value of rk is distributed according to an uniform distribution of probability U(Amin,Amax), where Amin and Amax are the minimum and the maximum attractiveness respectively. In case of exploration, each ant k computes its value for rk, doing the selection based on the minimum distance between rk and the attractiveness of the unvisited nodes.

of elements in the Selection Node Set associated to a given ant k (|Sk| = n). Let us define the set of transformed vertices at the i-th iteration as

S ki

>W uv @ ˜ >Kuv @

Fig. 3. Strategy for selecting nodes. For clarity in this representation, the

i k

state transition rule is only applied to the transformed vertices S .

2) Updating the list of selected nodes Once the next nodes to be visited have been selected for each ant, they have to be marked as unavailable for the upcoming iterations. For this purpose, again an orthographic camera is defined and the same vertex buffer is used. If ant k has visited the node v in the previous step then its corresponding vertex is made visible from the camera viewpoint so that its projection sets the flag in the forbidden stream to true. The remaining vertices are discarded by

2007 IEEE Congress on Evolutionary Computation (CEC 2007)

placing them outside of the clipping space. 125000

Score

120000 115000 110000 105000 100000 0

5

10

15 20 Ants-Colonies

GPU-Vertex-ACO-OP

25

30

35

GRID-ACO-OP

Fig. 4. Score comparison: GPU-Vertex-ACO-OP vs. GRID-ACO-OP, against a graph with 3000 nodes, varying the number of ants for the GPU, and the number of colonies for the GRID.

However, if we consider the overall execution time, we can observe in Fig. 5 that GPU computing is competitive when few ants-colonies are involved (i.e. 1, 2, 4 and 8). This approach obtains in some trials up to 80 times faster computation times than GRID-ACO-OP running in one computing node. Besides, if the number of computing nodes is increased in GRID-ACO-OP, GPU-Vertex-ACO-OP remains competitive to an extent, despite being run on just one GPU. Obviously it behaves worse and it is not recommendable if 16 or 32 computing nodes are available. 500000

400000 Time (ms)

C. Experimental Results: GPU-Vertex-ACO-OP vs. GRID-ACO-OP We ran experiments to evaluate the performance, execution time and quality of solutions (score), of GPUVertex-ACO-OP versus GRID-ACO-OP [22]. The computation on GRID-ACO-OP is distributed in a GRID with N computing nodes, the number of ant colonies. This multicolony grid infrastructure makes it possible to get spectacular low execution times while maintaining the quality of the solutions as the number of participating PCs in the Grid is increased. Its effectiveness resides in a clustering algorithm which splits the problem into smaller subproblems, and in an algorithm that recombines the solutions for these subproblems to return the definitive solution for the original problem. However, we must keep in mind that GRID-ACO-OP requires a lot of hardware resources for it to be effective (several computing nodes), otherwise the execution time for medium size problems is prohibitive. Therefore, our motivation here is to take advantage of specific graphics hardware to accelerate the computations trying to improve the execution time by using only one PC. We used up to 32 Intel Pentium IV @2.40GHz PCs with 256 MB RAM for GRID-ACO-OP versus only one PC with 512 MB RAM and a GeForce 6600 GT nVidia GPU for GPU-Vertex-ACO-OP. The experimental results on this paper were produced by averaging results from 20 trials, varying the number of colonies from 1 to 32 on GRID-ACO-OP, and the number of ants in a single colony also from 1 to 32 on GPU-VertexACO-OP. Graphs had varying sizes (up to several thousand nodes) and were generated following the same principles described in [22]. The following parameters for the evolutionary algorithm were established: Į = 1, ȕ = 3, q0 = 0.2, ȡ = 0.9. With respect to figures and discussions about performance, notice that when we are using GRID-ACO-OP we are talking about N colonies exactly located in N computing nodes. However, when we are using a GPUbased algorithm we are talking about m ants in a single colony located in one computing node. Keep in mind this consideration because we will use the term “ants-colonies” for both approaches but with a dual meaning. Fig. 4 shows the quality of solutions for GPU-VertexACO-OP increases as the number of ants in the colony is increased. With few ants (1, 2 or 4), GPU-Vertex-ACO-OP obtains better solutions than GRID-ACO-OP. However, as the number of colonies is increased in GRID-ACO-OP the quality of the obtained solutions is improved up to +6%. These results are consistent in all our trials with different graphs and show that our GPU-Vertex-ACO-OP implementation obtains solutions whose quality is comparable to the ones obtained by using GRID-ACO-OP, although GRID-ACO-OP is superior in this respect.

300000

200000

100000

0 0

5

10

15 20 Ants-Colonies

GPU-Vertex-ACO-OP

25

30

35

GRID-ACO-OP

Fig. 5. Execution time comparison: GPU-Vertex-ACO-OP vs. GRID-ACOOP, against a graph of 3000 nodes. Time expressed in milliseconds varying the number of ants for GPU and colonies for GRID.

V. ALGORITHM GPU-FRAGMENT-ACO-OP This second algorithm, called GPU-Fragment-ACO-OP, is conceptually very similar to GPU-Vertex-ACO-OP, but it has been implemented in a different way to exploit the massive fragment processing power available on GPUs. The reason to propose a second algorithm is that we expect to improve the execution time by taking advantage of the greater number of fragment-pipelines available in the GPU. A. GPU-Fragment-ACO-OP Streams The data structures that only support data storage, such as the Graph, the Attractiveness, Forbidden and Tour streams are mapped in the same way. However, the computation is

2007 IEEE Congress on Evolutionary Computation (CEC 2007)

497

fragment-oriented, and the Selection Node Set has to change to be able to massively process fragments instead of vertices. The Selection Node Set is a stream defined as:

^ v, p  `

S

2

: v  V š p  ^0,1``

(12)

where p is a flag that indicates if the element is located at the “top” or at the “bottom” of an imaginary vertical segment. The cardinality of S is |S|=2·n and the subset Sv={(v,0),(v,1)}S forms a straight vertical line that represents the possibility for any ant to visit the node v (see Fig. 6). The elements of S are encoded by using the position coordinates of 3D vertices in a vertex buffer. B. Kernels for Building solutions based on the state transition rule 1) Selection by projection In this kernel we use an orthographic camera which defines a clipping space of size munud, where d is the depth considered in an arbitrary way. A vertex program transforms the vertices encoding S, locating them in the space as shown in Fig. 6. The vertices with their p flag set to 0 (“top”) are located at the top of the clipping space and the ones with the p flag set to 1 (“bottom”) are located at the bottom. The offset in the x-axis is fixed by the i-th iteration that is being performed. The depth of these transformed vertices is restricted to be within the clipping space and each pair of vertices related to the same node must have the same depth. Each pair of transformed vertices is assembled into a straight line covering the complete height of the clipping space. Consequently, once the rasterization process takes place in the GPU, mun fragments are generated, i.e., m fragments for each assembled line related to Sv. n n

k to visit the node v. The components i and k are interpolated during the rasterization, fixing the position coordinates of the fragment. The component v is encoded into the colour component of the fragment together with additional information required to build the tour. The component d is encoded in the depth of the fragment and it is established according to (14). This selection algorithm (14) is an equivalent form of the state transition rule described in (11), and it is performed by a fragment program. Again we are locating graphical entities, in this case fragments, which represent the intention of a specific ant to visit a specific node, allowing the selection of a node by each ant by using projections.

>W 1.0 

@ ˜ >Kuv @ D

uv

E

Amax  Amin

r k  >W uv @ ˜ >Kuv @ D

duvk

 Amin

,in case of exploitation

E

Amax  Amin

, in case of exploration

(14)

f, the node v is already visited

If we focus on the set of fragments related to a given ant k, FSSki ^ i ,k ,d ,v :vV `Ž FSS i , then we can observe that the corresponding generated fragments are located in the position associated to the i-th component of the tour for the ant k in the Tour stream. These fragments differ in depth and they are set as Fig. 7 shows. As a result, if the node v represented by a fragment is already visited by the ant k then the fragment is discarded. Otherwise, the depth of the fragment is set differently depending on whether exploration or exploitation is applied. The fragment in FSSki with closest depth to 0 is selected and its associated node is the one marked as visited by the ant k (see Fig. 7). This strategy is based on the generation of sets of fragments located in the same position and carrying out the selection by using the depth testing mechanism over the Z-buffer. Not Visible (depth test fails) Discarded (Fragment Kill)

Visible-selected

... m

xax is

z-axis

Near Plane

0.0

N

Far Plane

1.0

H z-axis

is

Fig. 6. Locating input vertices of S and assembling primitives. Each line will be broken up into m fragments after rasterization process.

Fig. 7. Strategy for selecting nodes. For clarity, in the figure the state

FSS i

transition rule is only applied to FSS .

^ i, k, d , v  ` u ` u \ u ` :

d > 0, f šv V š k  M `

(13)

An element of FSSi represents the intention of a given ant

498

x-

ax

Formally, let us define these fragments by means of the Fragment Selection Set for the i-th iteration as:

Zmax (1.0f)

Zmin (0.0f)

i k

2) Updating the list of selection nodes To carry out this step we use again an orthographic

2007 IEEE Congress on Evolutionary Computation (CEC 2007)

camera, but with a triangle whose cathets are 2·m and 2·n in length in order to obtain mun fragments (see Fig. 8). This is needed because the process of marking the visited nodes is also performed, in this version of our algorithm, by the fragment processor. Each one of these generated fragments is related to a specific node v and a specific ant k. If the ant k has visited the node v in the previous step, then the corresponding fragment is rendered into the Forbidden stream. The remaining fragments related to k are discarded and they do not have any effect on the texture.

theoretical asymptotic cost expressions associated to these algorithms. Assuming that the step selection by projection is the most expensive section in each generation for both algorithms, the costs in the vertex and fragment processors would be as computed in Table I (workload on each processor and the overall cost are listed). Notice that the number of vertices in GPU-Fragment-ACO-OP is linear functional dependent O(2n), therefore the cost of processing them is linear too, and the number of vertices is constant even when we vary the number of ants in the colony. This is a reasonable simplification to the cost for a generation since the GPU kernels for this step perform almost all the computations. However, for GPU-Vertex-ACO-OP, both the number of vertices and fragments increase as more ants are added to the colony resulting in a penalization to the overall cost. TABLE I THEORETICAL BEHAVIOR OF GPU-VERTEX-ACO-OP AND GPU-FRAGMENT-ACO-OP WITH N=1000

Fig. 8. Generating mun array of fragments to update the list of visited nodes.

C. Experimental Results: GPU-Vertex vs. GPU-Fragment The experiments described next compare the effectiveness of both GPU-based algorithms when solving OP instances of different size. For a valid comparison, the same values for the evolutionary parameters and hardware configuration were used. Several problem instances were solved as the ones described in [22] and we averaged the results from 20 trials. Both algorithms behaved similarly in terms of the qualities of the obtained solutions. This was an expected result because the same state transition rule was used on both algorithms but with different implementations.

GPU-Vertex-ACO-OP m 1 2 4 8 16 32

Vertex O(m·n) 1000 2000 4000 8000 16000 32000

Fragment O(m·n) 1000 2000 4000 8000 16000 32000

GPU-Fragment-ACO-OP

Total 2000 4000 8000 16000 32000 64000

Vertex O(2·n) 2000 2000 2000 2000 2000 2000

Fragment O(m·n) 1000 2000 4000 8000 16000 32000

Total 3000 4000 6000 10000 18000 34000

To verify that our theoretical analysis is correct, we have plotted these theoretical models (see Fig. 10). We observe that the experimental results fit to the simplified theoretical model. 70000

9000

60000

8000

50000

6000

Cost

Time (ms)

7000 5000 4000

40000 30000 20000

3000

10000

2000

0

1000

0

0 0

5

10

15

20

25

30

35

Ants GPU-Vertex-ACO-OP

GPU-Fragment-ACO-OP

Fig. 9. Execution time comparison: GPU-Vertex-ACO-OP vs. GPUFragment-ACO-OP, against a graph of 1000 nodes. Time in milliseconds.

5

10

15

GPU-Vertex-ACO-OP

Ants

20

25

30

35

GPU-Fragment-ACO-OP

Fig. 10. Theoretical time (counting cost) comparison: GPU-Vertex-ACOOP vs. GPU-Fragment-ACO-OP for a graph of 1000 nodes.

VI. CONCLUSION If we compare the execution times of both algorithms (see Fig. 9 for a problem size of 1000 nodes) we observe that if up to 4 ants are used, the required time by GPU-FragmentACO-OP is slightly greater. However, as the number of ants increases, the execution time for GPU-Vertex-ACO-OP becomes higher than GPU-Fragment-ACO-OP. We observe in this particular case that GPU-Fragment-ACO-OP is around 37.5% faster. This behaviour occurred systematically with all tried problem instances and can be explained by analyzing the

In this paper, we have presented two implementations for solving the orienteering problem based on ACO by using a Graphics Processing Unit (GPU). We have performed experiments with large instances of several thousand nodes and varied the number of ants, comparing the score of the solutions and the computing time of the algorithms GPUVertex-ACO-OP, GPU-Fragment-ACO-OP, and GRIDACO-OP. We have observed that GPU algorithms obtain excellent results versus GRID-ACO-OP when few

2007 IEEE Congress on Evolutionary Computation (CEC 2007)

499

computing nodes are involved in the grid infrastructure. We have also observed that score in both GPU algorithms is very similar. This fact is an expected result, because both algorithms are based on the same state transition rule but implemented in a different way. GPU-Vertex-ACO-OP orients its computation to the massive processing of vertices, whereas GPU-Fragment-ACO-OP is oriented to the massive processing of fragments. With respect to the computation time, GPU-Fragment-ACO-OP is slightly slower than GPUVertex-ACO-OP with few ants, but the former becomes around 30~45% faster when the number of ants increases to 16 or 32. This performance improvement is explained by its better design as we can infer from our brief study about the theoretical cost. Our most immediate future work is to integrate the GPU algorithms into GRID-ACO-OP in order to take advantage of the good properties of both approaches. On the one hand, GPU computes solutions very fast for reasonable large instances involving colonies of up to 32 ants and, on the other hand, the GRID infrastructure is able to solve large instances and improve the score of the obtained solutions when several colonies are involved in the computation. Therefore, a combination of both approaches, a GRID of GPUs, would be able to solve very large instances of the problem, obtaining high quality solutions with short execution times. ACKNOWLEDGMENT We thank the Microsoft Research Labs (Cambridge) for supporting this work under grant "Microsoft Research Excellence Awards for Embedded Systems". We also thank the Spanish Ministry of Education and Science for supporting A. Catala with a FPU fellowship with reference AP2006-00181. REFERENCES [1]

[2]

[3] [4] [5] [6]

[7]

[8] [9]

500

G. Laporte, S. Martello. “The Selective Traveling Salesman Problem,” Discrete Applied Mathematics, vol. 26, pp. 193-207, 1990. M. Fischetti, J. J. S. Gonzalez, P. Toth. “Solving the Orienteering Problem through Branch-and-Cut”. INFORMS Journal on Computing, vol. 10, no. 2, pp. 133-148, 1998. J. Jaén, J. H. Canós. “A Grid Architecture for Building Hybrid Museums”. Human Society @Internet, pp. 312-322, 2003. J. Jaén, V. Bosch, J. M. Esteve, J. A. Mocholí. “MoMo: A Hybrid Museum Infrastructure”. Museums and the Web 2005. T. Tsiligirides. “Heuristic Methods Applied to Orienteering,” Journal of Operational Research Society, vol. 35, no. 9, pp. 797-809, 1984. M. F. Tasgetiren, A. E. Smith. “A Genetic Algorithm for the Orienteering Problem”. Congress on Evolutionary Computation, vol. 2, no. 2, pp. 910-915, 2000. I. M. Chao , B. L. Golden, E. A. Wasil. “A Fast and Effective Heuristic for the Orienteering problem”. European Journal of Operational Research, vol. 88, no. 3, pp.475-489, 1996. B. L. Golden, L. Levy, R. Vohra. “The Orienteering Problem,” Naval Research 19 Logistics, vol. 34, no. 3, pp.307-318, 1987. Y.-C. Liang, A. Smith. “An Ant Colony Approach to the Orienteering Problem”. Journal of the Chinese Institute of Industrial Engineers. vol. 23, no. 5, pp. 403-414, 2006.

[10] M. Dorigo, V. Maniezzo, A. Colorni. “The Ant System: optimization by a Colony of Cooperating Agents,” IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics, vol. 26, no. 1, pp. 29-41, 1996. [11] M. Dorigo, L. M. Gambardella. “Ant Colony System: A cooperative learning approach to the travelling Salesman Problem”. IEEE Transactions on Evolutionary Computation, vol. 1, no. 1, pp 53-66, 1997. [12] B. Bullnheimer, R. F. Hartl, C. Strauss. “A new rankbased version of the Ant System: A computational study”. Central European Journal for Operations Research and Economics, vol. 7, no. 1, pp. 25-38, 1999. [13] O. Cordón, I. Fernandez, F. Herrera, Ll. Moreno. “A New ACO Model Integrating Evolutionary Computation Concepts: The BestWorst Ant System”. Proc. of ANTS' 2000. From Ant Col onies to Artificial Ants: Second International Workshop on Ant Algorithms, Brussels, Belgium, September 7-9, pp. 22-29, 2000. [14] T. Stützle, H. H. Hoos. “MAX-MIN Ant System”. Future Generation Computer Systems, vol. 16, pp. 889-914, 2000. [15] M. Bolondi, M. Bondaza. “Parallelizzazione di un algoritmo per la risoluzione del problema del comesso viaggiatore”; Master' s thesis, Dipartimento di Elettronica e Informazione, Politecnico di Milano. 1993. [16] B. Bullnheimer, G. Kotsis, C. Strauss. “Parallelization Strategies for the Ant System”. In: R. De Leone, A. Murli, P. Pardalos, G. Toraldo (Eds.), High Performance Algorithms and Software in Nonlinear Optimization; series: Applied Optimization, 24, pp. 87-100, 1997. [17] T. Stützle. “Parallelization strategies for ant colony optimization”. In: A. E. Eiben, T. Bäck, M. Schonauer, H.-P. Schwefel (Eds.), Parallel Problem Solving from Nature, Springer-Verlag LNCS, vol. 1498, pp. 722-731, 1998. [18] R. Michels, M. Middendorf. “An Ant System for the Shortest Common Supersequence Problem”. D. Corne, M. Dorigo, F. Glover (Eds.), New Ideas in Optimization, Mc Graw Hill. 1999. [19] M. Middendorf, F. Reischle, H. Schmeck. “Information Exchange in Multi-Colony Ant Algorithms”. Proceedings for the International Parallel and Distributed Processing Symposium, LNCS vol. 1800, pp. 645-652, 2000. [20] E-G. Talbi, O. Roux, C. Fonlupt, D. Robillard. “Parallel ant colonies for combinatorial optimization problems”. In J. Rolim et al. (Eds.) Parallel and Distributed Processing, 11 IPPS/SPDP' 9 Workshops, LNCS 1586, Springer Verlag, pp. 239-247, 1999. [21] F. Krüger, M. Middendorf, D. Merkle. “Studies on a Parallel Ant System for the BSP Model”. Unpublished manuscript, 1998. [22] J. A. Mocholí, J. Jaén, J. H. Canós. “A Grid Ant Colony Algorithm for the Orienteering Problem”. Proceedings of the IEEE Conference on Evolutionary Computation (CEC' 05), vol. 1, pp. 491-948, 2005. [23] P. Hanrahan. “Stream Programming Environments”. GP2 ACM Workshop on General Purpose Copmuting on Graphics Processors. August 7-8, 2004. [24] GPGPU: General-Purpose Computation Using Graphics Hardware. Web Site: www.gpgpu.org [25] J.-L. Deneubourg, S. Aron, S. Goss, J.-M. Pasteels. “The selforganizing exploratory pattern of the Argentine ant”. Journal of Insect Behavior, 3, pp. 159-168, 1990. [26] S. Goss, S. Aron, J. L. Deneubourg, J. M , Pasteels. “Self-organized shortcuts in the Argentine ant”. Naturwissenschaften Journal, vol. 76, no. 12, pp. 579-581, 1989. [27] R. Fernando, M. J. Kilgard. “The Cg Tutorial: nVidia”. AddisonWesley 2003. [28] “BrookGPU”. Project at the Stanford University Graphics Lab. http://graphics.stanford.edu/projects/brookgpu/index.html. Last visit on Dec 13th, 2006. [29] “Sh: A high-level metaprogramming language for modern GPUs”. Web Site: http://libsh.org/index.html [30] M. Harris. ”Mapping computational concepts to the GPU”. Courses in GPGPU: General-Purpose Computation on Graphics Hardware. Conference SIGGRAPH 2004.

2007 IEEE Congress on Evolutionary Computation (CEC 2007)