FPGA Implementation of Tabu Search for the Quadratic Assignment Problem Shin’ichi Wakabayashi, Yoshihiro Kimura, Shinobu Nagayama Faculty of Information Sciences, Hiroshima City University 3-4-1, Ozuka-higashi, Asaminami-ku, Hiroshima 731-3194, Japan
[email protected]
Abstract— In this paper, we propose an FPGA implementation of tabu search to solve the quadratic assignment problem in a short execution time. In the proposed hardware implementation of tabu search, multiple neighbor solutions are evaluated in parallel and each solution is evaluated in a pipeline fashion. The proposed method effectively utilizes internal block RAMs of recent large scale FPGAs. Experimental results show the efficiency and effectiveness of the proposed method.
II. P RELIMINARIES A. The Quadratic Assignment Problem (QAP) Given a set of n units and n locations, the quadratic assignment problem, the QAP for short, is the problem of finding a minimum assignment of n units to n locations so as to minimize the following objective function:
I. I NTRODUCTION The quadratic assignment problem (QAP), which is one of NP-hard combinatorial optimization problems, is known to be difficult to be solved optimally with ordinary optimization methods, such as mathematical programming, etc.[1], [3], [6]. Tabu search is one of heuristics to solve the QAP [4], [5]. Tabu search is known as a robust search method to NPhard combinatorial problems [2]. For a well-known benchmark data set, QAPLIB [7], most benchmark problems have been solved with tabu search. However, there is one drawback of tabu search. That is, as the size of problem instance becomes large, tabu search could hardly find a near-optimal solution in a practical execution time, since the search space becomes extremely large. To resolve this problem, Taillard proposed two parallelization methods of tabu search [5], and showed experimentally that a number of processors proportional to the size of the problem can be efficiently used. However, since Taillard’s parallel implementation was realized on a multi-processor with 10 CPUs, and the tabu search method was implemented as software, the speedup ratio was limited to less than 10. In this paper, we propose an FPGA implementation of tabu search to solve the QAP in a shorter execution time. The tabu search we adopted was based on the heuristic method proposed by Taillard [5], which was devised to be implemented as a parallel program running on a multi-processor system. In the proposed hardware implementation of tabu search, multiple neighbor solutions are evaluated in parallel, and each solution is evaluated in a pipeline fashion. For the problem instance with size n, the proposed method can achieve the speedup ratio of n, compared to an ordinary sequential program. This paper is organized as follows. Section II will give the definition of the quadratic assignment problem, Section III will propose a hardware implementation of tabu search for the QAP, Section IV will show some experimental results, and finally, Section V will conclude the paper.
0-7803-9729-0/06/$20.00 2006 IEEE
F (π) =
n−1 n−1
aij bπ(i)π(j)
(1)
i=0 j=0
where π is a permutation of n units, and A and B are n × n matrices. A is called the distance matrix, and each matrix element aij denotes the distance of two locations i and j. B is called the flow matrix, and each matrix element b kl denotes the flow of materials moving from unit k to unit l. n is called the size of the problem. B. Tabu Search Tabu search is one of heuristic methods for solving combinatorial optimization problems [2]. During the search, it always keeps a feasible solution. Let x be a feasible solution, and N (x) be a set of neighborhood solutions of x. Tabu search begins with a feasible solution, and proceeds the search by iteratively moving from the current feasible solution x to another feasible solution in N (x). When moving to another solution, the tabu search selects the best solution y in N (x) if y has not been selected recently. Once one feasible solution y has been selected as a current solution of the search, y will be kept in a special memory called tabu list in a certain period of time so that y is prohibited to be selected as a new temporal solution of the search for that period. Using the tabu list, the tabu search can successfully escape from a local minimum solution. There are several terminating conditions of tabu search. Examples are the maximum number of updating solutions, and the number of consecutive unsuccessful updates of solutions, etc. III. H ARDWARE I MPLEMENTATION A. Tabu Search for Solving the QAP When applying tabu search to the QAP, a feasible solution of the problem is represented as a permutation of n units. Given a permutation φ of n units, let π be a permutation, which can be obtained from φ by interchanging two distinct units, i and
269
FPT 2006
j, in φ. Then, π is defined as a neighborhood of φ. From the definition, each feasible solution of the problem has n(n−1)/2 neighborhood solutions. Any move from a current solution to its neighborhood solution can be represented as a pair (r, s) of units to be interchanged. The tabu list consists of a set of pairs of units (r, s), which is prohibited from interchanging. Since the tabu search iteratively updates a current solution, there is no need to calculate the objective function given in expression (1) as defined, but we can get its value by calculating the difference between the current solution and the updated solution. Assume that φ is a current solution, and π is a solution obtained by interchanging two units r and s in φ. Then, the difference between the costs of φ and π is given in expression (2). Δ(φ, r, s)
=
F (π)
=
n n
s
r
A*k
A*k
ask
k φ
ark
φ
φ(k)
φ(s) B
φ(r) B
bφ(s)φ(k) -
compare
φ
bφ(r)φ(k) -
*
: pipeline register
(aij bφ(i)φ(j) − aij bπ(i)π(j) ), (2)
Fig. 2.
Difference calculation unit (k-th subunit).
i=1 j=1
F (φ) + Δ(φ, r, s),
(3)
where π(r) = φ(s), π(s) = φ(r), π(k) = φ(k), 0 ≤ k < n, k = r, k = s. If two matrices A and B are both symmetric matrices, and their diagonal elements are all 0s, then expression (2) can be rewritten as follows [1]. Δ(φ, r, s)
=
2
Δ(φ, r, s, k) = (ask − ark ) × (bφ(s)φ(k) − bφ(r)φ(k) ).
(ask − ark ) ×
k=r,s
(bφ(s)φ(k) − bφ(r)φ(k) ).
(4)
Since, in most applications of the QAP, two matrices A and B are both symmetric matrices, and their diagonal elements are all 0s, in this study, we assume that the difference between the costs of φ and π is given in expression (4). B. Overview of the Hardware The proposed hardware implementation of tabu search for the QAP mainly consists of two units: the difference calculation unit (DCU) and the tabu memory unit (TMU). The block diagram of the proposed hardware is shown in Fig. 1 when the problem size is 16. Each DCU calculates expression (4) in parallel for a given pair of units (r, s), which is a candidate of the next move. A pair of units with the largest improvement of the cost function is chosen as the next move. If there are no improving moves, one that least degrades the cost function is selected as the next move. The TMU checks whether a given pair of units is contained in the tabu list. If it is in the tabu list, it is not selected as a next move unless it leads a better solution than the best one found so far. In the following, we explain the DCU and TMU in detail. 1) Difference Calculation Unit: The difference calculation unit (DCU) is a subcircuit to calculate the value of difference of the objective function after interchanging two units according to expression (4). The DCU consists of n subunits, where n is the size of a given problem. Each subunit k, 0 ≤ k < n is responsible for calculating a part of expression (4) concerning
with a specific value of k, that is, the following expression is evaluated by the subunit k. (5)
The block diagram of each subunit is shown in Fig. 2. One subunit consists of subtracters, multipliers, counters, registers, and memories which store one column data of matrix A and one row data of matrix B. In FPGA implementation, memories are implemented using the internal block RAMs of a FPGA chip. The behavior of each subunit shown in Fig. 2 is as follows. Inputs of the units are r, s, and k, where r and s specify the units to be interchanged. The value of k is fixed for each subunit. In the figure, large rectangles containing A or B represent memories, small rectangles show pipeline registers, and trapezoids show arithmetic units such as subtracters and multipliers. In each subunit k, for matrix A, the k-th column of the matrix, A[∗, k], is stored, and for matrix B, the φ(k)-th column of the matrix, B[∗, φ(k)], is stored. This is a major advantage of the proposed architecture. Matrices A and B are distributedly stored among each subunit of the DCU, and there is no overlaps of data among subunits. The matrix data A[∗, k] is initialized at the start of the algorithm and kept during the algorithm execution, since there is no need to update them during the algorithm execution. The matrix data B[∗, φ(k)] is initialized at the start of the algorithm, and is periodically updated when the current solution φ is updated. Note that, each time when the current solution is updated, there are only two subunits, for which the matrix data B is needed to be updated. One column data of matrix B to be updated are interchanged with another subunit. Thus, O(n) clock cycles are enough for updating the memory data. Since in recent large scale FPGAs, there are enough numbers of internal block RAMs, the proposed hardware architecture with distributed RAMs is very suitable for implementation
270
counter s
Difference Calculation Unit (DCU)
k=0
counter r
k=1
k=2
k=14
current solution φ
k=15
update Tabu Memory Unit (TMU)
+ Δ(φ,r,s)
current cost F(φ)
tabu
+ best cost
updated cost F(π) compare update
Fig. 1.
Overview of the circuit (n = 16).
with recent FPGAs. For example, a recent FPGA chip (Altera EP2S180) contains 768 internal block RAMs, each of which is a 4Kbit RAM [8]. Each subunit of the DCU runs as follows. In the first clock cycle of each phase for computing (4), a sk and ark are read out from memory A ∗k , and φ(s), φ(r) and φ(k) are read out from memory φ. In the same clock cycle, k is compared with r and s to check if k = r and/or k = s hold. In the second clock cycle, b φ(s)φ(k) and bφ(r)φ(k) are calculated from φ(s), φ(r) and φ(k). In the third clock cycle, the subunit calculates ask − ark and bφ(s)φ(k) − bφ(r)φ(k) , and then in the next clock cycle, multiplication is performed. In case that k = r or k = s, the result of multiplication is set to 0. As shown in Fig. 1, for solving the problem with size n, n subunits of the DCU are implemented, each subunit is dedicated to calculate for specific k, 0 ≤ k < n, and runs in parallel, and expression (4) is evaluated for each pair (r, s), 0 ≤ r, s < n, in a pipeline fashion. Outputs produced by n subunits of the DCU are added to calculate the cost function of the candidate move, and the pair (r, s) with the smallest difference is selected as a candidate of the next feasible solution of tabu search. The candidate pair is then transfered to the tabu memory unit, and is compared with all pairs stored in the tabu list. If the candidate pair is not included in the tabu list, or the candidate pair is superior than the best solution ever found by that time, it becomes the next feasible solution of tabu search. When the candidate does not satisfy the condition, then it is thrown away, and the next candidate is investigated. 2) Tabu Memory Unit: The tabu memory unit (TMU) is responsible for checking whether a given pair of units as a candidate of next move is contained in the current tabu list or not. Additionally, it manages the tabu list and the
current solution of the problem. The TMU mainly consists of a shift register and memory. The shift register realizes a queue (FIFO), and the memory realizes a two-dimensional array. When a pair of units, (i, j), as a candidate move of the next feasible solution is given, the TMU checks whether it is tabu or not. First, the unit accesses the memory with memory address (i, j). If the memory returns the value 0, then the pair is not a tabu. Tabu checking by the TMU and the cost calculation by the DCU are performed in parallel. For a current solution φ, if the next move (i, j) is finally determined, the pair is entered in the queue, and the memory data with memory address (i, j) is set to 1. The oldest tabu in the queue, (s, t), is read out, and the memory data with memory address (s, t) is set to 0. The tabu list size is determined as the length of the shift register. The current solution is also updated. C. Execution Time There are several possible termination conditions of tabu search. In the current implementation of the proposed method, the termination condition is the maximum number of moves. When the number of moves reaches a user-given number, the method is terminated. Time complexity of the proposed method is evaluated as follows. For a problem instance with size n, one move is performed in O(n 2 ) clock cycles, since for each current solution, there are O(n2 ) possible moves, and the DCU can evaluate each move in a constant time. Thus, if the maximum number of moves is represented as M , the whole time complexity becomes O(M × n2 ). D. FPGA Implementation The proposed hardware solver based on tabu search for the quadratic assignment problem is implemented on FPGAs.
271
Specifications of the circuit depending on a given instance are: the problem size n, the data width of matrices A and B, the number representation (integer, fixed-point number, or floating-point), etc. Currently, for given specifications of the QAP, the HDL description of the proposed method is written by hand. However, it will be very convenient if we develop a program, which automatically generates an HDL source code from given specifications of the QAP. TABLE I R ESULTS OF LOGIC SYNTHESIS. size 32 64
#LE 11,068 (19 %) 39,634 (69 %)
memory (byte) 8,512 33,534
TABLE II E XPERIMENTAL RESULTS ON EXECUTION TIME . size 32 64 128
best 130 116 64
soft(s) 264 2153 17528
hard(s) 1.404 5.472 21.888 (est.)
speedup 188 393 801
IV. E XPERIMENTS To evaluate the effectiveness of the proposed hardware solver, we compared the proposed solver with a software program. The proposed hardware was designed with VerilogHDL, and implemented on a FPGA board. We used Altera’s Quartus II Version 5.0 as the FPGA design tool. The FPGA board (Mitsubishi Electric Micro-Computer Application Software Co.,Ltd., PowerMedusa MU200-SX60) used in experiments consists of an FPGA, Altera EP1S60, which contains 57,120 logic elements. The FPGA tool was run on a PC with Pentium 4 2.4GHz CPU. As a software solver, we implement the logically equivalent algorithm of the proposed tabu search based hardware as a software program written with the C language, and executed it on a workstation, Sun Microsystems Sun Blade 1500 (CPU:UltraSPARC, 1062MHz, 2GB main memory). In experiments, we prepared three benchmark data of the QAP, whose problem size were 32, 64, and 128. The proposed hardware and the software program were terminated if the number of interchanging units became 100,000, and their execution time was measured. The length of the tabu list was set to n, the problem size. Each benchmark data was selected from the QAPLIB, a well-known benchmark suite of the QAP [7]. For each benchmark, we wrote the HDL source code of the circuit, and implemented it on the FPGA board. The clock frequency of FPGA was set to 40 MHz. Table I shows the result of logic synthesis, and Table II shows the execution time of hardware and software solvers for each benchmark data. In Table I, “#LE” represents the number of logic elements required to implement the circuit, and “memory” shows the total size of internal memory in FPGA used to implement the circuit. In Table II, “best” shows
the cost of the best solution obtained by the algorithm, and “hard” and “soft” show the execution time of the hardware and software solvers, respectively. “speedup” means the speedup ratio of the proposed method to the software program. In experiments, we could not implement the hardware solver for the problems whose size was 128 due to the shortage of logig elements of FPGA. Thus, for the benchmark data of 128, results were estimated from the result of simulation. Note that current state-of-the-art FPGAs contain more than three times large number of logic elements in a chip, compared to an FPGA chip which we used in experiments, such as Altera EP2S180 [8]. Using such an FPGA chip, we will be able to implement the proposed hardware solver on an FPGA chip. From experimental results, the proposed method outperformed the software in execution time even though the CPU clock frequency of the workstation was 26 times faster than the FPGA clock frequency. This was due to parallel and pipeline processing adopted in the proposed method. As the problem size increased, the speedup ratio also became improved. This was explained by the fact that the degree of parallelism was improved when the problem size became large. Furthermore, for any benchmark data, the proposed method successfully obtained the best solution, which was published in QAPLIB. From those results, we can conclude that the proposed method was very effective to solve the QAP. In particular, for problems with large size, the proposed method is very superior to the software implementation of tabu search. V. C ONCLUSION In this paper, we have proposed a hardware solver for the quadratic assignment problem, and implemented it on an FPGA board. As future work, development of a more effective tabu search based hardware solver is important. Another future work includes improvement of the circuit structure so as to solve larger size problems. To effectively utilize a stateof-the-art large scale FPGA chip, we need to develop a more sophisticated architecture of tabu search based hardware solver. ACKNOWLEDGMENT This research is partially supported by Grant-in-Aid for Scientific Research (C) #18500042 from Japan Society for the Promotion of Science. R EFERENCES [1] E.C ¸ ela, The Quadratic Assignment Problem: Theory and Algorithms, Kluwer Academic Publishers, 1998. [2] F.Glover, and M.Laguna, Tabu Search, Kluwer Academic Publishers, 1997. [3] P.Merz, and B.Freisleben, “A comparison of memetic algorithms, tabu search, and ant colonies for the quadratic assignment problem,” Proc. Congress of Evolutionary Computation, Vol.3, pp.2063–1070, 1999. [4] J.Skorin-Kapov, “Tabu search applied to the quadratic assignment problem,” ORSA Journal on Computing, 2, 1, pp.33-45, 1990. [5] E.Taillard, “Robust taboo search for the quadratic assignment problem,” Parallel Computing, 17, pp.443–455, 1991. [6] M.V´azquez, L.D.Whitley, “A hybrid genetic algorithm for the quadratic assignment problem,” Proc. Genetic and Evolutionary Computation Conference, pp.135–142, 2000. [7] http://www.seas.upenn.edu/qaplib/ [8] Altera Corporation, “Stratix II Device Handbook”, Vol.1, 2006.
272