Jun 8, 2005 - Guy Rabbat, editor, Advanced Semiconductor Technology and Computer Systems, pages ... [15] Michael G. Wrighton and André M. DeHon.
Acceleration of FPGA placement Joseph Rios Department of Computer Engineering University of California, Santa Cruz June 8, 2005 Abstract Placement (and routing) of circuits is very computationally intensive. This intensity has motivated several attempts at acceleration of this process for application-specific integrated circuits (ASIC) and Field-programmable gate arrays (FPGA). In this paper an overview of some of these attempts is given. Specifically, parallelization of the standard simulated annealing (SA) algorithm is examined as well as a particular improvement to VPR, the academic Versatile Place and Route tool. Overall, it is clear that SA is difficult to parallelize and that very minor improvements on a well-known tool is cause for publication. A discussion is provided outlining a more innovative and potentially fruitful direction for acceleration of placement and routing.
1
Introduction and motivation
FPGAs are circuits that can be programmed (and reprogrammed) in the field. Logic functions are typically implemented with look-up tables and flip-flops. The routing between the various blocks is also programmable with the ability to connect the output of virtually any logic block with any other. Through synthesis of a hardware-description language like Verilog or VHDL, user logic is mapped to these logic blocks. After this mapping, it is necessary to make decisions as to the physical location of these logic blocks and which routing resources should be dedicated to which nets. To perform this placement and routing in an optimal manner is a proven NP-complete problem [1]. There are several methods for providing solutions that are acceptable to the designer in a tractable amount of time. Some of these include simulated annealing (SA), forcedirected placement, min-cut placement, placement by numerical optimization, and evolutionbased placement[2]. In this paper, we are mostly concerned with SA, though evolutionary techniques will make a brief appearance. SA is a stochasitic method for solving optimization problems based on the process of annealing metals[3, 4]. The basic idea is to start out with an initial solution with high energy (randomness) and an evaluation function that computes the “goodness” of a given solution. 1
A montonically decreasing temperature schedule is kept such that the initial temperature is very high. A new solution is created from the current solution by perturbing it in some way. This new solution is kept if it is better and may be kept if it is worse. The higher the temperature, the more likely a worse solution is kept. This allows the search for a global optimum to escape local optima, especially early in the process. With appropriate annealing schedule and function for keeping less-optimal solutions, it can be proven that the global optimum can be found. However, this isn’t practical in most situations since it usually requires an infinite or extremely long cooling schedule. Fortuneately, it has been shown to be a useful tool for solving many problems[5]. Simulated annealing is the algorithm used in VPR[6]. VPR is an open-source, academic tool for experimentation with various aspects of FPGA placement and routing. The widespread use of VPR is a testament to its robust nature and the current state of industry tools, which are not available for experimental, academic use.
2
Parallelization of simulated annealing
There has been significant effort put toward the task of parallelizing simulated annealing. It seems that there is a fundamental limit on the amount of parallezation that can be applied to the standard simulated annealing algorithm. This is due to the fact that it works on a single data structure and every move is very dependent on every previous move. To circumvent this bottleneck, the following algorithms change the basic algorithm in some way to achieve some degree of parallelization. The biggest problem with each of these solutions is that they depend on the ability of the system to maintain multitple, often independent, copies of the large data structures involved in creating a placement.
2.1
Two-level approach
Two different papers[7, 8] published within a year of each other offer similar approaches to the parallelization of SA. They suggest a “two-level” approach where standard SA is considered one of the levels and the other is some other, greedier alogorithm. Standard SA is allowed to search for a global optimum while this secondary level uses hill-climbing to find any local optima when SA reaches some new valley (or hill). Greedy algorithms that operate via gradient descent are able to find local optima rather quickly. Since both levels operate fairly independently, they can be run on separate processors. In actuality, several of the hill-climbers can be run on different hills (or valleys) simultaneously. While this idea is an innovative use of a multiprocessor environment applied to SA, it was not tested with any problems with solution states as large as those for FPGA placement. It is not hard to see that the large size of the circuits involved may cause a problem of communication between processors, especially since each processor needs its own copy of the circuit being designed.
2
2.2
Parallel moves
VPR has been used as a platform for experimenting with parallelization of SA[9]. The idea here is to allow more than one move to be attempted at a time. That move can be accepted or rejected without consideration to the other moves currently being attempted. The main setback here is that when parallel moves are attempted on the same net, the bounding box calculations become difficult or outright incorrect. When bounding box calculations are wrong, the acceptance of the move is based on poor data and may hurt SA. The paper discusses ways to deal with this problem including allowing SA to proceed independently on several processors and occasionally sample the solutions and either take the best current solution or take the pieces of each of the solutions that are “best”. Synchronization is a major concern with this system, as is the amount of data being handled.
2.3
Infusion of genetic algorithms
There is a decent amount of literature on the combination of SA with genetic algorithms (GA) in order to gain some parallelization [10, 11, 12]. A GA is a search method that uses a “population” of potential solutions that “evolve” over the course of several generations. This evolution takes place via reproduction between individuals in the population and some sort of occasional mutation to introduce new pieces of solutions. Individual solutions are selected for reproduction based on some evaluation of their fitness so that the more fit solutions (those closer to the global optimum) are more likely to survive and reproduce. There is is some parallelism to be exploited with a GA because the calculation of the fitness is typically the computationally intensive portion of the algorithm and those calculations can be done without regard to the fitness of other individuals, thus the calculations theoretically scale well with the number of processors. Likewise, the “mating” operations are also independent of each other. The problem with this application to FPGAs is that the size of each individual is determined by the size of a complete placement, which is very large. All of the cited papers show promising results, but do not attempt an application of the placement problem.
3
Improvement on VPR
There are many examples of people expermenting with VPR to improve it in some way[13, 14]. For this project, the paper by Danilin and Sawitzki[13] was examined. The results were a bit incomplete so the changes they made to VPR were duplicated and further results were obtained. Essentially what they tried to do was reduce the amount of time that the placer ran without sacrificing the placement quality. Their results purported a 3% improvement in timing with a 6% reduction in swap moves. The major drawback to their results was that they increased the track width by 20% over reported best values for VPR in order to make these comparisons. After setting this new width, Each of the 20 circuits was run through the VPR placer and their “PROBE” placer. These placements were then supplied to the VPR router with the new, 20% larger, width to obtain the previously mentioned results. 3
In VPR the wire cost for each net is based on bounding box of logic blocks as shown here: " # NX nets bbx (i) bby (i) q(i) . (1) + W ire Cost = Cav,x (i) Cav,y (i) i=1 The bb terms refer to the bounding box and the C terms refer to the average channel width in the bounding box. q(i) is a term that makes for allowances for wires to exceed the bounding box by some amount based on the number of nodes in the net. The change to this equation was to incorporate the size of the target circuit. So the Cav,x was replaced by Cav,x × nx where nx is the number of logic blocks in the circuit in the x direction. A similar substitution is made for the y term. This effectively lowers the wiring cost causing faster convergence. Since this changes how the overall ∆Cost will be calculated, ∆Cost = (1 − ρ) ×
∆T iming Cost ∆W ire Cost +ρ× , P revious T iming Cost P revious W ire Cost
(2)
the weighting on this function is the second change made in this implementation. In the default VPR implementation ρ is 0.5. For this experiment ρ was based on Rlimit which is the maximum distance a move can occur at an iteration of SA and decreases as the temperature decreases. This new ρ is defined as ρ=
2 × Rlimit . nx + ny
(3)
The ρ value is further constrained to stay between 0.3 and 0.45 to make sure it doesn’t become too large or small. The reasoning for this modified ρ is not completely clear, but the results are reasonable. For my experiment I made the same changes described in the paper and allowed VPR to then run with the default settings on the 20 circuits. This allowed a comparison on how much better the timing improves (or degrades) when the router tries to minimize the size of the channels, rather than working with the +20% values used in the paper. In order to track if the program performed fewer swaps, each of the runs was profiled with gprof and the number of calls to try swap() were noted. This is a very expensive call in the placement process, so decreasing the number of times it is called can significantly decrease the run time. The actual runtime of the unaltered VPR versus the modified VPR were not noted due to the multiuser environment in which the program was run. Several factors could have artificially increased runtime, thus it was disregarded. The various delays were also noted and the percentage change between the original VPR and the modified version was calculated for each circuit. The results are in table 1. These results seem to bring the claims of improvement [13] down to earth a little. While it should be expected that there would be fewer calls to try swap() because of the faster convergence, it should have also been expected that the delays would not be as good as the standard implementation. The fact that the original authors allowed for a 20% increase in track width in order to show improvements seems dubious. While I cannot reason why they acheived slightly better delays after the width increase, typically one would think that the 4
Tracks Logic delay Net delay Crit Path try swap calls Circuit VPR mod ∆ ∆ ∆ VPR mod ∆ alu4 11 10 0.00% -19.03% -18.27% 18199466 17485822 -3.92% apex2 12 11 11.88% -5.36% -4.57% 24558046 23366001 -4.85% apex4 14 14 0.00% 20.39% 19.46% 14602266 14040690 -3.85% bigkey 7 6 0.00% 33.46% 32.44% 31020633 27726633 -10.62% clma 13 14 16.10% -4.02% -3.31% 188093551 181127439 -3.70% des 8 7 0.00% 17.81% 17.27% 29158284 26483404 -9.17% diffeq 8 8 6.93% -2.40% -1.35% 18523609 17588154 -5.05% dsip 7 7 0.00% 21.39% 20.62% 25319492 22700420 -10.34% elliptic 12 11 0.00% 47.01% 43.37% 63925771 62116660 -2.83% ex1010 11 12 0.00% -20.41% -19.94% 81497948 77653923 -4.72% ex5p 15 14 0.00% 32.52% 30.61% 12192730 11482540 -5.82% frisc 15 13 72.45% -18.21% -14.34% 59902622 57050292 -4.76% misex3 12 12 0.00% -7.49% -7.14% 16834710 16193442 -3.81% pdc 19 19 10.62% -14.13% -13.45% 87205827 83347367 -4.42% s298 8 8 -6.09% -0.22% -0.59% 24934121 23723821 -4.85% s38417 8 8 -56.35% 9.09% 5.22% 124744381 117406861 -5.88% s38584.1 9 9 0.00% -4.93% -4.78% 132374973 124664205 -5.82% seq 12 12 0.00% 7.64% 7.30% 23653606 22537956 -4.72% spla 15 16 11.88% -23.32% -22.58% 61204052 58872612 -3.81% tseng 7 7 0.00% -2.54% -2.18% 12657005 12004645 -5.15% Avg. 3.37% 3.36% 3.19% -5.41% Table 1: VPR vs. modified VPR The ∆ values are from the original, default version of VPR to the modified version. minor speed increase does not offset the cost of wider tracks. Also, it is clear that this isn’t really a scalable idea. There is no way to use this method to get anymore benefits than those already reported.
4
Future direction
As FPGAs become larger and the subsequent designs that are mapped to them grow to match, the design process needs to be improved in order to keep pace. If the design process falls behind the growth of the devices, there seems to be a risk that many of the benefits of “rapid” prototyping on FPGAs may be eliminated. Parallel computation of the accepted algorithms (force-directed placement, SA, etc.) may prove to be the savior of the design process, allowing faster placement and routing via use of several workstations or perhaps future chip-multiprocessors. However, there is another promising direction that is not being persued with nearly enough vigor: hardware assisted placement and routing. 5
alu4 bigkey des dsip elliptic exp5p frisc misex3 pdc seq Averages a : get heap head b : add to heap c : node to heap d : alloc heap data e : add route tree to heap
a 18.78 2.83 7.20 2.40 15.05 9.26 8.38 15.82 13.77 14.06 10.76
b 5.50 3.67 3.77 5.41 5.00 5.95 4.19 4.82 4.21 4.69 4.72
c 2.70 3.50 1.98 6.66 3.21 2.61 2.56 2.32 1.89 2.30 2.97
d e Totals 3.11 2.25 32.34 1.61 4.71 16.32 2.07 1.61 16.63 3.97 10.53 28.97 3.21 3.30 29.77 2.31 1.79 21.92 2.54 3.21 20.88 2.84 1.74 27.54 2.15 0.84 22.86 2.33 0.97 24.35 2.61 3.10 24.16
Table 2: VPR: percentage of time spent on various heap ops These percentages were obtained using gprof. It is clear that nearly one-quarter of the computation time for VPR is spent on heap operations of one sort or another. These ten circuits were chosen somewhat randomly from the 20 available netlists. To my knowledge, there is only one research group publishing accounts of accelerating the design process through the use of programmable logic[15, 16]. The idea is relatively simple: since the designer of an FPGA will likely have an FPGA available, why not design an ASIC that can perform placement and routing in hardware to avoid costly software instructions? One approach [15] involves using several interconnected processing elements that operate in parallel to perform a modified version of simulated annealing. Up to three orders of magnitude improvement in computation time was reported. The cost of this system, however is in silicon area since it takes multiple FPGAs to place a single-FPGA design. Current FPGAs have processor cores synthsized and ready for implementation into the programmble logic. Some more advanced FPGAs have hard processor cores embedded into the logic fabric. It is not uncommon to have an FPGA prototyping system with off-chip RAM connected and ready for use. This opens the possibility for running software that can interact with custom hardware in the FPGA fabric. For example, if one was able to put into hardware the heap data structure that is central to the operation of VPR’s routing algorithm, significant improvements may be realized[17]. To illustrate this point, ten of the benchmark circuits for VPR were randomly chosen and placed and routed to the target architecture for the “Place and Route Challenge”. Statistics were gathered through the use of gprof (see table 2). Close to 25% of the computation time, on average, is spent performing some sort of heap operation. This percentage would rise significantly if only routing were considered. 6
If these accesses to the heap data occured in a few cycles versus several hundred cycles (as would probably be necessary for software management of the heap), one could conclude that the operation of the overall algorithm would speed up quite dramatically.
5
Conclusion
Overall it is clear that there is a bottleneck in FPGA design at the placement and routing stage. Attempts to alleviate this bottleneck haven’t been overly successful. Future work in parallelization of the common algorithms will prove important to accelerating the design process. Much of this parallelization may be acheived in hardware adding to the gains of the parallelization process. Other methods may still need to be developed to take full advantage of programmable logic even at the design stage. With the current state of research, it is clear that even minor improvements in established algorithms, tools and methods are cause for publication, such is the importance of this issue to the design community.
References [1] S. Sahni, A. Bhatt, and R. Raghavan. Complexity of design automation problems. In Guy Rabbat, editor, Advanced Semiconductor Technology and Computer Systems, pages 526–573. Von Nostrand, 1988. [2] K. Shahookar and P. Mazumder. Vlsi cell placement techniques. ACM Comput. Surv., 23(2):143–220, 1991. [3] N. Metropolis, A. Rosenbluth, M. RosenBluth, A. Teller, and E. Teller. Equation of state calculations by fast computing machines. Journal of Chemical Physics, 21:1087– 1092, 1953. [4] Mark Fleischer. Simulated annealing: Past, present, and future. In Winter Simulation Conference, pages 155–161, 1995. [5] P. J. M. Laarhoven and E. H. L. Aarts, editors. Simulated annealing: theory and applications. Kluwer Academic Publishers, Norwell, MA, USA, 1987. [6] Vaughn Betz and Jonathan Rose. VPR: A new packing, placement and routing tool for FPGA research. In Wayne Luk, Peter Y. K. Cheung, and Manfred Glesner, editors, Field-Programmable Logic and Applications, pages 213–222. Springer-Verlag, Berlin, September 1997. Proceedings of the 7th International Workshop on Field-Programmable Logic and Applications, FPL 1997. Lecture Notes in Computer Science 1304. [7] Guo-Liang Xue. Parallel two-level simulated annealing. In International Conference on Supercomputing (ICS’93), pages 357–366, Tokyo, July 1993. ACM Press.
7
[8] M. Conti, S. Orcioni, and C. Turchetti. Parametric yield optimisation of MOS VLSI circuits based on simulated annealing and its parallel implementation. IEE Proceedings - Circuits, Devices and Systems, 141(5):387–398, October 1994. [9] Malay Haldar, Anshuman Nayak, Alok N. Choudhary, and Prithviraj Banerjee. Parallel algorithms for fpga placement. In Majid Sarrafzadeh, Prithviraj Banerjee, and Kaushik Roy, editors, ACM Great Lakes Symposium on VLSI, pages 86–94. ACM, 2000. [10] Samir W. Mahfoud and David E. Goldberg. A genetic algorithm for parallel simulated annealing. In PPSN, pages 303–312, 1992. [11] Hao Chen, Nicholas S. Flann, and Daniel W. Watson. Parallel genetic simulated annealing: A massively parallel simd algorithm. IEEE Trans. Parallel Distrib. Syst., 9(2):126–136, 1998. [12] Adora E. Calaor, Augusto Y. Hermosilla, and Bobby O. Corpus Jr. Parallel hybrid adventures with simulated annealing and genetic algorithms. In ISPAN, pages 39–44, 2002. [13] Alexander Danilin and Sergei Sawitzki. Optimizing the performance of the simulated annealing based placement algorithms for island-style fpgas. In FPL, pages 852–856, 2004. [14] Pak K. Chan and Martine D. F. Schlag. Parallel placement for field-programmable gate arrays. In Proceedings of the 2003 ACM/SIGDA Eleventh International Symposium on Field Programmable Gate Arrays (FPGA-03), pages 43–50, New York, February 23–25 2003. ACM Press. [15] Michael G. Wrighton and Andr´e M. DeHon. Hardware-assisted simulated annealing with application for fast FPGA placement. In Proceedings of the 2003 ACM/SIGDA Eleventh International Symposium on Field Programmable Gate Arrays (FPGA-03), pages 33–42, New York, February 23–25 2003. ACM Press. [16] Randy Huang, John Wawrzynek, and Andr´e DeHon. Stochastic, spatial routing for hypergraphs, trees, and meshes. In Proceedings of the 2003 ACM/SIGDA Eleventh International Symposium on Field Programmable Gate Arrays (FPGA-03), pages 78– 90, New York, February 23–25 2003. ACM Press. [17] Pak Chan. Personal comunication, 2004.
8