An Approach to the Physical Design Problems of 3D-FPGAs John Karro and James P. Cohoon Department of Computer Science, University of Virginia Charlottesville, VA 22903 Phone: 804-982-2291 Fax: 804-982-2214 email:
[email protected],
[email protected]
Abstract
rithms quickly adopted for the 3D architecture. In this paper, we present Spiffy: a new tool specifically designed for the simultaneous placement and global routing of 3D-FPGAs. When applied to standard FPGAs, Spiffy is able to produce better results than the leading tools in the literature, and we have used Spiffy to show that the new architecture does have potential for better, more efficient, circuit layouts.
Although FPGAs are a flexible alternative to custom design chips, they suffer from severe interconnection delay resulting in comparatively poor performance. The threedimensional FPGA was proposed in order to reduce this delay [2] without sacrificing any of the versatility. Here we present Spiffy – the first tool specifically designed for the placement and global routing of 3D-FPGAs. Spiffy produces some of the best results in the literature over the standard FPGA architecture, and we use it to show that the 3DFPGA architecture does lead to better performance.
Section two is an overview of the 3D-FPGA, and Section three describes the Spiffy algorithm. Section four compares Spiffy to a leading placement tool, and Section five discusses the results of using Spiffy to map certain benchmarks to a different 3D architectures. We conclude and discuss some remaining problems in Section six.
1 Introduction Field-Programmable Gate Arrays (FPGAs) are a flexible design alternative to custom integrated chips that can be used to implement arbitrary logic functions quickly and at low cost. Since their introduction they have become quite popular, as their design cycle is typically more efficient than that of custom design chips. However, this flexibility often comes with a substantial performance cost, due mostly to an interconnect delay which can account for over 70% of the clock cycle period.
2
3D-FPGAs
Our 3D-FPGAs will be constructed by placing a number of 2D-FPGAs in layers, providing vertical interconnections between adjacent switch-blocks through the use of solderbumps. Hence the switch-block configuration is generalized to the third dimension. The motivation behind the 3D-FPGAs is to improve both the logic density and performance of the chip. Logic density of 3D-FPGAs will be improved when compared to their 2D counterparts both because we now have more levels on which to place the logic, and the increased number of paths between logic blocks.
Recently the concept of a three-dimensional (3D) FPGA was proposed [2]. In this model several standard (2D) FPGAs are stacked and provided with interconnects between vertically-adjacent switch-blocks. It was argued that such a structure will reduce the average interconnection length, hence greatly reduce this propagation delay. However, the actual improvement in the circuit mapping depends on the quality of the physical design tools used – and this new architecture requires new design tools. In [1], the first placement statistics were given for any benchmarks, but these were only crude estimates taken from standard 2D algo-
In terms of performance, we get a win in the expected path-lengths. If we arrange logic blocks in a square (the optimal configuration), the average distance between blocks is . Arranged in a cube, the average distance is . Hence for any FPGA of more than twelve logic blocks (that is, any FPGA), the average distance will be smaller in the
John Karro is supported by a grant from the Virginia Aerospace Consortium Fellowship Program. Our benchmarks and additional related papers may be found at WWW URL http://www.cs.virginia.edu/ vlsicad/
1
(a)
(b)
(c)
Figure 1. An example of the partition model, with (a) the partition template, (b) the partitioning graph, and (c) two possible thumbnails over a given set of terminals in the partitioning graph.
partitions in such a way as to minimize the total number of thumbnail edges when summed over all the nets. Intuitively, we hope to force any given net into a single partition – or, failing that, neighboring partitions – thus grouping it together and reducing connection length.
cubic configuration. Thus the expected path length will be shorter, leading to better performance. The idea of three dimensional chip architectures is certainly not new; MCM-Vs have been been studied in detail, and other schemes for 3D architectures have also been proposed. Since our initial proposal for a 3D-FPGA architecture[1], other schemes have since been proposed [9, 11].
The classic partitioning problem can easily be reduced to this more complicated problem, hence thumbnail partitioning is NP-complete. We heuristically solve the problem using a simulated annealing technique.
Clearly there will be problems with the construction of such chips; with the increased volume to surface ratio, heat dissipation will become an issue, and building such chips could be difficult. However, similar problems have been encountered and explored for the other 3D packaging technologies [6, 10]. We believe the 3D-FPGA architecture will not pose any additional problems, and is a feasible goal.
Once we have assigned the logic-blocks to their partitions, we allow each net that must cross over a partition edge to do so at exactly one point along that edge. In order to minimize congestion, we want to minimize the maximum number of points crossing over any one point. We begin by picking a specific thumbnail for each net. As shown in Figure 1(c), any net can have multiple thumbnails associated with it. We want to pick a specific thumbnail for each net in such a way as to minimize the maximum number of thumbnail edges
crossing a given partition cut. While this can be done in run-time, we use a linear greedy heuristic instead. Intuitively, it makes sense to save the nets with more options for later, and pick those with fewer associated thumbnails right away. We order the nets based on this measure, and assign them accordingly. As the actual algorithm is based on dynamic-programming, this tends to produce good results as well.
3 Spiffy
The algorithm used in Spiffy is based on the concepts introduced by the Mondrian tool in [3], generalized to threedimensions. To provide an understanding of the Spiffy algorithm, we must explain the four important steps. Our first step partitions the logic blocks into an grid, where is the number of levels, the number of rows, and the number of columns. Figure 1(a) shows how the FPGA is partitioned into the areas, while Figure 1(b) shows the partition graph, where each node corresponds to one partition.
Once we have picked a thumbnail for a given net, we need to decide for each edge exactly which switch-block, or virtual node, on the corresponding partition-cut the net will cross. We consider all the nets crossing the partitioncut, and try to assign them to switch-blocks in such a way as to minimize the maximum number of nets using any one switch-block. This reduces to the problem of minimum bipartite matching.
For each net we consider the possible thumbnails – the minimum Steiner trees connecting the nodes corresponding to the partitions the net occupies. In Figure 1(c), we show two of the many possible thumbnails for a given net.
The algorithm recurses over progressively smaller parti-
Our objective in the first phase is to assign the nets to 2
Circuit busc dma bnre dfsm z03 9symml term1 apex7 alu2 too large example vda alu4 k2 Average
FPGA Size 13x12 18x16 22x21 32x22 27x26 11x10 10x9 12x10 15x13 14x14 14x12 16x16 19x17 22x20
Num. Nets 151 213 352 420 608 79 88 115 153 186 205 225 255 404
Mondrian 77.5 196.8 654.4 528.6 1042.3 56.1 30.9 48.0 154.4 216.5 139.7 1019.3 1105.8 12198.8 1247.2
Spiffy 106.3 137.9 230.0 289.1 523.0 27.7 30.2 56.5 74.5 102.3 161.5 280.2 153.2 872.3 217.5
Improvement -37.2% 29.9% 64.4% 45.3% 49.8% 50.6% 2.3% -17.7% 51.8% 52.8% -15.6% 72.5% 86.2% 92.8% 82.6%
Table 1. A comparison of the run-times of Mondrian against Spiffy. All tests were run on a 50MHZ Sparc 20 using SunOS 4.1.3. Time is given in seconds.
Name busc 9symml term1 apex7 alu2 2large example Averages
Mondrian 9 10 7 9 10 11 13 9.9
Congestion Spiffy Improvement 11.1 -23.3% 13.5 -35.0% 9.4 -34.2% 10.1 -12.2% 14.4 -44.4% 14.0 -27.3% 12.1 6.9% 12.1 -22.2%
Mondrian 9.3 11.4 7.0 9.1 12.2 11.9 9.3 10.0
Net Length Spiffy Improvement 7.0 24.7% 9.3 18.8% 6.0 14.3% 7.1 22.0% 10.3 15.6% 9.5 20.2% 6.6 29.0% 8.0 20.0%
Table 2. A comparison of Mondrian against Spiffy.
Spiffy consistently produces better net-lengths in a significantly shorter time. Its higher congestion is due to the absence of a post-processing algorithm used in Mondrian. Such post-processing will be incorporated with the addition of our detailed router.
Once a placement and global routing has been produced, our system uses Alexander’s graph-based detailed router [4] to complete the physical design. It is our plan to replace this tool with a router that will successively refine the coarse route to a detailed route. Such a tool will better integrate the routing information produced by Spiffy into a final solution.
4
that of Mondrian. Table 2 compares the statistics of the detailed routings resulting from each tool (using , for our partition size).
tions, eventually resulting in partitions with only one logic block. At this point the algorithm need only connect any virtual nodes to pins on the logic-block. This interconnect is done using a integer-programming scheme devised by Ganley [7].
5 Mapping to 3D-FPGAs The true importance of Spiffy is its ability to map designs to 3D-FPGAs, thus allowing us to determine whether this new architecture actually does bring about improvements over the standard 2D-FPGA. We mapped each benchmark to 3D architectures ranging from one level (i.e. a 2D-FPGA) to four levels.
Comparison of Spiffy to Mondrian
Our first test of Spiffy was against Mondrian, which currently produces the best published results in the literature. By treating a 2D-FPGA as a single layer 3D-FPGA, we are able to make a meaningful comparison. The tool was tested in selected benchmarks for the Xilinx 3000 and 4000 series FPGAs [5, 8, 12], and compared against the Mondrian results [7]. Table 1 compares the run-time of Spiffy against
Due to the complexity of calculating the Steiner trees, Spiffy maintains a list of all possible such trees for the partition graph. As this data structure renders it computationally infeasible to use a partition larger than , we tried both and . (For the single-level architecture,
3
Congestion Avg. Net Length Avg. Run Time
Two Levels 2x2x2 3x2x2 3.97% 3.30% 3.27% 3.02% 23.8% -83.3%
Three Levels 2x2x2 3x2x2 6.09% 9.06% 6.95% 9.34% 23.6% -105%
Four Levels 2x2x2 3x2x2 11.9% 8.31% 10.7% 10.9% 20.0% -90.1%
Table 3. The average percentage gains of the 3D architectures of the 2D architecture using different partition sizes.
we used a partition.) In Table 3, we see the percentage gain of each architecture over the 2D architecture.
We can conclude a number of things from these tables. First, we gain nothing from the use of the partition. The results are at best marginally better than the results from using the partition, and the run-time of Spiffy increases dramatically.
[2]
[3]
The results do improve as the number of layers increase. In terms of run-time, adding the second layer greatly speeds the algorithm, though this gain seems to decrease as the number of layers increase. Assuming this trend continues, it will be interesting to determine the point at which the runtime gain becomes insignificant. Furthermore, we need to investigate whether improvements resulting from the addition of another layer suffer from diminishing returns. The addition of the third layer results in much larger improvements in average net length then does the addition of the the fourth layer, though the reverse seems to be true with congestion.
6
[4]
[5]
[6]
Conclusion [7]
Spiffy is the first serious tool to produce either placements or global routes for the 3D-FPGA architecture, and our results show that it does its job well. There is still room for improvement, as evidenced by the the congestion present in its mappings.
[8]
The 3D architecture does have potential. Even with our first design tool, we have already produced mappings better than 2D-FPGA mappings. However, there are still a number of questions that need to be investigated.
[9] [10]
One avenue of investigation we are currently pursuing is the possibility of eliminating some of the vertical connections. In practice it seems each route uses very few such connection, hence it may be worth experimenting with the elimination of some of these connections.
[11]
References
[12]
[1] Alexander, M.J., Cohoon, J.P., Colflesh, J.L., Karro, J.,
4
Peters, E.L., and Robins, G. Physical layout for threedimensional fpgas. Fifth ACM/SIGDA Physical Design Workshop, pages 142–149, April 1996. Alexander, M.J., Cohoon, J.P., Colflesh, J.L., Karro, J., and Robins, G. Three-dimensional field-programmable gate arrays. IEEE ASIC Conference, pages 253–256, September 1995. Alexander, M.J., Cohoon, J.P., Ganley, J.L., and Robins, G. Performance-oriented placement and routing for fieldprogrammable gate arrays. Proc. European Design Automation Conference, pages 80–85, 1995. Alexander, M.J. and Robins, G. A new approach to fpga routing basses on multi-weight graphs. Proc. ACM/SIGDA Intl. Workshop on Field-Programmable Gate Arrays, February 1994. Brown, S.D., Rose, J.S., and Vranesic, Z.G. A detailed router for field-programmable gate arrays. Proc. International Conference on Computer-Aided Design, pages 382– 385, 1990. Cahill, C., Compagno, A., O’Donovan, J., Slattery, O., Mathuna, S.C., Barrett, J., Sethelon, I., Val, C., Tigneres, J.P., Stern, J., Ivey, P., Masgrangeas, M., and Coello-Vera, A. Thermal characterization of vertical multichip modules mcm-v. IEEE Transactions on Components, Packaging, and Manufactoring Technology, 18(4):765–771, December 1995. Ganley, J. and Cohoon, J. Improved computation of optimal rectilinear steiner tree minimal trees. International Journal of Computational Geometry and Applications, 7:457–472, 1997. Lemieux, G. and Brown, S. A detailed routing algorithm for allocating wire segments in field-programmable gate arrays. Proc. Fourth Physical Design Workshop, pages 215– 226, 1993. Lesser, M., Meleis, W., and Vai, M. Rothko: 3-Dimensional FPGA’s. http://www.ece.neu.edu/research/rothko/. Reber, M. and Kirsch, A. Testability controlled physical design of vertically stacked integrated circuits. Proc. Eigth Annual IEEE Internaional ASIC Conference and Exhibit, pages 249–252, 1995. Seed, L., Meacham, R., Zawads, A., and Thorne, P. TriMorph: 3D Computational Structures. http://www.shef.ac .uk/uni/academic/D-H/eee/esg/research/trimorph.html. Wu, Y. and Marek-Sadowaka, M. An efficient router for 2-d field programmable gate arrays. Proc. European Design and Test Conference, pages 412–416, 1994.