Signal Scheduling Driven Circuit Partitioning for Multiple FPGAs with Time-multiplexed Interconnection Young-Su Kwon VLSI Systems Lab., KAIST Taejon, Republic of Korea
[email protected]
Woo-Seung Yang VLSI Systems Lab., KAIST Taejon, Republic of Korea
[email protected]
Abstract FPGA-based logic emulator with large gate capacity generally comprises a large number of FPGAs. However, gate utilization of FPGAs and speed of emulation are limited by the number of signal pins among FPGAs and the interconnection architecture of the logic emulator. The time-multiplexing of interconnection wires is required for multi-FPGA system incorporating several stateof-the-art FPGAs. This paper proposes a circuit partitioning algorithm called SCATOMi(SCheduling driven Algorithm for TOMi) for multi-FPGA system incorporating four to eight FPGAs where FPGAs are interconnected through TOMi(Time-multiplexed, Off-chip, Multicasting interconnection). SCATOMi improves the performance of TOMi architecture by limiting the number of inter-FPGA signal transfers on the critical path and considering the scheduling of inter-FPGA signal transfers. The performance of the partitioning result of SCATOMi is 5.5 times faster than traditional partitioning algorithms. Experiments on architecture comparison show that, by adopting the proposed TOMi interconnection architecture along with SCATOMi, the pin count is reduced to 15.2%-81.3% while the critical path delay is reduced to 46.1%-67.6% compared to traditional architectures including mesh, crossbar and VirtualWire architecture.
1 Introduction In designing system-on-chip(SoC) functional verification represents over 48% of the whole design process [1]. Simulation accelerators and logic emulators are usually based on FPGAs and widely used for functional verification of complex logic designs at the speed 100-1000 times faster than software simulation. An essential limitation of FPGA comes from the fact that its gate capacity lags behind that of the contemporary
Chong-Min Kyung VLSI Systems Lab., KAIST Taejon, Republic of Korea
[email protected]
application-specific integrated circuits(ASIC’s) by a factor of five to ten [1][10]. The multi-FPGA logic emulation system is, therefore, the only viable solution for providing gate-level emulation with a large gate capacity by combining multiple FPGAs through hard-wired interconnections. A multi-FPGA system is characterized by its interconnection architecture, target FPGA and logic partitioning software. The mesh [8] or crossbar architecture [15] and their variations have been adopted as a method for interconnecting several FPGAs in traditional multiple FPGA architecture. A problem with these traditional interconnection architectures is that the logic mapping into FPGA is easily pin-limited. That is, only a fraction of available FPGA gates can be utilized and the circuit partitioning algorithm sometimes fails in routing inter-FPGA nets due to the shortage of IO pins. This is because the number of pins in each FPGA is not increased as fast as the gate count of FPGA is increased. For example, the average number of IO pins has been increased only by 14 times for the last five years, while the number of CLB gates has been increased by 600 times during the same period for Xilinx FPGAs. The time-multiplexed interconnect architecture, VirtualWire [2][14] was proposed to overcome the pin limitation problem by using time-multiplexed physical wires. Timemultiplexing was also implemented in commercial emulators [6], where a physical wire transfers a multitude of logical signals between FPGAs. The time-multiplexing of interFPGA wire is inevitable in multiple-FPGA system which comprises four to eight state-of-the-art FPGAs where the number of IO pins is small while the available CLB gate count is very large(6M ASIC gates). The issues in time-multiplexed interconnection architecture are as follows. The first is to preserve the precedence relations among inter-FPGA signals for correct operation. If Inter-FPGA signal A is an input to FPGA 0, inter-FPGA net B is an output of FPGA 0 and the value of A influences the value of B, A must be transferred before B. The phase routing algorithm proposed in [2] schedules the sequence
123
of inter-FPGA signal transfer. The second is performance degradation caused by time-multiplexing of physical wires. The traditional partitioning algorithms have mostly concentrated on minimizing the number of cuts (inter-FPGA nets) between logic partitions utilizing the locality of logic designs [4][5]. The performance driven partitioning algorithms concentrate on reducing the number of inter-FPGA nets on the critical path because the delay of off-chip signal transfer between FPGAs dominates the critical path delay[7][12][13]. Fang and Wu[7] assigned higher weights to nets on the critical path and used functional clustering to improve the performance. Sawkar and Thomas[13] implemented a circuit clustering and set covering-based partitioning algorithm to minimize the number of off-chip interconnection on the critical path. In this paper, we propose a novel circuit partitioning algorithm called SCATOMi(SCheduling driven Algorithm for TOMi) for an interconnection architecture called TOMi(Time-multiplexed, Off-chip, Multicasting interconnection) which uses time-multiplexed interconnection wires between FPGAs exploiting multicasting of signals to solve the pin limitation problem. The proposed circuit partitioning algorithm improves the performance by removing backward-directed nets and considering the scheduling result of time-multiplexed inter-FPGA signals. This paper is organized as follows. In section 2, the TOMi architecture is presented. SCATOMi is described in section 3. Experimental results are shown in section 4.
FPGA 0
FPGA 1
FPGA 2
FPGA 3
TOMi
emulation clock
uclk
Figure 1. A block diagram of a system comprising four FPGAs interconnected using TOMi.
logic and arrows indicate nets between logics. The direction of a net indicates dependency relation. For example, N3 can not be transferred until N0 and N1 are transferred. The signal transfer scheduling is also shown in the figure. We have assumed that the inter-FPGA signal transfer takes two µclk cycles and the evaluation of combination logics in one FPGA takes one µclk cycle. The bit width(number of bit lines) of TOMi is assumed to be two for simplicity. After N0 and N1 are transferred during the time interval δt = (t + 2, t + 4), the combinational logics in FPGA 0 are evaluated during (t + 4, t + 5). N3 is transferred during (t + 4, t + 5) because N3 is influenced by N0 and N1 .
2 Multi-FPGA Architecture with TOMi
N0
N5
N3 N1
A multi-FPGA system based on TOMi consists of several FPGAs connected via shared inter-FPGA interconnections as shown in Fig. 1. The interconnection among FPGAs consists of wires for “emulation clock”, “µclk” and TOMi, respectively. µclk, which is of higher frequency than emulation clock, controls micro-operations for the signal transfer between consecutive edges of emulation clock. TOMi is composed of time-multiplexed interconnection wires that transfer logic signals from one FPGA to another according to µclk. Each bit line of TOMi shared by all FPGAs transfers a logic signal driven by one of FPGAs in one µclk cycle. It is a bidirectional signal where the signal is driven by a single source and transferred to multiple destination FPGAs. The TOMi architecture always guarantees, by adding clock cycles as necessary, to route all inter-FPGA signals. Multi-terminal inter-FPGA nets can also be easily routed. The TOMi architecture is not suitable for interconnecting many FPGAs because of the capacitive loading of multicasting wire. In Fig. 2, an enlarged view of two FPGAs comprising the TOMi architecture and the operation of TOMi are shown. Each circle indicates an instance of a combinational
N6
N2
N4 FPGA k
FPGA k+1
emulation clock uclk t
Signal transfer on TOMi
t+1 t+2 t+3 t+4 t+5 t+6 t+7 t+8 N4
N0
N5
N1
N3 N6
N2
Figure 2. An operation of TOMi architecture. Enlarged view of two FPGAs comprising the TOMi architecture has seven inter-FPGA nets where the signals are transferred according to the µclk.
3 SCATOMi Circuit Partitioning Algorithm for TOMi Architecture The terminologies used in SCATOMi are shown in Fig. 3. The FPGA index is the sequential number, starting from 0, assigned to each FPGA. Forward direction is
124
Inter-FPGA forward net (F-net)
Storage net (S-net)
3.1 Initial Partitioning with B-net Removal
N2 C1 N4
N1 C0 N0
N6
N5
N3 FPGA 0 (k=0)
FPGA 1 (k=1) Internal net (I-net)
Inter-FPGA Inter-FPGA Storage net backward net (Inter-FPGA S-net) (B-net)
FPGA index
Forward direction Backward direction
Figure 3. Definition of terminologies used in SCATOMi. “FPGA index”(k) is the sequential number assigned to each FPGA.
the direction of signals flowing from an FPGA with lower index to one with larger index. Backward direction is the opposite net direction. Types of nets are categorized as follows. Inter-FPGA forward net(abbreviated as F-net) is an inter-FPGA net that carries signal flowing in the forward direction but is not the output of a storage element such as flip-flop. Inter-FPGA backward net(abbreviated as Bnet) is an inter-FPGA net in which any port pair of the net carry signals flowing in backward direction but is not the output of a storage element such as flip-flop. Internal net(I-net) is a net where all the logic instances connected by the net are located in the same FPGA. Finally, Storage net(S-net) is one of the followings: primary input, primary output, and output of a storage element such as flip-flop. Any S-net is not considered as F-net or B-net irrespective of the flow direction. Ii (Influencing set of instances) of net ni is the set of instances which are transitive fan-in instances of ni bounded by any S-net, while Di (Depending set of instances) of ni is the set of instances which are transitive fan-out instances of ni bounded by any S-net. (See Fig. 4).
Ii
S-net
output to other instances inputs from other instances
S-net
ni Di
primary output
Figure 4. Definition of Ii and Di .
The design is partitioned by min-cut partitioning algorithm at first. In this paper, the improved FM algorithm implemented in [5] is used. After min-cut partitioning, all Bnets are removed to reduce the number of inter-FPGA nets on the critical path. By removing all B-nets, all inter-FPGA nets become F-nets or S-nets. Therefore, in the worst case, the critical path is composed of one S-net and (n(F ) − 1) F-nets where n(F ) is the number of FPGAs. The algorithm shown in Fig. 5 is responsible for removing all B-nets. SCAT OM i Init and Bnet Removal() { M in cut partitioning(); foreach B-net, bni if bni = B-net, continue; ks = partition index of bni ’s source instance; kd,min = min. partition index of bni ’s destination instances; Search for Ii and Di ; Select Ii (⊂ Ii ) with FPGA index > kd,min ; Select Di (⊂ Di ) with FPGA index < ks ; // Choose option which generates // more balanced partitioning result. if BLD after movement of Ii is larger than BLD after movement of Di , then move Ii to partition kd,min ; // I-option is taken. (See Fig. 6) else move Di to partition ks ; // D-option is taken. }
Figure 5. Pseudo code of B-net removal algorithm. All B-nets are removed by moving Ii to partition kd,min or Di to partition ks . For each B-net denoted as bni , the algorithm first checks whether it is still B-net or not because bni may be changed to F-net or I-net during B-net removal process. For each bni , there is only one source instance and one or more destination instances. ks is the partition index of source instance and kd,min is the smallest index among those of each partition covering all destination instances. There are two options to change the net type of bni from B-net to either Fnet or I-net. The first option(I-option) is to move the source instance of bni to the partition with index kd,min . In this case, a subset of Ii denoted as Ii is defined such that Ii includes all the instances located in the partitions with index larger than kd,min . All the instances belonging to Ii are then moved to the partition kd,min to eliminate bni as shown in Fig. 6. In Fig. 6, the source instance of bni is C2 and destination instances are C3 , C4 and C6 . kd,min is 0 because the smallest partition index of all partitions covering C3 , C4 and C6 is 0. Ii consists of C0 , C1 and C2 while Ii consists of C1 and C2 as their partition indices are larger
125
Ii
Ii C1
C0 C2 C10 C9
C4
C5
FPGA 0(k=0)
C7
C6
C3
FPGA 1(k=1)
bni
C8
FPGA 2 (k=2)
move Ii to P0
C1
C0 C2 C4
C3
C5 FPGA 0(k=0)
C6
FPGA 1(k=1)
C7
C8
FPGA 2 (k=2)
bni becomes F-net
Figure 6. The removal of bni by moving Ii . Ii consists of C2 and C1 . bni becomes F-net by moving Ii to P0 while no F-net or I-net becomes B-net.
than kd,min (=0). bni becomes F-net by moving Ii to FPGA 0. The second option(D-option) is to move the destination instances of bni to the partition with index ks . A subset of Di denoted as Di is defined such that Di includes all the instances located in the partitions with index smaller than ks . All the instances belonging to Di are then moved to the partition ks to remove B-nets. BLD is a balance measure of the partitioning result which is defined as follows. | (Total CLB count)/n(F ) − nk (CLB) |−1 BLD = k
(1) where n(F ) is the number of FPGAs and nk (CLB) is the CLB count of each partition in FPGA k. If movement of Ii results in larger BLD, I-option is taken, i.e., Ii is moved. Otherwise, D-option is taken, i.e., Di is moved. It can be simply proven that no F-net or I-net is changed to Bnet during the B-net removal procedure. The property of ”creating no additional B-net” is very important for computing the algorithm complexity. The algorithm complexity of SCAT OM i Bnet removal() is O(n) where n is the number of B-nets in the initial partitioning.
SCAT OM i Optimization() { forever { List Scheduling(); Search for inter-FPGA net nci on the critical path; Search for Iic and Dic for each nci ; Select Ici and Dci where instances are in FPGA k adjacent to ni ; Select Ica or Dca with max. gain, G; // Ic0 and Dcn(F)−1 are excluded from candidate. // Gain G = BLD∗ /Tc∗ . If performance is not improved, break; else { Move Ica to FPGA a + 1; (or Move Dca to FPGA a − 1); } }
Figure 7. Pseudo code of optimization algorithm. The subset of Ic or Dc are moved to optimize the performance.
nets is created. In dependency graph, nodes represent the inter-FPGA net and edges represent the precedence relation between inter-FPGA nets. The priority, the distance from the node to the sync node, is computed for each node. The number of resources is WT OM i , the number of bit lines of TOMi. The list scheduling algorithm routine denoted as List Scheduling() schedules the dependency graph to determine the time interval in which each inter-FPGA signal is transferred. After scheduling, Iic and Dic for nci (interFPGA net on the critical path) are found as shown in Fig. 8. In Fig. 8, the critical path after list scheduling of interFPGA net is shown as a bold line. Two inter-FPGA nets, n0 and n1 are on the critical path. Ic0 (Dc0 ) is the subset of I0c (D0c ) where instances are in FPGA 0(FPGA 1). In Fig. 8, Ic0 ={C1 }, Dc0 ={C2 , C3 }, Ic1 ={C2 , C3 , C4 } and Dc1 ={C5 , C7 , C8 }. critical path IC0
DC1
IC1
C0 C2
3.2 Performance Optimization Considering Signal Transfer Scheduling
C5
n0C
C1
After the initial partitioning, the number of inter-FPGA signal transfers on the critical path is reduced by considering the signal scheduling result. The balance measure is also considered in the optimization procedure. No B-net is created in this procedure because instances are moved in unit of Ii or Di . The optimization algorithm is shown in Fig. 7. Initially, the dependency graph for inter-FPGA
DC0
C3
C6
C4
FPGA 0
C7
n1C
FPGA 1
C8
FPGA 2
Figure 8. Ic and Dc . If Dc0 is moved to FPGA 2, the number of inter-FPGA signal transfer on the critical path can be decreased by one. Similarly, Ic1 can be moved to FPGA 0 to reduce the number of inter-FPGA signal transfers on the critical path. No additional B-net is created when Dc0 or Ic1 is moved. The
126
Table 1. A comparison of 4-way partitioning results in terms of the distribution of inter-FPGA nets(# F-nets/# B-nets/# inter-FPGA S-nets) denoted as “cut nets” and the number of µclk cycles per emulation clock period(Te ) denoted as “#µ/Te ” for various partitioning algorithms including the proposed algorithm, SCATOMi.
Circuit s5378 s9234 s13207 s15850 s35932 s38417 sum ratio
#nets 3224 6097 9444 11070 19879 25588
#inst. 3082 5890 8808 10489 18818 23982
Multi-KL cut nets #µ/Te 207/254/42 24 441/411/83 37 510/918/295 70 966/516/367 82 1554/1631/272 149 2151/1391/858 147 509 9.1
PFM cut nets #µ/Te 109/113/44 28 197/105/3 37 123/173/60 40 144/236/83 50 441/344/587 88 274/322/400 69 312 5.6
proof of “creating no additional B-net” can also be easily proven. The optimization algorithm selects one of Ic ’s and Dc ’s with the maximum gain. The gain function, G is defined as G = BLD∗ /Tc∗ where BLD∗ is balance measure after movement and Tc∗ is the required number of µclk cycles after movement. When Ici or Dci is moved, only nets adjacent to that instance group change their types. The list scheduling after movement of instances can be done by considering only the type change of adjacent nets. After selecting Ica or Dca which maximizes the gain, the selected Ica is moved to FPGA a + 1 or Dca is moved to FPGA a − 1. The algorithm repeats the above procedure until no more performance improvement is possible.
4 Experiments SCATOMi described in the previous section was implemented in C. SCATOMi accepts EDIF netlist and generates partitioned EDIF designs. The Partitioning93 XNF(Xilinx Netlist Format) netlists are selected for the experiment. The EDIF netlists for Partitioning93 circuits were generated from XNF file. The excellence of the proposed partitioning algorithm is shown by comparing the number of inter-FPGA nets and performance results between the proposed partitioning algorithm and other partitioning algorithms. We have applied several traditional partitioning algorithms to partition benchmark circuits on the TOMi architecture. The number of µclk cycles for an inter-FPGA net transfer is two and that for one FPGA evaluation is one. The bit width of TOMi, WT OM i = 256. The TOMi architecture comprises four Virtex2 FPGAs. DDR(Double Data Rate) registers in Virtex2 FPGA are used to transfer inter-FPGA nets at high speed where the frequency of µclk is 133MHz(7.5ns period). We compared the results of SCATOMi against those of multi-
hMetis cut nets #µ/Te 44/65/0 17 100/36/10 11 108/38/3 34 84/49/15 27 32/106/35 8 54/120/16 11 108 1.9
SCATOMi cut nets #µ/Te 142/0/78 7 272/0/63 7 311/0/112 11 148/0/202 13 213/0/86 11 123/0/62 7 56 1.0
level KL [9], PFM [5] and hMetis [11]. The numbers of inter-FPGA nets and the performance in the number of µclk cycles per emulation clock cycle (denoted as Te ) when the circuits are partitioned into four FPGAs are shown in Table 1. All the algorithms in Table 1 use TOMi architecture as an implementation platform. The numbers of F-net, B-net and inter-FPGA S-net are shown in the first column of each benchmark circuit and the number of required µclk cycles per emulation clock is shown in the second column. The traditional algorithms such as Multi-level KL, PFM, hMetis do not consider the scheduling result of time-multiplexed architecture and, therefore, many B-nets are generated. The µclk count per emulation clock cycle of SCATOMi shows a significant reduction compared to others although the number of cut nets is larger than other algorithms. In SCATOMi, all B-nets are completely removed while there are many Bnets in the results of traditional partitioning algorithms. The removal of B-nets reduced the dependencies between partitions and scheduling driven optimization have greatly improved the performance. The operation speed of SCATOMi is 5.5 times faster than other algorithms on the average. The results of implementing benchmark circuits into four FPGAs on several architectures such as NTT mesh [8], BORG partial crossbar [3], VirtualWire(denoted as VWire) [2] and TOMi are shown in Table 2 and Table 3. For NTT mesh(denoted as NTT), BORG partial crossbar(denoted as BORG) and VirtualWire(VWire) interconnection architectures, PFM partitioning algorithm is used while SCATOMi is used for TOMi architecture. Each architecture incorporates four FPGAs and 4-way partitioning is done for each benchmark circuits. The number of required pins for TOMi is about 22.0% of NTT, 15.2% of BORG and 81.3% of VirtualWire. The ratio of number of pins to 1.0 for TOMi architecture shows an advantage of
127
TOMi. Table 2. The required number of pins for several architectures to implement benchmark circuits. Circuit s5378 s9234 s13207 s15850 s35932 s38417 sum ratio
NTT 1242 1880 2536 3132 9488 7248 25526 4.54
BORG 1824 2690 3748 4556 13932 10340 37090 6.60
VWire 776 868 1174 1394 906 1836 6954 1.23
TOMi 676 848 1024 1024 1024 1024 5620 1.0
In Table 3, the critical path delays of benchmark circuits for several interconnection architectures are shown. In TOMi architecture, the wire delay connected to four FPGAs is 7.5ns that is the same as the period of µclk. In NTT, BORG and VWire, inter-FPGA wire delay connecting two FPGAs is 3.75ns(=0.5 µclk period). Although the time-multiplexing feature of TOMi degrades the performance and multicasting increases the capacitive load, the critical path delay of TOMi is 67.6% of NTT mesh, 49.8% for BORG partial crossbar and 46.1% for VirtualWire architecture. Table 3. The critical path delay of benchmark circuits for several interconnection architectures with PFM algorithm and TOMi architecture with SCATOMi algorithm. The unit of measurement is one µclk period, 7.5ns. Circuit s5378 s9234 s13207 s15850 s35932 s38417 sum ratio
NTT 4.92 14.16 18.16 22.28 9.95 13.31 82.78 1.48
BORG 11.93 17.97 25.34 28.19 11.64 18.02 113.09 2.01
VWire 11.50 20.50 23.00 31.00 14.00 22.00 122.00 2.17
TOMi 7.00 7.00 11.00 13.00 11.00 7.00 56.0 1.0
5 Conclusion In this paper, we present a novel multi-FPGA partitioning algorithm called SCATOMi for TOMi architecture with the time-multiplexed, off-chip, multicasting interconnection wires. SCATOMi partitions the circuit to minimize
the number of off-chip signal transfer and considers the scheduling result of inter-FPGA signals for performance improvement. SCATOMi is suitable for multi-FPGA system with a small number of FPGAs because of capacitive load of shared wires. Salient features of SCATOMi include no pin limitation problem, no routing failure and expansibility. The performance of TOMi architecture with SCATOMi shows a significant performance improvement for benchmark circuits while the required pin number is small.
References [1] International technology roadmap for semiconductors. Technical report, Silicon Industry Association, 2001. [2] J. Babb, R. Tessier, M. Dahl, S. Hanono, D. Hoki, and A. Agarwal. Logic emulation with virtual wires. IEEE Transactions on CAD, 16(6):609–626, Jun. 1997. [3] P. Chan and M. Schlag. Architectural tradeoffs in field programmable device based computing systems. In IEEE Symposium on FPGAs for Custom Computing Machines, pages 152–161, 1993. [4] C.-S. Chen, T. Hwang, , and C. Liu. Architecture driven circuit partitioning. IEEE Transactions on VLSI Systems, 9(2):383–389, Apr. 2001. [5] A. Dasdan and C. Aykanat. Two Novel Multiway Circuit Partitioning Algorithms Using Relaxed Locking. IEEE Transactions on CAD, 16(2):169–178, Feb. 1997. [6] S. S. et al. Emulation system with time-multiplexed interconnect. Technical report, US Patent and Trademark Office, Sep. 28 1999. US patent number : 5,960,191. [7] W.-J. Fang and A.-H. Wu. Performance-driven multi-FPGA partitioning using functional clustering and replication. In IEEE/ACM Design Automation Conference, pages 283–286, 1998. [8] S. Hauck, G. Borriello, and C. Ebeling. Mesh routing topologies for multi-FPGA systems. IEEE Transactions on VLSI Systems, 6(3):400–408, Sep. 1998. [9] B. Hendrickson and R. Leland. A multilevel algorithm for partitioning graphs. Technical report, Sandia National Laboratories, 1993. Technical report SAND93-1301. [10] http://www.xilinx.com. Xilinx Inc.[online]. [11] G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar. Multilevel Hypergraph Partitioning:Applications in VLSI Domain. IEEE Transactions on CAD, 16(9):956–964, Sep. 1997. [12] C. Kim and H. Shin. A performance-driven logic emulation system:FPGA network design and performance-driven partitioning. IEEE Transactions on CAD, 15(5):560–568, May. 1996. [13] P. Sawkar and D. Thomas. Multi-way partitioning for minimum delay for look-up table based FPGAs. In IEEE/ACM Design Automation Conference, pages 647–652, 2001. [14] H.-P. Su and Y. L. Lin. A phase assignment method for virtual-wire-based hardware emulation. IEEE Transactions on CAD, 16(7):776–782, Jul. 1997. [15] J. Varghese, M. Butts, and J. Batcheller. An efficient logic emulation system. IEEE Transactions on VLSI Systems, 1(2):171–174, Jun. 1993.
128