sFPGA - A SCALABLE SWITCH BASED FPGA ARCHITECTURE AND

sFPGA - A SCALABLE SWITCH BASED FPGA ARCHITECTURE AND DESIGN METHODOLOGY Shakith Fernando, Xiaolei Chen and Yajun Ha Department of Electrical and Computer Engineering National University of Singapore Email: {shakith,g0600118,elehy}@nus.edu.sg ABSTRACT The poor scalability of current mesh-based FPGA interconnection networks is impeding our attempts to build nextgeneration FPGA of larger logic capacity. A few alternative interconnection network architectures have been proposed for future FPGAs, but they still have several design challenges that need to be addressed. In this paper, we propose sFPGA, a scalable FPGA architecture, which is a hybrid between hierarchical interconnection and Network-on-Chip. The logic resources in sFPGA are organized into an array of logic tiles. The tiles are connected by a hierarchical network of switches, which route data packets over the network. In addition, we have proposed a design flow for sFPGA which integrates current design flows seamlessly. By doing a case study in our emulation prototype, we have validated our sFPGA design flow. 1. INTRODUCTION Modern breakthroughs in semiconductor process have enabled us to build larger and larger FPGAs. However, this poses a big challenge to design scalable interconnects to route between FPGA logic blocks. In current FPGAs, the interconnection networks dominate the logic blocks as the main contributor to area, delay and power consumption[1]. For example, routing network consumes up to 90% of the silicon area in FPGA. The interconnection network in most commercial FPGAs is manhattan mesh based today. Mesh based interconnection networks have several desirable properties: regular, easy to design and layout, and CAD toolsfriendly. But unfortunately, these mesh based interconnects do not increase linearly with logic block arrays, and thus has poor scalability. There have been several studies on how to improve the FPGA interconnect’s scalability. In the literature, the proposed alternatives to mesh based interconnection networks This work was supported by Science and Engineering Research Council( SERC), under A.STAR Grant P0521010098 and FOE/NUS Incentive Grant R397-000-035-731

978-1-4244-1961-6/08/$25.00 ©2008 IEEE.

include Tree-of-Meshes (ToM) architecture [2], Mesh-ofTrees architecture (MoT) [3] and 3D mesh architecture [4, 5]. Although these new architectures can, in theory, provide good scalability, constraints imposed by the semiconductor manufacturing process make them not very practical. Hence, it is meaningful to look beyond these architectures for alternative FPGA interconnect architectures. Inspired by the success of computer networks (e.g., Internet and LAN), the idea of Network-on-Chip (NoC) has been introduced as an alternative to global on-chip wiring. People envision that NoC is of better scalability and predictability, and also more fault-tolerant [6]. In NoC, logic fabrications are organized into tiles, which communicate with one another over an network. The on-chip networks can be direct or indirect, and employ the packet-based communication. RAW architecture [7] is an example implementation of a direct network where each tile has the granularity of processor. Another similar design is the Intel’s 80-tile TeraFLOPS Chip [8]. However, it is usually very hard to program a multiprocessor system like the Intel’s TeraFOPS processor. While RAW architecture takes a leap forward in mitigating the mismatch between software parallelism and hardware parallelism, it is still coarse-grained, and requires the support of an intelligent compiler. In this paper, we present a scalable FPGA (sFPGA) architecture. It consists of an array of FPGA tiles, where each tile corresponds to the granularity of a current FPGA. These tiles are connected in a packet switched network, which is hierarchical and cluster based. To add more tiles to the architecture, sFPGA only requires adding additional switches, making it scalable similar to the current Internet architecture. We also present a design flow for sFPGA that is compatible with current FPGA design flows. Our solution is unique in three ways: (1) The architecture makes the increasing FPGA interconnections scalable, (2) The granularity of a tile in our sFPGA is fine-grain logic block array (e.g., a current FPGA device) compared to other coarse-grain blocks (e.g. ALU or processor) in other research, and (3) Complexity of the design flow is simplified by reusing available design flows for FPGA technology

95

mapping, placement, routing and bitstream generation. We have emulated this architecture using several Xilinx University Virtex II Pro Boards (XUPV2P). We have completed a case study through every stage of the sFPGA design flow to show feasibility of mapping an application to our sFPGA architecture. This paper is organized as follows. Section 2 reviews the related work on sFPGA. Section 3 introduces the new scalable FPGA architecture. Section 4 introduces its design methodology using a case study before concluding in Section 5. 2. RELATED WORK We look at the problem of wiring requirements, before looking at the related work on routing architectures for FPGA. Rent’s rule [9] gives an empirical formula to estimate the wiring requirements of a circuit: Q = cN p ,

(1)

where N is the number of gates in the circuit, and Q is the total number of input and output signals, and p and c are feature parameters of the circuit. It is observed that typical value of p is between 0.5 and 0.75. If we want to implement a design with parameters (c, p) on an architecture A, available wiring Q that A can provide must meet the following constraint: Q ≤ Q

(2)

In a mesh-based FPGA with a uniform channel width Wc in both horizontal and vertical direction, it can be derived that Q = Wc N 1/2 [10]. To ensure that Q is no less than Q, we always need a larger Wc as N increases. As a result, more area is consumed by global wiring. In fact, Dehon et al points out that wiring area forces the chip area to grow faster linearly with N for p > 1/2[2]. While this problem can be alleviated by the availability of more metal layers, we also have to take the switches into consideration. Total number of switches required per logic block is linear in Wc [3], and switches consume the area on substrate. Hence, logic density of mesh-based FPGA will decrease as N increases. Betz et al [11] suggest that segmentation can help reduce area and improve timing performance, and wires that span 4 - 8 logic blocks are the most important to mesh-based FPGA. Identifying that it is the connection box ”which ultimately impedes scalability”, DeHon in [3] proposes the use of MoT architecture for FPGA interconnect. In MoT interconnection networks, the number of connections between a logic block and outside is bound to a constant independent of design size. As a result, MoT uses less switches. More significantly, given enough wiring layers, MoT interconnection networks only require O(N ) area. A complementary interconnect architecture, ToM, for FPGA, is analyzed in

[2]. Early works on FPGA interconnection networks, such as [12], [13] and [14] can be categorized as ToM architecture. A desirable property of ToM is that it allows us to do placement by recursive bisection only. Also, DeHon et al builds a mapping from ToM to MoT, which means that ToM can also be laid out in O(N ) area, given that there are enough metal layers[2]. However, we note that there are a few problems with ToM and MoT waiting to be resolved. First, it is not clear whether there will be enough metal layers to be exploited. Second, no data is given about the timing performance of ToM and MoT. So, it is still meaningful to look beyond ToM and MoT for alternative architectures for FPGA interconnection networks. 3-D integration is believed to be able to reduce wiring length and increase logic density. 3-D FPGA is studied in [4] and [5]. The CLBs are organized into a 3-D mesh array, and the 3-D switches connect to the nearest six switches. Although analytical results based on predicting models show that 3D integration can achieve great improvement in logic density (as much as 20% - 40%) and interconnect delay (as much as 45% - 60%) [4], the fabrication of 3D IC still poses a huge challenge. In [10], the concept of extra-dimensional FPGA, in which logical third and fourth dimension are mapped to standard two dimensional IC, is introduced. To attack the interconnect delay problem in high performance design, on-chip packet-switched interconnection networks have been proposed to replace the global and shared buses. RAW microprocessor [7], an implementation of Network-on-Chip, consists of 16 tiles, with each tile containing a processing core and a router. The router connects to routers of the 4 neighboring tiles. This type of organization is termed as direct or point-to-point network in [6]. RAW provides a software interface to the gate, wires and pin resources of a chip. It accepts the applications in the form of high-level languages (e.g., C and Java), and relies on its scalable instruction set architecture (ISA) and an compiler (e.g., RawCC for C and Fortran compilation) to implement the software circuits. Our sFPGA architecture resembles ToM architecture in the way they hierarchically organize tiles and switches. However, the feature that communication in interconnection network of sFPGA is packet-based differentiates sFPGA from conventional ToM architecture. We also highlight that sFPGA is, as defined in [6], an indirect or switchbased network. So it is different from RAW, and RAW-like architectures, for example, Intel’s 80-tile TeraFLOPS processor [8]. 3. sFPGA ARCHITECTURE In this section, we present the sFPGA architecture in Figure 1. The sFPGA contains an array of tiles such as A0, A1, A2, A3 and etc. Each tile can be considered to have the

96

granularity of a current FPGA. A0

A7

A6

B0

B7

B6

C0

C7

C6

A1

S0

A5

B1

S7

B5

C1

S6

C5

A2

A3

A4

B2

B3

B4

C3

C4

H0

H7

H6

H1

S1

H5

H2

H3

H4

G0

G7

C2

Switch D0

D7

D6

D1

S5

D5

D2

D3

D4

E0

E7

E6

FPGA Tile G0

Serial Link

G6

F0

F7

F6

ure 1), middle bottom block of next layer. The design flow will generate the destination address and port id for each net to be embedded for tx network interface primitive such that it can generate the correct packet. Similarly, the design flow will generate destination port id for each rx network interface primitive, and the correct net is decoded from the packet. This will be discussed further in section 4.2. We also define the packet structure of sFPGA. For example, for a four-layered sFPGA device, 12 bits of the 40 bit packet are reserved for the address id. The other 28 bits of packet are used concurrently to send 7 inter-tile nets, where each net uses 3bits for the port id and 1 bit for the data. TX_NI 0 1

G2

S2

G3

G5

G4

F1

F2

S3

F3

F5

F4

E1

E2

S4

E3

Input Buffer

FIFO

E5 m

Logic Resources

E4

Transceiver

G1

0 1

Crossbar Output Buffer

FIFO

Fig. 1. Scalable FPGA Architecture

n

RX_NI

These tiles are connected to a packet switched network. FPGA Tiles A0, A1, A2, A3 .. A7 with the switch S0 forms layer one logical cluster. FPGA Tiles B0, B1, B2, B3 .. B7 and switch S7 forms another logical cluster in layer one. These layer one clusters are then connected to another switch G0 which forms layer two logic cluster such that a hierarchy of clusters are created. The communication links among FPGA tiles and switches are high speed serial links. The tiles at the boundary of the architecutre contains nesscerary IO blocks to connect with external circuitry. Two important system parameters for this architecture are the number of tiles that can be connected to one switch and the number of layers in the hierarchy. We envision that using a range of values for these, would create a family of sFPGAs that would be able to meet the area requirements of different application domains such as networking, cryptography and digital signal processing. 3.1. Tile Each tile in sFPGA has the granularity of a current FPGA. It contains two network interface primitives for incoming and outgoing data and a transceiver as shown in Figure 2(a). The function of the network interface primitive is to generate packets from the internal net signals to be sent across the serial links from one tile to another. Each primitive connects to n number of nets on the tile, which we define as ports.Each network interface primitive contains a FIFO buffer, which contains the packets to be sent or received. Transceiver is a high speed SerDes that serializes the packets to be sent and de-serializes the packets to be received. We define the addressing scheme of sFPGA. It consists of an address id and a port id. For example, packet address of 3 2 5 : 4 refer to the fourth net in Tile G5(Refer to Fig-

(a) Tile Architecture

(b) Switch Architecture

Fig. 2. Tile and Switch Block Diagram

3.2. Switch The switch of sFPGA is an ASIC primitive which serves to connect and forward packets between several sFPGA subnets and layers as shown in Figure 2(b). Head-of-line blocking[15] is reduced in the switch architecture by maintaining separate FIFO buffers for each input-output port pair. The key difference between our sFPGA switch and the Internet network switch is that the routing path is static and created offline at design time. This avoids network congestion as the network traffic behavior is statically determined. This routing path is generated by the design flow and programmed into the switch. The contention latency of the switch is completely deterministic as we can compute the rate of the traffic at each input and output port of the switch. Another two system parameters of the sFPGA architecure are the number of ports on the switch and the speed of the switch. These also would create a family of sFPGA devices for different application domains. 4. DESIGN METHODOLOGY FOR sFPGA In this section, we describe our design flow. It consists of three major steps. The first step is the same as the traditional FPGA design flow until the logic optimization stage. The result is a gate-level netlist as shown in Figure 3(a). The second step is to partition this netlist into clusters. Partitioning is constrained such that each cluster has the capacity of one tile (equivalent to one FPGA today) and signal traffic is

97

localized in its logic cluster. After that as shown in Figure 3(b), each cluster will be mapped to one tile, and each net will be either an intra-tile net or an inter-tile net. The final step contains two subtasks. One is to implement the tile and intra-tile nets using the existing FPGA EDA tools for mapping, placement and routing which we define as Tile Generation. The other task Routing Path Generation implements each inter-tile net into packet switch schedule. This new design flow stage maps inter-tile nets into packets with their respective source and destination cluster addresses and creates a static schedule to meet the real-time deadlines of the packets. The result of this step is a list of placed and routed tiles interconnected by statically scheduled packets paths in a real-time switch network, as shown in Figure 3(c).

mary inputs (PIs), local functions (i.e., modules) and primary outputs (POs). The set of directed edges E represents the decomposition of multi-terminal nets into two terminal nets[17]. As an example, the netlist of a two-bit adder is shown in Figure 4(a). The corresponding DAG shown in Figure 4(b) contains ten vertices each mapped to the module in the netlist and twelve edges which are mapped two terminal nets. A0_IBUF Time Constriant A LUT2_A

Sum0_OBUF

B0_IBUF Sum1_OBUF

Carry_OBUF LUT4_A

A1_IBUF

Time Constriant B

B1_IBUF

LUT4_A

(a) Netlist

(a) A0_IBUF

B0_IBUF

A1_IBUF

LUT2_A

LUT4_B

LUT4_A

B1_IBUF

Logic Cluster A Logic Cluster B

Sum0_OBUF

Carry_OBUF

Sum1_OBUF

(b) related Directed Acyclic Graph Logic Clusters

Logic ClusterC

Network Connections

Fig. 4. Adder Example

(b)

Tile A

Tile C

Switch

Tile B

Then the partitioning problem can be formulated as follows [18]: Find a partition V1 , V2 , ..., Vk of V that minimizes area Wv ≤ A (3) v∈Vi

(c)

Fig. 3. sFPGA Design Methodology In the following subsections, we describe Partitioning, Tile Generation and Routing Path Generation steps in our flow using a two-bit full adder example to show how each step works. The first step is not discussed here, as we use the conventional logical synthesis step which has been widely studied in literature [16]. 4.1. Partitioning Design partitioning is a critical step in the design flow. It contains two components, logic partitioning and timing partitioning. The input for logical partitioning is a gate-level netlist. We represent this gate-level netlist by using a directed acyclic graph (DAG) G(V, E, Wv , We ), whose vertex set V is in one-to-one correspondence with the pri-

and also minimize the weight sum of the edges connecting two nodes in different partitions (4) we (e : e = (r, s), P (r) = P (s)) This partitioning problem can be readily solved by using Kernighan-Lin algorithm based partitioning tools, such as Capo [19]. Partitioning timing constraints is relatively simpler. There are two possible scenarios for each input system timing constraint. These are, whether each constrained timing path contains an inter-tile net or not. If it does not contain any inter-tile nets, that system timing constraint becomes a tile timing constraint which is used in the place and routing stage of the tile. If the constrained timing path contains an intertile net, the timing constraint is partitioned into three components, namely source tile timing constraint, switching timing

98

constraint and destination tile timing constraint. As before, source and destination timing constraints are used to implement the respective tile. The switching timing constraints are used to generate the routing table information for each switch in the architecture as discussed further in Section 4.3.

4.2. Tile Generation In this section, the generation of each tile is presented. The input to this step is a list of partitioned logic clusters (including the modules and intra-tile nets) and a list of inter-tile nets. For each partitioned logic cluster, a network primitive is added and each outgoing inter-tile net is mapped to local net that connects to the input port of the network interface. Similarly, each incoming inter-tile net is mapped to local net that connects to the output port of the network interface. For the adder example, the partitioned logic clusters are shown in Figure 5(a), while the generated tile structures are shown in Figure 5(b).

A0_IBUF

B0_IBUF

A1_IBUF

4.3. Routing Path Generation This step creates the routing path using inter-tile nets as the input such that timing constraints are met. We observe that this problem is similar to the traditional placement and routing problem in FPGA[11] and we provide a similar solution using the Dijkstra algorithm[20]. We create a directed acyclic graph (DAG) G(V, E, We ), where a vertex V represents tiles and switches, and an edge E represents a transmission link between tiles and switches. The weight We from switch A to switch B represents the switching latency of switch A. This switching latency is completely deterministic, as each input and output rate of traffic is a priori. Algorithm 1 shows the pseudo-code of this solution. After generating the graph model, the algorithms maps the local traffic to each switch and calculates the weights of each vertex using contention latency. The shortest path for each inter-tile net is found using Dijkstra algorithm and weights are updated accordingly. This will be iterated several times until all timing constraints are met. If maximum iteration count is exceeded by the algorithm, the timing constraints will be infeasible to be met in the design.

B1_IBUF

Algorithm 1 Find routing path and meet timing constraints LUT2_A

LUT4_A

Sum0_OBUF

Sum1_OBUF

LUT4_B

Carry_OBUF

(a) Partitioned Logic Clusters Tile:0_0_0

Tile:0_0_1 LUT2_A

TC_A

B0_IBUF

Sum0_OBUF

LUT4_B

A0_IBUF

Sum1_OBUF

Carry_OBUF

LUT4_A

TC_B_Destination

A1_IBUF B1_IBUF

TC_B_Source

TX_NI

RX_NI

TX_NI

RX_NI Transceiver

Transceiver Switch0

DestinationNext Hop 0_0_1 0_0_1

(b) Generated Tiles

Fig. 5. Adder Example Further, in Figure 4(a), the timing constraint A becomes tile constraint in as shown in Figure 5(b). But the timing constraint B which contains an inter-tile net is partitioned into three, TC B Source, TC B Switching and TC B Destination. TC B Switching will be used to generate routing path.

Procedure: Genereate Routing Table 1: Generate graph 2: Map local cluster traffic to switches 3: Update weights 4: while Timing Constriants are not met do 5: for all each inter-tile net Ni do 6: Find shortest path for net Ni using Dijkstra algroithm 7: Update weights using this path for the Ni 8: end for 9: end while 10: Add each inter-tile net Ni path to the respective routing table

For the adder example, let us assume a switching timing constraint B of 20ns and TC B Source and TC B Destination is 2ns and 3ns respectively (obtained from the traditional static timing analysis from tile implementation). Assume switching latency of 4ns for one entry in the routing table as shown in Figure 5(b). This allows the timing constraint B to be met. Let’s consider a non-trivial example now to show how the timing constraints are met. Consider where Tile 0 0 0 and Tile 0 0 1 in 5(a) are mapped respectively to Tile A0 and Tile C4 as set X as shown in Figure 6. Constrained path B has two paths: Path A1 (A0, S0, S7, S6 and C4) and Path A2 (A0, C0, G0, S6 and C4). To increase complexity, also another set Y of Tile 0 0 0 and Tile 0 0 1 are mapped respectively to Tile A2 and Tile B2 with constrained path A3 (A2, S0, S7 and B2). Both have timing constraints of 20ns. During the first iteration of the algorithm, if the path A2 is selected for timing constraint B in set X and switching latency of path A3 becomes greater than 20ns, constraint B in the set Y will become infeasible. Then the algorithm will

99

select path A1 in the next iteration (due to updated weights) for set X, thus allowing path A3 to meet its timing constraints. A0

A7

A6

B0

B7

B6

C0

C7

C6

A1

S0

A5

B1

S7

B5

C1

S6

C5

A2

A3

A4

B2

B3

B4

C3

C4

C2

Path A1 Path A2

G0 Path A3

Fig. 6. Routing path example

4.4. Emulation Prototype In this section, we describe the sFPGA emulation prototype we have developed. Our emulation prototype uses a Xilinx FPGA as the FPGA tile of the sFPGA architecture. Communication link is through a high speed SMA connector which communicates through the Multi Gigabit Transceiver (MGT) module in FPGA. An MGT is a SerDes capable of operating at serial bit rates of 3.125 Gigabit/second. A Verilog module was written as a network interface to this MGT. The implementation platform is the Xilinx XUP Virtex II Pro Development Board[21] with an xc2vp30 FPGA onboard per tile. Xilinx ISE 9.1i is used for synthesis and implementation as back end tools. Early results show that, the network interface primtive using 8 input ports and 8 output ports with a twenty-bit packet size uses 11 slices on a Virtex 2 Pro Xilinx FPGA. For the full adder example, as there is only one entry in the switch, it uses 2 bytes of memory for the entry. 5. CONCLUSIONS AND FUTURE WORK In this paper, we propose a scalable FPGA architecture. It consists of an array of FPGA tiles connected in a hierarchical packet switched network. The tiles are connected in cluster and hierarchy based network. Thus, making the sFPGA very scalable. We also propose a design-flow for sFPGA using a cluster based approach to integrate existing design tools seamlessly. We have also presented theoretical case study to show the feasibility of our design flow. In future work, we would like to further verify more complex applications on our emulation prototype and complete the automation of the design flow using Java EDIF[22] and make it avaiable for research community. It would be interesting to explore different parameters for the network in our experimental prototype and observe the performance of the new sFPGA architecture. Currently, logic cluster to

logic tile placement is simplified by random placement and we would like to explore optimal placements algorithms for this. 6. REFERENCES [1] J. M. Rabaey, Digital integrated circuits: Second Edition. Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1996, p. 413. [2] A. DeHon, “Unifying mesh- and tree-based programmable interconnect,” IEEE Trans. Very Large Scale Integr. Syst., vol. 12, no. 10, pp. 1051–1065, 2004. [3] A. DeHon and R. Rubin, “Design of FPGA interconnect for multilevel metallization,” IEEE Trans. Very Large Scale Integr. Syst., vol. 12, no. 10, pp. 1038– 1050, 2004. [4] A. Rahman, S. Das, A. Chandraksan, and R. Reif, “Wiring requirement and three-dimensional integration of field-programmable gate arrays,” in SLIP ’01: Proceedings of the 2001 international workshop on System-level interconnect prediction. New York, NY, USA: ACM, 2001, pp. 107–113. [5] W. Meleis, P. Leeser, M. v Zavracky, and M. Vai, “Architectural design of a three dimensional FPGA,” Advanced Research in VLSI, 1997. Proceedings., Seventeenth Conference on, pp. 256–268, 15-16 Sep 1997. [6] L. Benini and G. D. Micheli, “Networks on Chips: A New SoC paradigm,” Computer, vol. 35, no. 1, pp. 70–78, 2002. [7] M. B. Taylor, J. Kim, J. Miller, D. Wentzlaff, F. Ghodrat, B. Greenwald, H. Hoffman, P. Johnson, J.-W. Lee, W. Lee, A. Ma, A. Saraf, M. Seneski, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, and A. Agarwal, “The RAW microprocessor: A computational fabric for software circuits and general-purpose programs,” IEEE Micro, vol. 22, no. 2, pp. 25–35, 2002. [8] S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, and S. Borkar, “An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS,” Solid-State Circuits, IEEE Journal of, vol. 43, no. 1, pp. 29–41, Jan. 2008. [9] B. Landman and R. Russo, “On a pin versus block relationship for partitions of logic graphs,” Transactions on Computers, vol. C-20, no. 12, pp. 1469–1479, Dec. 1971. [10] H. Schmit, “Extra-dimensional island-style FPGAs,” in Field-Programmable Logic and Applications. 13th International Conference, FPL 2003. Proceedings (Lecture Notes in Comput. Sci. Vol.2778), Lisbon, Portugal, 2003, pp. 406 – 15. [11] V. Betz, J. Rose, and A. Marquardt, Architecture and CAD for Deep-Submicron FPGAs. Norwell, MA, USA: Kluwer Academic Publishers, 1999, pp. 151– 190. [12] A. A. Aggarwal and D. M. Lewis, “Routing architectures for hierarchical field programmable gate arrays,” in ICCS ’94: Proceedings of the1994 IEEE International Conference on Computer Design: VLSI in Computer & Processors. Washington, DC, USA: IEEE Computer Society, 1994, pp. 475–478. [13] R. Amerson, R. Carter, W. Culbertson, P. Kuekes, G. Snider, and L. Albertson, “Plasma: an fpga for million gate systems,” in FPGA ’96: Proceedings of the 1996 ACM fourth international symposium on Field-programmable gate arrays. New York, NY, USA: ACM, 1996, pp. 10–16. [14] W. Tsu, K. Macy, A. Joshi, R. Huang, N. Walker, T. Tung, O. Rowhani, V. George, J. Wawrzynek, and A. DeHon, “HSRA: high-speed, hierarchical synchronous reconfigurable array,” in FPGA ’99: Proceedings of the 1999 ACM/SIGDA seventh international symposium on Field programmable gate arrays. New York, NY, USA: ACM, 1999, pp. 125–134. [15] “Head-of-line blocking,” Available from: http://en.wikipedia.org/wiki/Head-ofline blocking, 2008. [16] R. Murgai, R. K. Brayton, and A. Sangiovanni-Vincentelli, Logic Synthesis for Field-Programmable Gate Arrays. Norwell, MA, USA: Kluwer Academic Publishers, 1995. [17] G. D. Micheli, Synthesis and Optimization of Digital Circuits. Higher Education, 1994, pp. 345–348.

McGraw-Hill

[18] B. Preas and M. Lorenzetti, Physical Design Automation of VLSI systems. Benjamin/Cummings, 1988, pp. 67–70. [19] A. E. Caldwell, A. B. Kahng, and I. L. Markov, “Can recursive bisection alone produce routable placements?” in DAC ’00: Proceedings of the 37th conference on Design automation. New York, NY, USA: ACM, 2000, pp. 477–482. [20] T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson, Introduction to Algorithms. MIT Press, 2001, pp. 595–601. [21] “Xilinx XUPV2P Development System,” http://www.xilinx.com/univ/xupv2p.html, 2008.

Available

from:

[22] “BYU Java EDIF Tools,” Available from: http://reliability.ee.byu.edu/edif/, 2008.

100

sFPGA - A SCALABLE SWITCH BASED FPGA ARCHITECTURE AND

sFPGA - A SCALABLE SWITCH BASED FPGA ARCHITECTURE AND

Suggest Documents

An FPGA based scalable architecture of a stochastic state point ...

A FPGA-based Parallel Architecture for Scalable High-Speed Packet ...

A Scalable Correlator Architecture Based on Modular FPGA Hardware ...

A FPGA-based scalable architecture for URL legal filtering in ...

FPGA Implementation of a Scalable and Highly Parallel Architecture ...

A Scalable FPGA Architecture for Nonnegative Least ... - Google Sites

A Scalable FPGA Architecture for Non-Linear SVM Training

FPGA based generalized architecture for Modulation and ...

NoC-Based FPGA: Architecture and Routing - People.csail.mit.edu

A Component-Based Architecture for Scalable

A Scalable Unsegmented Multiport Memory for FPGA-Based Systems

CN4093 Scalable Switch Datasheet

Micropositioning mechatronics system based on FPGA architecture

An FPGA-Based Region-Growing Architecture for

FPGA-based Parallel Hardware Architecture for Real

Beehive: an FPGA-based multiprocessor architecture - UPCommons

DESCRIBING THE FPGA-BASED HARDWARE ARCHITECTURE OF ...

HAsim: FPGA-Based Micro-Architecture Simulator - gem5

AN FPGA BASED ARCHITECTURE FOR COMPLEX RULE ...

Beehive: an FPGA-based multiprocessor architecture - UPCommons

Scalable 3-Stage ATM Switch Architecture Using Optical ... - CiteSeerX

A Scalable Hybrid FPGA/GPU FX Correlator

A Scalable Hybrid FPGA/GPU FX Correlator

FPGA Architecture: Survey and Challenges