Mesh-of-Trees and Alternative Interconnection ... - Semantic Scholar

4 downloads 638 Views 2MB Size Report
Index Terms—Layout, mesh-of-trees, multiprocessor intercon- nection .... of 128 MB each, and [2] uses as many memory modules as .... columns. Fig. 3(c) shows a case where . Recently, new topologies, such as spidergon, are proposed and ...
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 10, OCTOBER 2009

1419

Mesh-of-Trees and Alternative Interconnection Networks for Single-Chip Parallelism Aydin O. Balkan, Student Member, IEEE, Gang Qu, Senior Member, IEEE, and Uzi Vishkin, Senior Member, IEEE

Abstract—In single-chip parallel processors, it is crucial to implement a high-throughput low-latency interconnection network to connect the on-chip components, especially the processing units and the memory units. In this paper, we propose a new mesh of trees (MoT) implementation of the interconnection network and evaluate it relative to metrics such as wire complexity, total register count, single switch delay, maximum throughput, tradeoffs between throughput and latency, and post-layout performance. We show that on-chip interconnection networks can provide higher bandwidth between processors and shared first-level cache than previously considered possible, facilitating greater scalability of memory architectures that require that. MoT is also compared, both analytically and experimentally, to some other traditional network topologies, such as hypercube, butterfly, fat trees and butterfly fat trees. When we evaluate a 64-terminal MoT network at 90-nm technology, concrete results show that MoT provides higher throughput and lower latency especially when the input traffic (or the on-chip parallelism) is high, at comparable area. A recurring problem in networking and communication is that of achieving good sustained throughput in contrast to just high theoretical peak performance that does not materialize for typical work loads. Our quantitative results demonstrate a clear advantage of the proposed MoT network in the context of single-chip parallel processing. Index Terms—Layout, mesh-of-trees, multiprocessor interconnection, multistage interconnection networks, network-on-chip (NoC), parallel processing, topology.

I. INTRODUCTION HE advent of the Billion-transistor chip era coupled with a slow down in clock rate improvement brought about a growing interest in parallel computing. Ongoing expansion in the demands of scientific and commercial computing workloads also contributes to this growth in interests. To date, the outreach of parallel computing has fallen short of historical expectations. This has primarily been attributed to programmability shortcomings of parallel computers. The Parallel Random Access Model (PRAM) is an easy model for parallel algorithmic thinking and for programming. Current multichip multiprocessor designs that aim to support the PRAM (such as Tera/Cray MTA [1] and SB-PRAM [2]), although interesting, are constrained by interchip interconnections. Latency and bandwidth problems have limited their success in supporting PRAM. With the continuing increase of silicon capacity, it

T

becomes possible to build a single-chip parallel processor, as demonstrated in the Explicit Multithreading (XMT) project [3], [4] that seeks to prototype a PRAM-On-Chip vision. To handle the high level of parallelism and memory access needed for a PRAM on-a-chip, XMT uses a memory architecture where partitioning of data memory starts from the first level of the on-chip cache. It is a challenging task to build a high-throughput low-latency interconnection network on such architecture (see Section II-A for details). A not well designed interconnection network may create many on-chip queuing bottlenecks, when concurrent read and/or write requests are issued to the memory. This will significantly affect the network’s throughput, memory latency, and the overall system performance. In this paper, we study the interconnection network design problem for a memory architecture designed to achieve high single-chip parallelism. The contributions of this paper include a novel high-performance design and evaluation of the Mesh-ofTrees (MoT) topology, and a comprehensive comparison with existing popular networks. We demonstrate, through both theoretical analysis and simulation, that the proposed MoT network can achieve competitive throughput and low latency, especially when the input traffic (or the on-chip parallelism) is high. In pre-layout simulations, average 8-terminal MoT throughput is 4-26% higher compared to butterfly and hypercube networks of similar cost. The difference is 2-23% higher in case of 64 terminals. Post-layout comparison of 8-terminal networks show that MoT can operate at higher frequencies and provide 206 Gbps (up to 56% higher) average throughput with 32-bit data path. We extended the 8-terminal configuration with power routing and I/O pads and taped-out for fabrication [Fig. 1(a)]. The rest of the paper is organized as follows. We analyze the existing interconnection network models in Section II and describe their shortcomings. We then propose a new approach based on the MoT topology in Section III. We conduct simulation to evaluate the performance of both the existing approaches and the proposed MoT network in Section IV. Section V concludes and points out our current and future work. An extended discussion of the results presented in this paper can be found in [5]. II. BACKGROUND AND RELATED WORK

Manuscript received October 01, 2007; revised May 18, 2008. First published April 17, 2009; current version published September 23, 2009. This work was supported in part by Grant CCF-0325393 from the National Science Foundation. The authors are with the Electrical and Computer Engineering Department and Institute of Advanced Computer Studies, University of Maryland, College Park, MD 20742 USA (e-mail: [email protected]; [email protected]; [email protected]). Digital Object Identifier 10.1109/TVLSI.2008.2003999

A. A Memory Architecture For Single-Chip Parallelism Parallel computing generally requires a larger number of memory accesses than serial computation per clock. A standard technique for hiding access latencies is by feeding functional units with instructions coming from multiple hardware threads.

1063-8210/$26.00 © 2009 IEEE

1420

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 10, OCTOBER 2009

Fig. 1. (a) Die photo of 8-terminal chip. (b) Global memory is partitioned into modules (separated by dashed lines). Each module has its own possibly multilevel on-chip caches (within dotted lines).

This allows, for example, overlapping several arithmetic instructions as well as read instructions each requiring waiting for data. Such overlap implies a steady and high demand for memory accesses. To facilitate concurrent accesses by many processing elements, memory is normally partitioned on parallel machines [6]. For example, [1] uses 512 memory modules of 128 MB each, and [2] uses as many memory modules as processing elements. The following memory structure is used in the XMT single-chip parallel architecture, [7], which is designed to optimize single-task completion time [Fig. 1(b)]. A globally shared memory space is partitioned into multiple memory modules. Each memory module consists of on-chip cache and off-chip memory portions. A universal hashing function is used to avoid pathological access patterns (similar to [1], [2], [8]). As a result, memory requests are expected to be distributed uniformly over all memory modules. Furthermore, we assume that the processors issue load and store instructions. These instructions are converted to network packets consisting of one or two flits.1 Load instructions are sent in a single flit and less-frequent store instructions are sent with at most two flits. Returning data or acknowledgment messages are sent in a single flit. This structure completely avoids cache coherence issues because the processors do not have their private caches. However, it imposes significant challenges for the interconnection network design. First, a hardware thread should be able to send memory requests to any memory location on the chip. Coupled with the objective of avoiding cache coherence issue, this requires placing an interconnection network between processing units and the first level of the cache and will cause a higher latency compared to cache access latencies of traditional serial processors. Multiple threads will run concurrently to hide this latency. In order 1A flow control digit, or flit, is the smallest part of a packet that can be manipulated by the flow control mechanisms.

to satisfy the steady and high demand of threads for data, the interconnection network needs to provide high throughput between the processing units (threads) and the memory modules. Second, the interconnection network needs to provide low on-chip communication latency. This will allow designating fewer threads just to hide latencies and will simplify the overall design. It will also improve performance in cases where a sufficiently large number of thread contexts is not available to overlap communication with computation. B. Existing Interconnection Network Models In this section, we briefly discuss popular interconnection networks that have been extensively studied, and used in earlier parallel processing computers. , intercon1) Hypercube: An -dimensional hypercube, nects nodes by connecting each node to other nodes. in binary, a pair of nodes If we label the nodes from 0 to are connected directly if and only if their labels differ by one bit [10]–[12]. This connection consists of two uni-directional physical communication channels (wires). Fig. 2(a) depicts the best in terms of area efficiency [11]. known implementation of nodes (PC stands for Processing Cluster in the figure) are connected by wires in tracks shown in the shaded areas between the PCs. 2) Butterfly: A binary butterfly network also connects nodes as shown in Fig. 2(b). The 8 PCs are connected to each other through switch nodes labeled (by their vertical layers) A, B, C, and D. For example, the connection between source 0 and destination 5 and the connection between source 6 and destination 6 are highlighted. Fig. 2(c) shows the best known physical layout to implement the same butterfly network for single chip multiprocessors [9]. In general, butterfly networks can have in Fig. 2(b)]. switches with input and output ports [ In Section II-D, we explain why this is not desirable for high throughput.

BALKAN et al.: MESH-OF-TREES AND ALTERNATIVE INTERCONNECTION NETWORKS FOR SINGLE-CHIP PARALLELISM

Fig. 2. (a) Physical implementation of 4-D hypercube. (b) Butterfly network and (c) its layout with

Fig. 3. (a) k-ary n-tree (fat tree) with

1421

N = 8 PCs as shown in [9].

k = 2, n = 4, N = k = 16. (b) Butterfly fat tree with N = 16. (c) 2-D mesh (4 2 4) with N = 16.

3) Fat Trees: Fat tree network provides multiple paths between each pair of nodes. A disadvantage of fat tree is its large switch size. The following two structures were proposed to overcome this disadvantage. (i) The k-ary n-tree[13] connects PCs with a fat tree of levels as illustrated in Fig. 3(a). Root nodes (small circles in the center) have children, switch nodes (oval shape internal nodes) have children and parents, and there are two unidirectional links between a child and input ports and output ports for parent. Thus, there are each switch node. (ii) Fig. 3(b) depicts a butterfly fat tree (BFT) [14]. Each internal switch node (the square with with no label surrounded by 4 PCs) is connected to 4 PCs and the 2 root switch nodes. Thus, it has 6 input ports and 6 output ports. 4) Ring and 2-D-Mesh: Recently, ring [Fig. 2(b)] and 2-D mesh [2-D-mesh, Fig. 3(c)] topologies [10] have been used in some industrial single-chip parallel processor projects. The Element Interconnect Bus of Cell processor (IBM, SONY, Toshiba) [15] is designed with 4 parallel rings that connect 12 cores consisting of processing elements and peripherals. The interconnection network of Teraflop processor (Intel) [16] connects 80 processing cores in an 8 10 2-D-mesh. This topology has also been used in academic projects such as RAW (MIT) [17], and TRIPS (UT-Austin) [18]. Furthermore, several network-on-chip (NoC) studies chose 2-D-mesh as underlying topology, due to its regularity and low hardware complexity [19]–[21]. In general, terminals can be connected 2-D-mesh networks with grid, where and represent number of rows and by an

. Recently, columns. Fig. 3(c) shows a case where new topologies, such as spidergon, are proposed and evaluated as a compromise among the ring and 2-D-mesh topologies [22]. 5) Crossbar: Several studies considered crossbar (or bus matrix) networks for connecting processors, memory modules and application-specific components on Multiprocessor Systems-on-Chip (MPSoC) [23], [24]. Such networks are built by connecting multiple pipelined buses, based on communication bandwidth requirements between heterogeneous components. While diversity in required bandwidth allows efficient hardware optimization for different applications, such optimizations are not expected to provide the similar benefit when the connectivity of components is symmetric and high bandwidth is required between processors and memory. The pipelined crossbar network of Cyclops-64 processor (IBM) [25] connects such processors and globally-shared memory modules. However, due to its centralized architecture, it is unable to provide the desired performance. C. Performance Tuning With Additional Resources In many cases per-cycle throughput of interconnection networks can be improved by increasing the amount of resources. In this section, we summarize two common methods. 1) Virtual-Channel Routers: Virtual channels act as buffers for incoming data packets that are stalled due to contention in later stages. A packet is stored in a virtual channel in the switch until an output port and physical channel toward its destination becomes available [10], [13], [26], [27]. These studies typically

1422

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 10, OCTOBER 2009

use 2 or 4 virtual channels per physical channel. Using more virtual channels improves throughput, by increasing utilization of physical channels (wires). However, in addition to the area cost of virtual channel buffers, this approach also increases the complexity and logic delay of the network switch [28]. 2) Virtual Output Queuing and Buffered Crossbars: Virtual Output Queues (VOQ) are the most commonly used methods to achieve maximum throughput with crossbars. In its classical buffers per input port ( buffers implementation, total) precede the inputs of a monolithic crossbar [29]. In such crossbars, the complexity of the arbitration and scheduling may affect the length of clock cycles, similar to the effect of virtual channels discussed above. Buffered crossbars use asymptotically the same number of buffers as VOQ-based crossbars; however, the buffers are distributed at each crosspoint. More advanced VOQ-based crossbar architectures are built by combining input buffers and distributed buffers, and generally used in large-scale network routers [30]. Using some amount of crosspoint buffers effectively decouples input scheduling from output scheduling; however, it does not completely eliminate them as centralized operations [31]. Therefore, such architectures may still require long clock cycles, or multiple iterations of arbitration. 3) Tuned Butterfly Networks: The butterfly network is one of the most extensively studied interconnection networks. Many variants of butterfly network have been developed to improve its performance. The improvement is achieved by increasing the resources. One group of networks extend the regular butterfly, by adding parallel resources. Extra hardware provides additional bandwidth, reduces congestion and improves throughput. Examples of this approach include multibutterfly [32], dilated butterfly [33], [34], and replicated butterfly [33], [35]. D. Deficiency of the Existing Interconnection Networks 1) Interference: Consider two packets going from sources and to destinations and , respectively. In a noninterfering network, excess traffic destined to target node will not interfere with or steal bandwidth from traffic destined to target node [10]. In most of the networks discussed in Section II-B interference will reduce performance. VOQ-based or buffered crossbars eliminate such interference by decoupling input and output scheduling. 2) Bisection Bandwidth: The bisection bandwidth is the bandwidth of the smallest cut that partitions the network nearly in half2 [10]. The upper bound of per-terminal throughput of a network under uniform traffic is proportional to its bisection bandwidth, and inversely proportional to the number of terminals [10]. As a result, with uniform traffic assumption, overall throughput of a network is proportional to its bisection bandwidth. Therefore, it is an important performance measure to consider for the memory system of Section II-A. If links between nodes have identical bandwidths, a ring net, and a 2-D-mesh network has at most work has bisection bandwidth. If the number of rows and columns ( , ) are not equal, the bisection bandwidth of 2-D-mesh is . While these networks may be better suited for 2Exactly

in half, if number of nodes and terminals are even.

Fig. 4. Ring network with

N = 16 terminals.

localized traffic patterns, for parallel processors with uniform memory access patterns they face scalability challenges. Networks can be replicated to improve the bisection bandwidth, such as the 4-ring network of the Cell processor [15]. However, there may be challenges in further scaling (e.g., up to rings), and we are unaware of any comprehensive performance study of such replicated networks. 3) Network Diameter: Packets in a network advance in hop s from one node to the next, as they travel from the source to the destination. The diameter of a network is the largest number of hops among all shortest paths [10]. Since the network we intend to build is placed between processors and the first level of globally shared cache memory, low and scalable diameter is desirable. Among the networks of Section II-B, ring and spidergon have , and 2-D-mesh has diameter of . diameter of Other networks that we consider in this paper have or better diameter. 4) Switch Delay: Peh and Dally showed in [28] that the critical delay of a routing switch increases, when (i) number of input and output ports, or (ii) number of virtual channels increases. Longer critical delay requires a slower clock rate, and this reduces peak and average throughput of the network. Although there is an apparent cap on the global clock rate of the processors, it is still reasonable to run a small and centralized module with a fast clock, and many other modules with a slower clock. Therefore, it is important to seek short critical delays in the interconnection network. input and Switches of hypercube network have output ports, where is the number of network terminals. Therefore, peak throughput of hypercube would reduce with increasing number of terminals. Similarly, a butterfly network built with larger switches with e.g., 4-ports, would have longer clock period and, therefore, lower peak throughput, compared to a butterfly with 2-port switches. Leiserson [36] states that the root capacity of the fat tree for and , where the capacity beN terminals is between tween the network and each processor at the leaves is defined as 1. The capacity increases exponentially at each level between leaves and the root. The number of the input and output ports to the switching nodes is proportional to the capacity at that level of the tree. Therefore, in Leiserson’s fat tree, number of switch ports increases between leaves and root, reaching at the root switch. Alternative fat tree architectures as described in Section II-B4 are built with small switches with constant number of ports. Smallest of these switches has 4 input and 4 output ports. As a result of this, they will provide higher peak throughput compared to Leiserson’s fat tree; however, they will

BALKAN et al.: MESH-OF-TREES AND ALTERNATIVE INTERCONNECTION NETWORKS FOR SINGLE-CHIP PARALLELISM

1423

Fig. 5. Mesh of trees with four clusters and four memory modules.

fall short compared to the peak throughput of butterfly network with 2-port switches. 5) Global Synchronization: Crossbar networks provide one standard type of high-throughput interconnection networks, where packets do not interfere. They achieve this by scheduling the switches based on the global state of the network. For example, the arbiters in each of the 96 destination ports of pipelined crossbar in [25] select a winner from all requested source ports, and the result is used in a 96-to-1 multiplexer. Other studies discuss pipelined arbitration methods for VOQ-based crossbars [37]; however, the arbiter remains centralized, trying to process all sources at once in order to match them to available destinations. The overhead for global scheduling may be acceptable with large payloads in messages. However, in the XMT single-chip parallelism context, the messages between processors and cache are short. Therefore, the networks that need to globally schedule the switches will incur significant overheads. E. Advantages of MoT Network In this paper, we propose a new interconnection network implementation based on mesh of trees (MoT) topology. We will demonstrate, both analytically and experimentally, that the proposed MoT network can provide high throughput with low latency within a reasonable area cost. The MoT topology and routing method guarantee that unless the memory access traffic is extremely unbalanced, packets between different sources and destinations will not interfere. Therefore, MoT network provides high per-cycle throughput, very close to its peak throughput. Furthermore, using simpler switches and avoiding global scheduling allows higher operating frequencies.

Notable differences between Mesh-of-Trees network and VOQ-based buffered crossbar switches [30] include the distribution of buffers over trees with logarithmic depth; the use of completely decentralized and decoupled constant-complexity routing and arbitration operations; and combining arbitration with data traversal over the network, similar to [38]. III. MESH OF TREES NETWORK In earlier sections the memory architecture and its implementation challenges were overviewed. We now present the MoT network in the context of single-chip parallel processing. A. Topology The MoT concept is discussed earlier in studies such as [12], [39], [40]. These approaches are not interference-free and they allow performance bottlenecks. We implement the MoT network differently (Fig. 5), to eliminate such bottlenecks. The network consists of two main structures, a set of fan-out trees and a set of fan-in trees.3 Fig. 5(b) shows the binary fan-out trees, where each PC is a root and connects to two children (we call them up child and down child), each child will have two children of their own. The 16 leaf nodes also represent the leaf nodes in the binary fan-in trees that have MMs as their roots [Fig. 5(c)]. There are two interesting properties of our MoT network. First, there is a unique path between each source and each destination. This simplifies the operation of the switching circuits and allows faster implementation which translates into improvement in throughput when pipelining a path (registers need to be added to separate pipeline stages). Second, packets 3They are called row and column trees in [12]. Here, we use the names of fan-out and fan-in trees to convey the data flow direction.

1424

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 10, OCTOBER 2009

Fig. 6. Switch primitives for MoT network. Data paths are marked with thick lines. Control paths are simplified. (a) Fan-out tree primitive: One input channel, two output channels. (b) Fan-in tree (arbitration) primitive: Two input channels, one output channel. (c) Pipeline primitive: One input channel, one output channel. Signals: req: request; ks: kill-and-switch; write/read: write and read pointers; b: storage buffer; select: result of arbitration; destination: destination address.

between different sources and destinations will not interfere, unless the traffic is heavily unbalanced. When several packets need to reach the same destination, some packets may be queued at fan-in tree nodes. Since each node has a limited queue storage capacity, packets can be backed up to previous nodes in the fan-in trees. In extreme cases such backup can spill to the fan-out trees. This is the only case that the path to one destination can affect the path to a different destination. But for that to happen, the demand for the first destination needs to be very high, and exceed the storage capacity of the fan-in tree nodes. We will further discuss this below in the flow control section. B. Routing Routing is the process of finding a path for each packet from its source to its destination. Fig. 5(d) gives the communication paths from PCs to MMs for three memory requests. Each memory request will travel from the PC (source) through a fan-out tree and then a fan-in tree before it reaches the MM (destination). There is no routing decision to be made in the fan-in trees as all packets move toward the root. In fan-out trees, routing decision is trivial from the binary representation of the destination. For example, when PC 0 sends a packet to MM 2 (“10” in binary) as shown in Fig. 5(d), the packet goes from the root to its down child (because of the first bit “1” in “10”) and then it selects the up child (because of the ‘0’ in the next bit position in ‘10’) and reaches the leaf. This simple routing scheme also ensures that the fan-out tree part of the network is noninterfering. Similarly, packets with different destinations will not interfere in the fan-in trees. C. Flow Control Flow control mechanisms manage the allocation of channels and buffers to packets. Fig. 6 illustrates the switching primitives in our MoT network. Each node in the fan-out and fan-in trees of the network will be implemented as the fan-out or fan-in (arbitration) primitives as shown in Fig. 6(a) and (b). The pipeline primitive in Fig. 6(c) is used to divide long wires into multiple short segments. In general, packets do not compete for resources in the fan-out trees. They stall only when the fan-in trees stall and the stall propagates to the fan-out trees. This occurs rarely when traffic pattern is extremely unbalanced. Arbitration in fan-in trees ensures that the two children of a node pass requests to their parent in an alternating fashion when

there is a stream of incoming packets at each child. In other words, if a request from one child loses the arbitration to the other child’s request in one cycle, the request is guaranteed to win the arbitration in the next cycle. This step-wise arbitration concept is similar to the use of arbitrate-and-move primitives in [38]. In each cycle, one request among two is arbitrated at each arbitration primitive, and moved to the parent node. The use of local flow control mechanisms reduces the communication overhead. However, in the absence of a global stall signal, a node has to get the stall information from its immediate successor. A raised kill-and-switch (ks) signal allows the subsequent data packet to appear at the output port. The low ks signal keeps the same packet at the output, effectively signaling a stall condition. To prevent data loss, each primitive has two buffers per input [41], and a control circuit ensures that consecutive packets are stored in different buffers. If in a given cycle ks signal remains low, one more data packet can be read without overwriting the stalled one. D. Floorplan Fig. 7 depicts our proposed floorplan for the MoT networks. We first explain the layout of the fan-out and fan-in trees. Both the fan-out and fan-in trees are placed in pairs for better area utilization. Fig. 7(a) shows such a pair of 8-leaf fan-out trees clusters. The two root nodes for an MoT network with of the two fan-out trees are connected to the source clusters by the thick lines. Empty circles are internal nodes and crosses are leaf connections. Fig. 7(c) shows the same layout for a pair of 32-leaf fan-out trees. Fig. 7(d) shows a pair of 8-leaf fan-in tree. Leaves are connected to internal nodes represented by the empty squares. Roots of the two fan-in trees are connected to the destination clusters through the connections with arrowhead. Fig. 7(e) gives the layout of a pair of 32-leaf fan-in trees. Fig. 7(b) shows how the fan-out and fan-in trees are placed and for an 8-terminal network. Sources (PC) are marked as . Each pair of the destinations (MM) as , where fan-out trees is placed vertically and each pair of fan-in trees is laid out horizontally. The leaves of fan-out tree are connected to the leaves of fan-in trees. The path of a packet from source 5 to destination 2 is highlighted. IV. EVALUATION We evaluate the proposed MoT network in two parts. Pre-layout evaluation metrics include area complexity in terms of wire area and register count; and performance in

BALKAN et al.: MESH-OF-TREES AND ALTERNATIVE INTERCONNECTION NETWORKS FOR SINGLE-CHIP PARALLELISM

1425

TABLE II ASYMPTOTIC AREA COMPARISON OF NETWORKS

Fig. 7. Detailed floorplan of the MoT interconnection network. (b): the eightcluster MoT network floorplan; (a) and (d): details of a fan-out tree pair and a fan-in tree pair in (b); (c) and (e): layout of 32-leaf trees.

underlying memory architecture. Cost-efficient optimizations for other specific architectures with longer packets (e.g., those with vector load/store operations) requires further studies. A. Wire Area Complexity

TABLE I LIST OF SYMBOLS

terms of maximum throughput, throughput-latency relation, and single-switch performance. In this stage, latency and throughput are measured in terms of cycles and flits per cycle (fpc) respectively. For each metric we derive asymptotic and experimental comparison between MoT and other popular interconnection networks such as hypercube, butterfly, replicated butterfly, fat trees, 2-D-mesh, ring, and buffered crossbar. We also simulate MoT and butterfly using real applications. Post-layout evaluations include clock rate, actual area, and throughput in Gbps. We build and compare layouts for MoT and replicated butterfly network, which promise highest peak throughput in pre-layout evaluations. Table I describes the symbols we used in these evaluations. We perform network simulations with single-flit packets on networks being evaluated. Two-flit packets are used when sending store, or some architecture-specific and less frequently issued instructions [42] from processors to memory modules. All load instructions, returning data and store-acknowledgment packets fit in single flits. Therefore, our simulations and comparisons cover the majority of communication packets between processors and memory modules. Furthermore, application simulations in a parallel processor context show that such less-frequent two-flit instructions do not significantly affect performance. Earlier work showed small amount of performance loss of MoT network with increasing amounts of two-flit packets [43]. Longer packets may further reduce performance; however, they are not expected to occur in the

We follow the following grid assumptions of Thompson’s classical VLSI complexity theory [44]: (i) Width of wires and square switches are assumed to be one unit; (ii) there are two levels of metal wires; (iii) two wires can intersect in one unit square, if one is horizontal, the other is vertical, and they belong to different levels. These assumptions are sufficient to evaluate and compare asymptotic wire area complexity of networks. However, in modern VLSI processes, there are more than 2 levels of metal available for wiring, and switch nodes are usually wider (and taller) than wires. Therefore, the area of logic gates is an important practical factor in area comparison. In fact, our recent results [43] indicate that logic gate area will remain higher than actual wire area for several future technology generations. Earlier work [45] assumed a fixed chip size, and compared wire areas of different networks. In this paper we generalize that discussion by removing the chip size constraint. Our results show that the wire area complexity of MoT network is asympor totically larger than other networks by a factor of (Table II). 1) Mesh of Trees: In the structure of Fig. 7(b), the width of and height is wire area is . Their product gives the wire area of MoT network

(1) 2) Hypercube: As shown in Fig. 2(a), the chip area is , where is the size of a PC, is the width of the wire area between two PCs. The constant 2 is is the number due to the use of unidirectional channels, of tracks in such area, and . We use the formula from [11], assume unit width for PCs, and obtain for hypercube network’s wire area (2) 3) Butterfly: The number of wire tracks required in both dimensions can be obtained from the layout of [9]. Similar to the

1426

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 10, OCTOBER 2009

MoT approach, their product gives the wire area in (3). The wire area of a replicated butterfly network is times the area of for connecting multiple copies single butterfly times to sources and destinations

(3) 4) Fat Trees: The wire area does not change for k-ary n-trees with different values of and as long as is kept constant. We calculate the total wire area by iteratively adding wires starting from the root. Assuming unit size for PCs, the wire area for k-ary n-trees and butterfly fat trees are calculated as in (4) and (5), respectively

are only wire connections. Based on the primitive circuits shown in Fig. 6, the total number of -bit registers is (9) 2) Hypercube: Hypercube has switching nodes, each with input and output ports. Each of the input and output ports contains data registers [10], one per virtual channel. Therefore, the total number of -bit registers is (10) 3) Butterfly: Similarly, the switches of butterfly network have a total of input and output ports with virtual channels each (11)

(4)

(5) 5) 2-D-Mesh and Ring: Considering only switches and links, and assuming unit width for PCs, the wire area of 2-D-mesh and ring networks is computed as shown in (6) and (7), respectively

Replicated butterfly switches have two registers per input, and no virtual channels. The network consists of copies of a regular butterfly, and binary trees between the network and source/destination modules. The total number of registers is (12)

(6)

4) Fat Trees: The k-ary n-tree with terminals has root nodes with total ports, and internal total ports. In total they require nodes with -bit registers

(7)

(13)

6) Buffered Crossbar: The wire area of a traditional crossbar is . The buffered crossbar adds constant overhead per . With crosspoint buffer. Therefore, its wire area remains VOQ support outside the crossbar the wire area complexity may increase.

In the butterfly-fat-tree, the total number of switches approaches to as grows [14], [27], [46]. Each switch node has 6 input and 6 output ports. Therefore, the total number of -bit registers will be approximately (14)

(8) B. Register Count Routing switches consist of several data registers of -bits each, and some control circuit that handles resource allocation, and forward and backward signaling. In typical virtual-channel routing switches [10], there are virtual channels per input and output port to improve performance. Each virtual channel uses at least one -bit register for one data packet. In our proposed MoT network, each switch primitive has either one or two input and output ports and no virtual channels. In both types of switches, the control circuit consumes negligible area compared to data registers. Next, we discuss register count for different networks. If is constant, hypercube, butterfly and fat trees have asymptotically fewer registers than MoT network. However, if so that these networks become noninterfering, register count of these networks reach or exceed MoT (Table II). 1) Mesh of Trees: As shown in Figs. 5 and 7, the network fan-out and fan-in trees, each with consists of nodes. The leaves do not contain switching circuits, since they

5) 2-D-Mesh and Ring: 2-D-mesh and ring networks have switches. In 2-D-mesh, each switch has 5 input and output ports (North, South, East, West, Corresponding PC). In total there will -bit registers be (15) Each switch in a ring has three ports. Then (16) 6) Buffered Crossbar: In a buffered crossbar or VOQ-based crossbar, there are buffers with usually constant depth. We assume that each buffer has registers (17) C. Maximum Throughput In order to evaluate the maximum throughput provided by each network model, we assume the maximum packet generation rate of one flit per cycle (1.0 fpc) at each input port of the

BALKAN et al.: MESH-OF-TREES AND ALTERNATIVE INTERCONNECTION NETWORKS FOR SINGLE-CHIP PARALLELISM

1427

TABLE III MAXIMUM THROUGHPUT (IN FPC PER PORT) OF NETWORKS (BFT: BUTTERFLY FAT TREE)

Fig. 8. Cost-performance comparison of 8, 16, and 64-terminal networks. On the curves, number of virtual channels for hypercube, butterfly, 2-D mesh and ring are doubled from left (v = 2) to right (v = N ). With Replicated butterfly, the number of copies is doubled from left (r = 1) to right (r = N ).

network. At this generation rate, the network will saturate, and the injection and delivery rates will come to balance at the maximum throughput. We assume uniform traffic pattern, which is expected for the memory architecture described in Section II-A. We obtain the results for hypercube, butterfly, 2-D-mesh, ring and MoT networks from theoretical analysis and simulation. The results for fat tree networks are from [13] and [47]. As one can see from Table III, the proposed MoT network can provide the highest maximum throughput, which is 76% and 28% higher virtual channels, and than butterfly and hypercube with 3% and 16% higher than butterfly and hypercube with 64 virtual channels, respectively. We plot area cost versus performance for these networks and MoT for various number of terminals (Fig. 8). Cost and performance of replicated butterfly increases as we increase ; and cost and performance of virthe number of copies tual-channel networks increase as we increase number of . There is single MoT configuration on each virtual channels . Replicated butterfly chart for a given number of terminals achieves higher performance at comparable cost, with respect to virtual-channel networks. On the other hand, MoT network achieves higher throughput for comparable cost, or comparable terminals. throughput at lower cost for up to D. Throughput and Latency As traffic in the network increases, packets will experience longer latencies. We follow the guidelines in [10] to design simulations in order to evaluate the throughput and latency of various network models under different input traffic. The network is

warmed-up until the throughput stabilizes, then marked packets are injected for latency measurement. We are particularly interested in the case when the input traffic, or the on-chip parallelism, is high. Hypercube and butterfly networks are simutermilated on the simulator provided by [10] with nals, and different number of virtual channels, namely a typical setting and an aggressive setting. Router switches have three cycle switch latency per speculative virtual channel router design of [28]. MoT network is simulated using a SystemC based simulator [43], [45]. We vary the input traffic from the low 0.1 fpc per port to the maximum 1.0 fpc. Traffic is generated as a Bernoulli process and the packet destinations are uniformly random. The latency of a packet is measured as the time from it is generated to the time it is received at the destination, which includes the waiting time at the source queue. For each input traffic rate, we use different seeds to generate a set of traffic with the same traffic rate. These input traffic sets are injected to simulators for each network model and the average throughput and latency are reported in Fig. 9. MoT network provides competitive throughput and latency and has a clear advantage over others when the input traffic is high. More importantly, MoT network has a more predicable latency when the input traffic varies. For example, when we increase the input traffic from 0.1 fpc per port to 0.9 fpc, the hypercube latency increases by a factor of 3.2, butterfly network latency increases by a factor of 3.9, while MoT latency increases only by a factor of 1.6. This could allow more accurate design and analysis of algorithms [48].

1428

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 10, OCTOBER 2009

Fig. 9. Throughput and latency of various networks for

N = 64 terminals.

E. Program Execution Time We conduct preliminary study with real programs to demonstrate the effectiveness of the proposed MoT network in parallel processing context. Table IV lists the applications we used. In inc-1 and inc-8 we perform an arithmetic operation (increment by one) on each element of an 256k-long array. The inc-1 program computes one element of result array in one parallel iteration, whereas inc-8 computes 8 elements at once. In matmul-1 and matmul-2 we compute the product of two 64 64 matrices. In one parallel iteration, matmul-1 computes one element of the result matrix, and matmul-2 computes 2 elements. Programs inc-8 and matmul-2 are expected to generate higher traffic than inc-1 and matmul-1 respectively. In FFT, we apply a fixed-point FFT implementation on a 64k-long data array, where each parallel iteration computes fixed-point FFT of two points, and intermediate results are stored in the memory. The programs are compiled by a development version of XMT compiler. We configure the hardware model of the FPGA prototype of XMT processor [42] to generate 1024 light-weight processors grouped in 64 clusters. We build two MoT networks with using Verilog, one form processors to the memory

TABLE IV SIMULATION RESULTS FOR EXECUTION TIME AND TRAFFIC RATE

modules and another one in the opposite direction, and integrate them into the XMT processor. We execute the compiled programs and measure the execution time in terms of cycles as reported by verilog simulator. We exclude the time, during which data is uploaded before execution, and the result is downloaded after execution. We also measure the average traffic rate that enters the interconnection network from processors towards the memory, by tracing the flits at network ports. For comparison, we implement a butterfly network (Replicated Butterfly , ) in Verilog and conduct the same simulawith tion. cycles) and traffic Table IV reports the execution time (in rates (in flits per cycle per port) on two interconnection networks

BALKAN et al.: MESH-OF-TREES AND ALTERNATIVE INTERCONNECTION NETWORKS FOR SINGLE-CHIP PARALLELISM

TABLE V SINGLE SWITCH DELAY OF VARIOUS NETWORKS (IN FO4). REPLICATED BUTTERFLY AND MOT DO NOT HAVE VIRTUAL CHANNELS

1429

In [28], the routing stage of switches are not analyzed, and they are assumed to be less than 20 FO4 delay, which used to be the typical clock rate for earlier serial processors. In most networks that we consider, routing requires checking a single bit at each stage. Therefore, this is not likely to affect the critical path for switches of butterfly, hypercube and fat trees. However, if routing takes as long as 20 FO4, the minimum switch delay for these networks will be 20 FO4. G. Post-layout Throughput

for the same processor. We make the following observations: (1) In average, MoT network accepts more traffic per cycle, for example, 69% more on the tested applications compared to the butterfly network. (2) The execution time of these applications is reduced by approximately the same amount as the increase in traffic rate. This indicates that processor-memory communication strongly impacts the execution time of the tested applications; and the use of a high-throughput MoT network improves execution time. (3) The improvement is significant, even if the corresponding BF traffic is below BF’s saturation throughput (Fig. 8). This could be due to temporary bursts in traffic demand, which are more efficiently handled by MoT. (4) When we increase the traffic rate of same program and data set, (from inc-1 and matmul-1 to inc-8 and matmul-2) the performance of the system with MoT improves, whereas the performance of the system with BF slightly degrades. This could be caused by higher contention due to increased interference. F. Single Switch Delay As stated earlier in Section II-D, long logic delay of switches in existing networks limits the performance. In this section we evaluate switch delay of the networks. For switches with virtual channels, we use the analytical results of [28]. For MoT and replicated butterfly network, we build verilog models of switches and synthesize them using Cadence tools, ARM regular- standard cell library and IBM 90nm (9SF) CMOS technology. We normalize all results using delay unit that represents the technology-independent delay of an inverter driving four identical inverters. For this delay is 66.5 ps.4. technology and library the These results do not immediately translate into throughput in terms of Gbps, because they do not include wire delays. They are useful to compare the complexity of switches, and decide which networks to choose for more detailed analysis. In Section IV-G, we generate full layouts and obtain clock rate and throughput in terms of Gbps. The results are summarized in Table V. For butterfly, ring, 2-D-mesh, hypercube and fat trees, we increase the number of virtual channels. MoT and replicated butterfly don’t have virtual channels. For MoT, we used the longest critical delay of three switch primitives (Fig. 6). We assumed 32-bit data path in our computations. Our results show that attempts to improve throughput by increasing virtual channels will increase switch delay, and reduce clock rate. Replicated butterfly can improve throughput without increasing switch delay. 4At

slow process corner with 1.08 V supply voltage and 125 C temperature

We extended the arbitration primitive to handle longer flits to support both load and store operations [43]. We created fully placed and routed layouts for MoT network and replicated but, and measured their clock rate. We used the terfly with clock rate of that replicated butterfly as an upper bound for more . Therefore, our results reflect complicated networks with the upper bound for replicated butterfly throughput. Assuming 32-bit wide data path, we computed the throughput in terms of Gbps. In pre-layout comparison (Fig. 8), throughput of replicated butterfly approaches asymptotically to its peak throughput, 1.0 fpc. In post-layout comparison, it has a clock frequency of 578 MHz, and its cumulative peak throughput is 148 Gbps. The average throughput approaches to this performance as increases. With , the number of registers is approximately the same as MoT. The throughput of that configuration is 132 Gbps. On the other hand, MoT clock frequency is 890 MHz, providing a throughput of 206 Gbps, 39% higher than replicated butterfly’s asymptotic limit, and 56% higher than 135 Gbps. Compared to pre-layout measurements, we observe that the gap between replicated butterfly and MoT is larger, mostly because of the difference in clock rates. H. Layout Results for Larger MoT Networks We follow the standard flow of the Cadence tools. MoT networks with different configurations are synthesized, placed and routed. Throughput and latency are measured by verilog simulations for higher accuracy compared to pre-layout simulations. Power consumption has been estimated based on the layout, and simulated switching activity with highest traffic rate. As expected, the power consumption grows quadratically with the number of terminals, that is, at the same rate as the number of cells. Pipelining increases power consumption by both adding more cells, and increasing operating frequency. In this study, we did not optimize for power consumption. However, typical approaches such as clock-gating could reduce power consumption. Table VI summarizes our results. Letter ’p’ denotes pipelined configuration. Clock rates are reported in MHz, throughput in and power in mW. We measured latency at Gbps, area in low (0.1 fpc) and high (0.9 fpc) traffic in terms of cycles. Due to constraints on computing resources, power consumption results for 32-terminal configurations are not available. Clock frequency reduces as the number of terminals increases. This is mainly caused by longer wires on the critical path. Pipelining reduces loss of throughput, but average latency increases due to increased number of stages between some sources and destinations. We follow a high level heuristic

1430

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 10, OCTOBER 2009

TABLE VI POST-LAYOUT COMPARISON OF MOT CONFIGURATIONS

approach to determine the amount of pipelining [43]. A more elaborate pipelining method could recover more throughput. V. CONCLUSION AND FUTURE WORK We introduce a certain Mesh of Trees (MoT) interconnection network and compare it with traditional networks in terms of wire area complexity, register count, maximum throughput, latency-throughput relation, and single switch delay. We generate layouts for MoT networks with up to 32 terminals, and evaluate their cost and performance. We compare the layout accurate throughput of 8-terminal MoT and replicated butterfly networks. We tape-out an 8-terminal network for fabrication with IBM 90nm technology. The MoT network enables (i) support of easy parallel programming approaches, (e.g., PRAM-like [3], [4]) that cannot be supported otherwise [49]; and (ii) feeding data to the functional units at a sufficient rate. Due to its high throughput architecture, multiple processors (or hardware threads) can share one terminal of the MoT network without saturating the network. Recent studies show a design with 16 such processors sharing one port [42], [50]. As a result, a 64-terminal network could support 1024 processors on a single chip. A memory hierarchy similar to Fig. 1(b) was considered in [49]. However, [49] opined that while this strategy is possible for a multiprocessor-on-chip it applies to only a very small number of processors (2–8) since the interconnection network between processors and shared first-level cache will not be able to deliver the tremendous bandwidth needed to the multiprocessors accessing it simultaneously. The current paper shows that it can, for a significantly larger number of processors. We study and compare cost-performance trade-offs with other networks. Usually, throughput can be improved by adding more hardware (registers, buffers, or virtual channels). Asymptotically, the cost of MoT seems higher than other networks in terms of wire area complexity and register count. However, we observed the contrary for small number of terminals. We showed that MoT network is more cost-effective, at least up to 64 terminals, and provides highest throughput among evaluated networks. We perform experiments with single-flit packets, which represent the majority of all processor-memory communication in the underlying memory system architecture. A 64-terminal MoT provides approximately 2%, 8%, and 23% more throughput compared to the replicated butterfly, virtual-channel butterfly and virtual-channel hypercube of comparable cost. Based on the trend, we predict that between 128 and 512

terminals replicated butterfly network may pass MoT. On the other hand, our recent study shows that MoT-Butterfly hybrid networks have lower area cost and high performance [51]. We generated and evaluated layouts of MoT networks for up to 32 terminals. We applied a heuristic pipelining method to reduce wire delays. A pipelined 32-terminal network with 32-bit wide data path provides close to 750 Gbps average throughput, and it can support up to 512 processors. We compared the layout-accurate performance of 8-terminal MoT with replicated butterfly. Post-layout throughput of MoT is 39% higher than peak throughput, and 56% higher than average throughput with comparable register cost. The difference is due to switch complexity, network architecture, and mostly due to wire delays. Pipelining and placement optimizations could reduce the gap due to wire delays; however, they cannot reduce the gap of other factors. We integrated 64-terminal MoT and BF networks with verilog model of an XMT processor with 1024 parallel hardware threads, and showed through simulation of a limited set of applications that the high performance promise of MoT reduces parallel execution time. Finally, we note that the use of MoT network is not limited to XMT processor. As the number of functional units grow in system-on-chip applications, on-chip networks will be required to satisfy the communication needs. MoT network may provide an alternative solution, where high-throughput and low-latency communication is needed. REFERENCES [1] R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith, “The Tera computer system,” in Proc. Int. Conf. Supercomputing, 1990, pp. 1–6. [2] P. Bach et al., “Building the 4 processor SB-PRAM prototype,” in Proc. 30th Hawaii Int. Conf. System Sciences, Jan. 1997, vol. 5, pp. 14–23. [3] U. Vishkin, S. Dascal, E. Berkovich, and J. Nuzman, “Explicit multi threading (XMT) bridging models for instruction parallelism,” in Proc. Symp. Parallel Algorithms and Architectures (SPAA), 1998, pp. 140–151. [4] D. Naishlos, J. Nuzman, C.-W. Tseng, and U. Vishkin, “Towards a first vertical prototyping of an extremely fine-grained parallel programming approach,” Theory Comput. Syst., 2003. [5] A. O. Balkan, “Mesh-of-trees interconnection network for an explicitly multi-threaded parallel computer architecture,” Ph.D. dissertation, Univ. Maryland, College Park, 2008. [6] D. J. Kuck, “A survey of parallel machine organization and programming,” Comput. Surv., pp. 29–59, 1977. [7] J. Nuzman, “Memory subsystem design for explicit multithreading architectures,” M.S. thesis, Univ. Maryland, College Park, 2003. [8] A. Gottlieb et al., “The NYU ultracomputer-designing an MIMD shared memory parallel computer,” IEEE Trans. Comput., pp. 175–189, Feb. 1983. [9] T. T. Ye and G. D. Micheli, “Physical planning for on-chip multiprocessor networks and switch fabrics,” in Proc. Application-Specific Systems, Architectures and Processors (ASAP), 2003, pp. 97–107. [10] W. J. Dally and B. Towles, Principles and Practices of Interconnection Networks. San Fransisco, CA: Morgan Kaufmann, 2004. [11] R. I. Greenberg and L. Guan, “On the area of hypercube layouts,” Inf. Process. Lett., vol. 84, pp. 41–46, 2002. [12] F. T. Leighton, Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes. San Francisco, CA: Morgan Kaufmann, 1992. [13] F. Petrini and M. Vanneschi, “Performance analysis of wormhole routed K-ary N-trees,” Int. J. Foundations Comput. Sci., vol. 9, no. 2, pp. 157–178, 1998.

BALKAN et al.: MESH-OF-TREES AND ALTERNATIVE INTERCONNECTION NETWORKS FOR SINGLE-CHIP PARALLELISM

[14] P. P. Pande, C. Grecu, A. Ivanov, and R. Saleh, “Design of a switch for network on chip applications,” in Proc. IEEE Int. Symp. Circuits and Systems (ISCAS), 2003, vol. 5, pp. V–217. [15] M. Kistler, M. Perrone, and F. Petrini, “Cell multiprocessor communication network: Built for speed,” IEEE Micro, vol. 26, no. 3, pp. 10–23, May-Jun. 2006. [16] S. Vangal et al., “An 80-tile sub-100-w teraflops processor in 65-nm CMOS,” IEEE J. Solid-State Circuits, vol. 43, no. 1, pp. 29–41, Jan. 2008. [17] M. B. Taylor, W. Lee, S. Amarasinghe, and A. Agrawal, “Scalar operand networks: On-chip interconnect for ILP in partitioned architectures,” in Proc. Int. Symp. High-Performance Computer Architecture, 2003, pp. 341–353. [18] P. Gratz, C. Kim, K. Sankaralingam, H. Hanson, P. Shivakumar, and S. K. Burger, “On-chip interconnection networks of the trips chip,” IEEE Micro, vol. 27, no. 5, pp. 41–50, Sep.-Oct. 2007. [19] S. Murali, D. Atienza, P. Meloni, S. Carta, L. Benini, G. D. Micheli, and L. Raffo, “Synthesis of predictable networks-on-chip-based interconnect architectures for chip multiprocessors,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 15, no. 8, pp. 869–880, Aug. 2007. [20] F. Angiolini, P. Meloni, S. M. Carta, L. Raffo, and L. Benini, “A layoutaware analysis of networks-on-chip and traditional interconnects for MPSoCs,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 26, no. 3, pp. 421–434, Mar. 2007. [21] A. Pullini, F. Angiolini, S. Murali, D. Atienza, G. D. Micheli, and L. Benini, “Bringing NOCS to 65 nm,” IEEE Micro, vol. 27, no. 5, pp. 75–85, Sep.-Oct. 2007. [22] M. Coppola, R. Locatelli, G. Maruccia, L. Pieralisi, and A. Scandurra, “Spidergon: A novel on-chip communication network,” in Proc. Int. Symp. System-on-Chip, Nov. 16–18, 2004, p. 15. [23] S. Murali, L. Benini, and G. D. Micheli, “An application-specific design methodology for on-chip crossbar generation,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 26, no. 7, pp. 1283–1296, Jul. 2007. [24] S. Pasricha, N. Dutt, and M. Ben-Romdhane, “BMSYN: Bus matrix communication architecture synthesis for MPSoC,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 26, no. 8, pp. 1454–1464, Aug. 2007. [25] Y. P. Zhang, T. Jeong, F. Chen, H. Wu, R. Nitzsche, and G. Gao, “A study of the on-chip interconnection network for the IBM Cyclops64 multi-core architecture,” in Proc. 20th Int. Parallel and Distributed Processing Symp., Apr. 25–29, 2006. [26] W. J. Dally, “Virtual-channel flow control,” IEEE Trans. Parallel Distrib. Syst., vol. 3, pp. 194–205, Mar. 1992. [27] C. Grecu, P. P. Pande, A. Ivanov, and R. Saleh, “Structured interconnect architecture: A solution for the non-scalability of bus-based SoCs,” in Proc. Great Lakes Symp. LSI, 2004, pp. 192–195. [28] L.-S. Peh and W. J. Dally, “A delay model and speculative architecture for pipelined routers,” in Proc. Int. Symp. High-Performance Computer Architecture, 2001, pp. 255–266. [29] M. Mehmet-Ali, M. Youssefi, and H. Nguyen, “The performance analysis and implementation of an input access scheme in a high-speed packet switch,” IEEE Trans. Commun., vol. 42, no. 12, pp. 3189–3199, Dec. 1994. [30] F. Abel, C. Minkenberg, I. Iliadis, T. Engbersen, M. Gusat, F. Gramsamer, and R. Luijten, “Design issues in next-generation merchant switch fabrics,” IEEE/ACM Trans. Networking, vol. 15, no. 6, pp. 1603–1615, Dec. 2007. [31] G. Kornaros and Y. Papaefstathiou, “A buffered crossbar-based chip interconnection architecture supporting quality of service,” in Proc. 3rd Southern Conf. Programmable Logic, Feb. 28–26, 2007, pp. 51–56. [32] E. Upfal, “An O(log N) deterministic packet-routing scheme,” J. Assoc. Computi. Mach., vol. 39, no. 1, pp. 55–70, Jan. 1992. [33] C. P. Kruskal and M. Snir, “The performance of multistage interconnection networks for multiprocessors,” IEEE Trans. Comput., vol. 32, no. 12, pp. 1091–1098, Dec. 1983. [34] J. T. Schwartz, The Burroughs FMP Machine. New York: Courant Institute, NYU, 1980. [35] G. R. Goke and G. J. Lipovski, “Banyan networks for partitioning multiprocessor systems,” in Proc. 1st Annu. Symp. Computer Architecture, 1973, pp. 21–28.

1431

[36] C. E. Leiserson, “Fat trees: Universal networks for hardware-efficient supercomputing,” IEEE Trans. Comput., vol. 34, no. 10, pp. 892–901, Oct. 1985. [37] C. Minkenberg, I. Iliadis, and F. Abel, “Low-latency pipelined crossbar arbitration,” in Proc. IEEE Global Telecommunications Conf., 2004, vol. 2, pp. 1174–1179. [38] A. O. Balkan, G. Qu, and U. Vishkin, “Arbitrate-and-move primitives for high throughput on-chip interconnection networks,” in Proc. IEEE Int. Symp. Circuits and Systems (ISCAS), Vancouver, BC, Canada, May 2004, vol. II, pp. 441–444. [39] F. T. Leighton, “New lower bound techniques for VLSI,” in Proc. 22nd IEEE Symp. Found. Comput. Sci., 1981, pp. 1–12. [40] A. DeHon and R. Rubin, “Design of FPGA interconnect for multilevel metallization,” IEEE Trans. Very Large Scale Integration (VLSI) Syst., vol. 12, no. 10, pp. 1038–1050, Oct. 2004. [41] L. P. Carloni, K. McMillan, A. Saldanha, and A. L. Sangiovanni-Vincentelli, “A methodology for correct-by-construction latency insensitive design,” in Proc. IEEE/ACM Int. Conf. Computer Aided Design (ICCAD), 1999, pp. 301–315. [42] X. Wen and U. Vishkin, “FPGA-based prototype of a PRAM-on-Chip processor,” in Proc. ACM Computing Frontiers, Ischia, Italy, May 2008, pp. 55–66. [43] A. O. Balkan, M. N. Horak, G. Qu, and U. Vishkin, “Layout-accurate design and implementation of a high-throughput interconnection network for single-chip parallel processing,” in Proc. IEEE Symp. High Performance Interconnection Networks, Aug. 2007, pp. 21–28. [44] C. D. Thompson, “A complexity theory for VLSI,” Ph.D. dissertation, Carnegie Mellon Univ., Pittsburgh, PA, 1979. [45] A. O. Balkan, G. Qu, and U. Vishkin, “A Mesh-of-Trees interconnection network for single-chip parallel processing,” in Proc. Application-Specific Systems, Architectures and Processors (ASAP), 2006, pp. 73–80. [46] A. DeHon, “Compact, multilayer layout for butterfly fat-tree,” in Proc. Symp. Parallel Algorithms and Architectures (SPAA), 2000, pp. 206–215. [47] P. P. Pande, C. Grecu, M. Jones, A. Ivanov, and R. Saleh, “Evaluation of MP-SoC interconnect architectures: A case study,” in Proc. IEEE Int. Workshop on System-On-Chip for Real-Time Applications, 2004, pp. 253–356. [48] U. Vishkin, G. Caragea, and B. Lee, Handbook on Parallel Computing: Models, Algorithms, and Applications, S. Rajasekaran and J. Reif, Eds. Boca Raton, FL: CRC, 2008, ch. Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chip Platform. [49] D. E. Culler and J. P. Singh, Parallel Computer Architecture. San Francisco, CA: Morgan Kaufmann, 1999. [50] X. Wen, “Hardware design, prototyping and studies of the explicit multi-threading (XMT) paradigm,” Ph.D.dissertation, Univ. Maryland, College Park, 2008. [51] A. O. Balkan, G. Qu, and U. Vishkin, “An area-efficient highthroughput hybrid interconnection network for single-chip parallel processing,” in Proc. IEEE/ACM Design Automation Conf., Jun. 2008, pp. 435–440.

Aydin O. Balkan (S’00) received the B.S. degree in electrical engineering from Bogazici University, Istanbul, Turkey, in 2000, and the M.S, and Ph.D. degrees in electrical and computer engineering from the University of Maryland, College Park, in 2006 and 2008, respectively. His research interests include on-chip interconnection networks, digital VLSI design technologies, and parallel computing.

1432

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 10, OCTOBER 2009

Gang Qu (S’98–A’00–M’03) received the B.S. and M.S. degrees in mathematics from the University of Science and Technology, China, in 1992 and 1994, respectively, and the Ph.D. degree in computer science from the University of California, Los Angeles, in 2000. In 2000, he joined the Electrical and Computer Engineering Department, University of Maryland, College Park. In 2001, he became a member of the University of Maryland Institute of Advanced Computer Studies. His research interests include low-power system design, computer-aided synthesis, sensor network, and intellectual property reuse and protection. Dr. Qu has received many awards for his academic achievements and service and has served on the Technical Program Committee for many conferences. Currently, he is a member of the ACM SIGDA Low Power Technical Committee.

Uzi Vishkin (SM’98) received the B.Sc. and the M.Sc. degrees in mathematics from the Hebrew University, Jerusalem, Israel, and the D.Sc. degree in computer science from The Technion—Israel Institute of Technology, Haifa, in 1981. One of the founders of the field of parallel algorithms and the inventor of the explicit multithreading (XMT) architecture, his primary area of interest has been parallel computing in general and parallel algorithms in particular. He is a permanent member of the University of Maryland Institute for Advanced Computer Studies and has been a Professor of electrical and computer engineering at the University of Maryland, College Park, since 1988. He was Professor of computer science at The Technion from 2000–2001. Previously, he was Professor of computer science at Tel Aviv University, Israel, where he was Chair of the Computer Science Department from 1987–1988 and where he became a Professor in 1988. Prior to that, he was research faculty at the Courant Institute, New York University, and a postdoctoral fellow at the IBM T. J. Watson Research Center. He has authored, or coauthored, over 130 publications, including several patents. Prof. Vishkin is Fellow of the ACM.

Suggest Documents