Scalable load balancing congestion-aware Network ... - ScienceDirect

15 downloads 42751 Views 4MB Size Report
Journal of Computer and System Sciences 79 (2013) 421–439 ... Dept. of Electrical Engineering and Computer Science, University of California, Irvine, Irvine, CA ... network throughput and provide better fault tolerance characteristics.
Journal of Computer and System Sciences 79 (2013) 421–439

Contents lists available at SciVerse ScienceDirect

Journal of Computer and System Sciences www.elsevier.com/locate/jcss

Scalable load balancing congestion-aware Network-on-Chip router architecture Chifeng Wang ∗ , Wen-Hsiang Hu, Nader Bagherzadeh Dept. of Electrical Engineering and Computer Science, University of California, Irvine, Irvine, CA 92697-2625, USA

a r t i c l e

i n f o

Article history: Received 20 December 2010 Received in revised form 30 April 2011 Accepted 26 September 2012 Available online 12 October 2012 Keywords: Interconnection network Network-on-Chip (NoC) Congestion-aware Load balance Fault tolerance

a b s t r a c t Adaptive routing algorithms have been employed in interconnection networks to improve network throughput and provide better fault tolerance characteristics. However, they can harm performance by disturbing any inherent global load balance through greedy local decisions. This paper proposes a novel scalable load balancing congestion-aware Networkon-Chip (NoC) architecture that not only enhances network transmission performance while maintaining a feasible implementation cost, but also improves overall network throughput for various traffic scenarios. This congestion control scheme which consists of dynamic input arbitration and adaptive routing path selection is proposed to balance global traffic load distribution so as to alleviate congestion caused by heavy network activities. Furthermore, faulty links information can be broadcasted by existing congestion management control signals to prevent packets from routing through defected areas in order to eliminate potential heavy congestion situations around these regions. Experimental results show that throughput is improved dramatically while maintaining superior latency performance for various traffic patterns. Compared to a baseline router, the proposed congestion management mechanism requires negligible cost overhead but provides better throughput for both mesh and diagonally-linked mesh NoC platforms. © 2012 Elsevier Inc. All rights reserved.

1. Introduction As semiconductor technology continues its phenomenal growth, on-chip transistor densities increase enabling the integration of dozens of components on a single die. These components include regular arrays of processors and heterogeneous resources for System-on-Chip (SoC) design. Several multi-core integrated circuit designs such as 64-core SoC and 80-core NoC architecture [6,31] have been proposed recently. NoC interconnection scheme has been introduced as a better solution for the design of chip multiprocessors (CMPs) because of superior performance and fault tolerance characteristics [7,11]. NoC interconnection architecture uses a distributed control mechanism, providing a scalable interconnection network. A multiprocessor system platform called Network-based Processor Array (NePA) [5] in which processors are interconnected by using an on-chip two-dimensional (2D) mesh network has recently been developed and evaluated. NePA is a deadlock-free and livelock-free network that implements wormhole packet switching technique and utilizes an adaptive minimal routing algorithm. Because of the limitation of traditional 2D mesh topology, additional alternative routing resources which provide more network tolerance are employed to further improve the performance of NePA architecture. Diagonal links for the 2D mesh network are proposed because of the emergence of X-architecture routing technique in chip manufacturing [17,28]. These X-architecture articles demonstrated that wire-length, via-count and core-area are reduced because of X-architecture routing adoption. If there are five metal layers used for layout, it will cause layout inefficiency when

*

Corresponding author. E-mail addresses: [email protected] (C. Wang), [email protected] (W.-H. Hu), [email protected] (N. Bagherzadeh).

0022-0000/$ – see front matter © 2012 Elsevier Inc. All rights reserved. http://dx.doi.org/10.1016/j.jcss.2012.09.007

422

C. Wang et al. / Journal of Computer and System Sciences 79 (2013) 421–439

only orthogonal wiring rule is enforced. This situation can be improved by utilizing two diagonal wires and three orthogonal wires as layout rules. Implementation examples are also given to validate diagonal routing technology. There are more than twelve metal layers in the proposed new technologies, where those additional layers could be used to accommodate our Diagonally-linked Mesh (DMesh) NoC architecture without impacting IP core connectivity of metal layers. DMesh employs diagonal express links between routers on a baseline mesh network. These diagonal links not only reduce the distance between source and destination nodes but also alleviate traffic congestion in the network so that network performance is enhanced dramatically [34,35]. Adaptive routing algorithms have been adopted in interconnection networks as the means to improve network performance and to tolerate link faults or router failures. The key neglect of these methods is global network status in distributed parallel networks which leads to performance degradation because packets are routed according to local network information. Such methods without sufficient network congestion information tend to upset global load balance in some high workload scenarios. To further effectively exploit available routing resources, we proposed a scalable load balancing congestion control scheme to provide intelligent routing arbitration control. This allows avoiding control beyond target baseline adaptive routing algorithm which can adaptively balance traffic load and increase NoC overall throughput by efficiently allocating existing resources and bypassing congested areas. The proposed routing scheme can be easily extended to routers with virtual channels (VCs) [9] or parallel buffers (PBs) [3] architecture. Congestion index broadcasting mechanism can further incorporate faulty links or damaged routers information with existing congestion control signal lines so as to enhance the routing performance and provide fault tolerance capability. The implementation results of the proposed mechanism also demonstrate negligible logic and moderate wiring overhead. The major contributions of this work are as follows:

• Introduced a load-balanced congestion-aware routing scheme which effectively resolves upstream flow congestion and adaptively directs packets toward destination based on downstream network status. This approach manages transmission tasks efficiently and optimizes resource utilization by intelligently exploring routing resources of NoC platforms. • Introduced a low wiring cost and scalable congestion-aware routing architecture that adjusts to different NoC platforms and router designs which employ various numbers of ports or multiple channels for each port. • Incorporated links and routers fault tolerance routing with the proposed congestion-aware routing scheme to bypass faulty regions and potentially highly congested areas. This work designed and evaluated a scalable congestion-aware high throughput router architecture for 2D mesh NoC. It first introduced NePA and DMesh platforms which achieve higher throughput and better performance [35]. Beyond the baseline router architecture, an enhanced parallel buffer was also introduced and integrated to further enhance transmission performance. Congestion-Aware (CA) technique deployment can dramatically improve throughput as well as performance, especially for designs with adaptive routing that have additional routing resources such as NePA and DMesh platforms. Because more alternative routing resources are employed in both platforms, a light weight CA mechanism capable of managing multiple ports/buffers and abundant routing resources should be considered and designed for these NoCs. As the experimental results demonstrated, DMesh has the most significant improvement due to its extra routing flexibility when CA is employed. The proposed CA mechanism can be seamlessly employed in traditional NoCs with five-port routers, but the improvement ratio of performance might not be as significant as NePA or DMesh because of their routing resource limitations. This effective load balancing routing scheme not only improves transmission latency but also enhances network accommodation. Section 2 gives an overview of NePA and DMesh high performance NoC router architecture and summarizes related congestion management work in on-chip interconnection networks. Section 3 proposes congestion-aware router scheme. Section 4 shows performance and cost evaluation of the proposed architecture. Section 5 provides concluding remarks. 2. Background and related work 2.1. NePA system platform NePA platform is a scalable, flexible and reconfigurable multiprocessor architecture which meets the high-performance and low-power requirements. NePA implements the wormhole packet switching technique and is based on a 2D mesh topology as shown in Fig. 1. Each node in NePA consists of a router and a local IP which can be a CPU, DSP, memory block, or application-specific logic. The router is connected to its four neighboring routers via six bidirectional links, including two horizontal links and four vertical links. A key feature of the six links architecture is the use of two separate vertical links which are used to construct a deadlock-free network [4]. The NePA network is actually composed of two disjoint subnetworks. One subnetwork is responsible for delivering eastbound packets while the other one is for westbound packets. Once packets are injected into the networks, they can only transmit in one of the subnetworks. Therefore, there are no cycles in the resource dependence graph [8], preventing deadlocks from happening. To increase network performance and provide fault-tolerant routing ability, NePA utilizes an adaptive XY routing scheme following a minimal path policy. When

C. Wang et al. / Journal of Computer and System Sciences 79 (2013) 421–439

423

Fig. 1. Example of 4 × 4 NePA platform with two additional vertical links.

Fig. 2. Example of 4 × 4 DMesh platform with diagonal connection. Dash lines indicate eastbound subnetwork called E-subnet; solid lines indicate westbound subnetwork called W-subnet.

an output port is congested or the output buffer is full, the router selects an alternative output port for packets. Therefore, link utilization is well balanced and network performance also improves accordingly. 2.2. DMesh system platform DMesh network [34] is constructed by integrating diagonal links into NePA, as presented in Fig. 2. Four additional bidirectional links are added to connect with its neighbors in order to provide diagonal shortcut and enrich routing resources. Additionally, there are three ports for connecting with local processor elements (PEs): IntE, IntW and Int. Fig. 3 depicts the input and output ports of NePA and DMesh routers. DMesh network is composed of two subnetworks: E-subnet and W-subnet, represented with dashed arrows and solid arrows in Fig. 2, respectively. E-subnet is responsible for transferring eastward packets while W-subnet is responsible for westward traffic. When source PE starts packet transmission, it injects packets into the network via IntE or IntW ports, depending on the direction of destination PE. IntE port is in charge of injecting packets into the E-subnet and IntW port is for the W-subnet. Subsequently, packets traverse in one of the subnetworks to their destinations. When packets arrive at the destination node, they are ejected from the Int port. DMesh inherits deadlock-free routing scheme and provides express links to shorten transmission latency and resolve congestion situations under high traffic load condition. The benefit of adaptive routing becomes magnificent owing to these added diagonal links. 2.3. Related work Adaptive routing has been employed as a means to improve network performance by balancing traffic load and to tolerate link or router failures by adjusting routing paths based on network situations [8]. Congestion control mechanisms were proposed to avoid saturation and improve the throughput in NoCs. Congestion-aware adaptive routing helps routing space exploration and even traffic load distribution over the network. Different congestion detection and management mechanisms have been proposed to improve transmission performance.

424

C. Wang et al. / Journal of Computer and System Sciences 79 (2013) 421–439

Fig. 3. Router ports description.

Buffer status and link utilization are two general metrics of evaluating network congestion situations. Congestion detection and management strategies might differ routing decisions and implementation cost. A self-optimized routing strategy uses buffer load information as the congestion index to select a favorable path for incoming packets [29]. A proximity congestion awareness technique is proposed to avoid congested areas based on the use of stress values which are passed from neighboring switches [23]. Both techniques attempt to divert packets from hot spots in the network based on buffer occupancy. A common congestion control strategy which uses link utilization to perform traffic prediction is proposed to enable efficient routing resources allocation [30]. Buffer availability from adjacent routers [19] and output queue length [25] are another examples of employing buffer status as a congestion metric. Mechanisms with buffer information only from local router would have less improvement although they need no extra wires to broadcast congestion status between routers. However, mechanisms with congestion information from neighboring routers usually need extra logic implementation to detect buffer occupancy or abundant wiring overhead to broadcast congestion index. A scheme uses the number of free VCs at an output port as a contention metric, with the routing algorithm favoring the port with the most available VCs [10]. Authors compared it to congestion-oblivious routing methods and showed that congestion awareness yields lower latency and better throughput enhancement. A contention-aware input selection algorithm which gives priority to incoming packets from congested areas was proposed [36], in order to alleviate congestion in upstream areas. These two schemes merely utilize either input or output congestion situations to decide routing paths which might decrease load balancing effectiveness. Regional Congestion Awareness (RCA) routers [14] presented non-local information for improving dynamic load balancing properties and validated that regional congestion awareness can enhance adaptive routing performance. A similar design also uses regional congestion information to manage routing path decision [13]. Weighted function design between local and non-local congestion index and congestion congregation circuit might impact router cost and performance. Besides, the RCA mechanism simply detects congestion from buffer status of downstream routers and design overhead increase dramatically when RCA with quadrant congestion information is employed. Injection throttling is proposed to improve the network throughput especially under high workload scenarios by limiting injection of new packets [16,24]. This methodology brings about fairness issue and makes it an infeasible generic methodology for versatile applications. A fault tolerant mechanism for handling permanent and transient failures has been proposed to improve reliability of NoC design [1]. Deterministic routing, routing table usage and table update overhead could hinder its adoption from low cost and low power NoC design. On-chip stochastic communication methodology mitigates the effect of faulty components by using a probabilistic broadcasting algorithm [12,22]. Although low latency can be achieved, duplicating packets will increase router process overhead and deteriorate energy consumption efficiency which prohibit its employment from low power NoC design. Adaptive routing can potentially provide more robustness than deterministic routing. Enhanced Dynamic XY routing (EDXY) benefits from some extra links to broadcast the global traffic information throughout the network and it extends congestion index to carry faulty link information which allows tolerating single link failure [21]. In order to design a light-weight scalable CA router, we devised an approach based on dynamic input port arbitration to resolve congestion and adaptive output path selection to distribute traffic load efficiently. CA routers can effectively enhance network throughput by utilizing simple congestion index statistics to determine the network congestion status so that implementation complexity can be reduced. This method can be easily adjusted for routers with single or multiple buffers per input port and extended to incorporate with faulty information broadcasting by minimal cost overhead. Non-minimal adaptive routing has the potential to further improve load balance at the cost of increased design complexity and possibly higher transmission latency [10,25]. Therefore, this work focused the evaluation on minimal adaptive routing, but the general principles of the proposed CA mechanism could be applied to networks employing non-minimal routing algorithms.

C. Wang et al. / Journal of Computer and System Sciences 79 (2013) 421–439

425

Fig. 4. Microarchitecture of congestion-aware router in DMesh networks.

3. Congestion-aware router architecture In order to measure congestion situation, the transmission status of the network has to be monitored on-line. A simple and effective way to detect congestion is used in order to log congestion status and inform connecting routers. Conventional congestion control algorithms usually use buffer status or link status as an index to indicate the existence of network congestion. For example, buffer utilization is used to balance the traffic [29], and the amount of flits stored in a buffer is adopted to adjust the routing policy [24]. In the proposed CA router design, instead of observing buffer or link usage status, network congestion is detected by inspecting the input ports of a router [33]. We use the number of blocked input ports as the congestion index. A blocked input port is defined as the input port with an incoming packet which is not routed by the router. This method helps reduce hardware cost because only simple logic in the router arbiter is added to count the number of blocked input ports in a router. If the buffer or link status is used, we would otherwise need to attach a monitor to each of the 10 buffers or 10 links in DMesh, which leads to higher implementation cost. The method of buffer occupancy detection even worsens the overhead in multiple parallel channels per input router architecture. As will be described later, our studies have shown that the proposed scheme is effective in congestion detection. Each router is divided into two sub-routers, so it needs two congestion indices, C E and C W , for the E-router and W-router respectively. C E and C W are generated on a cycle-by-cycle basis, and they are sent to neighboring routers via dedicated signal wires where they will be used in the next clock cycle. The microarchitecture of DMesh CA router with single FIFO per input port is presented in Fig. 4. DMesh CA routers have two separate sub-routers for processing traffic in the E-subnet and W-subnet output ports. There is a FIFO associated with each input port. Header Parsing Unit (HPU) processes destination information from the header flit and Arbiter Logic (AL) is used to decide routing path, performing arbitration and managing the crossbar switch. Switching unit takes care of packets traversal from input FIFOs to output links. Congestion Index (CI) calculates the number of flits that are not serviced and passes the congestion index to neighboring routers. In the following sections, how these two indices are used as a decisive factor in congestion control algorithms will be demonstrated in detail. 3.1. Dynamic input port arbitration The purpose of dynamic input port arbitration is to alleviate traffic congestion by allowing packets coming from hot spots to move first. Resource contention in the congested region is reduced, and therefore the congestion management mechanism can relieve the congestion in hot spots so that transmission tasks are load-balanced and links utilization also improves accordingly. In order to describe our approach in more detail, a CA routing procedure is shown in Algorithm 1. For a router with M input ports and N output ports, conceptually we declare an input priority queue to store the input ports according to a specified priority scheme. The priority scheme can be fixed, round-robin, first-come first-served, congestion based and so on. At the beginning of the routing procedure, we establish a priority queue PQ i . Each input port has a key associated with it, and these keys are used to decide the priority of each port. After PQ i is constructed, the procedure starts to make routing decision from the output port point of view. That is, it first picks an output port and

426

C. Wang et al. / Journal of Computer and System Sciences 79 (2013) 421–439

Algorithm 1 CA routing for single FIFO routers Input: Packeti , Cogestion_idx_in Output: Cogestion_idx_out PQ i : a priority queue that holds M input ports PQ o : a priority queue that holds N output ports P i : input port index P o : output port index 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:

Generated packets are injected into either E-subnet or W-subnet Construct PQ i by Cogestion_idx_in for nonempty input FIFOs Construct PQ o by Cogestion_idx_in for available output links for all P i in PQ i do for all P o in PQ o do if input output pair is available then route packet from selected input to output port remove P i and P o from PQ i and PQ o else restore P i and P o into PQ i and PQ o end if end for end for Calculate congestion index from input ports with waiting packets for both subnetworks Transmit Cogestion_idx_out to neighboring routers

Fig. 5. The candidate paths of different source-destination configurations in DMesh networks.

chooses an input port which has packets intended for this output port. This choice is based on the priority of input ports. If there are multiple input ports demanding the same output port, the one with the highest priority will get access. In input port selection, congestion indices sent from neighboring routers are used to decide the priority. Each input port is associated with a congestion index from its upstream router. For example, the N1 input port of Router(1, 1) in Fig. 2 uses the C E from Router(1, 0) as its key because N1 belongs to the E-subnet, and the NE input port of Router(1, 2) uses the C W from Router(2, 1). A larger key which means higher congestion gives a higher priority to the input port. 3.2. Adaptive path selection scheme The routing path is decided in a distributed manner for NePA and DMesh. NePA adopts adaptive minimal routing. But for DMesh, each router chooses the next hop for a packet following a quasi-minimal rule.1 To explore more available paths, CA adaptive path selection is utilized. Fig. 5 depicts the candidate paths in various source-destination combinations for the E-subnet in DMesh. The candidate paths in the W-subnet are symmetric to those in the E-subnet. Take Fig. 5(a) for example. The destination is located on the east side of the source, and there are three candidates for this condition: a, b, or c node. The original NePA and DMesh routing algorithms adopt a fixed selection scheme to choose the next hop without considering network status. In the proposed scheme, congestion indices (C E and C W ) are utilized as keys to construct PQ i and PQ o in procedure in Algorithm 1. Therefore, less congested output node will be selected. Alternative paths enrich routing resources and serve as escaping channels to forward packets to light traffic load routers. In the case of Router(1, 1) sending packets to Router(2, 2) in Fig. 2, Router(1, 1) receives congestion indices from Router(2, 1), Router(1, 2) and Router(2, 2). PQ o is sorted according to the gathered congestion information, then the one with smallest congestion index will be selected. 3.3. Adaptive CA routing scheme enhancement This simple congestion detection mechanism and routing algorithm can be easily extended to input ports with multiple parallel channels and error messages transferring between neighboring routers.

1 The rule balances traffic load and mitigates congestion situations so as to maintain optimal communication performance. Diagonal links will be considered first and packets will use alternatives, if they are taken or congested [34].

C. Wang et al. / Journal of Computer and System Sciences 79 (2013) 421–439

427

Fig. 6. Enhanced parallel buffer architecture for NePA E-subnet.

3.3.1. Multiple buffers in each port To enhance network channel utilization but keep design overhead moderate, a new routing-independent parallel buffer structure and its management scheme are proposed [3], as shown in Fig. 6. Each added channel keeps the merit of adaptive minimal routing strategy instead of mapping to dedicated outputs in a fixed pattern. This scheme works independently to explore more routing resources so that efficient channel utilization and maximum throughput are achieved accordingly. The proposed architecture maintains the routing flexibility to deliver packets toward paths with less congested possibility. Therefore packets can bypass blocked output ports and keep heading to destination with minimal routing paths. For multiple buffers router architecture, although complex congestion management can enhance performance and throughput, it will also significantly increase the overhead of congestion detection and routing arbitration mechanism. Generally, free buffer length and crossbar information are served as congestion information for arbiter to make routing decision. Furthermore, to enhance congestion control efficiency, this information is broadcasted to neighboring routers to reinforce congestion control [14]. But these messages correspond to many control signals connecting among neighbors, resulting in a considerable overhead. Instead of collecting the buffer length information, the count of free buffers or VCs serves as a better solution for dramatically lowering the wiring cost. In order to fully take advantage of this approach, the modified CA routing mechanism can reduce the wiring requirement further, as illustrated in Algorithm 2. Congestion index is calculated by active FIFOs with waiting packets. Higher value indicates more waiting packets in current router and this information will be delivered to its neighbors, four routers for NePA networks and eight routers for DMesh networks. For the case of a router with four VCs in each port, two bits are required to indicate available buffer length when each VC size is four flits. The router has to use 32 (2 × 4 × 4) bits of control signals to send information to one of its neighbors, accounting for 128 bits totally for 5-port NoC networks. NePA employs 7-port router, the needed bits will be 48 (2 × 6 × 4) bits per link. This number becomes 80 bits (2 × 10 × 4) for each neighboring router in DMesh networks, making this approach impractical when compared with number of bits in each flit. The congestion metric can be compressed by using free VCs and designate buffer count for only 8 or 16 bits per link to identify the congestion status [14]. For the proposed modified CA scheme, only 4 bits are needed for 5-port routers which is half of the compressed method, 8 bits for NePA routers (4 bits for each subnetwork) and 10 bits for DMesh routers (5 bits for each subnetwork). Thus the proposed CA mechanism can significantly reduce wiring overhead of passing congestion information to its neighbors. 3.3.2. Link/Node fault tolerance On-line fault detection in NoC communication fabrics is introduced to distinguish between faults in the communication links and faults in the NoC switches [15]. It provides minimal modification to enable NoC routers to detect faults timely. This scheme works seamlessly with routing algorithm to prevent packets from traversing through these faulty links and partially damaged routers which are also part of the highly congested areas. The implementation can easily be achieved by integrating this information into congestion index or adding extra wires to indicate specific error patterns so that routing arbitration can adjust accordingly.

428

C. Wang et al. / Journal of Computer and System Sciences 79 (2013) 421–439

Algorithm 2 Modified fault tolerance CA mechanism for multiple channels Input: Packeti , Cogestion_idx_in Output: Cogestion_idx_out CHnumi : nonempty parallel channel count PQ i : a priority queue that holds M input ports PQ o : a priority queue that holds N output ports P i : input port index P o : output port index 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20:

Generated packets are injected into either E-subnet or W-subnet Construct PQ i by Cogestion_idx_in for input ports with nonempty channels Construct PQ o by Cogestion_idx_in for available output links Resort PQ i and PQ o according to faulty links or routers status for all P i in PQ i do for all CHnumi in each P i do for all P o in PQ o do if input output pair is available then route packet from selected input to output port remove P i and P o from PQ i and PQ o else restore P i and P o into PQ i and PQ o end if end for end for end for Calculate congestion index from parallel channels with waiting packets for both subnetworks On-line links and routers fault monitor and detection Update faulty links and routers information to congestion index Transmit Cogestion_idx_out to neighbor routers

Fig. 7. Link/Node fault tolerance routing examples.

Similar to EDXY [21], the proposed fault tolerance CA routing can tolerate more link or router failures if there are available routing paths between source-destination pairs. Assuming routers are equipped with fault investigation units that signal if faulty links or routers are detected. This faulty links information will be integrated into congestion information and broadcasted outward. As described in Algorithm 2, routing candidates with partial faulty links will be degraded to the lowest priority in PQ o and upgrade to the highest priority in PQ i when making routing path decision. The more precise faulty information is delivered, the more sophisticated routing policy can be adopted to reorder PQ i and PQ o . Based on the modified fault tolerance CA routing scheme, packets from faulty areas intend to traverse through healthy routers toward destination. Examples are shown in Fig. 7, only horizontal and vertical candidates can be selected for NePA routers, but for DMesh routers diagonal links serve as alternative paths and still maintain the minimal path routing spirit. In Fig. 7(a), the faulty link status in node b will be broadcasted to nodes S1 and S2. They both will select alternative paths without going through the faulty region. In Fig. 7(b), there are three different minimal paths between nodes S and D. The faulty link status will be passed to node S, so it will choose other alternatives to forward the packets. 3.4. CA router variants Routing path decision logic can flexibly apply congestion status based arbitration in input ports, output ports or hybrid schemes. So there are different variants of the CA router. CA_INPORT: As mentioned in Algorithm 1, congestion indices of each input ports are stored in PQ i respectively. In routing arbitration process, an input port with a higher congestion index has higher priority to use a dedicated output port.

C. Wang et al. / Journal of Computer and System Sciences 79 (2013) 421–439

429

Fig. 8. Average latency of CA variants in 8 × 8 DMesh networks.

CA_OUTPORT: Instead of storing congestion indices in input ports, CA_OUTPORT scheme gathers congestion indices of each output port and stores them in PQ o respectively. In arbitration process, an output port with a higher congestion index has lower weight to get the routing grant to prevent congestion from getting worse. CA_HYBRID: CA_HYBRID combines priority selection schemes of both CA_INPORT and CA_OUTPORT to ensure input packet coming from higher congested regions can be routed to less congested regions. First, it selects output port with the smallest congestion index and chooses an input port which has packets intended for this output port. This choice is based on the priority of input ports. The implementation cost of CA_HYBRID is much larger than CA_INPORT or CA_OUTPORT, because arbiters with double priority selection are much more complex and will require more logic overhead. We have evaluated network performance for these cases and simulation environment will be addressed in Section 4. The result is illustrated in Fig. 8, it was observed that CA_OUTPORT only routes input packets to less congested regions with fixed priority but not first routing input packets from high congested regions, so it cannot effectively alleviate congestion from upstream data. CA_HYBRID has the best performance because it has more information about network congestion status. CA_INPORT has comparable performance with CA_HYBRID because it effectively resolves the upstream congestion which will propagate to downstream. CA_INPORT is picked as the most efficient CA router algorithm candidate to lower the overhead of router design with manageable complexity. 4. Evaluation 4.1. Experimental setup NoC platform employed with CA routers is developed by a System-C based cycle accurate simulator. With the simulator, various network configurations such as network size, topology, buffer size, routing algorithm, priority scheme for router arbitration, and traffic patterns can be set up to evaluate NoC performance (see Table 1). Traffic generator produces four different synthetic traffic patterns, including {Random, Bit complement, Bit reverse, Matrix transpose} traffic patterns [8]. These patterns define the spatial distribution of packets. Temporal distribution to transmitted packets is also applied in a self-similar traffic generation technique. Self-similar traffic has been discovered between on-chip modules in MPEG-2 video applications [32] and conventional computer networks [20]. Researchers have shown that self-similar traffic can be generated by aggregating a large number of packet sources which exhibit a long-range dependence property [27]. The modeling method proposed by Avresky [2] is used to produce the self-similar traffic. ON/OFF state is imposed on source node to control traffic generation during simulation time. ON/OFF state period is determined by the Pareto distribution:

F (x) = 1 − x−α ,

1 < α < 2.

(1)

These equations use shape parameters, αON and αOFF , to calculate ON and OFF duration. T ON = U −1/αON and T OFF

where U is a uniformly distributed value in the range of (0, 1],

= U −1/αOFF ,

αON = 1.9 and αOFF = 1.25 are set in the simulation.

430

C. Wang et al. / Journal of Computer and System Sciences 79 (2013) 421–439

Table 1 Simulation parameters. Characteristics

Baseline

Variations

Topology Network size Routing Parallel buffers/port Buffer size (flits)/PB Packet length (flits) Flit size Traffic workload Simulated period

2D mesh 8×8 CA adaptive 1 4 4 64 bits Synthetic 100,000 cycles

– 4 × 4, 16 × 16, 32 × 32 fixed, local, arithmetic 2, 4 PBs 2, 8, 16 1, 8, 16 – 3-tuple –

Table 2 3-tuple traffic categories and characteristics.

HgHsLc MdHsLc HgEoLc MdEoLc HgHsGb MdHsGb HgEoGb MdEoGb

Burstiness

Injection

Hop distance

High Moderate High Moderate High Moderate High Moderate

Hot-spot Hot-spot Evened-out Evened-out Hot-spot Hot-spot Evened-out Evened-out

Local Local Local Local Global Global Global Global

In order to emulate practical application traffic, 3-tuple traffic generation technique [26] extracts spatial and temporal distribution behavior to generate traffic. It is composed of three different traffic behavior description functions, including burstiness, injection distribution and hop distance. The first factor models how often packet bursts are injected into network routers and how large these bursts are. Hurst parameter, 0.5 < H  1, which defines the level of self-similarity is used in the model. The closer H is to 1 the higher the level of burstiness. Second factor models how packet injection streams are distributed among the various nodes. σGauss is used to decouple the network size to the actual packet injection distributions, and a smaller σGauss models a more hot-spot injected traffic pattern. The last factor models how far packets travel from source to destination. Different scenarios are further categorized as {HgHsLc, MdHsLc, HgEoLc, MdEoLc, HgHsGb, MdHsGb, HgEoGb, MdEoGb}, which are listed in Table 2. These groupings are from permutations of moderate (Md) and highly bursty traffic (Hg), hotspot (Hs) and evened-out (Eo) injected traffic, and global (Gb) and local traffic (Lc). Highly bursty traffic is set with H = 0.9 and moderately bursty with H = 0.65. Traffic injections for hot-spot traffic is modeled as 10% of the nodes receive 68% of the total injected traffic and evened-out traffic is modeled as 20% of the nodes receive the same portion of traffic. Local traffic is set in the case of the traffic where only 20% of the total injected traffic traverses distances greater than four hops; for global traffic this value rises to 40%. Our traffic modeling approach provides the flexibility to adjust corresponding factor coefficients to emulate the specific application traffic. An open-loop network measurement setup was used [8]. After packets are generated, they are stored in an infinite queue at the source node, and wait until they are injected into the network. This mechanism isolates the packet generation from the network behavior, i.e. the packet generation is independent of the network condition. Each simulation executes 10,000 clock cycles for warm-up and then continues for 100,000 cycles during which router performance and power consumption measurements are conducted. 4.2. Performance evaluation Performance evaluation is based on transmission time and traffic load among source-destination pairs. Latency and throughput are major performance evaluation criteria of NoC platforms. For performance comparison, we implemented various arbitration schemes including algorithm based and CA arbitration algorithm in NePA and DMesh platforms. 4.2.1. Single buffer router architecture performance To validate CA router architecture, we compared the proposed CA scheme with various basic arbitration schemes including Round-Robin (RR), First-Come First-Served (FCFS) and uniform RANDOM, as shown in Fig. 9 and Fig. 10. The results all demonstrate that CA scheme improves transmission latency and achieves better throughput than these basic arbitration schemes. It was also observed that handling current congestion information impacts routing path selection favorably, resulting in balanced traffic load and reducing congestion. By employing a congestion control scheme, transmitted flits can eventually find available paths to dedicated destinations in an acceptable time. The control method can improve link utilization and prevent networks from excessive congestion. We compared CA with other known congestion control mechanisms such as Input Port Selection based on Buffer Status

C. Wang et al. / Journal of Computer and System Sciences 79 (2013) 421–439

431

Fig. 9. Average latency of CA and standard arbitration schemes in 8 × 8 DMesh networks for synthetic traffic patterns (FIFO depth = 4).

Fig. 10. Throughput of CA and standard arbitration schemes in 8 × 8 DMesh networks for synthetic traffic patterns (FIFO depth = 4).

(IPSBS) and Output Port Selection based on Link Status (OPSLS). IPSBS uses input port buffer usage status to select an input candidate and OPSLS uses output port link congestion status to adjust the routing path. In DMesh, CA consistently has better performance than other methods including fixed priority, IPSBS and OPSLS. In NePA, CA prevails for {Random} and {Bit complement} patterns. For {Bit reverse} and {Matrix transpose} patterns which feature large portions of diagonal sourcedestination pairs, routers with congestion control can adaptively choose less congested paths and cause less congestion situations. Among CA, IPSBS and OPSLS, they all demonstrate similar or adequate improvement over the fixed priority schemes. Because NePA has limited routing flexibility and resources to choose from, IPSBS and OPSLS which have detail buffer status and link utilization result in slightly better improvement than the simplified CA mechanism, especially {Matrix transpose} pattern. If there are more routing resources can be used such as DMesh, the simplified CA not only improves performance, but also reduces implementation cost. As illustrated in Fig. 11 and Fig. 12, networks with congestion control have equal or shorter latency compared with the original fixed priority routing arbitration, especially in heavy traffic load.

432

C. Wang et al. / Journal of Computer and System Sciences 79 (2013) 421–439

Fig. 11. Average latency of different congestion control schemes in 8 × 8 NePA networks for synthetic traffic patterns (FIFO depth = 4).

Fig. 12. Average latency of different congestion control schemes in 8 × 8 DMesh networks for synthetic traffic patterns (FIFO depth = 4).

Among them, CA has the best performance in latency for both NePA and DMesh platforms. Congestion information from neighboring routers helps to bypass congestion areas and lower average latency accordingly. In general, routers with congestion control have better network handling because of utilizing more flexibly adaptive routing paths to achieve higher network utilization. The throughput of 8 × 8 DMesh networks for four different traffic patterns is shown in Fig. 13 and Fig. 14. It can be observed that CA and IPSBS congestion control schemes outperform

C. Wang et al. / Journal of Computer and System Sciences 79 (2013) 421–439

433

Fig. 13. Throughput of different congestion control schemes in 8 × 8 DMesh networks for synthetic traffic patterns (FIFO depth = 4).

Fig. 14. Throughput of different congestion control schemes in 8 × 8 DMesh networks for 3-tuple traffic patterns (FIFO depth = 4).

OPSLS scheme and original DMesh platform for synthetic traffic. CA maintains consistent throughput improvement in each case of the realistic application patterns, and it especially outperforms cases featuring high burstiness and long traveling distances. On the contrary, counterparts show improvement in some cases but degradation in the others. To maintain a reliable and generic traffic control scheme, CA should be the best candidate to serve as congestion management scheme for versatile applications. From the simulation results, one can observe that output link utilization cannot improve DMesh throughput, and actually makes it worse than the original fixed priority arbitration scheme. On the other hand, IPSBS adaptively allocates routing resources to heavy loaded input FIFO to alleviate potential congestion so that it can achieve better network bandwidth utilization. The overall throughput of CA routers is better than or equal to IPSBS scheme routers because CA routers are based on regional network congestion status instead of individual congestion status. Based on sophisticated channel allocation and updated congestion status, the CA routers not only sustain the highest throughput but also maintain a comparatively stable throughput after the saturation load is reached. Throughput improvement of CA router in NePA platform and DMesh platform is illustrated in Fig. 15. NePA_CA has an improvement of 7.7% to 110% under different traffic patterns. For CA scheme in DMesh, the throughput improvement of DMesh_CA routers is better than NePA_CA, which is about 30% to 135%. The major reason is because more alternative

434

C. Wang et al. / Journal of Computer and System Sciences 79 (2013) 421–439

Fig. 15. Throughput in 8 × 8 mesh networks for synthetic traffic patterns (FIFO depth = 4).

routing paths are provided in DMesh to support more flexible routing path selection. Therefore, DMesh_CA routers can allocate routing resources more efficiently to achieve better performance. 4.2.2. Multiple Parallel Buffers (PBs) router architecture performance Better performance and throughput are achieved when multiple buffers are employed. Performance and throughput comparison for NePA and DMesh networks are shown in Figs. 16–18. Employing PBs helps improving the performance and increasing the throughput, and the proposed CA scheme largely enhances the effect because resources are utilized more efficiently. All traffic patterns and different network platforms reach the same conclusion. It was also observed that CA routers with less PBs improve much more than the fixed priority ones with more PBs. The observation leads to a resource optimization solution for a cost sensitive NoC platform design. 4.2.3. Different simulation variants verification Different simulation setups are evaluated and analyzed to validate feasibility and generality of the proposed mechanism. NoC size: Fig. 19(a) shows the latency comparison for different network size such as 4 × 4, 8 × 8, 16 × 16, and 32 × 32 mesh NoC under the Random traffic profile with four-flit packets. Throughput improvement is about 75% for 8 × 8 and 16 × 16 NoCs, 50% for 32 × 32 NoC and 28% for 4 × 4 NoC. The result concludes that CA routers for medium size NoCs gains the most benefit from sophisticated congestion management. FIFO size: Fig. 19(b) indicates the FIFO size influence on latency and throughput under the Random traffic profile with four-flit packets. CA scheme has identical improvement of 50% for each FIFO size case. It is observed that CA routers with less FIFO size outperform their counterparts. In a stringent cost NoC platform design, CA routers with smaller FIFO size can satisfy performance requirement and cost restriction simultaneously. Packet length: Fig. 19(c) shows the latency impacts for various packet sizes under the Random traffic profile. The average latency of long packets is more than those of short packets. This phenomenon is caused by an imbalance in the resource utilization which occurs when wormhole routing is adopted and long packets are transmitted. Packets holding resources over multiple routers is the major reason for this condition. CA scheme has a moderate improvement of around 12–30% for various packet sizes. All different simulation variants have consistent conclusion that CA scheme enhances transmission performance and accommodated throughput of NoC platforms. 4.2.4. Verification of fault tolerance capability As discussed earlier in Section 3.3.2, CA can be extended to handle links or nodes failures. If there exist available routes between source-destination pairs, Congestion-Aware Fault-Tolerance (CA_FT) scheme can find alternative paths to transfer packets toward destination. In our simulation, four randomly picked links in NePA and DMesh networks are considered faulty when communication starts. Fig. 20 shows the enhancement in performance and throughput. For original NePA and DMesh with fixed priority routing policy, the throughput suffers because packets cannot bypass faulty areas and even the highly congested regions near faulty components. However, the proposed CA_FT scheme is aware of network faulty links and congestion status so packets can be detoured to other available alternatives beforehand to prevent the network from

C. Wang et al. / Journal of Computer and System Sciences 79 (2013) 421–439

435

Fig. 16. Average latency and throughput comparison of with and without congestion control schemes in 8 × 8 mesh networks for synthetic traffic patterns (FIFO depth = 4).

Fig. 17. Throughput comparison of with and without congestion control schemes in 8 × 8 NePA networks for 3-tuple traffic patterns (FIFO depth = 4).

436

C. Wang et al. / Journal of Computer and System Sciences 79 (2013) 421–439

Fig. 18. Throughput comparison of with and without congestion control schemes in 8 × 8 DMesh networks for 3-tuple traffic patterns (FIFO depth = 4).

Fig. 19. Average latency for different FIFO/Packet/Network size in DMesh networks for Random traffic.

C. Wang et al. / Journal of Computer and System Sciences 79 (2013) 421–439

437

Fig. 20. Average latency and throughput comparison of with and without CA fault tolerance schemes in 8 × 8 mesh networks for Random traffic pattern (FIFO depth = 4).

Table 3 Cost comparison of baseline and CA router design with single FIFO per input port.

Area (um2 ) Dynamic power (mW) Leakage power (uW)

FIFO = 8 (4 VC)

FIFO = 4

NoC5_Orion2

NePA

NePA_CA

NePA

NePA_CA

DMesh

DMesh_CA

170,442

30,295 20.27 139.08

31,524 20.40 147.44

48,900 37.33 227

49,407 37.34 277.75

56,583 33.68 265.02

59,939 34.16 278.88

29.33

FIFO = 8

FIFO = 4

serious congestion. NePA_CA_FT even outperforms baseline DMesh platform in throughput enhancement which emphasizes that the proposed faulty information passing scheme is superior to the solution of adding routing resources. 4.3. Implementation cost evaluation Feasibility is also evaluated by increased hardware area and power consumption. CA router has been designed at RegisterTransfer Level (RTL) in VerilogTM HDL. For NePA routers, we referred to the architecture described in our previous work [5]. A logic description of DMesh router component [34] has been obtained using SynopsysTM Design Compiler and TSMC TM 65 nm CMOS generic process technology to perform logic synthesis and analyze hardware cost. For Network Interface (NI) design, a custom NI architecture which features two input ports and one output port was designed and implemented for NePA and DMesh to improve performance. If a single input/output port NI is adopted by time multiplexing both injection buffers, NePA and DMesh NoCs can still function well by sending either eastbound or westbound traffic once at a time with slightly performance degradation compared with the two input ports NI design. The cost of CA scheme in both NePA and DMesh is evaluated. The CA router employs two adders to calculate congestion indices of eastward and westward sub-routers separately and modify routing arbiter from fixed priority to dynamic priority arbitration which is composed of a priority multiplexer circuit. Table 3 illustrates the comparison of implementation cost for both platforms. Hardware synthesis frequency is set to 800 MHz and switching activity is set to 10%. NoC5_Orion2,2 one five-port NoC counterpart, is also listed for comparison. The result shows that NePA_CA increases area by only 4.1% for FIFO = 4 and 1% for FIFO = 8 and DMesh_CA increases area by modest 6%, proving that CA routers enhance interconnection network throughput with a cost efficient modification.

2 Area and power consumption is estimated from Orion2 simulator [18]. The configuration parameters are: 65 nm technology, 800 MHz speed, 4 VC each port, 8 flits each buffer and flit size is 64 bits.

438

C. Wang et al. / Journal of Computer and System Sciences 79 (2013) 421–439

Fig. 21. Link utilization in an 8 × 8 DMesh for Random traffic under high load condition.

To further improve transmission performance, increasing buffer size or virtual channel numbers cannot always enhance performance. In our opinion, adding additional routing resources is the plausible approach to this problem. In order to provide the routing flexibility, more buffers and powerful switches are needed to manage various routing directions, and that is the major reason why DMesh outperforms NePA and traditional five-port routers in terms of average latency and throughput. Besides, the proposed router design eliminates the need for virtual channel control and allocation overhead to compensate powerful switch design [35]. When compared with five-port routers, NePA and DMesh are more area efficient because they eliminate VC flow control management and are optimally customized designs. Power consumption increase is proportional to buffer size which is reflected in NePA with FIFO = 8 and DMesh cases. NePA and DMesh are power efficient when buffer size is aligned. Power consumption comparison between original NePA/DMesh and CA designs shows trivial overhead for both dynamic and leakage power. The increase of area and power becomes negligible for routers with multiple parallel buffers per input port. Furthermore, the wiring cost is dramatically reduced in CA congestion messages passing as discussed in Section 3.3.1. 4.4. Link utilization analysis Statistical traffic load distribution is recorded and plotted to demonstrate the improvement for link utilization of CA routers. Fig. 21 illustrates the link utilization of high congested traffic load under random traffic for original DMesh and DMesh_CA platforms. The congested region is located around the central part of the network. DMesh has an average 53% accepted rate from all injected flits. The link distribution in Fig. 21(a) indicates that right center part have lower link utilization owing to congestion situation. By adaptively allocating routing path according to congestion status, DMesh_CA can balance traffic load among NoC network and avoid worsening network congestion situation. As illustrated in Fig. 21(b), link utilization among center part of the network is improved and the congested area in Fig. 21(a) is also resolved. Overall average accepted rate of DMesh_CA is significantly improved to 70% in this scenario. 5. Conclusion Flexible router design and adaptive routing algorithm not only effectively exploit area and power consumption, but also support more advanced features to accommodate various services using an NoC platform. Congestion-aware router designs enhance NoC performance in terms of latency and throughput. To validate the proposed congestion management mechanism, thorough simulation cases were done to demonstrate its feasibility and superiority. CA can be easily extended to multiple buffers router architecture with negligible cost and achieves significant performance improvement because of resource usage optimization. CA also can easily integrate faulty components information exchange mechanism into routing algorithm for providing robust routing scheme to enhance the network accommodation in faulty environments. Experimental results showed that performance improvement is considerable and implementation cost overhead is moderate in both mesh and diagonally-linked mesh NoC platforms. With alternative links employed between routers, the proposed mechanism showed that DMesh has better potential of supporting fault tolerance capability to bypass faulty areas and accommodate more traffic. References [1] M. Ali, M. Welzl, S. Hessler, A fault tolerant mechanism for handling permanent and transient failures in a network on chip, in: Proc. Fourth Int. Conf. Information Technology ITNG ’07, 2007, pp. 1027–1032.

C. Wang et al. / Journal of Computer and System Sciences 79 (2013) 421–439

439

[2] D.R. Avresky, V. Shurbanov, R. Horst, P. Mehra, Performance evaluation of the ServerNetR SAN under self-similar traffic, in: Proc. IPPS/SPDP Parallel and Distributed Processing 13th Int. and 10th Symp. Parallel and Distributed Processing, 1999, pp. 143–147. [3] J.H. Bahn, N. Bagherzadeh, Efficient parallel buffer structure and its management scheme for a robust Network-on-Chip (NoC) architecture, in: 13th International CSI Computer Conference CSICC, Kish Island, Iran, March 2008, pp. 98–105. [4] J.H. Bahn, S.E. Lee, N. Bagherzadeh, On design and analysis of a feasible Network-on-Chip (NoC) architecture, in: Proc. Fourth Int. Conf. Information Technology ITNG ’07, 2007, pp. 1033–1038. [5] J.H. Bahn, S.E. Lee, Y.S. Yang, J. Yang, N. Bagherzadeh, On design and application mapping of a Network-on-Chip (NoC) architecture, Parallel Process. Lett. 18 (2008) 239–255. [6] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif, L. Bao, J. Brown, M. Mattina, C.-C. Miao, C. Ramey, D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montenegro, J. Stickney, J. Zook, TILE64 – Processor: A 64-core SoC with mesh interconnect, in: Proc. Digest of Technical Papers. IEEE Int. Solid-State Circuits Conf. ISSCC 2008, 2008, pp. 88–598. [7] L. Benini, G. De Micheli, Networks on chip: a new paradigm for systems on chip design, in: Proc. Design, Automation and Test in Europe Conf. and Exhibition, 2002, pp. 418–419. [8] W. Dally, B. Towles, Principles and Practices of Interconnection Networks, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003. [9] W.J. Dally, Virtual-channel flow control, IEEE Trans. Parallel Distrib. Syst. 3 (2) (1992) 194–205. [10] W.J. Dally, H. Aoki, Deadlock-free adaptive routing in multicomputer networks using virtual channels, IEEE Trans. Parallel Distrib. Syst. 4 (4) (1993) 466–475. [11] W.J. Dally, B. Towles, Route packets, not wires: on-chip interconnection networks, in: Proc. Design Automation Conf., 2001, pp. 684–689. [12] T. Dumitras, R. Marculescu, On-chip stochastic communication [soc applications], in: Proc. Design, Automation and Test in Europe Conf. and Exhibition, 2003, pp. 790–795. [13] X. Fu, T. Li, J.A.B. Fortes, Architecting reliable multi-core Network-on-Chip for small scale processing technology, in: Proc. IEEE/IFIP Int. Dependable Systems and Networks (DSN) Conf., 2010, pp. 111–120. [14] P. Gratz, B. Grot, S.W. Keckler, Regional congestion awareness for load balance in Networks-on-Chip, in: Proc. IEEE 14th Int. Symp. High Performance Computer Architecture HPCA 2008, 2008, pp. 203–214. [15] C. Grecu, A. Ivanov, R. Saleh, E.S. Sogomonyan, P.P. Pande, On-line fault detection and location for NoC interconnects, in: Proc. 12th IEEE Int. On-Line Testing Symp. IOLTS 2006, 2006. [16] H. Gu, J. Xu, K. Wang, M. Morton, A new distributed congestion control mechanism for networks on chip, Telecommun. Syst. 44 (2010) 321–331, http://dx.doi.org/10.1007/s11235-009-9257-7. [17] M. Igarashi, T. Mitsuhashi, A. Le, S. Kazi, Y.-T. Lin, A. Fujimura, S. Teig, A diagonal-interconnect architecture and its application to RISC core design, in: Solid-State Circuits Conference, 2002. Digest of Technical Papers. ISSCC. 2002 IEEE International, vol. 1, 2002, pp. 210–460. [18] A.B. Kahng, B. Li, L.-S. Peh, K. Samadi, ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration, in: Proc. DATE ’09. Design, Automation and Test in Europe Conf. and Exhibition, 2009, pp. 423–428. [19] J. Kim, D. Park, T. Theocharides, N. Vijaykrishnan, C.R. Das, A low latency router supporting adaptivity for on-chip interconnects, in: Proc. 42nd Design Automation Conf., 2005, pp. 559–564. [20] W.E. Leland, M.S. Taqqu, W. Willinger, D.V. Wilson, On the self-similar nature of Ethernet traffic (extended version), IEEE/ACM Trans. Netw. 2 (1) (1994) 1–15. [21] P. Lotfi-Kamran, A. Rahmani, M. Daneshtalab, A. Afzali-Kusha, Z. Navabi, EDXY – A low cost congestion-aware routing algorithm for Network-on-Chips, J. Syst. Archit. 56 (7) (2010) 256–264, Special Issue on HW/SW Co-Design: Systems and Networks on Chip. [22] R. Marculescu, Networks-on-Chip: the quest for on-chip fault-tolerant communication, in: Proc. IEEE Computer Society Annual Symp. VLSI, 2003, pp. 8–12. [23] E. Nilsson, M. Millberg, J. Oberg, A. Jantsch, Load distribution with the proximity congestion awareness in a network on chip, in: Proc. Design, Automation and Test in Europe Conf. and Exhibition, 2003, pp. 1126–1127. [24] U.Y. Ogras, R. Marculescu, Prediction-based flow control for Network-on-Chip traffic, in: Proc. 43rd ACM/IEEE Design Automation Conf., 2006, pp. 839– 844. [25] A. Singh, W.J. Dally, A.K. Gupta, B. Towles, Goal: a load-balanced adaptive routing algorithm for torus networks, in: Proc. 30th Annual Int. Computer Architecture Symp., 2003, pp. 194–205. [26] V. Soteriou, N. Eisley, H. Wang, B. Li, L.-S. Peh, Polaris: A system-level roadmap for on-chip interconnection networks, in: Proc. Int. Conf. Computer Design ICCD 2006, 2006, pp. 134–141. [27] M.S. Taqqu, W. Willinger, R. Sherman, Proof of a fundamental result in self-similar traffic modeling, SIGCOMM Comput. Commun. Rev. 27 (April 1997) 5–23. [28] S.L. Teig, The X architecture: not your father’s diagonal wiring, in: Proceedings of the 2002 International Workshop on System-Level Interconnect Prediction, SLIP ’02, ACM, New York, NY, USA, 2002, pp. 33–37. [29] W. Trumler, S. Schlingmann, T. Ungerer, J. Bahn, N. Bagherzadeh, Self-optimized routing in a Network-on-a-Chip, in: Biologically-Inspired Collaborative Computing, in: IFIP Int. Fed. Inf. Process., vol. 268, Springer, Boston, 2008, pp. 199–212. [30] J.W. van den Brand, C. Ciordas, K. Goossens, T. Basten, Congestion-controlled best-effort communication for Networks-on-Chip, in: Proc. Design, Automation and Test in Europe Conf. and Exhibition DATE ’07, 2007, pp. 1–6. [31] S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, P. Iyer, A. Singh, T. Jacob, S. Jain, S. Venkataraman, Y. Hoskote, N. Borkar, An 80-tile 1.28TFLOPS Network-on-Chip in 65nm CMOS, in: Proc. Digest of Technical Papers. IEEE Int. Solid-State Circuits Conf. ISSCC 2007, 2007, pp. 98–589. [32] G. Varatkar, R. Marculescu, Traffic analysis for on-chip networks design of multimedia applications, in: Proc. 39th Design Automation Conf., 2002, pp. 795–800. [33] C. Wang, W.-H. Hu, N. Bagherzadeh, Congestion-aware Network-on-Chip router architecture, in: Proc. 15th CSI Int. Computer Architecture and Digital Systems (CADS) Symp., 2010, pp. 137–144. [34] C. Wang, W.-H. Hu, N. Bagherzadeh, S.E. Lee, Area and power-efficient innovative Network-on-Chip architecture, in: Proc. 18th Euromicro Int. Parallel, Distributed and Network-Based Processing (PDP) Conf., 2010, pp. 533–539. [35] C. Wang, W.-H. Hu, S.E. Lee, N. Bagherzadeh, Area and power-efficient innovative congestion-aware Network-on-Chip architecture, J. Syst. Archit. 57 (January 2011) 24–38. [36] D. Wu, B.M. Al-Hashimi, M.T. Schmitz, Improving routing efficiency for Network-on-Chip through contention-aware input selection, in: Proc. Asia and South Pacific Conf. Design Automation, 2006.