taking advantages of both wired and wireless communications. By using on-chip antennas, one can provide on-chip wireless communication to transfer data ...
2011 19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing
A Wireless Network-on-Chip Design for Multicore Platforms Chifeng Wang, Wen-Hsiang Hu, Nader Bagherzadeh Dept. of Electrical Engineering and Computer Science University of California, Irvine Irvine, CA 92697 USA {chifengw, wenhsiah, nader}@uci.edu and power dissipation. However, optical links face technological challenges such as design of efficient transmitter and receiver components, reliability of integrated light source and high manufacturing cost which prevents its commercial adoption. These disadvantages hinder optical interconnections from becoming a feasible solution. Although RF-I can be implemented by silicon-based CMOS technology, they need the additional physically overlaid transmission lines that serve as wave guides to enable data transmission. To sustain high throughput, RF-I based system must employ multiple high frequency oscillators and high precision filters to validate its feasibility. CMOS UWB wireless interconnection enables on-chip multi-hop communication by the embedded wireless channels, and this gives more opportunities to provide wire based NoCs with alternative routing strategies. On-chip wireless communication fulfills system feasibility and flexibility to overcome limitation of wired communication using existing and well-understood CMOS technology. To extract its wireless capabilities, multihop NoC wired channels are replaced with high-bandwidth single-hop long range wireless channels so that transmission performance, power consumption and long distant communication problems of traditional wired NoCs can be addressed simultaneously. In this work, we design and analyze a hybrid NoC system which adopts on-chip wireless interconnect technology working with existing two-dimensional (2-D) mesh NoC as a solution. We refer to this hybrid architecture as Wireless NoC (WNoC). By using on-chip antennas, WNoC utilizes both intra-chip wireless and wired communication resource to intelligently collaborate data transmission among cores to improve system performance. WNoC routers provide sophisticated routing scheme to maintain minimal transfer latency and power requirements. Performance and feasibility analysis of WNoC routers are addressed in this work. The rest of the paper is organized as follows. Section II introduces the existing interconnection solutions for CMP. Section III presents WNoC architecture and routing scheme. Section IV demonstrates experimental results from cycleaccurate simulation to show significant improvement in transfer latency and throughput over conventional 2-D mesh
Abstract—Aggressive scaling of transistors allows integration of hundreds of processors on a chip. However, on-chip interconnects carrying signals between different blocks will be the bottleneck for system performance and reliability. To tackle this problem, we developed an on-chip communication infrastructure based on a network-on-chip architecture and developed a hybrid mechanism to transfer data among IP cores by taking advantages of both wired and wireless communications. By using on-chip antennas, one can provide on-chip wireless communication to transfer data across long distances and minimize transfer latency and energy dissipation accordingly. A wireless network-on-chip architecture was designed and evaluated, and the experimental results showed significant improvement in transfer latency, network throughput and energy dissipation. Keywords-Network-on-Chip (NoC); On-chip wireless interconnect network; Wireless Network-on-Chip (WNoC)
I. I NTRODUCTION NoC architecture has emerged as a promising technology to tackle design challenges of the conventional bus architecture by using network-like interconnection among multiple cores [14][26]. Today’s multi-core chip multiprocessors (CMP) support tens to low hundreds of cores such as Tilera’s 64-core TILE64 [19], Intel’s 80-core TFLOPS [23], NVIDIA’s 128-core Quadro GPU [35] and 240-core Tesla C1060 GPU [36]. However, on-chip interconnects carrying signals across different components will be the bottleneck to system performance and reliability, especially when CMPs scale to hundreds or thousands of cores on a chip. According to the International Technology Roadmap for Semiconductors (ITRS) [34], the wiring delay will be one of the critical issues of future designs. Technology scaling will cause limitations such as long wiring delay and global synchronization as chip dimension grows with the integration of numerous cores on a chip. Therefore, alternative communication approaches like optical interconnections, RF Interconnect (RF-I) transmission lines and CMOS Ultra-wideband (UWB) wireless interconnect technology have been proposed [3][8][16]. The basic concepts of optical interconnections and RF-I are to insert express communication links to reduce transmission latency 1066-6192/11 $26.00 © 2011 IEEE DOI 10.1109/PDP.2011.37
409
NoC. Section V concludes this paper. II. BACKGROUND AND RELEATED WORK NoC have been considered as enabling technology to provide high degree integration in multi-core CMP. 2-D mesh [23] networks are the most common topologies due to its feasibility and low dimensionality. 2-D meshes have the benefits of short channel lengths and low router complexity. However, the 2-D floor plan characteristic limits the choice of NoC topology and thus constraints the performance improvement expected from NoC because the network diameter of a 2-D mesh topology grows linearly with the network size. For example, a 10x10 2-D mesh NoC has a network diameter of 18 hops. This leads to long transfer latency and energy inefficiency. To shorten the effect of long distance transmission, inserting express channels [22][27] was proposed to bridge the gap. Beyond traditional 2-D wired interconnect solutions, different revolutionary approaches were proposed. 3-D NoCs [24] make use of the advantages from both 3-D ICs and NoCs to improve latency, throughput and power consumption. Instead of employing express metal links, several on-chip interconnect alternatives have been proposed. On-chip optical interconnects [3] are expected to give very high throughput with low latency. RF interconnects [16] modulate data on a carrier frequency to deliver data over an on-chip RF waveguide instead of sending baseband signals on a parallel bus. However, each of them has its own implementation challenges, preventing them from being adopted [1]. The idea of on-chip interconnects was first demonstrated to distribute global clock signals [5][12]. A wireless NoC based on UWB technology was proposed in [8] and it utilized wired signals to realize a synchronous and distributed medium access control (MAC) protocol. This work achieved 1 mm data transmission range at an antennas length of 2.98 mm. The peak bandwidth on a single channel was 10 Gbps in 0.18 um technology. [13] reported that the cutoff frequency for NMOS transistors was predicted to be 500 GHz by 2013. At such a high operating frequency, a typical bandwidth can be predicted to be hundreds of Gbps. From the implementation cost point of view, since wireless RF signals are in GHz range, sizes of transmitter/receiver blocks and antennas are substantially reduced, which result in viable solutions. Recently, a scalable wireless interconnect structure [20] was proposed to validate on-chip wireless communication and evaluate the feasibility of hybrid wired and wireless platform. This work provided a two-tier hybrid wireless/wired architecture to demonstrate the benefits of long-range wireless links. Besides silicon-based on-chip antenna solutions, the feasibility of applying carbon nanotubes (CNT) to implement on-chip antennas was discussed and analyzed in [1][15][18]. CNT-based antenna was reported to feature better emission and absorption characteristics than
(a) 4x4 2-D NePA architecture Figure 1.
(b) NePA router port description NePA architecture
traditional materials, making it a good candidate for on-chip antenna elements for optical frequencies. III. WN O C ARCHITECTURE A. On-chip wireless interconnect A 324 GHz oscillator using 90 nm CMOS process [7] and a 410 GHz oscillator using 6M 45 nm CMOS process [9] have been reported as solutions to realize on-chip wireless interconnects. Based on these technologies, the output power level of the on-chip millimeter-wave generator has been predicted as high as -1.4 dBm in 32 nm CMOS process [20], which enables on-chip short distance communication. With the advance in CMOS mm-wave circuits, hundreds of GHz of bandwidth will be available in the near future. An onchip antenna deposited in the polyimide layer to minimize substrate loss was proposed to improve wireless transmission bit-error rate (BER) to a reliable level [20]. At a distance of 10 mm which is the diagonal distance of a 10x10 NoC, the BER will be less than 10−14 which is sufficient to serve as a reliable transmission medium. Based on the estimation of 500 GHz switch rate of a CMOS transistor for the 32 nm CMOS process, we can implement many high frequency bands for the on-chip wireless network. The maximum available bandwidth is empirically predicted to be 10% of the carrier frequency. This scenario can accommodate up to total of 16 available channels for the on-chip wireless network at the range from 100 GHz to 500 GHz, and each channel can transmit at around 20 Gbps. Besides bandwidth capacity, the on-chip wireless network requires a simple wireless transceiver architecture to achieve low power design so as to satisfy the stringent requirements of future NoC design. Optical Mach-Zehnder modulation at data rates up to 10 Gbps demonstrated with low RF power consumption of only 5 pJ/bit [31] are currently commercially available. A simple On-Off-Keying (OOK) system suffices to satisfy these requirements in such high frequency range. B. WNoC Topology 1) Baseline wired NoC network: Our WNoC architecture is based on a conventional wired 2-D mesh NoC architecture called Network-based Processor
410
Figure 2.
Figure 3.
into various subnets which have one WR responsible for providing wireless communication for the PEs in the same subnet. WRs allocation in subnets is an important issue to decide system performance. Fig.2 depicts an example of WNoC with 10x10 PEs divided into 4 rectangular subnets and WRs are located at the center of each subnet. Solid lines and dotted lines represent wired and wireless links, respectively, for transmitting packets between routers. Due to availability of multiple channels, Frequency Division Multiple Access (FDMA) technique is adopted for channelization to achieve simultaneous multiple communications between WRs. Each wireless transmitter and receiver pair uses an independent carrier frequency to accommodate data from different channels. Data delivery in WNoC is done by wormhole packet switching, because it has advantages of both low transfer latency and low buffer requirement. Packets are composed of 64-bit flits in our scenario. As illustrated in Fig.3, the first flit of a packet is the header flit, which carries control information for packet delivery such as destination address, sequential number, packet type, payload size, and some control flags. The body flits that follow header flit are the actual payload. Address ID in WNoC is described by four bit fields: X subnet, Y subnet, X local, and Y local, where the first two fields are used to specify the subnet location and the remaining two fields identify the router location within a subnet. The separation of subnet and local address enables fast routing decision and easily achieves scalable hierarchical system design. It also can simplify router design and lower hardware complexity.
10x10 WNoC architecture
WNoC packet format
Array (NePA) [10]. As illustrated in Fig.1(a), each Processing Element (PE) consists of a processor core, Network Interface (NI) and a router. Each router has 64-bit bidirectional links connecting with its neighbor routers. Additionally, there are two extra vertical ports for building up two separate eastward and westward sub-networks: E-subnet and W-subnet. Fig.1(b) depicts the input and output ports of NePA router. E-subnet, which includes E port, N1 port and S1 port, is responsible for transferring eastward packets while W-subnet, which includes W port, N2 port and S2 port, is responsible for westward traffic. When source PE starts packet transmission, it injects packets into the network via Int port, directing to either E-subnet or W-subnet. Subsequently, packets traverse in one of the sub-networks to their destinations. When packets arrive at the destination node, they are ejected from the Int port. By separating the whole network into two sub-networks, it reduces the design complexity of the router and ensures deadlock freedom. To increase network performance and provide faulttolerant routing ability, NePA utilizes an adaptive XY routing scheme. When an output port is congested, the router selects an alternative output port for incoming packets. Therefore, the link utilization is balanced and network performance improves. 2) On-chip wireless NoC network: WNoC is constructed by replacing some of the routers in NePA with Wireless Routers (WR), which have wireless links to other routers in addition to the original wired links. Therefore, WRs are capable of transferring packets via both wired signals and wireless channels. To distinguish the two kinds of routers in WNoC, we named the router in NePA as Baseline Router (BR). In WNoC, PEs are divided
C. Routing Scheme WNoC BRs implements an adaptive minimal routing algorithm which takes buffer utilization into account when deciding routing path. This approach has been proven to have a better performance than conventional 5-port router [10]. The wireless data communication in WNoC is performed by the wireless links between WRs. WRs equipped with on-chip antennas exchange data by wireless communication. When transferring data between PEs in WNoC, a packet may be transmitted by the wired links, the wireless links, or a mix of the two. Thus, we need an efficient decision criteria to choose a proper path for transmitting packets. When considering this issue, we view WNoC as a network formed by adding expressways (wireless links) to baseline mesh network, so the problem becomes whether a packet uses the expressway or not. For packets whose source node and destination node are located in the same subnet, there is no need to use wireless links. For packets whose source node and destination node are from different subnets, in our design, the path selection scheme is a function of traveling distance (expressed in terms of hop count) and buffer utilization status. The path decision procedure is shown in Algorithm 1. This algorithm is executed at the 411
Algorithm 1 Path Decision Algorithm Input: Packeti and BUFW R 1: if BUFW R is full then 2: NePA Routing(s,d)) 3: else if HW < HB + δ then 4: WNoC Routing(s,d) 5: else 6: NePA Routing(s,d) 7: end if 8: 9: Proc NePA Routing(s,d) 10: if eastward transmission then 11: inject to E-subnet buffer 12: else 13: inject to W-subnet buffer 14: end if 15: 16: Proc WNoC Routing(s,d) 17: if BUFW R is full then 18: NePA Routing(s,d) 19: else 20: NePA Routing(s,WR s) 21: Wireless Channel(WR s,WR d) 22: NePA Routing(WR d,d) 23: end if
Figure 4.
Figure 5.
packet injection time by the router, which could be either a BR or a WR. The purpose of the first line of this algorithm is to provide congestion control capability for WRs. Since a WR is shared by all PEs in the same subnet, it is vulnerable to congestion. We use buffer utilization information from WRs to avoid traffic build up when there is congestion at WRs. There are wired signals sent from WRs to each BR in the same subnet to notify BRs that output buffer on wireless links is full, which means the intended path is congested. In this case, the BR will not use wireless links so that congestion could be alleviated. The condition check in line 3 is used to balance the usage of wired and wireless links. HW and HB are the travelling distance between source and destination node for taking wireless path and wired path, respectively. They are represented in terms of hop count. HW and HB are calculated from source and destination IDs. If we just use (HW ≤ HB ) to make a decision, the routers will overuse wireless links, and it results in hotspots at WRs. Thus, δ is utilized to adjust the equation to be suitable for different network configurations and traffic patterns. δ is an application dependent variable. δ = 4 is used in our simulation to balance traffic load between both networks. The result of this path decision algorithm is stored in the header flit. If a packet is decided to take wired path, it will traverse the wired links to reach its destination. If a packet uses wireless links, it will be first sent to its local WR via wired network. Then it is transferred to the destination WR via wireless links, and then goes back to wired network.
Parallel buffer router architecture
Deadlock example in WNoC network
control can be easily implemented [29]. However, in order to maximize its utilization, allocation of virtual channels is a critical issue in designing routing algorithms [4][17]. Each virtual channel is mapped to individual dedicated output port by virtual channel allocation logic. However, our WR and BR adopt realtime adaptive routing algorithm, routing packets coming into input ports are not mapped to dedicated outputs and sustain the flexibility to route packets toward less congested paths. Based on this NoC architecture, a new routing-independent parallel buffer structure and its management scheme are proposed instead of conventional VC [11], as shown in Fig.4. As a result, the channel utilization and maximum throughput are improved significantly. It also helps isolate wired and wireless traffic to gain the most benefits of hybrid architecture. WNoC transmits long distance flits and local flits by using shared links. When traffic load increases, transmission performance deteriorates due to head-of-line (HOL) blocking effect. Incorporating parallel buffer structure effectively mitigates this effect, and it improves the throughput and shortens average latency significantly. E. Deadlock Avoidance Although BRs separate transmitted flits into individual E-subnet and W-subnet to provide deadlock-free routing scheme, there are still possibilities of deadlock by incorporating wireless links to form cyclic routing paths. As illustrated in Fig.5, hybrid routing path forms a cycle, which spans over both wired and wireless links. To deal with deadlock issue, WNoC routing schemes provide some strategies to avoid deadlock and guarantee effectiveness of WNoC routing mechanism.
D. Performance Enhancement Architecture Virtual channels (VCs) are generally employed to enhance network throughput by dividing the buffer storage associated with each network channel into several virtual channels [28]. By proper control of VCs, network flow 412
Table I 3- TUPLE TRAFFIC CATEGORIES AND CHARACTERISTICS
WNoC VC1: In single buffer scenario, flits have no additional buffers to bypass congested path so they suffer from HOL effect. Although BRs embedded adaptive routing algorithm provide flexible routing arbitration to relax network traffic, flits with source-destination pairs in the same horizontal/vertical coordinate as WR will still have possibility to cause deadlock. In this case, routing algorithm exempts flits in this manner from taking wireless express links. By breaking the potential cyclic routing paths, WNoC maintains deadlock freedom in single buffer scheme. WNoC VCn: When WNoC input ports employ more than one virtual channels, transmitted flits have more resources to route and almost eliminate deadlock possibility. In order to provide guaranteed deadlock-free routing scheme, flits toward WR and flits toward destination nodes are isolated by individual virtual channel. One channel is reserved for flits toward WR and one channel is reserved for flits toward destination nodes, and other channels are shared by all flits to provide the best utilization and performance.
Case1 Case2 Case3 Case4 Case5 Case6 Case7 Case8
Burstiness High Moderate High Moderate High Moderate High Moderate
Injection Hot-spot Hot-spot Evened-out Evened-out Hot-spot Hot-spot Evened-out Evened-out
Hop distance Local Local Local Local Global Global Global Global
in an NoC. A Gaussian injection distribution is observed when the aggregated injection peaks are collected. σGauss is used to decouple the size of the network to the actual packet injection distributions, hot-spot or evened-out traffic can be modeled where a relatively larger σGauss models a more evened-out injected traffic pattern. (3) Traffic hop distance: It models how long packets travel from source to destination, capturing long and shortdistance traffic with respect to the network size and type, and consequently in relation to the maximum hop distance of a NoC. Local traffic tendency or global traffic tendency are both considered. These traffic patterns cases are categorized as listed in Table I. These spatio-temporal traffic groupings are based on permutations of moderate and highly bursty traffic, global (long hop distance) and local traffic, and hot-spot and evened-out injected traffic. WNoC simulator sets highly bursty traffic with an H=0.9 and moderately bursty with H=0.65. Traffic injections are set as hot-spot and evenedout traffic. Hot spot traffic is modeled as 10% of the nodes receive 68% of the total injected traffic and evened-out traffic is modeled as 20% of the nodes receive the same portion of traffic. Local traffic is set as the traffic where only 20% of the total injected traffic traverses distances greater than four hops; for global traffic this value rises to 40%. Note that 3-tuple traffic generation method provides the flexibility to change to fit the specific application traffic. A proposed interconnection network measurement setup was used [30]. After packets are generated, they are stored in an infinite queue at the source node, and wait until they are injected into the network. This mechanism referred to as the open-loop measurement configuration isolates the packet generation from the network behavior. Before collecting data, there is a 10,000 cycles warm-up phase to wait until system reaches steady state. Subsequently, it continues for 100,000 cycles during which router performance, system throughput and power consumption measurements are conducted.
IV. EVALUATION In order to demonstrate the router’s performance and its feasibility for VLSI implementation, in this section, we describe the methodology used to analyze the performance, power consumption and area cost of the architecture. Simulation platform is set up to validate the new architecture and compare with traditional wired NoCs. A. Experimental Setup WNoC platform is developed by a System-C based cycle accurate simulator. The simulator uses a 100-core based system which divides into four subnets. Each subnet consists of a 5x5 mesh network and high speed wireless channels to neighboring subnets. Each subnet employs 24 BRs and one WR which features one-hop transmission. The simulator also features various network configurations such as: network size, topology, buffer size, routing algorithm, priority scheme for router arbitration, and traffic patterns generation. The WNoC simulator has traffic generator capable of producing both synthetic and application traffic patterns. For synthetic traffic: WNoC has {Random, Bit complement, Bit reverse, Matrix transpose} patterns [30]. In order to apply spatial and temporal distribution behavior to simulate practical application traffic, we adopted 3-tuple traffic generation technique [25] which comprises: (1) Traffic burstiness: It models how often packet bursts are injected into network routers and how large these bursts are. This factor reflects on-chip traffic self-similarity where traffic burst patterns repeat themselves over time. Burstiness modeling uses the Hurst parameter, 0.5 < H ≤ 1, which defines the level of self-similarity. The closer H is to 1 the higher the level of burstiness. (2) Traffic injection distribution: It models how packet injection streams are distributed among the various nodes
B. Performance Evaluation Latency and throughput are fundamental performance evaluation metrics. The WNoC simulator keeps track of data transmission time and monitors the throughput in different network traffic loads. Each flit in WNoC simulator is an 413
WNoC_VC1
WNoC_VC2
WNoC_VC3
NePA
WNoC_VC4
0.4
0.3
0.3
Saturation load
Saturation load
NePA 0.4
0.2
0.1
Figure 6.
Bit reverse
Matrix transpose
improvement. All 3-tuple experimental results showed in Fig.8 consistently demonstrate WNoC outperforms NePA in terms of transmission latency. Parallel buffer architecture can effectively release local congestion and dramatically facilitate long-distance traffic using wireless links. These effects can be easily observed in case5 to case8. Average latency in WNoC dramatically outperforms NePA especially in long-distance communication scenarios. Wireless links are crucial for shortening distance travelled, especially for transmission between two far apart cores. Incorporating a load-balance routing algorithm to efficiently manage traffic load can take advantages of express alternative links so that the inter-node distance is reduced remarkably. In order to isolate wired and wireless traffic effectively, we incorporate a parallel buffer architecture to largely enhance overall performance and network accommodation. From the results, we can conclude that two or more parallel buffers can significantly improve overall performance. WNoC VC2 achieves a performance that is twice better than NePA network and WNoC VC1, and it indicates that separating wired and wireless traffic will gain the most benefits. By balancing performance and implementation cost, WNoC VC2 or WNoC VC4 is the best candidate to implement WNoC routers. This mechanism not only decreases the travelling distance by using wireless links but also helps release the busy links located in the central part of the network. Although subnets are still wired networks, they have less chances of causing congestion than global networks due to their comparatively smaller size, resulting in overall system throughput improvement.
Case1
Case2
Case3
Case4
Case5
Case6
Case7
Case8
(b) 3-tuple traffic
Saturation load in NePA/WNoC VC networks 400
400
NePA WNoC_VC1 WNoC_VC2 WNoC_VC4 WNoC_VC8
Average Latency (cycles)
Average Latency (cycles)
W NoC_VC4
0
Bit complement
(a) synthetic traffic
300
200
NePA WNoC_VC1 WNoC_VC2 WNoC_VC4 WNoC_VC8
200
100
100
0 0.00
0.20
0 0.00
0.40
Traffic Load (flits/node/cycle)
(a) Random
0.10
0.20
Traffic Load (flits/node/cycle)
0.30
(b) Bit complement 400
400
NePA WNoC_VC1 WNoC_VC2 WNoC_VC4 WNoC_VC8
Average Latency (cycles)
Average Latency (cycles)
W NoC_VC3
0.1
Random
300
W NoC_VC2
0.2
0
300
W NoC_VC1
300
200
200
100
100
0 0.00
0.20
0.40
Traffic Load (flits/node/cycle)
(c) Bit reverse Figure 7.
NePA WNoC_VC1 WNoC_VC2 WNoC_VC4 WNoC_VC8
0 0.00
0.20
0.40
Traffic Load (flits/node/cycle)
(d) Matrix transpose
Average latency in NePA/WNoC VC networks for synthetic traffic
C. Implementation Cost Evaluation
independent object that carries time stamp and travelling distance information. Saturation load means the point where throughput no longer grows linearly with traffic load. All of our results in Fig.6(a) indicate that WNoC VC1 networks could sustain higher traffic load than NePA networks. The result shows the improvement in throughput is about 6 to 22%. By incorporating parallel buffers, improvements are remarkably increasing to 65 to 88%. The increase in symmetric traffic is more than that in random traffic, which implies that long-distance data take more advantage of wireless links in WNoC. For 3-tuple results in Fig.6(b), WNoC has moderate improvements for case1, case2 and case6 which feature hot-spot and local traffic. Improvements become significant around 15 to 30% in other cases because WNoC is taking advantage of wireless one-hop express links to shorten latency and increase the throughput accordingly. Fig.7 depicts the comparison of average latency for various traffic patterns in WNoC and NePA. We observed that all of these delays are decreased in WNoC for different traffic loads. The scheme especially benefits long-distance source and destination pairs located almost diagonally, as in {Matrix transpose} traffic, which demonstrates noticeable
Feasibility of WNoC is considered by estimating hardware implementation cost. The BR has been designed at Register-Transfer Level (RTL) in VerilogT M HDL. A logic description of router component has been obtained using SynopsysT M Design Compiler and T SMCT M 65 nm CMOS generic process technology by performing logic synthesis. The BRs of WNoC VC2 or WNoC VC4 including two or four parallel buffers for each input port occupy an area of approximately 0.071 mm2 or 0.123 mm2 using a 65 nm technology. The area of WRs is 0.089 mm2 or 0.141 mm2 for WNoC VC2 or WNoC VC4, respectively, which consists of a hybrid router and a wireless base station. The area of wireless base station is estimated to be 18,332 um2 in 65 nm technology, obtained by scaling a 130 nm design [8]. We compare the area of WR with widely-used processing elements such as ARM11 MPCoreT M and PowerPCT M 405 10SF, which occupy 0.938 mm2 and 1.4 mm2 in 65 nm technology respectively [32][33]. If the WR and FIFOs are integrated within a CMP as an interconnection network, the area overhead imposed by the network would be within the margin of design, demonstrating the feasibility of this approach. 414
200
0.05
0.10
0.15
0.20
Traffic Load (flits/node/cycle)
0 0.00
0.05
NePA WNoC_VC1 WNoC_VC2 WNoC_VC4 WNoC_VC8
0.20
0 0.00
0.25
300
100
0.15
0.20
Traffic Load (flits/node/cycle)
0 0.00
0.25
0.10
0.15
0.20
0.25
Traffic Load (flits/node/cycle)
0.30
NePA WNoC_VC1 WNoC_VC2 WNoC_VC4 WNoC_VC8
300
(e) Case5
0.15
0.20
Traffic Load (flits/node/cycle)
0.25
(f) Case6 Figure 8.
0 0.00
0.10
0.15
0.20
0.25
Traffic Load (flits/node/cycle)
0.30
(d) Case4
NePA WNoC_VC1 WNoC_VC2 WNoC_VC4 WNoC_VC8
300
NePA WNoC_VC1 WNoC_VC2 WNoC_VC4 WNoC_VC8
200
100
100
0.10
0.05
400
200
0.05
0 0.00
0.35
(c) Case3
100
0.10
0.05
400
200
0.05
100
(b) Case2
200
0 0.00
0.15
400
Average Latency (cycles)
Average Latency (cycles)
300
0.10
Traffic Load (flits/node/cycle)
NePA WNoC_VC1 WNoC_VC2 WNoC_VC4 WNoC_VC8
200
100
(a) Case1 400
Average Latency (cycles)
50
0.25
300
200
100
50
0 0.00
300
NePA WNoC_VC1 WNoC_VC2 WNoC_VC4 WNoC_VC8
Average Latency (cycles)
100
Average Latency (cycles)
150
400
400
NePA WNoC_VC1 WNoC_VC2 WNoC_VC4 WNoC_VC8
Average Latency (cycles)
150
NePA WNoC_VC1 WNoC_VC2 WNoC_VC4 WNoC_VC8
Average Latency (cycles)
Average Latency (cycles)
200
0.05
0.10
0.15
0.20
0.25
Traffic Load (flits/node/cycle)
0.30
0.35
(g) Case7
0 0.00
0.05
0.10
0.15
0.20
0.25
Traffic Load (flits/node/cycle)
0.30
0.35
(h) Case8
Average latency in NePA/WNoC VC networks for 3-tuple traffic
where ΨHW is the hamming distance of write flits; ΨHR is the hamming distance of read flits; αHW and αHR are the regression coefficients for relative variable and α0 is the power term which is independent of the variables. Link power consumption also plays an important role in deep submicron design. In order to evaluate link power consumption, we used Orion 2.0 power models [2] to calculate WNoC link power. PLINK is associated with port numbers of routers, link length and transmission probability. Link model is listed as: 2 PLINK = α ·Cl ·Vdd · fclk (4)
D. Power Consumption Evaluation PrimeTimeT M
Power analyses are achieved by Synopsys PX tool [37] which can analyze power consumption in nanosecond precision. A statistical approach is used to extract the power model from the physical implementation of interconnection networks [6][21]. Furthermore, regression analysis is adopted to characterize related parameters and form the router power equation. For wireless communication, wireless interconnect power is evaluated in an analytical way. In the scenario of 10x10 WNoC platform, assuming each core occupies an area of 1 mm2 which includes a processor and a router, the longest transmission path is about 7.5 mm in diagonal direction. Therefore, an on-chip wireless antenna with transmitter power -10 dBM can provide acceptable transmission performance and energy consumption will be 4.5 pJ/bit [20]. Based on this assumption, WNoC power model is produced and incorporated into WNoC simulator so that we can estimate performance and power consumption simultaneously for different benchmarks. Power consumption of NoC is mainly attributed to router, FIFO and link power, listed as: P = PROUT ER + PFIFO + PLINK
where α, Cl , Vdd and fclk denote the activity factor, load capacitance, supply voltage, and frequency, respectively. Using 65 nm ARM11MPCore which is 0.938 mm2 for an embedded design of NoC platform, the link between processors is assumed to be 1050 um. In our simulation scenario, NePA platform uses only wired links to transfer data and WNoC utilizes the hybrid network. We observed that WNoC consumes less link power because wireless links accelerate data transmission and lower travelling hop counts so as to decrease link transition during the communication tasks. In order to evaluate power consumption comprehensively, different traffic patterns were used to analyze and compare system power consumption for various network scenarios. Fig.9 depicts performance and power consumption for synthetic traffic patterns in NePA and WNoC networks. The results reveal that WNoC networks shorten both latency and power consumption when compared with NePA. Flits traverse less hops through WNoC and access fewer FIFOs so that the overall power consumption is less than NePA. Latency in WNoC is about 80 to 90% of NePA which means flits in WNoC either traverse fewer hops or block fewer
(1)
Router model is listed as: PROUT ER = α0 + αH • ΨH + αS • ΨS + αS • ΨS
(2)
where ΨH is the hamming distance of outgoing flits; ΨS is the number of outgoing ports passing body flits; ΨS is the number of state transitions of outgoing ports; αH , αS and αS are the regression coefficients for relative variable and α0 is the power term which is independent of the variables. FIFO model is listed as: PFIFO = α0 + αHW • ΨHW + αHR • ΨHR
(3) 415
NePA_Power
WNoC_Power
NePA_Latency
WNoC_Latency
[6] C. Wang, W. Hu, S. E. Lee and N. Bagherzadeh, “Area and power-efficient innovative Network-on-Chip architecture,” 18th Euromicro International Conference on Parallel, Distributed and Network-Based Computing, PDP 2010, Pisa, Italy, Feb. 2010. [7] D. Huang et al., “Terahertz CMOS frequency generator using linear superposition technique,” IEEE Journal of Solid State Circuits, Dec 2008. [8] D. Zhao and Y. Wang, “SD-MAC: Design and Synthesis of a Hardware-Efficient Collision-Free QoS-Aware MAC Protocol for Wireless Network-on-Chip,” IEEE Trans. Comput., pp. 1230-1245, 2008. [9] E. Seok et al., “A 410GHz CMOS push-push oscillator with an on-chip patch antenna,” ISSCC, 2008. [10] J. H. Bahn, S. E. Lee and N. Bagherzadeh, “On design and analysis of a feasible Network-on-Chip (NoC) architecture,” 4th International Conference on Information Technology, ITNG 2007, pp. 1033-1038, 2007. [11] J. H. Bahn and N. Bagherzadeh, “Efficient Parallel Buffer Structure and Its Management Scheme for a Robust Network-on-Chip(NoC) Architecture,” 13th International CSI Computer Conference, CSICC 2008 Kish Island, Iran, March 2008. [12] K. Kim, H. Yoon and K. O, “On-chip wireless interconnection with integrated antennas,” in International Electron Devices Meeting, pp. 485-488, 2000. [13] K. K. O, K. Kim, et. al., “On-chip antennas in silicon ICs and their application,” IEEE Trans. Electron Devices, vol. 52, pp. 1312-1323, 2005. [14] L. Benini and G. De Micheli, “Networks on chip: a new paradigm for systems on chip design,” Proc. of Design, Automation and Test Conference in Europe, pp. 418-419, 2002. [15] L. P. Carloni, P. Pande and Y. Xie, “Networks-on-chip in emerging interconnect paradigms: Advantages and challenges,” In Proceedings of the 2009 3rd ACM/IEEE international Symposium on Networks-on-Chip, pp. 93-102, 2009. [16] M. F. Chang et al., “CMP Network-on-Chip Overlaid With Multi-Band RF-Interconnect,” Proceedings of IEEE International Symposium on HighPerformance Computer Architecture (HPCA), pp. 191-202, 2008. [17] M. Rezazad and H. Sarbazi-azad, “The Effect of Virtual Channel Organization on the Performance of Interconnection Networks,” In:IPDPS 2005, p.264.1. IEEE Computer Society, Washington, 2005. [18] P. P. Pande, A. Ganguly, K. Chang, and C. Teuscher, “Hybrid wireless Network on Chip: a new paradigm in multi-core design,” in Proceedings of the 2nd international Workshop on Network on Chip Architectures, pp. 71-76, 2009. [19] S. Bell et al., “TILE64 Processor: A 64-Core SoC with Mesh Interconnect,” Solid-State Circuits Conference, Digest of Technical Papers. IEEE International, pp. 88-598, 2008. [20] S. Lee et al., “A scalable micro wireless interconnect structure for CMPs,” In Proceedings of the 15th Annual international Conference on Mobile Computing and Networking, pp. 217-228, 2009. [21] S. E. Lee and N. Bagherzadeh, “A high-level power model for Network-onChip (NoC) router,” Computer & Electrical Engineering, High Performance Computing Architectures (HPCA), vol. 35, no. 6, pp. 837-845, 2009. [22] S. Loucif, M. Ould-Khaoua, and L. M. Mackenzie, “On the performance merits of bypass channels in hypermeshes and k-ary n-cubes,” Comput. J., vol. 42, no. 1, pp. 62-72, 1999. [23] S. Vangal et al., “An 80-Tile 1.28TFLOPS Network-on-Chip in 65nm CMOS,” Solid-State Circuits Conference, Digest of Technical Papers. IEEE International, pp. 98-589, 2007. [24] V. F. Pavlidis and E. G. Friedman, “3-D topologies for networks-on-chip,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 15(10):10811090, 2007. [25] V. Soteriou, N. Eisley, Hangsheng Wang, Bin Li and L-S. Peh, “Polaris: A System-Level Roadmap for On-Chip Interconnection Networks,” in Proceedings of the 24th International Conference on Computer Design (ICCD), San Jose, October, 2006. [26] W. J. Dally and B. Towles, “Route packets, not wires: on-chip interconnection networks,” Proc. of Design Automation Conference, pp. 684-689, 2001. [27] W. J. Dally, “Express cubes: improving the performance of k-ary n-cube interconnection networks,” IEEE Trans. Comput., vol. 40, no. 9, pp. 1016-1023, 1991. [28] W. J. Dally and C.L. Seitz, “Deadlock-Free Message Routing in Multiprocessor Interconnection Networks,” IEEE Trans. Computer C-36(5), pp. 547V553, 1987. [29] W. J. Dally, “Virtual-Channel Flow Control,” IEEE Trans. Parallel and Distributed Systems 3(2), pp.194V205, 1992. [30] W. J. Dally, Principles and Practices of Interconnection Networks. Morgan Kaufmann, 2004. [31] W. M. J. Green, M. J. Rooks, L. Sekaric, and Y. A. Vlasov, “Ultracompact, low RF power, 10 Gb/s silicon Mach-Zehnder modulator,” Optics Express, vol. 15 issue 25, pp. 17106-17113, December 2007. [32] ARM. ARM11 MPCore. (http://www.arm.com) [33] IBM. IBM PowerPC405 Embedded Core. (http://www.ibm.com) [34] International Technology Roadmap for Semiconductors, 2007 edition [35] NVIDIA Quadro FX 5600, (http://www.nvidia.com) [36] NVIDIA Tesla C1060, (http://www.nvidia.com) [37] Synopsys, Synopsys Design Compiler, Primetime Px (http://www.synopsys.com)
Power (mW)
16 90 12
60 8
30
Average Latency (cycles)
120
4
0
0 Random
Bit complement
Bit reverse
Matrix transpose
Traffic Patterns
Figure 9. Comparison of average latency and power consumption for NePA/WNoC networks under load condition as 0.11 flits/node/cycle
times than NePA. These characteristics also explain that WNoC consumes about 70 to 90% of the power of NePA. On one hand, accelerating transmission tasks contributes to average latency reduction and improves energy consumption of tasks themselves, which can be exploited to optimize the system energy consumption. On the other hand, shorter active time enables system to enter power saving mode earlier to lower overall system energy consumption for the rest of data transmission time. V. CONCLUSION AND FUTURE WORK In this paper, we proposed a hybrid wired and wireless network platform. The proposed WNoC architecture and effective routing algorithm not only improve performance and throughput by using wireless links between subnets and efficiently exploit area and power consumption, but also support more advanced features to accommodate various services for an NoC architecture. We have demonstrated its performance enhancement over traditional wired networks. WNoC networks outperform wired networks in average latency and saturation load by taking advantage of wireless links. With alternative links employed between routers, we anticipate that WNoC has the potential of supporting QoS, fault-tolerance routing and multicasting/broadcasting capability to accommodate more diverse services in the future. Our next steps to WNoC routers are to consistently improve router mechanisms so as to enhance transmission quality. R EFERENCES [1] A. Ganguly, K. Chang, P. P. Pande, B. Belzer and A. Nojeh, “Performance evaluation of wireless networks on chip architectures,” in Proceedings of the 10th International Symposium on Quality of Electronic Design, pp. 350-355, 2009. [2] A. Kahng, B. Li, L. Peh and K. Samadi, “ORION 2.0: A Fast and Accurate NoC Power and Area Model for Early-Stage Design Space Exploration,” In Proceedings of Design Automation and Test in Europe (DATE), Nice, France, April 2009. [3] A. Shacham et al., “Photonic Network-on-Chip for Future Generations of Chip Multi-Processors,” IEEE Transactions on Computers, vol. 57, issue 9, pp. 12461260, 2008. [4] A.S. Vaidya, A. Sivasubramaniam and C.R. Das, “Impact of Virtual Channels and Adaptive Routing on Application Performance,” IEEE Trans. Parallel Distributed Systems 12(2), pp. 223V237, 2001. [5] B. A. Floyd, C. M. Hung, and K. K. O, “Intra-chip wireless interconnect for clock distribution implemented with integrated antennas, receivers and transmitters,” IEEE Journal of Solid-State Circuits, 37(5):543-552, May 2002.
416