Intelligent Hotspot Prediction for Network-on-Chip ... - IEEE Xplore

7 downloads 0 Views 8MB Size Report
Feb 17, 2012 - Abstract—Hotspots are network-on-chip (NoC) routers or modules in multicore systems which occasionally receive packe- tized data from other ...
418

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 31, NO. 3, MARCH 2012

Intelligent Hotspot Prediction for Network-on-Chip-Based Multicore Systems Elena Kakoulli, Student Member, IEEE, Vassos Soteriou, Member, IEEE, and Theocharis Theocharides, Senior Member, IEEE

Abstract—Hotspots are network-on-chip (NoC) routers or modules in multicore systems which occasionally receive packetized data from other networked element producers at a rate higher than they can consume it. This adverse phenomenon may greatly reduce the performance of NoCs, especially when wormhole flow-control is employed, as backpressure can cause the buffers of neighboring routers to quickly fill-up leading to a spatial spread in congestion. This can cause the network to saturate prematurely where in the worst scenario the NoC may be rendered unrecoverable. Thus, a hotspot prevention mechanism can be greatly beneficial, as it can potentially enable the interconnection system to adjust its behavior and prevent the rise of potential hotspots, subsequently sustaining NoC performance. The inherent unevenness of traffic patterns in an NoC-based generalpurpose multicore system such as a chip multiprocessor, due to the diverse and unpredictable access patterns of applications, produces unexpected hotspots whose appearance cannot be known a priori, as application demands are not predetermined, making hotspot prediction and subsequently prevention difficult. In this paper, we present an artificial neural network-based (ANN) hotspot prediction mechanism that can be potentially used in tandem with a hotspot avoidance or congestion-control mechanism to handle unforeseen hotspot formations efficiently. The ANN uses online statistical data to dynamically monitor the interconnect fabric, and reactively predicts the location of an about to-beformed hotspot(s), allowing enough time for the multicore system to react to these potential hotspots. Evaluation results indicate that a relatively lightweight ANN-based predictor can forecast hotspot formation(s) with an accuracy ranging from 65% to 92%. Index Terms—Multiprocessor interconnection, neural network hardware, on-chip network, ultralarge-scale integration.

I. Introduction

N

ETWORKS-ON-CHIP (NoCs) [10] have become the preferred communication backbone in high-performance multicore chips such as general-purpose chip multiprocessors (CMPs) and application-specific systems-on-chips (SoCs). NoCs have already been utilized in ultrahigh-performance products such as in the Tilera TILE64 CMP [2] and the 48-core single-chip cloud computer (SCC) [23], hence becomManuscript received March 28, 2011; revised July 13, 2011; accepted August 29, 2011. Date of current version February 17, 2012. This paper was recommended by Associate Editor R. Marculescu. E. Kakoulli and V. Soteriou are with the Cyprus University of Technology, Limassol 3603, Cyprus (e-mail: [email protected]; [email protected]). T. Theocharides is with the University of Cyprus, Nicosia, Cyprus (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TCAD.2011.2170568

ing functional and inbuilt components in massively parallel on-chip systems. In the commonly NoC-employed [3], [16] wormhole flowcontrol (WFC) [11], communication among the various onchip components is organized in the form of packetized messages, of arbitrary length, which are further segmented into logical link-width chunks called flow-control units, or flits for short. However, the spreading of a packet in a pipelined fashion across several routers at a time makes WFC susceptible to packet blocking and possible indefinite stalling, as a prolonged delay at an intermediate router quickly forms backpressure which can be spread spatially across the network topology in the reverse direction of the packet’s traversal path. To alleviate this problem, designers usually employ several resource utilization-enhancing mechanisms, such as using virtual channels (VCs) at buffers [9], exploiting advance resource-reserving control packets [36], designing specialty architectures or schemes to guarantee communication rates [7], or, inefficiently, operating high-performance NoCs at relatively low utilization rates to avoid message blocking [35]. Hotspots1 are NoC routers or modules in multicore systems which occasionally receive packetized traffic from remaining networked element producers at a rate faster than they can consume it, as interconnecting links and output (and input) ports are bandwidth-limited and as the traffic load distribution of actual applications is intrinsically uneven, such as at the bisection of 2-D mesh NoC topologies [32]. This consumer– producer gap is inherently unavoidable, especially in generalpurpose multicore systems (i.e., CMPs), due to the diverse and unpredictable access patterns of applications which dictate network traffic to be uneven in distribution or posses a bursty or streaming nature across interconnected paths spanning portions of the network topology. Even a single router can cause a hotspot, and worse, a hotspot can appear even when using links of theoretically infinite bandwidth. Hotspots can also be caused by nonoptimal application mapping, lack of traffic balancing when using oblivious routing algorithms, application migration, and due to resource demands that occur unpredictably and dynamically [4], [34]. Hotspots have a spatial component when a subset of the routers receives the majority of the traffic and a temporal com1 The term “hotspot” may also refer to networked elements which possess a thermal profile that is higher than that of the network’s average temperature level, a direct consequence of localized network contention.

c 2012 IEEE 0278-0070/$31.00 

KAKOULLI et al.: INTELLIGENT HOTSPOT PREDICTION

Fig. 1. Performance comparison of nonhotspot (URT) traffic versus synthetic hotspot traffic for various adaptive and deterministic routing algorithms and VC per-input port counts. The term “VCs” stands for virtual channel count per router input port.

ponent as these router nodes receive traffic often. The adverse phenomenon of hotspot formation greatly reduces the performance of an NoC degrading its effective throughput and increasing the average communication latency, even under adaptive routing which aims at balancing-out the traffic among alternative routing paths (see Fig. 1 and associated Section II-A). This effect is particularly intense when the widely employed WFC used in today’s NoCs is in place; backpressure can cause the buffers of neighboring routers to quickly fill-up in a domino-style effect, leading to a spatial spread in congestion. This can cause the network to form a “saturation tree,” and even worse, to block due to the severe traffic-induced congestion owning to network resource overutilization (links, buffers, VCs, etc.), a phenomenon which can unfortunately be irreversible. A traffic hotspot, an intense form of network congestion, may thus be responsible for catastrophic NoC performance degradation when it causes the network to stall indefinitely, under which state the NoC is deemed inoperable. Though hotspots can be predicted with relative accuracy in SoCs where application demands can be determined a priori [48], and thus be reduced, hotspot formation is especially unpredictable in general-purpose best-effort parallel on-chip systems such as CMPs that are considered in this article; the application behavior of CMPs is not exactly known a priori so as to react accordingly by activating exact spatiotemporal hotspot reduction mechanisms. A number of surveys [4], [31], [34] which outline the design challenges and lay the roadmap in future multicore design have emphasized the need to conduct research and identify the primary challenges in NoC congestion-management techniques as a means to safeguard the scalability and performance sustainability of both general-purpose CMPs and applicationdriven SoCs. Techniques such as application scheduling [4] and even dynamic congestion management in the form of adaptive routing protocols [11], [13], [24], [27], [29], [40] are not always sufficient as NoC congestion is an unpredictable and complex phenomenon. Unfortunately, adding extra buffering to alleviate the impact of hotspots, in resource-constrained NoCs, will not improve the network throughput but will only help in housing flits for longer in case of bursty traffic causing such a hotspot [33]. In this paper, we present an attempt toward utilizing artificial intelligence (AI) in predicting the formation

419

of traffic hotspots. We choose AI principles because of their adaptability to changing traffic conditions and their ability to learn about small network spatiotemporal variations which can lead to online congestion and thus build on improving their ability to forecast the next hotspot occurrence in advance. We use a hardware-based artificial neural network (ANN), which we train and employ to detect the formation of hotspots, in an effort to notify the system so that it can potentially take corrective action. We extend our initial work [26] with several new contributions: 1) the impact of traffic hotspots for several routing algorithms is explored; 2) detailed synthesis and ANN overhead analysis for several NoC router architectures are performed; 3) the experimental result range is expanded; and 4) architectural details on the ANN are extended, while scalability issues and optimization strategies are examined. The proposed ANN is designed with minimal hardware overheads, and can be implemented as an independent processing element (PE). The ANN uses traffic pattern data to dynamically monitor the interconnect fabric, and reactively predicts the location of a potential hotspot(s), letting ample time for the system to adjust its behavior and hence potentially avert the hotspot formation(s). The ANN is trained using synthetic hotspot traffic, and evaluated using both synthetic and real-system application traces from the Raw [45] and TRIPS [38] 2-D mesh-interconnected multicore chips. The applicability of our methodology across various network sizes, applications, and platforms shows its scalability, versatility, and adaptability. Evaluation results indicate that a relatively small ANN can predict hotspot formations with an accuracy ranging from 65% to 92%. Section II presents a survey of related work, and briefly introduces ANNs and their predictive abilities. Section III presents our ANN-based NoC hotspot prediction methodology, the proposed ANN architecture and the data it processes, and Section IV explains how we model hotspots. Next, Section V outlines our experimental setup, while Section VI presents ANN hardware synthesis results and quantitative accuracy results in predicting NoC hotspots in CMPs using a range of traffic traces gathered from full-system application runs, and, finally, Section VII concludes this paper. II. Background and Related Work A. Hotspot Impact on an NoC’s Effective Throughput The intrinsically unevenly distributed nature of packetized traffic being transported in an NoC, both spatially across the topology and temporarily during operation, emanates from the unpredictable runtime access patters of applications. This trait is particularly observable in NoCs that handle intertile communication in general-purpose multicore systems, and especially noticeable when running multimedia applications where traffic can be described as bursty, streaming, or is “jittery” [4], [31], [33], [34]. Cacheable traffic (read-only and read-write sharing) also places a significant load onto the NoC [34]. This uneven distribution of network traffic creates regions at and around network routers with increased contention, which can quickly lead to congestion, or in some cases create demand for throughput at levels above and beyond

420

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 31, NO. 3, MARCH 2012

the network’s transport capability (i.e., bandwidth) that can give rise to hotspots. Hotspots, if not handled in time, can quickly lead to severe performance degradation and even indefinite blocking of packet flows which can render the NoC as unrecoverable, stalling the entire multicore system’s operation. The effects of even a single hotspot can quickly spread spatially across the NoC topology, applying backpressure onto the advancing traffic and eventually halting its flow even when traversing interconnect links of theoretically infinite bandwidth. There are three types of hotspots: 1) at the network interface of the source router(s)2 (producer); 2) intermediately located hotspots between receiver(s)–transmitter(s) router pairs; and 3) receiver or destination router (consumer) hotspots.3 The source-induced hotspots can more easily be handled as the operating system, or the sender router(s), or a multicore module(s), can delay the transmission of packets at the network interface buffers which are usually much larger in capacity than intermediate buffers located at the input ports of routers. This method restricts the throughput demands placed upon the network, ensuring that they fall well within the bandwidth capacity of the network; however this comes at a penalty due to increased application execution time. The latter type, the destination hotspots which are examined in this paper, are the most severe form of hotspots, as several sources of traffic streams, even with each transmitting packets at typical rates, can cause the appearance of a destination hotspot without the multicore system’s expectation. Worse, once a hotspot builds-up it is very hard to alleviate its effect as traffic is already in-flight. Here, certain architectural [18], [32], [49] and algorithmic approaches such as adaptive routing protocols with load-balancing selection functions [24], [27], [29] can reduce the impact of hotspots, but as it can be seen later they cannot alone guarantee the avoidance of hotspots and the prevention of their adversarial effects on the network’s sustained effective throughput. More possible solutions would be to use infinite buffering to house traffic when not being able to advance due to a contention-induced hotspot(s), or to use wider switches at crossbars [34]. However, studies [28] have shown that these approaches cannot help by extending network throughput to sustain a hotspot(s), and that they can only help with the “absorption” of bursty traffic that causes the hotspot, thus the negative effects of a hotspot(s) cannot be fully eradicated. Last, intermediately located hotspots, are hotspots created when a combination of sender–receiver, and/or receiver hotspots exist, as hotspots can shift their effect geographically in the network causing other hotspots. Even the superimposition of traffic streams can cause an unforeseen hotspot(s); intermediate hotspots are therefore by-products of other hotspots and/or increased NoC throughput demands due to convergent traffic spread at various routers in the network. To show the effect of hotspots we simulated two types of traffic, uniform random (URT) and destination hotspot (HS) 2 The term router can also be replaced with the terms “module,” “tile,” “switch,” etc., depending upon the architecture of the multicore system. 3 In the rest of this article the term “hotspot” refers to consumer router, or destination router, or receiver router hotspots, unless explicitly stated.

traffic produced with our hotspot generation model described in Section IV, onto an 8 × 8 mesh-connected NoC using three state-of-the-art adaptive functions: the fully adaptive Duato’s protocol [13], the near-optimal 01TURN adaptive routing algorithm [40], and the DyXY congestion-oriented adaptive routing algorithm [29]. We also run the standard nonadaptive dimension-order routing (DOR) function for comparison. Fig. 1 shows the large negative impact of HS traffic in an NoC versus URT even under the adaptive routing functions examined here; NoC latency under HS traffic increases at a faster rate and the network saturates well below its traffic transport rate potential, with certain cases showing more than 40% degradation in the NoC’s sustainable effective throughput. In short, the network saturates prematurely and it becomes slower across its entire operational range. Note that some of the adaptive routing functions exhibit worse performance under URT; it is a well-known fact that under URT DOR provides better load-balancing than adaptive routing algorithms which tend to concentrate packets at the diagonal of 2-D meshinterconnected NoCs [32]. B. Related Work The problem of hotspot prevention in interconnected parallel and distributed systems has been explored, especially in the domain of large-scale interconnection networks found in offchip parallel computer systems, and relatively recently to a much lesser extent in the domain of NoCs. There are two main categories of approaches: 1) implicit approaches that mostly deal with reducing network congestion via workload load-balancing, and 2) explicit approaches that are specifically geared toward reducing the impact of hotspots upon an NoC. Approaches under category (1) are equally applied to offchip and on-chip systems and their range is vast; the interested reader is urged to refer to survey works such as [4], [11], and [34]. These mostly involve load-balancing schemes incorporated into relevant adaptive routing functions, most of which are localized or decentralized to reduce overheads, which aim at distributing uneven traffic among alternative lightly-loaded router output ports or routing paths spanning the topology. To achieve this they employ selection functions that make use of online statistics, such as link and buffer utilization, as loadlevel indicators which lead them decide among less heavily utilized link(s) or path(s) candidates to which to direct traffic. The routing functions tested in Section II-A [29] (only for NoCs), [13], [40] can be coupled to such selection functions, and armed with their adaptability, can further reduce network contention, but cannot explicitly handle hotspots. A few sample works from the off-chip domain include the GOAL load-balanced adaptive routing algorithm for torus networks which follows a hybrid locally-adaptive globallyoblivious approach [41], and the spatially-based mechanism in [15] that deflects packets away from congested paths. In the on-chip domain the recent work by Gratz et al. [18] aimed to enhance global load-balancing in adapting routing, and thus improve NoC throughput, using a lightweight technique that informs a congestion-aware routing policy about traffic levels in network regions beyond the locality of adjacent network routers. Next, in [27] the balanced adaptive routingprotocol is

KAKOULLI et al.: INTELLIGENT HOTSPOT PREDICTION

proposed that uses hybrid local–global information regarding network congestion to distribute traffic evenly among the shortest routing paths so as to avoid congestion emergencies. Next, in [32] the authors proposed a memory-less switch for NoCs that misroutes packets to nonideal routes during instances of elevated localized congestion levels so as to improve performance. The DyAD routing algorithm [24] adopts buffer occupation statistics and accordingly exchanges flags between neighbors to signal congestion and hence act accordingly (deterministically or adaptively) to increase NoC throughput. The mechanism in [46] used source routing to adaptively change the path traversed by traffic between source and destination nodes to reduce congestion and thus improve NoC throughput. This requires QoS session establishment and traffic monitoring, where congestion information is exchanged globally across the topology. Finally, the work in [49] investigated the impact of input selection at NoC routers and presented a contention-aware technique to enhance NoC routing efficiency. All the above works measure network congestion online, react to adverse conditions, and then act to reduce or correct the impact of elevated network throughput demands by spatially spreading traffic among alternative routes so as to decrease localized contention. All these schemes exploit no advance information about potential hotspots in the network topology and respond to already elevated network delays in packet transmission progression. Instead, the hotspot prediction mechanism proposed here predicts near-future hotspots in a spherical approach, unlike the aforementioned works where corrective actions are applied locally or semiglobally, to allow the NoC to adjust its behavior before increased latency and delays take effect. Thus, our approach is orthogonal to all these load-balancing and/or congestion-aware approaches and can be applied to them as a means of forecasting detrimental to NoC performance future events, directing high throughputoriented routing decisions. The work in NoC traffic analysis also plays an important role in accurately modeling workload characteristics so as to monitor the state of the network and hence provide metrics for cost-performance tradeoffs, i.e., to ensure certain throughput levels and to reduce the impact of congestion onto the NoC. The work in [6] analyzed the traffic dynamics of multicore systems to show how the nonstationary effects of NoC workloads can be captured mathematically. Results demonstrate that both network architecture and application characteristics dictate a power law distribution in buffer utilization. This helps account for buffer finiteness with reduced overflow probability, even with the presence of hotspots, according to both application and user behaviors. The same authors in [5] successfully determined the optimal buffer sizing in NoC routers by adopting a statistical physics approach to characterize the probability density of buffers. Their approach incorporates said power law distribution, correlations, and scaling properties exhibited within an NoC due to various network transactions. Next, in both off-chip and on-chip systems, most of the explicit hotspot management efforts [category (2)] focus on reducing the number of routed packets destined to highly utilized router modules using various schemes that are detailed next. All these works assume that the spatial locations of

421

hotspots are either: 1) known a priori and hence the hotspot management techniques pro-actively aim at reducing the possibility of hotspot formations, or 2) spatially and temporarily actively react to reduce the possibility of hotspot occurrences. There are various hotspot prevention works in the domain of large-scale off-chip interconnected parallel computers. The methodology in [1] reacts temporally and throttles traffic at the router injection ports, preventing further traffic from being routed into a hotspot(s). Another temporally-based mechanism is the discarding of hotspot-bound network packets which are later retransmitted to their destinations [21]. Additionally, the schemes in [14] use separate buffering for efficiently handling hotspot-destined traffic or use a large number of VCs. Finally, the work in [39] presented an analytical model of hotspot traffic in k-ary n-cube networks. Most of the hotspot-related work in the domain of off-chip networks is unsuitable for NoCs, as the resources are restricted (VCs, protocol computation, etc.) and packet dropping and retransmission are unacceptable due to the inherent complexity of these mechanisms and the NoC’s need to attain ultrahigh performance. Hotspot prevention is mostly based on spatial techniques via the use of adaptive routing which aims to route packets around hotspots, such as the work in [12] which sends control packets up-front to gather information about various paths, update routing tables and then direct traffic in avoiding those heavily used paths. Next, [8] presented a congestioncontrol strategy for NoCs based on a congestion-controlled best-effort communication service that monitors link utilization as a congestion measure. The gathered measurements are then transported to a model predictive controller that bounds latency and offers bandwidth availability at links, hence ensuring traffic progress. Finally, [33] described a predictive closedloop flow-control mechanism that utilizes NoC traffic source and router models to control the packet injection rate of traffic producers so as to regulate the total number of packets in the network to achieve a smooth packet flow. Finally, the work by Walter et al. [48] uses a low-cost end-to-end credit-based allocation technique in NoCs to throttle and regulate hotspotdestined traffic. Our hotspot prediction methodology may be coupled to some the aforementioned works for improved hotspot-management. C. ANNs and Prediction An ANN is an information processing paradigm, consisting of computation nodes called neurons and interconnections between them, called synapses. Neurons are organized in various topologies, depending on the ANN type and application [25]. This paper employs a multilayer perceptron ANN, with neurons structured in layers, where each neuron from layer n forward its output to all neurons in layer n + 1. The input layer receives data to be processed, and the output layer returns the computation result. The ANN operates in two stages: the training stage and the computation stage. Each neuron is configured with training data that allows it to make necessary decisions based on the input data during execution of a desired operation. This paper utilizes a digital implementation of the Hodgkin–Huxley [22] neuron model, where a neuron takes a set of inputs and multiplies each input

422

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 31, NO. 3, MARCH 2012

by a weight value, accumulating the result until all inputs are received. Weights are determined during the training process, by a training algorithm (such as back-propagation [25]). A threshold value (also determined in training) is then subtracted from the accumulated result, and this result is then used to compute the output of the neuron, via the so-called activation function. Typical activation functions include functions such as the hyperbolic tangent function, or sinusoidal, sigmoid, and even simple step functions [25]. The neuron output is then propagated to a number of destination neurons in another layer, which perform the same general operation with the newly received set of inputs and their own weight/threshold values, and this is repeated for other layers, depending on the complexity and accuracy required for the application. The activation function used, as well as the connectivity of the neurons, depends on the application. ANNs have been successfully used as prediction and forecasting mechanisms, in several application areas [30], [44], with relatively high accuracy. A prediction ANN acts as a pattern recognition system, where it observes patterns of data, and based on the training data that it is provided to it through supervised learning, attempts to recognize an event that will happen in the future. In our case, data patterns used by the ANN are traffic data concerning utilization of NoC router buffers, and, based on this data, the ANN attempts to predict which patterns will lead to a hotspot formation and where the hotspot will be formed in the network. One of the advantages of ANNs over other prediction algorithms is their uncomplicated approach in hardware implementations, especially when the number of neurons remains small and the activation function remains simple [25]. Furthermore, the neuron operation can be modeled as a multiply-accumulate operation (MAC), which can be designed efficiently in hardware, in contrast to more complicated mathematical models employed in other learning algorithms. ANNs also offer more flexibility and better recognition and prediction capabilities than simple control mechanisms, as they can be dynamically trained to variable scenarios [25]. While this paper focuses on ANNs, other pattern recognition mechanisms such as support vector machines can also be used; this is left as future work. III. ANN Architecture In designing the ANN, we take into consideration the conditions leading toward a hotspot formation. As already explained, hotspots are not single routing nodes, but rather a combination of increased utilization in a neighborhood of routers, involving two or more routers usually. Thus, the ANN is designed in such a way as to integrate information of neighboring routers, combining conditions happening one and two routers away from the router that the ANN will potentially identify as a hotspot. Furthermore, we consider the hardware overheads by partitioning the on-chip network into smaller subregions, and designing independent ANNs for each region, in an effort to reduce the hardware resources necessary. Potentially, the ANN can be designed as a PE in the network, as Fig. 2(a) and (b) show. The ANN receives monitoring data from the NoC routers, processes the data, and

Fig. 2. Examples. (a) Placement of an ANN engine as a dedicated PE in an NoC architecture. (b) Four ANN engines controlling an 8 × 8 NoC. (c) Overall data flow of the proposed ANN engine.

returns potential hotspot formation warnings to the handling mechanism (beyond the scope of this paper), as shown in Fig. 2(c). The ANN will receive utilization information from all the routers that it monitors within a fixed time frame (setting the information from any routers which have not managed to send their information within that time frame to a fixed sentinel value), use the information and within another fixed time frame, produce a “warning” about the possibility of a hotspot forming within the NoC in the next time frame. Routers compute and transmit their utilization data using the NoC [can be done over the existing NoC infrastructure, or, in the event that an extra control network is used (i.e., a dedicated NoC used for control purposes), it can be done over the control network] in fixed discrete time intervals using a sequence number for each packet (to distinguish possible outdated data). Each control packet is therefore transmitted to each base ANN that the router reports to. Let us assume that all available data arrives (or is set to a predetermined sentinel value in the event that there is congestion) at time to . The ANN processes the data belonging to that time frame, and outputs its predictions at time t = to + tα , where tα is the time required for the ANN to compute its output. If we assume that a correction mechanism will require tc cycles to adjust the system operation and avoid the hotspot mechanism, then the hotspot must be predicted at least tc cycles before it will be formed, which forces the ANN to be trained with hotspots that exhibit formation patterns at least to +tα +tc cycles earlier than their formation. We attempt to minimize tα , which eases the training and improves the prediction as well. In our experimental setup, we defined the collection time frame as 50 cycles, targeted the ANN computation time to be also less than 50 cycles (so that the ANN computation can overlap a new collection of utilization data), and allow another 50 cycles for the potential hotspot handling/correction mechanism to

KAKOULLI et al.: INTELLIGENT HOTSPOT PREDICTION

423

address an emerging hotspot. The latter value depends on the chosen mechanism that will handle the prediction result. The prediction result of the ANN must be delivered to the handling mechanism early enough before the hotspot is about to be formed. The overall time (t) taken from the time that routers send information, until the handling mechanism receives the prediction result, has to be smaller than the time that the result is delivered until the time the hotspot has been formed and is creating trouble. A. Base ANN Hardware Implementation Issues We employ a scalable strategy, starting from a small ANN designed to monitor a small NoC region, and use that as a base module to construct ANNs to monitor larger NoC regions, by hierarchically connecting overlapping ANNs. We label the ANN that monitors the smallest NoC region as the base ANN system. The overlapping base ANN regions can then be either processed by an additional layer of neurons or through a simple voting mechanism. The mechanism simply evaluates the decision of each base ANN concerning overlapping routers and selects the majority vote on whether a router (or a combination of routers) becomes a hotspot or not. The voting mechanism can be an OR-based (i.e., at least one positive) or AND-based (all positive). Hardware overheads for larger ANNs increase significantly, along with the latency. This is because the top ANN needs to receive data from each of the smaller ANNs and compute the final output; this adds significant delays in the overall operation of the ANN and increases the hardware requirements exponentially. The base ANN size and architecture was selected based on the associated hardware overheads and the anticipated prediction accuracy. Choosing a large NoC region to serve as the base monitoring architecture, results in significant overheads stemming from a larger number of neurons required (at all layers), and adds significant communication overheads in the NoC. As the monitored region grows, more routers need to transmit information to a single ANN, adding further delays and traffic to the NoC. A larger region would however provide more information to an ANN, and perhaps result in better overall accuracy. We experimented with different region sizes, and obtained results in terms of hardware overheads and accuracy. We started with evenly-sized NoC mesh regions as a starting experimental point as such regions can subsequently grow to form regular mesh and torus NoC topologies. Results shown in Fig. 3 indicate that a 4 × 4 NoC region offers enough accuracy and significantly less overheads (in terms of hardware area and storage requirements) when compared to larger regions, whereas a 3 × 3 region does not provide enough information for the ANN to make accurate predictions. Consequently, we selected the monitored region for the base ANN, to be a 4 × 4 region. B. 4 × 4 Base ANN Monitoring Architecture The proposed ANN hotspot prediction mechanism consists of three perceptron layers which monitor 2 × 2, 3 × 3, and 4 × 4 regions of routers in the NoC, and fully connected hidden and output layers that combine the three perceptron

Fig. 3. Accuracy versus CMOS hardware overheads corresponding to various sizes of ANN monitoring regions in an NoC. These overhead calculations are based on an NoC consisting of three-stage pipelined virtual-channel router architectures with credit-based flow-control, two VCs at each input port and speculative switch arbitration [37] (see also Section VI-D).

Fig. 4. Base ANN monitoring a 4 × 4 mesh NoC, showing the (a) 2 × 2, (b) 3 × 3, and the top (c) 4 × 4 partition and the inputs to the respective input layer neurons.

layers and return the location of a potential hotspot router. The computation starts when the first monitoring packet containing the buffer utilization values arrives from any of the monitored routers. In our experimental implementation, data is collected within a 50-cycle time frame window (TFW). The ANN consists of an input layer which partitions the routers being monitored into nine segments of 2 × 2 routers, four segments of 3 × 3 routers, and one segment for the entire 4 × 4 mesh. The segmentations are shown in Fig. 4. The segmentation is done in such a way so as to detect hotspots not as single routers, but as a combination of events affecting routers one and two hops next to the probable hotspot location. The input layer consists of nine neurons responsible for monitoring regions of 2 × 2 routers, four neurons responsible for monitoring regions of 3 × 3 routers, and a neuron that monitors the entire 4 × 4 region (shown in Fig. 4). We use one hidden layer (as both input and output data are within a finite space [25]), with ten neurons, selected based on trial and error, starting from the rule of thumb that hidden neurons should be within the range of the input layer neurons (14) and the output layer neurons (16) [25]. During training, we determined that there was negligible accuracy loss going from 15 hidden layer neurons to 10 hidden layer neurons, and both cases were bound by the maximum error margin that we set. We selected ten hidden layer neurons as it gave us considerable accuracy (less than 1% error when compared to the 15 neurons used as our starting case). The last layer consists of 16 spiking neurons, each representing a router within the region. The network receives the average buffer utilization from each router in the region that is responsible for monitoring, processes the information, and returns as output a binary vector

424

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 31, NO. 3, MARCH 2012

Fig. 5. Three-layer ANN structure. Connections from input layers to hidden layer neurons are shown for only a few neurons, for readability purposes.

containing the location of a potential hotspot router, including hotspots where two or more routers become hotspots. Incoming packets containing utilization rates from each router arrive through the NoC using the existing infrastructure, one packet per cycle. A single packet may contain all buffer utilization data from a single router, however this really depends on the network; in our case, we assume that a single packet delivers the utilization values per router, and that packets arrive at the ANN at a rate no greater than one packet per cycle. Keeping the base ANN small also keeps the latency of these packets small. It must be noted that if a packet does not arrive within the fixed discrete time interval, then the utilization rates for that router are marked with a predefined sentinel value, indicating that the information is not available; the training was done with such values to enhance the prediction accuracy in the presence of such values, hence the ANN maintains accurate operation even if some router control packets fail to reach the ANN within the time. These packets also contain sequence identification numbers, in case that a packet arrives in a different interval than what it was initially sent. Such a packet is discarded; the correct one is being used instead if it arrives within the appropriate interval. Computation can start as soon as the first packet of information arrives, and occurs in a pipelined fashion, in parallel for all base ANNs. In order for all the first layer neurons to output their values in the second layer of neurons (as the system is fully connected), the last packet has to arrive within the fixed time frame. A controller receives incoming packets, depacketizes and forward the utilization values in the right order, and delivers these values to the ANN. Each of the input layer neurons simply acts as a MAC that multiplies the incoming utilization rate from each of the routers the neuron is responsible for. When the results are accumulated, they are passed through the activation function look-up table (LUT) (hyperbolic tangent), and propagated to a fully-connected hidden layer of ten neurons. The hidden layer neurons similarly process the 14 inputs from each of the input layer neurons, and propagate their result to 16 spiking neurons (activate, i.e., output an “1”) only when their accumulated sum is higher than a predefined threshold), which return a 16-bit vector indicating whether or not a hotspot is predicted and its location in the 4 × 4 NoC mesh. The overall ANN layered structure is shown in Fig. 5. The ANN architecture (Fig. 6) was designed with emphasis on hardware reuse and with consideration to the fact that results should be computed prior to the next timing interval that the NoC will retransmit new utilization data. The primary hardware component of the ANN is a 5-bit × 5-bit MAC

Fig. 6. Base ANN engine in hardware. It consists of 16 MAC units (only three shown for readability purposes), a memory which holds the ANN weights, registers used to hold partial neuron sums, and the activation function LUT. The entire process is controlled through an FSM control unit.

unit (5-bits for the weight, 5 bits for the utilization value). Each router transmits a status packet which contains four 5-bit utilization values. The first layer receives a total of 64 inputs (16 routers, four ports per router), but at any given time each input is computed at most three times. Assuming that during each cycle the ANN receives four utilization values (i.e., one packet per cycle from each router), there will be at most 12 parallel MAC operations (assuming each incoming value is part of all three regions). When the first layer computes, the second layer can compute in parallel, hence we can have at most ten MAC operations in parallel. Last, the output neurons can also compute in parallel, hence we need at least 16 MAC operations in parallel. Given the relatively small size of each MAC, we designed the network with 16 MAC units, to boost parallel computation, reusing them on demand during each layer of computation nodes. The neuron weights were stored in a RAM, with an 80-bit bus, transmitting at most 16 5-bit weight values in parallel for use during the multiplication stage. An FSM control unit synchronizes the entire computation, where input values from each router arrive and are directed toward the appropriate MAC unit, depending on the neuron that they belong to. As explained, it is assumed that buffer utilization data for each router arrive as individual packets; if we also assume one packet per cycle, then the ANN receives four buffer utilization rates for a single router in each cycle. However, given the overlap of the 2 × 2, 3 × 3, and 4 × 4 regions, the values are used at least three times, therefore they are directed to the corresponding MAC unit. The MAC units are also interconnected to a set of accumulation registers, where each register holds the value of each neuron necessary for the computation. There are 24 12-bit registers (each for the input and hidden layer neurons, with extra bits for accumulation overflow purposes) and a single 16-bit register that holds the result of the output neurons. Each 12-bit register holds the MAC value corresponding to the neuron it represents. When

KAKOULLI et al.: INTELLIGENT HOTSPOT PREDICTION

the next input arrives at the ANN and is directed to a neuron already being computed, its partial sum is fed back into a MAC unit using the stored value from the register. This enables data reuse, and reduces the overall hardware requirements necessary. When a neuron finishes its computation cycle, its final sum-of-products is used as an input to the LUT, to return the activation function value. The output of the function is a 5-bit signed fixed-point value, greater than or less than ±1. This value is then reused as input to the MAC units, for computing the second layer neuron values. The process is repeated for all second layer neurons. The last layer receives all the outputs of the ten hidden layer neurons, and the spiking result is encoded as a 16-bit binary vector representing the location(s) of the routers with a potential hotspot formation(s). The MACs happen in parallel for all regions, so the first layer of neuron MAC operations will require 16 cycles. Next, four additional cycles are required to finalize computation of the first layer neurons. This happens because each completed neuron output will have to pass through the activation function LUT, and can be reduced by parallelizing access to the LUT (four-port LUT). Once the activation is complete, the second layer is computed; there are a total of 14 MAC operations, hence the results for the spiking neurons will be available in 18 cycles (14 MAC operations and four LUT accesses). The last neurons need to perform 16 parallel MAC operations, needing an additional 11 cycles (ten for all MACs, and one for comparison to the predetermined threshold). In total, the base ANN requires 49 cycles from the time the first router utilization data is received until its output, meeting the 50-cycle targeted delay, preparing it to start the computation for the next TFW. The training data consists of 5-bit weight values, biased to yield positive integers for ease of hardware implementation. During the training stage (Section IV), we varied the bit-width precision of the weights in intervals of 1 bit; 4-bit weight values yielded high error rates (>10%), whereas for 5-bit and 6-bit precision, the error rate was 7.0% and 6.4%, respectively. We therefore chose 5 bits to store the weight data. We used the hyperbolic tangent as the activation function to the network, as its advantage lies in its function properties; it is an odd function with mirror symmetry tanh(−x) = − tanh(x) and tanh(z) = tanh(z). The function is also asymptotic, and for values above or below a certain threshold, the output of the function can be considered to be 1 (or −1). Hence it is easy to be implemented as a LUT, requiring only positive values ranging from zero to the threshold value that yields 1 (or −1). In our case, any value of the neuron greater than +8 or less than −8 was not fed through the LUT; instead the neuron output was set to 1. All neurons were loaded with the neuron threshold value as the initial accumulation value used in the computation. The threshold value is typically subtracted at the end of the MAC operation, prior to the activation function. However, by setting the initial partial sum to the threshold value (with the appropriate sign), we reduce the overall latency by at least one cycle. In addition to the hardware implementation advantages that the hyperbolic tangent function presents, it is also the

425

Fig. 7. Demonstration of ANN scalability using the base 4 × 4 region partitioning already presented in Section III-A. The 3 × 3 ANN is shown in top left. A 5 × 5 mesh (top middle) can be monitored by a 4 × 4 ANN and seven extra neurons monitoring the seven 2 × 2 partitions. A 6 × 6 mesh can be monitored by the base 4 × 4 ANN and five ANNs monitoring the five 3 × 3 partitions (top right). A 7 × 7 mesh (bottom left) can be monitored by four overlapping 4 × 4 ANNs, using an OR-voting mechanism to determine whether the combined output of the overlapping routers are hotspots or not. The 8 × 8 mesh can be monitored using four independent 4 × 4 ANNs, and a shared voting mechanism for the boundary routers between the four regions (shaded in the bottom right).

preferred function for binary classification such as the one employed in this paper. C. System Scalability The ANN presented in this section is directed toward a 2-D 4×4 mesh architecture, and was trained and designed for such region. Larger network sizes however, can be monitored by adding more base ANN monitoring regions, and any necessary additional neurons. Thus, a strategy that builds on the base architecture is followed, and, given the 4×4 region that covers the base case, larger monitoring regions can be formed. The condition that needs to be imposed is that there has to be at least one router overlapped by two regions, so that its information is passed into two or more ANNs, and a joint decision is made. Fig. 7 shows how several combinations of regions can be used to cover mesh NoCs between sizes of 4×4 and 8 × 8; e.g., a 4 × 4 region as described in Section III-B can be used along with seven 2 × 2 regions covered by seven extra input neurons, as shown in the top left part of Fig. 7. Note that the extra seven neurons can be an independent, much smaller ANN, with its own hidden layer and output neurons; in that case, a simple voting mechanism (OR-based voting) can be used for the neurons that overlap the 4 × 4 ANN and the smaller ANN. In larger networks, such as the 8 × 8 mesh, four independent ANNs can be used; in this case, since there is no overlap between the four regions, a top-level, 28-bit input neuron that processes the outcome of the four smaller ANNs regarding the boundary routers (see Fig. 7, bottom right) can be used to make a weighted decision. This is what we followed in our 8 × 8 experimental platform as well.

426

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 31, NO. 3, MARCH 2012

There are several combinations of regions that can be formed in virtually any type of NoC topology, provided that the hotspot is treated as a combination of events concerning at least two-hop neighboring routers. The optimal segmentation of NoC topologies into ANN regions obviously is extremely complicated, and perhaps NP-complete [25], hence it is beyond the scope of this paper; we simply propose a partitioning scheme that works in a relatively efficient manner. Network topologies such as regular symmetric meshes and symmetric tori, can easily be partitioned using the proposed technique since the base ANN is also symmetric. The challenge is to formulate the cost of each partitioning technique. This cost relates to all metrics: power and hardware area overheads, performance, and accuracy. In terms of area and power overheads, they depend on the number of extra neurons required as the number of base ANNs increase, and the hardware used in relaying the base ANN information to the extra neurons (or voting mechanisms). In terms of performance, this is impacted mostly by the added latency of the extra layer(s) and/or voting mechanisms. Finally, in terms of accuracy, an extra layer of neurons will maintain the accuracy extremely high, whereas a voting mechanism will be faster and less power hungry, but will reduce the overall system accuracy. Obviously, depending on the hardware budget, performance requirements, topology, and complexity, the partitioning scheme and the ANN monitoring region can change on a case-by-case basis with all tradeoffs explored. IV. Hotspot Modeling and ANN Training To both train our ANN-based hotspot prediction mechanism and to test its effectiveness and success in forecasting hotspots, we devise a hotspot model with well-defined spatially located hotspot formations, of relatively short temporal duration, centered on randomly chosen routers in an 8 × 8 mesh network. Using this model, we are able to synthesize a number of traces that exhibit a range of network throughput demands upon the NoC, to be used for the two aforementioned purposes. The gist is to compare uneven traffic flows (hotspot) versus a base URT NoC traffic pattern. Under URT, all router nodes in an NoC have an equal probability of sending and receiving a packet from another node per unit time; thus the traffic is almost perfectly balanced across the NoC topology and the attainable network throughput is the maximum that can be archived both practically and theoretically [11]. Our hotspot model uses the URT model as a base, for most of the time, except when for short prespecified and periodically occurring time intervals just two arbitrarily selected nodes each receive with equal probability all the network traffic from all [remaining] sender nodes. In specific, we define TFWs of 10 000 cycles, during any time which the two hotspots can occur simultaneously for a short duration of 200 cycles. These values were set empirically. No other hotspots can occur within a TFW, and during the rest of the duration of the TFW the traffic behaves purely as URT. The injection rate during the URT phases as well as during the hotspot phases is of constant periodicity, i.e., the injection rate is steady and is set at a predefined value. This model test-stresses our prediction mechanism as hotspots occur both

TABLE I Synthetic Hotspot Traffic Results Normalized Throughput 0.22∗ 0.33 0.44 0.56 0.67∗ 0.78 0.89 0.98∗

20%

40%

60%

80%

100%

0.96 0.84 0.81 0.85 0.98 0.86 0.85 0.95

0.97 0.87 0.84 0.88 0.98 0.88 0.86 0.96

0.97 0.88 0.86 0.89 0.98 0.88 0.88 0.96

0.98 0.88 0.86 0.89 0.98 0.89 0.90 0.96

0.98 0.88 0.87 0.91 0.99 0.91 0.92 0.96

% False Positives 4 8 5 6 5 6 7 4

Prediction accuracy results as a function of simulation time at various levels of normalized network throughput. The asterisk symbol “*” indicates ANN training data.

infrequently and for short durations. To achieve this, the routing algorithm used in all of our experiments (see Sections V and VI) is DOR-XY routing; it was specifically chosen so as to eliminate any load-balancing behavior exhibited in adaptive routing protocols. Despite this short time duration of hotspots and their relative infrequency of occurrence in our synthetic hotspot traffic model, Fig. 1 (refer to the four curves denoted as “XY”) shows the large negative impact of hotspots in an NoC versus URT; latency under the presence of hotspot formations keeps increasing at a faster pace and the NoC saturates well below its packet transport rate potential, with certain cases showing more than 40% degradation in the NoC’s sustainable effective throughput. To train the network, we used part of our suite of synthetic traffic traces, consisting of buffer utilization data collected over half a million cycles. We targeted three different traffic scenarios; moderate, average, and high hotspot temporal intensity, expressed in terms of the NoC’s normalized saturation throughput (defined as the throughput at which the latency value is three times the zero-load latency of the network for hotspot traffic) at 0.22, 0.67, and 0.98 (asterisks in Table I), through the duration of the simulation. This NoC constant throughput rate was measured in terms of the number of flits injected per cycle in the entire network. Moreover, as mentioned earlier, a hotspot will likely become nonlocalized, as it may span over a region of routers; we included such spatial hotspot scenarios in our training data. Using the synthetic traces, we collected utilization rates of all buffers in all routers, measuring the average utilization rate of each router during 50-cycle intervals, which we input as training data to the ANN. The throughput values represent a full spectrum of traffic variability, consisting of hotspots of low, medium, and high intensity, which are formed under low, medium, and heavy network traffic. The training process can be enhanced by integrating application-specific traffic models, thus targeting selected scenarios and directing the training of the ANN toward certain anticipated targeted system behavior. Furthermore, training can be done at all stages in the multicore chip’s operation, since training weights are loaded during the system boot phase onto on-chip RAM memory modules. As such, the ANN can be retrained even after the system is deployed, using new application traffic behavioral characteristics, which can be collected and modeled after each specific chip. This enables possible software-based reconfiguration of the ANN, tailoring the ANN behavior to satisfy each specific multicore chip, based of course on the specific chip’s operation.

KAKOULLI et al.: INTELLIGENT HOTSPOT PREDICTION

We used the MATLAB ANN toolbox in training the ANN, using supervised learning algorithms (back-propagation) [19], [20]. The acceptable error rate was set to less than 10%, given the large data size and considerations in training memory for the ANN weights, as well as the large range variations and sparsity of the training set data. Error rates of less than 5% could not be obtained in realistic training time, so we consider them impractical for the purposes of the presented network, however, potential optimization of the training set can probably improve the accuracy of the network and enable training for less than 5% acceptable error rate [47]. As the output of the prediction mechanism basically represents a probability that a hotspot will be formed at location (x, y) in a 2-D NoC, the acceptable error rate represents the probability that a hotspot is not created; hence if the ANN is trained with a resulting accuracy of 85%, e.g., this essentially represents that there is a probability of 85% of a hotspot happening at location (x, y) when the ANN predicts so. V. Experimental Platform A. Experimental ANN Architecture and Training The ANN was modeled and trained using the MATLAB ANN toolbox, and then tested using a number of traffic traces to evaluate its hotspot prediction accuracy. During the training stage, synthetic traffic utilization data were used, as explained in Section IV. We experimented with an 8 × 8 2-D mesh network utilizing four ANN prediction engines, with each such ANN monitoring a nonoverlapping 4×4 mesh subnetwork and a top-level neuron which received the outputs of the boundary routers for each of the four 4×4 regions to further evaluate the hotspot prediction(s). Assuming that prediction results from each of the 4 × 4 regions are propagated in a small number of cycles (since each ANN outputs a single bit corresponding to each router, this can be actually made with a single, dedicated wire requiring virtually no delay), the output concerning each of the boundary routers can be made available in a few extra cycles. The top-level ANN simply uses a weighted OR-based voting algorithm, where center routers are assigned a greater probability of having a hotspot if their neighboring routers are also predicted as hotspots. Having trained the ANN using synthetic traffic with predefined hotspots, we then used the ANN predictor on: 1) a variety of synthetic traffic traces (Section IV) running on an 8×8 2-D mesh network, and 2) on traffic traces gathered from applications run on two NoC-based processors, the Raw [45] and the TRIPS [17], [38] multicore chips. This experimental process facilitated the evaluation of the ANN prediction accuracy under both synthetic and real NoC application scenarios. The Raw and TRIPS multicore applications exhibit high spatial and time-varying behavior (irregular injection rate) with periods of low activity interchanged with periods of localized high activity (hotspots), especially in the case of the TRIPS applications, rendering them suitable in testing our proposed predictor. The predictions of the ANN were then compared to the corresponding actual hotspots formed under each of these benchmarks. The accuracy of the ANN was progressively measured during constant simulation intervals, to observe the

427

behavior of the network as the amount of data increased. Overall, these simulation experiments ensured the exposure of the ANN predictor to a wide range of applications of different behaviors and variable packet injection rates across each PE, especially when the TRIPS and Raw multicore chip applications that exhibit all kinds of traffic variability were in use. Experimental results gathered when executing these scenarios drew a detailed picture of the accuracy of the ANN predictor both across NoC operational time elapsed and in terms of NoC application load variability. B. Simulation Setup In order to obtain the buffer utilization data to train and evaluate the neural network we implemented a detailed cycleaccurate simulator that supports k-ary 2-mesh topologies with four-stage pipelined routers, each with two VCs. The pipeline stages are as follows: 1) routing computation to compute the next-hop route for the in-transit packets; 2) VC arbitration to allocate VCs at the downstream router; 3) switch arbitration to allocate per-flit switch bandwidth at the crossbar to traversing flits; 4) switch (crossbar) traversal, and finally physical link traversal (taken to be one-cycle in duration but not considered to be part of the router pipeline). Packets are composed of five 32-bit flits with each flit transported in one link cycle. Each router consists of four incoming and four outgoing unidirectional channels. Under the experiments carried out using synthetic traffic we evaluate the effectiveness of our ANN-based predictor based on the number of preset hotspots (see Section IV) that were actually predicted at least 50 cycles prior to their occurrence (see Sections III-B and VI-A). However, in the case of the Raw [45] and TRIPS [38] benchmark suite traces (see Sections VI-B and VI-C, respectively) these direct comparisons cannot be made, as obviously the hotspots are not intentionally preset (as in the case of synthetic NoC traces); they solely depend on the behavior of the application. Thus a first step to determine the existence of hotspots in these benchmarks sets is primarily to quantify a hotspot. To do this we used the following heuristic, based upon empirical experience and traffic trace profiling: using our simulation infrastructure we measured the buffer utilization at every input port of each router in the NoC individually over a moving window (MW) of 300 cycles for the entirety of each Raw and TRIPS benchmark, and kept all the statistical data in a database. This MW cycle duration is just long enough to capture traffic “spikes” avoiding the effects of “smoothing” over longer sampling periods of statistics collection, while short enough to test the effectiveness of our predictor in filtering-out and recognizing short-term hotspots from the traffic “jitter.” The router’s (x, y) coordinate and the starting cycle of measurement for each corresponding buffer utilization data point were also recorded along. We then ranked the buffer utilization data in decreasing order, and consider the top 0.1% of highest measurements to be hotspot occurrence indicators due to their high-ranking network resource (buffer) demands. We note that a single hotspot does not necessarily have to span exactly 300 cycles, but that it can span several MWs each of 300 cycle duration (several hotspots can coexist, each at a different router coordinate);

428

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 31, NO. 3, MARCH 2012

e.g., a hotspot with duration of 500 cycles will span 201 consecutive buffer utilization MWs. Thus the aforementioned 0.1% figure of highest buffer utilizations does not signify that exactly 0.1% of the application runtime there can be an NoC hotspot(s), or that 0.1% of the aggregate NoC traffic is hotspotbound. During the ANN-based predictor experiments for each benchmark trace of the Raw and TRIPS multicore chips (see Sections VI-B and VI-C, respectively), we compare the ANNbased predictor results, which likewise report the cycle of each hotspot occurrence (a 50-cycle advance prediction is considered to be successful, see Section III-B) and its router spatial location [(x, y) coordinate] in the NoC, against the original simulator-created database so as to determine the accuracy of our ANN-based predictor. Timely, missed, and falsely reported hotspot predictions are each counted for.

TABLE II Raw Application Traces Hotspot Prediction Results Simulation Experiment Completion (From Start)/Benchmark 8b− encode 802.11a adpcmRAW FFT MPEG2 streams VPR

20%

40%

60%

80%

100%

0.82 0.88 0.84 0.76 0.91 0.95 0.92

0.82 0.89 0.86 0.78 0.90 0.94 0.90

0.80 0.88 0.85 0.78 0.91 0.95 0.91

0.80 0.88 0.86 0.80 0.91 0.95 0.91

0.81 0.89 0.85 0.80 0.92 0.95 0.90

% of Hotspots Predicted 50 Cycles Ahead 64 69 82 80 72 84 68

Prediction accuracy results for the Raw benchmarks as a function of simulation time, and percentage of the correctly identified hotspots with at least 50 cycles of advance prediction.

We used various hotspot spatiotemporal intensities as inputs to the ANN, each expressed as a function of the normalized NoC saturation throughput, and compared the output of the ANN with the actual behavior of each trace, in discrete intervals of 20% increments during the simulation process. A hotspot prediction is considered to be successful if it can foresee a hotspot at least 50 cycles ahead of its occurrence with correct (x, y) locality. Table I shows the accuracy of the ANN for each of the synthetic traffic models. Each traffic model exhibits different hotspot behavior; Table I also shows the three training traces used [normalized throughputs with asterisks (*)], which were also utilized as evaluation benchmarks to validate the operation of the ANN. The accuracy value obtained range from 88% to 92% (excluding the training data), values which are relatively good, given that the error margin used during training was bound to 10% (see Section IV). The ANN also had a number of false positive predictions (i.e., predicted hotspots which never occurred in the actual trace); they are shown in Table I. The accuracy of the ANN predictor can be potentially improved by expanding the ANN training set, and by using more bits to represent the weights. Both are left as future work.

We used seven traces extracted from the Raw’s static network, through binaries compiled by the Raw compiler on the Raw cycle-accurate simulator. The traces accurately match the hardware timing. Though the two static networks that Raw incorporates are preprogrammed with an a priori orchestrated behavior at application compile-time, this does not equate to the fact that at times routed traffic cannot be dense, since Raw’s static networks were designed with no practical architectural limit on the number of simultaneous communication patterns, and are used to route stream-based memory traffic [45] (in our experiments, we did notice, however, that hotspots in the Raw fabric were less intense and frequent as opposed to the TRIPS CMP’s OCN network [17]; for details see Section VI-C). We used the same methodology as with the synthetic benchmarks, and obtained the ANN prediction accuracy for each of the seven Raw benchmark traces. Table II shows the accuracy rates obtained during discrete simulation intervals, until the simulation was completed at half a million cycles. Table II also shows the percentage of those identified hotspots that were detected at least 50 cycles in advance of their future occurrence. It is evident that the ANN remains accurate throughout the simulation, even if for the fast Fourier transform (FFT) benchmark, the prediction accuracy is relatively low. The ANN does exceptionally well with three benchmark traces (MPEG2, streams, and VPR). The streaming nature of the MPEG2 and streams applications has a relatively repetitive pattern, therefore if the ANN is properly trained, streaming applications could largely benefit from this.

B. Raw CMP Applications

C. TRIPS Applications and Accuracy Discussion

The Raw CMP [45] is a scalable 32-bit fabric for generalpurpose and embedded computing, consisting of 16 identical tiles each interconnected to its closest neighbor in a 4 × 4 mesh array. Every tile contains its own pipelined RISC processor, computational resources, memory, and programmable routers. The Raw architecture contains two overlapped dynamic networks to support unpredictable inter-communication requirements among the tiles (e.g., cache misses) and two static networks that allow the implementation of softwaredirected routing among tiles with ordered and flow-controlled transfer of operands and data streams between functional units. The switching of the routers in the two static networks is preprogrammed by the compiler, collectively configuring the entire network on a cycle-by-cycle basis to enable supervised flow of data. The 4 × 4 size of the Raw mesh is ideal in evaluating our base ANN architecture.

The TRIPS [38] prototype CMP consists of two large, tiled, distributed processing cores. Each core contains an execution array of ALU tiles, distributed register file tiles, and partitioned local L1 cache tiles interconnected via a 5 × 5 mesh network. The two cores are interconnected by a second, 4 × 10 2-D mesh network to a shared, distributed static-NUCA L2 cache. The main, processor-core, 5 × 5 wormhole-routed NoC is the operand mesh-connected network, abbreviated as OPN. The two multitile processor cores communicate through the on-chip secondary cache system using an embedded 4 × 10 wormhole-routed mesh network with four VCs per port and DOR-YX routing, abbreviated as on-chip network (OCN) [17]. The OCN is optimized for cache-line-sized transfers with support for other memory-related operations, acting as an intercore fabric for the two processors, the L2 cache, DRAM, I/O, and DMA traffic [38].

VI. Results and Discussion A. Synthetic Traffic Simulation Results

KAKOULLI et al.: INTELLIGENT HOTSPOT PREDICTION

429

TABLE III TRIPS Application Traces Hotspot Prediction Results Simulation Experiment Completion (From Start)/ Benchmark

20%

40%

60%

80%

100%

% of False Positives

ammp applu apsi art bzip2 crafty equake gap gzip mcf mesa mgrid parser sixtrack swim twolf vortex VPR wupwise

0.78 0.75 0.65 0.67 0.78 0.77 0.85 0.74 0.79 0.85 0.71 0.68 0.78 0.71 0.70 0.75 0.67 0.85 0.83

0.80 0.81 0.68 0.72 0.80 0.79 0.85 0.76 0.81 0.86 0.73 0.72 0.78 0.74 0.74 0.80 0.70 0.86 0.85

0.81 0.82 0.69 0.75 0.82 0.80 0.86 0.77 0.82 0.86 0.74 0.74 0.79 0.75 0.77 0.81 0.71 0.88 0.85

0.85 0.83 0.71 0.77 0.82 0.84 0.88 0.77 0.82 0.88 0.74 0.75 0.80 0.80 0.81 0.82 0.75 0.91 0.89

0.90 0.84 0.71 0.78 0.83 0.85 0.89 0.78 0.83 0.88 0.75 0.76 0.81 0.82 0.83 0.82 0.75 0.92 0.90

8 5 7 5 6 4 7 7 8 9 6 5 8 4 3 8 7 9 8

% of Hotspots Predicted 50 Cycles Ahead 82 80 65 78 80 81 82 69 80 84 70 69 74 77 75 80 68 88 86

Prediction accuracy results for the TRIPS SPEC CPU2000 benchmarks as a function of simulation time, percentage of false-positive hotspot predictions, and percentage of the correctly identified hotspots with at least 50 cycles of advance prediction.

To determine the predictive accuracy of our ANN-based hotspot predictor, we used 19 benchmarks from the SPEC CPU2000 Suite [43] gathered from the cycle-accurate TRIPS processor simulator to generate OCN requests. These 19 benchmarks were compiled using the Scale compiler [42], appropriately modified for TRIPS’s use. There are two memory system transaction types, writes and reads, consisting of request and reply packets. Write transactions from the processor to the L2 bank consist of a five-flit request packet containing the evicted L1 dirty cache line, answered with a one-flit packet acknowledgement to the processor; while read transactions consist of a single flit read request packet from the processor, replied by the L2 bank with a five-flit packet containing the requested cache line. The TRIPS 4 × 10 mesh OCN was covered using three base ANNs; an OR-based voting engine was used for the results from each ANN that affect the overlapping routers. Table III shows the results for each of the 19 applications, as well as the percentage of the correctly identified hotspots at least 50 cycles ahead of their occurrence. As in the case of Raw, the ANN continues its predictions in relatively high accuracy for the VPR trace. The overall accuracy of our ANNbased predictor ranges from 71% to 92%, and the percentage of correct hotspot predictions at least 50 cycles ahead of their occurrence also demonstrate high prediction accuracies ranging from 65% to 88%, slightly less when compared to the just-in-time prediction rates for each corresponding benchmark. Benchmarks with high hotspot predictive accuracy of a 50-cycle advanced notification, at and above 80%, are ammp, appplu, bzip2, crafty, equake, gzip, mcf, twolf, VPR, and wupwise, which exhibit relatively high burstiness or traffic pattern unevenness, rendering them as good candidates for our ANN-based hotspot predictor. The effects of the system accuracy lie in the consequences of a misprediction, whether it is a hotspot that has not been predicted, or a hotspot that has been predicted but did not in

fact occur. The consequences of each such case vary. In the case of a miss-predicted hotspot, obviously the hotspot will reduce the performance of the NoC, as explained earlier, and will likely cause other hotspots in the multicore chip. If the traffic patterns that cause such hotspots (i.e., hotspots induced as a result of hotspots not being predicted and taken care of) are well defined, the ANN training should encapsulate these hotspots and allow the ANN to predict them as well, provided that they are formed within the time frame of the ANN’s operation. In our case, given that the correction mechanism was not yet been developed, the training actually encapsulates such occurrences. Certainly, the experimental platform used simply produces the detection outcome; it does not model miss-predicted traffic to help draw an indicative picture. In the case of an event being wrongly predicted as a hotspot however (i.e., false positive), then this clearly depends on how the expected hotspot correction mechanism will operate; if the mechanism will reroute packets, e.g., then the cost of a falsepositive will depend on the cost of [unnecessary] rerouting. As such, we cannot have a clear picture on how a false positive will impact the system without knowing a priori the effects of the correction mechanism. D. Hardware Synthesis Results To determine the hardware overheads of the proposed ANN architecture, we implemented the ANN hardware architecture shown in Figs. 5 and 6 using Verilog and synthesized it using Synopsys Design Vision, targeting a 65 nm commercial CMOS library, in order to obtain indicative hardware results. The targeted frequency was 500 MHz, at 1 V power supply voltage. Synthesis results indicate an estimated amount of 40 000 gates for an ANN controlling a 4 × 4 NoC region (including the hardware overheads in each router used for collecting and transmitting utilization values). Next, we investigated the impact of the overall ANN engine in the NoC topologies that we have experimented with. To determine the hardware overhead of the ANN controlling the 4 × 4 NoC region we also performed synthesis of three-stage pipelined virtual-channel routers comprising: 1) routing; 2) VC arbitration and speculative switch allocation (done in parallel); and 3) switch traversal. Link traversal lasts for one cycle but it is not part of the router pipeline. The speculative switch allocation prioritizes non-speculative requests over speculative ones, while prioritized matrix arbitration, look-ahead routing, and credit-based WFC [37] are employed. Using the said router architectures, we synthesized routers with two, three, and four VCs, and for each router configuration, we obtained the estimated hardware requirements of the NoCs used to support the Raw multiprocessor (4 × 4 mesh), TRIPS’s OCN (4 × 10 mesh), and the 8 × 8 mesh topology used in our synthetic experimentations. Table IV depicts the overheads for each NoC configuration; the ANN is a very small part of the on-chip interconnect infrastructure, especially as the routers grow in size in an effort to support more routing and QoS protocols. The ANN-based predictor can therefore be easily integrated as a PE in a typical NoC architecture, and using the scalability options discussed in Section III-C, can cover arbitrary NoC sizes while maintaining a low hardware overhead ratio.

430

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 31, NO. 3, MARCH 2012

TABLE IV ANN Hardware Overheads Router Characteristics Two VCs

Three VCs

Four VCs

NoC Size 4 × 4 Mesh 8 × 8 Mesh 4 × 10 Mesh 4 × 4 Mesh 8 × 8 Mesh 4 × 10 Mesh 4 × 4 Mesh 8 × 8 Mesh 4 × 10 Mesh

ANN Overheads 4.21% 4.21% 5.06% 1.99% 1.99% 2.39% 1.13% 1.13% 1.36%

Estimated ANN overheads for the three mesh networks used in our experiments. Results are presented with the NoC utilizing a speculative router architecture using two, three, and four VCs [37], typically used in high-performance NoCs.

Next, we used Synopsys’ PrimePower to obtain an estimate of the power consumption. Assuming 50% switching activity probability, the synthesized ANN engine described above, consumes an estimated 0.011 mW when computing one hotspot prediction. The power consumption reported does not include the power consumed in transmitting the utilization rates from each router, but it includes the computation of these values at each router. However, the extra packets involved in the ANN computation can be considered as very small, provided that each router transmits one packet every 50 cycles (in our case, the number of cycles and the time frame that utilization rates are collected can be adjusted depending on each specific system as explained before). Therefore, in our experimental NoC, if we assume that each router will route at least one packet per cycle (i.e., fully utilized routers), the control packets will be 1/50th of the total number of NoC packets. This however is an extreme scenario; obviously, system-level simulation which includes cycle-accurate NoC power modeling is required to compute the impact of the power consumption of the ANN-induced communication overheads. Overall, the estimated power requirements of the ANN predictor can therefore be considered negligible when compared to existing real architectures such as Intel’s SCC [23]. Both synthesis and power results demonstrate that the ANN-based predictor is a feasible hardware engine in an NoC. VII. Conclusion In this paper, we presented an ANN-based hotspot prediction mechanism that used buffer utilization data to dynamically monitor the interconnect fabric to reactively predict the location of an about to-be-formed hotspot(s). The ANN was trained using synthetic traffic models, and evaluated using both real and synthetic application traces. Results based on two benchmark suites run on two multicore platforms, the Raw and the TRIPS multicore processors, showed that a relatively small ANN architecture can predict hotspot formation with accuracy ranges between 65% and 92%. The overall impact of the ANN-based hotspot prediction engine can only be encouraging; further optimizations should improve the accuracy, and reduce both the latency and overheads of the proposed ANN-based system. Future work includes the exploration of further optimized ANNs in predicting NoC hotspots more accurately, and the pairing of proactive ANN prediction schemes with reactive

hotspot reduction mechanisms. Moreover, we plan to utilize other system parameters such as link utilization and topology information to improve the training of the ANN aiming to enhance the accuracy and efficiency of the hotspot predictor. Acknowledgment The authors would like to thank the Massachusetts Institute of Technology Raw Team, Cambridge, for supplying the Raw simulator and the VersaBench applications, and P. Gratz of Texas A&M University, College Station, for supplying and helping understand the TRIPS CMP OCN applications. References [1] E. Baydal, P. Lopez, and J. Duato, “A family of mechanisms for congestion control in wormhole networks,” IEEE Trans. Parallel Distrib. Syst., vol. 16, no. 9, pp. 772–784, Sep. 2005. [2] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif, L. Bao, J. Brown, M. Mattina, C.-C. Miao, C. Ramey, D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montenegro, J. Stickney, and J. Zook, “TILE64 processor: A 64-core SoC with mesh interconnect,” in Proc. 55th IEEE Int. SolidState Circuits Conf., Feb. 2008, pp. 88–598. [3] D. Bertozzi and L. Benini, “Xpipes: A network-on-chip architecture for gigascale systems-on-chip,” IEEE Circuits Syst. Mag., vol. 4, no. 2, pp. 18–31, Mar.–May 2004. [4] T. Bjerregaard and S. Mahadevan, “A survey of research and practices of network-on-chip,” ACM Comput. Surveys, vol. 38, no. 1, pp. 1–51, Mar. 2006. [5] P. Bogdan and R. Marculescu, “Statistical physics approaches for network-on-chip traffic characterization,” in Proc. 7th ACM Int. Conf. Hardw./Softw. Codes. Syn., Oct. 2009, pp. 461–470. [6] P. Bogdan and R. Marculescu, “Workload characterization and its impact on multicore platform design,” in Proc. 8th ACM Int. Conf. Hardw./Softw. Codes. Syn., Oct. 2010, pp. 231–240. [7] E. Bolotin, I. Cidon, R. Ginosar, and A. Kolodny, “QNoC: QoS architecture and design process for network on chip,” Elsevier J. Syst. Architect., vol. 50, nos. 2–3, pp. 105–128, Feb. 2004. [8] J. W. van den Brand, C. Ciordas, K. Goossens, and T. Basten, “Congestion-controlled best-effort communication for networks-onchip,” in Proc. 10th ACM/IEEE Des., Automat. Test Eur. Conf. Exhibit., Apr. 2007, pp. 948–953. [9] W. J. Dally, “Virtual-channel flow control,” IEEE Trans. Parallel Distrib. Syst., vol. 3, no. 2, pp. 94–205, Mar. 1992. [10] W. J. Dally and B. Towles, “Route packets not wires: On-chip interconnection networks,” in Proc. 38th IEEE Des. Automat. Conf., Jun. 2001, pp. 684–689. [11] W. J. Dally and B. Towles, Principles and Practices of Interconnection Networks. San Mateo, CA: Morgan Kaufmann, 2004. [12] M. Daneshtalab, A. Sobhani, A. Afzali-Kusha, O. Fatemi, and Z. Navabi, “NoC hot spot minimization using antnet dynamic routing algorithm,” in Proc. 16th IEEE Int. Conf. Applicat.-Specific Syst., Architect. Process., Dec. 2006, pp. 33–38. [13] J. Duato, “A new theory of deadlock-free adaptive routing in wormhole networks,” IEEE Trans. Parallel Distribut. Syst., vol. 4, no. 12, pp. 1320– 1331, Dec. 1993. [14] J. Duato, I. Johnson, J. Flich, F. Naven, P. Garcia, and T. Nachiondo, “A new scalable and cost-effective congestion management strategy for lossless multistage interconnection networks,” in Proc. 11th IEEE Symp. HPCA, Feb. 2005, pp. 108–119. [15] P. Gawghan and S. Yalamanchili, “Adaptive routing protocols for hypercube interconnection networks,” IEEE Comput., vol. 26, no. 5, pp. 12–23, May 1993. [16] K. Goossens, J. Dielissen, and A. Radulescu, “AEthereal network on chip: Concepts, architectures, and implementations,” IEEE Des. Test Comput., vol. 22, no. 5, pp. 414–421, Sep.–Oct. 2005. [17] P. Gratz, C. Kim, R. McDonald, S. W. Keckler, and D. Burger, “Implementation and evaluation of on-chip network architectures,” in Proc. 24th IEEE Int. Conf. Comput. Des., Oct. 2006, pp. 477–484. [18] P. Gratz, B. Grot, and S. W. Keckler, “Regional congestion awareness for load balance in networks-on-chip,” in Proc. 14th IEEE Int. Symp. High-Performance Comput. Architect., Feb. 2008, pp. 203–214.

KAKOULLI et al.: INTELLIGENT HOTSPOT PREDICTION

[19] S. Hashem, Z. H. Ashour, E. F. A. Gawad, and A. A. Hakeem, “A novel approach for training neural networks for long-term prediction,” in Proc. IEEE Int. Joint Conf. Neural Netw., vol. 3. Jul. 1999, pp. 1594–1599. [20] K. S. Hashemi and R. J. Thomas, “On the number of training points needed for adequate training of feedforward neural networks,” in Proc. 1st IEEE Int. Forum Applicat. Neural Netw. Power Syst., Jul. 1991, pp. 232–236. [21] W. S. Ho and D. L. Eager, “A novel strategy for controlling hot-spot congestion,” in Proc. 29th IEEE Int. Conf. Parallel Process., Aug. 1989, pp. 14–18. [22] A. Hodgkin and A. Huxley, “A quantitative description of membrane current and its application to conduction and excitation in nerve,” J. Physiol., vol. 117, no. 4, pp. 500–544, Aug. 1952. [23] J. Howard, S. Dighe, S. R. Vangal, G. Ruhl, N. Borkar, S. Jain, V. Erraguntla, M. Konow, M. Riepen, M. Gries, G. Droege, T. Lund-Larsen, S. Steibl, S. Borkar, V. K. De, and R. Van Der Wijngaart, “A 48-core IA32 processor in 45 nm CMOS using on-die message-passing and DVFS for performance and power scaling,” IEEE J. Solid-State Circuits, vol. 46, no. 1, pp. 173–183, Jan. 2011. [24] J. Hu and R. Marculescu, “DyAD: Smart routing for networks-on-chip,” in Proc. 41st IEEE/ACM Des. Automat. Conf., Jun. 2004, pp. 260–263. [25] A. K. Jain, J. Mao, and K. M. Mohiuddin, “Artificial neural networks: A tutorial,” IEEE Comput., vol. 29, no. 3, pp. 31–44, Mar. 1996. [26] E. Kakoulli, V. Soteriou, and T. Theocharides, “An artificial neural network-based hotspot prediction mechanism for NoCs,” in Proc. 9th IEEE Annu. Symp. VLSI, Jul. 2010, pp. 339–344. [27] P. Lotfi-Kamran, M. Daneshtalab, C. Lucas, and Z. Navabi, “BARP: A dynamic routing protocol for balanced distribution of traffic in NoCs,” in Proc. 11th ACM/IEEE Des., Automat. Test Eur. Conf. Exhibit., Mar. 2008, pp. 1408–1413. [28] S. Kumar, A. Jantsch, J.-P. Soininen, M. Forsell, M. Millberg, J. Oberg, K. Tiensyrja, and A. Hemani, “A network-on-chip architecture and design methodology,” in Proc. 1st IEEE Annu. Symp. VLSI, Aug. 2002, pp. 105–112. [29] M. Li, Q.-A. Zeng, and W.-B. Jone, “DyXY: A proximity congestionaware deadlock-free dynamic routing method for network on chip,” in Proc. 43rd IEEE Des. Automat. Conf., Jul. 2006, pp. 849–852. [30] I. Maqsood, M. R. Khan, and A. Abraham, “An ensemble of neural networks for weather forecasting,” Neural Comput. Applicat., vol. 13, no. 2, pp. 112–122, Jun. 2004. [31] R. Marculescu, U. Y. Ogras, L.-S. Peh, N. E. Jerger, and Y. Hoskote, “Outstanding research problems in NoC design: System, microarchitecture, and circuit perspectives,” IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 28, no. 1, pp. 3–21, Jan. 2009. [32] E. Nilsson, M. Millberg, J. Oberg, and A. Jantsch, “Load distribution with the proximity congestion awareness in a network on chip,” in Proc. 6th IEEE/ACM Des., Automat. Test Eur. Conf. Exhibit., Mar. 2003, pp. 11126–11127. [33] U. Y. Ogras and R. Marculescu, “Analysis and optimization of prediction-based flow control in networks-on-chip,” ACM Trans. Des. Automat. Electron. Syst., vol. 13, no. 1, pp. 1–28, Jan. 2008. [34] J. D. Owens, W. J. Dally, R. Ho, D. N. Jayasimha, S. W. Keckler, and L.-S. Peh, “Research challenges for on-chip interconnection networks,” IEEE Micro, vol. 27, no. 5, pp. 96–108, Sep.–Oct. 2007. [35] P. P. Pande, C. Grecu, M. Jones, A. Ivanov, R. Saleh, “Performance evaluation and design trade-offs for network-on-chip interconnect architectures,” IEEE Trans. Parallel Distrib. Syst., vol. 54, no. 8, pp. 1025– 1040, Aug. 2005. [36] L.-S. Peh and W. J. Dally, “Flit-reservation flow-control,” in Proc. 6th IEEE Int. Symp. High-Performance Comput. Architect., Jan. 2000, pp. 73–84. [37] L.-S. Peh and W. J. Dally, “A delay model and speculative architecture for pipelined routers,” in Proc. 6th IEEE Int. Symp. High-Performance Comput. Architect., Jan. 2001, pp. 255–266. [38] K. Sankaralingam, R. Nagarajan, R. Mcdonald, R. Desikan, S. Drolia, M. S. Govindan, P. Gratz, D. Gulati, H. Hanson, C. Kim, H. Liu, N. Ranganathan, S. Sethumadhavan, S. Sharif, P. Shivakumar, S. W. Keckler, and D. Burger, “Distributed microarchitectural protocols in the TRIPS prototype processor,” in Proc. 36th IEEE/ACM Int. Symp. Microarchitect., Dec. 2006, pp. 480–491. [39] H. Sarbazi-Azad, M. Ould-Khaoua, and L. M. Mackenzie, “An analytical model of fully-adaptive wormhole-routed k-ary n-cubes in the presence of hot spot traffic,” IEEE Trans. Comput., vol. 50, no. 7, pp. 623–634, Jul. 2001. [40] D. Seo, A. Ali, W.-T. Lim, and N. Rafique, “Near-optimal worst-case throughput routing for two dimensional mesh networks,” in Proc. 32nd IEEE/ACM Int. Symp. Comput. Architect., Jun. 2005, pp. 432–443.

431

[41] A. Singh, W. J. Dally, A. K. Gupta, and B. Towles, “GOAL: A loadbalanced adaptive routing algorithm for torus networks,” in Proc. 30th IEEE/ACM Int. Symp. Comput. Architect., Jun. 2003, pp. 194–205. [42] A. Smith, J. Burrill, J. Gibson, B. Maher, N. Nethercote, B. Yoder, D. Burger, and K. S. McKinley, “Compiling for EDGE architectures,” in Proc. 4th IEEE/ACM Int. Symp. Code Generat. Optimiz., Mar. 2006, pp. 185–195. [43] Standard Performance Evaluation Corporation [Online]. Available: http: //www.spec.org [44] G. Steven, R. Anguera, C. Egan, F. Steven, and L. Vintan, “Dynamic branch prediction using neural networks,” in Proc. 4th Euromicro Conf. Digital Syst. Des., Sep. 2001, pp. 178–185. [45] M. B. Taylor, J. Psota, A. Saraf, N. Shnidman, V. Strumpen, M. Frank, S. Amarasinghe, A. Agarwal, W. Lee, J. Miller, D. Wentzlaff, I. Bratt, B. Greenwald, H. Hoffmann, P. Johnson, and J. Kim, “Evaluation of the raw microprocessor: An exposed-wire-delay architecture for ILP and streams,” in Proc. 31st IEEE/ACM Int. Symp. Comput. Architect., Jul. 2004, pp. 2–13. [46] L. Tedesco, T. R. da Rosa, F. Clermidy, N. Calazans, and F. G. Moraes, “Implementation and evaluation of a congestion aware routing algorithm for networks-on-chip,” in Proc. 23rd Symp. Integr. Circuits Syst. Des., Sep. 2010, pp. 91–96. [47] R. de A. Teixeira, A. de P. Braga, R. H. C. Takahashi, and R. R. Saldanha, “A multi-objective optimization approach for training artificial neural networks,” in Proc. 6th IEEE Brazilian Symp. Neural Netw., Nov. 2000, pp. 168–172. [48] I. Walter, I. Cidon, R. Ginosar, and A. Kolodny, “Access regulation to hot-modules in wormhole NoCs,” in Proc. 1st ACM/IEEE Annu. Symp. Netw.-on-Chip, May 2007, pp. 137–148. [49] D. Wu, B. M. Al-Hashimi, and M. T. Schmitz, “Improving routing efficiency for network-on-chip through contention-aware input selection,” in Proc. 11th IEEE Asia South Pacific Conf. Des. Automat., Mar. 2006, pp. 6–10.

Elena Kakoulli (S’10) received the Masters degree in computer science from the University of Cyprus, Nicosia, Cyprus. She is currently pursuing the Ph.D. degree in computer engineering from the Department of Electrical and Computer Engineering and Informatics, Cyprus University of Technology, Limassol, Cyprus. Her current research interests include computer architecture and interconnection networks.

Vassos Soteriou (S’03–M’08) received the B.S. degree from Rice University, Houston, TX, in 2001, and the Ph.D. degree from Princeton University, Princeton, NJ, in 2006, respectively, both in electrical engineering. Since 2007, he has been a Lecturer with the Department of Electrical and Computer Engineering and Informatics, Cyprus University of Technology. His current research interests include interconnection networks, on-chip networks, and multicore architectures, with emphasis on power consumption management methodologies, fault-tolerance, performance enhancements, and design-space exploration. Dr. Soteriou was a recipient of the Best Paper Award at the 2004 IEEE International Conference on Computer Design.

Theocharis Theocharides (S’01–M’05–SM’11) received the Ph.D. degree in computer science and engineering from the Pennsylvania State University, University Park. He is currently an Assistant Professor with the Department of Electrical and Computer Engineering, University of Cyprus, Nicosia, Cyprus. His current research interests include intelligent embedded systems design, with emphasis on the design of reliable and low power embedded and application-specific processors, media processors, and real-time digital artificial intelligence applications.