Runtime Multi-Optimizations for Energy Efficient On-chip Interconnections1 Yuan He∗, Masaaki Kondo∗, Takashi Nakada∗, Hiroshi Sasaki†, Shinobu Miwa‡ and Hiroshi Nakamura∗ ∗The
University of Tokyo, Bunkyo-ku, Tokyo, Japan University, New York, NY, USA ‡The University of Electro-Communications, Chofu, Tokyo, Japan Email: ∗{he, kondo, nakada, nakamura}@hal.ipc.i.u-tokyo.ac.jp, †
[email protected], ‡
[email protected] †Columbia
Abstract—On-chip interconnection (or NoC) is a major performance and power contributor to modern and future multicore processors. So far, many optimization techniques have been developed to improve its bandwidth, latency and power consumption. But it is not clear how energy efficiency is affected since an optimization technique normally comes with overheads. This paper thus attempts to address when and how such optimization techniques should be applied and tuned to help achieve better energy efficiency. We firstly model the performance and energy impacts of representative NoC optimization techniques. These models help us more easily understand the consequences when applying these optimization techniques and their combinations under different circumstances. Moreover, based on such modeling, we propose and implement an adaptive control over these NoC optimization techniques to improve both performance and energy efficiency of the network. Our results show that, this proposal can achieve an average improvement of 26% and 57% on network performance and energy delay product, respectively.
example, performance optimization techniques such as low latency routers have power overhead while power optimization techniques such as power gating may hurt the performance. Second, the usefulness of some optimization techniques depends on independent factors, for example, the compressibility of the traffic in traffic compression.
Keywords—networks-on-chip; energy efficiency; optimization; modeling; adaptive control
Conventionally, to find out the performance and energy impacts of NoC optimization techniques, they may need to be evaluated with methods such as architecture level simulations. However, the problem of such simulations is their time. To address this issue, we carry out a study for adaptive control of NoC optimization techniques in order to improve the energy efficiency. We build and verify performance and energy models for various NoC optimization techniques. This also helps us explore and find the best optimization mix for a particular application statically. With the help of these models, we can also make reasonable decisions at runtime to turn these implemented optimization techniques on or off for further energy efficiency improvements. We thus propose an adaptive NoC optimization control and evaluate its effectiveness.
I.
I NTRODUCTION
With rapidly increasing number of cores, the demand for scalable and efficient on-chip interconnections grows significantly. As a consequence, both performance and power of NoCs should be considered carefully as their size and complexity scale rapidly with larger number of cores. This has no doubt made NoCs one of the main performance and power contributors and thus their energy efficiency an important concern since it is one of the most critical metric for modern and future computer systems. There have been many studies on optimizing performance and power consumption of NoCs. Performance-wise, there are router designs focusing on shortening the latency of a router [19, 17, 14, 5, 6, 7]. There are also studies which focus on improving the bandwidth of NoCs through different approaches, such as compression [2, 10], allocators with higher matching efficiency [16], and different buffer designs allowing higher capacity [9]. Power-wise, there are many existing studies on saving the static power of routers through power management techniques such as power gating [15] or proportionally supplying power to the network with respect to the traffic demand [3]. This wide variety of optimization techniques help improve both performance and power of NoCs, but they bring several concerns. First, all these techniques come with overheads. For 1
This work was, in part, supported by the Japan Science and Technology Agency CREST program, “A Power Management Framework for Post Peta-Scale Supercomputers”.
c 978-1-4673-7166-7/15/$31.00 2015 IEEE
With too many optimization techniques, as the complexity of evaluating and implementing them in design process exponentially increases, a better approach to assess and utilize them is indispensable. When optimization techniques are applied to the network, how it behaves according to different workloads at runtime is also important. Some optimization techniques may not provide much benefit in energy saving as they may cause performance loss which even worsens the energy efficiency. Thus, further ideas are needed to control such optimization techniques at runtime.
The rest of this paper is organized as follows. Section II presents the modeling of NoC optimization techniques. We discuss an adaptive control method in Section III. In Section IV, we provide details on how we validate and evaluate the models and our method. Section V presents the results. Finally, Section VI concludes this paper. II.
M ODELING THE N O C O PTIMIZATION T ECHNIQUES
In this section, we model the performance and the energy consumption of NoCs including performance and energy impacts from 3 representative NoC optimization techniques. All these models will be validated later in Section V-A. A. Performance Modeling To understand the performance of a NoC, we first focus on the average network latency per flit as in Equation (1). It can be divided into the average zero load latency (LZeroLoad ), which is the latency from the source to the destination without
455
contentions, and queuing latency (LQueue ). Zero load latency can be further divided into several parts, for network interface (LN I ), routing (LRoute ) and link traversal (LLink ). It can be written as Equation (2). H is the average number of hops a flit travels in the network. LN et = LZeroLoad + LQueue
(1)
LZeroLoad = 2LN I + LRoute × H + LLink × (H + 1)
(2)
The average queuing latency per flit, LQueue , can be approximated with the M/D/1 queue model [4] as in Equation (3). NP acket is the number of injected packets to the network and NN I is the number of network interfaces that inject traffic to the network. LP rop is the average propagation delay for the packet containing the modeled flit. It is determined by the number of flits in the packet so we can model it by dividing the amount of flits (NF lit ) by the amount of packets (NP acket ) traveled through the network as in Equation (4). LQueue =
2×
NP acket × (LZeroLoad + LP rop )2 NN I NP acket (1 − N × (LZeroLoad + LP rop )) NI
LP rop =
NF lit NP acket
B. Energy Modeling We attempt to model the average energy consumption per flit (EN et ) as written in Equation (5), which is further modeled as static and dynamic energy (ESN et and EDN et ) in Equation (6) and (7), respectively. In Equation (7), the dynamic router and link energy per flit are modeled with the average number of accesses per flit to routers (H ) and links (H + 1). Since the static power (PSN et ) and the clock power (PDClk ) are always consumed, they are multiplied with the runtime (T (LN et )), as in both Equation (6) and (7). T (LN et ) is related to the network performance (in our case, LN et ). When we validate our model with simulations, we simply take the application runtime for T (LN et ); when applied to our proposal, we take the runtime of the shutter period for T (LN et ). The power and energy parameters of these models, such as the static power of the network, the clock power of the network, the the dynamic energy per router access (EDRouter ) and the dynamic energy per link access (EDLink ) are simply obtained from the Orion simulator.
ESN et =
(5)
PSN et × T (LN et ) NF lit
EDN et = EDRouter × H + EDLink × (H + 1) +
•
Prediction Router: PR [14] is a popular low latency router deign whose speculative switch traversal is enabled (in such a case, the router latency is equal to the time of a switch traversal, LST ) with predictions of the output ports before a packet actually comes to a router. For a successful prediction (RP red represents the prediction accuracy), the latency of the router pipeline is hidden. Prediction units also consume dynamic and static energy (as EDP R and ESP R ; NP R is the number of predictions taken).
•
Traffic Compression: TC is used to conserve bandwidth and reduce packet latency [2, 10]. It may help save power since a successfully compressed packet may have its size shrunk (RCompression is the compression ratio), which saves dynamic power while the shrunk packet traverses the network. The performance impact is modeled as it may change the amount of flits which is in turn affecting the propagation delay of packets. For energy, smaller packets will help reduce the amount of dynamic energy, but compressor and de-compressor circuits consume dynamic and static energy (as EDCU and ESCU ; NT C is the number of compressions taken).
III.
A DAPTIVE C ONTROL OF N O C O PTIMIZATION T ECHNIQUES
This section introduce our proposal of adaptively controlling NoC optimization techniques. Using multiple optimization techniques statically is not good enough since applications have phases among which they may behave very differently. This leads us to think about adaptively throttling these optimization techniques from time to time. To establish adaptive control, we rely on the models introduced in Section II.
(6)
Time
PDClk × T (LN et ) NF lit
(7)
C. The NoC Optimization Techniques and Their Impacts In this subsection, we introduce a few important optimization techniques on NoCs: power gating (PG), prediction router (PR), and traffic compression (TC). These are chosen since they are targeting at different problems and each one of them is a representative of its kind. After the following brief introduction of them, we also model their impacts on both performance and energy. These models are summarized in TABLE I. Note that we consider the combinations of these optimization techniques in our evaluations.
456
Power Gating: PG is a representative static power reduction technique, which helps cut off the power supply to idle circuit blocks by turning off (or on) the power switches inserted between the GND/VDD lines and the blocks [8]. Applying it to routers saves their static power when they are not actively used [15]. We model wakeup latency (LW akeup ), dynamic energy overhead (EDP S ) when switching (NP G is the number of switching events). The static power after PG is modeled by considering the power-on time of routers (TRouterOni is the time Router i is powered on while NRouter is the number of routers).
(3) (4)
EN et = EDN et + ESN et
•
Shutter Period
Model-based Opt. Selection
Execution with Selected Opt.
Fig. 1: Procedures to select optimization techniques adaptively. The method is, optimization techniques are switched on for the epoch they can productively help (performance or energy efficiency) according to related performance counter values collected from a shutter period (see Fig. 1). In our proposal, we rely on our performance and energy models to determine if an optimization technique should be switched on or off at runtime. Within a shutter period, all optimization techniques are switched on and if the performance and energy models
2015 33rd IEEE International Conference on Computer Design (ICCD)
TABLE I: Performance and energy models of NoC optimization techniques. NoC opt. techniques
Performance impacts
PG
LRoute = LRoute + LW akeup
Dynamic energy impacts
ED
LRoute = LST × RP red (11) + LRoute × (1 − RP red )
ED
PR
(8)
NF lit =
TC
N et
N et
= EDN et + EDP S × 2NP G
(9)
= EDN et + EDP R × NP R
(12)
NF lit RCompression
(14)
ED
N et
predict a positive outcome for an optimization technique, we then switch it on for the epoch following this shutter period. This process will be repeated until the application ends. It is possible for this proposal to be implemented in hardware as a centralized control. It will need to communicate with network components to retrieve performance statistics and send control messages. Such communication is negligible since it is much shorter than the length of an epoch. IV.
M ETHODOLOGY FOR M ODEL VALIDATIONS AND E VALUATIONS
The evaluation in this paper involves two parts. We first validate our models with full system simulations. We then evaluate the proposed method with execution traces.
=
EDN et
RCompression + EDCU × NT C
L2 cache Cache line size Memory controller Main memory Coherence protocol Link Packet Router Virtual channel Virtual network Routing algorithm Router
Link
Core/L1$
N et
= PSN et ×
ES
(15)
N et
NRouter i
TRouterOni
T (LN et ) × NRouter (10)
= ESN et + ESP R × NRouter (13)
ES
V. NO-OP
N et
= ESN et + ESCU × NN I
(16)
Error (%)
PG
PR
TC
R ESULTS PG&PR
PG&TC
PR&TC
PG&PR&TC
20 15 10 5 0 -5 -10 -15 -20
Fig. 3: Model validation for latency per flit. NO-OP
Error (%)
Value 16 4 × 4 mesh 4 GHz, In-order 32 KB per core, 4-way set associative, 1-cycle access latency 16 Banks, 256 KB per Bank, 16-way set associative, 6-cycle access latency 64 Bytes 4 in total, 1 on each corner 4 GB, 160-cycle access latency MOESI, Directory 128-bit, 1 cycle traversal 128-bit control, 640-bit data 1 GHz, Virtual channel router 2 per Virtual network 3 per Physical link X-Y routing
ES
To adaptively select optimization techniques, we collect periodic execution traces with necessary performance counters to help our models to determine which combinations of optimization techniques are the best for a coming epoch. In the evaluation of our adaptive approach, the size of shutters and epochs are determined by the amount of instructions. We pick 200K instructions as the epoch size since finer granularity in control does not bring any obvious benefit for the evaluation results. The metrics we use to compare our results are latency per flit and energy delay product per flit. The first metric is for network performance while the second is a metric for both performance and energy efficiency.
TABLE II: Simulation parameters. Simulation Parameter Number of cores Topology Processor L1 I/D cache
Static energy impacts
PG
PR
TC
PG&PR
PG&TC
PR&TC
PG&PR&TC
2 1.5 1 0.5 0 -0.5 -1 -1.5 -2
Link
Fig. 4: Model validation for energy per flit.
NI L2$/Directory
A. Model Validations
Memory Controller Router
Fig. 2: The evaluated platform. Full system simulations on performance and energy are carried out on GEMS/Simics [13, 12] extended with the network model from GARNET [1] and the power model from Orion [11]. The assumptions are summarized in TABLE II. A schematic view of this simulated platform and what a tile is composed of are illustrated in Fig. 2. The applications used are based on workloads from NPB 3.3 [18] and SPLASH-2 [20].
Fig. 3 and Fig. 4 present the validation results for our models against simulations. It can be seen that both our latency and energy models are accurate. The error for the latency model is mostly within 15% and this error comes from the approximation we have on queuing latency. The energy model is more accurate with an error of less than 2% since the way we model energy is the same as the power simulator (Orion 2). Both our model and Orion estimate the energy consumption by counting the events taken at network components. With these models, we can very easily predict the performance and energy impacts of different NoC optimization techniques
2015 33rd IEEE International Conference on Computer Design (ICCD)
457
without incurring any time-consuming simulations. This is very important for the runtime adaptive control we propose.
20
and energy in a simpler way. Then, with the help of modeling, we try to establish both design exploration and adaptive control of multiple optimization techniques on NoCs. Our approach is the first to have in-depth analyses on the usage of multiple NoC optimization techniques. Through evaluations, we found that adaptively throttling the NoC optimization techniques at runtime is very promising for both network performance and energy efficiency.
15
R EFERENCES
NO-OP PG&TC
PG PR&TC
PR PG&PR&TC
TC Adaptive
PG&PR Oracle
30
Cycle
25
10
5 0
Fig. 5: Latency per flit under different optimization techniques. NO-OP PG&TC
PG PR&TC
PR PG&PR&TC
TC Adaptive
PG&PR Oracle
120
nJ·Cycle
100 80 60 40 20 0
Fig. 6: Energy delay product per flit under different optimization techniques. B. Adaptive Control on NoC Optimization Techniques Fig. 5 depicts the latency per flit when single and multiple NoC optimization techniques are applied and they are compared to our adaptive approach and the oracle case. It can be seen that our approach performs nearly as good as oracle while PR&TC is the best performing combination. The resulted latency per flit values are also similar to our approach. This proves the effectiveness of our approach since it selects the best combinations of optimization techniques for each epoch and most of these choices should be PR&TC. Fig. 6 presents the energy delay product results. It can be noted that our approach is again not far from oracle while PG&PR is the best performing combination in terms of energy delay product per flit. As a summary, if the target is performance, PR&TC is preferred while PG&PR is more suitable for improving the energy efficiency. In either case, our approach performs very well and produces outcomes very similar to oracle. VI.
C ONCLUSIONS
In this work, we have presented how multiple NoC optimization techniques are tuned at runtime for performance and energy efficiency. Through modeling, we try to capture impacts of selected NoC optimization techniques on both performance
458
[1] N. Agarwal et al., “GARNET: a detailed on-chip network model inside a full-system simulator,” in Proc ISPASS’09, pp. 33–42, 2009. [2] R. Das et al., “Performance and power optimization through data compression in network-on-chip architectures,” in Proc. 14th HPCA, pp. 215–225, Feb 2008. [3] R. Das et al., “Catnap: Energy proportional multiple networkon-chip,” in Proc. 40th ISCA, pp. 320–331, 2013. [4] Z. Guz et al., “Network delays and link capacities in applicationspecific wormhole nocs,” VLSI Design, 2007. [5] M. Hayenga and M. Lipasti, “The NoX router,” in MICRO 44: Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 36–46, Dec. 2011. [6] Y. He et al., “Predict-more router: A low latency noc router with more route predictions,” in Parallel and Distributed Processing Symposium Workshops and PhD Forum (IPDPSW), 2013 IEEE 27th International, pp. 842–850, May 2013. [7] Y. He et al., “Mcrouter: Multicast within a router for high performance network-on-chips,” in Proceedings of the 22Nd International Conference on Parallel Architectures and Compilation Techniques, ser. PACT ’13, pp. 319–330. [8] Z. Hu et al., in Proc. ISLPED ’04, pp. 32–37, 2004. [9] H. Jang et al., “A hybrid buffer design with stt-mram for on-chip interconnects,” in Proc. NOCS’12, ser. NOCS ’12, pp. 193–200, 2012. [10] Y. Jin, K. H. Yum, and E. J. Kim, “Adaptive data compression for high-performance low-power on-chip networks,” in Proc. 41st MICRO, pp. 354–363, 2008. [11] A. Kahng et al., “Orion 2.0: A fast and accurate noc power and area model for early-stage design space exploration,” in Design, Automation Test in Europe Conference Exhibition, 2009. DATE ’09., pp. 423–428, April 2009. [12] P. Magnusson et al., “Simics: a full system simulation platform,” IEEE Computer, vol. 35, no. 2, pp. 50–58, 2002. [13] M. M. K. Martin et al., “Multifacet’s general execution-driven multiprocessor simulator (GEMS) toolset,” SIGARCH Computer Architecture News, vol. 33, no. 4, Nov. 2005. [14] H. Matsutani et al., “Prediction router: yet another low latency on-chip router architecture,” in Proc 15th HPCA, pp. 367–378, 2009. [15] H. Matsutani et al., “Ultra fine-grained run-time power gating of on-chip routers for cmps,” in Proc. NOCS’10, pp. 61–68, 2010. [16] G. Michelogiannakis et al., “Packet chaining: Efficient singlecycle allocation for on-chip networks,” in Proc. 44th MICRO, pp. 83–94, 2011. [17] R. Mullins, A. West, and S. Moore, “Low-latency virtualchannel routers for on-chip networks,” in ISCA ’04: Proceedings of the 31st annual international symposium on Computer architecture, pp. 188–197, Jun. 2004. [18] http://www.nas.nasa.gov/Resources/Software/npb.html, NAS parallel benchmarks 3.3. [19] L.-S. Peh and W. J. Dally, “A Delay Model and Speculative Architecture for Pipelined Routers,” in Proc. 7th HPCA, pp. 255–255, 2001. [20] S. Woo et al., “The SPLASH-2 programs: characterization and methodological considerations,” in Proc. 22nd ISCA, pp. 24–36, 1995.
2015 33rd IEEE International Conference on Computer Design (ICCD)