Low-Powered Self-Timed Pipeline with Runtime

Low-Powered Self-Timed Pipeline with Runtime Fine-Grain Power Supply Kei MIYAGI1 , Shuji SANNOMIYA2 , Makoto IWATA1 , and Hiroaki NISHIKAWA2 1 School of Information, Kochi University of Technology, Kami, Kochi, Japan 2 Department of Computer Science, Graduate School of Systems and Information Engineering, University of Tsukuba, Tsukuba Science City, Ibaraki, Japan Abstract— This paper describes a runtime fine-grain power supply scheme based on the self-timed pipeline (STP) circuits. The STP works with its local hand-shake signal so that it does not require the global clock distribution, i.e., centralized control. Therefore, various power supply control for the STP can be naturally localized in both spatial and temporal domains without stopping its effective data transfer, e.g., program execution in case of microprocessors. As a result, the power supply scheme proposed in the paper can efficiently incorporate both commonly used voltage scaling and power gating techniques and it can further produce synergetic effects on its total power saving nature. In the paper, effective power-performance characteristics of the STP supported by the proposed scheme are discussed and analyzed in terms of break-even model for power reduction effect against its control overhead. Furthermore, the scheme is applied to an ultra-low-power data-driven networking processor, named ULP-CUE, designed in 65 nm CMOS process and then it is evaluated through typical UDP/IP traffic of a wireless ad hoc network. In this case, total power can be reduced to about 13% compared with the normal-STP-based data-driven processor. Keywords: self-timed pipeline, runtime power gating, runtime voltage scaling, file-grain power-supply

1. Introduction Lowering power dissipation of LSI systems is now more and more crucial to realize greener devices, while fully utilizing the potential speed of transistors under deepsubmicron process technologies. In order to facilitate such low power consumption of the LSI chips, both dynamic and static power dissipation should be cut or reduced as much as possible. The main causes of the dynamic power dissipation are the transistor-switching unnecessary for processing and the excessive switching frequency higher than required processing speed, while the leakage current through inactive transistors increases the static power dissipation mainly. Voltage scaling and power gating are well-known techniques for reducing switching power and static power respectively[1][2]. Mostly those techniques have been applied to relatively coarse grain power domain of the LSI chip, e.g., a whole die or processor core, and they are

assumed to be controlled by the operating system. In order to achieve ultimate reduction of the power consumption, it is essential to control power supply more finely in terms of both spatial and temporal domains. To realize such fine control, a powered circuit block itself has to behave autonomously without any centralized control such as a global clock signal. Therefore, in our collaborative research project to establish ultra-low-power data-driven networking system (ULP-DDNS)[3], we have fully introduced selftimed pipeline (STP) circuits to implement our ultra-lowpower data-driven processor (ULP-CUE) without any global clock signal. To minimize overheads caused by the finegrain power control, our ultra-low-power self-timed pipeline (ULP-STP) circuit has been designed to work in part even during supply voltage scaling or power gating. That is, the ULP-CUE processor core can execute programs without evacuation of their contexts even when the power supply voltage is altered or power supply is cut in part. Because of self-timed elastic data-transfer mechanism of the original STP [4], it can work well under variable voltage without adjusting clock frequency even if the altered voltage could transiently fluctuate at individual pipeline stage. Since the pipeline throughput can be adaptive to its processing load only by altering supply-voltage appropriately, a poweraware pipeline scheme can be realized naturally in terms of runtime power saving. For instance, the proportionalintegral-derivative (PID) controller can be applied to such voltage control by monitoring consumption current of a target power domain within the chip. The STP is also suitable for gating power-supply to fine grain circuits since its stage-by-stage data-transfer control independently activates only pipeline stages with valid data. We therefore proposed a stage-by-stage power gating scheme adopted in the STP [5]. This scheme provides natural signal gating, i.e., it stops the unnecessary signal propagation and transistor-switching at pipeline stage level without any global control mechanisms resulting in both power dissipation and processing speed degradation. Moreover, it makes it possible to scale the voltage even when the stages are activated because it can be realized without any global oscillator such as phase-locked loop (PLL) circuit, which forces pipeline flush ahead of the frequency and voltage change. By introducing both the voltage scaling and power gating

into the STP, synergetic effects on its total power saving nature will be further expected. Apparently power gating will be conducted at the lowest voltage scaled so that the power overhead of power gating will be minimized. As for the remaining part of this paper, the following section describes the circuit design of the ULP-STP with runtime fine-grain power-supply. Section 3 discusses a break-even model for power reduction effect against its control overhead to implement various ULP-STP systems. Section 4 shows the quantitative analysis of the ULPCUE implemented by the optimized ULP-STP under 65 nm CMOS process and then we conclude in the final section.

2. Runtime fine-grain scheme for STP

power-supply

By the runtime fine-grain power-supply scheme proposed in the paper, voltage scaling and power gating will be adaptively performed along with the present processing load within a system. In general, the processing load in the system may alter due to both the intrinsic parallelism of a program and the extrinsic request traffic to the program. In this section, the autonomous behavior of the self-timed elastic pipeline (STP) is briefly introduced and then its natural contribution to the runtime fine-grain power-supply control is discussed.

2.1 Self-timed elastic pipeline Each pipeline stage of the STP consists of a data latch as a pipeline register, function logic, and transfer control unit named C-element. The basic structure of the STP is shown in figure 1. The data latch, function logic, and Celement are denoted by DL, Logic, and C, respectively. The data is packed with tag into packet form, and the packet is transferred between the pipeline stages as a result of the communication between the C’s in the adjacent stages. The communication is performed stage-by-stage according to the 4-phase handshake protocol [6] by using transfer request and acknowledge signals which are called send signal and ack signal respectively. The stage-by-stage transfer control changes the states of each pipeline stage independently, and the states of the stages are defined below according to the handshake protocol. Here, the C-element in the i-th stage is denoted by Ci . • Reset state: The send and ack signals are negated after the assertion of the reset signal. • Idle state: The Ci waits until the sendi−1 is asserted. • Busy state: The sendi−1 is asserted at the beginning of the transfer of the packet from the precedent (i − 1)-th stage. After the assertion of the sendi−1 , the Ci asserts its ack signal (acki−1 ). In response to the assertion, the Ci−1 negates the sendi−1 . After that, if and only when both the sendi−1 and acki are negated, the Ci asserts the T oDLi to open the DLi and it asserts sendi at

Pipeline Stage

DL1

Logic

C1 ack0

Logic

ToDL2

ToDL1 send0

DL2

send1 Ds1 Da1 ack1

C2

send2 Ds2 Da2 ack2

DL3 ToDL3 send3

C3 ack3

DL: Data Latch C: Transfer Control Circuit Ds: Delay Element of send Signal Da: Delay Element of ack Signal

Fig. 1: Basic structure of self-timed pipeline.

the same time. As a consequence, the packet is latched in the i-th stage, and the i-th stage goes to idle state. Otherwise, the Ci waits until the acki is negated while it keeps its send and ack signals. The successive stages receiving the assertion of the send signal go to busy state and their C’s repeat the same transfer control sequence individually. During the handshakes, the send signals are delayed to assure the completion of the primitive logic function and ack signals are delayed to assure the setup-hold timing of the DL’s. This stage-by-stage transfer control of the STP suggests the timing of the power controls. That is, in the idle stages, the circuit of the DL, and combinational Logic can be powered-off, i.e., the supply-voltage can be cut while that of the C and sequential Logic can be powered-down, i.e., the supply voltage can be lowered enough to keep the circuit’s states. Moreover, in the busy stages, those circuits should be powered-down enough to assure the switching of the transistors, i.e., the supply-voltage can be lowered as long as the required switching speed is achieved.

2.2 Stage-by-stage power gating To realize stage-by-stage power controls finely, we have already proposed an ultra-low-power self-timed pipeline (ULP-STP) structure illustrated in figure 2 [5]. In the ULPSTP structure, VDD supplied to all circuits is scaled by using DVS technique. In addition, to cut the power-supply to the DL and Logic, a high threshold NMOS transistor, called power switch (PS), is placed between VSS and the groundside terminals of the DL’s and combinational logics which are composed of low-threshold transistors. In this case, an isolation element (ISO) must be inserted between adjacent stages in order to block the propagation of the electrically unstable signals from the gated stages to the other active stages. This ISO function can be implemented in a part of a data-latch so that the circuit overhead (i.e., power and delay time) for the ISO is negligible. Each power switch is controlled by its power control circuit (PC) which observes the send and ack signals.

Voltage Control Circuit

Scaled VDD

Scaled VDD

Shunt Resistance

DL1

Logic

DL2

Logic

DL3

ToDL1 send0

C1

Ds1 Da1

C2

Ds2 Da2

CORE

ToDL3 send3

C3

ack0

Current

VSS

VSS PS: Power Switch

I-V Lookup Table

Voltage

PID Controller

Target Voltage

Fig. 3: Voltage controller.

ack3

PC

DC/DC Operation Converter

VVSS PS

PC: Power Control Circuit

Fig. 2: Power-supply control for STP.

off die

on die

DC/DC Converter VDD max

䌾VDD

min

Circuit

CVDD

The ULP-STP structure makes it possible to power-down or power-off stage-by-stage, and thus the necessary power is supplied only to the processing stages having a valid data while the leakage power at other idle stages are finely cut. Therefore this power gating scheme is named runtime finegrain power gating (RTPG) in this paper. In addition, the power gating for each core is achieved without any additional mechanism because all of the stages in an idle core are powered down as a result of the stage-by-stage power gating in each stage. Since each stage will consecutively wake up along with travelling of a newly arriving data within the STP, total rush current of the whole STP system can be temporally distributed when the PS is powered on.

2.3 Runtime voltage scaling With the proposed structure, the DVS and the PG can be enabled independently or simultaneously. The supply voltage to the STP can be autonomously altered without adjusting the clock frequency since the STP itself is clockless. The supply voltage can be controlled based on the consumption current which is proportional to the amount of processing load in the STP system. On scaling the supply voltage in runtime, sharp scaling may cause overshooting or undershooting the voltage and thus power noise and malfunction may be generated. On the other hand, dull scaling may bring down to degrade its prompt operation. Since the consumption current of the STP may fluctuate usually, adaptive control mechanism is necessary. Thus, we introduce the proportional-integralderivative (PID) controller to our voltage scaling scheme to minimize the error by adjusting the process control inputs, i.e., consumption current values. Figure 3 shows the runtime voltage scaling mechanism (RTVS) introducing the PID controller. At the I-V lookup table in the figure, the appropriate supply voltage is indexed by a sampled consumption current value. For instance, the target voltage in the I-V table may be designed to achieve the maximum throughput per power. How to optimize the

㵺 VSS

C L,

㰱

㱍㵺

VVSS

C VVSS CPS

Fig. 4: Equivalent load capacitance of ULP-STP.

I-V table may depend on the low-power policy of the target application. Note that a transfer function of a target system can be simply approximated by its dead time and first-order lag. Thus, the PID parameters for our RTVS can be adjusted by using the same approximation. Those advanced features indicate that the STP is more robust than the clocked pipeline, especially in the nanometerscale processes with more variation in transistor performance.

3. Break-even model of the ULP-STP Although voltage scaling and power gating can reduce the switching power and the leakage power of the transistors respectively, there are power overheads such as charge/discharge current during scaling supply voltage and switching current of the power switch transistors for the power gating, and so on. In this section, a break-even model of power reduction effect against its overhead is discussed, especially in terms of runtime and fin-grain of RTPG and RTVS. Figure 4 illustrates a simple equivalent circuit model of the ULP-STP. With the voltage scaling, there are equivalent load capacitances CV DD of on-die/off-die power lines and. With the power gating, there are some equivalent load capacitances, CL , CV V SS , and CP S each of which is extracted from the target power domain circuit, the virtual ground line (VVSS), and the power switch (PS) respectively. Here, the switching energy of the PS, EP S , is represented as follows:

EP S = CP S ×V DD2

(1)

It is assumed here that the power of ISO is negligible because the ISO can be overlapped onto the data-latch. When the power switch is powered on, the power consumption due to the rush current to the circuit, Erush , can be calculated based on [7] as follows: ( ) 1 CV V SS + CL ×V DD×∆V V SS (2) 2 where ∆V V SS denotes the increased amount of VVSS before wake-up. Based on equations 1 and 2, the lower bound of the sleep time to get power reduction effects, BET can be represented as follows: Erush =

( ) CP S ×V DD2 + CV V SS + 12 CL ×V DD×∆V V SS BET = Pleak (3) where Pleak denotes the leakage power of the target power domain circuits. As for the break-even condition related the voltage scaling, it depends on the processing load. Thus, the break-even processing load, BEP L can be expressed as follows: CV DD (4) CL ×α where α denotes the average switching provability of the transistors within the target power domain circuits. Although the PID controller itself must consume the energy, it can be negligible compared to the charge/discharge energy of CV DD . The equation 3 and 4 represents the basic break-even conditions of the proposed runtime fine-grain power-supply control scheme. As represented in those equations, the break-even conditions depend on those load capacitances of the target system. In the following section, the low-power characteristics of the ULP-CUE as an actual STP system will be evaluated through the execution of a practical networking application program. BEP L =

4. Power-performance evaluation In our collaborative research project on ultra-low-power data-driven networking system (ULP-DDNS), a data-driven processor ULP-CUE based on the self-timed pipeline has been implemented by using 65 nm CMOS process. In this section, the ULP-CUE is briefly introduced and then the basic power-performance characteristics are evaluated by integrating actual measurement results of the ULP-CUE chip and SPICE simulation results. Finally, total power reduction effects of the proposed runtime fine-grain power-supply are revealed in the case the ULP-CUE executes a UDP/IP protocol handling program based on a typical traffic log of an ad hoc wireless network simulated by the network simulator[8].

Fig. 5: Layout of the experimental ULP-CUE (65nm CMOS 7ML process).

4.1 Circuit configuration of ULP-CUE The ULP-CUE is a 32 bit dynamic data-driven processor implemented by the 13-stages ring-shaped STP and each STP stage is composed of the following elemental functional module. • M: merging function of input tokens and internally circulated tokens. • FC: firing control function to detect a pair of operand tokens for its instruction execution. It is divided into two STP stages, FC0 and FC1. • MB: merging function for tokens bypassing the FC stages. • IF: instruction fetching function. It is divided into two stages, IF0 and IF1. • ID: instruction decoding function. • EX: execution function, i.e., ALU. It is divided into two stages, EX0 and EX1. • MA: data-memory access function. It is divided into two stages, MA0 and MA1. • BB: branch function to bypass the FC stages or not. • B: branch function to ether output port or the circular STP. Those stages are placed and routed on a die shown in figure 5. As shown in the figure, area of each stage is different from others so that the load capacitance of each stage is different. This means its break-even condition is different.

4.2 Evaluation procedure Because each STP stage of the ULP-CUE is implemented as different circuits, the break-even time is different. For each stage, it is difficult to measure every parameter in equation 3. Thus, in this evaluation, PS switching energy EP S , energy consumption caused by rush current Erush ,

%(7 3RZHUJDLQDQGORVV

Fig. 6: An example measurement result of RTVS (25◦ C).

4.3 Basic evaluations on break-even model As described in the previous subsection, the basic characteristics of power consumption in the ULP-CUE have been evaluated in the case it execute the UDP/IP protocol handling program. Figure 7 shows the break-even time and power reduction effect of each STP stage composing the ULP-CUE at 0.8 V, 25◦ C. The solid line shows the break-even time, and the solid bar shows the amount of power reduction when the sleep time is 250 ns. If the sleep time is longer than 250 ns, the total power reduction of the ULP-CUE can be gained by the stage-by-stage power gating. The FC0 stage has the shortest BET, 159 ns. This is because its area is the largest in all stages. Furthermore, the gate width of the PS can be shortened because the switching probability of transistors composing the FC0 is not so high compared with other stages. The IF0 has the longest BET, 839 ns. It is about 5 times longer than the

%

%%

0$

(;

0$

,'

(;

,)

,)

0%

Fig. 7: Break even time of each STP stage (0.8 V, 25◦ C). 'ĂŝŶ >ŽƐƐ;ŽĨĨͲĚŝĞ͕DĞĂƐƵƌĞĚͿ >ŽƐƐ;ŽŶͲĚŝĞ͕ƐƚŝŵĂƚĞĚͿ

Energy [nJ]

and leakage power Pleak are evaluated by SPICE simulation of each stage. This is because the detailed breakdown of each stage’s power consumption cannot be measured on the fabricated ULP-CUE chip. Since the voltage of the VVSS depends on sleep time, the SPICE simulation is conducted in many times in the case of different sleep time. Furthermore, the gate width of the power switch NMOS transistor is designed to reduce the voltage drop to less than 5%. Wakeup delay time of the stage is hidden by asserting a power-on signal from the preceded STP stage. As for the voltage scaling, total power of the ULP-CUE processor can be measured on the fabricated chip as shown in figure 6. This measured wave shows an example of consumption current of the ULP-CUE in the case the supply voltage VDD is changed from 0.8 V to 1.3 V by using the PID controller. This consumption current includes charge current to both CV DD and CL . Since the VDD can be forced to alter from 0.8 V to 1.3 V, only the charge current to CV DD can be measured. Thus, the difference of both measurement values expresses that to CL . As a result, the break-even processing load can be calculated based on the equation 4.

)&

0

)&

%UHDNHYHQWLPH>QV@

3RZHUJDLQDQGORVV>X:@

The number of token

Fig. 8: Break even processing load of ULP-CUE (0.8 V 1.2 V, 25◦ C).

shortest one. Those results implies that it is important to introduce an adaptive power gating mechanism such as an invalidation scheme of individual power gating based on a leakage current monitor[9]. As for the voltage scaling, the break-even processing load is evaluated based on the equation 4. Figure 8 shows the break-even processing load of the ULP-CUE based on the measurement current of the chip when the supply voltage is changed from 0.8 V to 1.2 V. The diamond-shape plots indicate CL , i.e., the denominator part of the equation 4, and the square-shape plots indicate CV DD , i.e., the numerator part of that. From this result, the BEPL is about 113 tokens. In order to estimate an acceptable voltage scaling frequency, potential performance-power characteristics of the ULP-CUE have been measured at 0.8 V to 1.3 V. Those results are shown in figure 9. From this result, the maximum performance of the ULP-CUE is 87K UDP/IP packet/sec. at 1.3 V where the length of a UDP/IP packet is 512 bytes. When the throughput of UDP/IP packets is 29K packet/sec., total power consumption at 0.8 V can be reduced to 38% compared to that of 1.2 V. Therefore, the evaluated BEPL, i.e., 113 tokens, indicates that the acceptable voltage scaling frequency is at most 32.7 KHz. The measured chip is not equipped with the on-die DCDC convertor. If a DC-DC convertor can be implemented on a die, the load capacitance of the power line, CV DD ,

Power [mW]

9 9 9 9 9 9

Throughput [K packet/sec.]

Fig. 9: Performance-power characteristics of the experimental ULP-CUE under altered supply-voltage (25◦ C).

3HUIRUPDQFHSRZHU 5LVHWLPH

9BWRB9

9BWRB9

9BWRB9

Time [us]

Ratio [K packet/mW]

Fig. 11: Power comparisons of the ULP-CUE (54 M bit/sec., 25◦ C).

Fig. 10: Performance-power characteristics of RTVS (25◦ C).

can be reduced to one-tenth of that. In this case, the breakeven processing load can be reduced to 11 tokens. This assumption indicates that the runtime voltage scaling may be conducted at most 336 KHz. Figure 10 shows the measured transient powerperformance ratios and voltage rise times when the supply-voltage is altered from 0.8 V to 0.9 V, 1.1 V, and 1.3 V. Even during such transient time of supply-voltage, the ULP-CUE can work at reasonable power-performance ratio. Therefore, total performance-power ratio could be improved as well as better dependability against hard real-time constraints can be obtained.

4.4 Evaluations on typical network traffic Since the ULP-CUE has been designed to implement network protocol handling efficiently, its practicality should be evaluated based on actual network traffic pattern. Therefore, we used a network simulator, OPNET, to obtain such benchmark traffic logs. In this evaluation, an ad hoc wireless network [8] is assumed and then simulated by OPNET to get traffic logs. From those traffic logs, a set of input token to the ULP-CUE is extracted and the data-driven UDP/IP program is executed on it. In the case that the basic wireless transmission rate of this network is 54M bit/s or 162M bit/s, the power consumption

Fig. 12: Power comparisons of the ULP-CUE (162 M bit/sec., 25◦ C).

of the ULP-CUE is estimated based on the above basic evaluation results. Figure 11 and 12 show the power comparison among the normal STP, the STP with RTPG, the STP with RTSV, and the STP with the proposed runtime fine-grain power-supply scheme (i.e., with both RTPG and RTSV). In figure 12, the STP with RTPG can reduce its leakage power to 48uW. This is 6% leakage power compared to the normal STP. The STP with RTVS can reduce its switching power to 51.7uW and its leakage power to 235uW. This is 32% switching power compared to the normal STP. In the case of the STP with the proposed power-supply scheme, the ULP-CUE can reduce the total power to 13%. Because of the synergetic effects between RTPG and RTVS, the switching power of the PS in RTPG can be reduced to 68%.

5. Conclusion In this paper, a runtime fine-grain power-supply mechanism based on the self-timed elastic pipeline (STP) was proposed to realize lower-power LSI circuits and then it was analyzed by defining a break-even model in terms of power trade-off. The proposed circuit introduced both the power gating and voltage scaling techniques so that it could utilize

the synergetic effects between them. The low-powered STP circuit was then applied to an ultra-low-power data-driven processor, ULP-CUE, and evaluated through typical UDP/IP traffic of a wireless ad hoc network. In this case, total power can be reduced to about 13% compared with the original STP-based CMP. Since the break-even condition of the proposed scheme may change depending on the temperature and process variations, a kind of self-checking circuit of typical leakage and switching power should be introduced on a die and its monitoring result should be dealt with as a feedback to power-supply controller. Furthermore, in order to verify such on-die mechanism in terms of power consumption and performance, a microarchitecture simulator must be developed which can simulate not only architectural behavior but also transient power consumption. We are now developing such a platform simulator [10] in our collaborative research project and then we will report the comprehensive evaluation results using this simulator iin near future..

Acknowledgement Although it is impossible to give credit individually to all those who organized and supported our project, the authors would like to express their sincere appreciation to all the colleagues in the project. This research work was supported in part by Core Research for Evolutional Science and Technology (CREST), Japan Science and Technology Agency (JST). The circuit design work was supported by VLSI Design and Education Center (VDEC), the University of Tokyo in collaboration with Synopsys, Inc. and Cadence Design Systems, Inc.

References [1] P. Pillai and K. Shin, “Real-Time Dynamic Voltage Scaling for LowPower Embedded Operating Systems,” Proc. of The 18th ACM Symposium on Operating Systems Principles (SOSP’01), pp.89–102, October 2001. [2] S. Mutoh, S. Shigematsu, Y. Gotoh, and S. Konaka, “Design method of MTCMOS power switch for low-voltage high-speed LSIs,” Proc. of Asia and South Pacific Design Automation Conference(ASP-DAC’99), pp.113–116, Jan. 1999. [3] H. Nishikawa, K. Aoki, H. Ishii and M. Iwata, “Intermediate Achievement of Ultra-Low-Power Data-Driven Networking System: ULPDDNS,” Proc. of the 2011 International Conference on Parallel and Distributed Processing Techniques and Applications(PDPTA’11), pp.421– 427 July 2011. [4] H. Terada, S. Miyata, and M. Iwata, “DDMP’s: Self-Timed SuperPipelined Data-Driven Processors,” Proc. of the IEEE, Vol.87, No.2, pp.282–296, February 1999. [5] S. Sannomiya, K. Miyagi, K. Sakai, M. Iwata, and H. Nishikawa, “Selftimed power gating for ultra-low-power pipeline circuit,” Proc. of the 2009 International Conference on Parallel and Distributed Processing Techniques and Applications(PDPTA’09), pp.575–580, July 2009. [6] C. J. Myers, “Asynchronous circuit design,” Univ. of Utah John Wiley & Sons, Inc., July 2001. [7] Z. Hu, A. Buyuktosunoglu, V. Srinivasan, V. Zyuban, H. Jacobson, and P. Bose, “Microarchitectural Techniques for Power Gating of Execution Units,” ISLPED, pp.32–37, Proc. of the 2004 International Symposium on Low Power Electronics and Design (ISLPED’04), August 2004.

[8] K. Utsu, H. Sano, C. Chow, and H. Ishii, “Proposal of Load-aware Dynamic Flooding over Ad Hoc Networks,” Proc. of IEEE TENCON 2009, THU2, pp.1–6, November 2009. [9] N. Seki, L. Zhao, J. Kei, D. Ikebuchi, Y. Kojima, Y. Hasegawa, H. Amano, T. Kashima, S. Takeda, T. Shirai, M. Nakata, K. Usami, T. Sunata, J. Kanai, M. Namiki, M. Kondo and H. Nakamura, “A Fine Grain Dynamic Sleep Control Scheme in MIPS R3000” the 26 IEEE International Conference on Computer Design (ICCD’08), pp.612–617, October 2008. [10] K. Aoki, H. Ishii, M. Iwata, and H. Nishikawa, “A Comprehensive Evaluation of ULP-DDNS by Platform Simulator,” Proc. of the 2012 International Conference on Parallel and Distributed Processing Techniques and Applications(PDPTA’12), PDP6025, July 2012 (to be presented).