Clock Gating and Multi-VTH Low Power Design Methods Based on 32/28 nm ORCA Processor Vazgen Melikyan National polytechnic university of Armenia Synopsys Armenia CJSC
[email protected]
Eduard Babayan Synopsys Armenia CJSC
[email protected]
Anush Melikyan National polytechnic university of Armenia
[email protected]
Abstract This paper presents method of power optimization implemented on RISC architecture ORCA processor with the help of clock gating and multi-threshold approach aimed at significant reduction of dynamic (switching) power and leakage power. The results are compared with previous research implementing other low power technique on the same processor and with standard design.
1. Introduction According to research 60% of customers highlighted problem related to power consumption of modern ICs [6]. RISC architecture ORCA processor is a prominent example of the mentioned issue. The purpose of this work was to investigate possible ways of power reduction of ORCA processor with application of different power optimization methods. Power consumption of CMOS IC is composed of two components (Fig.1) dynamic and static power. Dynamic power is divided into 2 components- short circuit and switching power which is consumed when both transistors are open simultaneously and for the charging and discharging of capacitances.
,(((
Davit Babayan National polytechnic university of Armenia Synopsys Armenia CJSC
[email protected]
Poghos Petrosyan Synopsys Armenia CJSC
[email protected]
Edvard Mkrtchyan Synopsys Armenia CJSC
[email protected] The third components are the static power consumption [2] due to leakage which is spends in standby mode. Ptotal = Pshort + Pswitching + Pleakage (1) Power
Dynamic
Switching
Static (Leakage)
Short Circuit
Figure 1. CMOS power consumption Where Pshort is Energy consumption during the switching of CMOS gates when the complementary parts are opened simultaneously, Pswitching is switching energy consumption happens during the data dependent switching of capacitance in transistors and the connections between them, Pleakage is energy consumption caused by currents during the nonconducting state of gates [5]. In this equation (1) leakage power is the most essential component, because in advanced technologies leakage power is becoming greater then dynamic power component, and in the results of device geometry scaling gate dielectric thickness is very sensitive to leakages.
ORCA is a 32 bit CPU microprocessor core (Fig.2). Microprocessor has two main interfaces: PCI interface and source synchronous DDR interface for SDRAM. The processor core consists of a high-speed RISC machine with a power save mode. The BLENDER block is shut down during power save mode and RISC_CORE is slowed down to half its frequency. [3] m o d e p o rts
In digital core of the processor we can find the tradeoff ratio between performance and leakage [2]. We can find the critical path of digital core and design it with low threshold devices to reduce the leakage. (Fig.4) ))
ɨʗʂɺɼʗ &&
I_ C L O C K _ G E N
te s t c lo c k s s y s t e m c lo c k s
))
C LO C K_G EN
g lo b a l r e s e t I_ O R C A _ T O P
S D R A M in t e r f a c e d a ta
P C I in t e r f a c e a d d r e s s / d a ta
O R CA _TO P
a d d re s s R / W / L D /B W S
c o n tro l p o rts
o u t p u t c lo c k s
))
))
Figure 2. ORCA processor Except power saving ORCA has two main modes functional and test. In test mode, the clocks for functional mode are not used. All asynchronous interfaces between clock domains are isolated with dual-port FIFOs. The ORCA design contains six scan paths to enable high fault coverage for stuck-at fault testing on Automatic Test Equipment (ATE). These scan paths are essentially six long shift registers made up of all the multiplexed scan flip-flops in the ORCA design. The sub-block CLOCK_GEN contains two PLLs (Phase Locked Loop) and a clock multiplier for the functional clocks. [3] The two PLLs cancel the clock tree insertion delay for the PCI I/O interface timing and for the SDRAM input interface timing. There is also a clock multiplier block which creates the internal clocks for the processor core. 6 generated clocks are created to model the effects of CLOCK_GEN.
2. Multi-VTH Low threshold devices are faster than high threshold devices, but they have more under-threshold leakage current, which we can see in the following diagram (Fig.3).
/RZWKUHVKROG
+LJKWUHVKROG
Figure 4. Multi-Threshold design example
3. Clock-Gating Usually, clock power is a major component of SoC dynamic power because clock switches every cycles. Considering all of the clock signals, the total clock power is 25%-35% of the SoC power [5]. The most common and effective way to reduce dynamic power is to turn clocks off when they are not required. This method is known as clock gating. Clock register bank is a group of flip-flops that share the same clock and synchronous control signals are inferred from the same HDL variable. Without clock gating, register banks are implemented by using a feedback loop and multiplexer, such feedback loops can unnecessarily use power. By controlling the clock signal for the register bank, you can eliminate the need for reloading the same value in the register through multiple clock cycles [2]. Method inserts clock-gating circuitry into the register bank's clock network, creating the control to eliminate unnecessary register activity. It reduces the clock network power dissipation, relaxes the data path timing, and reduces probability of routing congestion by eliminating feedback multiplexer loops. Synopsys's Power Compiler can automatically gate clock in RTL design by insertion of clock-gating command after elaboration (Fig 5).
E
Figure 3. Leakage and delay dependencies on threshold voltage
ůŬ
Y
&&
'
Figure 5. Clock gating circuit
4. Previous Research Previous research (Table 1) of ORCA processor’s power reduction with power gating method was performed using RETENTION and ISOLATION cells. Table 1. Results of timing/power/area report with power gating design method and multi voltage design method Power gating Frequency
200 MHz
Data required time
27.48 ns
Data arrival time
-24.86 ns
Slack(MET)
2.62 ns
Total Power
69.42 mW ~(-8%)
Macro/Black Box area
16340.8μm2
Total cell area
807616.35μm2(+22%)
Total area
823956.35μm2 ~(+21%)
In the result, power consumption was decreased by ~8%, compared with multi voltage design, area overhead was ~21%, timing characteristics were globally unchanged (at 200MHz RISC core clock frequency). [1] Compared to standard design results, power gating design method reduced total power of ORCA (RISC CORE) more than 23 %.
During design process and integration of clock gating and multi Vth design methods, special library was chosen containing cell’s in three versions • • •
Low-threshold voltage Standard-threshold voltage High –threshold voltage
Critical paths have been detected from side of power/timing characteristics. Chosen paths were analyzed and the cells which have the highest contribution to the power/timing in path were identified. Cells in paths which have timing issues and power issues have been replaced with low-Vth devices and high-Vth devices accordingly. Low-Vth devices have been replaced for critical paths where timing is important as delay on this path should be small, simultaneously leakage power has been increased. High-Vth devices were replaced in the paths with high leakage power. At the same time delay has been increased, but this procedure can be considered as normal while delay increase does not affect functionality. Fig.7 shows one of the critical paths of ORCA (RISC CORE) which was chosen as path for replacing.
5. Design Process The design flow of ORCA with clock-gating and multi-VTH methods is fully integrated into standard digital design flow.
63(&,),&$7,21
/RJLF'HVLJQ '&
INTEGRATION OF CLOCK GATING AND MULTI VTH
3K\VLFDOGHVLJQ ,&&
67$7,&7,0,1* $1$/