2009 2009 International International Conferences Conference on on Embedded Embedded Software Software and and Systems Systems
Implementation and Optimization of DSP Suspend Resume on Dual-Core SOC 1
Ming-Wei Chang1, Shau-Yin Tseng1, Homn Lin1, and Ching-Lung Su12 2 SoC Technology Center Department of Electrical Engineering Industrial Technology Research Institute National Yunlin University of Science Technology Hsinchu, Taiwan, R.O.C Yun-Lin, Taiwan, R.O.C. { MingWeiChang, tseng, lin1255, CLSu }@itri.org.tw
[email protected] 1
restore decoding information to DSP after powering up DSP. DSP suspend/resume mode could minimize energy consumption and continue to decode video when idle time occurs. In this paper, in order to minimize energy consumption for idle time, we implement and optimize the DSP suspend/resume mode and verify it by H.264/AVC decoder. The optimization scheme can achieve 99% improvement in latency. The rest of this paper is organized as follows. PAC platform, the power management techniques and PACDSP are introduced in Section 2. H.264/AVC decoder and suspend/resume are discussed in Section 3. In Section 4, it summarizes the optimization techniques for reducing the backup latency in suspend/resume process. Experimental results are provided in Section 5, and conclusions are finalized in Section 6.
Abstract Energy consumption has become an increasingly design issue in multimedia mobile system. This paper presents a realization for suspend/resume on a dual-core SoC to conserve energy consumption while system is idle. A generic backup/restore mechanism is proposed for any DSP applications. Furthermore, optimization for latency targets on characteristic of H.264/AVC decoder is refined and only needs to backup 26 words for resume. Experimental results show optimization scheme can achieve 99% improvement in latency and the realization for suspend/resume can provide a valuable experience for similar designs.
1. Introduction
2. Overview of the PAC Platform
As rapid development of complicated multimedia applications on mobile system, the computational requirement increases as well. However, the progress of battery capacity falls far behind. Thus, energy consumption becomes a critical design issue in mobile multimedia system. Numerous researches focus on dynamic power management for multimedia applications but ignore one point users may pause for a period of time. For example, when playing multimedia applications, user may be interrupted and pauses the playing for a period of time. In this situation, we need special dynamic power management techniques to saving maximum energy for this idle time. The PAC (Parallel Architecture Core) platform [1][2] provides multi-function of power management, such as DVFS (Dynamic Voltage/Frequency Scaling) for runtime power management [3][4][5][6][7], clock gating, and power down for reducing energy consumption in idle time [8][9]. It is an energy-efficient manner that we power down DSP when video decoding is idle, but it will lose all decoding information on DSP. Thus, we propose a DSP suspend/resume mode that would backup all decoding information in DSP before powering down DSP and 978-0-7695-3678-1/09 $25.00 © 2009 IEEE DOI 10.1109/ICESS.2009.52
2.1. PAC platform architecture and power management The PAC platform is an asymmetric dual-core architecture that contains a MPU for control flow and a PACDSP for multimedia computation. It also contains a APB and multiple layered AHB to connect dual core and all of IPs. The PAC platform architecture is shown in Figure 1. The PAC platform supports many power management techniques and power domains summarized as Table 2. There are six power domains including MPU, DSP, AHB, APB, DVFS, and PLL domain. Each domain can independently operate with different voltage and frequency. In DSP power domain, there are three active modes for run time power management, respectively operating at 1.2 V/228 MHz, 1.0 V/152 MHz, and 1.0 V/112 MHz. In inactive mode and pending mode, DSP is clock gated respectively with 1.2 V and 0.9 V. In sleep mode, DSP is
183 185
powered down, so DSP doesn’t consume any energy. Table 4 summarizes DSP average power for each mode. We can define power states for dual-core system by permuting MPU and DSP power modes. Taking video decoding for example, we can map user behavior of video decoding into system power states as show in Figure 3. After system powers up, MPU is waken up and in active mode. When video decoding is requested, DSP will also change to active mode for video decoding. In active state 2, voltage and frequency for DSP can automatically adjust by DVFS algorithm [10]. During video decoding if user is interrupted, our system supports three types of pause. For short pause, we can gate DSP clock for providing fast power mode switch and saving energy consumption. For middle pause, voltage for DSP will change to lowest supported voltage and also gate clock. Because voltage needs to change, changing to active state 3 takes regulator delay time and will be more power efficient than active state 1. For long pause (in active state 4), DSP will be powered down for greatest power saving, but it takes longest delay time for regulator delay and data backup delay. In the end, when video decoding is no longer needed, DSP will also be powered down.
Sleep
0 0 Same as DSP-domain Full-speed 1.2 V 114 MHz Low-power 1.0 V 76 MHz
ME-domain AHB-domain SRAM-domain, LCD-domain APB-domain, DVFS-domain PLL-domain
Y N
Same as AHB-domain
Y
Fix V
1.0 V
48 MHz
N
Fix V
1.2 V
456 MHz
N
2.2. PACDSP Architecture The PACDSP [11][12] is a 32-bit fixed-point, 5-way issue, and 9-stage pipelined DSP designed by STC/ITRI organization. The architecture of PACDSP is shown in Figure 5. It contains one scalar unit and two clusters for compute and provides 32 KB instruction memory and 64 KB data memory. In both scalar unit and clusters contain many registers for compute. For example, there are coefficient registers, accumulator registers for arithmetic unit, address registers for load/store unit, and ping-pong Exit Video Decoding
Active state 2 MPU active, DSP active
Start Video Decoding or Resume Decoidng
Short Pause
Active state 5 MPU active, DSP sleep
Resume Decoidng Resume Decoidng
Long Pause
Active state 1 MPU active, DSP inactive
System Power Up
Middle Pause
Active state 4
Active state 3
MPU active, DSP suspend
MPU active, DSP pending
registers for both. Figure 3. Mapping behavior of video decoding into system power state transition for dual-core system. Figure 1. PAC platform architecture and partitions of power domain.
Table 4. DSP average power in each power mode. DSP Voltage/Frequency Average Power (mW) 1.2 V/228 MHz 158.81 1.0 V/152 MHz 83.24 09. V/114 MHz 53.92 1.2 V/Clock Gating 7.29 (Inactive Mode) 0.9 V/Clock Gating 2.71 (Pending Mode) 0 V/Clock Gating 0 (Suspend or Sleep Mode)
Table 2. Summary of PAC platform power domain. Power domain name MPU-domain
DSP-domain
Power modes Active-1 Active-2 Active-3 Inactive Active-1 Active-2 Active-3 Inactive Pending
Operation condition Voltage Frequency 1.2 V 228 MHz 1.2 V 152 MHz 1.2 V 114 MHz 1.2 V 0 1.2 V 228 MHz 1.0 V 152 MHz 0.9 V 114 MHz 1.2 V 0 0.9 V 0
Power gating N
Y
186 184
program is resides in SDRAM. It is responsible for controlling decoding flow and moving result (decoded frame) to LCD frame buffer for display. DSP program is an H.264/AVC decoder and it resides in SRAM. It is responsible for decoding H.264/AVC video data. The H.264/AVC decoder also needs a constant table that resides in DSP data memory. The input data (encoded frame) and output data (decoded frame) of the H.264/AVC decoder are in SDRAM. Before starting to decode, all related data need to reside in the correct position.
3.2. Suspend/Resume that Uses Interrupt and ISR to Backup DSP Context In order to support suspend/resume, we need to consider how to backup decoding status in DSP. The decoding status contains all data in DSP data memory and DSP registers. Certain special registers (Ex: Program Counter) in DSP cannot be directly accessed by DSP and MPU. However, these registers have corresponding shadow registers that can be accessed when DSP has been interrupted. Therefore we can invoke interrupt to DSP then save the contents of those shadow registers by DSP ISR (interrupt service routine).
Figure 5. Architecture of the PACDSP core.
SDRAM (128MB) Decoder Input Data (Encoded Frame)
LCD Frame Buffer
MPU
MPU
Decoder Output Data (Decoded Frame)
DSP
Setup DSP ISR for Suspend/Resume Start DSP to Decode
MPU Program
H.264/AVC Decode a Frame Suspend Notification
Move Output Data to Framebuffer and Start DSP to Decode next frame Invoke an Interrupt to DSP for Notifying ISR Backup DSP Context
DSP
DSP ISR Backup all DSP registers to DSP Data Memory
DSP Program Constant Table
DSP Data Memory (64KB)
H.264/AVC Decode Frame N
MPU Backup DSP Data Memory to SDRAM
SRAM (128KB)
Resume Notification
Figure 6. Memory Layout of H.264/AVC Decoder.
Power Down DSP A Period of Time Power On DSP
3. H.264/AVC Decoder and Suspend/Resume
Restore DSP Data Memory Invoke an Interrupt to DSP for Notifying ISR Restore DSP Context
In this section, implementation of suspend/resume mode is verified by H.264/AVC decoder to demonstrate the concept of suspend/resume on dual-core system.
DSP ISR Restore all DSP registers from DSP Data Memory H.264/AVC Continue Decoding Frame N
3.1. H.264/AVC decoder on PAC platform
Figure 7. Control flow of interrupt and ISR backup method.
The H.264/AVC [13][14][15] contains a lot of features to make video compression very efficient and is realized on dual-core platform with macroblock (MB) level software partition [16]. The memory layout of related input/output data on PAC platform is illustrated in Figure 6. The H.264/AVC decoder is partitioned for dual-core system, so it contains MPU and DSP programs. MPU
The control flow is illustrated in Figure 7. In the beginning, MPU must setup correct address for DSP ISR that is responsible for backup and restore all DSP registers. When video decoding is requested, MPU will change a flag that tells DSP to decode a frame. When DSP finishes decoding a frame, MPU will receive a finish
187 185
signal and move decoded result (decoded frame) to LCD frame buffer. The decoding flow repeats until suspend notification is received by MPU. MPU will invoke an interrupt to DSP for backup all registers in DSP. After DSP ISR backup all registers to DSP data memory, MPU then backup the contents in DSP data memory to SDRAM. When the backup procedure finishes, MPU will power down DSP by setting DVFS controller. During this idle time, DSP almost doesn’t consume any energy. When resume request is received by MPU, MPU will power up DSP and wait for a voltage regulator delay. This delay will depend on which voltage regulator your system uses. After voltage for DSP is ready, MPU will start to restore DSP data memory that includes the contents of all registers. Then MPU invokes an interrupt to DSP. The DSP ISR restores the all registers from DSP data memory including shadow registers. At the end of DSP ISR, a special return from interrupt instruction will be executed. The contents of the special registers will be restored from shadow registers and the DSP status will roll back to the state of previous decoding. The control flow of suspend/resume finishes and video decoding continues.
4. Suspend/Resume H.264/AVC Decoder
Optimization
for
In this section, we modify the backup mechanism to improve the latency of suspend/resume. The concept of optimization is to reduce the amount of backup data. In our observation, when H.264/AVC decoder finishes a frame, the temporary data in DSP data memory and all register content can be discarded. Although this method has a constraint that suspend/resume can proceed after finishing a frame. We only need to backup some global variables (Ex: current decoding frame number, and related data address) and some H.264/AVC headers including SPS (Sequence Parameter Set) and PPS (Picture Parameter Set). The total amount of backup data reduces to only 26 words rather than 16384 words for the whole DSP data memory. In this method, we don’t backup constant table either, because there is a copy in SDRAM. Reversely, we will need to restore constant table (388 words) besides global variables during resume flow. The control flow of global variables backup method is shown in Figure 8. In this method, we don’t need DSP ISR because we don’t need backup any registers. Although backup DSP register needs about only 78 cycles, reducing DSP ISR can reduce DSP code size that greatly affect performance. The decoding flow is the same as previous method and repeats until suspend notification is received by MPU. MPU will wait for DSP to finish a frame, then starts to backup global variables to SDRAM. After backup procedure finishes, MPU then powers down DSP. After a period of idle time, MPU may receive a
resume notification. MPU will power on DSP, and also wait for voltage regulator delay. After voltage for DSP is ready, MPU will restore global variables and constant table to DSP data memory. Then MPU can start DSP to continue decoding next frame. The method also simplifies the backup procedure for implementation. MPU
DSP
Start DSP to Decode H.264/AVC Decode a Frame Move Output Data to Framebuffer and Start DSP to Decode next frame Suspend Notification
H.264/AVC Decode Frame N MPU Backup Global Variables in DSP Data Memory to SDRAM Power Down DSP
Resume Notification
A Period of Time Power On DSP Restore Global Variables and Constant Table to DSP Data Memory H.264/AVC Continue Decoding Frame N+1
Figure 8. Control flow of global variables backup method.
5. Experimental Results As shown in Figure 9, an evaluation board of the PAC SoC platform is used to evaluate the implementation, where the system bus is AMBA 2.0 lite running at 76 MHz with multi-layer AHBs. A Philips PCF50606 PMIC [17] is used for DSP voltage regulation. Two Agilent 34401A Digit Multimeter [18] are used to verify voltage and current for DSP.
Digit Multimeter
PAC SoC PMIC
Figure 9. Evaluation board of PAC SoC and power measurement environment. In the experiment, we measure the latency of suspend and resume by a 48 MHz timer. Before backup procedure starts, we start the timer to count the latency. Then, after
188 186
finishing restore procedure, we stop the timer and read the timer counter to calculate the latency. The experimental results are as shown in Table 10. There are three cases that system changes to DSP suspend. MPU and DSP may run at 228 MHz, 152 MHz, and 112 MHz and we measure the backup/restore latency in these conditions. In interrupt and ISR suspend/resume method, the amount of backup data and restore data are the same. MPU reads from DSP data memory and writes to SDRAM before suspend and reads from SDRAM and writes to DSP data memory before resume. The time for backup and restore are different because the read/write latency are different at SDRAM and DSP data memory. However, the latency of suspend is longer than resume. In global variables suspend/resume, the latency of resume is longer than suspend because the constant table for H.264/AVC decoder should be additionally restored in resume flow. In both methods, latency is proportional to the degradation of MPU and DSP frequency. In these three conditions, the global variables backup method achieves about 99% improvement (decreases latency more than 100 times). Thus, in this method we can greatly reduce the backup latency of suspend/resume and can also reduce energy consumption because of reducing latency.
References [1].
Juan-Ming Lu, Hsin-Long Wu, Tsai-Min Chiang and Wen-Feng Chen. High Performance and Low-Power Dual-Core SoC Platform for Portable Multimedia Applications. SoC Technical Journal, vol. 2, pp. 36-45, May 2005.
[2].
Chien-Yuan Lai, Jin-Hon Lin and Yao-Feng Wang. DVFS SoC Architecture and Implementation, SoC Technology Journal, vol. 3, pp. 84-91, November 2005.
[3].
Nurvitadhi, E., Lee, B., Yu, C., and Kim, M.. A Comparative Study of Dynamic Voltage Scaling Techniques for Low-Power Video Decoding. International Conference on Embedded Systems and Applications, pp. 23-26, June 2003.
[4].
Kihwan Choi, Ramakrishna Soma, and Massoud Pedram. Off-chip latency-driven dynamic voltage and frequency scaling for an MPEG decoding. Proceedings of the 41st annual conference on Design automation, pp. 07-11, June 2004.
[5].
Seongsoo Lee. Low-Power Video Decoding on Variable Voltage Processor for Mobile Multimedia Applications. ETRI Journal, vol. 27, no. 5, pp. 504-510, Oct. 2005.
[6].
Jia-Ming Chen, Chih-Hao Chang, Shau-Yin Tseng, JenqKuen Lee, and Wei-Kuan Shih. Power Aware H.264/AVC Video Player on PAC Dual-Core SoC Platform. IFIP International Conference on Embedded and Ubiquitous Computing (EUC 2006), Seoul, Korea. Also in LNCS 4096, pp. 57-68, Aug. 2006.
[7].
Fen Xie, Margaret Martonosi, and Sharad Malik. Efficient behavior-driven runtime dynamic voltage scaling policies. Hardware/Software Codesign and System Synthesis, CODES+ISSS '05, 2005 Sept.
[8].
Matthew Garrett. Powering http://queue.acm.org/detail.cfm?id=1331293.
6. Conclusions
[9].
TuxOnIce for Linux. http://www.tuxonice.net/.
In this paper, a generic backup/restore mechanism for DSP suspend/resume is presented. This mechanism can be used for any DSP applications to realize suspend/resume, but the long latency will reduce the use possibility of suspend/resume. The optimization scheme verified by H.264/AVC decoder can reduce 99% latency for suspend/resume. The optimization concept is that decoder can discard most obsolete data in DSP when finishing a frame. Although, there is a constraint that the optimization scheme only proceed after decoder finishes a frame. In H.264/AVC decoder, it only needs to backup 26 words and to restore 414 words for suspend/resume. So this optimization scheme can greatly reduce latency.
[10]. Shau-Yin Tseng, and Ming-Wei Chang. DVFS aware Techniques on Parallel Architecture Core (PAC) Platform. International Conference on Embedded Software and Systems Symposia, pp. 79-84, July 2008.
Table 10. Comparison and experimental results Interrupt and ISR Backup Latency (μs) Suspend /Resume [suspend ,resume] ,Energy Consumption (mJ) MPU/DSP 228MHz 6371.05 [3497.65 ,2873.40] , 1.01178 MPU/DSP 152MHz 7279.01 [3948.64 ,3330.37] , 0.60590 MPU/DSP 114MHz 8784.05 [4835.93 ,3948.12] , 0.47364
Global Variables Improve Suspend ment /Resume 62.25 [6.33 ,55.92] , 0.00988 67.66 [7.08 ,60.58] , 0.00563 71.48 [8.17 ,63.31] , 0.00385
99.02 % 99.07 % 99.19 %
Down.
[11]. David Chih-Wei Chang, I-Tao Liao, Shau-Yin Tseng, Chein-Wen Jen. PAC DSP Core and Its Applications. IEEE Asian Solid-State Circuits Conference, 19-22, November 2006 [12]. David Chih-Wei Chang, I-Tao Liao, Jenq-Kuen Lee, Wen-Feng Chen, Shau-Yin Tseng and Chein-Wei Jen. PAC DSP core and application processors. IEEE International Conference on Multimedia and Expo, pp. 289-292, July 2006
189 187
[13]. ITU-T H.264. Advanced Video Coding for Generic Audiovisual Services. 2005. [14]. Wiegand, T., Sullivan, G.J., Bjntegaard, G., and Luthra, A. Overview of the H.264/AVC video coding standard. IEEE Transactions on Circuits and Systems for Video Technology, vol. 12, issue 7, pp. 560-576, July 2003. [15]. G. J. Sullivan, P. Topiwala, and A. Luthra. The H.264/AVC Advanced Video Coding Standard: Overview and Introduction to the Fidelity Range Extensions. SPIE Conference on Applications of Digital Image Processing, vol. 5558, part 1, pp. 454-474, Aug. 2004. [16]. Jia-Ming Chen, Chiu-Ling Chen ,Jian-Liang Luo, PoWen Cheng, Chia-Hao Yu, Shau-Yin Tseng and WeiKuan Shih. REALIZATION AND OPTIMIZATION OF H.264 DECODER FOR DUAL-CORE SOC. International 2007 Conference on Signal Processing and Multimedia Applications, pp. 28-31, July 2007. [17]. Philips PCF50606 PMIC. Philips PCF50606 Preliminary Specification. 2003. [18]. Agilent 34401A Digit Multimeter Data Sheet. from http://cp.literature.agilent.com/litweb/pdf/59680162EN.pdf
190 188