This paper describes processing performance of MP3 audio encoding on a ... scheme of an MP3 audio encoding program and by scheduling the program onto ...
Performance Evaluation of Heterogeneous Chip Multi-Processor with MP3 Audio Encoder Hiroaki SHIKANO,*† Yuki SUZUKI,‡ Yasutaka WADA,‡ Jun SHIRAKO,‡ Keiji KIMURA*‡ and Hironori KASAHARA*‡ *Advanced Chip Multi-Processor Research Institute, Waseda University †Central Research Laboratory, Hitachi, Ltd. ‡Department of Computer Science, Waseda University 3-4-1 Okubo, Shinjuku-ku, Tokyo 169-8555, Japan { shikano, yuki, yasutaka, shirako, kimura, kasahara } @oscar.elec.waseda.ac.jp
1. Abstract This paper describes processing performance of MP3 audio encoding on a heterogeneous chip multiprocessor (HCMP) that possesses different types of processing elements (PEs) such as general-purpose processors and special-purpose processors. The HCMP realizes higher performance than conventional single-core processors or even homogeneous multi-processors for some specific applications such as media processing under the condition of low operating frequency aiming at lower power consumption. In this paper, the performance of the HCMP is analyzed by studying parallelizing scheme and power control scheme of an MP3 audio encoding program and by scheduling the program onto the HCMP using these two schemes. As a result, it is confirmed that an HCMP consisting of three CPUs and two DRPs outperforms a single-core processor by a speed-up factor of 16.3, and a homogeneous multi-processor with 5 CPUs by a speed-up factor of 4.0. It is also confirmed that the power control on the HCMP results in 24 % power reduction. Keywords: Heterogeneous chip multi-processor, Parallelizing compiler, Power control
2. Heterogeneous Chip Multi-Processor In recent years, embedded processors in digital consumer appliances have been required to handle various types of data such as video and audio. The processors are expected to deliver high performance as well as low power consumption. At the same time, as process technology evolves, the number of transistors available is increasing. Therefore, chip multi-processors (CMPs) have attracted much attention since CMPs attain higher performance without raising operating frequency [1,2]. However, embedded processors demand more efficiency in both performance and power especially in terms of cost. This paper proposes a heterogeneous chip multi-processor (HCMP) that possesses different types of processing elements as shown in Fig 1. The HCMP consists of a multiple number of general purpose processors such as CPUs and special purpose processors such as digital signal processors (DSPs) or dynamically reconfigurable processors (DRPs), handling a specific form of program in a highly efficient manner. The HCMP has the following architectural features: a hierarchical memory structure, an intelligent data transfer unit (DTU) and a power control register (FVR). High-speed local memories, that is, a local data memory (LDM), a local program memory (LPM) and a distributed shared memory (DSM), are attached to each PE core and a central shared memory (CSM) is placed on a chip or off the chip. All the memories are mapped with a global address and the coherence of the memories is managed with software. The DTU interprets command chains on a local memory generated by an HCMP compiler, and transfers data between the local memory and the shared memory, or between the local memories on different PEs in a background as the PE core is in operation. The power control register (FVR) is attached to each PE and each system component such as interconnection-network, memory module and I/O, which controls frequency and voltage of each PE core and power of each component.
3. Evaluation Model The evaluation of the HCMP architecture has been performed using “UZURA[3]”, an MP3 audio encoder with 4 frames of 16-bit 44.1KHz audio data and the output bit rate option of 128 kbps, by the following steps by hand in the same way as the HCMP parallelizing and power control compiler [4,5] behaves. The evaluated HCMP consists of three SuperH (SH) processors [6] (one solely for a task scheduler) and two Flexible Engine / Generic ALU Array (FE-GA) processors [7], a sort of DRP. First the program is analyzed in detail and decomposed into program parts such as loops, subroutines and basic blocks, defined as tasks. All the tasks are examined by checking and marking if the task is suitable for FE-GA operation or not. Then the operation cycles of the tasks executed on a CPU are measured using the SH-4 architecture
simulator. Note that the cycles of the FE-GA tasks are calculated as 1/10 that of CPU’s, based on the evaluation result of an AAC encoding on FE-GA [8]. Then the data transfer cycles are calculated on the supposition that the DTU transfers data required for a task and generated by the task between a local memory and the shared memory in a burst transfer mode via an atomic transaction single bus. Then these tasks with their operation cycles and their data transfer cycles are assigned onto PEs on the HCMP. Finally, the average operation cycles consumed in one-frame encoding process are obtained.
4. Parallelizing Scheme and Evaluation Fig. 2 shows the scheduling result of an HCMP for one SH worked as a scheduler, two SHs for application processing and two FE-GA accelerators. MP3 encoding tasks for 4 input frames are executed on the FE-GAs and the application SHs in parallel and alternative. The average execution cycles of 1-frame encoding with this scheduling is approximately 1,527 K cycles, which gives up 16.3 speed-ups against scheduling on a single-core SH processor. Other structures of HCMPs such as with one scheduler SH and four application SHs (homogeneous) and with one scheduler SH, one application SH and one FE-GA have also been evaluated as shown in Fig. 3.
5. Power Control Scheme and Evaluation Not all the processors on an HCMP are always on duty at the same time due to data dependency or control dependency among tasks. Fig. 4 shows a sample macro task graph (a) and its scheduling on an HCMP. Because of data dependency among the three tasks, CPU is in an idle condition between MT1 and MT3. The compiler generates such power control codes setting up FVR registers that clock and power supply are shut off (b), or clock frequency and supply voltage are lowered (c). Power control has been applied and evaluated on the HCMP with three SHs and two FE-GAs using the following four techniques: (1) 1/8 clock lowering, (2) clock stop, (3) power cut-off and (4) the mixed of the former three techniques. Transition cycles required for the power control is determined as 100, 2,000, 20,000 cycles for (1), (2), (3) respectively [9]. These techniques are applied only if inter-task idle cycles are longer than the transition cycles. The mixed technique gives the first priority to power cut-off in use, the second to clock stop and the third to clock lowering. The power estimation is based on 90 nm process, using average dynamic power index of 0.3 mW/MHz for SH [6] and 0.8 mW/MHz for FE-GA [7], and setting static power as 20 % of the total [10]. Fig. 5 shows the detail of 4-frame encoding operations with the mixed technique applied. It is confirmed that power cut-off is mainly applied since the average size of MP3 encoding tasks is large, for example 20K cycles, and most of the idle cycles are greater than the power cut-off transition cycles. Tab. 1 shows calculated power dissipation of the HCMP with the four power control techniques applied. It is estimated that the power control on the HCMP results in 24 % power reduction.
Acknowledgements A part of this research has been supported by NEDO “Advanced Heterogeneous Multi-Processor”.
References [1] D. Pham, et. al., “The design and implementation of a first-generation CELL processor”, ISSCC 2005. [2] T. Shiota, et. al., “A 51.2GOPS, 1.0GB/s-DMA Single-Chip Multi-Processor Integrating Quadruple 8-Way VLIW Processors”, ISSCC 2005. [3] Uzura MP3 audio encoder, http://members.at.infoseek.co.jp/kitaurawa/index_e.html . [4] Keiji Kimura, et. al., “Multigrain Parallel Processing on Compiler Cooperative Chip Multiprocessor”, INTERACT-9, Feb., 2005 [5] Jun Shirako, et. al., “Compiler Control Power Saving Scheme for Multi Core Processors”, LCPC2005. [6] T. Yamada, et. al., “Low-Power Design of 90-nm SuperH Processor Core”, Proc. Of Computer Design 2005, Int’l. Conf on Computer Design, 2005. [7] M. Ito, “Reconfigurable LSI, major vendors rushing its release”, the 13th FPGA/PLD Design Conference 2006, Jan., 2006 (in Japanese). [8] H. Tanaka, et. al., “Study of Audio Signal Processing on Reconfigurable Processor FE-GA”, technical report of IEICE, RECONF2005-67, Nov., 2005 (in Japanese). [9] H. Kawaguchi, et. al., “uItron-LP: Power Conscious Real-Time OS Based on Cooperative Voltage Scaling for Multimedia Applications”, IEEE Trans. On multimedia, Feb. 2005. [10] S. Naffziger, et. al., “The Implementation of a 2-core Multi-Threaded Itanium-Family Processor”, ISSCC2005.
HCMP Chip CPU0
CPU
CPU1
SH0: Scheduler SH1
CPUn
DTU
LPM
LDM
FVR
DSM
S2 P F1 F1
S2 F2
S1 F1
Q1 Q1 F2 F4
HF F3&F4
S1 F3
MD MD Q2 Q2 F1 F3 F1 F3
Q6 Q7 Q8 Q8 F1 F1&F2 F1 F2
S1 F4
MD MD Q2 Q2 F2 F4 F2 F4
Q6 F2
S1 F2
HF F1&F2
Q9 Q9 F1 F2
HF F1&F2
Q10Q10 F1 F2
BS F1
BS F2
S2 P F5 F5
Q9 Q9 F3 F4
HF F3&F4
Q6 F1
Q6 F2
Q7 Q8 Q8 Q6 F1&F2 F1 F2 F1
Q6 F3
Q6 Q7 Q8 Q8 Q6 F4 F3&F4 F3 F4 F3
CSM 0
Controller DRPm
18
16.3
16
10
24
18 15
Speed-up ratio
8
12
6.7
6
9 4.0
4
6 3
1.0
0
0 SH x 1
BS S2 F4 F6
Q6 F2
Q7 Q8 Q8 F1&F2 F1 F2
Q8 F3
S1 F5
S1 F7
Q6 F4
Q7 Q8 Q8 Q6 F3&F4 F3 F4 F3
S1 F6
S1 F8
Q6 Q7 Q8 F4 F3&F4 F4
SH x 5
2,000
6,108 K cyc / 4 frames 4,000 5,000 6,000
3,000
Q: Non-Linear Quantization HF: Huffman Coding BS: Bit Stream Generation
[K cycle] 7,000
Fxx: xx th frame
27
21
Execution cycles
12
HF Q10 BS F3&F4 F3 F4 F3
Fig. 2 Scheduling result of MP3 audio encoder on an HCMP with three SHs and two FE-GAs.
Fig.1 Standard heterogeneous multi-processor architecture with CPUs and DRPs.
Execution cycles per frame [M cycles]
DRP0
1,000
S: Sub-band Analysis P: Psycho-Acoustic Analysis MD: Modified Discrete Transform
ALU Array
14
Q9 Q9 F3 F4
DMA
FVR
LDM
Speed-up ratio vs. SH x 1
Q5 F2
Q9 Q9 F1 F2
FE-GA1
Network Interface
2
S2 F4
HF F1&F2
FE-GA0
Interconnection Network
DTU
Q5 F1
SH2
Network Interface
DSM
S2 Q1 Q1 F3 F1 F3
SH x 2, SH x 3, DRP x 1 DRP x 2
CPU0 CPU1 MT1
MT1 FV: FULL
MT2
FV: OFF MT3
MT2 FV: FULL
MT1 FV: LOW
MT2 FV: FULL
MT3 FV: time FULL
MT3 FV: time FULL
(a) Macro Task Graph
CPU0 CPU1
(b) Clock stop / Power cut-off
(c) clock lowering / voltage lowering
Fig. 4 Compiler power control scheme.
Fig. 3 Execution cycles and performance comparison on various types of HCMPs. (1) Clock lowering cycles (not applied) SH1
Tab. 1 Calculated power dissipation of the HCMP with the power control techniques applied.
(3) Power supply cut-off cycles
Power Control
SH2 Power control transition cycles
Task execution cycles DRP1
DRP2
0
1,000
2,000 3,000 4,000 Time [ K cycles ]
5,000
6,000
Fig. 5 Detail of 4-frame encoding operations with the mixed power control technique applied.
Per-
[mW]
centage
Not applied
129.3
100.0 %
(Leakage power) (1) 1/8 clock lowering (2) Clock stop
(25.9) 101.0 98.3
78.2 % 76.0 %
99.8 97.7
77.2 % 75.6 %
(3) Power cut-off (4) (1)(2)(3) Mixed
(2) Clock stop cycles
Power