Jun 5, 2009 - Example of a multicore processor for a desktop PC. â« ..... I will try to use the Intel's dual-core ATOM platform for the measurement of the ...
Evaluation of Multicore Processors for Embedded Systems by Parallel Benchmark Program Using OpenMP Toshihiro Hanawa1,2 Mitsuhisa Sato1,2 Jinpil Lee1 Takayuki Imada1 Hideaki Kimura1Taisuke Boku1,2 1: Graduate School of Systems and Information Engineering
2: Center for Computational Sciences University of Tsukuba
Outline • Background & Motivation • Multicore processors for embedded systems evaluated in this study
Renesas: M32700 ARM & NEC Elec.: MPCore Waseda Univ. &Renesas& Hitachi: RP1
• Performance evaluation for synchronization with OpenMP • Evaluation by parallel benchmarks using OpenMP • Conclusions & Future works
2
5 June 2009
IWOMP 2009, Dresden
Background(1) • Embedded systems with complicated functions are widely used.
Digital home appliances, Car navigation system, and so on
֜ These systems require increasingly higher performance. • But, the power consumption of embedded systems must be reduced
To extend the operating time of mobile devices To realize more environmentally friendly products
• Multicore technology has been introduced also for embedded processors •
3
Multicore takes advantage of parallel processing rather than increasing frequency of clock-rate as the performance improvement.
5 June 2009
IWOMP 2009, Dresden
Background(2) • OpenMP
A portable programming model that provides a flexible interface for developing parallel applications on shared memory multiprocessors By inserting a “directive” for parallelization into a program, the OpenMP compiler generates the parallelized code. OpenMP has become available for several compiler products for general-purpose processors. Latest GCC has also supported OpenMP since ver. 4.2
Apply OpenMP to the multicore processor for embedded systems
4
5 June 2009
IWOMP 2009, Dresden
Our motivation • We evaluate some embedded multicore processors with a shared memory mechanism by parallel benchmark programs using OpenMP.
We develop the OpenMP implementation using a cross compiling environment for the target multicore architecture.
• We investigate the following points:
5
the effect of OpenMP as a programming method the memory bandwidth and synchronization performance as an SMP multicore processor and their impact on the overall system performance
5 June 2009
IWOMP 2009, Dresden
Multicore Processor for Embedded Systems • Three multicore processors for embedded systems with shared memory Renesas Technology: M32700 ARM & NEC Electronics:MPCore Waseda Univ. &Renesas& Hitachi: RP1 • Example of a multicore processor for a desktop PC Intel: Core2Quad Q6600
6
5 June 2009
IWOMP 2009, Dresden
Related works Several embedded multicore processors had been researched individually, by some benchmarks using OpenMP, including the extension of OpenMP in a few cases. •
M32700 (Hotta, et.al.)
•
MPCore (Blume, et al.)
•
Several applications parallelized using OpenMP Extension of OpenMP directive suitable for BlackFin architecture Modification of EEMBC benchmarks for parallel version using OpenMP
5 June 2009
BlackFin 561 (Seo, et al.)
7
mpeg2encode in MediaBench parallelized using OpenMP
• RP1 & Fujitsu FR1000 (Miyamoto, et al.) SMP vs. AMP multimedia applications parallelized automatically by OSCAR compiler which they developed (including a few of OpenMP functions).
• In this paper, we evaluate a few kinds of multicore processor for embedded systems by the parallel benchmark using OpenMP under the similarcondition.
IWOMP 2009, Dresden
M32700 (Renesas[Formerly Mitsubishi Electric]) • M32R-II corex 2 – –
–
7-stage pipeline 32bit instructions (single-issue) +16bit instructions (dual-issue) No Floating Point Unit •
CPU1
Bus Arbiter
CPU0
512KB Shared SRAM
128 I-Cache 8KB/2way
D-Cache 8KB/2way
I-TLB 32Entries
D-TLB 32Entries
CPU Core Debugging Interface
Floating point library with GCC(soft-float)
Peripherals •ICU •Clock Control •Timer •UART •GPIO
SDRAMClockPLLDivider M32700
• μT-Engine M3T32700UT is used 8
5 June 2009
IWOMP 2009, Dresden
Bus Controller DMAC SDRAMC
32
MPCore (ARM & NEC Electronics) •
ARM MP11 core (ARM11 architecture)x 4 – –
–
•
ARMv6 instruction set ARM instruction set(32bit), Thumb instruction set (16bit), Jazelleinsturction set (variable length) DRAM 8-stage single-issue pipeline
CT11MPCore + RealView Emulation BaseboardBus IF, are used. –
–
9
Distributed Interrupt Controller
External peripheral Memory devices, including two AXI bus controller interfaces for attachment to the MPCore processor and a DRAM controller, are incorporated into a single FPGA chip. To reduce the performance degradation, a 1MB L2 shared cache is embedded in CT11MPCore.
5 June 2009
Interrupt CPU0 Interface
Interrupt CPU1 Interface
Interrupt CPU2 Interface
Interrupt CPU3 Interface
Timer &watchd og
Timer &watchd og
Timer &watchd og
Timer &watchd og
MP11 CPU0
MP11 CPU1
MP11 CPU2
MP11 CPU3
MPCore
Coherency Control Bus Snoop Control Unit (SCU)
Instr& Data 64bit Bus
IWOMP 2009, Dresden
Interrupt lines
RP1 (Waseda Univ. &Renesas& Hitachi) • SH-X3 architecture, SH-4A corex4 – –
16bit instruction set 8-stage dual-issue pipeline
• A dedicated snoopbus for cache consistency –
RP1
data transfer for cache coherence control can avoid traffic on the Super Highway (SuperHwy)
Snoop bus Core 3 CoreCPU 2 FPU Core 1 CPU FPU I$ D$ CCN Core 0 CPU FPU 32K 32K I$ D$ CCN OLRAM ILRAM CPU FPU 32K 32K I$ D$ DTU CCN 16K 8K OLRAM 32K ILRAM 32K I$ D$ DTU CCN 16K 8K OLRAMURAM 128K ILRAM 32K 32K DTU 8K OLRAM ILRAM URAM16K 128K DTU 16K 8K URAM 128K URAM 128K
5 June 2009
CSM 128K
On-chip system bus (SuperHwy) LBSC SRAM
DDR2‐SDRAM 10
Snoop Controller (SNC)
IWOMP 2009, Dresden
DBSC DDR2‐SDRAM
Summary of multicore processors M32700
MPCore
RP1
Q6600
# of cores
2
4
4
4
Freq. of core
300MHz
210MHz
600MHz
2.4GHz
Freq. of int. bus
75MHz
210MHz
300MHz
Freq. of ext. bus
75MHz
30MHz
50MHz
Cache(I+D)
2way 8K+8K
4way 4way 32K+32K 32K+32K(L1), 8way 1M (L2, shared)
8way 32K+32K (L1), 16way 4M(@2core) x 2 (L2)
Line size
16byte
32byte
32byte
64byte
Main memory
32MB SDRAM 100MHz
256MB DDR-SDRAM 30MHz
128MB DDR2-600 300MHz
4GB DDR2-800 400MHz
On-chip fast memories are unused in this study. 11
5 June 2009
IWOMP 2009, Dresden
Implementation of runtime library for Omni OpenMP • Omni OpenMP compiler Ver.2 and cross compiler for target architecture as backend are used. • As the mutual exclusion in Omni OpenMP compiler, we can choose either mutex lock, provided by POSIX thread, or a dedicated spinlock function for target system. • We implement the runtime libraries using spinlock for each of the multicore processors examined in this study. –
by referencing the implementation in the Linux kernel
– Please refer to paper about how to implement the spinlock functions for each multicores due to the limitation of presentation time… 12
5 June 2009
IWOMP 2009, Dresden
Evaluation environment M32700 Main memory 32MB SDRAM 100MHz
MPCore
RP1
Q6600
256MB DDR-SDRAM 30MHz
128MB DDR2-600 300MHz
4GB DDR2-800 400MHz
OS (uname -m)
Linux 2.6.25 (m32r)
Linux 2.6.19 (armv61)
Linux 2.6.16 (sh4a)
Linux 2.6.18 (i686)
File System
NFS
NFS
NFS
Local(ext3)
C Compiler
gcc 4.1.2 20061115
gcc 4.1.1
gcc 3.4.5 20060103
gcc 4.1.2 20061115
Compiling option
-m32r2
-mcpu=mpcore
-m4a
-march=nocona
C library
glibc 2.3.6.ds1-13
glibc 2.3.6
glibc 2.3.3
glibc 2.3.6.ds113.etch5
pthread
Linuxthreads 0.10
NPTL 2.3.6
Linuxthreads 0.10
NPTL 2.3.6
13
5 June 2009
IWOMP 2009, Dresden
Pthread library • Linuxthreads – –
The first implementation of POSIX thread on Linux In order to manage thread creation or thread termination, a manager thread is required independently of the computation thread, and operations related to synchronization are implemented by signal.
• NPTL (Native POSIX Thread Library) –
–
An implementation in order to solve the problem of LinuxThreads Uses futex (fast user-level locking mechanism)
• Correction of Pthread library (linuxthreads) on RP1 –
14
TAS (test-and-set) Instruction → change to the combination of MOVLI, and MOVCO (load-linked and store-conditional) instruction
5 June 2009
IWOMP 2009, Dresden
Evaluation of synchronization • EPCC micro benchmark
15
Benchmark to measure the overhead for language constructs in OpenMP by Edinburgh Parallel Computing Centre syncbench: benchmark for the performance evaluation of synchronization
5 June 2009
IWOMP 2009, Dresden
Results of EPCC syncbench Upper:Mutex lock Linuxthreads demonstrate very low performance with mutex lock. Lower:Spinlock, Unit:μs para for para barri singl criti lock/ orde ato redu In the 1.12 times MPCore. llelcase of NPTL, llel spinlock er eis still cal unlo faster red onmic ctio foruse only spinlock for synchronization. ck n ֜ Hereinafter, we M32700
MPCore
RP1 Q6600
392.2
18.5
399.7
14.1
50.8
273.5
273.1
8.64
241.0
401.9
376.8
13.6
383.6
10.7
9.87
3.15
2.51
7.08
0.501
387.1
436.5
7.46
436.3
6.11
3.14
0.921
1.03
1.50
0.894
443.9
434.8
6.15
435.7
5.98
107.8
1.66
108.2
1.13
295.1
128.2
121.0
0.584
121.0
327.0
107.2
1.42
107.7
0.867
1.53
0.190
0.174
0.598
0.365
109.1
2.80
0.364 In
2.25 16
5 June 2009
Memory access speed has 3.12 0.837 0.962 1.33 0.893 443.8 an impact on “parallel” directives.
3.71 4.54 1.31 spinlock 1.41 0.191 the case0.301 of Linuxthreads, is 4820.474 times6.13 and2.47 695 times a maximum 0.372 0.316 faster 0.859at 0.129 0.131 0.168 0.307 3.35 on M32700 and RP1, respectively. IWOMP 2009, Dresden
Parallel benchmark using OpenMP: MiBench (1/2) •
MiBench suite A free, commercial representative benchmark suite for embedded systems Modeled on EEMBC(Embedded Microprocessor Benchmark Consortium) benchmark suites We parallelize benchmarks using OpenMP for typical applications on embedded systems
Auto./Industrial
Consumer
Office
Network
Security
basicmath
jpeg
ghostscript
dijkstra
blowfish enc. CRC32
bitcount
lame
ispell
patricia
blowfish dec. FFT
qsort
mad
rsynth
(CRC32)
pgp sign
IFFT
susan(edge)
tiff2bw
sphinx
(sha)
pgp verify
ADPCM enc.
susan(corners)
tiff2rgba
stringsearch (blowfish) rijndael enc.
susan(smoothing) tiffdither
rijndaeldec.
GSM enc.
sha
GSM dec.
tiffmedian 17
5 June 2009
typeset
IWOMP 2009, Dresden
Telecomm.
ADPCMdec.
Parallel benchmark using OpenMP: MiBench (2/2) • MiDataSets:a set of workload inputs for MiBench
Susan smoothing: 19.pgm Blowfish encoding: 4.txt
• Large dataset provided by MiBench
18
FFT: nwave=6, nsample=65536
5 June 2009
IWOMP 2009, Dresden
Parallel benchmark using OpenMP: NPB • NAS Parallel Benchmarks:NPB3.3-OMP (OpenMP version) IS : Integer sort
Class W (problem size = 220) Undefine USE_BUCKETS since huge slowdown occurred with buckets
• Modified version of NPB from Fortran into C language using OpenMP directives CG: Conjugate Gradient method
19
Class S, 1400 x 1400 sparse matrix, 14 Iterations until convergence It is provided as the sample by Omni OpenMP.
5 June 2009
IWOMP 2009, Dresden
Parallel benchmark using OpenMP: MediaBench2 • MediaBench2
Mpeg2encode
JPEG2000encode
20
the parallelized version using OpenMP modified by Hotta, et.al. Input file : input_base_4CIF_96bps_15.par Parallelization using OpenMP by ourselves Command line: -f input_base_4CIF.ppm –F output_base_4CIF_96bps.jp2 –T jp2 –O rate=0.010416667
5 June 2009
IWOMP 2009, Dresden
Speedup of Susan smoothing 4 3,5
M32700 MPCore RP1 Q6600
Speedup
3 2,5 2 1,5
High parallelism 3.4 – 3.7x faster at 4 cores
1 0,5 0
1
2
3
Number of cores 21
5 June 2009
IWOMP 2009, Dresden
4
5
Execution time of Susan smoothing 25
Execution time [s]
M32700 20
MPCore RP1
15
Q6600
10
24 – 25x
5
10x 0 0
1
2
3
Number of cores 22
5 June 2009
IWOMP 2009, Dresden
4
5
Power consumption, Power efficiency • M32700
800mW (core only)
• MPCore
355mW x4 = 1.4W (core only)
• RP1
0.6mW/MHz x 600MHz x 4 = 1.4W (core only)
• Q6600
TDP (Thermal Design Power) 105W / 95W
RP1 obtains several times higher efficiency for performance per watt than Q6600. 23
5 June 2009
IWOMP 2009, Dresden
Speedup of NPB IS • CLASS=W: maximum size acceptable to all processors Dedicated snoop bus works effectively.
2,2 M32700 MPCore RP1 Q6600
2
Speedup
1,8
2 . 0 x
1,6 1,4
Cache size is too small.
1,2 1 0,8 0
24
Table 3
1 . 1 2x
2 . 1 x
1 . 6 x
DRAM access is too slow.
1 3 4 5 Memory intensive application, Number of Cores major impact from memory access speed
5 June 2009
IWOMP 2009, Dresden
1.81 1.11 0.93 0.91 0.034 0.019 0.020 0.016
Speedup of CG 4 3,5
M32700 MPCore RP1 Q6600
3
Speedup
3 . 8 x
2,5
2 . 8 x
2 2 . FP calculations 0 dominate the x execution time.
1,5 1
L2 cache is shared among 4 cores.
Computation intensive
0,5 0
1
2
3
Number of cores 25
5 June 2009
IWOMP 2009, Dresden
4
5
Speedup of Mpeg2encode •On MPCore, mpeg2encode with OpenMP didn’t run. 2,5 M32700 RP1
2
2 . 2 x
Speedup
Q6600
1 . 6 Overhead of file x operation via NFS
1,5 1 . FP calculations 5 dominate the x
1
execution time.
0,5 0
1
2
3
Number of cores 26
5 June 2009
IWOMP 2009, Dresden
4
5
Speedup of JPEG2000encode 2,5 2 . 3 x
M32700
Speedup
2
RP1 Q6600
1,5 1 . 4 x
1
0,5 0
1
2
3
Number of cores 27
5 June 2009
IWOMP 2009, Dresden
4
5
Modification cost for parallelization using OpenMP Principle •The parallel region should be assigned to the largest section possible in order to reduce the overhead for thread assignment (fork-join).
Application
Amount of modification
susan smoothing
Add 6 directives
Blowfish encoding FFT
Add9 directives& modify 12 lines Add 4 directives
Mpeg2enc
Add 5 directives & modify 7 lines Add 6 directives & modify 9 lines
JPEG2000enc 28
5 June 2009
IWOMP 2009, Dresden
Comparison between spinlock and mutex lock
•In the case of mpeg2encode on RP1, speedup is saturated with mutex.
1,8 Spinlock Mutex
Speedup
1,6
1 . 6 x
1,4 1,2
Overhead of synchronization is large.
1 0,8 0
1
2
3
Number of cores 29
5 June 2009
IWOMP 2009, Dresden
4
5
Power consumption of mpeg2encode on RP1 with 4 cores • • •
We evaluate the power consumption using PowerWatch system developed by Hotta. Cores are idle due to waiting ba ATX 12V drives Core, DDR2, etc. but ATX 12V has almost no vari Cores uses 1V, DDR2 memory uses 1.8V ATX 5V drives LEDs, constant range (2.0-2.8W)
We remove these results from the power consumption of system.
6
Energy [Ws]
Cores
Entire system
5
Mutex
129.0
271.0
Spin
89.9
181.8
Watts
4
ATX 12V (Mutex) Core 1V ATX 12V (Spin) Core 1V
3 2 1 0 0
20
40
60
seconds 30
5 June 2009
IWOMP 2009, Dresden
80
Conclusions We evaluated the performance of four multicore processors, M32700, MPCore, and RP1 for embedded systems, and Core2Quad Q6600 for a desktop PC. •
The effect of OpenMP as a programming method:
Several OpenMP directives were just inserted into the source code, and the performance became higher as the number of cores increased in the most of applications.
•
The memory bandwidth and synchronization performance as an SMP multicore processor and their impact on the overall system performance:
Multicoreprocessors for embedded systems have larger synchronization cost and slower memory performance than multicore processors for desktop PC. Nevertheless, the spinlock mechanism enables embedded multicore processors to improve the synchronization performance in our observation.
31
Of course, OpenMP directives are not always applicable easily to any applications, and we cannot always get the results we expected..
Since NPTL is a sophisticated library, it may solve the problems around the mutex lock, and improve the power efficiency. But, Linux kernel for target system must offer “futex” system calls,
5 June 2009
IWOMP 2009, Dresden
Future works • Most multicore processors for embedded systems include fast internal memories on the chip, and we will consider using these internal memories to speedup synchronization.
Extension of Linux kernel is needed…
• The effect of spinlock for synchronization under multiple parallel workloads should be examined.
I will try to use the Intel’s dual-core ATOM platform for the measurement of the performance and the power consumption in comparison between the LinuxThreads and NPTL.
• Parallel processing using OpenMP is too difficult to satisfy real-time restriction for embedded systems. To apply OpenMP to embedded systems, some extensions for OpenMP directives will be required. 32
5 June 2009
IWOMP 2009, Dresden
Thank you for your attention!
33
5 June 2009
IWOMP 2009, Dresden