Evaluation of Multicore Processor for Embedded ... - iwomp 2009

4 downloads 54419 Views 2MB Size Report
Jun 5, 2009 - Example of a multicore processor for a desktop PC. ▫ ..... I will try to use the Intel's dual-core ATOM platform for the measurement of the ...
Evaluation of Multicore Processors for Embedded Systems by Parallel Benchmark Program Using OpenMP Toshihiro Hanawa1,2 Mitsuhisa Sato1,2 Jinpil Lee1 Takayuki Imada1 Hideaki Kimura1Taisuke Boku1,2 1: Graduate School of Systems and Information Engineering

2: Center for Computational Sciences University of Tsukuba

Outline • Background & Motivation • Multicore processors for embedded systems evaluated in this study „ „ „

Renesas: M32700 ARM & NEC Elec.: MPCore Waseda Univ. &Renesas& Hitachi: RP1

• Performance evaluation for synchronization with OpenMP • Evaluation by parallel benchmarks using OpenMP • Conclusions & Future works

2

5 June 2009

IWOMP 2009, Dresden

Background(1) • Embedded systems with complicated functions are widely used. „ „

Digital home appliances, Car navigation system, and so on

֜ These systems require increasingly higher performance. • But, the power consumption of embedded systems must be reduced „ „

To extend the operating time of mobile devices To realize more environmentally friendly products

• Multicore technology has been introduced also for embedded processors •

3

Multicore takes advantage of parallel processing rather than increasing frequency of clock-rate as the performance improvement.

5 June 2009

IWOMP 2009, Dresden

Background(2) • OpenMP „

„

A portable programming model that provides a flexible interface for developing parallel applications on shared memory multiprocessors By inserting a “directive” for parallelization into a program, the OpenMP compiler generates the parallelized code. OpenMP has become available for several compiler products for general-purpose processors. Latest GCC has also supported OpenMP since ver. 4.2 ‹

‹

Apply OpenMP to the multicore processor for embedded systems

4

5 June 2009

IWOMP 2009, Dresden

Our motivation • We evaluate some embedded multicore processors with a shared memory mechanism by parallel benchmark programs using OpenMP. „

We develop the OpenMP implementation using a cross compiling environment for the target multicore architecture.

• We investigate the following points: „ „

5

the effect of OpenMP as a programming method the memory bandwidth and synchronization performance as an SMP multicore processor and their impact on the overall system performance

5 June 2009

IWOMP 2009, Dresden

Multicore Processor for Embedded Systems • Three multicore processors for embedded systems with shared memory „ Renesas Technology: M32700 „ ARM & NEC Electronics:MPCore „ Waseda Univ. &Renesas& Hitachi: RP1 • Example of a multicore processor for a desktop PC „ Intel: Core2Quad Q6600

6

5 June 2009

IWOMP 2009, Dresden

Related works Several embedded multicore processors had been researched individually, by some benchmarks using OpenMP, including the extension of OpenMP in a few cases. •

M32700 (Hotta, et.al.) „



MPCore (Blume, et al.) „



„

Several applications parallelized using OpenMP Extension of OpenMP directive suitable for BlackFin architecture Modification of EEMBC benchmarks for parallel version using OpenMP

5 June 2009

„ „

BlackFin 561 (Seo, et al.) „

7

mpeg2encode in MediaBench parallelized using OpenMP

• RP1 & Fujitsu FR1000 (Miyamoto, et al.) SMP vs. AMP multimedia applications parallelized automatically by OSCAR compiler which they developed (including a few of OpenMP functions).

• In this paper, we evaluate a few kinds of multicore processor for embedded systems by the parallel benchmark using OpenMP under the similarcondition.

IWOMP 2009, Dresden

M32700 (Renesas[Formerly Mitsubishi Electric]) • M32R-II corex 2 – –



7-stage pipeline 32bit instructions (single-issue) +16bit instructions (dual-issue) No Floating Point Unit •

CPU1

Bus Arbiter

CPU0

512KB Shared SRAM

128 I-Cache 8KB/2way

D-Cache 8KB/2way

I-TLB 32Entries

D-TLB 32Entries

CPU Core Debugging Interface

Floating point library with GCC(soft-float)

Peripherals •ICU •Clock Control •Timer •UART •GPIO

SDRAMClockPLLDivider M32700

• μT-Engine M3T32700UT is used 8

5 June 2009

IWOMP 2009, Dresden

Bus Controller DMAC SDRAMC

32

MPCore (ARM & NEC Electronics) •

ARM MP11 core (ARM11 architecture)x 4 – –





ARMv6 instruction set ARM instruction set(32bit), Thumb instruction set (16bit), Jazelleinsturction set (variable length) DRAM 8-stage single-issue pipeline

CT11MPCore + RealView Emulation BaseboardBus IF, are used. –



9

Distributed Interrupt Controller

External peripheral Memory devices, including two AXI bus controller interfaces for attachment to the MPCore processor and a DRAM controller, are incorporated into a single FPGA chip. To reduce the performance degradation, a 1MB L2 shared cache is embedded in CT11MPCore.

5 June 2009

Interrupt CPU0 Interface

Interrupt CPU1 Interface

Interrupt CPU2 Interface

Interrupt CPU3 Interface

Timer &watchd og

Timer &watchd og

Timer &watchd og

Timer &watchd og

MP11 CPU0

MP11 CPU1

MP11 CPU2

MP11 CPU3

MPCore

Coherency Control Bus Snoop Control Unit (SCU)

Instr& Data 64bit Bus

IWOMP 2009, Dresden

Interrupt lines

RP1 (Waseda Univ. &Renesas& Hitachi) • SH-X3 architecture, SH-4A corex4 – –

16bit instruction set 8-stage dual-issue pipeline

• A dedicated snoopbus for cache consistency –

RP1

data transfer for cache coherence control can avoid traffic on the Super Highway (SuperHwy)

Snoop bus Core 3 CoreCPU 2 FPU Core 1 CPU FPU I$ D$ CCN Core 0 CPU FPU 32K 32K I$ D$ CCN OLRAM ILRAM CPU FPU 32K 32K I$ D$ DTU CCN 16K 8K OLRAM 32K ILRAM 32K I$ D$ DTU CCN 16K 8K OLRAMURAM 128K ILRAM 32K 32K DTU 8K OLRAM ILRAM URAM16K 128K DTU 16K 8K URAM 128K URAM 128K

5 June 2009

CSM 128K

On-chip system bus (SuperHwy) LBSC SRAM

DDR2‐SDRAM 10

Snoop Controller (SNC)

IWOMP 2009, Dresden

DBSC DDR2‐SDRAM

Summary of multicore processors M32700

MPCore

RP1

Q6600

# of cores

2

4

4

4

Freq. of core

300MHz

210MHz

600MHz

2.4GHz

Freq. of int. bus

75MHz

210MHz

300MHz

Freq. of ext. bus

75MHz

30MHz

50MHz

Cache(I+D)

2way 8K+8K

4way 4way 32K+32K 32K+32K(L1), 8way 1M (L2, shared)

8way 32K+32K (L1), 16way 4M(@2core) x 2 (L2)

Line size

16byte

32byte

32byte

64byte

Main memory

32MB SDRAM 100MHz

256MB DDR-SDRAM 30MHz

128MB DDR2-600 300MHz

4GB DDR2-800 400MHz

On-chip fast memories are unused in this study. 11

5 June 2009

IWOMP 2009, Dresden

Implementation of runtime library for Omni OpenMP • Omni OpenMP compiler Ver.2 and cross compiler for target architecture as backend are used. • As the mutual exclusion in Omni OpenMP compiler, we can choose either „ mutex lock, provided by POSIX thread, or „ a dedicated spinlock function for target system. • We implement the runtime libraries using spinlock for each of the multicore processors examined in this study. –

by referencing the implementation in the Linux kernel

– Please refer to paper about how to implement the spinlock functions for each multicores due to the limitation of presentation time… 12

5 June 2009

IWOMP 2009, Dresden

Evaluation environment M32700 Main memory 32MB SDRAM 100MHz

MPCore

RP1

Q6600

256MB DDR-SDRAM 30MHz

128MB DDR2-600 300MHz

4GB DDR2-800 400MHz

OS (uname -m)

Linux 2.6.25 (m32r)

Linux 2.6.19 (armv61)

Linux 2.6.16 (sh4a)

Linux 2.6.18 (i686)

File System

NFS

NFS

NFS

Local(ext3)

C Compiler

gcc 4.1.2 20061115

gcc 4.1.1

gcc 3.4.5 20060103

gcc 4.1.2 20061115

Compiling option

-m32r2

-mcpu=mpcore

-m4a

-march=nocona

C library

glibc 2.3.6.ds1-13

glibc 2.3.6

glibc 2.3.3

glibc 2.3.6.ds113.etch5

pthread

Linuxthreads 0.10

NPTL 2.3.6

Linuxthreads 0.10

NPTL 2.3.6

13

5 June 2009

IWOMP 2009, Dresden

Pthread library • Linuxthreads – –

The first implementation of POSIX thread on Linux In order to manage thread creation or thread termination, a manager thread is required independently of the computation thread, and operations related to synchronization are implemented by signal.

• NPTL (Native POSIX Thread Library) –



An implementation in order to solve the problem of LinuxThreads Uses futex (fast user-level locking mechanism)

• Correction of Pthread library (linuxthreads) on RP1 –

14

TAS (test-and-set) Instruction → change to the combination of MOVLI, and MOVCO (load-linked and store-conditional) instruction

5 June 2009

IWOMP 2009, Dresden

Evaluation of synchronization • EPCC micro benchmark „

„

15

Benchmark to measure the overhead for language constructs in OpenMP by Edinburgh Parallel Computing Centre syncbench: benchmark for the performance evaluation of synchronization

5 June 2009

IWOMP 2009, Dresden

Results of EPCC syncbench Upper:Mutex lock Linuxthreads demonstrate very low performance with mutex lock. Lower:Spinlock, Unit:μs para for para barri singl criti lock/ orde ato redu In the 1.12 times MPCore. llelcase of NPTL, llel spinlock er eis still cal unlo faster red onmic ctio foruse only spinlock for synchronization. ck n ֜ Hereinafter, we M32700

MPCore

RP1 Q6600

392.2

18.5

399.7

14.1

50.8

273.5

273.1

8.64

241.0

401.9

376.8

13.6

383.6

10.7

9.87

3.15

2.51

7.08

0.501

387.1

436.5

7.46

436.3

6.11

3.14

0.921

1.03

1.50

0.894

443.9

434.8

6.15

435.7

5.98

107.8

1.66

108.2

1.13

295.1

128.2

121.0

0.584

121.0

327.0

107.2

1.42

107.7

0.867

1.53

0.190

0.174

0.598

0.365

109.1

2.80

0.364 In

2.25 16

5 June 2009

Memory access speed has 3.12 0.837 0.962 1.33 0.893 443.8 an impact on “parallel” directives.

3.71 4.54 1.31 spinlock 1.41 0.191 the case0.301 of Linuxthreads, is 4820.474 times6.13 and2.47 695 times a maximum 0.372 0.316 faster 0.859at 0.129 0.131 0.168 0.307 3.35 on M32700 and RP1, respectively. IWOMP 2009, Dresden

Parallel benchmark using OpenMP: MiBench (1/2) •

MiBench suite „ A free, commercial representative benchmark suite for embedded systems Modeled on EEMBC(Embedded Microprocessor Benchmark Consortium) benchmark suites We parallelize benchmarks using OpenMP for typical applications on embedded systems ‹

„

Auto./Industrial

Consumer

Office

Network

Security

basicmath

jpeg

ghostscript

dijkstra

blowfish enc. CRC32

bitcount

lame

ispell

patricia

blowfish dec. FFT

qsort

mad

rsynth

(CRC32)

pgp sign

IFFT

susan(edge)

tiff2bw

sphinx

(sha)

pgp verify

ADPCM enc.

susan(corners)

tiff2rgba

stringsearch (blowfish) rijndael enc.

susan(smoothing) tiffdither

rijndaeldec.

GSM enc.

sha

GSM dec.

tiffmedian 17

5 June 2009

typeset

IWOMP 2009, Dresden

Telecomm.

ADPCMdec.

Parallel benchmark using OpenMP: MiBench (2/2) • MiDataSets:a set of workload inputs for MiBench „ „

Susan smoothing: 19.pgm Blowfish encoding: 4.txt

• Large dataset provided by MiBench „

18

FFT: nwave=6, nsample=65536

5 June 2009

IWOMP 2009, Dresden

Parallel benchmark using OpenMP: NPB • NAS Parallel Benchmarks:NPB3.3-OMP (OpenMP version) „ IS : Integer sort ‹

‹

Class W (problem size = 220) Undefine USE_BUCKETS since huge slowdown occurred with buckets

• Modified version of NPB from Fortran into C language using OpenMP directives „ CG: Conjugate Gradient method ‹

‹

19

Class S, 1400 x 1400 sparse matrix, 14 Iterations until convergence It is provided as the sample by Omni OpenMP.

5 June 2009

IWOMP 2009, Dresden

Parallel benchmark using OpenMP: MediaBench2 • MediaBench2 „

Mpeg2encode ‹

‹ „

JPEG2000encode ‹ ‹

20

the parallelized version using OpenMP modified by Hotta, et.al. Input file : input_base_4CIF_96bps_15.par Parallelization using OpenMP by ourselves Command line: -f input_base_4CIF.ppm –F output_base_4CIF_96bps.jp2 –T jp2 –O rate=0.010416667

5 June 2009

IWOMP 2009, Dresden

Speedup of Susan smoothing 4 3,5

M32700 MPCore RP1 Q6600

Speedup

3 2,5 2 1,5

High parallelism 3.4 – 3.7x faster at 4 cores

1 0,5 0

1

2

3

Number of cores 21

5 June 2009

IWOMP 2009, Dresden

4

5

Execution time of Susan smoothing 25

Execution time [s]

M32700 20

MPCore RP1

15

Q6600

10

24 – 25x

5

10x 0 0

1

2

3

Number of cores 22

5 June 2009

IWOMP 2009, Dresden

4

5

Power consumption, Power efficiency • M32700 „

800mW (core only)

• MPCore „

355mW x4 = 1.4W (core only)

• RP1 „

0.6mW/MHz x 600MHz x 4 = 1.4W (core only)

• Q6600 „

TDP (Thermal Design Power) 105W / 95W

RP1 obtains several times higher efficiency for performance per watt than Q6600. 23

5 June 2009

IWOMP 2009, Dresden

Speedup of NPB IS • CLASS=W: maximum size acceptable to all processors Dedicated snoop bus works effectively.

2,2 M32700 MPCore RP1 Q6600

2

Speedup

1,8

2 . 0 x

1,6 1,4

Cache size is too small.

1,2 1 0,8 0

24

Table 3

1 . 1 2x

2 . 1 x

1 . 6 x

DRAM access is too slow.

1 3 4 5 Memory intensive application, Number of Cores major impact from memory access speed

5 June 2009

IWOMP 2009, Dresden

1.81 1.11 0.93 0.91 0.034 0.019 0.020 0.016

Speedup of CG 4 3,5

M32700 MPCore RP1 Q6600

3

Speedup

3 . 8 x

2,5

2 . 8 x

2 2 . FP calculations 0 dominate the x execution time.

1,5 1

L2 cache is shared among 4 cores.

Computation intensive

0,5 0

1

2

3

Number of cores 25

5 June 2009

IWOMP 2009, Dresden

4

5

Speedup of Mpeg2encode •On MPCore, mpeg2encode with OpenMP didn’t run. 2,5 M32700 RP1

2

2 . 2 x

Speedup

Q6600

1 . 6 Overhead of file x operation via NFS

1,5 1 . FP calculations 5 dominate the x

1

execution time.

0,5 0

1

2

3

Number of cores 26

5 June 2009

IWOMP 2009, Dresden

4

5

Speedup of JPEG2000encode 2,5 2 . 3 x

M32700

Speedup

2

RP1 Q6600

1,5 1 . 4 x

1

0,5 0

1

2

3

Number of cores 27

5 June 2009

IWOMP 2009, Dresden

4

5

Modification cost for parallelization using OpenMP Principle •The parallel region should be assigned to the largest section possible in order to reduce the overhead for thread assignment (fork-join).

Application

Amount of modification

susan smoothing

Add 6 directives

Blowfish encoding FFT

Add9 directives& modify 12 lines Add 4 directives

Mpeg2enc

Add 5 directives & modify 7 lines Add 6 directives & modify 9 lines

JPEG2000enc 28

5 June 2009

IWOMP 2009, Dresden

Comparison between spinlock and mutex lock

•In the case of mpeg2encode on RP1, speedup is saturated with mutex.

1,8 Spinlock Mutex

Speedup

1,6

1 . 6 x

1,4 1,2

Overhead of synchronization is large.

1 0,8 0

1

2

3

Number of cores 29

5 June 2009

IWOMP 2009, Dresden

4

5

Power consumption of mpeg2encode on RP1 with 4 cores • • •

We evaluate the power consumption using PowerWatch system developed by Hotta. Cores are idle due to waiting ba ATX 12V drives Core, DDR2, etc. but ATX 12V has almost no vari „ Cores uses 1V, DDR2 memory uses 1.8V ATX 5V drives LEDs, constant range (2.0-2.8W) „

We remove these results from the power consumption of system.

6

Energy [Ws]

Cores

Entire system

5

Mutex

129.0

271.0

Spin

89.9

181.8

Watts

4

ATX 12V (Mutex) Core 1V ATX 12V (Spin) Core 1V

3 2 1 0 0

20

40

60

seconds 30

5 June 2009

IWOMP 2009, Dresden

80

Conclusions We evaluated the performance of four multicore processors, M32700, MPCore, and RP1 for embedded systems, and Core2Quad Q6600 for a desktop PC. •

The effect of OpenMP as a programming method: „

Several OpenMP directives were just inserted into the source code, and the performance became higher as the number of cores increased in the most of applications. ‹



The memory bandwidth and synchronization performance as an SMP multicore processor and their impact on the overall system performance: „

Multicoreprocessors for embedded systems have larger synchronization cost and slower memory performance than multicore processors for desktop PC. Nevertheless, the spinlock mechanism enables embedded multicore processors to improve the synchronization performance in our observation. ‹

‹

31

Of course, OpenMP directives are not always applicable easily to any applications, and we cannot always get the results we expected..

Since NPTL is a sophisticated library, it may solve the problems around the mutex lock, and improve the power efficiency. But, Linux kernel for target system must offer “futex” system calls,

5 June 2009

IWOMP 2009, Dresden

Future works • Most multicore processors for embedded systems include fast internal memories on the chip, and we will consider using these internal memories to speedup synchronization. „

Extension of Linux kernel is needed…

• The effect of spinlock for synchronization under multiple parallel workloads should be examined. „

I will try to use the Intel’s dual-core ATOM platform for the measurement of the performance and the power consumption in comparison between the LinuxThreads and NPTL.

• Parallel processing using OpenMP is too difficult to satisfy real-time restriction for embedded systems. To apply OpenMP to embedded systems, some extensions for OpenMP directives will be required. 32

5 June 2009

IWOMP 2009, Dresden

Thank you for your attention!

33

5 June 2009

IWOMP 2009, Dresden