Embedded Parallel Computing

Embedded Parallel Computing Lecture 5 The anatomy of a modern multiprocessor, the multicore processors Tomas Nordström Course webpage:: Course responsible and examiner: Tomas Nordström, [email protected]; Room E313; Tel: +46 35 16 7334

1

Outline Modern Multicore • Symmetrical Multiprocessing (SMP) • Multicore • ARM Multi-core architecture

2

MIMD: Symmetrical Multiprocessing • A multi-core architecture with Symmetrical Multiprocessing (SMP) is defined by the following characteristics: ‣ Architecture consists of two or more identical CPU cores. ‣ All cores share a common system memory and are controlled by a single Operating system. ‣ Each CPU is capable of operating independently on different workloads and whenever possible, is also capable of sharing workloads with the other CPU.

3

Multicore • [Wikipedia]: A multi-core processor is a single computing component with two or more independent actual processors (called "cores"), which are the units that read and execute program instructions • A many-core processor is a multi-core processor in which the number of cores is large enough that traditional multiprocessor techniques are no longer efficient — largely because of issues with congestion in supplying instructions and data to the many processors

An Intel Core 2 Duo E6750 dual-core processor

Several tens of cores!

4

Multicore • Symmetric multiprocessing (SMP) designs using discrete CPUs exists since a long time • Thus the issues regarding implementing multicore processor architecture and supporting it with software are well known • Utilizing a proven processing-core design without architectural changes reduces design risk significantly 5

Single Core Design is Hitting the f.... Wall • Greatly diminished gains in processor performance from increasing the operating frequency. This is due to three primary factors: ‣ The memory wall ‣ The ILP wall ‣ The power wall

6

Multicore SMP • The proximity of multiple CPU cores on the same die allows the cache coherency circuitry to operate at a much higher clock-rate than is possible if the signals have to travel off-chip • Combining equivalent CPUs on a single die significantly improves the performance of cache snoop (alternative: Bus snooping) operations

7

Example: ARM MPcore

8

Thread-level Parallelism • For thread-level parallelism, ARM needed to improve exception handling to prepare for the increased complexity in handling multithreading on multiple processors • These requirements added inherent complexity in the interrupt handler, scheduler, and context switch

9

MPcore Semaphores • Earlier ARM architectures implemented semaphores with the swap instruction, which held the external bus until completion. One processor can hold the entire bus until completion, disallowing all other processors. Unacceptable! • ARMv6 introduced two new instructions—load-exclusive LDREX and store-exclusive STREX—which take advantage of an exclusive monitor in memory: ‣ LDREX loads a value from memory and sets the exclusive monitor to watch that location, and ‣ STREX checks the exclusive monitor and, if no other write has taken place to that location, performs the store to memory and returns a value to indicate if the data was written.

10

Physically Tagged Caches • Usage of Virtual or Physical addresses in the cache? • A virtually tagged cache must be flushed every time a context switch takes place because the cache contains old virtual-to-physical translations • In ARM11 with MPcore the memory management unit logic resides between the level 1 cache and the processor core 11

Atomic Instructions • Traditionally swap-based and compare- andexchange-based semaphores have been used to control access to critical data • SMP often aim for lock-free synchronization

12

cmpxchg8b • Many are using the Intel cmpxchg8b instruction in these lock-free routines because it can exchange and compare 8 bytes of data atomically. • Typically, this involved 4 bytes for payload and 4 bytes to distinguish between payload versions that could otherwise have the same value—the so-called A-B-A problem.

13

• The ARM exclusives provide atomicity using the data address rather than the data value, so that the routines can atomically exchange data without experiencing the A-B-A problem • Exploiting this would, however, require rewriting much of the existing two-word exclusive code. • Consequently, ARM added instructions for performing load-and-store exclusives using various payload sizes -- including 8 bytes -- thus ensuring the direct portability of existing multithreaded code.

14

Misc MP improvements • Improved access to localized data • Power-conscious spin-locks • Weakly ordered memory consistency

15

Two Main Enhancements • The ARM11 multiprocessor includes two main SMP enhancements: • Generic Interrupt Controller (GIC) providing interprocessor communication • Snoop Control Unit (SCU), an intelligent memory-communication system providing cache coherence

16

Cache Coherency • The ARM11 MPCore implements a Snoop Control Unit (SCU) between the processors. Operating at CPU frequency. • This configuration also provides a very rapid path for data to move directly between each CPU’s cache.

17

18

MESI • Modified ‣ The cache line is present only in the current cache, and is dirty; it has been modified from the value in main memory. The cache is required to write the data back to main memory at some time in the future, before permitting any other read of the (no longer valid) main memory state. The write-back changes the line to the Exclusive state. • Exclusive ‣ The cache line is present only in the current cache, but is clean; it matches main memory. It may be changed to the Shared state at any time, in response to a read request. Alternatively, it may be changed to the Modified state when writing to it. • Shared ‣ Indicates that this cache line may be stored in other caches of the machine and is "clean" ; it matches the main memory. The line may be discarded (changed to the Invalid state) at any time. • Invalid ‣ Indicates that this cache line is invalid. 19

MOESI • The processor maintains cache coherence with an optimized version of the MESI (modified, exclusive, shared, invalid) protocol. • In addition to the four common MESI protocol states, there is a fifth "Owned" state representing data that is both modified and shared. This avoids the need to write modified data back to main memory before sharing it.

20

21

22

23

Interrupt System • Generic Interrupt Controller (GIC) • External interrupts • Internal Interrupts ‣ Example: One processor allocates virtual memory -> all others needs to update their memory translations -> ARM uses GIC to quickly signal that between processors

24

Distributed Interrupt Controller Distributed Interrupt Controller

I N T E R F A C E

MP 11 CPUs

• masking of interrupts • prioritization of the interrupts • distribution of the interrupts to the target MP11 CPUs • tracking the status of interrupts • generation of interrupts by software

25

26

NVIDIA Tegra 2

27

Applications Using MPCore • Frostbite is an example of a game engine that employs job-based parallelism. This engine is used by the popular Battlefield: Bad Company series of games. It is an engine that is capable of using as many threads as the underlying hardware platform provides. The engine performs the primary Game and Render tasks on the GPU and divides up the other system related work into jobs. • Each job typically consists of 15K to 200K lines of C+ + code with the average job size being around 25K lines of code. Most of these jobs are independent while some have interdependencies. • Each frame of the game would typically contain two hundred to three hundred jobs and the engine assigns the jobs to all available hardware cores.

28

Task Level Parallelism on Frostbite Game Engine

Questions • Study-support questions ‣

30

Links Multi-core ; SMP - Symmetric Multiprocessor System ABA problem MESI MOESI Embedded moves to multicore Goodacre, J.; Sloss, A.N.; , "Parallelism and the ARM instruction set architecture," IEEE Computer , vol.38, no.7, pp. 42- 50, July 2005 doi: 10.1109/MC.2005.239 Goodacre, J., "Details of a New Cortex Processor Revealed, Cortex-A9", Presentation at the ARM developers' Conference, October 2007. Stevens A., ”Introduction to AMBA 4 ACE”, ARM White paper June 6, 2011.

31

Embedded Parallel Computing

Embedded Parallel Computing

Suggest Documents

Embedded Parallel Computing

Embedded Computing

Parallel Computing

Parallel Computing

Embedded Computing - Elsevier

Computing for Embedded Systems

embedded computing systems

UNIFIED EMBEDDED PARALLEL FINITE ELEMENT ...

Parallel Computing with X10

Parallel Computing

Parallel Computing - inuTech

Parallel Computing in Ada:

Hyper-Systolic Parallel Computing

Introduction to Parallel Computing

preprint (pdf) - Parallel Computing

Parallel Computing with MATLAB

UNIFIED EMBEDDED PARALLEL FINITE ELEMENT ...

Performance Metrics for Embedded Parallel

parallel computing - WordPress.com

ePUMA: Embedded Parallel DSP Processor

UNIFIED EMBEDDED PARALLEL FINITE ELEMENT ...

Parallel Computing and Parallel Programming - LIP Lisboa

Engineering Parallel Algorithms for Community ... - Parallel Computing

Measuring Parallel Processor Performance - Parallel Computing