Multi-Core Processors

7 downloads 0 Views 8MB Size Report
Mar 20, 2010 - parallel processors (MPPs) that occurred in the early 1990s. ▫. The ManyCore Revolution: Will the HPC Community Lead or Follow? Prof.
THE STATE OF THE ART IN

Multi-Core Processors Mostafa I. Soliman Computer & System Section, Electrical Engineering Department, Aswan Faculty of Engineering, South Valley University, Egypt

20/ 3 / 2010

1

Outlines 

Introduction  



How to Exploit the Increasingly Number of Transistors 

Increase the Size of Cache Memories Ultrawide-Issue, Out-Of-Order Superscalar Processors



Single Chip Multiprocessor



  



Simultaneous Multi-Threading (SMT) Vector-Thread Architecture Virtual Vector Architecture (ViVA)

Multi-Core Hardware     



Computer Parallelism: ILP, TLP, and DLP Moore’s Law and Current Challenge

Core Count and Complexity Heterogeneity vs. Homogeneity Memory Hierarchy Interconnect Memory Interface

Conclusion Prof. Mostafa I. Soliman

2

Introduction 





The rapid improvement in computer technology has come both  from advances in the technology used to build computers  from innovation in computer design Processor performance growth  about 25% per year (prior to the mid-1980s)  over 50% per year (1986-2002)  about 20% per year (since 2002). The performance has dropped due to  power-wall,  limited ILP, and  memory-wall. In 2004 Intel canceled its highperformance uniprocessor projects and switched to multiple processors (multi-core) per chip. Prof. Mostafa I. Soliman

3

Computer Parallelism 







Taking advantage of parallelism is one of the most important methods for improving performance. All processors since about 1985 use pipelining to improve performance by overlapping the execution of instructions. Beyond simple pipelining, there are three major forms of parallelism, which are not mutually exclusive  instruction-level parallelism (ILP),  thread-level parallelism (TLP), and  data-level parallelism (DLP) Exploiting ILP was the primary focus of processor designs for about 20 years starting in the mid-1980s.

Prof. Mostafa I. Soliman

4

Moore’s Law / Current Challenge The number of transistors that would be incorporated on a silicon die would double every 18 months. Challenge How to translate the increasing number of transistors per chip into a correspondingly large increase in computing performance

Quad-core Itanium® Processor Dual-core Itanium® 2 Processor

Transistors

Itanium® 2 Processor Itanium® Processor Pentium® III Processor

Pentium® Processor 386™ Processor

100,000,000

Pentium® 4 Processor

Pentium® II Processor

10,000,000 1,000,000

486™ DX Processor

100,000

286

8086

1,000,000,000

10,000

4004

1970

8080 8008

1,000

1980

1990

Prof. Mostafa I. Soliman

2000

2010 5

Increase the Size of Cache Memories 

Reducing the cache misses by increasing the size of cache memories to improve hit ratios  This approach is limited by the amount of performance lost in misses in L2 cache  More than 90% of the die area in the Intel Itanium processor is occupied by caches for hiding memory latency.

Intelligent RAM (IRAM): Chips that Remember and Compute Prof. Mostafa I. Soliman

IA-64 Itanium Processor Cartridge

L3 cache: 1.42 Billion Transistors 30MB 6

VIRAM: Vector Processor with Integrated Memory 



VIRAM is a media-oriented vector processor with an integrated main memory system to exploit the increased number of transistors Enormous, high bandwidth memories (DRAM technology) are placed on the processor die to  increase the main memory bandwidth  reduce power and energy consumption  reduce space and weight Prof. Mostafa I. Soliman

7

Ultrawide-Issue, Out-of-order Superscalar Processors 



Patt et al. advocated that the best use of billions of transistors is for ultrawide-issue, out-of-order superscalar processors, with the resulting chips interconnected to create a multiprocessor system. With one billion transistors, 









60 million transistors will be allocated to the execution core, 240 million to the trace cache, 48 million to the branch predictor, 32 million to the data caches, and 640 million to L2 caches. Prof. Mostafa I. Soliman

8

Diminishing Returns 



ILP can be more aggressively exploited by  deep pipelines,  multiple instruction issue,  speculation,  out-of-order instruction execution. Recently, these techniques have reached a point of diminishing returns  Increasing the design complexity  low power efficiency  there is only a limited amount of exploitable ILP in single threaded code.

Prof. Mostafa I. Soliman

9

Single Chip Multiprocessor 

Hammond et al. proposed a single chip multiprocessor (multicore processor) to exploit the increased number of transistors.



With the announcement of multi-core microprocessors from Intel, AMD, IBM, and Sun Microsystems, CMP have recently expanded from an active area of research to a hot product area.



The programming model has to change to exploit explicit thread-level parallelism (TLP). 



The required reengineering of existing application codes will likely be as dramatic as the migration from vector HPC systems to massively parallel processors (MPPs) that occurred in the early 1990s.

The ManyCore Revolution: Will the HPC Community Lead or Follow? Prof. Mostafa I. Soliman

10

Simultaneous Multi-Threading (SMT) 





Key idea: Issue multiple instructions from multiple threads each cycle Features  Fully exploit TLP and ILP  Better Performance for  Mix of independent programs,  Programs that are parallelizable, and  Single threaded program The changes to enable SMT are minimal (PCs, subroutine return stacks, physical registers)



SMT’s performance reaches 6.1 instructions per cycle, compared with 4.3 for MP2, 4.2 for MP4, and 2.7 for single-threaded superscalar. Prof. Mostafa I. Soliman

Unutilized Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 11

Scale: Vector-Thread Architecture 



Ronny et al. proposed Vector-Thread (VT) architecture to support both of vector and multithreaded computation to flexibly and compactly encode application parallelism and locality. Goals  Achieve high performance with low energy and small area  Take advantage of whatever parallelism and locality is available 



Allow intermixing of multiple levels of parallelism

Scale includes a RISC control processor and a four-lane vector-thread  can execute 16 operations per cycle  supports up to 128 active threads Prof. Mostafa I. Soliman

12

Vector / Multithreaded Architectures Control vector control Processor PE0 PE1 PE2

PEN

PE0

PE1

PE2

PEN

thread control

Memory

Memory

Vector Architecture

Multithreaded Architecture

Control vector-fetch Processor VP0 VP1

VP2

VP3

Memory Vector-Thread Architecture Prof. Mostafa I. Soliman

VPN

threadfetch

Execute 3–11 operations per cycle 13

Maven: Many-Core Vector-Thread Architectures 





Batten et al. proposed Maven that would include tens to hundreds of simple control processors each with its own single-lane vectorthread unit (VTU). A Maven single-lane VTU is potentially easier to implement and more efficient than a Scale multiple-lane VTU. Maven will extend VT to more general-purpose systems with larger dies and more execution resources Prof. Mostafa I. Soliman

14

Virtual Vector Architecture (ViVA) 



Gebis et al. proposed ViVA to combine the memory semantics of vector computers with a software-controlled memory in order to hide memory latency. ViVA adds vector-style memory operations to existing microprocessors but does not include arithmetic datapaths;  instead, memory instructions work with a new buffer placed between the core and L2 cache.

ViVA achieved 2x–13x improvement compared the scalar version. Prof. Mostafa I. Soliman

15

Multi-Core Hardware 1. 2. 3. 4. 5.

Core Count and Complexity Heterogeneity vs. Homogeneity Memory Hierarchy Interconnect Memory Interface Prof. Mostafa I. Soliman

Multi-Core Hardware 



Currently, multi-core processors are the norm for servers as well as desktops and laptops, and some embedded processors. There are two broad classes:  Processors contain a few very powerful cores, essentially the same core one would put in a single core processor. 



Examples include AMD Athlons, Intel Core 2, IBM Power 5 and 6

Processors trade single core performance for number of cores, limiting core area and power. 

Examples include the Tilera 64, the Intel Larrabee (64 cores) and the Sun UltraSPARC T1 (8 cores) and T2 (8 cores).

Prof. Mostafa I. Soliman

17

Relative Size and Power Dissipation of Different CPU 











Simpler processor cores require far less surface area and power with only a modest drop in clock frequency. Power5 (Server) 2  389 mm – 120 W @ 1,900 MHz Intel Core2 sc (Laptop) 2  130 mm – 15 W @ 1,000 MHz ARM Cortex A8 (Automobiles) 2  5 mm – 0.8 W @ 800 MHz Tensilica DP (Cell Phones/Printers) 2  0.8 mm – 0.09 W @ 600 MHz Tensilica Xtensa (Cisco Router) 2  0.32 mm – 0.05 W @ 600 MHz Prof. Mostafa I. Soliman

0.08% 0.2% 1.3%

33%

100%

18

1- Core Count and Complexity 



For markets where a substantial fraction of the software is not parallelized, such as desktop systems,  speedup from extra cores is less than linear.  a few copies of the most powerful core that can reasonably be designed is the preferred. If the expected speedup from extra cores is assumed to be linear  core design should follow the KILL rule (Kill If Less than Linear)  any architectural feature for performance improvement should be included if and only if it gives a relative speedup that is at least as big as the relative increase in size (or power or whatever is the limiting factor) of the core. Prof. Mostafa I. Soliman

19

Intel Multi-Core Processor 

Intel Core 2 Due and Core 2 Quad processors are based on the high-performance and power-efficient Intel Core microarchitecture.



The processor core  exploits ILP by dynamically executing instructions independent of the program order.  exploits DLP by executing multimedia instructions  exploits TLP by processing multiple threads on multiple cores Prof. Mostafa I. Soliman

20

SIMD Extensions 

MMX: MultiMedia eXtension 



SSE: Streaming SIMD Extensions 



SSE2 Pentium D & Pentium Dual-Core

SSSE3 Supplemental Streaming 



Pentium 4

SSE3 is an incremental upgrade to 



Pentium III

SSE2 is a major enhancement to SSE 



Pentium II

SIMD Extension 3 Xeon 5100 & Intel Core 2

SSE4 is a major enhancement, adding a dot product instruction, …



AVX: Advanced Vector Extensions 



128-bits XMM registers  256-bits registers called YMM0 - YMM15 Sandy Bridge processor, 2010 Prof. Mostafa I. Soliman

21

Performance Evaluation of Intel Core 2 Quad Processor Using Intel MKL Scal

SAXPY

Intel Core 2 Quad Processor: 2.66 GHz

Givens

3.5

FLOPs/cycle

3

Core 1

Core 2

Core 3

Core 4

Architecture Architecture Architecture Architecture State State State State

2.5 2

Execution Execution Execution Execution Engine Engine Engine Engine Local APIC Local APIC Local APIC Local APIC

1.5 1 0.5 0 4

16

6 25

64

VMmul

3

K 16

4K

1K

6K 25

K 64

MVmul

L2 Cache (4 MB)

Bus Interface

Bus Interface

M 16

4M

1M

L2 Cache (4 MB )

AxB

Rank-1

A'xB

AxB'

A'xB'

28

FLOPs/cycle

24

2.5

FLOPs/cycle

20

2

16

1.5

12

1

Prof. Mostafa I. Soliman

0

0 12 50 14 00 15 50 17 00 18 50 20 00 35 00 60 00

11 0

95

80 0

65 0

50 0

35 0

50

00

60

00

00

35

20

50

00

18

50

17

15

00

50

14

00

12

0

0

11

95

80

0 65

50

35

20

0

0

0

0 0

4

50

0.5

20 0

8

22

The Sun UltraSPARC T1 Processor 

 







T1 is a multi-core multiprocessor introduced by Sun as a server processor. It is almost totally focused on exploiting TLP rather than ILP. Each T1 processor contains eight cores, each supporting four threads. Each core consists of a simple six-stage, single-issue pipeline T1 uses fine-grained multithreading  switching to a new thread on each clock cycle  idle threads because of pipeline delay or cache miss are bypassed in the scheduling. A single set of floating-point functional units is shared by all eight cores, as floating-point performance was not a focus for T1. Prof. Mostafa I. Soliman

23

Sun UltraSPARC T1

Average CPI per Thread: 6.5 Average CPI per Core: 1.6 Average CPI per 8 Cores: 0.2 Average IPC per 8 Cores: 5 62.5% of Peak Performance

Prof. Mostafa I. Soliman

24

Sun UltraSPARC T2 (8 cores x 8 threads) 







T2 supports concurrent execution of 64 threads by utilizing eight SPARC cores The cores communicate via a high bandwidth crossbar and share a 4MB, eight bank, L2 cache. Each SPARC core includes two integer execution units and a dedicated floating point and graphics unit, which delivers a peak floating point throughput of 11.2 GFLOPS/s at 1.4 GHz. T2 is a good choice for a range of applications including webservers, database and applications servers, HPC, The chip has ~500M transistors on a 342 mm2 die secure networking. with a power consumption of 84 W at 1.4 GHz.

Prof. Mostafa I. Soliman

25

2- Heterogeneity vs. Homogeneity In a multi-core chip, the cores could be  identical (homogenous multi-core)   



Cores implementing the same instruction set simpler to design and to resource allocate Examples: Intel Core 2 and Tilera 64

more than one kind (heterogeneous).  Single-ISA Heterogeneous Multi-Core Architectures  Cores with different instruction sets  Cell processor where one core implements the PowerPC architecture and 6-8 synergistic processing elements implement vector instruction set.  Specialized hardware is more area and energy efficient. Prof. Mostafa I. Soliman

26

One potentially useful kind of heterogeneity is to have • a small number of very fast cores for parts of the computation with limited parallelism and • a large number of simpler cores to exploit abundant parallelism when it is available (apply the KILL rule).

Prof. Mostafa I. Soliman

TILE64 Processor

     

8 X 8 grid of identical, general purpose processor cores (tiles) 3-way VLIW pipeline for instruction level parallelism 5 Mbytes of on-chip total caches 192 billion operations per second (32-bit) • 1GHz By subword arithmetic, 256 billion 16-bit OPS, or 0.5 Tera-ops for 8-bit 27 Tbps of on-chip mesh interconnect enables linear application scaling Prof. Mostafa I. Soliman

28

Cell System Architecture Cell project = IBM + Sony + Toshiba

Goal improve performance an order of magnitude over desktop Prof. Mostafa I. Soliman

29

The Cell Architecture Some Cell statistics:  Clock speed: 4 GHz  Peak performance SP: 256 GFlops DP: 26 GFlops  Local storage size per SPU: 256KB  Area: 221 mm²  Technology: 90nm  # Transistors: 234M

Prof. Mostafa I. Soliman

30

3- Memory Hierarchy 





First level caches are typically private to each core and split into instruction and data caches, as in the preceding generation of single core processors. Early dual core processors had private per core second level caches. Now, the options is on L2 cache  Some designs continue with separate L2 caches, like the Tilera 64 where each core has a 64 KB L2 cache.  L2 caches can be shared between the cores on a chip; Sun T1 (3MB) and Intel Core 2 Duo (2-6 MB). Prof. Mostafa I. Soliman

31

Memory Hierarchy 



(Cont.)

Separate L2 caches backed by a shared L3 cache as in the AMD Phenom processor (512 KB L2 per core, shared 2MB L3) or the recent Intel Core i7 (256 KB L2 per core, shared 8MB L3). A hierarchy where L2 caches are shared by subsets of cores.  Intel Core 2 Quad: Each of the chips have an L2 cache shared between its two cores  Rock “A High-Performance Sparc CMT Processor”: fourcore / cluster share 2Mbytes L2 cache. Prof. Mostafa I. Soliman

32

It will be increasingly difficult to maintain a shared L2 cache as the number of cores increases to tens and hundreds

Prof. Mostafa I. Soliman

Intel Core i7 Processor

Quad-core supporting HT Technology 3.20 GHz 731 M Transistors Memory: 25.6 GB/s , 64 GB L1: 64 KB, L2: 256KB , L3: 8MB Prof. Mostafa I. Soliman

34

4- Interconnect 

1.

The cores on a die must be connected to each other, and there are several possibilities. Classical buses  

2.

Rings    

3.

do not scale beyond a limited number of cores long lines give high power consumption and low speed are used in the Cell processor and in the Intel Larrabee better than buses due to the lower power higher frequencies due to shorter lines latency is linear in the number of cores

Crossbars   

are used in the Sun T1 and T2 processors offer low latency high bandwidth interconnects scale as the square of the number of ports Prof. Mostafa I. Soliman

35

Interconnect 4.

5.



(Cont.)

Switched networks (typically 2-D meshes)  are used in the Tilera 64 Hierarchical interconnects where groups of cores are interconnected in some way and groups of groups are interconnected in a possibly different way.  cores could be interconnected in small groups using buses or rings and those groups could communicate with each other over a mesh.

Switched networks and hierarchical interconnects are the main contenders for the future. Prof. Mostafa I. Soliman

36

Intel Larabee Processor Sony PlayStation 4

Prof. Mostafa I. Soliman

37

5- Memory Interface The memory bandwidth per core will decrease as we move to chips with more cores.  stacking memory chips on top of processor chips and spread the connections over the area of the chips  significant wire reduction  better scaling  Improve memory latency  multiple dies are placed on top of each other and connected using Through Silicon Vias (TSVs).  give dense signal connections between chips;  allow for a 1024 bit bus in a small area.  building many-core processor-to-DRAM networks with monolithic CMOS silicon photonics Prof. Mostafa I. Soliman

38

Stacking a layer of DRAM on top of the original processor

3D stacking a second die of SRAM

Replacing the original cache with a stacked die of DRAM

The original 2D floorplan Prof. Mostafa I. Soliman

39

‫‪Conclusion‬‬

‫‪40‬‬

‫اﻻﻧﺗــﺎج اﻟﻌﻠﻣﻰ اﻟﻣﻘدم ﻣن د‪ /‬ﻣﺻطﻔﻰ إﺑراھﯾم ﺳﻠﯾﻣﺎن ﻟﻠﺣﺻول ﻋﻠﻰ اﻟﻠﻘب اﻟﻌﻠﻣﻰ ﻟدرﺟﺔ أﺳﺗﺎذ ﻣﺳﺎﻋد ﻓﻰ ھﻧدﺳﺔ اﻟﺣﺎﺳﺑﺎت‬

Thank You

Prof. Mostafa I. Soliman

41

More Details 











J. Shalf et al., The Many-Core Revolution: Will HPC Lead or Follow? Journal of SciDAC Review, No. 14, pp. 40-49, Fall 2009. J. Ahn et al., Future Scaling of Processor-Memory Interfaces, Proceedings of the 22nd International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), Portland, OR, November .2009 Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 1: Basic Architecture, www.intel.com/products/processor/manuals, September 2009. C. Batten et al., Building Many-core Processor-to-DRAM Networks with Monolithic CMOS Silicon Photonics, IEEE Micro, Vol. 29 , No. 4, pp. 821, July/August 2009. C. Batten, Many-Core Vector-Thread Architectures, Presentation available at http://chess.eecs.berkeley.edu/pubs/538.html, March 2009. J. Byun et al., Performance Analysis of Coarse-Grained Parallel Genetic Algorithms on the Multi-core Sun UltraSPARC T1, Proceedings of IEEE SoutheastCon 2009, pp. 301-306, March 2009. Prof. Mostafa I. Soliman

42

More Details 











(Cont.)

T. Ziaja et al., Efficient Array Characterization in the UltraSPARC T2, Proceedings of 27th IEEE VLSI Test Symposium, VTS '09, pp. 3-8, May 2009. S. Chaudhry et al., Rock: A High-Performance Sparc CMT Processor, IEEE Micro, Vol. 29, 02, pp. 6-16, March/April, 2009 J. Gebis et al., Improving Memory Subsystem Performance using ViVA: Virtual Vector Architecture, Proceedings of the International Conference on Architecture of Computing Systems, Delft, Netherlands, March 2009. L. Seiler et al., Larrabee: A Many-Core x86 Architecture for Visual Computing, IEEE Micro, Vol. 29, No. 1, pp. 10-21, January/February 2009. B. Stackhouse et al., A 65nm 2-Billion-Transistor Quad-Core Itanium Processor, IEEE Journal of Solid-State Circuits, Vol. 44, No. 1, pp. 1831, January 2009. R. Krashinsky et al., Implementing the Scale Vector-Thread Processor, ACM Transactions on Design Automation of Electronic Systems (TODAES), Vo. 13, No. 3, Article No. 41, July 2008. Prof. Mostafa I. Soliman

43

More Details 













(Cont.)

J. Gebis. Low-complexity Vector Microprocessor Extensions. PhD Thesis, University of California, Berkeley, Berkeley, CA, USA, May 2008. F. Montenegro et al., Tile64 Processor: A 64-core SOC with Mesh Interconnect, Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC 2008) Digest of Technical Papers, Vol. 51, pp. pp. 88–598, February 2008. K. Faxén et al., Multicore Computing--The State of the Art, http://eprints.sics.se/3546/. J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach, Morgan Kaufmann, San Francisco, CA, 4th Edition, 2007. M. Shah et al., UltraSPARC T2: A Highly-Treaded, Power-Efficient, SPARC SOC, Proceedings of the IEEE Solid-State Circuits Conference, ASSCC'07, pp. 22-25, November 2007. H. Le. et al., IBM POWER6 Microarchitecture, IBM Journal Research and Development. Vol. 51, No. 6, pp. 639- 662, November 2007. R. Krashinsky, Vector-Thread Architecture And Implementation, Ph.D. Thesis, Massachusetts Institute of Technology, June 2007. Prof. Mostafa I. Soliman

44

More Details 













(Cont.)

W. Wentzlaff et al., On-Chip Interconnection Architecture for the Tile Processor, IEEE Micro, Vol. 27, No. 5, pp. 15-31 , September-October 2007. A. Agarwal and M. Levy, The KILL Rule for Multicore, Proceedings of the Design Automation Conference (DAC), pp. 750-753, June 2007. G. Loh et al., Processor Design in 3D Die-Stacking Technologies, IEEE Micro, Vol. 27, No. 3, pp. 31-48, May/June 2007. A. Leon et al., The UltraSPARC T1: A Power-Efficient High-Throughput 32-Thread SPARC Processor, IEEE Journal of Solid-State Circuits, Vol. 42, No. 1, pp. 7-16, January 2007. M. Gschwind et al., Synergistic Processing in Cell's Multi-core Architecture, IEEE Micro, Vol. 26, No. 2, pp. 10-24, March/April 2006. J. Kahle et al., Introduction to the Cell Multiprocessor, IBM J. Research and Development, Vol. 49, No. 4/5, pp. 589–604, 2005. R. Krashinsky et al., The Vector-Thread Architecture, IEEE Micro, Vol. 24, No. 6, pp. 84-90, November/December 2004. Prof. Mostafa I. Soliman

45

More Details 















(Cont.)

D. Burger and J. Goodman, Billion-Transistor Architectures: There and Back Again, IEEE Computer, Vol.37, No.3, pp. 22-28., March 2004. R. Kumar, Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance, ACM SIGARCH Computer Architecture News, Vol. 32, No. 2, March 2004. S. Eggers et al., Simultaneous Multithreading: A Platform for NextGeneration Processors, Vol. 17, No. 5, pp. 12-19, September/October 1997. Y. Patt et al., One Billion Transistors, One Uniprocessor, One Chip, IEEE Computer, Vol.30, No.9, pp. 51-57, September 1997. L. Hammond et al., The Stanford Hydra CMP, IEEE MICRO, Vol.20, No.2, pp. 71-84, March/April 2000. C. Kozyrakis et al., Scalable Processors in the Billion-Transistor Era: IRAM, IEEE Computer, Vol.30, No.9, pp.75-78, September 1997. G. Amdahl, Validity of the Single-Processor Approach to Achieving Large Scale Computing Capabilities, Proc. AFIPS 1967 Spring Joint Computer Conference, Atlantic City, New Jersey, AFIPS Press, Vol. 30, pp. 483-48, April 1967. G. Moore, Cramming more Components onto Integrated Circuits, Electronics, Vol. 38, No. 8, 1965. Prof. Mostafa I. Soliman

46

Suggest Documents