Mar 20, 2010 - parallel processors (MPPs) that occurred in the early 1990s. â«. The ManyCore Revolution: Will the HPC Community Lead or Follow? Prof.
THE STATE OF THE ART IN
Multi-Core Processors Mostafa I. Soliman Computer & System Section, Electrical Engineering Department, Aswan Faculty of Engineering, South Valley University, Egypt
20/ 3 / 2010
1
Outlines
Introduction
How to Exploit the Increasingly Number of Transistors
Increase the Size of Cache Memories Ultrawide-Issue, Out-Of-Order Superscalar Processors
Single Chip Multiprocessor
Simultaneous Multi-Threading (SMT) Vector-Thread Architecture Virtual Vector Architecture (ViVA)
Multi-Core Hardware
Computer Parallelism: ILP, TLP, and DLP Moore’s Law and Current Challenge
Core Count and Complexity Heterogeneity vs. Homogeneity Memory Hierarchy Interconnect Memory Interface
Conclusion Prof. Mostafa I. Soliman
2
Introduction
The rapid improvement in computer technology has come both from advances in the technology used to build computers from innovation in computer design Processor performance growth about 25% per year (prior to the mid-1980s) over 50% per year (1986-2002) about 20% per year (since 2002). The performance has dropped due to power-wall, limited ILP, and memory-wall. In 2004 Intel canceled its highperformance uniprocessor projects and switched to multiple processors (multi-core) per chip. Prof. Mostafa I. Soliman
3
Computer Parallelism
Taking advantage of parallelism is one of the most important methods for improving performance. All processors since about 1985 use pipelining to improve performance by overlapping the execution of instructions. Beyond simple pipelining, there are three major forms of parallelism, which are not mutually exclusive instruction-level parallelism (ILP), thread-level parallelism (TLP), and data-level parallelism (DLP) Exploiting ILP was the primary focus of processor designs for about 20 years starting in the mid-1980s.
Prof. Mostafa I. Soliman
4
Moore’s Law / Current Challenge The number of transistors that would be incorporated on a silicon die would double every 18 months. Challenge How to translate the increasing number of transistors per chip into a correspondingly large increase in computing performance
Quad-core Itanium® Processor Dual-core Itanium® 2 Processor
Transistors
Itanium® 2 Processor Itanium® Processor Pentium® III Processor
Pentium® Processor 386™ Processor
100,000,000
Pentium® 4 Processor
Pentium® II Processor
10,000,000 1,000,000
486™ DX Processor
100,000
286
8086
1,000,000,000
10,000
4004
1970
8080 8008
1,000
1980
1990
Prof. Mostafa I. Soliman
2000
2010 5
Increase the Size of Cache Memories
Reducing the cache misses by increasing the size of cache memories to improve hit ratios This approach is limited by the amount of performance lost in misses in L2 cache More than 90% of the die area in the Intel Itanium processor is occupied by caches for hiding memory latency.
Intelligent RAM (IRAM): Chips that Remember and Compute Prof. Mostafa I. Soliman
IA-64 Itanium Processor Cartridge
L3 cache: 1.42 Billion Transistors 30MB 6
VIRAM: Vector Processor with Integrated Memory
VIRAM is a media-oriented vector processor with an integrated main memory system to exploit the increased number of transistors Enormous, high bandwidth memories (DRAM technology) are placed on the processor die to increase the main memory bandwidth reduce power and energy consumption reduce space and weight Prof. Mostafa I. Soliman
7
Ultrawide-Issue, Out-of-order Superscalar Processors
Patt et al. advocated that the best use of billions of transistors is for ultrawide-issue, out-of-order superscalar processors, with the resulting chips interconnected to create a multiprocessor system. With one billion transistors,
60 million transistors will be allocated to the execution core, 240 million to the trace cache, 48 million to the branch predictor, 32 million to the data caches, and 640 million to L2 caches. Prof. Mostafa I. Soliman
8
Diminishing Returns
ILP can be more aggressively exploited by deep pipelines, multiple instruction issue, speculation, out-of-order instruction execution. Recently, these techniques have reached a point of diminishing returns Increasing the design complexity low power efficiency there is only a limited amount of exploitable ILP in single threaded code.
Prof. Mostafa I. Soliman
9
Single Chip Multiprocessor
Hammond et al. proposed a single chip multiprocessor (multicore processor) to exploit the increased number of transistors.
With the announcement of multi-core microprocessors from Intel, AMD, IBM, and Sun Microsystems, CMP have recently expanded from an active area of research to a hot product area.
The programming model has to change to exploit explicit thread-level parallelism (TLP).
The required reengineering of existing application codes will likely be as dramatic as the migration from vector HPC systems to massively parallel processors (MPPs) that occurred in the early 1990s.
The ManyCore Revolution: Will the HPC Community Lead or Follow? Prof. Mostafa I. Soliman
10
Simultaneous Multi-Threading (SMT)
Key idea: Issue multiple instructions from multiple threads each cycle Features Fully exploit TLP and ILP Better Performance for Mix of independent programs, Programs that are parallelizable, and Single threaded program The changes to enable SMT are minimal (PCs, subroutine return stacks, physical registers)
SMT’s performance reaches 6.1 instructions per cycle, compared with 4.3 for MP2, 4.2 for MP4, and 2.7 for single-threaded superscalar. Prof. Mostafa I. Soliman
Unutilized Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 11
Scale: Vector-Thread Architecture
Ronny et al. proposed Vector-Thread (VT) architecture to support both of vector and multithreaded computation to flexibly and compactly encode application parallelism and locality. Goals Achieve high performance with low energy and small area Take advantage of whatever parallelism and locality is available
Allow intermixing of multiple levels of parallelism
Scale includes a RISC control processor and a four-lane vector-thread can execute 16 operations per cycle supports up to 128 active threads Prof. Mostafa I. Soliman
12
Vector / Multithreaded Architectures Control vector control Processor PE0 PE1 PE2
PEN
PE0
PE1
PE2
PEN
thread control
Memory
Memory
Vector Architecture
Multithreaded Architecture
Control vector-fetch Processor VP0 VP1
VP2
VP3
Memory Vector-Thread Architecture Prof. Mostafa I. Soliman
VPN
threadfetch
Execute 3–11 operations per cycle 13
Maven: Many-Core Vector-Thread Architectures
Batten et al. proposed Maven that would include tens to hundreds of simple control processors each with its own single-lane vectorthread unit (VTU). A Maven single-lane VTU is potentially easier to implement and more efficient than a Scale multiple-lane VTU. Maven will extend VT to more general-purpose systems with larger dies and more execution resources Prof. Mostafa I. Soliman
14
Virtual Vector Architecture (ViVA)
Gebis et al. proposed ViVA to combine the memory semantics of vector computers with a software-controlled memory in order to hide memory latency. ViVA adds vector-style memory operations to existing microprocessors but does not include arithmetic datapaths; instead, memory instructions work with a new buffer placed between the core and L2 cache.
ViVA achieved 2x–13x improvement compared the scalar version. Prof. Mostafa I. Soliman
15
Multi-Core Hardware 1. 2. 3. 4. 5.
Core Count and Complexity Heterogeneity vs. Homogeneity Memory Hierarchy Interconnect Memory Interface Prof. Mostafa I. Soliman
Multi-Core Hardware
Currently, multi-core processors are the norm for servers as well as desktops and laptops, and some embedded processors. There are two broad classes: Processors contain a few very powerful cores, essentially the same core one would put in a single core processor.
Examples include AMD Athlons, Intel Core 2, IBM Power 5 and 6
Processors trade single core performance for number of cores, limiting core area and power.
Examples include the Tilera 64, the Intel Larrabee (64 cores) and the Sun UltraSPARC T1 (8 cores) and T2 (8 cores).
Prof. Mostafa I. Soliman
17
Relative Size and Power Dissipation of Different CPU
Simpler processor cores require far less surface area and power with only a modest drop in clock frequency. Power5 (Server) 2 389 mm – 120 W @ 1,900 MHz Intel Core2 sc (Laptop) 2 130 mm – 15 W @ 1,000 MHz ARM Cortex A8 (Automobiles) 2 5 mm – 0.8 W @ 800 MHz Tensilica DP (Cell Phones/Printers) 2 0.8 mm – 0.09 W @ 600 MHz Tensilica Xtensa (Cisco Router) 2 0.32 mm – 0.05 W @ 600 MHz Prof. Mostafa I. Soliman
0.08% 0.2% 1.3%
33%
100%
18
1- Core Count and Complexity
For markets where a substantial fraction of the software is not parallelized, such as desktop systems, speedup from extra cores is less than linear. a few copies of the most powerful core that can reasonably be designed is the preferred. If the expected speedup from extra cores is assumed to be linear core design should follow the KILL rule (Kill If Less than Linear) any architectural feature for performance improvement should be included if and only if it gives a relative speedup that is at least as big as the relative increase in size (or power or whatever is the limiting factor) of the core. Prof. Mostafa I. Soliman
19
Intel Multi-Core Processor
Intel Core 2 Due and Core 2 Quad processors are based on the high-performance and power-efficient Intel Core microarchitecture.
The processor core exploits ILP by dynamically executing instructions independent of the program order. exploits DLP by executing multimedia instructions exploits TLP by processing multiple threads on multiple cores Prof. Mostafa I. Soliman
20
SIMD Extensions
MMX: MultiMedia eXtension
SSE: Streaming SIMD Extensions
SSE2 Pentium D & Pentium Dual-Core
SSSE3 Supplemental Streaming
Pentium 4
SSE3 is an incremental upgrade to
Pentium III
SSE2 is a major enhancement to SSE
Pentium II
SIMD Extension 3 Xeon 5100 & Intel Core 2
SSE4 is a major enhancement, adding a dot product instruction, …
AVX: Advanced Vector Extensions
128-bits XMM registers 256-bits registers called YMM0 - YMM15 Sandy Bridge processor, 2010 Prof. Mostafa I. Soliman
21
Performance Evaluation of Intel Core 2 Quad Processor Using Intel MKL Scal
SAXPY
Intel Core 2 Quad Processor: 2.66 GHz
Givens
3.5
FLOPs/cycle
3
Core 1
Core 2
Core 3
Core 4
Architecture Architecture Architecture Architecture State State State State
2.5 2
Execution Execution Execution Execution Engine Engine Engine Engine Local APIC Local APIC Local APIC Local APIC
1.5 1 0.5 0 4
16
6 25
64
VMmul
3
K 16
4K
1K
6K 25
K 64
MVmul
L2 Cache (4 MB)
Bus Interface
Bus Interface
M 16
4M
1M
L2 Cache (4 MB )
AxB
Rank-1
A'xB
AxB'
A'xB'
28
FLOPs/cycle
24
2.5
FLOPs/cycle
20
2
16
1.5
12
1
Prof. Mostafa I. Soliman
0
0 12 50 14 00 15 50 17 00 18 50 20 00 35 00 60 00
11 0
95
80 0
65 0
50 0
35 0
50
00
60
00
00
35
20
50
00
18
50
17
15
00
50
14
00
12
0
0
11
95
80
0 65
50
35
20
0
0
0
0 0
4
50
0.5
20 0
8
22
The Sun UltraSPARC T1 Processor
T1 is a multi-core multiprocessor introduced by Sun as a server processor. It is almost totally focused on exploiting TLP rather than ILP. Each T1 processor contains eight cores, each supporting four threads. Each core consists of a simple six-stage, single-issue pipeline T1 uses fine-grained multithreading switching to a new thread on each clock cycle idle threads because of pipeline delay or cache miss are bypassed in the scheduling. A single set of floating-point functional units is shared by all eight cores, as floating-point performance was not a focus for T1. Prof. Mostafa I. Soliman
23
Sun UltraSPARC T1
Average CPI per Thread: 6.5 Average CPI per Core: 1.6 Average CPI per 8 Cores: 0.2 Average IPC per 8 Cores: 5 62.5% of Peak Performance
Prof. Mostafa I. Soliman
24
Sun UltraSPARC T2 (8 cores x 8 threads)
T2 supports concurrent execution of 64 threads by utilizing eight SPARC cores The cores communicate via a high bandwidth crossbar and share a 4MB, eight bank, L2 cache. Each SPARC core includes two integer execution units and a dedicated floating point and graphics unit, which delivers a peak floating point throughput of 11.2 GFLOPS/s at 1.4 GHz. T2 is a good choice for a range of applications including webservers, database and applications servers, HPC, The chip has ~500M transistors on a 342 mm2 die secure networking. with a power consumption of 84 W at 1.4 GHz.
Prof. Mostafa I. Soliman
25
2- Heterogeneity vs. Homogeneity In a multi-core chip, the cores could be identical (homogenous multi-core)
Cores implementing the same instruction set simpler to design and to resource allocate Examples: Intel Core 2 and Tilera 64
more than one kind (heterogeneous). Single-ISA Heterogeneous Multi-Core Architectures Cores with different instruction sets Cell processor where one core implements the PowerPC architecture and 6-8 synergistic processing elements implement vector instruction set. Specialized hardware is more area and energy efficient. Prof. Mostafa I. Soliman
26
One potentially useful kind of heterogeneity is to have • a small number of very fast cores for parts of the computation with limited parallelism and • a large number of simpler cores to exploit abundant parallelism when it is available (apply the KILL rule).
Prof. Mostafa I. Soliman
TILE64 Processor
8 X 8 grid of identical, general purpose processor cores (tiles) 3-way VLIW pipeline for instruction level parallelism 5 Mbytes of on-chip total caches 192 billion operations per second (32-bit) • 1GHz By subword arithmetic, 256 billion 16-bit OPS, or 0.5 Tera-ops for 8-bit 27 Tbps of on-chip mesh interconnect enables linear application scaling Prof. Mostafa I. Soliman
28
Cell System Architecture Cell project = IBM + Sony + Toshiba
Goal improve performance an order of magnitude over desktop Prof. Mostafa I. Soliman
29
The Cell Architecture Some Cell statistics: Clock speed: 4 GHz Peak performance SP: 256 GFlops DP: 26 GFlops Local storage size per SPU: 256KB Area: 221 mm² Technology: 90nm # Transistors: 234M
Prof. Mostafa I. Soliman
30
3- Memory Hierarchy
First level caches are typically private to each core and split into instruction and data caches, as in the preceding generation of single core processors. Early dual core processors had private per core second level caches. Now, the options is on L2 cache Some designs continue with separate L2 caches, like the Tilera 64 where each core has a 64 KB L2 cache. L2 caches can be shared between the cores on a chip; Sun T1 (3MB) and Intel Core 2 Duo (2-6 MB). Prof. Mostafa I. Soliman
31
Memory Hierarchy
(Cont.)
Separate L2 caches backed by a shared L3 cache as in the AMD Phenom processor (512 KB L2 per core, shared 2MB L3) or the recent Intel Core i7 (256 KB L2 per core, shared 8MB L3). A hierarchy where L2 caches are shared by subsets of cores. Intel Core 2 Quad: Each of the chips have an L2 cache shared between its two cores Rock “A High-Performance Sparc CMT Processor”: fourcore / cluster share 2Mbytes L2 cache. Prof. Mostafa I. Soliman
32
It will be increasingly difficult to maintain a shared L2 cache as the number of cores increases to tens and hundreds
Prof. Mostafa I. Soliman
Intel Core i7 Processor
Quad-core supporting HT Technology 3.20 GHz 731 M Transistors Memory: 25.6 GB/s , 64 GB L1: 64 KB, L2: 256KB , L3: 8MB Prof. Mostafa I. Soliman
34
4- Interconnect
1.
The cores on a die must be connected to each other, and there are several possibilities. Classical buses
2.
Rings
3.
do not scale beyond a limited number of cores long lines give high power consumption and low speed are used in the Cell processor and in the Intel Larrabee better than buses due to the lower power higher frequencies due to shorter lines latency is linear in the number of cores
Crossbars
are used in the Sun T1 and T2 processors offer low latency high bandwidth interconnects scale as the square of the number of ports Prof. Mostafa I. Soliman
35
Interconnect 4.
5.
(Cont.)
Switched networks (typically 2-D meshes) are used in the Tilera 64 Hierarchical interconnects where groups of cores are interconnected in some way and groups of groups are interconnected in a possibly different way. cores could be interconnected in small groups using buses or rings and those groups could communicate with each other over a mesh.
Switched networks and hierarchical interconnects are the main contenders for the future. Prof. Mostafa I. Soliman
36
Intel Larabee Processor Sony PlayStation 4
Prof. Mostafa I. Soliman
37
5- Memory Interface The memory bandwidth per core will decrease as we move to chips with more cores. stacking memory chips on top of processor chips and spread the connections over the area of the chips significant wire reduction better scaling Improve memory latency multiple dies are placed on top of each other and connected using Through Silicon Vias (TSVs). give dense signal connections between chips; allow for a 1024 bit bus in a small area. building many-core processor-to-DRAM networks with monolithic CMOS silicon photonics Prof. Mostafa I. Soliman
38
Stacking a layer of DRAM on top of the original processor
3D stacking a second die of SRAM
Replacing the original cache with a stacked die of DRAM
The original 2D floorplan Prof. Mostafa I. Soliman
39
Conclusion
40
اﻻﻧﺗــﺎج اﻟﻌﻠﻣﻰ اﻟﻣﻘدم ﻣن د /ﻣﺻطﻔﻰ إﺑراھﯾم ﺳﻠﯾﻣﺎن ﻟﻠﺣﺻول ﻋﻠﻰ اﻟﻠﻘب اﻟﻌﻠﻣﻰ ﻟدرﺟﺔ أﺳﺗﺎذ ﻣﺳﺎﻋد ﻓﻰ ھﻧدﺳﺔ اﻟﺣﺎﺳﺑﺎت
Thank You
Prof. Mostafa I. Soliman
41
More Details
J. Shalf et al., The Many-Core Revolution: Will HPC Lead or Follow? Journal of SciDAC Review, No. 14, pp. 40-49, Fall 2009. J. Ahn et al., Future Scaling of Processor-Memory Interfaces, Proceedings of the 22nd International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), Portland, OR, November .2009 Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 1: Basic Architecture, www.intel.com/products/processor/manuals, September 2009. C. Batten et al., Building Many-core Processor-to-DRAM Networks with Monolithic CMOS Silicon Photonics, IEEE Micro, Vol. 29 , No. 4, pp. 821, July/August 2009. C. Batten, Many-Core Vector-Thread Architectures, Presentation available at http://chess.eecs.berkeley.edu/pubs/538.html, March 2009. J. Byun et al., Performance Analysis of Coarse-Grained Parallel Genetic Algorithms on the Multi-core Sun UltraSPARC T1, Proceedings of IEEE SoutheastCon 2009, pp. 301-306, March 2009. Prof. Mostafa I. Soliman
42
More Details
(Cont.)
T. Ziaja et al., Efficient Array Characterization in the UltraSPARC T2, Proceedings of 27th IEEE VLSI Test Symposium, VTS '09, pp. 3-8, May 2009. S. Chaudhry et al., Rock: A High-Performance Sparc CMT Processor, IEEE Micro, Vol. 29, 02, pp. 6-16, March/April, 2009 J. Gebis et al., Improving Memory Subsystem Performance using ViVA: Virtual Vector Architecture, Proceedings of the International Conference on Architecture of Computing Systems, Delft, Netherlands, March 2009. L. Seiler et al., Larrabee: A Many-Core x86 Architecture for Visual Computing, IEEE Micro, Vol. 29, No. 1, pp. 10-21, January/February 2009. B. Stackhouse et al., A 65nm 2-Billion-Transistor Quad-Core Itanium Processor, IEEE Journal of Solid-State Circuits, Vol. 44, No. 1, pp. 1831, January 2009. R. Krashinsky et al., Implementing the Scale Vector-Thread Processor, ACM Transactions on Design Automation of Electronic Systems (TODAES), Vo. 13, No. 3, Article No. 41, July 2008. Prof. Mostafa I. Soliman
43
More Details
(Cont.)
J. Gebis. Low-complexity Vector Microprocessor Extensions. PhD Thesis, University of California, Berkeley, Berkeley, CA, USA, May 2008. F. Montenegro et al., Tile64 Processor: A 64-core SOC with Mesh Interconnect, Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC 2008) Digest of Technical Papers, Vol. 51, pp. pp. 88–598, February 2008. K. Faxén et al., Multicore Computing--The State of the Art, http://eprints.sics.se/3546/. J. Hennessy and D. Patterson, Computer Architecture: A Quantitative Approach, Morgan Kaufmann, San Francisco, CA, 4th Edition, 2007. M. Shah et al., UltraSPARC T2: A Highly-Treaded, Power-Efficient, SPARC SOC, Proceedings of the IEEE Solid-State Circuits Conference, ASSCC'07, pp. 22-25, November 2007. H. Le. et al., IBM POWER6 Microarchitecture, IBM Journal Research and Development. Vol. 51, No. 6, pp. 639- 662, November 2007. R. Krashinsky, Vector-Thread Architecture And Implementation, Ph.D. Thesis, Massachusetts Institute of Technology, June 2007. Prof. Mostafa I. Soliman
44
More Details
(Cont.)
W. Wentzlaff et al., On-Chip Interconnection Architecture for the Tile Processor, IEEE Micro, Vol. 27, No. 5, pp. 15-31 , September-October 2007. A. Agarwal and M. Levy, The KILL Rule for Multicore, Proceedings of the Design Automation Conference (DAC), pp. 750-753, June 2007. G. Loh et al., Processor Design in 3D Die-Stacking Technologies, IEEE Micro, Vol. 27, No. 3, pp. 31-48, May/June 2007. A. Leon et al., The UltraSPARC T1: A Power-Efficient High-Throughput 32-Thread SPARC Processor, IEEE Journal of Solid-State Circuits, Vol. 42, No. 1, pp. 7-16, January 2007. M. Gschwind et al., Synergistic Processing in Cell's Multi-core Architecture, IEEE Micro, Vol. 26, No. 2, pp. 10-24, March/April 2006. J. Kahle et al., Introduction to the Cell Multiprocessor, IBM J. Research and Development, Vol. 49, No. 4/5, pp. 589–604, 2005. R. Krashinsky et al., The Vector-Thread Architecture, IEEE Micro, Vol. 24, No. 6, pp. 84-90, November/December 2004. Prof. Mostafa I. Soliman
45
More Details
(Cont.)
D. Burger and J. Goodman, Billion-Transistor Architectures: There and Back Again, IEEE Computer, Vol.37, No.3, pp. 22-28., March 2004. R. Kumar, Single-ISA Heterogeneous Multi-Core Architectures for Multithreaded Workload Performance, ACM SIGARCH Computer Architecture News, Vol. 32, No. 2, March 2004. S. Eggers et al., Simultaneous Multithreading: A Platform for NextGeneration Processors, Vol. 17, No. 5, pp. 12-19, September/October 1997. Y. Patt et al., One Billion Transistors, One Uniprocessor, One Chip, IEEE Computer, Vol.30, No.9, pp. 51-57, September 1997. L. Hammond et al., The Stanford Hydra CMP, IEEE MICRO, Vol.20, No.2, pp. 71-84, March/April 2000. C. Kozyrakis et al., Scalable Processors in the Billion-Transistor Era: IRAM, IEEE Computer, Vol.30, No.9, pp.75-78, September 1997. G. Amdahl, Validity of the Single-Processor Approach to Achieving Large Scale Computing Capabilities, Proc. AFIPS 1967 Spring Joint Computer Conference, Atlantic City, New Jersey, AFIPS Press, Vol. 30, pp. 483-48, April 1967. G. Moore, Cramming more Components onto Integrated Circuits, Electronics, Vol. 38, No. 8, 1965. Prof. Mostafa I. Soliman
46