Computer System architectures. (2011) ..... First attempt to build a digital computer. 1936 Z1. Zuse ...... instructions, rather than a more specialized set of.
CSC 203 1.5
Computer System Architecture
By Budditha Hettige Department of Statistics and Computer Science University of Sri Jayewardenepura (2011)
Computer System architectures
1
Course Outline Course Type
Core
Credit Value
1.5
Duration
22 lecture hours
Pre-requisites
CSC 106 2.0
Course contents • Introduction and Historical Developments – About Historical System development – Processor families • Computer Architecture and Organization – Instruction Set Architecture (ISA) – Microarchitecture – System architecture – Processor architecture – Processor structures • Interfacing and I/O Strategies – I/O fundamentals, Interrupt mechanisms, Buses
Course contents • Memory Architecture – Primary memory, Cache memory, Secondary memory
• Functional Organization – – – –
Instruction pipelining Instruction level parallelism (ILP), Superscalar architectures Processor and system performance
• Multiprocessing – – – –
Amdahl’s law Short vector processing Multi-core multithreaded processors
Introduction
(2011)
Computer System architectures
5
What is Computer? • Is a machine that can solve problems for people by carrying out instructions given to it • The sequence of instructions is call Program • The language machine can understand is call machine language
(2011)
Computer System architectures
6
What is Machine Language? • • •
Machine language(ML) is a system of instructions and data executed directly by a computer's Central Processing Unit The codes are strings of 0s and 1s, or binary digits (“bits”) Instructions typically use some bits to represent – Operations (addition ) – Operands or – Location of the next instruction.
(2011)
Computer System architectures
7
Machine Language contd.. • Advantages – Machine can directly access (Electronic circuit) – High Speed
• Disadvantages – Human cannot identify – Machine depended (Hardware depended) (2011)
Computer System architectures
8
More on Machines • Machine defines a language – Set of instructions carried out by the machine
• Language defines by the machine – Machine executing all the program, writing in the language
Language
(2011)
Machine
Language
Computer System architectures
9
Two Layer (Level) Machine • This machine contains only New Language (L1) and the Machine language (LO)
Virtual Machine (L1)
(2011)
Machine Language (L0)
Virtual Machine (L1) Translate/ Interpreter
Machine Language (L0)
Machine
Computer System architectures
10
Translation (L1 → L0) 1. Replace each instruction written in L1 in to LO 2. Program now execute new Program 3. Program is called compiler/ translator
(2011)
Computer System architectures
11
Interpretation • Each instruction in L1 can execute through the relevant L0 instructions directly • Program is call interpreter
(2011)
Computer System architectures
12
Multi Level Machine High-level Language Program (C, C++)
Assembly Language Program
Machine Language
(2011)
Computer System architectures
13
Multilevel Machine Virtual Machine Ln
Virtual Machine Ln-1 . . .
Machine Language L0
(2011)
Computer System architectures
14
Six-Level Machine • Computer that is designed up to the 6th level of computer architecture
(2011)
Computer System architectures
15
Digital Logic Level • • • • •
• •
(2011)
The interesting objects at this level are gates; Each gate has one or more digital inputs (0 or 1) Each gate is built of at most a handful of transistors A small number of gates can be combined to form a 1-bit memory, which can store a 0 or 1; The 1-bit memories can be combined in groups of, for example, 16, 32 or 64 to form registers Each register can hold a single binary number up to some maximum; Gates can also be combined to form the main computing engine itself. Computer System architectures
16
Microarchitecture level •
•
•
•
(2011)
A collection of 8-32 registers that form a local memory and a circuit called an ALU (Arithmetic Logic Unit) that is capable of performing simple arithmetic operations; The registers are connected to the ALU to form a data path over which the data flow; The basic operation of the data path consists of selecting one or two registers having the ALU operate on them; On some machines the operation of the data path is controlled by a program called a microprogram, on other machine it is controlled by hardware. Computer System architectures
17
Data Path
(2011)
Computer System architectures
18
Instruction Set Architecture Level • The ISA level is defined by the machine’s instruction set • This is a set of instructions carried out interpretively by the microprogram or hardware execution sets
(2011)
Computer System architectures
19
Operating System Level •
•
•
•
(2011)
Uses different memory organization, a new set of instructions, the ability to run one or more programs concurrently Those level 3 instructions identical to level 2’s are carried out directly by the microprogram (or hardwired control), not by the OS; In other words, some of the level 3 instructions are interpreted by the OS and some of the level 3 instructions are interpreted directly by the microprogram; This level is hybrid
Computer System architectures
20
Assembly Language Level • •
•
•
This level is really a symbolic form for the one of the underlying languages; This level provides a method for people to write programs for levels 1, 2 and 3 in a form that is not as unpleasant as the virtual machine languages themselves; Programs in assembly language are first translated to level 1, 2 or 3 language and then interpreted by the appropriate virtual or actual machine; The program that performs the translation is called an assembler.
(2011)
Computer System architectures
21
Between Levels 3 and 4 • The lower 3 levels are not for the average programmer – Instead they are primarily for running the interpreters and translators needed to support the higher levels; • These are written by system programmers who specialise in developing new virtual machines; • Levels 4 and above are intended for the applications programmer • Levels 2 and 3 are always interpreted, Levels 4 and above are usually, but not always, supported by translation; (2011)
Computer System architectures
22
Problem-oriented Language Level • This level usually consists of languages designed to be used by applications programmers; • These languages are generally called higher level languages • Some examples: Java, C, BASIC, LISP, Prolog; • Programs written in these languages are generally translated to Level 3 or 4 by translators known as compilers, although occasionally they are interpreted instead; (2011)
Computer System architectures
23
Multilevel Machines: Hardware • • •
Programs written in a computer’s true machine language (level 1) can be directly executed by the computer’s electronic circuits (level 0), without any intervening interpreters or translators. These electronic circuits, along with the memory and input/output devices, form the computer’s hardware. Hardware consists of tangible objects: – – – – – –
•
(2011)
integrated circuits printed circuit boards Cables power supplies Memories Printers
Hardware is not abstract ideas, algorithms, or instructions.
Computer System architectures
24
Multi level machine Software • Software consists of algorithms (detailed instructions telling how to do something) and their computer representations-namely, programs • Programs can be stored on hard disk, floppy disk, CDROM, or other media but the essence of software is the set of instructions that makes up the programs, not the physical media on which they are recorded. • In the very first computers, the boundary between hardware and software was crystal clear. • Over time, however, it has blurred considerably, primarily due to the addition, removal, and merging of levels as computers have evolved. • Hardware and software are logically equivalent
(2011)
Computer System architectures
25
The Hardware/Software Boundary • Any operation performed by software can also be built directly into the hardware; • Also, any instruction executed by the hardware can also be simulated in software; • The decision to put certain functions in hardware and others in software is based on such factors as: – – – – (2011)
Cost Speed Reliability and Frequency of expected changes Computer System architectures
26
Exercises 1. Explain each of the following terms in your own words – –
Machine Language Instruction
2. What are the differences between Interpretation and translation? 3. What are Multilevel Machines? 4. What are the differences between two-level machine and the six-level machine
(2011)
Computer System architectures
27
Historical Developments
(2011)
Computer System architectures
28
Computer Generation 1. 2. 3. 4. 5. 6.
Zeroth generation- Mechanical Computers (1642-1940) First generation - Vacuum Tubes (1940-1955) Second Generation -Transistors (1956-1963) Third Generation - Integrated Circuits (1964-1971) Forth Generation – VLS-Integration (1971-present) Fifth Generation – Artificial Intelligence (Present and Beyond)
(2011)
Computer System architectures
29
The Zero Generation (1) Year 1834
(2011)
Name Analytical Engine
Made by
Comments
Babbage
First attempt to build a digital computer
1936 Z1
Zuse
First working relay calculating machine
1943 COLOSSUS
British gov't
First electronic computer
1944 Mark I
Aiken
First American general-purpose computer
1946 ENIAC I
EckerVMauchley Modern computer history starts here
1949 EDSAC
Wilkes
First stored-program computer
1951 Whirlwind I
M.I.T.
First real-time computer
1952 IAS
Von Neumann
Most current machines use this design
1960 PDP-1
DEC
First minicomputer (50 sold)
1961 1401
IBM
Enormously popular small business machine
1962 7094
IBM
Dominated scientific computing in the early 1960s Computer System architectures
30
The Zero Generation (2)
(2011)
1963 B5000
Burroughs
First machine designed for a high-level language
1964 360
IBM
First product line designed as a family
1964 6600
CDC
First scientific supercomputer
1965 PDP-8
DEC
First mass-market minicomputer (50,000 sold)
1970 PDP-11
DEC
Dominated minicomputers in the 1970s
1974 8080
Intel
First general-purpose 8-bit computer on a chip
1974 CRAY-1
Cray
First vector supercomputer
1978 VAX
DEC
First 32-bit superminicomputer
1981 IBM PC
IBM
Started the modern personal computer era
1985 MIPS
MIPS
First commercial RISC machine
1987 SPARC
Sun
First SPARC-based RISC workstation
1990 RS6000
IBM
First superscalar machine Computer System architectures
31
The Zero Generation (3) • Pascal’s machine – Addition and Subtraction
• Analytical engine – Four components (Store, mill, input, output)
(2011)
Computer System architectures
32
Charles Babbage
• Difference Engine
1823
• Analytic Engine
1833
– The forerunner of modern digital computer – The first conception of a general purpose computer
(2011)
Computer System architectures
33
Von-Neumann machine
(2011)
Computer System architectures
34
First Generation-Vacuum Tubes (1945-1955) • First generation computers are characterized by the use of vacuum tube logic • Developments – ABC – ENIAC – UNIVAC I
(2011)
Computer System architectures
35
Brief Early Computer Timeline
First Generation- Time Line
(2011)
Date
Event
Description
Arithmetic Logic
Memory
1942
ABC
Atanasoff-Berry Computer
binary
vacuum tubes
capacitors
1946
ENIAC
Electronic Numerical Integrator And Computer
decimal
vacuum tubes
vacuum tubes
1947
EDVAC
Electronic Discrete Variable Automatic Computer
binary
vacuum tubes
mercury delay lines
1948
The Baby
Manchester Small Scale Experimental Machine
binary
vacuum tubes
CRST
1949
UNIVAC I
Universal Automatic Computer
decimal
vacuum tubes
mercury delay lines
1949
EDSAC
Electronic Delay Storage Automatic Computer
binary
vacuum tubes
mercury delay lines
1952
IAS
Institute for Advanced Study
binary
vacuum tubes
cathode ray tubes
1953
IBM 701
binary
vacuum tubes
mercury delay lines
Computer System architectures
36
ABC - Atanasoff-Berry Computer • world's first electronic digital computer • The ABC used binary arithmetic
(2011)
Computer System architectures
37
ENIAC – First general purpose computer • •
Electronic Numerical Integrator And Computer Designed and built by Eckert and Mauchly at the University of Pennsylvania during 1943-45 • capable of being reprogrammed to solve a full range of computing problems • The first, completely electronic, operational, general-purpose analytical calculator! – 30 tons, 72 square meters, 200KW • Performance – Read in 120 cards per minute – Addition took 200 µs, Division 6 ms
(2011)
Computer System architectures
38
UNIVAC - UNIVersal Automatic Computer • The first commercial computer • UNIVAC was delivered in 1951 • designed at the outset for business and administrative use • The UNIVAC I had 5200 vacuum tubes, weighed 29,000 pounds, and consumed 125 kilowatts of electrical power • Originally priced at US$159,000
(2011)
Computer System architectures
39
The Second GenerationTransistors (1955-1965) • Second generation computers are characterized by the use of discrete transistor logic • Use of magnetic core for primary storage • Developments – – – –
(2011)
IBM 1620 System IBM 7030 System IBM 7090 System IBM 7094 System
Computer System architectures
40
IBM 7090 • The IBM 7090 system was announced in 1958. • The 7090 included a multiplexor which supported up to 8 I/O channels. • The 7090 supported both fixed point and floating point arithmetic. • Two fixed point numbers could be added in 4.8 microseconds, and two floating point numbers could be added in 16.8 microseconds. • The 7090 had 32,768 thirty-six bit words of core storage. • In 1960, the American Airlines • SABRE system used two 7090 systems. • Cost of a 7090 system was in the $3,000,000 range.
(2011)
Computer System architectures
41
IBM 1620 • The IBM 1620 system was announced in 1959. • The IBM 1620 system had up to 60,000 digits of core storage (6 bits each.) • Floating point hardware was optional. • The IBM 1620 system performed decimal arithmetic. • The system was digit oriented, not word oriented.
(2011)
Computer System architectures
42
IBM 7030 • The IBM 7030 system was announced in 1960. • The IBM 7030 system used magnetic core for main memory, and magnetic disks for secondary storage. • The ALU could perform 1,000,000 operations per second. • Up to 32 I/O channels were supported. • The 7030 was also referred to as "Stretch." • Cost of a 7030 system was in the $10,000,000 range. (2011)
Computer System architectures
43
IBM 7094 • The IBM 7094 system was announced in 1962. • The 7094 was an improved 7090. • The 7094 introduced double precision floating point arithmetic.
(2011)
Computer System architectures
44
Third Generation • Third generation computers are characterized by the use of integrated circuit logic. • Development – IBM System/360
(2011)
Computer System architectures
45
IBM S 360 • The IBM S/360 family was announced in 1964. • Included both multiplexor and selector I/O channels. • Supported both fixed point and floating point arithmetic. • Had a microprogrammed instruction set. • Cost between $133,000 and $12,500,000.
(2011)
Computer System architectures
46
Forth Generation • Very Large Scale(VLSI) and Ultra Large scale(ULSI) • Fourth generation computers are characterized by the use of microprocessors. • Semiconductor memory was commonly used • Development – Intel – AMD etc
(2011)
Computer System architectures
47
Intel 4004 • The Intel 4004 microprocessor was announced in 1971. • The Intel 4004 microprocessor had – – – – –
2,300 transistors. A clock speed of 108 KHz. A die size of 12 sq mm. 4 bit memory access. 4 bit registers.
• The Intel 4004 microprocessor supported – Up to 32,768 bits of program storage. – Up to 5,120 bits of data storage.
• The 4004 was used mainly in calculators. (2011)
Computer System architectures
48
Intel 4004 - 1971
(2011)
Computer System architectures
49
MOS 6502 • The MOS 6502 microprocessor was announced in 1975. • The MOS 6502 microprocessor had – A clock speed of 1 MHz. – 8 bit memory access. – 8 bit registers.
• The MOS 6502 microprocessor supported – Up to 65,536 bytes (8 bit) of main memory.
• The MOS 6502 was used in – – – – –
The Apple II personal computer. The Comodore PET personal computer. The KIM-1 computer kit. The Atari 2600 game system. The Nintendo Famicon game system.
• Initial price of the 6502 was $25.00. (2011)
Computer System architectures
50
Intel Pentium IV - 2001 • “State of the art” • 42 million transistors • 2GHz • 0.13µm process • Could fit ~15,000 4004s on this chip! (2011)
Computer System architectures
51
Now - zEnterprise196 Microprocessor • • • •
1.4 billion transistors, Quad core design Up to 96 cores (80 visible to OS) in one multichip module 5.2 GHz, IBM 45nm SOI CMOS technology 64-bit virtual addressing – original 360 was 24-bit; 370 was a 31-bit extension
• Superscalar, out-of-order – Up to 72 instructions in flight
• Variable length instruction pipeline: 15-17 stages • Each core has 2 integer units, 2 load-store units and 2 floating point units • 8K-entry Branch Target Buffer – Very large buffer to support commercial workload
• Four Levels of caches: – – – –
(2011)
64KB L1 I-cache, 128KB L1 D-cache 1.5MB L2 cache per core 24MB shared on-chip L3 cache 192MB shared off-chip L4 cache
Computer System architectures
52
Fifth Generation • Computing devices, based on artificial intelligence • Features – Voice recognition, – Parallel processing – Quantum computation and molecular and nanotechnology will radically change the face of computers in years to come. – The goal of fifth-generation computing is to develop devices that respond to natural language input and are capable of learning and self-organization (2011)
Computer System architectures
53
Computer Architecture
2011
Computer System Architecture
54
What is Computer Architecture? • Set of data types, Operations, and features are call its architecture • It deals with those aspects that are visible to user of that level • Study of how to design those parts a computer is called Computer Architecture
2011
Computer System Architecture
55
Why Computer Architecture • Maximum overall performance of system keeping within cost constraints • Bridge performance gap between slowest and fastest component in a computer • Architecture design – Search the space of possible design – Evaluate the performance of design choose – Identify bottlenecks, redesign and repeat process 2011
Computer System Architecture
56
Computer Organization • The Simple Computer concise with – CPU – I/O Devices – Memory – BUS (Connection method)
2011
Computer System Architecture
57
Simple Computer
2011
Computer System Architecture
58
CPU – Central Processing Unit • Is the “Brain” • It Execute the program and stored in the main memory • Composes with several parts – Control Unit – Arithmetic and Logic Units – Registers
2011
Computer System Architecture
59
Registers • High-speed memory • Top of the memory hierarchy, and provide the fastest way to access data • Store temporary results • Some useful registers – PC – Program counters • Point to the next instructions
– IR - Instruction Register • Hold instruction currently being execute 2011
Computer System Architecture
60
Registers more… • Types – User-accessible Registers – Data registers – Address registers – General purpose registers – Special purpose registers – Etc.
2011
Computer System Architecture
61
Instruction • Types – Data handling and Memory operations • Set, Move, Read, Write
– Arithmetic and Logic • Add, subtract, multiply, or divide • Compare
– Control flow
• Complex instructions – Take many instructions on other computers • saving many registers on the stack at once • moving large blocks of memory
2011
Computer System Architecture
62
Parts of an instruction • Opcode – Specifies the operation to be performed
• Operands – – – –
2011
Register values, Values in the stack, Other memory values, I/O ports
Computer System Architecture
63
Type of the operation • Register-Register Operation – Add, subtract, compare, and logical operations
• Memory Reference – All loads from memory
• Multi Cycle Instructions – Integer multiply and divide and all floatingpoint operations 2011
Computer System Architecture
64
Fetch-Decode execute circle • Instruction fetch – 32-bit instruction was fetched from the cache
• • • •
2011
Decode Execute Memory Access Write back
Computer System Architecture
65
Fetch-Decode execute circle
2011
Computer System Architecture
66
MIcroprocessors • Processors can be identify by two main parameters – Speed (MHz/ GHz) – Processor with • Data bus • Address bus • Internal registers
2011
Computer System Architecture
67
Data bus • Known as Front side bus, CPU bus and Processor side bus • Use between CPU and main chipset • Define a size of memory – 32 bit – 64 bit etc.
2011
Computer System Architecture
68
Data bus
2011
Computer System Architecture
69
The division of I/O buses is according to data transfer rate. Specifically,
I/O Ports with data transfer rates Controller
Port / Device
PS/2 (keyboard / mouse) Serial Port Super I/O Floppy Disk Parallel Port Integrated Audio Integrated LAN USB Southbridge Integrated Video IDE (HDD, DVD) SATA (HDD, DVD) 2011
Computer System Architecture
Typical Data Transfer Rate 2 KB/s 25 KB/s 125 KB/s 200 KB/s 1 MB/s 12 MB/s 60 MB/s 133 MB/s 133 MB/s 300 MB/s 70
Address Bus • Carries addressing information • Each wire carries a single bit • Width indicates maximum amount of RAM the processor can handle • Data bus and address bus are independent
2011
Computer System Architecture
71
How CPU works? • A Simple CPU – 4 Bit Address bus – Registers A, B and C (4 Bit) – 8 Bit Program ( 4 BIT Instruction, 4 BIT Data)
2011
Computer System Architecture
72
How CPU works? A
B
IP
Instruction SET 0000
Sleep
0001
LOAD M → A
0010
LOAD M → B
0101
SET A → M
0110
SET B → M
1000
ADD A + B → C
1111
MOVE
1001
RESET
ALU IC Register C
Instruction Counter
C
2011
Computer System Architecture
73
Instruction SET
How CPU works? A
B
0000
Sleep
0001
LOAD M → A
0010
LOAD M → B
0101
SET A → M
0110
SET B → M
0111
SET C → M
1000
ADD A + B → C
C 0000
IC 01 C
1
0
0
0
0
0
0
0
0
2
0
0
0
1
0
0
1
0
3
0
0
1
0
0
1
0
1
4
1
0
0
0
0
0
0
0
5
0
1
1
1
0
0
0
0
6
2011
Computer System Architecture
74
Instruction SET
How CPU works? A
B
0000
Sleep
0001
LOAD M → A
0010
LOAD M → B
0101
SET A → M
0110
SET B → M
0111
SET C → M
1000
ADD A + B → C
C 0001 1
IC 02 C
0
0
0
0
0
0
0
0 0
2
0
0
0
1
000 0 1 10
3
0
0
1
0
0
1
0
1
4
1
0
0
0
0
0
0
0
5
0
1
1
1
0
0
0
0
6
2011
Computer System Architecture
75
Instruction SET
How CPU works? A 0010
B
0000
Sleep
0001
LOAD M → A
0010
LOAD M → B
0101
SET A → M
0110
SET B → M
0111
SET C → M
1000
ADD A + B → C
C 0010 IC 03 C
1
0
0
0
0
0
0
0
0
2
0
0
0
1
0
0
1
0
3
0
0
1
0
0
4
1
0
0
0
0
0
0
0
5
0
1
1
1
0
0
0
0
0 11 0 01
1
6
2011
Computer System Architecture
76
Instruction SET
How CPU works? A 0010
B 0101
0000
Sleep
0001
LOAD M → A
0010
LOAD M → B
0101
SET A → M
0110
SET B → M
0111
SET C → M
1000
ADD A + B → C
C 1000
C 0111
IC 04
1
0
0
0
0
0
0
0
0
2
0
0
0
1
0
0
1
0
3
0
0
1
0
0
1
0
1
4
1
0
0
0
0
0
0
0
5
0
1
1
1
0
0
0
0
6
1
0
0
1
0
0
0
0
7
1
1
1
1
0
0
0
1
8
2011
Computer System Architecture
77
Instruction SET
How CPU works? A 0010
B 0101
0000
Sleep
1111
MOVE
1001
RESET
0101
SET A → M
0110
SET B → M
0111
SET C → M
1000
ADD A + B → C
C 0111
C 0111
IC 05
0 1 1 1
1
0
0
0
0
0
0
0
0
2
0
0
0
1
0
0
1
0
3
0
0
1
0
0
1
0
1
4
1
0
0
0
0
0
0
0
5
0
1
1
1
1
1
1
1
6
1
0
0
1
0
0
0
0
7
1
1
1
1
0
0
0
1
8
2011
Computer System Architecture
78
Instruction SET
How CPU works? A 0010
B 0101
0000
Sleep
1111
MOVE
1001
RESET
0101
SET A → M
0110
SET B → M
0111
SET C → M
1000
ADD A + B → C
C 1001
C 0111
2011
IC 06
Computer System Architecture
1
0
0
0
0
0
0
0
0
2
0
0
0
1
0
0
1
0
3
0
0
1
0
0
1
0
1
4
1
0
0
0
0
0
0
0
5
0
1
1
1
1
1
1
1
6
1
0
0
1
0
0
0
0
7
1
1
1
1
0
0
0
1
8
0
1
1
1
79
Instruction SET
How CPU works? A 0000
B 0000
0000
Sleep
1111
MOVE
1001
RESET
0101
SET A → M
0110
SET B → M
0111
SET C → M
1000
ADD A + B → C
C 0000
C 0000
IC 06
1
0
0
0
0
0
0
0
0
2
0
0
0
1
0
0
1
0
3
0
0
1
0
0
1
0
1
4
1
0
0
0
0
0
0
0
5
0
1
1
1
1
1
1
1
6
1
0
0
1
0
0
0
0
7
1
1
1
1
0
0
0
1
8
2011
Computer System Architecture
80
Instruction SET
How CPU works? A 0000
B 0000
0000
Sleep
1111
MOVE
1001
RESET
0101
SET A → M
0110
SET B → M
0111
SET C → M
1000
ADD A + B → C
C 1111
C 0000
IC 07
1
0
0
0
0
0
0
0
0
2
0
0
0
1
0
0
1
0
3
0
0
1
0
0
1
0
1
4
1
0
0
0
0
0
0
0
5
0
1
1
1
1
1
1
1
6
1
0
0
1
0
0
0
0
7
1
1
1
1
0
0
0
1
8
2011
Computer System Architecture
81
Instruction SET
How CPU works? A 0000
B 0000
0000
Sleep
1111
MOVE
1001
RESET
0101
SET A → M
0110
SET B → M
0111
SET C → M
1000
ADD A + B → C
C 0000
C 0000
IC 01
1
0
0
0
0
0
0
0
0
2
0
0
0
1
0
0
1
0
3
0
0
1
0
0
1
0
1
4
1
0
0
0
0
0
0
0
5
0
1
1
1
1
1
1
1
6
1
0
0
1
0
0
0
0
7
1
1
1
1
0
0
0
1
8
2011
Computer System Architecture
82
How BUS System works
CPU Device A
Device B
Device C
DATA BUS ADDRESS BUS CONTROL BUS 2011
Computer System Architecture
83
How BUS System works
DATA BUS ADDRESS BUS CONTROL BUS 2011
Computer System Architecture
84
How BUS System works ADDRESS BUS DATA BUS CONTROL BUS
4 BIT 4 BIT 2 BIT
CPU Device A
Device B
Device C
DATA BUS ADDRESS BUS CONTROL BUS 2011
Computer System Architecture
85
How BUS System works CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0100
CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0010
CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0001
DATA BUS ADDRESS BUS CONTROL BUS 2011
Computer System Architecture
86
How BUS System works CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0100
CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0010
CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0001
DATA BUS ADDRESS BUS CONTROL BUS 0000 2011
0000
00 Computer System Architecture
87
How BUS System works CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0100
CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0010
CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0001
DATA BUS ADDRESS BUS CONTROL BUS 0000 2011
0100
00 Computer System Architecture
88
How BUS System works CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0100
CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0010
CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0001
DATA BUS ADDRESS BUS CONTROL BUS 1 01 0 2011
0100
10 Computer System Architecture
89
How BUS System works CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0100
CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0010
CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0001
DATA BUS ADDRESS BUS CONTROL BUS 1 01 0 2011
0010
00 Computer System Architecture
90
How BUS System works CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0100
CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0010
CONTROL 2 BIT 01 – READ, 10 – Write ADDRESS 0001
DATA BUS ADDRESS BUS CONTROL BUS 1 01 0 2011
0010
01 Computer System Architecture
91
Intel Microprocessor History
2011
Computer System Architecture
92
Microprocessor History • Intel 4004 (1971) – – – – –
2011
0.1 MHz 4 bit World first Single chip microprocessor Instruction set contained 46 instructions Register set contained 16 registers of 4 bits each
Computer System Architecture
93
Microprocessor History • Intel 8008 (1972) – – – –
2011
Max. CPU clock rate 0.5 MHz to 0.8 MHz 8-bit CPU with an external 14-bit address bus could address 16KB of memory had 3,500 transistors
Computer System Architecture
94
Microprocessor History • Intel 8080 (1974) – second 8-bit microprocessor – Max. CPU clock rate 2 MHz – Large 40-pin DIP packaging – 16-bit address bus and an 8-bit data bus – Easy access to 64 kilobytes of memory – Processor had seven 8-bit registers, (A, B, C, D, E, H, and L) 2011
Computer System Architecture
95
Microprocessor History • Intel 8086 (1978) – 16-bit microprocessor – Max. CPU clock rate 5 MHz to 10 MHz – 20-bit external address bus gave a 1 MB physical address – 16-bit registers including the stack pointer,
2011
Computer System Architecture
96
Microprocessor History • Intel 80286 (1978) – 16-bit x86 microprocessor – 134,000 transistors – Max. CPU clock rate 6 MHz to 25 MHz – Run in two modes • Protected mode • Real mode
2011
Computer System Architecture
97
Microprocessor History • Intel 80386 (1985) – 32-bit Microprocessor – 275,000 transistors – 16-bit data bus – Max. CPU clock rate 12 MHz to 40 MHz – Instruction set • x86 (IA-32)
2011
Computer System Architecture
98
Microprocessor History • Intel 80486 (1989) – – – – – –
2011
Max. CPU clock rate 16 MHz to 100 MHz FSB speeds 16 MHz to 50 MHz Instruction set x86 (IA-32) An 8 KB on-chip SRAM cache stores 486 has a 32-bit data bus and a 32-bit address bus. Power Management Features and System Management Mode (SMM) became a standard feature
Computer System Architecture
99
Microprocessor History • Intel Pentium I (1993) – Intel's 5th generation micro architecture – Operated at 60 MHz – powered at 5V and generated enough heat to require a CPU cooling fan – Level 1 CPU cache from 16 KB to 32 KB – Contained 4.5 million transistors – compatible with the common Socket 7 motherboard configuration
2011
Computer System Architecture
100
Microprocessor History • Intel Pentium II (1997) – Intel's sixth-generation microarchitecture – 296-pin Staggered Pin Grid Array (SPGA) package (Socket 7) – speeds from 233 MHz to 450 MHz – Instruction set IA-32, MMX – cache size was increased to 512 KB – better choice for consumer-level operating systems, such as Windows 9x, and multimedia applications
2011
Computer System Architecture
101
Microprocessor History • Intel Pentium III (1999) – – – –
400 MHz to 1.4 GHz Instruction set IA-32, MMX, SSE L1-Cache: 16 + 16 KB (Data + Instructions) L2-Cache: 512 KB, external chips on CPU module at 50% of CPU-speed – the first x86 CPU to include a unique, retrievable, identification number
2011
Computer System Architecture
102
Microprocessor History • Intel Pentium IV (2000) – Max. CPU clock rate 1.3 GHz to 3.8 GHz – Instruction set x86 (i386), x86-64, MMX, SSE, SSE2, SSE3 – featured Hyper-Threading Technology (HTT) – The 64-bit external data bus – More than 42 million transistors – Processor (front-side) bus runs at 400MHz, 533MHz, 800MHz, or 1066MHz – L2 cache can handle up to 4GB RAM – 2MB of full-speed L3 cache 2011
Computer System Architecture
103
Microprocessor History • Intel Core Duo – Processing Die Transistors 151 million – Consists of two cores – 2 MB L2 cache – All models support: MMX, SSE, SSE2, SSE3, EIST, XD bit – FSB Speed 533 MHz – Intel® Virtualization Technology (VT-x) – Execute Disable Bit 2011
Computer System Architecture
104
Microprocessor History • Pentium Dual-Core – Max. CPU clock rate 1.3 GHz to 2.6 GHz – based on either the 32-bit Yonah or (with quite different microarchitectures) 64-bit Merom-2M – Instruction set MMX, SSE, SSE2, SSE3, SSSE3, x86-64 – FSB speeds 533 MHz to 800 MHz – Cores 2
2011
Computer System Architecture
105
Microprocessor History • Intel Core Due – – – – – –
Clock Speed 1.2 GHz L2 Cache 2 MB FSB Speed 533 MHz Instruction Set 32-bit Processing Die Transistors 151 million Advanced Technologies • Intel® Virtualization Technology (VT-x) • Enhanced Intel SpeedStep® Technolog • Execute Disable Bit
2011
Computer System Architecture
106
Microprocessor History • Core 2 due – – – – – –
Cores 2 , Threads 2 Clock Speed 3.33 GHz L2 Cache 6 MB FSB Speed 1333 MHz Processing Die Transistors 410 million Advanced Technologies • • • • • • • •
2011
Intel® Virtualization Technology (VT-x) Intel® Virtualization Technology for Directed IO (VT-d) Intel® Trusted Execution Technology Intel® 64 Idle States Enhanced Intel SpeedStep® Technology Thermal Monitoring Technologies Execute Disable Bit
Computer System Architecture
107
Microprocessor History • Intel Core 2 Quad – – – – – –
Cores 4 , Threads 4 Clock Speed 3. GHz L2 Cache 12 MB FSB Speed 1333 MHz Processing Die Transistors 410 million Advanced Technologies • • • • • • • •
2011
Intel® Virtualization Technology (VT-x) Intel® Virtualization Technology for Directed IO (VT-d) Intel® Trusted Execution Technology Intel® 64 Idle States Enhanced Intel SpeedStep® Technology Thermal Monitoring Technologies Execute Disable Bit
Computer System Architecture
108
Microprocessor History • Core i3 – – – – –
Cores 2 Threads 4 Clock Speed 2.13 GHz Intel® Smart Cache 3 MB Instruction Set 64-bit Instruction Set Extensions SSE4.1,SSE4.2 – Max Memory Size 8 GB – Processing Die Transistors 382 million – Technologies • Intel® Trusted Execution Technology • Intel® Fast Memory Access • Intel® Flex Memory Access
2011
Computer System Architecture
109
Microprocessor History • Core i5 – – – – – –
Cores 2 Threads 4 Clock Speed 1.7 - 3.0 GHz Max Memory Size 8 GB Processing Die Transistors 382 million Technologies • • • • • • •
Intel® Trusted Execution Technology Intel® Fast Memory Access Intel® Flex Memory Access Intel® Anti-Theft Technology Intel® My WiFi Technology 4G WiMAX Wireless Technology Idle States
– 2011
Computer System Architecture
110
Microprocessor History • Core i7 – – – –
Cores 4 Threads 8 Clock Speed 3.4 GHz Max Turbo Frequency 3.8 GHz – Intel® Smart Cache 8 MB
2011
Technologies Intel® Turbo Boost Technology 2.0Intel® vPro Technology Intel® Hyper-Threading Technology Intel® Virtualization Technology (VT-x) Intel® Virtualization Technology for Directed I/O (VT-d) Intel® Trusted Execution Technology AES New Instructions Intel® 64 Idle States Enhanced Intel SpeedStep® Technology Thermal Monitoring Technologies Intel® Fast Memory Access Intel® Flex Memory Access Execute Disable Bit
Computer System Architecture
111
Summary – Processor Family Vs Buses
2011
Computer System Architecture
112
Summary - Intel processors (1)
2011
Computer System Architecture
113
AMD processors (1)
2011
Computer System Architecture
114
AMD processors (2)
2011
Computer System Architecture
115
Microprocessors
2011
Computer System Architecture
116
Processor Instructions • Intel 80386 (1985) – x86 (IA-32)
• Intel 80486 (1989) – x86 (IA-32)
• Intel Pentium I (1993) – x86 (IA-32)
• Intel Pentium II (1997) – IA-32, MMX 2011
Computer System Architecture
117
Processor Instructions(2) • Intel Pentium III (1999) – IA-32, MMX, SSE
• Intel Pentium IV (2000) – x86 (i386), x86-64, MMX, SSE, SSE2, SSE3
• Intel Core Duo – MMX, SSE, SSE2, SSE3, EIST, XD bit
• Pentium Dual-Core – MMX, SSE, SSE2, SSE3, SSSE3, x86-64 2011
Computer System Architecture
118
Processor Modes
2011
Computer System Architecture
119
Processor modes • Intel and Compatible processors are run in several modes – Real Mode – IA 32 Mode • Protected Mode • Virtual Real Mode
– IA 32e 64 bit mode • 64-bit mode • Compatibility mode 2011
Computer System Architecture
120
8086 Real Mode (x86) • • • • •
80286 and later x86-compatible CPUs Execute 16 bit instructions Address only 1MB Memory Single task MS-Dos Programs are run in this mode – Windows 1x, 3x – 16 bit instructions
• No built in protection to keep one program overwriting another in memory 2011
Computer System Architecture
121
IA-32 - Protected Mode • First implemented in the Intel 80386 as a 32-bit extension of x86 architecture • Can run 32-bit instructions • 32 bit OS and Application are Required • Programs are protection to keep one program overwriting another in memory
2011
Computer System Architecture
122
Virtual Real mode (IA- 32 Mode) • Backward compatibility (can run 16 bit apps) – used to execute DOS programs in Windows/386, Windows 3.x, Windows 9x/Me
• 16 bit program run on the 32 bit protected mode • Address only up to 1 Mb • All Intel and Intel-supported processors power up in real mode 2011
Computer System Architecture
123
IA-32e 64 bit Exaction Mode • Originally design by AMD , later adapted by Intel • Processor can run – Real mode – IA 32 mode – IA 32e mode
• IA -32e 64 bit is run 64 bit OS and 64 bit apps • Need 64 bit OS and All 64 bit hardware 2011
Computer System Architecture
124
64-Bit Operating Systems • Windows XP – 64 bit Edition for Itanium (IA64 bit processors) • Windows XP professional x64( IA 32, Atholen 64) • 32 bit Application can run without any probem • 16 bit and Dos application does not run • Problem ? – All 32-bit and 64 bit drivers are required 2011
Computer System Architecture
125
Physical memory limit
2011
Computer System Architecture
126
Processors Features
2011
Computer System Architecture
127
Processors Features • • • • • • • • • • • 2011
System Management Mode (SMM) MMX Technology SSE, SSE2, SSE3, SSE4 etc 3DNow!, Technology Math core processor Hyper Threading Dual core technology Quad core technology Intel Virtualization Execute Disable bit Intel® Turbo Boost Technology Computer System Architecture
128
System Management Mode(SMM) • is an operating mode • is suspended, and special separate software is executed in high-privilege mode • It is available in all later microprocessors in the x86 architecture • Some uses of SMM are – Handle system events like memory or chipset errors. – Manage system safety functions, such as shutdown on high CPU temperature and turning the fans on and off. – Control power management operations, such as managing the voltage regulator modules.
2011
Computer System Architecture
129
MMX Technology • Multimedia extension / Matrix math extension • Improves audio/video compression • MMX defined eight registers, known as MM0 through MM7 • Each of the MMn registers holds 64 bits • MMX provides only integer operations • Used for both 2D and 3D calculations • 57 new instructions + (SIMD- Single instruction multiple data) 2011
Computer System Architecture
130
SSE -Streaming SIMD Extensions • Used to accelerate floating point and parallel calculations • is a SIMD instruction set extension to the x86 architecture • subsequently expanded by Intel to SSE2, SSE3, SSSE3, and SSE4 • it supports floating point math • SSE originally added eight new 128-bit registers known as XMM0 through XMM7 • SSE Instructions – Floating point instructions – Integer instructions – Other instructions 2011
Computer System Architecture
131
SSE2- Streaming SIMD Extensions 2 • • • •
2011
Introduce in Pentium IV Add 114 additional instructions Also include MMX and SSE instructions SSE2 is an extension of the IA-32 architecture
Computer System Architecture
132
SSE3- Streaming SIMD Extensions 3 • Introduce in PIV Prescott processor • Code name Prescott New Instructions (PNI) • Contains 13 new instructions • Also include MMX, SSE, SSE2
2011
Computer System Architecture
133
SSE3- Supple • Introduce in xeon and Core 2 processors • Add new 32 SIMD instructions to SSE3
2011
Computer System Architecture
134
SSE4 (HD BOOT) • Introduce by Intel in 2008 • Adds 54 new instructions • 47 of SSE4 instructions are referred to as SSE4.1 • 7 other instruction as SSE4.2 • SSE4.1 – is targeted to improve performance of media, imaging and 3D • SSE4.2 improves string and text processing 2011
Computer System Architecture
135
SSE - Advantages • Higher quality and high quality image resolution • High quality audio and MPEG2 Video multi media application support • Reduce CPU utilization for speech recognition software • SSEx instructions are useful withMPEG2 decoding 2011
Computer System Architecture
136
3DNow! Technology • AMD’s alternative to SSE • Uses 21 instructions uses SIMD technologies • Enhanced 3DNow! ADDS 24 more instructions • Professional 3DNow! Adds 51 SSE command to the Enhanced 3DNow!
2011
Computer System Architecture
137
Math coprocessor • Provides hardware for plotting point Math • Speed Computer Operations • All Intel processors since 486DX include built-in floating point unit (FPU) • Can performance high level mathematical operation • Instruction set differ from main CPU 2011
Computer System Architecture
138
Hyper-Threading Technology • Is an Intel-proprietary technology used to improve parallelization of computations doing multiple tasks at once • The operating system addresses two virtual processors, and shares the workload between them when possible • Allowing multiple threads to run simultaneously 2011
Computer System Architecture
139
Hyper-Threading Technology • Originally introduce Xeon processor for Servers (2002) • Available all PIV processor with bus speed 800 MHz • HT enable processors has 2 set of general purpose registers, control registers • Only Single Cache memory and Single Buses 2011
Computer System Architecture
140
HT - Requirements • • • • •
2011
Processor with HT Technology Compatible MB (Chipset) BIOS support Compatible OS Software written to Support HT
Computer System Architecture
141
Dual Core Technology • Introduce in 2005 • Consist of 2 CPU cores (Enable Single processors to work as 2 processors) • Multi Tasking performance is improved
2011
Computer System Architecture
142
Quad-Core Technology • Consist of 4 CPU cores (Enable Single processors to work as 4 processors) • Less power consumption • Design to provide multimedia and multi tasking experience
2011
Computer System Architecture
143
Intel Virtualization • Allows hardware platform to run multiple platform • Available in Core to Quad processors
2011
Computer System Architecture
144
Execute Disable Bit • Is a hardware-based security feature • Can reduce exposure to viruses and malicious-code attacks and prevent harmful software from executing and propagating on the server or network. • Help protect your customers' business assets and reduce the need for costly virus-related repairs by building systems with built-in Intel Execute Disable Bit. 2011
Computer System Architecture
145
Intel® Turbo Boost Technology • Provides more performance when needed • Automatically allows processor cores to run faster than the base operating frequency • Depends on the workload and operating environment • Processor frequency will dynamically increase until the upper limit of frequency is reached • Has multiple algorithms operating in parallel to manage current, power, and temperature to maximize performance and energy efficiency 2011
Computer System Architecture
146
Bugs
2011
Computer System Architecture
147
Bugs • Processor can contain defects or errors • Only way to fix the bug – Work around it or replace it with bugs free
• Now… – Many bugs to be fixed by altering the microcode – Microcode gives set of information how processor works – Incorporate Reprogrammable Microcode 2011
Computer System Architecture
148
Fixing the Bugs • Microcode updates reside in ROM BIOS • Each time the system rebooted fixed code is loaded • These microcode is provided by Intel to motherboard manufacturers and they can incorporate it into ROM BIOS • Need to install most recent BIOS every time 2011
Computer System Architecture
149
CPU Design Strategy CISC & RISC
2011
Computer System Architecture
150
What is CISC? • CISC is an acronym for Complex Instruction Set Computer • Most common microprocessor designs such as the Intel 80x86 and Motorola 68K series followed the CISC philosophy. • But recent changes in software and hardware technology have forced a re-examination of CISC and many modern CISC processors are hybrids, implementing many RISC principles. • CISC was developed to make compiler development simpler.
CISC Characteristics • 2-operand format, • Variable length instructions where the length often varies according to the addressing mode • Instructions which require multiple clock cycles to execute. • E.g. Pentium is considered a modern CISC processor • Complex instruction-decoding logic, driven by the need for a single instruction to support multiple addressing modes. • A small number of general purpose registers • Several special purpose registers. • A 'Condition code" register which is set as a side-effect of most instructions.
CISC Advantages • Microprogramniing is as easy as assembly language to implement • The ease of microcoding new instructions allowed designers to make CISC machines upwardly compatible: a new computer could run the same programs as earlier computers because the new computer would contain a superset of the instructions of the earlier computers. • As each instruction became more capable, fewer instructions could be used to implement a given task. This made more efficient use of the relatively slow main memory. 2011
Computer System Architecture
153
CISC Disadvantages • Instruction set & chip hardware become more complex with each generation of computers. • Many specialized instructions aren't used frequently enough to justify their existence • CISC instructions typically set the condition codes as a side effect of the instruction.
What is RISC? • RISC - Reduced Instruction Set Computer. – is a type of microprocessor architecture – utilizes a small, highly-optimized set of instructions, rather than a more specialized set of instructions often found in other types of architectures.
• History – The first RISC projects came from IBM, Stanford, and UC-Berkeley in the late 70s and early 80s. – The IBM 801, Stanford MIPS, and Berkeley RISC 1 and 2 were all designed with a similar philosophy which has become known as RISC.
RISC - Characteristic • one cycle execution time: RISC processors have a CPI (clock per instruction) of one cycle. This is due to the optimization of each instruction on the CPU and a technique called PIPELINING • pipelining: a techique that allows for simultaneous execution of parts, or stages, of instructions to more efficiently process instructions; • large number of registers: the RISC design philosophy generally incorporates a larger number of registers to prevent in large amounts of interactions with memory 2011
Computer System Architecture
156
RISC Attributes • • • • • • • • •
The main characteristics of CISC microprocessors are: Extensive instructions. Complex and efficient machine instructions. Microencoding of the machine instructions. Extensive addressing capabilities for memory operations. Relatively few registers. In comparison, RISC processors are more or less the opposite of the above: Reduced instruction set. Less complex, simple instructions. Hardwired control unit and machine instructions. Few addressing schemes for memory operands with only two basic instructions, LOAD and STORE
CISC Vs RISC CISC
RISC
Emphasis on hardware
Emphasis on software
Includes multi-clock complex instructions
Single-clock, reduced instruction only
Memory-to-memory: "LOAD" and "STORE" incorporated in instructions
Register to register: "LOAD" and "STORE" are independent instructions
Small code sizes, high cycles per second
Low cycles per second, large code sizes
Transistors used for storing complex instructions
Spends more transistors on memory registers
Performance of Computers
Improving Performance of Computers • Increasing clock speed – Physical limitation (Need new hardware)
• Parallelism (Doing more things at once) – Instruction-level parallelism • Getting more instruction per second
– Processor-level parallelism • Having multiple CPUs working on the same problem
Instruction-level parallelism • Pipelining – Instruction execution speed is affected by time taken to fetch instruction from memory – Early Computers fetch instructions in advance and stored in registers (Prefetch buffer) • Prefetching divides instruction execution into two parts – Fetching – Actual execution
– Pipelining divides instruction in to many parts; each handled by different hardware and can run in parallel
Pipelining example • Packaging cakes – W1: Place an empty box on the belt every 10 second – W2: Place the cake in the empty box – W3: Close and seal the box – W4: Label the box – W5: Remove the box and place it in the large container
162
Computer Pipelines
• S1: Fetch instruction from memory and place it in a buffer until it is needed • S2: Decode the instruction; determine it type and operands it needs • S3: locate the fetch operands from memory (or registers) • S4: Execute instruction • S5: Write back result in a register
163
Example T - Cycle time N - Number of stages in the pipeline
Latency: Time taken to execute an instruction = N x T Processor Bandwidth: No. of MIPS the CPU has = 1000 MIPS T 164
Processor - pipeline depth
165
Dual pipelines
• Instruction fetch unit fetches a pair of instructions and puts each one into own pipeline • Pentium has two five-stage pipelines – U pipeline (main) executes an arbitrary Pentium instructions – V pipeline (second) executes inter instructions, one simple floating point instruction
• If instructions in a pair conflict, instruction in u pipeline is executed. Other instruction is held and is paired with next instruction
166
Superscalar architecture • Single pipeline with multiple functional units
Processor level parallelism • High bus traffic
• Low bus traffic
Measuring Performance
Moore’s law • Describes a long-term trend in the history of computing hardware • Defined by Dr. Gordon Moore during the sixties. • Predicts an exponential increase in component density over time, with a doubling time of 18 months. • Applicable to microprocessors, DRAMs , DSPs and other microelectronics.
Moore's Law and Performance • The performance of computers is determined by architecture and clock speed. • Clock speed doubles over a 3 year period due to the scaling laws on chip. • Processors using identical or similar architectures gain performance directly as a function of Moore's Law. • Improvements in internal architecture can yield better gains than predicted by Moore's Law.
Measuring Performance • Execution time: – Time between start and completion of a task (including disk accesses, memory accesses )
• Throughput: – Total amount of work dome a given time
Performance of a Computer
Two Computer X and Y; Performance of (X) > Performance of (Y)
Execution Time (Y) > Execution Time (X)
Performance of difference 2 Computer X is n Time faster than Y
CPU Time • Time CPU spends on a task • User CPU time – CPU time spent in the program
• System CPU time – CPU time spent in OS performing tasks on behalf of the program
CPU Time (Example) • User CPU time = 90.7s • System CPU time 12.9s • Execution time 2m 39 s 159s • % of CPU time = User CPU Time + System CPU Time X 100 % Execution time
CPU Time % CPU time
= (90.7 + 12.9 ) x 100 159 = 65 %
Clock Rate • Computer clock runs at the constant rate and determines when events take place in the hardware Clock Rate =
1 Clock Cycle
Amdahl’s law • Performance improvement that can be gained from some faster mode of execution is limited by fraction of the time the faster mode can be used
Amdahl’s law • Speedup depends on – Fraction of computation time in original machine that can be converted to take advantage of the enhancement
(Fraction Enhanced) – Improvement gains by enhanced execution mode
(Speedup Enhanced)
Example Total execution time of a Program = 50 s Execution time that can be enhanced = 30 s FractionEnhanced = 30 /50 = 0.6
Speedup
Example Normal mode execution time for some portion of a program = 6s Enhances mode execution time for the same program = 2s Speedup Enhanced = 6/2 =3
Execution Time
Example • Suppose we consider an enhancement to the processor of a server system used for Web serving. New CPU is 10 times faster on computation in Web application than original CPU. Assume original CPU is busy with computation 40% of the time and is waiting for I/O 60% of time.
What is the overall speedup gained from enhancement?
Answer
188
Remark • If an enhancement is only usable for fraction of a task, we cannot speedup by more than
189
Example • A common transformation required in graphics engines is square root. Implementation of floating-point (FP) square root vary significantly in performance, especially among processors designed graphics • Suppose FP square root (FPSQR) is responsible for 20% of execution tine of a critical graphics program • Design alternative 1. Enhance EPSQR hardware and speed up this operation by a factor of 10 2. Make all FP instruction run faster by a factor of 1.6 190
Example • FP instruction are responsible for a total of 50% of execution time. Design team believes they can make all fp instruction run 1.6 times faster with same effort as required for fast square root. Compare these two design alternatives 191
192
CPU performance equation CPU time = CPU clock cycles for a program x Clock cycle time = CPU clock cycles / Clock rate
Example A program runs in 10s on computer A having 400 MHz clock. A new machine B, which could run the same program in 6s, has to be designed. Further, B should have 1.2 times as many clock cycles as A. What should be the clock rate of B?
Answer
CPU Clock Cycles CPI (clock cycles per instruction) average no. of clock cycles each instruction takes to execute IC (instruction count) no. of instructions executed in the program CPU clock cycles = CPI x IC Note: CPI can be used to compare two different implementations of the same instruction set architecture (as IC required for a program is same)
Example • Consider two implementations of same instruction set architecture. For a certain program, details of time measurements of two machines are given below
• Which machine is faster for this program and by how much?
Answer
Measuring components of CPU performance equation • CPU Time: by running the program • Clock Cycle Time: published in documentation • IC: by a software tools/simulator of the architecture ((more difficult to obtain) • CPI: by simulation of an implementation (more difficult to obtain)
CPU clock cycles Suppose n different types of instruction Let ICi – No. of times instruction i is executed in a program CPIi – Avg. no. of clock cycles for instruction i
Example Suppose we have made the following measurements: – – – – –
Frequency of FP operations (other than FPSQR) = 25% Average CPI of FP operations = 4.0 Average CPI of other instructions = 1.33 Frequency of FPSQR= 2% CPI of FPSQR = 20
Design alternatives: 1. 2.
decrease CPI of FPSQR to 2 decrease average CPI of all FP operation to 2.5
Compare these two design alternatives using CPU performance equation
Answers •
Note that only CPI changes; clock rate; IC remain identical
MIPS as a performance measure
Problems MIPS as a performance measure • MIPS is dependant on instruction set – difficult to compare MIPS of computers with different instruction sets
• MIPS can vary inversely to performance
MFLOPS as a performance measure
Problems MIPS as a performance measure • MFLOPS is not dependable – Cray C90 has no divide instructions while Pentium has
• MFLOPS depends on the mixture of fast and slow floating point operations – add (fast) and divide (slow) operations
Instruction Set Architecture (ISA) Level
207
Introduction
208
Instruction Set Architecture • Positioned between microarchitecture level and operating system level • Important to system architects – interface between software and hardware
209
Instruction Set Architecture
210
ISA contd.. • General approach of system designers: – Build programs in high-level languages – Translate to ISA level – Build hardware that executes ISA level programs directly
• Key challenge: – Build better machines subject to backward compatibility constraint 211
Features off a good ISA • Define a set of instructions that can be implemented efficiently in current and future technologies resulting in cost effective designs over several generations • Provide a clean target for compiled code
212
Properties off ISA level • ISA level code is what a compiler outputs • To produce ISA code, compiler writer has to know – What the memory model is – What registers are there – What data types and instructions are available 213
ISA level memory models • Computers divide memory into cells (8 bits) that have consecutive addresses • Bytes are grouped into words (4-, 8byte) with instructions available for manipulating entire words • Many architectures require words to be aligned on their natural boundaries – Memories operate more efficiently that way 214
ISA level Memory Models
• On Pentium II (fetches 8 bytes at a time from memory), ISA programs can make memory references to words starting at any address – Requires extra logic circuits on the chip – Intel allows it cause of backward compatibility constraint (8088 programs made non-aligned memory references) 215
ISA level registers • Main function of ISA level registers: – provide rapid access to heavily used data
• Registers are divided into 2 categories – special purpose registers (program counter, stack pointer) – General purpose registers (hold key local variables, intermediate results of calculations). • These are interchangeable 216
Instructions • Main feature of ISA level is its set of machine instructions • They control what the machine can do • Ex: – LOAD and STORE instructions move data between memory and registers – MOVE instruction copies data among registers 217
Pentium II ISA level (Intel’s IA-32) • Maintains full support for execution of programs written for 8086, 8088 processors (16-bit) • Pentium II has 3 operating modes (Real mode, Virtual 8086 mode, Protected mode) • Address Add space: memory is divided into 16,384 segments, each going from address 0 to address 232-1 (Windows supports only one segment) • Every byte has its own address, with words being 32 bits long • Words are stored in Little endian format (loworder byte has lowest address)
218
Little endian and Big endian format
219
Pentium II’s primary registers
220
Pentium II’s primary registers • EAX: Main arithmetic registers, 32-bit – 16-bit register in low-order 16 bits – 8-bit register in low-order 8 bits – easy to manipulate 16-bit (in 80286) and 8-bit (in 8088) quantities
• EBX: holds pointers • ECX: used in looping • EDX: used for multiplication and division, where together with EAX, it holds 64-bit products and dividends 221
Pentium II’s primary registers • ESI,ESI EDI: holds pointers into memory – Especially for hardware string manipulation instructions (ESI points to source string, EDI points to destination string)
• • • • •
EBP: pointer register ESP: stack pointer CS through GS: segment registers EIP: program counter EFLAGS: flag register (holds various miscellaneous bits such as conditional codes) 222
Pentium II data Types
223
Instruction Formats • An instruction consists of an opcode, plus additional information such as where operands come from, where results go to • Opcode tells what instruction does • On some machines, all instructions have same length – Advantages: simple, easy to decode – Disadvantages: waste space 224
Common Instruction Formats
(a) Zero address instruction (b) One address instruction (c) Two address instruction (d) Three address instruction
225
Instruction and Word length Relationships
226
Example • An Instruction with 4bit Opcode and Three 4bit address
227
Design of Instruction Formats • Factors: – Length of instruction • short instructions are better than long instructions (modern processors can execute multiple instructions per clock cycle)
– Sufficient room in the instruction format to express all operations required – No. of bits in an address field
228
Intel® 64 and IA-32 Architectures •
Intel 64 and IA-32 instructions
– – – – – – – – – – – – – – – –
General purpose x87 FPU x87 FPU and SIMD state management Intel MMX technology SSE extensions SSE2 extensions SSE3 extensions SSSE3 extensions SSE4 extensions AESNI and PCLMULQDQ Intel AVX extensions F16C, RDRAND, FS/GS base access System instructions IA-32e mode: 64-bit mode instructions VMX instructions SMX instructions
229
Addressing
230
Addressing • Subject of specifying where the operands (addresses) are – ADD instruction requires 2 or 3 operands, and instruction must tell where to find operands and where to put result
• Addressing Modes – Methods of interpreting the bits of an address field to find operand • • • • •
Immediate Addressing Direct Addressing Register Addressing Register Indirect Addressing Indexed Addressing 231
Immediate Addressing • Simplest way to specify where the operand is • Address part of instruction contains operand itself (immediate operand) • Operand is automatically fetched from memory at the same time the instruction it self is fetched – Immediately available for use
• No additional memory references are required • Disadvantages – only a constant can be supplied – value of the constant is limited by size of address field
• Good for specifying small integers 232
Example Immediate Addressing MOV R1, #8 ; Reg[R1] ← 8 ADD R2R2, #3 ; Reg[R2] ← Reg[R2] + 3
233
Direct Addressing • Operand is in memory, and is specified by giving its full address (memory address is hardwired into instruction) • Instruction will always access exactly same memory location, which cannot change • Can only be used for global variables who address is known at compile time
• Example Instruction: – ADD R1, R1(1001) ; Reg[R1] ← Reg[R1] +Mem[1001] 234
Direct Addressing Example
235
Register Addressing • Same as direct addressing with the exception that it specifies a register instead of memory location • Most common addressing mode on most computers since register accesses are very fast • Compilers try to put most commonly accessed variables in registers • Cannot be used only in LOAD and STORE instructions (one operand in is always a memory address)
• Example instruction: – ADD R3, R4 ; Reg[R3] ← Reg[R3] + Reg[R4] 236
Register Indirect Addressing • Operand being specified comes from memory or goes to memory • Its address is not hardwired into instruction, but is contained in a register (pointer) • Can reference memory without having full memory address in the instruction • Different memory words can be used on different executions of the instruction
• Example instruction: – ADD R1,R1(R2) ; Reg[R1] ← Reg[R1] + Mem[Reg[R2]]
237
Example • Following generic assembly program calculates the sum of elements (1024) of an array A of integers of 4 bytes each, and stores result in register R1 – MOV R1, #0 – MOV R2, #A – MOV R3, #A+4096 word – LOOP: ADD R1, (R2) operand – ADD R2, #4 – CMP R2, R3 – BLT LOOP
; sum in R1 (0 initially) ; Reg[R2] = address of array A ; Reg[R3] = address of first beyond A ; register indirect via R2 to get ; increment R2 by one word ; is R2 < R3? ; loop if R2 < R3 238
Indexed Addressing • Memory is addressed by giving a register plus a constant offset • Used to access local variables
• Example instruction: – ADD R3, 100(R2) ; Reg[R3] ← Reg[R3] + Mem[100+Reg[R2]]
239
Based-Indexed Addressing • Memory address is computed by adding up two registers plus an optional offset • Example instruction: ADD R3, (R1+R2) ;Reg[R3] ← Reg[R3] + Mem[Reg[R1] + Reg[R2]] 240
Instruction Types • ISA level instructions are divided into few categories – Data Movement Instructions • Copy data from one location to another
– Examples (Pentium II integer instructions): • MOV DST, SRC – copies SRC (source) to DST (destination) • PUSH SRC – push SRC into the stack • XCHG DS1, DS2 – exchanges DS1 and DS2 • CMOV DST, SRC – conditional move 241
Instruction Types contd.. – Dyadic Operations • Combine two operands to produce a result (arithmetic instructions, Boolean instructions)
– Examples (Pentium II integer instructions): • ADD DST, SRC – adds SRC to DST, puts result in DST • SUB DST, SRC – subtracts DST from SRC • AND DST, SRC – Boolean AND SRC into DST • OR DST, SRC - Boolean OR SRC into DST • XOR DST,DST SRC – Boolean Exclusive OR to DST 242
Instruction Types contd.. • Monadic Operations – Have one operand and produce one result – Shorter than dyadic instructions
• Examples (Pentium II integer instructions): – INC DST – adds 1 to DST – DEC DST – subtracts 1 from DST – NOT DST – replace DST with 1’s complement 243
Instruction Types contd.. • Comparison and Conditional Branch Instructions • Examples (Pentium II integer instructions): – TST SRC1, SRC2 – Boolean AND operands, set flags (EFLAGS) – CMP SRC1, SRC2 – sets flags based on SRC1-SRC2
244
Instruction Types contd.. • Procedure (Subroutine) call Instructions – When the procedure has finished its task, transfer is returned to statement after the call
• Examples (Pentium II integer instructions): – CALL ADDR -Calls procedure at ADDR – RET - Returns from procedure 245
Instruction Types contd.. • Loop Control Instructions – LOOPxx – loops until condition is met
• Input / Output Instructions There are several input/output schemes currently used in personal computers – Programmed I/O with busy waiting – Interrupt-driven I/O – DMA (Direct Memory Access) I/O
246
Programmed I/O with busy waiting • Simplest I/O method • Commonly used in low-end processors • Processors have a single input instruction and a single output instruction, and each of them selects one of the I/O devices • A single character is transferred between a fixed register in the processor and selected I/O device • Processor must execute an explicit sequence of instructions for each and every character read or written 247
DMA I/O • DMA controller is a chip that has a direct access to the bus • It consists of at least four registers, each can be loaded by software. – Register 1 contains memory address to be read/written – Register 2 contains the count of how many bytes / words to be transferred – Register 3 specifies the device number or I/O space address to use – Register 4 indicates whether data are to be read from or written to I/O device 248
Structure of a DMA
249
Registers in the DMA • • •
•
•
Status register: readable by the CPU to determine the status of the DMA device (idle, busy, etc) Command register: writable by the CPU to issue a command to the DMA Data register: readable and writable. It is the buffering place for data that is being transferred between the memory and the IO device. Address register: contains the starting location of memory where from or where to the data will be transferred. The Address register must be programmed by the CPU before issuing a "start" command to the DMA. Count register: contains the number of bytes that need to be transferred. The information in the address and the count register combined will specify exactly what information need to be transferred. 250
Example • Writing a block of 32 bytes from memory address 100 to a terminal device (4)
251
Example contd.. • CPU writes numbers 32, 100, and 4 into first three DMA registers, and writes the code for WRITE (1, for example) in the fourth register • DMA controller makes a bus request to read byte 100 from memory • DMA controller makes an I/O request to device 4 to write the byte to it • DMA controller increments its address register by 1 and decrements its count register by 1 • If the count register is > 0, another byte is read from memory and then written to device • DMA controller stops transferring data when count = 0 252
Sample Questions Q1. 1. Explain the processor architecture of 8086. 2. What are differences in Intel Pentium Processor and dual core processor. 3. What are the advantages and disadvantage of the multi-core processors
253
Sample Questions Q2. 1. What is addressing. 2. Comparing advantages, disadvantages and features briefly explain each addressing modes. 3. What is DMA and why it useful for Programming?. Explain your answer
254
Computer Memory • Primary Memory • Secondary Memory • Virtual Memory
255
Levels in Memory Hierarchy Cache
Regs CPU
Register size: speed: $/Mbyte: line size:
32 B 0.3 ns 4B
8B
C a c h e
32 B
Cache 32 KB-4MB 2 ns? $75/MB 32 B
larger, slower, cheaper
Virtual Memory
Memory
4 KB
Memory 4096 MB 7.5 ns $0.014/MB 4 KB
disk
Disk Memory 1 TB 8 ms $0.00012/MB
Primary Memory
257
Primary memory • Memory is the workspace for CPU • When a file is loaded into memory, it is a copy of the file that is actually loaded • Consists of a no. of cells, each having a number (address) • n cells → addresses: 0 to n‐1 ‐ • Same no. off bits in each cell • Adjacent cells have consecutive addresses • m‐bit ‐ address 2m addressable cells • A portion of RAM address space is mapped into one or more ROM chips 258
Ways of organizing a 96-bit memory
259
SRAM (Static RAM) • • • •
Constructed using flip flops 6 transistors for each bit of storage Very fast Contents are retained as long as power is kept on • Expensive • Used in level 2 cache
260
DRAM (Dynamic RAM) • No flip‐flops • Array of cells, each consisting a transistor and a capacitor • Capacitors can be charged or discharged, allowing 0s and 1s to be Stored • Electric charge tends to leak out Þ each bit in a DRAM must be reloaded (refreshed) every few milliseconds (15 ms) to prevent data from leaking away • Refreshing takes several CPU cycles to complete (less than 1% of overall bandwidth) • High density (30 times smaller than SRAM) • Used in main memories • Slower than SRAM • Inexpensive (30 times lower than SRAM)
261
SDRAM (Synchronous DRAM) • • • •
Hybrid of SRAM and DRAM Runs in synchronization with the system bus Driven by a single synchronous clock Used in large caches, main memories
262
DDR (Double Data Rate) SDRAM • An upgrade to standard SDRAM • Performs 2 transfers per clock cycle (one at falling edge, one at rising edge) without doubling actual clock rate
263
Dual channel DDR •
Technique in which 2 DDR DIMMs are installed at one time and function as a single bank doubling the bandwidth of a single module
•
DDR2 SDRAM – – – –
•
A faster version of DDR SDRAM (doubles the data rate of DDR) Less power consumption than DDR Achieves higher throughput by using differential pairs of signal wires Additional signal add to the pin count
DDR3 SDRAM – – – – –
An improved version off DDR2 SDRAM Same no. of pins as in DDR2, Not compatible with DDR2 Can transfer twice the data rate of DDR2 DDR3 standard allows chip sizes of 512 Megabits to 8 Gigabits (max module size – 16GB)
264
DRAM Memory module
265
DRAM Memory module
266
SDRAM and DDR DIMM versions • Buffered • Unbuffered • Registered
267
SDRAM and DDR DIMM • Buffered Module – Has additional buffer circuits between memory chips and the connector to buffer signals – New motherboards are not designed to use buffered modules
• Unbuffered Module – Allows memory controller signals to pass directly to memory chips with no interference – Fast and most efficient design – Most motherboards are designed to use unbuffered modules 268
SDRAM and DDR DIMM • Registered Module – Uses register chips on the module that act as an interface between RAM chip and chipset – Used in systems designed to accept extremely large amounts of RAM (server motherboards)
269
Memory Errors
270
Memory errors • Hard errors – Permanent failure – How to fix? (replace the chip)
• Soft errors – Non permanent failure – Occurs at infrequent intervals – How to fix? (restart the system)
• Best way to deal with soft errors is to increase system’s fault tolerance (implement ways of detecting and correcting errors) 271
Techniques used for fault tolerance • Parity • ECC (Error Correcting Code)
272
Parity Checking • 9 bits are used in the memory chip to store 1 byte of information • Extra bit (parity bit) keeps tabs on other 8 bits • Parity can only detect errors, but cannot correct them
273
ODD Parity stranded for error checking • Parity generator/checker is a part of CPU or located in a special chip on motherboard • Parity checker evaluates the 8 data bits by adding the no. of 1s in the byte • If an even no. of 1s is found, parity generator creates a 1 and stores it as the parity bit in memory chip 274
ODD Parity stranded for error checking (contd.) • If the sum is odd, parity bit would be 0 • If a (9 bit) byte has an even no. of 1s, that byte must have an error · System cannot tell which bit or bits have changed • If 2 bits changed, bad byte could pass unnoticed • Multiple bit errors in a single byte are very rare • System halts when a parity check error is detected 275
ECC- Error Correcting Code • Successor to parity checking • Can detect and correct memory errors • Only a single bit error can be corrected though it can detect doubled bit errors • This type of ECC is known as single bit error correction double bit error detection (SEC DED) • SEC DED requires an additional 7 check bits over 32 bits in a 4 byte system, or 8 check bits over 64 bits in an 8 byte system 276
ECC- Error Correcting Code • ECC entails memory controller calculating check bits on a memory write operation, performing a compare between read and calculated check bits on a read operation • Cost of additional ECC logic in memory controller is not significant • It affects memory performance on a write 277
Cache memory
278
Cache Memory • A high speed,speed small memory • Most frequently used memory words are kept in • When CPU needs a word, it first checks it in cache. If not found, checks in memory
279
Cache and Main Memory
280
Cache memory Vs Main Memory
281
Cache Hit and Miss • Cache Hit: a request to read from memory, which can satisfy from the cache without using the main memory. • Cache Miss: A request to read from memory, which cannot be satisfied from the cache, for which the main memory has to be consulted. 282
Locality Principle • PRINCIPAL OF LOCALITY is the tendency to reference data items that are near other recently referenced data items, or that were recently referenced themselves. • TEMPORAL LOCALITY : memory location that is referenced once is likely to be referenced multiple times in near future. • SPATIAL LOCALITY : memory location that is referenced once, then the program is likely to be reference a nearby memory location in near future. 283
Locality Principle
Let c – cache access time m – main memory access time h – hit ratio (fraction of all references that can be satisfied out of cache) miss ratio = 1‐h
Average memory access time = c + (1 h) m H =1 No memory references H=0 all are memory references 284
Example: Suppose that a word is read k times in a short interval
First reference: memory, Other k 1 references: cache h = k–1 k
Memory access time = c + m k 285
Cache Memory • Main memories and caches are divided into fixed sized blocks • Cache lines – blocks inside the cache • On a cache miss, entire cache line is loaded into cache from memory • Example: – 64K cache can be divided into 1K lines of 64 bytes, 2K lines of 32 byte etc
• Unified cache – instruction and data use the same cache
• Split cache – Instructions in one cache and data in another 286
A system with three levels of cache
287
Pentium 4 Block Diagram
288
Replacement Algorithm • Optimal Replacement: replace the block which is no longer needed in the future. If all blocks currently in Cache Memory will be used again, replace the one which will not be used in the future for the longest time. • Random selection: replace a randomly selected block among all blocks currently in Cache Memory. 289
Replacement Algorithm • FIFO (first-in first-out): replace the block that has been in Cache Memory for the longest time. • LRU (Least recently used): replace the block in Cache Memory that has not been used for the longest time. • LFU (Least frequently used): replace the block in Cache Memory that has been used for the least number of times 290
Cache Memory Placement Policy • Three commonly used methods to translate main memory addresses to cache memory addresses. – Associative Mapped Cache – Direct-Mapped Cache – Set-Associative Mapped Cache
• The choice of cache mapping scheme affects cost and performance, and there is no single best method that is appropriate for all situations 291
Associative Mapping
292
Associative Mapping • A block in the Main Memory can be mapped to any block in the Cache Memory available (not already occupied) • Advantage: Flexibility. An Main Memory block can be mapped anywhere in Cache Memory. • Disadvantage: Slow or expensive. A search through all the Cache Memory blocks is needed to check whether the address can be matched to any of the tags. 293
Direct Mapping
294
Direct Mapping To avoid the search through all CM blocks needed by associative mapping, this method only allows # blocks in main memory # blocks in cache memory Blocks to be mapped to each Cache Memory block. • Each entry (row) in cache can hold exactly one cache line from main memory • 32‐byte ‐ cache line size → cache can hold 64KB 295
Direct Mapping • Advantage: Direct mapping is faster than the associative mapping as it avoids searching through all the CM tags for a match. • Disadvantage: But it lacks mapping flexibility. For example, if two MM blocks mapped to same CM block are needed repeatedly (e.g., in a loop), they will keep replacing each other, even though all other CM blocks may be available. 296
Set-Associative Mapping
297
Set-Associative Mapping • This is a trade-off between associative and direct mappings where each address is mapped to a certain set of cache locations. • The cache is broken into sets where each set contains "N" cache lines, let's say 4. Then, each memory address is assigned a set, and can be cached in any one of those 4 locations within the set that it is assigned to. In other words, within each set the cache is associative, and thus the name. 298
Set Associative cache • LRU (Least Recently Used) algorithm is used – keep an ordering of each set of locations that could be accessed from a given memory location – whenever any of present lines are accessed, it updates list, making that entry the most recently accessed – when it comes to replace an entry, one at the end of list is discarded 299
Load-Through and Store-Through •
Load-Through : When the CPU needs to read a word from the memory, the block containing the word is brought from MM to CM, while at the same time the word is forwarded to the CPU.
•
Store-Through : If store-through is used, a word to be stored from CPU to memory is written to both CM (if the word is in there) and MM. By doing so, a CM block to be replaced can be overwritten by an in-coming block without being saved to MM. 300
Cache Write Methods • Words in a cache have been viewed simply as copies of words from main memory that are read from the cache to provide faster access. However this view point changes. • There are 3 possible write actions: – Write the result into the main memory – Write the result into the cache – Write the result into both main memory and cache memory
301
Cache Write Methods • Write Through: A cache architecture in which data is written to main memory at the same time as it is cached. • Write Back / Copy Back: CPU performs write only to the cache in case of a cache hit. If there is a cache miss, CPU performs a write to main memory. • When the cache is missed : – Write Allocate: loads the memory block into cache and updates the cache block – No-Write allocation: this bypasses the cache and writes the word directly into the memory. 302
Cache Evaluation Problem
Solution
Processor on which feature first appears
External memory slower than the system bus
Add external cache using faster memory technology
386
Increased processor speed results in external bus becoming a bottleneck for cache access.
Move external cache on-chip, operating at the same speed as the processor
486
Internal cache is rather small, due to limited space on chip
Add external L2 cache using faster technology than main memory
486
303
Cache Evaluation Problem
Solution
Processor on which feature first appears Pentium II
Increased processor speed results in external bus becoming a bottleneck for L2 cache access
Move L2 cache on to the processor chip.
Some applications deal with massive databases and must have rapid access to large amounts of data. The on-chip caches are too small.
Add external L3 cache.
Pentium III
Move L3 cache on-chip
Pentium IV
Create separate back-side bus that runs at higher speed than the main (front-side) external bus. The BSB is dedicated to the L2 cache.
Pentium Pro
304
Comparison of Cache Sizes L1 cache
L2 cache
L3 cache
Mainframe
Year of Introduction 1968
16 to 32 KB
—
—
PDP-11/70
Minicomputer
1975
1 KB
—
—
VAX 11/780
Minicomputer
1978
16 KB
—
—
IBM 3033
Mainframe
1978
64 KB
—
—
IBM 3090
Mainframe
1985
128 to 256 KB
—
—
Intel 80486
PC
1989
8 KB
—
—
Pentium
PC
1993
8 KB/8 KB
256 to 512 KB
—
PowerPC 601
PC
1993
32 KB
—
—
PowerPC 620
PC
1996
32 KB/32 KB
—
—
PowerPC G4
PC/server
1999
32 KB/32 KB
256 KB to 1 MB
2 MB
IBM S/390 G4
Mainframe
1997
32 KB
256 KB
2 MB
IBM S/390 G6
Mainframe
1999
256 KB
8 MB
—
Pentium 4
PC/server
2000
8 KB/8 KB
256 KB
—
IBM SP
High-end server
2000
64 KB/32 KB
8 MB
—
CRAY MTAb
Supercomputer
2000
8 KB
2 MB
—
Itanium
PC/server
2001
16 KB/16 KB
96 KB
4 MB
SGI Origin 2001
High-end server
2001
32 KB/32 KB
4 MB
—
Itanium 2
PC/server
2002
32 KB
256 KB
6 MB
IBM POWER5
High-end server
2003
64 KB
1.9 MB
36 MB
CRAY XD-1
Supercomputer
2004
64 KB/64 KB
1MB
—
Processor
Type
IBM 360/85
Memory stall cycles No. of clock cycles during which CPU is stalled waiting for a memory access CPU time = (CPU clock cycles + Memory stall cycles) x Clock cycle time
Memory stall cycles = No. of misses x Miss penalty = IC x Misses per instruction x Miss penalty = IC x Memory accesses per instruction x Miss ratio x Miss penalty 306
Example Assume we have a machine where CPI is 2.0 when all memory accesses hit in the cache. Only data accesses are loads and stores, and these total 40% of instructions. If the miss penalty is 25 clock cycles and miss ratio is 2%, how much faster would the machine be if all instructions were cache hits?
307
Answer
308
Secondary Memory
309
Technologies • Magnetic storage – Floppy, Zip disk, Hard drives, Tapes
• Optical storage – CD, DVD, Blue-Ray, HD-DVD
• Solid state memory – USB flash drive, Memory cards for mobile phones/digital cameras/MP3 players, Solid State Drives
310
Magnetic Disk •
Purpose: – Long term, nonvolatile storage – Large, inexpensive, and slow – Lowest level in the memory hierarchy
•
Two major types: – Floppy disk – Hard disk
•
Both types of disks: – Rely on a rotating platter coated with a magnetic surface – Use a moveable read/write head to access the disk
•
Advantages of hard disks over floppy disks: – – – –
Platters are more rigid ( metal or glass) so they can be larger Higher density because it can be controlled more precisely Higher data rate because it spins faster Can incorporate more than one platter
Disk Track
Components of a Disk •
•
•
The arm assembly is moved in or out to position a head on a desired track. Tracks under heads make a cylinder (imaginary!). Only one head reads/writes at any one time. Block size is a multiple of sector size (which is often fixed).
Disk head
Spindle Tracks
Sector
Arm movement
Platters
Arm assembly
313
Internal Hard-Disk
Page 223
Magnetic Disk • A stack of platters, a surface with a magnetic coating • Typical numbers (depending on the disk size): – 500 to 2,000 tracks per surface – 32 to 128 sectors per track
• A sector is the smallest unit that can be read or written • Traditionally all tracks have the same number of sectors: • Constant bit density: record more sectors on the outer tracks
Magnetic Disk Characteristic • • •
Disk head: each side of a platter has separate disk head Cylinder: all the tracks under the head at a given point on all surface Read/write data is a three-stage process: – Seek time: position the arm over the proper track – Rotational latency: wait for the desired sector to rotate under the read/write head – Transfer time: transfer a block of bits (sector) under the read-write head
•
Average seek time as reported by the industry: – Typically in the range of 8 ms to 15 ms – (Sum of the time for all possible seek) / (total # of possible seeks)
•
Due to locality of disk reference, actual average seek time may: – Only be 25% to 33% of the advertised number
Typical Numbers of a Magnetic Disk • Rotational Latency: – Most disks rotate at 3,600/5400/7200 RPM – Approximately 16 ms per revolution – An average latency to the desired information is halfway around the disk: 8 ms
• Transfer Time is a function of : – Transfer size (usually a sector): 1 KB / sector – Rotation speed: 3600 RPM to 5400 RPM to 7200 – Recording density: typical diameter ranges from 2 to 14 in – Typical values: 2 to 4 MB per second
Disk I/O Performance
Disk Access Time = Seek time + Rotational Latency + Transfer time + Controller Time + Queueing Delay
Disk I/O Performance • Disk Access Time = Seek time + Rotational Latency + Transfer time + Controller Time + Queueing Delay • Estimating Queue Length: – Utilization = U = Request Rate / Service Rate – Mean Queue Length = U / (1 - U) – As Request Rate Service Rate -> Mean Queue Length ->Infinity
Example • Setup parameters: – 16383 Cycliders, 63 sectors per track, 3 platters, 6 heads
• • • •
Bytes per sector: 512 RPM: 7200 Transfer mode: 66.6MB/s Average Read Seek time: 9.0ms (read), 9.5ms (write) • Average latency: 4.17ms • Physical dimension: 1’’ x 4’’ x 5.75’’ • Interleave: 1:1
Disk performance • • • •
Preamble: allows head to be synchronized before read/write ECC (Error Correction Code): corrects errors Unformatted capacity: preambles, ECCs and inter sector gaps are counted as data Disk performance depends on
– seek time ‐ time to move arm to desired track – rotational latency – time needed for requested sector to rotate under head • Rotational speed: 5400, 7200, 10000, 15000 rpm
• Transfer time – time needed to transfer a block of bits under head (e.g., 40 MB/s) 321
Disk performance Disk controller – chip that controls the drive. Its tasks include accepting – commands (READ, WRITE, FORMAT) from software, controlling arm motion, detecting and correcting errors
Controller time – overhead the disk controller imposes in performing an I/O access
Avg. disk access time = avg. seek time + avg. rotational delay + overhead
Transfer time + controller
322
Example • Advertised average seek time of a disk is 5 ms, transfer rate is 40 MB per second, and it rotates at 10,000 rpm Controller overhead is 0.1 ms. Calculate the average time to read a 512 byte sector.
323
RAID(Redundant Array of Inexpensive Disks) • A disk organization used to improve performance of storage systems • An array of disks controlled by a controller (RAID Controller) • Data are distributed over disks (striping) to allow parallel operation
324
RAID 0- No redundancy • No redundancy to tolerate disk failure • · Each strip has k sectors (say) – Strip 0: sectors 0 to k 1 – Strip 1: sectors k to 2k 1 ...etc
• Works well with large accesses • Less reliable than having a single large disk
325
Example (RAID 0) • Suppose that RAID consists of 4 disks with MTTF (mean time to failure) of 20,000 hours. – A drive will fail once in every 5,000 hours – A single large drive with MTTF of 20,000 hours is 4 times reliable
326
RAID 1 (Mirroring) • Uses twice as many disk as does RAID 0 (first half: primary, next half: backup) • Duplicates all disks
• On a write, every strip is written twice • Excellent fault tolerance (if a disk fails, backup copy is used) • Requires more disks 327
RAID 3 (Bit Interleaved Parity) • Reads/writes go to all disks in the group, with one extra disk (parity disk) to hold check information in case off a failure
• Parity contains sum of all data in other disks • If a disk fails, subtract all data in good disks from parity disk 328
RAD 4 (Block Interleaved Parity) • RAID 4 is much like RAID 3 with a strip for strip parity written onto an extra disk – A write involves accessing 2 disks instead of all – Parity disk must be updated on every write
329
RAID 5- Block Interleaved Distributed Parity • In RAID 5, parity information is spread throughout all disks • In RAID 5, multiple writes can occur simultaneously as long as stripe units are not located in same disks, but it is not possible in RAID 4
330
Secondary Storage Devices: CD-ROM
331
Physical Organization of CD-ROM • • • •
•
Compact Disk – read only memory (write once) Data is encoded and read optically with a laser Can store around 600MB data Digital data is represented as a series of Pits and Lands: – Pit = a little depression, forming a lower level in the track – Land = the flat part between pits, or the upper levels in the track Reading a CD is done by shining a laser at the disc and detecting changing reflections patterns. – 1 = change in height (land to pit or pit to land) – 0 = a “fixed” amount of time between 1’s
332
Organization of data LAND
PIT
LAND
PIT
LAND
...------+ +-------------+ +---... |_____| |_______| ..0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 ..
• • • • • •
Cannot have two 1’s in a row! => uses Eight to Fourteen Modulation (EFM) encoding table. 0's are represented by the length of time between transitions, we must travel at constant linear velocity (CLV)on the tracks. Sectors are organized along a spiral Sectors have same linear length Advantage: takes advantage of all storage space available. Disadvantage: has to change rotational speed when seeking (slower towards the outside) 333
CD-ROM •
Addressing – 1 second of play time is divided up into 75 sectors. – Each sector holds 2KB – 60 min CD: 60min * 60 sec/min * 75 sectors/sec = 270,000 sectors = 540,000 KB ~ 540 MB – A sector is addressed by: Minute:Second:Sector e.g. 16:22:34
•
Type of laser – CD: 780nm (infrared) – DVD: 635nm or 650nm (visible red) – HD-DVD/Blu-ray Disc: 405nm (visible blue)
•
Capacity – – – –
CD: 650 MB, 700 MB DVD: 4.7 GB per layer, up to 2 layers HD-DVD: 15 GB per layer, up to 3 layers BD: 25 GB per layer, up to 2 layers 334
Solid state storage
335
Solid state storage • Memory cards – For Digital cameras, mobile phones, MP3 players... – Many types: Compact flash, Smart Media, Memory Stick, Secure Digital card... • USB flash drives – Replace floppies/CD-RW • Solid State Drives – Replace traditional hard disks • Uses flash memory – Type of EEPROM • Electrically erasable programmable read only memory – Grid of cells (1 cell = 1 bit) – Write/erase cells by blocks 336
Solid state storage • Cell=two transistors – Bit 1: no electrons in between – Bit 0: many electrons in between
• Performance – Acces time: 10X faster than hard drive – Transfer rate • 1x=150 kb/sec, up to 100X for memory cards • Similar to normal hard drive for SSD ( 100-150 MB/sec)
– Limited write: 100k to 1,000k cycles 337
Solid state storage • Size – Very small: 1cm² for some memory cards
• Capacity – Memory cards: up to 32 GB – USB flash drives: up to 32 GB – Solid State Drives: up to 256 GB
338
Solid state storage • Reliability – Resists to shocks – Silent! – Avoid extreme heat/cold – Limited number of erase/write
• Challenges – Increasing size – Improving writing limits 339
Virtual Memory
340
Virtual Memory • Virtual memory is a memory management technique developed for multitasking kernels • Separation of user logical memory from physical memory. • Logical address space can therefore be much larger than physical address space
341
A System with Physical Memory Only •
Examples: – Most Cray machines, early PCs, nearly all embedded systems, etc. Memory Physical Addresses
0: 1:
CPU
N-1:
Addresses generated by the CPU correspond directly to bytes in physical memory
A System with Virtual Memory •
Examples: – Workstations, servers, modern PCs, etc.
0: 1:
Page Table Virtual Addresses
0: 1:
Memory
Physical Addresses
CPU
P-1:
N-1: Disk
Address Translation: Hardware converts virtual addresses to physical ones via OS-managed lookup table (page table)
Page Tables Virtual Page Number
Memory-resident page table (physical page Valid or disk address) 1 1 0 1 1 1 0 1 0 1
Physical Memory
Disk Storage (swap file or regular file system file)
VM – Windows • Can change the paging file size • Can set multiple Virtual memory on difference drivers
345
Windows Memory management
346
IO Fundamentals
I/O Fundamentals • Computer System has three major functions – CPU – Memory – I/O
PC with PCI and ISA bus
Types and Characteristics of I/O Devices • Behavior: how does an I/O device behave? – Input – Read only – Output - write only, cannot read – Storage - can be reread and usually rewritten
• Partner: – Either a human or a machine is at the other end of the I/O device – Either feeding data on input or reading data on output
• Data rate: – The peak rate at which data can be transferred • between the I/O device and the main memory • Or between the I/O device and the CPU
Data Rate
Buses • A bus is a shared communication link • Multiple sources and multiple destinations • It uses one set of wires to connect multiple subsystems • Different uses: – Data – Address – Control
Motherboard
Advantages • Versatility: – New devices can be added easily – Peripherals can be moved between computer – systems that use the same bus standard
• Low Cost: – A single set of wires is shared in multiple ways
Disadvantages • It creates a communication bottleneck – The bandwidth of that bus can limit the maximum I/O throughput
• The maximum bus speed is largely limited by: – The length of the bus – The number of devices on the bus – The need to support a range of devices with: • Widely varying latencies • Widely varying data transfer rates
The General Organization of a Bus • Control lines: – Signal requests and acknowledgments – Indicate what type of information is on the data lines
• Data lines carry information between the source and the destination: – Data and Addresses – Complex commands
• A bus transaction includes two parts: – Sending the address – Receiving or sending the data
Master Vs Slave • A bus transaction includes two parts: – Sending the address – Receiving or sending the data
• Master is the one who starts the bus transaction by: – Sending the address
• Salve is the one who responds to the address by: – Sending data to the master if the master ask for data – Receiving data from the master if the master wants to send data
Output Operation
Input Operation • Input is defined as the Processor receiving data from the I/O device
Type of Buses •
Processor-Memory Bus (design specific or proprietary) – – – –
•
Short and high speed Only need to match the memory system Maximize memory-to-processor bandwidth Connects directly to the processor
I/O Bus (industry standard) – Usually is lengthy and slower – Need to match a wide range of I/O devices – Connects to the processor-memory bus or backplane bus
•
Backplane Bus (industry standard) – Backplane: an interconnection structure within the chassis – Allow processors, memory, and I/O devices to coexist
•
Cost advantage: one single bus for all components
Increasing the Bus Bandwidth •
Separate versus multiplexed address and data lines: – Address and data can be transmitted in one bus cycle if separate address and data lines are available – Cost: (a) more bus lines, (b) increased complexity
•
Data bus width: – By increasing the width of the data bus, transfers of multiple words require fewer bus cycles – Example: SPARCstation 20’s memory bus is 128 bit wide – cost: more bus lines
•
Block transfers: – Allow the bus to transfer multiple words in back-to-back bus cycles – Only one address needs to be sent at the beginning – The bus is not released until the last word is transferred – Cost: (a) increased complexity (b) decreased response time for request
Operating System Requirements • Provide protection to shared I/O resources – Guarantees that a user’s program can only access the portions of an I/O device to which the user has rights
• Provides abstraction for accessing devices: – Supply routines that handle low-level device operation
• Handles the interrupts generated by I/O devices • Provide equitable access to the shared I/O resources – All user programs must have equal access to the I/O resources
• Schedule accesses in order to enhance system throughput
OS and I/O Systems Communication Requirements • The Operating System must be able to prevent: – The user program from communicating with the I/O device directly
• If user programs could perform I/O directly: – Protection to the shared I/O resources could not be provided
• Three types of communication are required: – The OS must be able to give commands to the I/O devices – The I/O device must be able to notify the OS when the I/O device has completed an operation or has encountered an error
• Data must be transferred between memory and an I/O device
Commands to I/O Devices •
Two methods are used to address the device: – Special I/O instructions – Memory-mapped I/O
•
Special I/O instructions specify: – Both the device number and the command word – Device number: the processor communicates this via a set of wires normally included as part of the I/O bus – Command word: this is usually send on the bus’s data lines
•
Memory-mapped I/O: – Portions of the address space are assigned to I/O device – Read and writes to those addresses are interpreted as commands to the I/O devices – User programs are prevented from issuing I/O operations directly: • The I/O address space is protected by the address translation
I/O Device Notifying the OS • The OS needs to know when: – The I/O device has completed an operation – The I/O operation has encountered an error
• This can be accomplished in two different ways: – Polling: • The I/O device put information in a status register • The OS periodically check the status register
– I/O Interrupt: • Whenever an I/O device needs attention from the processor, it interrupts the processor from what it is currently doing.
Polling • Advantage: – Simple: the processor is totally in control and does all the work
• Disadvantage: – Polling overhead can consume a lot of CPU time
Interrupts • interrupt is an asynchronous signal indicating the need for attention or a synchronous event in software indicating the need for a change in execution • Advantage: – User program progress is only halted during actual transfer
• Disadvantage, special hardware is needed to: – Cause an interrupt (I/O device) – Detect an interrupt (processor) – Save the proper states to resume after the interrupt (processor)
Interrupt Driven Data Transfer • An I/O interrupt is just like the exceptions except: – An I/O interrupt is asynchronous – Further information needs to be conveyed
• An I/O interrupt is asynchronous with respect to instruction execution: – I/O interrupt is not associated with any instruction – I/O interrupt does not prevent any instruction from completion – You can pick your own convenient point to take an interrupt
I/O Interrupt • I/O interrupt is more complicated than exception: – Needs to convey the identity of the device generating the interrupt – Interrupt requests can have different urgencies: – Interrupt request needs to be prioritized
• Interrupt Logic – Detect and synchronize interrupt requests • • • •
Ignore interrupts that are disabled (masked off) Rank the pending interrupt requests Create interrupt microsequence address Provide select signals for interrupt microsequence
Multi-core architectures
Single Computer
Single Core CPU
Multi core architecture • Replicate multiple processor cores on a single die
Multi-core CPU chip • The cores fit on a single processor socket • Also called CMP (Chip Multi-Processor)
Why Multi-core • Difficult to make single-core clock frequencies even higher • Deeply pipelined circuits: – – – – –
heat problems speed of light problems difficult design and verification large design teams necessary server farms need expensive air-conditioning
• Many new applications are multithreaded • General trend in computer architecture (shift towards more parallelism)
Instruction-level parallelism • Parallelism at the machine-instruction level • The processor can re-order, pipeline instructions, split them into microinstructions, do aggressive branch prediction, etc. • Instruction-level parallelism enabled rapid increases in processor speeds over the last 15 years
Thread-level parallelism (TLP) • This is parallelism on a more coarser scale • Server can serve each client in a separate thread (Web server, database server) • A computer game can do AI, graphics, and physics in three separate threads • Single-core superscalar processors cannot fully exploit TLP • Multi-core architectures are the next step in processor evolution: explicitly exploiting TLP
Multiprocessor memory types • Shared memory: In this model, there is one (large) common shared memory for all processors • Distributed memory: In this model, each processor has its own (small) local memory, and its content is not replicated anywhere else
Multi-core processor is a special kind of a multiprocessor: All processors are on the same chip • Multi-core processors are MIMD: Different cores execute different threads (Multiple Instructions), operating on different parts of memory (Multiple Data). • Multi-core is a shared memory multiprocessor: All cores share the same memory
What applications benefit from multi-core? • • • • •
Database servers Web servers (Web commerce) Compilers Multimedia applications Scientific applications, CAD/CAM • In general, applications with Thread-level parallelism (as opposed to instruction-level parallelism)
Each can run on its own core
More examples • Editing a photo while recording a TV show through a digital video recorder • Downloading software while running an anti-virus program • “Anything that can be threaded today will map efficiently to multi-core” • BUT: some applications difficult to parallelize
A technique complementary to multi-core: Simultaneous multithreading
L2 Cache and Control
• Problem addressed: The processor pipeline can get stalled: – Waiting for the result of a long floating point (or integer) operation – Waiting for data to arrive from memory
Bus
Other execution units wait unused
L1 D-Cache D-TLB
Integer
Floating Point
Schedulers Uop queues Rename/Alloc BTB
Trace Cache
uCode ROM
Decoder BTB and I-TLB Source: Intel
Simultaneous multithreading (SMT) • Permits multiple independent threads to execute SIMULTANEOUSLY on the SAME core
• Weaving together multiple “threads” on the same core • Example: if one thread is waiting for a floating point operation to complete, another thread can use the integer units
Without SMT, only a single thread can run at any given time L1 D-Cache D-TLB
L2 Cache and Control
Integer
Floating Point Schedulers
Uop queues Rename/Alloc BTB
Trace Cache
uCode ROM
Bus
Decoder BTB and I-TLB Thread 1: floating point
Without SMT, only a single thread can run at any given time L1 D-Cache D-TLB
L2 Cache and Control
Integer
Floating Point Schedulers
Uop queues Rename/Alloc BTB
Trace Cache
Bus
Decoder BTB and I-TLB Thread 2: integer operation
uCode ROM
SMT processor: both threads can run concurrently L1 D-Cache D-TLB
L2 Cache and Control
Integer
Floating Point Schedulers
Uop queues Rename/Alloc BTB
Trace Cache
uCode ROM
Bus
Decoder BTB and I-TLB Thread 2: Thread 1: floating point integer operation
But: Can’t simultaneously use the same functional unit L1 D-Cache D-TLB
L2 Cache and Control
Integer
Floating Point Schedulers
Uop queues Rename/Alloc BTB
Trace Cache
Bus
Decoder BTB and I-TLB Thread 1 Thread 2 IMPOSSIBLE
uCode ROM
This scenario is impossible with SMT on a single core (assuming a single integer unit)
SMT not a “true” parallel processor • Enables better threading (e.g. up to 30%) • OS and applications perceive each simultaneous thread as a separate “virtual processor” • The chip has only a single copy of each resource • Compare to multi-core: each core has its own copy of resources
Multi-core: threads can run on separate cores L1 D-Cache D-TLB
Floating Point
Schedulers Uop queues Rename/Alloc BTB
Trace Cache
uCode ROM
Integer L2 Cache and Control
L2 Cache and Control
Integer
L1 D-Cache D-TLB
BTB and I-TLB Thread 1
Schedulers Uop queues Rename/Alloc BTB
Trace Cache
Decoder Bus
Bus
Decoder
Floating Point
BTB and I-TLB Thread 2
uCode ROM
Multi-core: threads can run on separate cores L1 D-Cache D-TLB
Floating Point
Schedulers Uop queues Rename/Alloc BTB
Trace Cache
uCode ROM
Integer L2 Cache and Control
L2 Cache and Control
Integer
L1 D-Cache D-TLB
BTB and I-TLB Thread 3
Schedulers Uop queues Rename/Alloc BTB
Trace Cache
Decoder Bus
Bus
Decoder
Floating Point
BTB and I-TLB Thread 4
uCode ROM
Combining Multi-core and SMT • Cores can be SMT-enabled (or not) • The different combinations: – Single-core, non-SMT: standard uniprocessor – Single-core, with SMT – Multi-core, non-SMT – Multi-core, with SMT: our fish machines
• The number of SMT threads: 2, 4, or sometimes 8 simultaneous threads • Intel calls them “hyper-threads”
SMT Dual-core: all four threads can run concurrently L1 D-Cache D-TLB
Floating Point
Schedulers Uop queues Rename/Alloc BTB
Trace Cache
uCode ROM
Integer L2 Cache and Control
L2 Cache and Control
Integer
L1 D-Cache D-TLB
BTB and I-TLB Thread 1 Thread 3
Schedulers Uop queues Rename/Alloc BTB
Trace Cache
Decoder Bus
Bus
Decoder
Floating Point
BTB and I-TLB Thread 2
Thread 4
uCode ROM
Comparison: multi-core vs SMT • Advantages/disadvantages?
Comparison: multi-core vs SMT • Multi-core: – Since there are several cores, each is smaller and not as powerful (but also easier to design and manufacture) – However, great with thread-level parallelism
• SMT – Can have one large and fast superscalar core – Great performance on a single thread – Mostly still only exploits instruction-level parallelism
The memory hierarchy • If simultaneous multithreading only: – all caches shared
• Multi-core chips: – L1 caches private – L2 caches private in some architectures and shared in others
• Memory is always shared
“Fish” machines hyper-threads
L1 cache
CORE0
• Each core is hyper-threaded
CORE1
• Dual-core Intel Xeon processors
L2 cache
• Private L1 caches • Shared L2 caches
memory
L1 cache
L2 cache
L2 cache
L1 cache
CORE0
L1 cache
CORE1
L1 cache
CORE0
CORE1
Designs with private L2 caches
L1 cache
L2 cache
L2 cache
L3 cache
L3 cache
memory memory Both L1 and L2 are private A design with L3 caches Examples: AMD Opteron, AMD Athlon, Intel Pentium D
Example: Intel Itanium 2
Private vs shared caches? • Advantages/disadvantages?
Private vs shared caches • Advantages of private: – They are closer to core, so faster access – Reduces contention
• Advantages of shared: – Threads on different cores can share the same cache data – More cache space available if a single (or a few) high-performance thread runs on the system
Windows Task Manager
core 2
core 1