Document not found! Please try again

Pipeline

104 downloads 587 Views 3MB Size Report
För utveckling av verksamhet, produkter och livskvalitet. Datorteknik. Tomas Nordström. Föreläsning 10. Accelerationsmekanismer ...
Datorteknik Tomas Nordström Föreläsning 10 Accelerationsmekanismer

För utveckling av verksamhet, produkter och livskvalitet.

Föreläsning 10 Accelerationsmekanismer och Pipelining • • • • •

RISC Pipelining Superscalar, Superpipeline, and VLIW SIMD Threads and Multi-Threading

2

ARM Accelerations •

Memory • Cache; Harvard architecture;Thumb Instructions; Memory pipeline



Pipeline • Super scalar; out of order execution; Register renaming; Speculative execution; Branch prediction;





NVIDIA Tegra 2

Signal Processing Support • MAC support; DSP operations; SIMD and Vector support; Floating point support Multiprocessor support • Accelerator Coherence Port; Cache Coherence (snooping)

3

Accelerationsmekanismer Två grundmetoder att få exekveringen att gå snabbare: • Öka takten på beräkningarna (klockningstakten) • Gör saker samtidigt (parallelism)

4

Klockfrekvens • I kisel rör sig elektriska impulser med ca 50% av ljushastigheten. • Med en klockfrekvens på 3 GHz (0.33 ns) så hinner signalen s = 0.5 * 3.108 * 0.33 n = 5cm • Så om avståndet är större än så kommer vi att klocka på helt skilda klockflanker • Man pratar om ”clock skew” när olika delar av en krets samlas vid olika tidpunkter på grund av att klockan har olika lång väg att gå. Kan ske om olika parallella ledningar har olika lång väg att gå.

5

Parallelism? • Mikroarkitekturtekniker för att utnyttja ILP (implicit parallellism) • • • •

Pipelining Superscalar execution Out-of-order execution (Register renaming) Branch Prediction & Speculative execution

• Explicit Parallelism • VLIW • SIMD • Multitrådning

6

Parallelism genom löpandebandsprincipen •

Löpandebandsprincipen (här pipelining) ger oss möjlighet att öka genomströmningen



Antag att det behövs 5 steg för att tillverka en bil. Istället för att ha en person som gör alla momenten har man flera personer som gör momenten samtidigt. Anta 1 dag/steg => 5 dagar för en bil. Med löpande band-principen får man ut en ny bil varje dag om 5 personer jobbar!



Tillämpat på datorarkitektur

1 2 3 4 5 Fetch -> Decode -> Fetch operands-> Execute -> Write-back

7

RISC •



John Cocke (IBM) analyserade vilka instruktioner som användes mest av en dator. Bara ca 10 instruktioner används oftast oavsett antal instruktioner. Han gjorde första RISC-processorn 1974. En stor del är dataförflyttningsinstruktioner (ca 50%). Load -Store. Det tar tid att skriva mot minnet hela tiden! Få instruktioner => Liten processorstorlek Mindre komplexitet=> Kortare utvecklingstid Hårdkodad logik => Högre prestanda. Kortare fördröjningstid. Kan klocka snabbare. Load-Store => Endast arbeta på data i register. Minskar skrivning till minnet. Varje instruktion lika stor => Förberedd för pipelining Som bäst 1 instruktion/klockcykel => Hög instruktionsgenomströmning

8

RISC RISC - Reduced Instruction Set Computer (alternativt Load-Store arkitektur som är ett mer korrekt namn) • Förenklad instruktionsuppsättning • Register-till-register aritmetiska instruktioner • Minnesaccess endast via load and store instruktioner CISC har aritmetiska instruktioner som • Få (ev. endast en) addreserings-mode också gör minnesåtkomst ”ADD R1,R2,[MemAddr]” • Fast instruktionslängd • Få, enkla, instruktionsformat • Pipelining • Optimerande kompilatorer • Enkla kretskonstruktioner för hög klockfrekvens och reducerad designtid, designkostnad och chipyta. 9

10

Instruktionsparallelism (Instruction-level Parallelism ILP)

• I följande sekvens av instruktioner kan vi göra vissa instruktioner samtidigt: 1. 2. 3.

ADD R2,R1,R0 ADD R3,R4,R5 MUL R6,R2,R3

• Instruktion 1 & 2 kan exekvera samtidigt (men 3 måste vänta på 1&2)

Hur ska vi kunna utnyttja denna instruktionsparallelism?

11

Arbetsgång Styrenhet (jmf F2) 1. Hämta instruktion från minnet 2. Avgör vad som skall göras 3. Hämta data/parametrar ifrån register eller minne 4. Beräkna/utför i CPU/ALU 5. Skriv resultatet till register eller till minnet

Adressbuss

Program minne

Data minne

IR

Instruktionsavkodning

ALU

status

Databuss

result 12

Löpandebandsprincipen (Pipelining) • Genom att dela in exekveringen i flera steg kan man arbeta med flera instruktioner samtidigt under de olika faserna av instruktionsexekveringen

ADD R3,R5,#5

Fetch

SUB R1,R2,R4

Decode

Execute

Fetch

Decode

Execute

Fetch

Decode

CMP R3,R4

t1

t2

t3

Execute

Time

13

ARM7 Blockdiagram

14

Prestandamått • Fördröjning (latency) - tiden det tar för instruktionen att komma igenom ”löpandebandet” (pipeline) • Genomströmning (throughput) - antalet instruktioner som exekveras per tidsenhet • Löpandebandsprincipen (pipeline) ökar genomströmningen utan att förändra fördröjningen

15

Pipeline and more

[Sloss, Symes, Wright, "ARM System Developer's Guide", 2004]

16

Ändringar i ARM pipeline

17

18

Hur lång pipeline? • In the late 1990s and early 2000s, microprocessors were marketed largely based on clock frequency. This pushed microprocessors to use very deep pipelines (20 to 31 stages on the Pentium 4) to maximize the clock frequency, even if the benefits for overall performance were questionable. • Power is proportional to clock frequency and also increases with the number of pipeline registers, so now that power consumption is so important, pipeline depths are decreasing. 19

ARM10 Blockdiagram • Separat steg för dataaccess • Stegen mer balanserade • Harvard arkitektur • Men nu dyker det upp datafaror (data hazards)

20

Faror med en trivial pipeline implementering Faror (Hazards) i en trivial pipeline implementering: • Strukturella faror (structural hazards). Kan uppstå om processorn inte har tillräckligt med resurser att utföra vissa instruktioner parallellt (tex inte tillräckligt med adderare för att utföra relativt hopp, dvs PC+offset, och en aritmetisk instruktion samtidigt) • Datafaror (data hazards). Om en instruktion gjordes utan hänsyn till denna fara skulle använda data före data är tillgängligt i registret. • Styrfaror (control hazards). Beror på ovillkorliga och villkorliga hopp. 21

Basic RISC Pipeline •

[Wikipedia: Instruction pipeline]

Basic five-stage pipeline in a RISC machine: IF = Instruction Fetch, ID = Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write back

In the fourth clock cycle (the green column), the earliest instruction is in MEM stage, and the latest instruction has not yet entered the pipeline.

22

Datafaror (Data Hazards): Databeroenden • Användning av register innan resultatet av föregående instruktion har skrivits tillbaka till registerbanken • Om vi ska exekvera följande instruktioner: SUB r10,r3,r4 AND r11,r10,r3 Pipeline stage Fetch Decode Execute Access Write-Back

; Writes (r3 - r4) to r10 ; Writes (r10 & 3) to r11 Clock Cycle

1

2

SUB

AND SUB

3

4

5

6

R3

AND SUB

R10

R10

AND SUB

AND SUB

AND

23

Databeroenden Lösning A: Bypass • Genom att göra en ”Bypass” så kan vi gena med datat och tillhandahålla resultatet ifrån SUB till AND

SUB r10,r3,r4 AND r11,r10,r3 Pipeline stage Fetch Decode Execute Access Write-Back

; Writes (r3 - r4) to r10 ; Writes (r10 & 3) to r11 Clock Cycle

1

2

SUB

AND SUB

3

R3

4

5

6 R10

AND SUB

AND SUB

AND SUB

AND

24

Databeroenden Lösning A: Bypass • Schematiskt blockdiagram

[Wikipedia: Classic_RISC_pipeline]

25

Databeroenden Lösning B: Upphakning/Stall •

Genom att haka upp löpandebandet (stall) och därmed skapa en bubbla i vår rörledning så kan vi vänta in med exekveringen av AND tills SUB hunnit skriva tillbaka sitt resultat.



För vårt exempel är bypass mycket bättre! SUB r10,r3,r4 AND r11,r10,r3 Pipeline stage Fetch Decode Execute Access Write-Back

; Writes (r3 - r4) to r10 ; Writes (r10 & 3) to r11 Clock Cycle

1

2

SUB

AND SUB

3

4

5

R3

6

R10

AND SUB

AND SUB SUB

26

Databeroenden Upphakning nödvändig för LDR •

Om instruktionerna istället är LDR r10,=Addr ; Read from Addr to r10 AND r11,r10,r3 ; Writes (r10 & 3) to r11

• •

Så måste minnet (cache) läsas innan r10 finns tillgängligt! Vi måste då göra en upphakning av löpandebandet (stall the pipeline) Pipeline stage Fetch Decode Execute Access Write-Back

Clock Cycle

1

2

LDR

AND LDR

3

4

R3

5

6

R10

AND LDR

AND LDR

AND LDR

27

Databeroenden Lösning B: Upphakning • Schematiskt blockdiagram LDR r10,=Addr AND r11,r10,r3

; Read from Addr to r10 ; Writes (r10 & 3) to r11

Kan ej lösas genom bypass!

[Wikipedia: Classic_RISC_pipeline]

Vi måste skapa en upphakning/bubbla

28

Datafaror (Hazards) •





Read-after-write (RAW) A read from a register or memory location must return the value placed there by the last write in program order, not some other write. This is referred to as a true dependency or flow dependency, and requires the instructions to execute in program order. Write-after-write (WAW) Successive writes to a particular register or memory location must leave that location containing the result of the second write. This can be resolved by squashing (synonyms: cancelling, annulling, mooting) the first write if necessary. WAW dependencies are also known as output dependencies. Write-after-read (WAR) A read from a register or memory location must return the last prior value written to that location, and not one written programmatically after the read. This is the sort of false dependency that can be resolved by renaming. WAR dependencies are also known as anti-dependencies. 29

Styrfara (Control Hazard): Hoppberoenden • Vid en normal RISC processor så måste mycket ske och alla hopp (både ovillkorliga eller villkorade) och man behöver åtminstone två klockcykler att börja ladda in nästa instruktion

30

Hoppberoenden • Exempelkod BNE Out SUB r10,r3,r4 B Loop ... Out AND r11,r10,r3

Pipeline stage Fetch Decode Execute Access Write-Back

; Writes (r3 - r4) to r10

; Writes (r10 & 3) to r11

Clock Cycle

1

2

3

4

BNE

SUB

B

AND

BNE

SUB

5

6

AND

BNE

AND BNE BNE

31

Hoppberoenden Lösningar

[Wikipedia: Classic_RISC_pipeline]

There are four schemes to solve this performance problem with branches: • Predict Not Taken: Always fetch the instruction after the branch from the instruction cache, but only execute it if the branch is not taken. If the branch is not taken, the pipeline stays full. If the branch is taken, the instruction is flushed (marked as if it were a NOP), and we lose 1 cycle's opportunity to finish an instruction. • Branch Likely: Always fetch the instruction after the branch from the instruction cache, but only execute it if the branch was taken. The compiler can always fill the branch delay slot on such a branch, and since branches are more often taken than not, such branches have a smaller IPC penalty than the previous kind. • Branch Delay Slot: Always fetch the instruction after the branch from the instruction cache, and always execute it, even if the branch is taken. Instead of taking an IPC penalty for some fraction of branches either taken (perhaps 60%) or not taken (perhaps 40%), branch delay slots take an IPC penalty for the those branches into which the compiler could not schedule the branch delay slot. The SPARC, MIPS, and MC88K designers designed a Branch delay slot into their ISAs. • Branch Prediction: In parallel with fetching each instruction, guess if the instruction is a branch or jump, and if so, guess the target. On the cycle after a branch or jump, fetch the instruction at the guessed target. When the guess is wrong, flush the incorrectly fetched target.

32

Mer om pipelining Vad ska ske vid avbrott? • Vilken instruktion exekveras/ska avbrytas? • Hur ska man hantera instruktioner som tar flera klockcykler att utföra (tex PUSH/POP av multipla register)?

33

Vad ska ske om man får en cachemiss? • Om man inte kan göra en minnesaccess på en klockcykel (dvs datat finns inte i en cache) vad gör vi då med vår rörledning • Måste kunna göra paus (suspend) av LDx instruktionen och alla efterföljande instruktioner som beror på denna instruktion. När sedan cachen är fylld kunna göra återstart (resume) av LDx instruktionen. • Två lösningar: • En global upphakning/stall • Gör ett avbrott (exception) för att ta hand om cachemissen 34

Superscalar •

Logical evolution of pipeline designs



Most executed operations are on scalar quantities. Common instructions (arithmetic, load/store, conditional branch) can be initiated and executed independently



The superscalar CPU has more than one pipelined functional unit (e.g. ALU) which can operate in parallel

A five-stage pipelined superscalar processor, capable of issuing two instructions per cycle. It can have two instructions in each stage of the pipeline, for a total of up to 10 instructions (shown in green) being simultaneously executed.

[Wikipedia: Superscalar]

35

Superpipeline •



Result from the observation that a large number of pipeline operations do not require a full clock cycle to complete. Dividing the clock cycle into smaller subcycles and and subdividing the "macro" pipeline stages into smaller (and faster) substages means that although the time to complete individual instructions does not change the perceived throughput increases.

[Wikipedia: Superscalar]

36

Superscalar • Available performance improvement from superscalar techniques is limited by three key areas: • The degree of intrinsic parallelism in the instruction stream, i.e. limited amount of instruction-level parallelism. • The complexity and time cost of the dispatcher and associated dependency checking logic. • The branch instruction processing.

[Wikipedia: Superscalar]

37

Superscalar • IPC - Instruktioner (färdigställda) per cykel. • Max antal parallella instruktioner (issue rate). typiskt 4-8 • IPC är begränsad av beroenden Här har 2 ett databeroende • Databeroenden (flow dependence) • Antiberoenden (Anti-dependence)

på 1:

1. A = 3 2. B = A 3. C = B Här har 3 ett antiberoende på 2: 1. B = 3 2. A = B + 1 3. B = 7 Här beror resultatet på ordningen: A = 2 * X 2. B = A / 3 3. A = 9 * Y

• Resultatberoenden (Output-dependence)1.

• Flera sätt att minska eller utesluta beroenden finns, tex registeromdöpning (register renaming)

38

Instruction Issue/ Completion Policy • With superscalar architectures we have the potential for beginning issuing and/or completing (retiring) instructions either in or out of order • To understand why these are attractive options, consider the following code: add r1, r3, r5 and r4, r3, #0x7f sub r6, r12, r6 ldr r9,=Fred

• There are no data dependencies, but if we only have two ALUs then we shall have to stall the pipeline when we fetch the sub instruction. • Issuing instructions out of order would mean that the CPU could fetch and begin work on the load instruction which would not involve the ALU.

39

Out-of-order Execution • There are four possibilities: • • • •

In-order Issue, In-order Completion In-order Issue, Out-of-order Completion Out-of-order Issue, In-order Completion Out-of-order Issue, Out-of-order Completion

• Doing everything in order is the simplest approach, but the slowest. We may need to stall the pipeline • As soon as we either issue or retire instructions out of order, the CPU is involved with considerable book-keeping overhead in order to ensure correctness. 40

Interrupts and O-o-O Exec. • If out-of-order completion is allowed, what PC value do we save at interrupt time, in order to ensure that: • Instructions are not repeated on restarting the program • Instructions are not missed on restarting the program

41

Register Renaming • The register renaming scheme facilitates out-of-order execution in Write-after-Write (WAW) and Write-after-Read (WAR) situations for the general purpose registers and the flag bits of the Current Program Status Register (CPSR). • The scheme maps the 32 ARM architectural registers to a pool of 56 physical 32-bit registers, and renames the flags (N, Z, C, V, Q, and GE) of the CPSR using a dedicated pool of eight physical 9-bit registers. 1. 2. 3. 4. 5. 6.

R1=M[1024] R1=R1+2 M[1032]=R1 Here instruction 4 R1=M[2048] can not execute R1=R1+4 before 3 is done M[2056]=R1

1. R1=M[1024] 2. R1=R1+2 3. M[1032]=R1

4. V1=M[2048] 5. V1=V1+4 6. M[2056]=V1

With renaming registers, instr. 4-6 can execute in parallel with 1-3

42

VLIW - Very Long Instruction Word-processor • En VLIW processor använder långa instruktioner. • Tanken med detta är att slå ihop flera instruktioner till en. Processorn kan då hämta flera instruktioner på en gång och därför arbeta effektivare. Det ställer dock högre krav på programkompilatorn som måste skicka instruktionerna till processorn i rätt ordning. • VLIW användes i Intels Itanium-processorfamilj, men tekniken kallas då för EPIC (Explicit Parallel Instruction Code).

43

VLIW ett ”RISC-sätt” att hitta parallelism Superscalar

VLIW

Startar flera, typiskt två till åtta instruktioner samtidigt ifrån ett sekvensiellt program

Startar flera operationer samtidigt via en väldigt lång instruktion

Instruktionerna schemaläggs vid körning

Instruktionerna schemaläggs av kompilatorn

Begränsat ”fönster” för vilka instruktioner som kan schemaläggas samtidigt

”Fönstret” är hela programmet. Kan hitta mer parallelism.

44

Senaste Generationerna av ARM

45

ARM med kortinstruktioner •

Thumb



To improve compiled code-density, processors since the ARM7TDMI have featured the Thumb instruction set state. When in this state, the processor executes the Thumb instruction set, a compact 16-bit encoding for a subset of the ARM instruction set. Most of the Thumb instructions are directly mapped to normal ARM instructions. The space-saving comes from making some of the instruction operands implicit and limiting the number of possibilities compared to the ARM instructions executed in the ARM instruction set state.



Thumb-2



Thumb-2 technology made its debut in the ARM1156 core, announced in 2003. Thumb-2 extends the limited 16-bit instruction set of Thumb with additional 32-bit instructions to give the instruction set more breadth, thus producing a variablelength instruction set. A stated aim for Thumb-2 is to achieve code density similar to Thumb with performance similar to the ARM instruction set on 32-bit memory.

46

ARMutökning för flyttalsprestanda •

VFP (Vector Floating Point)



VFP technology is a coprocessor extension to the ARM architecture. It provides low-cost single-precision and double-precision floatingpoint computation fully compliant with the ANSI/IEEE Std 754-1985 Standard. VFP provides floating-point computation suitable for a wide spectrum of applications such as PDAs, smartphones, voice compression and decompression, three-dimensional graphics and digital audio, printers, set-top boxes, and automotive applications. The VFP architecture also supports execution of short vector instructions but these operate on each vector element sequentially and thus do not offer the performance of true SIMD (Single Instruction Multiple Data) parallelism.

47

ARMutökning för mediaoperationer •

Advanced SIMD (NEON)



The Advanced SIMD extension, marketed as NEON technology, is a combined 64- and 128-bit SIMD instruction set that provides standardized acceleration for media and signal processing applications.



It features a comprehensive instruction set, separate register files and independent execution hardware. NEON supports 8-, 16-, 32and 64-bit integer and single-precision (32-bit) floating-point data and operates in SIMD operations for handling audio and video processing as well as graphics and gaming processing. In NEON, the SIMD supports up to 16 operations at the same time. The NEON hardware shares the same floating-point registers as used in VFP.

48

Flynn's taxonomy Single Instruction

Multiple Instruction

Single Data

SISD

MISD

Multiple Data

SIMD

MIMD

49

SIMD •

SIMD well suited for algoritms with a lot of parallel data • Example: FIR, FFT, dot product, image processing, video processing



The idea is to load multiple data items and perform the same operations across all the data at once • Examples x86 MMX/SSE, ARM Neon



To program for SIMD one need to think ”data parallelism” and vectorize the code

50

ARM SIMD Instructions

[Sloss, Symes, Wright, "ARM System Developer's Guide", 2004]

51

ARM Cortex-A9

[http://www.arm.com/products/processors/cortex-a/cortex-a9.php]

52

Multi Processor ARM En annan kurs: EPC

[http://www.arm.com/products/processors/cortex-a/cortex-a9.php] 53

ARM Accelerations Dagens System har många processorer på ett chip •

Memory • Cache; Harvard architecture;Thumb Instructions; Memory pipeline



Pipeline • Super scalar; out of order execution; Register renaming; Speculative execution; Branch prediction;





NVIDIA Tegra 2

Signal Processing Support • MAC support; DSP operations; SIMD and Vector support; Floating point support Multiprocessor support • Accelerator Coherence Port; Cache Coherence (snooping)

54

Threads • In computer science, a thread of execution is the smallest unit of processing that can be scheduled by an operating system • Multiple threads can exist within the same process and share resources such as memory, while different processes do not share these resources.

55

Multithreading • On a single processor, multithreading generally occurs by time-division multiplexing (as in multitasking): the processor switches between different threads. • On a multiprocessor (including multi-core system), the threads or tasks will actually run at the same time, with each processor or core running a particular thread or task.

56

Multithreading Computers •

Multithreading computers have hardware support to efficiently execute multiple threads. • These are distinguished from multiprocessing systems (such as multi-core systems) in that the threads have to share the resources of a single core: the computing units, the CPU caches and the translation lookaside buffer (TLB). • Where multiprocessing systems include multiple complete processing units, multithreading aims to increase utilization of a single core by using threadlevel as well as instruction-level parallelism. As the two techniques are complementary, they are sometimes combined in systems with multiple multithreading CPUs and in CPUs with multiple multithreading cores.



Simultaneous multithreading, often abbreviated as SMT, is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading. SMT permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures. (Intel calls this hyper-threading) 57

Questions • Study-support questions • What kinds of parallelism are used in a modern processor? • What “hazards” exist when using various forms of parallelism? • Explain the difference between the instruction level parallelisms used in architectures with pipeline, superscalar, super-pipeline, and VLIW? • Explain what threads are, and how computer architectures can support them and therefore improve performance?

58

Links Microprocessor Design Reduced instruction set computing Pipeline "Techniques to Improve Performance Beyond Pipelining: Superpipelining, Superscalar, and VLIW", Gaudiot, Jung-Yup Kang, and Won Woo Ro, 2005 Superscalar VLIW (Very long instruction word) Vector processor SIMD ; Register renaming Instruction-level parallelism Data dependency Threads Simultaneous multithreading / Hyper-threading Multithreading

59

Dagens Föreläsning: Accelerationsmekanismer •

Instruktionsparallelism • Pipelining • Superpipelining • VLIW



Faror med pipelining • Databeroenden • Hoppberoenden • (Resursbrist)



Några lösningskoncept • • • • • •

Bypass - ”Gena med datat” Upphakning (stall) Hopprediktion ”Out-of-order execution” Omdöpning av register Multitrådning 60