Pipeline and more. [Sloss, Symes, Wright, "ARM System Developer's Guide",
2004] .... The scheme maps the 32 ARM architectural registers to a pool of 56
physical 32-bit .... BookChapter/Chapter_in_Advance_in_Computers.pdf>.
Superscalar ...
Embedded Parallel Computing Lecture 2 - Parallelism in Microprocessors Tomas Nordström
Course webpage:: Course responsible and examiner: Tomas Nordström,
[email protected]; Room E313; Tel: +46 35 16 7334
1
Outline Parallelism in microprocessors • RISC and pipelining • Superscalar, Superpipeline, and VLIW • SIMD • Threads and Multi-Threading
2
3
4
5
Pipeline • Basic five-stage pipeline in a RISC machine: IF = Instruction Fetch, ID = Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write back • In the fourth clock cycle (the green column), the earliest instruction is in MEM stage, and the latest instruction has not yet entered the pipeline.
[Wikipedia: Instruction pipeline] 6
7
Pipeline and more [Sloss, Symes, Wright, "ARM System Developer's Guide", 2004]
8
• In the late 1990s and early 2000s, microprocessors were marketed largely based on clock frequency. This pushed microprocessors to use very deep pipelines (20 to 31 stages on the Pentium 4) to maximize the clock frequency, even if the benefits for overall performance were questionable. • Power is proportional to clock frequency and also increases with the number of pipeline registers, so now that power consumption is so important, pipeline depths are decreasing.
9
• The ARM7 and earlier implementations have a three stage pipeline; the stages being fetch, decode, and execute. • Higher performance designs, such as the ARM9, have deeper pipelines: Cortex-A8 has thirteen stages.
10
11
Hazards In Pipelines • Control Hazards.They arise from the pipelining of branches and other instructions that change the PC. • Structural Hazards. They arise from resource conflicts when the hardware cannot support all possible combinations of instructions in simultaneous overlapped execution. • Data Hazards. They arise when an instruction depends on the result of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline.
add r1, r3, r5 sub r4, r6, r9 inc r12 addc r8, r7, r10
add r1, r4, r5 sub r2, r5, r6
[Wikipedia: Data Hazard] 12
Data Hazards Read-after-write (RAW) • A read from a register or memory location must return the value placed there by the last write in program order, not some other write. This is referred to as a true dependency or flow dependency, and requires the instructions to execute in program order. Write-after-write (WAW) • Successive writes to a particular register or memory location must leave that location containing the result of the second write. This can be resolved by squashing (synonyms: cancelling, annulling, mooting) the first write if necessary. WAW dependencies are also known as output dependencies. Write-after-read (WAR) • A read from a register or memory location must return the last prior value written to that location, and not one written programmatically after the read. This is the sort of false dependency that can be resolved by renaming. WAR dependencies are also known as antidependencies.
13
Superscalar • Logical evolution of pipeline designs • Most executed operations are on scalar quantities. Common instructions (arithmetic, load/ store, conditional branch) can be initiated and executed independently • The superscalar CPU has more than one pipelined functional unit (e.g. ALU) which can operate in parallel
A five-stage pipelined superscalar processor, capable of issuing two instructions per cycle. It can have two instructions in each stage of the pipeline, for a total of up to 10 instructions (shown in green) being simultaneously executed.
[Wikipedia: Superscalar] 14
Superpipeline • Result from the observation that a large number of pipeline operations do not require a full clock cycle to complete. • Dividing the clock cycle into smaller subcycles and and subdividing the "macro" pipeline stages into smaller (and faster) substages means that although the time to complete individual instructions does not change the perceived throughput increases.
[Wikipedia: Superscalar] 15
16
17
Instruction Issue/ Completion Policy • With superscalar architectures we have the potential for beginning issuing and/or completing (retiring) instructions either in or out of order • To understand why these are attractive options, consider the following code: add r1, r3, r5 and r4, 0x7f, r3 sub r6, r12, r6 load Fred,,r9
• There are no data dependencies, but if we only have two ALUs then we shall have to stall the pipeline when we fetch the sub instruction. • Issuing instructions out of order would mean that the CPU could fetch and begin work on the load instruction which would not involve the ALU.
18
Out-of-order Execution • There are four possibilities: ‣ In-order Issue, In-order Completion ‣ In-order Issue, Out-of-order Completion ‣ Out-of-order Issue, In-order Completion ‣ Out-of-order Issue, Out-of-order Completion
• Doing everything in order is the simplest approach, but the slowest. We may need to stall the pipeline • As soon as we either issue or retire instructions out of order, the CPU is involved with considerable bookkeeping overhead in order to ensure correctness. 19
Interrupts and O-o-O Exec. • If out-of-order completion is allowed, what PC value do we save at interrupt time, in order to ensure that: ‣ Instructions are not repeated on restarting the program ‣ Instructions are not missed on restarting the program
20
Register Renaming ‣ The register renaming scheme facilitates out-of-order execution in Writeafter-Write (WAW) and Write-after-Read (WAR) situations for the general purpose registers and the flag bits of the Current Program Status Register (CPSR). ‣ The scheme maps the 32 ARM architectural registers to a pool of 56 physical 32-bit registers, and renames the flags (N, Z, C, V, Q, and GE) of the CPSR using a dedicated pool of eight physical 9-bit registers. 1. 2. 3. 4. 5. 6.
R1=M[1024] R1=R1+2 M[1032]=R1 R1=M[2048] R1=R1+4 M[2056]=R1
1. R1=M[1024] 2. R1=R1+2 3. M[1032]=R1 Here instruction 4 can not execute before 3 is done
4. V1=M[2048] 5. V1=V1+4 6. M[2056]=V1 With renaming registers, instr. 4-6 can execute in parallel with 1-3
21
22
23
24
25
26
27
28
Recent Generations of ARM
29
• Thumb • To improve compiled code-density, processors since the ARM7TDMI have featured the Thumb instruction set state. When in this state, the processor executes the Thumb instruction set, a compact 16-bit encoding for a subset of the ARM instruction set. Most of the Thumb instructions are directly mapped to normal ARM instructions. The space-saving comes from making some of the instruction operands implicit and limiting the number of possibilities compared to the ARM instructions executed in the ARM instruction set state. • Thumb-2 • Thumb-2 technology made its debut in the ARM1156 core, announced in 2003. Thumb-2 extends the limited 16-bit instruction set of Thumb with additional 32-bit instructions to give the instruction set more breadth, thus producing a variable-length instruction set. A stated aim for Thumb-2 is to achieve code density similar to Thumb with performance similar to the ARM instruction set on 32-bit memory.
30
• VFP (Vector Floating Point) • VFP technology is a coprocessor extension to the ARM architecture. It provides low-cost single-precision and double-precision floating-point computation fully compliant with the ANSI/IEEE Std 754-1985 Standard. VFP provides floating-point computation suitable for a wide spectrum of applications such as PDAs, smartphones, voice compression and decompression, three-dimensional graphics and digital audio, printers, set-top boxes, and automotive applications. The VFP architecture also supports execution of short vector instructions but these operate on each vector element sequentially and thus do not offer the performance of true SIMD (Single Instruction Multiple Data) parallelism. •
31
• Advanced SIMD (NEON) • The Advanced SIMD extension, marketed as NEON technology, is a combined 64and 128-bit SIMD instruction set that provides standardized acceleration for media and signal processing applications. • It features a comprehensive instruction set, separate register files and independent execution hardware. NEON supports 8-, 16-, 32- and 64-bit integer and singleprecision (32-bit) floating-point data and operates in SIMD operations for handling audio and video processing as well as graphics and gaming processing. In NEON, the SIMD supports up to 16 operations at the same time. The NEON hardware shares the same floating-point registers as used in VFP.
32
SIMD • SIMD well suited for algoritms with a lot of parallel data ‣ Example: FIR, FFT, dot product, image processing, video processing • The idea is to load multiple data items and perform the same operations across all the data at once ‣ Examples x86 MMX/SSE, ARM Neon • To program for SIMD one need to think ”data parallelism” and vectorize the code
33
ARM SIMD Instructions
[Sloss, Symes, Wright, "ARM System Developer's Guide", 2004]
34
ARM Cortex-A9
[http://www.arm.com/products/processors/cortex-a/cortex-a9.php]
35
Multi Processor ARM
[http://www.arm.com/products/processors/cortex-a/cortex-a9.php]
36
ARM Accelerations • Memory ‣ Cache; Harvard architecture;Thumb Instructions; Memory pipeline • Pipeline ‣ Super scalar; out of order execution; Register renaming; Speculative execution; Branch prediction;
NVIDIA Tegra 2
• Signal Processing Support ‣ MAC support; DSP operations; SIMD and Vector support; Floating point support • Multiprocessor support ‣ Accelerator Coherence Port; Cache Coherence (snooping)
37
Threads • In computer science, a thread of execution is the smallest unit of processing that can be scheduled by an operating system • Multiple threads can exist within the same process and share resources such as memory, while different processes do not share these resources.
38
Multithreading • On a single processor, multithreading generally occurs by time-division multiplexing (as in multitasking): the processor switches between different threads. • On a multiprocessor (including multi-core system), the threads or tasks will actually run at the same time, with each processor or core running a particular thread or task.
39
Multithreading Computers • Multithreading computers have hardware support to efficiently execute multiple threads. ‣ These are distinguished from multiprocessing systems (such as multi-core systems) in that the threads have to share the resources of a single core: the computing units, the CPU caches and the translation lookaside buffer (TLB). ‣ Where multiprocessing systems include multiple complete processing units, multithreading aims to increase utilization of a single core by using threadlevel as well as instruction-level parallelism. As the two techniques are complementary, they are sometimes combined in systems with multiple multithreading CPUs and in CPUs with multiple multithreading cores. • Simultaneous multithreading, often abbreviated as SMT, is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading. SMT permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures. (Intel calls this hyper-threading) 40
Questions • Study-support questions ‣ What kinds of parallelism are used in a modern processor? ‣ What “hazards” exist when using various forms of parallelism? ‣ Explain the difference between the instruction level parallelisms used in architectures with pipeline, super-scalar, super-pipeline, and VLIW? ‣ Explain what threads are, and how computer architectures can support them and therefore improve performance? ‣
41
Links Microprocessor Design Reduced instruction set computing Pipeline "Techniques to Improve Performance Beyond Pipelining: Superpipelining, Superscalar, and VLIW", Gaudiot, Jung-Yup Kang, and Won Woo Ro, 2005 Superscalar VLIW (Very long instruction word) Vector processor SIMD ; Register renaming Instruction-level parallelism Data dependency Threads Simultaneous multithreading / Hyper-threading Multithreading
42