Intel SIMD architecture

Overview • • • • •

Intel SIMD architecture

SIMD MMX architectures MMX instructions examples l SSE/SSE2

• SIMD instructions are p probablyy the best p place to use assembly since compilers usually do not do a good job on using these instructions

Computer p Organization g z and Assemblyy Languages g g Yung-Yu Chuang 2008/1/5

2

Performance boost

Performance boost

• Increasing clock rate is not fast enough for boosting performance

• Architecture improvements (such as pipeline/cache/SIMD) are more significant • Intel analyzed multimedia applications and f found d they th share h th the ffollowing ll i characteristics: h t i ti

In his 1965 paper, Intel co-founder Gordon Moore observed that “the number of transistors per square inch had doubled every 18 months.

– Small native data types (8-bit pixel, 16-bit audio) – Recurring operations – Inherent parallelism

3

4

SIMD

SISD/SIMD/Streaming

• SIMD (single instruction multiple data) architecture performs the same operation on multiple data elements in parallel • PADDW MM0 MM0, MM1

5

6

IA-32 SIMD development

MMX

• MMX (Multimedia Extension) was introduced in 1996 (Pentium with MMX and Pentium II). II) • SSE (Streaming SIMD Extension) was introduced with ith Pentium P ti III III. • SSE2 was introduced with Pentium 4. • SSE3 was introduced with Pentium 4 supporting hyper-threading yp g technology. gy SSE3 adds 13 more instructions.

• After analyzing a lot of existing applications such as graphics graphics, MPEG MPEG, music music, speech recognition, game, image processing, they found that manyy multimedia algorithms g execute the same instructions on many pieces of data in a large data set. • Typical elements are small, 8 bits for pixels, 16 bits for audio, 32 bits for graphics and general computing. • New data type: 64-bit packed data type. Why 64 bits? – Good enough – Practical 7

8

MMX data types

MMX integration into IA NaN or infinityy as real because bits 79-64 are ones.

79

11…11

Even if MMX registers are 64-bit, they don’t e tend Pentium extend Penti m to a 64-bit CPU since only logic instructions are provided for 64-bit data. 9

8

MM0~MM7

Compatibility

Compatibility

• To be fully compatible with existing IA, no new mode or state was created. created Hence Hence, for context switching, no extra state needs to be saved. • To T reach h th the goal, l MMX iis hidd hidden b behind hi d FPU FPU. When floating-point state is saved or restored, MMX is i saved d or restored. t d • It allows existing OS to perform context switching on the processes executing MMX instruction without be aware of MMX. • However, it means MMX and FPU can not be used at the same time. Big g overhead to switch.

• Although Intel defenses their decision on aliasing MMX to FPU for compatibility compatibility. It is actually a bad decision. OS can just provide a service pack or get updated. updated • It is why Intel introduced SSE later without any aliasing li i

11

10

12

MMX instructions

Saturation arithmetic

• 57 MMX instructions are defined to perform the parallel operations on multiple data elements packed into 64-bit data types. • These Th include i l d add, dd subtract, bt t multiply, lti l compare, and shift, data conversion, 64 bit d 64-bit data t move, 64-bit 64 bit l logical i l operation and multiply-add for multiplyaccumulate l t operations. ti • All instructions except for data move use MMX registers as operands. p support pp for 16-bit operations. p • Most complete

• Useful in graphics applications. • When Wh an operation i overflows fl or underflows, d fl the result becomes the largest or smallest possible ibl representable t bl number. b • Two types: signed and unsigned saturation

wrap-around

saturating

13

MMX instructions

14

MMX instructions

Call it before you switch to FPU from MMX; Expensive operation 15

16

Arithmetic

Arithmetic

• PADDB/PADDW/PADDD: add two packed numbers no EFLAGS is set, numbers, set ensure overflow never occurs by yourself • Multiplication: M lti li ti ttwo steps t • PMULLW: multiplies four words and stores the four lo words of the four double word results p four words and • PMULHW/PMULHUW: multiplies stores the four hi words of the four double word g results. PMULHUW for unsigned.

• PMADDWD

17

Detect MMX/SSE

18

cpuid

mov eax, 1 ; request version info cpuid ; supported since Pentium test edx, 00800000h ;bit 23 ; 02000000h (bit 25) SSE ; 04000000h (bit 26) SSE2 jnz HasMMX

: : 19

20

Example: add a constant to a vector char d[]={5, 5, 5, 5, 5, 5, 5, 5}; char clr[]={65 clr[]={65,66,68,...,87,88}; 66 68 87 88}; // 24 bytes __asm{ movq mm1 mm1, d mov cx, 3 mov esi esi, 0 L1: movq mm0, clr[esi] paddb ddb mm0, 0 mm1 1 movq clr[esi], mm0 add dd esi, i 8 loop L1 emms 22 }

Comparison

Change data types

• No CFLAGS, how many flags will you need? Results are stored in destination. destination • EQ/GT, no LT

• Pack: converts a larger data type to the next smaller data type. type • Unpack: takes two operands and interleave th them. It can b be used d ffor expand dd data t ttype ffor immediate calculation.

23

24

Pack with signed saturation

Pack with signed saturation

25

Unpack low portion

26

Unpack low portion

27

28

Unpack low portion

Unpack high portion

29

Keys to SIMD programming

30

Application: frame difference

• Efficient data layout • Elimination Eli i i off branches b h

A

B

| |A-B| |

31

32

Application: frame difference A-B

Application: frame difference MOVQ MOVQ MOVQ PSUBSB PSUBSB POR

B-A

mm1, mm2 mm2, mm3, mm1 mm1, mm2, mm1 mm1,

A //move 8 pixels of image A B //move 8 pixels of image B mm1 // mm3=A mm2 // mm1=A mm1=A-B B mm3 // mm2=B-A mm2 // mm1 mm1=|A-B| |A B|

((A-B)) or (B-A) ( )

33

Example: image fade-in-fade-out

A

34

α=0.75

B A*α+B*(1-α) = B+α(A-B) 35

36

α=0.5

α=0.25

37


38


• Two formats: planar and chunky • In I Chunky Ch k format, f 16 bits bi off 64 bi bits are wasted d • So, we use planar in the following example

Image A

Image B

R G B A R G B A

39

40


Data-independent computation

MOVQ mm0, alpha//4 16-b zero-padding α MOVD mm1 mm1, A //move 4 pixels of image A MOVD mm2, B //move 4 pixels of image B PXOR mm3 mm3, mm3 //clear mm3 to all zeroes //unpack 4 pixels to 4 words PUNPCKLBW mm1 mm1, mm3 // Because B B-A A could be PUNPCKLBW mm2, mm3 // negative, need 16 bits PSUBW mm1, 1 mm2 2 //(B //(B-A) A) PMULHW mm1, mm0 //(B-A)*fade/256 PADDW mm1, 1 mm2 2 //(B //(B-A)*fade A)*f d + B //pack four words back to four bytes PACKUSWB mm1, 1 mm3 3

• Each operation can execute without needing to know the results of a previous operation. operation • Example, sprite overlay for i=1 to sprite_Size if sprite[i]=clr then out_color[i]=bg[i] else out_color[i]=sprite[i]

41

Application: sprite overlay

• How to execute data-dependent calculations on several pixels in parallel. 42

Application: sprite overlay MOVQ MOVQ MOVQ MOVQ PCMPEQW PAND PANDN POR

43

mm0, mm2 mm2, mm4, mm1 mm1, mm0, mm4 mm4, mm0, mm0, 0

sprite mm0 bg clr mm1 mm0 mm2 mm4 4

44

Application: matrix transport

Application: matrix transport char M1[4][8];// matrix to be transposed char M2[8][4];// transposed matrix int n=0; for (int i=0;i

Intel SIMD architecture

Intel SIMD architecture

Suggest Documents

RC-SIMD: Reconfigurable Communication SIMD Architecture for ...

Intel 8086 architecture

Component labeling on SIMD-SPMD Architecture

Effective SIMD Vectorization for Intel Xeon Phi Coprocessors

Architecture of SIMD Type Vector Processor - CiteSeer

Reconfigurable Communication SIMD Architecture for Image

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Optimizing Dynamic Binary Translation for SIMD Instructions Intel ...

Screaming Fast Galois Field Arithmetic Using Intel SIMD ... - CiteSeerX

Exploring SIMD for Molecular Dynamics, Using Intel ... - EECS @ UMich

Minimal Boot Loader for Intel® Architecture

Intel Personal Internet Client Architecture & Palmware ... - CiteSeerX

Intel Personal Internet Client Architecture & Palmware ... - CiteSeerX

Intel Personal Internet Client Architecture & Palmware ... - CiteSeerX

Intel Architecture Software Developer's Manual Volume 3

Intel Architecture Optimization Reference Manual - download.intel.nl

IA-32 Intel® Architecture Optimization Reference Manual

Signal Processing on Intel® Architecture: Performance Analysis ...

Intel Itanium Floating-Point Architecture - Semantic Scholar

Implementing Industry Standard Architecture (ISA) with Intel ...

IA-32 Intel® Architecture Software Developer's Manual

The Intel 80x86 Processor Architecture - Semantic Scholar

The Intel 80x86 Processor Architecture - Semantic Scholar

Intel 8086 microprocessor architecture - Index of