SystemC Implementation and Performance ...

4 downloads 0 Views 10MB Size Report
Moore's law is alive and well, but the traditional sources of ...... Lane0 Lane1 Lane2 Lane3. Lane0 Lane1 Lane2 Lane3 m70 m71 m72 m73. Address 7 m34 m35.
SystemC Implementation and Performance Evaluation of a Matrix Processor for Intensive Computing Applications By Abdulmajid F. Al-Junaid and Mostafa I. Soliman

1

1. Introduction • Motivation • Background • Related Work 2. Mat-Core ISA Architecture • Mat-Core Instruction Sets • An Assembler for Mat-Core ISA 3. Mat-Core Microarchitecture and SystemC Implementation • Mat-Core Microarchitecture • SystemC Implementation of Mat-Core • Extending Scoreboard to Control Matrix core • Implementation of Vector/Matrix Operations on Mat-Core 4. Performance-Area Trade-offs of Mat-Core Processor 5. Mat-Core Scalability • Hardware Scalability • Performance Scaling of Linear Algebra Kernels on Mat-Core • Performance Scaling of DCT/IDCT on Mat-Core • Performance Scaling of Image Registration on Mat-Core 6. Multi-Mat-Core Processor 7. Conclusion and Future Work

2

 The huge budget of transistors provided by the technology.  Moore's law is alive and well, but the traditional sources of performance improvements (ILP and clock frequencies) have all been flattening.  The increasing demand in performance improvement of data parallel applications. Motivates many projects in the field of processor design. One of these projects is the research processor called Mat-Core which is proposed to exploit this huge budget of transistors for accelerating data parallel applications

3

Moore’s law: The number of transistors per integrated circuit would double every 18 months Moore's law is alive and well, but the traditional sources of performance improvements (ILP and clock frequencies) have all been flattening.

4

Data parallel applications are growing in importance and demanding increased performance from hardware. They include 3D graphics, image processing, signal processing, voice recognition, network processing, scientific and engineering applications, etc. They are based on data parallelism, where the same operations are carried out across arrays of operands. To satisfy the demand performance, specialized hardware is commonplace in data parallel applications. The domain of scientific and engineering applications has a limitless requirement for more computation. 5

Recently, major microprocessor vendors have announced extensions to their GP microprocessors (SIMD) to improve the performance of data parallel applications. Examples: • Intel extended IA-32 with MMX, SSE, SSE2, and SSE3, AVX, • Sun enhanced Sparc with VIS, • Hewlett-Packard added MAX to its PA-RISC architecture, and • Motorola extended the ProwerPC with AltiVec .

• A good step, however, • fixed vector length and stride • datapath busy for a few clock cycles • multiple instructions are needed to load and align a vector data 6

 Computer architects have employed various form of parallelism to provide increases in performance above those made possible just by improvements in underlying circuit technologies: • Pipelining • Instruction–level parallelism (ILP) • Thread-level parallelism (TLP) • Data-level parallelism (DLP)  These various forms of machine parallelism are not mutually exclusive and can be combined to yield systems that can exploit all forms of application parallelism 7

Vector ISAs provide an efficient organization for controlling a large amount of computation resources. They can express large amounts of data parallelism in a very compact manner. A vector instruction can package multiple homogeneous and independent scalar operations: - saves hardware in the fetch and decode stages, - avoids additional hardware for dependency checking, - keeps a functional unit busy for several clock cycles, - reduces loop overhead 8

The concept of parallel lanes is fundamental for the vector microarchitecture, as it leads to advantages in performance, design complexity, and scalability. • Note that: On parallel pipelines, not only vector but also matrix data can be processed 9

Recent trends in high-performance computing (HPC) systems have shown that future increases in performance, will be achieved through chip-level multiprocessor (CMP). Today's, the number of cores approximately double every two years. Only, it is required from programmers to parallelize their applications in order to make the whole application performance again start to track Moore's Law. the advantages of CMP : - exploiting TLP (the best way to address the issue of power), - it requires only a fairly modest engineering effort, and - the cores may be heterogeneous cores which could address variety of applications. 10

 Complexity of advanced chip designs is driving the progress of electronic system-level (ESL) methodologies and tools. Amongst several possible languages, SystemC (system level modeling language) is quickly becoming the favorite language of advanced chip designers for modeling and simulation. Advantages of SystemC : 1. modeling systems at different abstraction levels, 2. allowing the description and integration of complex hardware and software components in a unified environment, 3. fast and easy simulation and verification, and 4. taking advantages of C++ popular language.

11

 six

projects are related to our work because they were developed to accelerate data parallel applications. Torrent-0: A Single Chip Vector Microprocessor (1998), to be used for speech recognition (Krest Asanovic). VIRAM: A Media-Oriented Vector Processor (2002), address the processor-memory performance gap, by using embedded DRAM. (David Patterson, Christoforos Kozyrakis at UC Berkeley). TRIPS: The Tera-op, Reliable, Intelligently adaptive Processing System(2003), to produce scalable and distributed architecture that can accelerate industrial, consumer, embedded, and scientific workloads, reaching trillions of calculations per second on a single chip, new ISA, exploit ILP, TLP, and DLP. (University of Texas at Austin ) 12

 Trident:

A Technology-Scalable Matrix Processor (2004), emphasizes on local communication to overcome the limitations of future VLSI technology (Soliman). ViVA: Virtual Vector Architecture (2008), ViVA adds vectorstyle memory operations to existing microprocessors but does not include arithmetic datapaths; instead, memory instructions work with a new buffer placed between the core and secondlevel cache, in parallel with the L1 cache (Gebis). Maven: A Flexible and Efficient Data-Parallel Accelerator (2010), data-parallel accelerators that is based on simplifying the instruction set, microarchitecture, and programming methodology for a vector-thread architecture (Batten). 13

 Whereas, VIRAM and T0 improve multimedia performance, TRIPS, Trident , ViVA, and Maven can improve both multimedia and scientific applications. Torrent-0, VIRAM, ViVA, and Maven are vector processors whereas TRIPS and Trident are matrix processors. Trident versus Mat-Core: ring and communication versus partitioned register file, local versus global communication and scalability, and large versus small vector/matrix strips/blocks (64 lanes or more versus 4-8 lanes).

14

Mat-Core: A general-purpose processor which extends a scalar

ISA to process vector, and matrix data. There are a semantic and parallelism gaps between applications and scalar ISA of current microprocessors. The use of scalar ISA results in scattering the data parallelism by compilers and then gathering parallelism again using complex hardware. Three levels of ISA are proposed for Mat-Core architecture.

15

r1

r2

i

j

k r3 Scalar instruction: add r3, r2, r1

Scalar instruction set is the fundamental instruction set for any general-purpose processor, But It can not express parallelism to hardware (processor), Thus The performance of scalar processors can be improved only by processing more scalar instructions concurrently. - Requires complex hardware (superscalar architecture). - Sophisticated compiler techniques (VLIW architecture). 16

Lane 0

Lane 1

Lane 2

Lane 3

V1 V2 x12 y12 x8 y8 x4 y4 x0 y0

V1 V2 x13 y13 x9 y9 x5 y5 x1 y1

V1 V2 x14 y14 x10 y10 x6 y6 x2 y2

V1 V2 x15 y15 x11 y11 x7 y7 x3 y3

Crossbars

z12 z8 z4 z0

z13 z9 z5 z1

z14 z10 z6 z2

z15 z11 z7 z3

V3

V3

V3

V3

Vector instruction: add.vv V3, V2, V1

Vector ISAs can express large amounts of data parallelism in a very compact manner. A vector instruction can package multiple homogeneous and independent scalar operations: - saves hardware in the fetch and decode stages, - avoids additional hardware for dependency checking, - keeps a functional unit busy for several clock cycles, - reduces loop overhead 17

Lane 0 M1 a30 a20 a10 a00

M2 b30 b20 b10 b00

Lane 1 M1 a31 a21 a11 a01

M2 b31 b21 b11 b01

Lane 2 M1 a32 a22 a12 a02

M2 b32 b22 b12 b02

Lane 3 M1 a33 a23 a13 a03

M2 b33 b23 b13 b03

Crossbars

z30 z20 z10 z00

z31 z21 z11 z01

z32 z22 z12 z02

z33 z23 z13 z03

M3

M3

M3

M3

Matrix instruction: add.mm M3, M2, M1

Matrix processing is the logical direct extension of vector processing. Matrix ISA further reduces both the semantic and the parallelism gaps between high-level languages and hardware. Thus, high-level instructions, such as vector-scalar, vector-vector, matrixscalar, matrix-vector, and matrixmatrix instructions, convoy up to 3-D data parallelism to Mat-Core processor, which results in reducing the complexity of hardware and compiler. 18

 Mat-Core is a load/store architecture Mat-Core instructions are divided into several classes: Scalar/vector/matrix load/store instructions; Scalar/vector/matrix arithmetic/logical instructions; Compare/branch/jump scalar instructions; Control instructions; and Move instructions.

19

Before executing high-level vector/matrix instruction(s), CPL (control parallel lanes) instruction should be executed first. To adjust the number of parallel lanes and the number of elements per lane. Lane3 Lane0 M5

M5 6-bit 6-bit Matrix class CPL opcode

10-bit Strps

5-bit Wstrp

5-bit Dim

address = rs; for(i=0; i