Moore's law is alive and well, but the traditional sources of ...... Lane0 Lane1 Lane2 Lane3. Lane0 Lane1 Lane2 Lane3 m70 m71 m72 m73. Address 7 m34 m35.
SystemC Implementation and Performance Evaluation of a Matrix Processor for Intensive Computing Applications By Abdulmajid F. Al-Junaid and Mostafa I. Soliman
1
1. Introduction • Motivation • Background • Related Work 2. Mat-Core ISA Architecture • Mat-Core Instruction Sets • An Assembler for Mat-Core ISA 3. Mat-Core Microarchitecture and SystemC Implementation • Mat-Core Microarchitecture • SystemC Implementation of Mat-Core • Extending Scoreboard to Control Matrix core • Implementation of Vector/Matrix Operations on Mat-Core 4. Performance-Area Trade-offs of Mat-Core Processor 5. Mat-Core Scalability • Hardware Scalability • Performance Scaling of Linear Algebra Kernels on Mat-Core • Performance Scaling of DCT/IDCT on Mat-Core • Performance Scaling of Image Registration on Mat-Core 6. Multi-Mat-Core Processor 7. Conclusion and Future Work
2
The huge budget of transistors provided by the technology. Moore's law is alive and well, but the traditional sources of performance improvements (ILP and clock frequencies) have all been flattening. The increasing demand in performance improvement of data parallel applications. Motivates many projects in the field of processor design. One of these projects is the research processor called Mat-Core which is proposed to exploit this huge budget of transistors for accelerating data parallel applications
3
Moore’s law: The number of transistors per integrated circuit would double every 18 months Moore's law is alive and well, but the traditional sources of performance improvements (ILP and clock frequencies) have all been flattening.
4
Data parallel applications are growing in importance and demanding increased performance from hardware. They include 3D graphics, image processing, signal processing, voice recognition, network processing, scientific and engineering applications, etc. They are based on data parallelism, where the same operations are carried out across arrays of operands. To satisfy the demand performance, specialized hardware is commonplace in data parallel applications. The domain of scientific and engineering applications has a limitless requirement for more computation. 5
Recently, major microprocessor vendors have announced extensions to their GP microprocessors (SIMD) to improve the performance of data parallel applications. Examples: • Intel extended IA-32 with MMX, SSE, SSE2, and SSE3, AVX, • Sun enhanced Sparc with VIS, • Hewlett-Packard added MAX to its PA-RISC architecture, and • Motorola extended the ProwerPC with AltiVec .
• A good step, however, • fixed vector length and stride • datapath busy for a few clock cycles • multiple instructions are needed to load and align a vector data 6
Computer architects have employed various form of parallelism to provide increases in performance above those made possible just by improvements in underlying circuit technologies: • Pipelining • Instruction–level parallelism (ILP) • Thread-level parallelism (TLP) • Data-level parallelism (DLP) These various forms of machine parallelism are not mutually exclusive and can be combined to yield systems that can exploit all forms of application parallelism 7
Vector ISAs provide an efficient organization for controlling a large amount of computation resources. They can express large amounts of data parallelism in a very compact manner. A vector instruction can package multiple homogeneous and independent scalar operations: - saves hardware in the fetch and decode stages, - avoids additional hardware for dependency checking, - keeps a functional unit busy for several clock cycles, - reduces loop overhead 8
The concept of parallel lanes is fundamental for the vector microarchitecture, as it leads to advantages in performance, design complexity, and scalability. • Note that: On parallel pipelines, not only vector but also matrix data can be processed 9
Recent trends in high-performance computing (HPC) systems have shown that future increases in performance, will be achieved through chip-level multiprocessor (CMP). Today's, the number of cores approximately double every two years. Only, it is required from programmers to parallelize their applications in order to make the whole application performance again start to track Moore's Law. the advantages of CMP : - exploiting TLP (the best way to address the issue of power), - it requires only a fairly modest engineering effort, and - the cores may be heterogeneous cores which could address variety of applications. 10
Complexity of advanced chip designs is driving the progress of electronic system-level (ESL) methodologies and tools. Amongst several possible languages, SystemC (system level modeling language) is quickly becoming the favorite language of advanced chip designers for modeling and simulation. Advantages of SystemC : 1. modeling systems at different abstraction levels, 2. allowing the description and integration of complex hardware and software components in a unified environment, 3. fast and easy simulation and verification, and 4. taking advantages of C++ popular language.
11
six
projects are related to our work because they were developed to accelerate data parallel applications. Torrent-0: A Single Chip Vector Microprocessor (1998), to be used for speech recognition (Krest Asanovic). VIRAM: A Media-Oriented Vector Processor (2002), address the processor-memory performance gap, by using embedded DRAM. (David Patterson, Christoforos Kozyrakis at UC Berkeley). TRIPS: The Tera-op, Reliable, Intelligently adaptive Processing System(2003), to produce scalable and distributed architecture that can accelerate industrial, consumer, embedded, and scientific workloads, reaching trillions of calculations per second on a single chip, new ISA, exploit ILP, TLP, and DLP. (University of Texas at Austin ) 12
Trident:
A Technology-Scalable Matrix Processor (2004), emphasizes on local communication to overcome the limitations of future VLSI technology (Soliman). ViVA: Virtual Vector Architecture (2008), ViVA adds vectorstyle memory operations to existing microprocessors but does not include arithmetic datapaths; instead, memory instructions work with a new buffer placed between the core and secondlevel cache, in parallel with the L1 cache (Gebis). Maven: A Flexible and Efficient Data-Parallel Accelerator (2010), data-parallel accelerators that is based on simplifying the instruction set, microarchitecture, and programming methodology for a vector-thread architecture (Batten). 13
Whereas, VIRAM and T0 improve multimedia performance, TRIPS, Trident , ViVA, and Maven can improve both multimedia and scientific applications. Torrent-0, VIRAM, ViVA, and Maven are vector processors whereas TRIPS and Trident are matrix processors. Trident versus Mat-Core: ring and communication versus partitioned register file, local versus global communication and scalability, and large versus small vector/matrix strips/blocks (64 lanes or more versus 4-8 lanes).
14
Mat-Core: A general-purpose processor which extends a scalar
ISA to process vector, and matrix data. There are a semantic and parallelism gaps between applications and scalar ISA of current microprocessors. The use of scalar ISA results in scattering the data parallelism by compilers and then gathering parallelism again using complex hardware. Three levels of ISA are proposed for Mat-Core architecture.
15
r1
r2
i
j
k r3 Scalar instruction: add r3, r2, r1
Scalar instruction set is the fundamental instruction set for any general-purpose processor, But It can not express parallelism to hardware (processor), Thus The performance of scalar processors can be improved only by processing more scalar instructions concurrently. - Requires complex hardware (superscalar architecture). - Sophisticated compiler techniques (VLIW architecture). 16
Lane 0
Lane 1
Lane 2
Lane 3
V1 V2 x12 y12 x8 y8 x4 y4 x0 y0
V1 V2 x13 y13 x9 y9 x5 y5 x1 y1
V1 V2 x14 y14 x10 y10 x6 y6 x2 y2
V1 V2 x15 y15 x11 y11 x7 y7 x3 y3
Crossbars
z12 z8 z4 z0
z13 z9 z5 z1
z14 z10 z6 z2
z15 z11 z7 z3
V3
V3
V3
V3
Vector instruction: add.vv V3, V2, V1
Vector ISAs can express large amounts of data parallelism in a very compact manner. A vector instruction can package multiple homogeneous and independent scalar operations: - saves hardware in the fetch and decode stages, - avoids additional hardware for dependency checking, - keeps a functional unit busy for several clock cycles, - reduces loop overhead 17
Lane 0 M1 a30 a20 a10 a00
M2 b30 b20 b10 b00
Lane 1 M1 a31 a21 a11 a01
M2 b31 b21 b11 b01
Lane 2 M1 a32 a22 a12 a02
M2 b32 b22 b12 b02
Lane 3 M1 a33 a23 a13 a03
M2 b33 b23 b13 b03
Crossbars
z30 z20 z10 z00
z31 z21 z11 z01
z32 z22 z12 z02
z33 z23 z13 z03
M3
M3
M3
M3
Matrix instruction: add.mm M3, M2, M1
Matrix processing is the logical direct extension of vector processing. Matrix ISA further reduces both the semantic and the parallelism gaps between high-level languages and hardware. Thus, high-level instructions, such as vector-scalar, vector-vector, matrixscalar, matrix-vector, and matrixmatrix instructions, convoy up to 3-D data parallelism to Mat-Core processor, which results in reducing the complexity of hardware and compiler. 18
Mat-Core is a load/store architecture Mat-Core instructions are divided into several classes: Scalar/vector/matrix load/store instructions; Scalar/vector/matrix arithmetic/logical instructions; Compare/branch/jump scalar instructions; Control instructions; and Move instructions.
19
Before executing high-level vector/matrix instruction(s), CPL (control parallel lanes) instruction should be executed first. To adjust the number of parallel lanes and the number of elements per lane. Lane3 Lane0 M5
M5 6-bit 6-bit Matrix class CPL opcode
10-bit Strps
5-bit Wstrp
5-bit Dim
address = rs; for(i=0; i