FPGA Implementation and Performance Evaluation of a Simultaneous Multithreaded Matrix Processor Mostafa I. Soliman
Elsayed A. Elsayed
Computer and System Section Electrical Engineering Department, Faculty of Engineering, Aswan University, Aswan 81542, Egypt
[email protected] /
[email protected]
Computer and System Section Electrical Engineering Department, Faculty of Engineering, Aswan University, Aswan 81542, Egypt say
[email protected]
(ILP), (2) data-level parallelism (DLP), and (3) thread-level parallelism (TLP), which are not mutually exclusive. Exploiting all these forms of parallelism is the only way to significantly improve performance, as discussed in [4]. Superscalar processors [5] exploit ILP by dynamically extracting and dispatching more independent instructions in the same clock cycle. However, very long instruction word (VLIW) processors [6] rely on compilers to find and exploit ILP. As a result of the low inherent ILP of commercial workloads, these techniques have reached a point of rapidly diminishing returns [7].
Abstract-This paper proposes a simultaneous multithreaded matrix processor called SMMP to improve the performance of data-parallel applications by exploiting ILP, DLP, and TLP. In SMMP,
the
processor)
is
well-known extended
5-stage
to
execute
pipeline
(baseline
scalar
multi-scalar/vector/matrix
instructions on unified parallel execution datapaths. SMMP can issue four scalar instructions from two threads each cycle or four vector/matrix operations from one thread, where the execution of vector/matrix instructions in threads is done in round-robin fashion. Moreover, this paper presents the implementation of our proposed SMMP using VHDL targeting FPGA Virtex-6. In addition, the performance of SMMP is evaluated on some kernels from the basic linear algebra subprograms (BLAS). Our results
Instead of pursuing more ILP, architects are increasingly focusing on TLP implemented with single-chip multiprocessors [2]. However, placing multiple superscalar processors on a chip is also not an effective solution, because, in addition to the low ILP, performance suffers when there is little TLP [4]. Patterson and Hennessy [8] showed that moving to a simpler core design results in modestly lower clock frequencies, but has enormous benefits in power consumption and chip area. A many-core design would still be an order of magnitude more power- and area efficient in terms of sustained performance, assuming that the simpler core will offer only one-third the computational efficiency of the more complex out-of-order cores (see [8] for more details).
show that, the hardware complexity of SMMP is 5.68 times higher than the baseline scalar processor. However, speedups of 4.9, 6.09, 6.98, 8.2, 8.25, 8.72, 9.36, 11.84, and 21.57 are achieved on BLAS kernels of applying Givens rotation, scalar times vector plus another, vector addition, vector scaling, setting up Givens rotation, dot-product, matrix-vector multiplication, Euclidean length,
and
matrix-matrix
multiplications,
respectively.
In
conclusion, the average speedup over the baseline is 9.55 and the average speedup over complexity is 1.68. Keywords-data-parallel
applications;
simultaneous
multithreading; vector/matrix processing; performance evaluation;
FPGAlVHDL implementation
I.
INTRODUCTION
The exponential growth in the fabrication technology and the continuous improvements in transistor density have allowed tens of billions of transistors to be integrated into a single chip. Computer architects should propose new processor architectures that use this huge number of transistors to improve the performance of applications. Data-parallel applications, which include scientific and engineering, DSP, multimedia, network, security, etc., are growing in importance and demanding increased performance from hardware [1]. Taking advantage of parallelism is one of the most important methods for improving performance [2]. Computer architects have employed various forms of parallelism to provide increases in performance above and beyond those made possible just by improvements in underlying circuit technologies [3]. Beyond simple pipelining, which overlaps multiple independent instructions in execution, there are three major forms of parallelism: (1) instruction-level parallelism
978-1-4799-6594-6/14/$31.00 ©2014
IEEE
207
On the other hand, the cheapest and the most prevalent form of parallelism available in data-parallel applications is DLP. DLP needs only fetch and decode a single vector instruction to describe a whole array of parallel operations. This reduces control logic complexity, while the regular nature of vector instructions allows compact parallel datapath structures [3]. Thus, major vendors of general purpose microprocessors have announced multimedia extensions to their architectures to enhance the performance by exploiting DLP [9]. However, these extensions work well only on vector data using SIMD instructions that can be extracted by programmers or compilers. Like vector architectures, they are severely lacking in performance on higher percentage of scalar data compared to superscalar and VLlW processors. As advocated in [4], a better solution for high performance is to design a processor that can exploit all types of parallelism well. Therefore, this paper proposes a simple matrix processor called SMMP (simultaneous multithreaded matrix processor) to improve the performance of data-parallel applications. SMMP uses multi-level instruction set architecture (lSA) and exploits
unit to a superscalar core to improve performance of numeric and multimedia codes by exploiting both ILP and DLP. Espasa and Valero [[ 1] showed that [LP and DLP can be merged in a single simultaneous vector multithreaded architecture to execute regular vectorizable code at a performance level that cannot be achieved using either paradigm on its own. Rivoire et al. [[ 2] proposed VLT (vector lane threading) that allows idle vector lanes to run short-vector or scalar threads by partitioning the vector lanes across several threads. Krashinsky [13] proposed a vector-thread (VT) architecture, which unifies the vector and multithreaded compute models. Kumar et al. [14] proposed Supervector processor, which is a new technique that exploits the advantage of both superscalar and vector processing technique. Soliman and AI-Junaid [1 S] proposed extending a multi-core processor with a common matrix unit to maximize on chip resource utilization and to leverage the advantages of the current multi-core revolution to improve the performance of data-parallel applications. Soliman and Elsayed proposed SMP [[ 6] and SuperSMP [17] for executing scalar/vector/matrix and multi-scalar/vector/matrix instructions from a single thread, respectively.
[LP, TLP, and DLP to accelerate data-paralle[ applications. The explicit DLP parallelism can be expressed by high-level instructions instead of the dynamical extraction by a complicated logic or statically with sophisticated compilers. In addition to vector instructions, SMMP uses matrix-scalar, matrix-vector, and matrix-matrix instructions to express parallelism to hardware. Up to three levels of DLP (O(n) operations for vector processing, O(n2) operations for matrix vector processing, and O(n3) operations for matrix processing) can be exploited using the multi-level ISA. This leads to high performance, a simple programming model, and compact executable code. ILP is exploited in SMMP by parallel processing multiple scalar instructions using simple in-order superscalar techniques. Moreover, TLP is exploited through fetching, decoding, and executing multi-scalar/vector/matrix instructions from multiple threads simultaneously. This paper is organized as follows. Section II presents some related work. Section III describes the architecture of our proposed SMMP. The FPGA implementation of SMMP is presented in Section [v. The performance of our proposed SMMP processor is evaluated on vector/matrix kernels in Section V. Finally, Section V[ concludes this paper and gives directions for future work.
II.
III.
In the literature, researchers combined some forms of parallelism to achieve higher performance. Eggers et al. [4] proposed simultaneous multithreading, which consumes both TLP and [LP by combining hardware features of wide-issue superscalars and multithreaded processors to use resources more efficiently. Quintana et al. [[ 0] proposed adding a vector
II
NPCl
Jr ·�C.d"
NPC2
.-----.
TI
r==t
.
I················ c ·············· Thread2
I· ···· · ······
�������
rdb
1 .... I� ! � ....
lain COlllrol U n i t
I � j���
==
1
�
1 VectorfMatrix
Matrix
I
Control Unit
SRFI
S cala r
=-
it
II
�..J
Thread I
I I
12 Cache
;:::=
c················
SMMP
SMMP improves the performance of data-parallel applications by exploiting superscalar, vector/matrix processing, and multithreading techniques. As shown in Fig. 1, SMMP extends the well-known S-stage pipeline (baseline scalar processor [18]) to process multi-scalar/vector/matrix instructions from two threads. Thus, SMMP contains hardware state for two threads (two program counters (pes), two scalar register files, and two matrix register files). [t can issue four scalar/vector/matrix operations from two threads each cycle.
RELATED WORK
I I
THE ARCHITECTURE OF
J: '� :ri
-
�I
S RF2
Scalar
,,�,,;,:,
L oad/Store Control Unit
�
MRFI Matri�
Register File I
�
I
1111 I
Filc2 II
II
:� �
I
II
I
:�'=--t
1
,-. �)
�l
Fig. 1. The block diagram of SMMP processor.
208
J �
L...r.t---+
Register
� ��
'V
I::)
Matrix
rnr II1r
�
: i �I � �
:t
MRF2
�
fr""=.""=.�==="'== ' l :!
Ll Oato
Cnche
+
�
l: i
I I I I I
I I
I
:I
1
II II II II II II II II II I I
I
-
instruction from another thread. Moreover, they can write their results in parallel since each instruction writes its result in its own matrix register file.
SMMP fetches four instructions from instruction cache (two consecutive instructions per thread) using two PCs. It selects up to four scalar instructions for execution from two threads after checking the dependency. However, a single vector/matrix instruction from one thread can exploit all execution resources by issuing up to four vector/matrix operations. Since a vector/matrix instruction has high DLP, that parallelism can be satisfied; otherwise, two threads each have up to two independent scalar instructions can be executed together to compensate. The execution of vector/matrix instructions in threads is done in round-robin fashion.
C.
A. Fetch Stage
The fetch stage of SMMP pipeline accepts the branch/jump addresses for two threads and control signals from the decode stage to generate 32-bit NPC (next PC) and 2x32-bit instructions for each thread. The outputs of the fetch stage are sent to the decode stage through IF/ID pipeline register. The 2x32-bit program counters (PC I and PC2) are sent to the instruction cache to fetch 2x32-bit instructions per thread, when the read enable signal is asserted. Depending on the control signals generated from the main control unit, the input address of each PC is connected to the next sequential address (PC + 8) or to the branch/jump address. B.
Decode Stage
Execute Stage
The execute stage of SMMP accepts the control signals and operands of each thread from the decode stage through ID/EX pipeline register. Four parallel execution units operate on four pairs of operands prepared in the decode stage. They can perform addition, subtraction, multiplication, MAC (multiply accumulation), AND, OR NOR, XOR, shift leftlright logical, and shift right arithmetic. For a scalar load/store operation, the Ist or 3rd execution unit generates the memory address for thread 1 or 2 by adding its source operand with the immediate t value. When the ih operation is register-register, the /h execution unit performs the operation specified by the control unit on the /h pair of operands fed from scalar/matrix register file through ID/EX pipeline register, where 1 ...,�
.'
t;;'::'
",..r�
...�",).
Vedor IPllgth
7
$10
�'t,.,.
��";
....
"'-v�
o � o -. t)
::J ""0
-
�
..
--.
.... r:::."'b.
....��
)..�..,
-..
-
--
-
- -
..
.
-..
..
--.
-..
-
--
-
.
..
..
-..
..
-
-..
§-6 "t:i.:;
�. F. (J)4
....
__�. _....... . =:!!':::�!:::::: = :"': ...."" .....t: .....= .....=' ......'t .....= .....= ...... .:"' .= .....= ......=' .....'= . ..... = .....:::"' ........... �
o o
- - -- - - --- - -- - - -- --- - -- - - --- - -- --- -- --- - -- ---- - ---
---_.._- ._ -..
.., '
,. ector I('ugth
.. ___ S!\lP - 4L/S+I X- _ Supt'l"S!\lP 4 L1S....X�S�DIP
____ Sl\lP IL/S+IX
1
-::
",t
-
o �-----
,
"
Yt"c(orlt"ugth
Vedor length
5.5
�
�
�
�
�
� � ,
Vt"ctOl'It'llgth
�
�
, � # �
�
S!fP
ILlS+ L\:
�
�
___
SMP �LJS+IX
�
_
Snpf,I'SMP RJS+-iX
� � � ,
Ve-nol' lenglh
Fig. 3. Speedup of SMMP over 5-stage pipeline on scalar-vector, vector-vector, matrix-vector, and matrix-matrix kernels.
211
�
S!l.llIP
, / �
10
.§
9
td
� -5 .§
]..... � �
��
•
�
�
� � �
Y('octm- length
� ,
/ �
�
-
.
6
-_..
5
,------,----,----,---,---,----,---,---,----,----
_ .•_-_.. _-----_.. _------_. _-_ •. _-_.---._--_.•_-_.. _--_.•_-
-
,---,---,---,----,----,---,----,-
-� --- . 3---:----:--- ';---:----:--- -:- ·_--:--- ;"'
3
·-
2
.--
I
---.----.----.---.-----.----.---.-.--.-.---.-.--.--.--.---.---.---.----.----.---.----.-
-- - - -
. - --
- - - --- - -' .-'
-- .- - .---- -- .-- - .----.-- .-- .
-
.-- -
-
-- - - - -
--•.-- .- - .-- -.-- .--- .-
--------------------
O �------------------------------------
4
...� ':\1atrix size
'0""
12
...11.
","-
\iatl'ixsiH'
Fig. 3. Speedup of SMMP over 5-stage pipeline on scalar-vector, vector-vector, matrix-vector, and matrix-matrix kernels (continued).
completely finished in decode stage. For example in kernel "Add", there are two vector load instructions followed by vector addition instruction in each thread, the four load instructions will be fetched from instruction cache and issued to the decode stage in parallel. Thus, as explained previously, they will be executed in round-robin technique, where the last vector load instruction from them can be executed in parallel with the following vector addition instruction from other thread. Therefore, the speedups of Norm2 and SVmul, which has only one vector load instruction before the vector arithmetic instruction, are higher. Amdahl's law [19] governs the speedup of using parallel processing on a problem versus using sequential processing. It can be used to estimate the speedup of applications based on scalar/vector/matrix operations. Amdahl's law states that the performance improvement to be gained from using some faster mode of execution is limited by the fraction of the time the faster mode can be used. The main advantage of our proposed
1.57
2.01
4.32
4.9
SAXPY
2.09
3.05
4.92
6.09
Vector addition
2.53
4.08
5.48
6.98
Vector scaling
2.76
4.21
5.92
8.2
Set up Givens rotation
2.78
3.46
6.91
8.25
Dot-product
3.31
4.80
7.1
8.72
Matrix-vector multiplication
3.76
4.57
7.38
9.36
Euclidean length
4
5.16
8.41
11.84
Matrix-matrix multiplication
6.33
7.44
18.23
21.57
Average
3.24
4.31
7.63
9.55
• • •
•
. • .
Since the average speedups on vector/matrix kernels shown in Table 4 are 3.24, 4.31, 7.63, and 9.55 on SMP (lLlS+ IX), SMP (4L1S+ IX), SuperSMP, and SMMP, Fig. 4 shows that the overall speedup on data-parallel applications ranges \.53-3.24,
SMP SMP SuperSMP SMMP (lLS+IX) (4LS+IX) (4LS+4X) 64 MHz 81 MHz 82 MHz 65 MHz
Apply Givens rotation
. • .
1
Table 4. Speedups of SMP/SuperSMP/SMMP over baseline scalar processor. Kernel
simple matrix processor architectures is the increase of the faster mode fraction by parallel processing scalar/vector/matrix instructions on multiple execution units. Consider an application consists of a scalar code, which cannot be processed on parallel using vector/matrix instructions, vector/matrix code, which spends V/M fraction of time on the baseline scalar processor. This vector code can be decomposed further into V], V2, , Vm fractions that can be speeded up using vector instructions by factors of Plo P20 , Pm, respectively, where V = VI + V2 + ... + VIIl Moreover, matrix code can be decomposed further into Mlo M2, , Mn fractions that can be speeded up using matrix instructions by factors of Qlo Q20 ..., Qno respectively, where M= M) + M2 + ... + Mn. Thus, Amdahl's law predicts an overall speedup equals
12 ,----_SMP(1LS+1X) _SMP(4LS+1X) _Supe.SMP(4US+4X) _SMMP
10 +-------�--
% VectoriMatrix Operations 50%
55%
60%
65%
70%
75%
80%
85%
90%
95% 100%
Fig. 4. Speedup of application as increasing the percentage of parallel code.
212
l.62-4.31, l.77-7.63, and l.81-9.55 as the percentage of parallel code increases from 50% to 100%, respectively. Moreover, Fig. 5 shows the average speedups and the complexities over the baseline scalar processor on SMP ( ILlS+1X), SMP (4L1S+1X), SuperSMP (4L1S+4X), and SMMP. The speedups over complexities are about l.62, l.54, 2.02, and 1.68 on SMP (lLlS+IX), SMP (4L1S+1X), SuperSMP (4L1S+4X), and SMMP,respectively.
VI.
CONCLUSION AND FUTURE WORK
SMP(IL/S+IX)
This paper proposes a simultaneous multithreaded matrix processor called SMMP to improve the performance of data parallel of applications by exploiting ILP, DLP, and TLP and using multi-level ISA to express DLP to processor. SMMP modifies the well-known 5-stage pipeline to execute multi scalar/vector/matrix instructions from multiple threads on unified parallel execution datapaths. SMMP can issue four scalar instructions from two threads each cycle or four vector/matrix operations from one thread, where the execution of vector/matrix instructions in threads is done in round-robin fashion. The implementation of our proposed SMMP using VHDL targeting FPGA Virtex-6, XC6VLX550T-2FF1760 device shows that, the hardware complexity of SMMP is 5.68 times higher than the baseline scalar processor. However, the performances of SMMP show an average speedup over the baseline scalar processor of 9.55 on some BLAS kernels. Finally, the average speedup over the complexity of SMMP is 1.68.
•
•
•
•
[3]
K. Asanovic, "Vector Microprocessors, " PhD. Thesis, Computer Science Division, University of California at Berkeley, 1998.
S. Eggers et aI., "Simultaneous Multithreading: A Platform for Next Generation Processors, " IEEE Micro, pp. 12-19, Sep./October 1997.
[5]
J. Smith and G. Sohi, 'The Microarchitecture of Superscalar Processors, " Proc. of the IEEE, Vol. 83, No. 12, pp. 1609-1624, 1995.
1. Fisher, "VLIW Architectures and the ELI-512, " Proc. 10lh Interna tional Symposium on Computer Architecture, Stockholm, Sweden, pp. 140-150, June 1983.
[7]
c. Batten, "Simplified Vector-thread Architectures for Flexible and Efficient Data-parallel Accelerators, " PhD. Thesis, Massachusetts Institute of Technology, 2010.
[8]
1. Shalf et aI., "The MANYCORE Revolution: Will HPC Lead or Follow?, " Journal of SciDAC, Review (14), PP. 40-49, 2009.
[9]
N. Slingerland and A. Smith, "Multimedia Extensions for General Purpose Microprocessors: A Survey, " Microprocessors and Microsystems, pp. 225-246, November 2004.
[10] F. Quintana, J. Corbal, R. Espasa, M. Valero, "Adding a Vector Unit to a Superscalar Processor, " Proc. ACM Int. Conf on Supercomputing, pp. I10, June. 1999. [ll] R. Espasa and M. Valero "Simultaneous Multithreaded Vector Architecture: Merging ILP and DLP for High Performance, " Proc. 4th Interna tional Conference on High-Performance Computing, pp. 350 357, December 1997.
[13] R. Krashinsky, "Vector-thread Architecture and Implementation, " PhD. Thesis, Massachusetts Institute of Technology, 2007. [14] D. Kumar, R. Behera, K. Pandey, "Concept of a Supervector Processor: A Vector Approach to Superscalar Processor, Design and Performance Analysis, " Interna tional Journal of Engineering Research, ISSN: 23196890, Vol. 2, No.3, pp. 224-227, 2013. [15] M. Soliman and A. AI-Junaid, "A Shared Matrix Unit for a Chip Multi core processor, " Journal of Parallel and Distributed Computing (JPDC), Elsevier Publisher, ISSN: 0743-7315, Vol. 73, No. 8, pp. 11461156, August 2013. [16] M. Soliman and E. Elsayed, "Design and FPGA Implementation of a Simplified Matrix Processor, " Proc. the Seventh IEEE Interna tional Computer Engineering Conference (ICENCO), Faculty of Engineering, Cairo university, Egypt, pp. 31-36, December 2011. [l7] M. Soliman and E. Elsayed, "FPGA Implementation and Evaluation of a Simple Processor for Multi-scalarNector/Matrix Instructions, " Proc. IEEE 2',,1 Interna tional Conference on Engineering and Technology (ICET 2014), German University in Cairo, Egypt, April 2014.
Y. Lee, R. Avizienis, A. Bishara, R. Xia, D. Lockhart, C. Batten, and K. Asanovi6, "Exploring the Tradeofls between Programmability and Efficiency in Data-parallel Accelerators, " ACM Transactions on Computer Systems (TOCS), Vol. 31, No. 3, August 2013. 1. Hennessay and D. Patterson, "Computer Architecture: A Quantitative Approach, " 51h Edition, Morgan-Kaufinann, September 2011.
SMMP
[12] S. Rivoire, R. Schultz, T. Okuda, C. Kozyrakis, "Vector Lane Threading, " Proc. Interna tional Conference on Parallel Processing (ICPP '06), 2006.
Improving the performance of SMMP by assigning the idle execution units of a thread to the other thread. Design a multi-core processer based on our proposed SMMP architecture to exploit all fonns of computer parallelism and to take the advantages of multi-core processors. Improving the overall perfonnance by using out-of-order issue for issuing instructions to be executed on the parallel execution units. Compiler for our proposed SMMP by extending the vectorization technique, which is a mature research area in supercomputers. Evaluate the performance of the matrix processor architectures on overall applications instead of kernels.
[2]
SuperSMP(4L/S+4X)
[4]
[6]
REFERENCES [ I]
SMP(4L/S+IX)
Fig. 5. Average speedups and complexities over the baseline processor.
Although SMMP, which can execute a mixture of scalar, vector, and matrix instructions, have been proposed and evaluated, a lot of work remains to be done. The following are some key areas for future research: •
10
[18] D. Patterson and J. Hennessy, "Computer Organization and Design: The Hardware/Software Interface, " 51h Edition, Morgan Kaufman, San Francisco, CA, September 2013. [19] G. Amdahl, "Validity of the Single-Processor Approach to Achieving Large Scale Computing Capabilities, " Proc. AFIPS 1967 Spring Joint Computer Conference, Atlantic City, New Jersey, AFIPS Press, Vol. 30, pp. 483-485, April 1967.
213