Exploiting ILP, DLP, TLP, and MPI to Accelerate Matrix Multiplication on Xeon Processors Mostafa I. Soliman1,2 1
2
Computer Science and Information Department Community College, Taibah University Al-Madinah Al-Munawwarah, Saudi Arabia
[email protected] /
[email protected]
Fatma S. Ahmed Computers and Systems Section Electrical Engineering Dept., Faculty of Engineering, Aswan University, Aswan 81542, Egypt
[email protected] 2
Abstract—Matrix multiplication is one of the most important kernels used in the dense linear algebra codes. It is a computational intensive kernel that demands exploiting all available forms of parallelism to improve its performance. In this paper, ILP, DLP, TLP, and MPI are exploited to accelerate the execution of matrix multiplication on a cluster of computers with Intel Xeon processors. In addition, the Strassen’s algorithm, which reduces the arithmetic operations of matrix multiplication from O(n3) to O(n2.807), is implemented using multi-threading, SIMD, blocking, and MPI techniques. Our results show that the average speedups of the traditional matrix multiplication algorithm on large matrices (from 6000×6000 to 9000×9000) are 14.9, 24.1, 24, 22.8 and 20.7 on 2, 4, 6, 8, and 10 computers, respectively. However, on the Strassen’s algorithm, the average speedups are 3.9, 6.8, 6.8, and 10.3 on 2, 4, 6, and 7 computers, respectively, by applying the same parallel processing techniques.
rate, which indicate that the era of advances in exploiting ILP appears to be coming to an end [1].
Keywords—Matrix multiplication, multi-threading, SIMD, blocking, MPI
In this paper, the three forms of parallelism (ILP, DLP, and TLP) as well as the well-known MPI (message passing interface) are exploited to accelerate the multiplication of two matrices. Matrix multiplication is a computational intensive kernel, which involves O(n3) floating-point operations (FLOPs), while only creating O(n2) data movement, where n×n is the size of matrices involved [11]. Thus, it permits efficient reuse of data that resides in cache memory and create what is often called the surface-to-volume effect for the ratio of data movement (load/store operations) to computations (arithmetic operations) [12]. The arithmetic complexity of matrix multiplication can be reduced to O(n2.807) instead of O(n3) by applying Strassen’s algorithm [11]. However, Strassen’s algorithm has two drawbacks, although it successes to have lower complexity by using the divide-and-conquer approach. The first drawback is the recursion overhead that reduces the performance when the division is applied on small matrix elements. By stopping the recursion early and performing a conventional matrix multiplication on submatrices that are below the recursion truncation point, this overhead can be limited. The second drawback is that the odd sized matrices must be handled efficiently in the division step. There are three methods that can solve this drawback: (1) embedding the matrix inside a larger one (static padding), (2) decomposing into submatrices that overlap by a single row or column (dynamic overlap), or (3) performing special case computation for the boundary cases (dynamic peeling ), see [13] for more details.
Strassen’s
algorithm,
I. INTRODUCTION Taking advantage of parallelism is one of the most important methods for improving performance [1]. All processors since about 1985 use pipelining to improve performance by overlapping the execution of instructions to reduce the total time to complete an instruction sequence. Beyond simple pipelining, there are three major forms of parallelism, instruction-level parallelism (ILP), thread-level parallelism (TLP), and data-level parallelism (DLP), which are not mutually exclusive [2]. Whereas the compiler and hardware conspire to exploit ILP implicitly without the programmer’s attention, TLP and DLP are explicitly parallel, requiring the programmer to write parallel code to gain performance. A common approach to exploiting TLP is to decompose each parallel section into a set of tasks. At runtime, an underlying software layer distributes these tasks to different threads (see [3-5] for more detail). Exploiting ILP was the primary focus of processor designs for about 20 years starting in the mid-1980s. For the first 15 years, designers have applied a progression of successively more sophisticated schemes for pipelining, multiple issue, dynamic scheduling and speculation. Since 2000, designers have focused primarily on optimizing designs or trying to achieve higher clock rates without increasing issue
On the other hand, DLP is the cheapest form of parallelism, which needs only fetching and decoding a single instruction to describe a whole array of parallel operations. The use of a single instruction for processing multiple data (SIMD) reduces the control logic complexity and allows compact parallel datapath structures. Therefore, most modern general-purpose processors have included extensions to their architectures to improve the performance of multimedia applications by taking the advantages of SIMD. Examples include Intel MMX/SSE/SSE2/SSE3/SSE4/AVX for IA-32/Intel-64 [6], Sun VIS1/VIS2/VIS3 for Sparc [7], Hewlett-Packard MAX1/MAX2 for PA-RISC [8], Silicon Graphics MDMX for MIPS [9], Motorola AltiVec for ProwerPC [10], etc.
Our implementation has handled these drawbacks by selecting the appropriate recursion truncation point and applying the dynamic peeling to overcome odd sized matrices problem. In addition, several parallel processing techniques are applied such as multi-threading, SIMD, blocking, and MPI to achieve best performance by exploiting parallelism on a cluster of computers. The target system is a cluster of Siemens computers, each has quadruple-core Intel Xeon processor running at 2.33GHz, L1 data cache of 128KB for each core, shared L2 cache of 12MB, and 4GB of main memory. This paper is organized as follows. Section II describes the Strassen’s algorithm for multiplying two matrices. The parallel implementations of the traditional and Strassen’s algorithms are presented in Section III. In addition, the performances of the traditional and Strassen algorithms are evaluated. Finally, Section IV concludes this paper.
M2 = (A21 + A22) B11 M3 = A11 (B12 – B22) M4 = A22 (B21 – B11) M5 = (A11 + A12) B22 M6 = (A21 – A11) (B11 + B12) M7 = (A12 – A22) (B21 + B22) Finally, the construction of matrix C using the precomputed intermediate matrices are as follows: C11 = M1 + M4 – M5 + M7 C12 = M3 + M5 C21 = M2 + M4 C22 = M1 – M2 + M3 + M6 Note that the Strassen’s algorithm can also be used for the computation of intermediate matrices M1 to M7 recursively.
III.
II.
STRASSEN’S ALGORITHM OF MATRIX MULTIPLICATION
Matrix multiplication is a way to combine two matrices and get a third matrix, where the columns of the first matrix must equal the rows of the second matrix: Cn×m = An×p × Bp×m. It is known that there are six variants (ijk, ikj, jik, jki, kij, and kji) to multiply two n×n matrices due to the triply nested loops (i, j, and k) [14]. Each of these six variants has the same number of FLOPs (2n3), however, their access patterns of memory are different. For example, the following variant called ijk, which is the native technique for a matrix product. for i = 1 to n step 1 for j = 1 to n step 1 for k = 1 to n step 1 C[i, j] = C[i, j] + A[i, k] × B[k, j] By interchanging the order of i, j, and k loops, the remaining five variants can be calculated. The best variant for the conventional matrix product is ikj because all the memory accesses are done with a unit stride. The stride of an array refers to the way in which its elements are referenced; it is equal to the difference of the addresses of successive elements over the element size. The use of unit stride leads to higher performance because all the elements in a loaded cache line are used [2]. The traditional method of matrix multiplication of two n×n matrices takes O(n3) operations. However, Strassen’s algorithm is a divide-and-conquer algorithm that is asymptotically faster as it reduces the arithmetic operations to O(nlog27). The traditional method of multiplying two n×n matrices takes 8 multiplications and 4 additions on submatrices. Strassen showed how two n×n matrices can be multiplied using only 7 multiplications and 18 additions on submatrices. Consider the formation of the matrix product C = A×B, where A, B and C are n×n matrices. Let matrices A, B and C be divided into halfsized blocks as follows: =
×
.
The computations of the intermediate matrices M1 to M7 are: M1 = (A11 + A22) (B11 + B22)
PARALLEL IMPLEMENTATION OF MATRIX MULTIPLICATION Multi-threading is one of the most important techniques for parallel processing. On multi-core processors, this technique allows multiple tasks to be run concurrently, see [5] for more details. Each thread runs on a core and acts like its own individual program, except that all the threads work in the same memory space (sharing memory). This makes communication between threads fairly simple. There are some important concepts should be taken in consideration when working with threads: synchronization, resource limitations, load balancing, and scalability [3]. Another technique for exploiting parallelism is SIMD that is considered as a form of DLP, where multiple elements are processed simultaneously using a single instruction. Intel streaming SIMD extensions (SSE, SSE2, SSE3, and SSSE3 [6]) are SIMD instruction sets designed by Intel and introduced in their Pentium III, Pentium 4, and EM64T series processors. These instruction sets contain new instructions for processing packed data stored in new 128-bit registers known as XMM. Processors with Intel EM64T like Xeon extends the SIMD register set from 8 to 16 registers (XMM0-XMM15) [6]. Intel Xeon processors support 8/16/32/64/128-bit integer data types and 32/64-bit single/double-precision data types. The last technique used in this paper is MPI (message passing interface) that is a message passing library standard based on the consensus of the MPI forum [15]. The goal of the MPI is to establish a portable, efficient, and flexible standard for message passing that will be widely used for writing message passing programs. MPI gives the ability to distribute a parallel task into many slave computers by using point to point or collective communications routines (see [15] for more details). After each computer finishing its task, it returns the Thread 1
Thread 1
Thread 2 Thread 3
Thread 2
=
Thread 3
Thread 4
Thread 4
Matrix C
Matrix A
×
For all threads
Matrix B
Figure 1. Partitioning matrices among four threads
Thread 1
Thread 1
Thread 2 Thread 3
Thread 2
=
Thread 3
Thread 4
Thread 4
Matrix C
Matrix A
×
For all threads
Matrix B
Figure 2. Dividing matrices into blocks
result to the root computer .
To further improve the performance of the traditional matrix multiplication, more than one computer connected through a network would be used. Figure 4 shows how the matrix multiplication is executed using MPI on four computers as an example. Firstly, the root node partitions the rows of input matrix A into four parts, then sends a part to each process in the other nodes. However, matrix B is sent to all processes.
Figure 5 shows the performance of the traditional matrix multiplication using MPI, multi-threading, SIMD, and blocking techniques on 2, 4, 6, 8, and 10 computers. On large matrix size (4800×4800), the use of MPI alone results in speedups of 1.97, 3.73, 5.12, 6.2, and 6.85 on 2, 4, 6, 8, and 10 computers, respectively. In addition to MPI, the use of multi-threading technique improves the speedups to 3.14, 5.6, 7.52, 8.97, and 8.78 on 2, 4, 6, 8, and 10 computers, respectively. Four threads are created per computer and assigned to four cores. Note that larger matrices are needed to show the effect of multi-threading as increasing the number of computers. However, the use of MPI and SIMD techniques improve the speedups to 2.2, 4.2, 5.76, 6.83, and 7.48 on 2, 4, 6, 8, and 10 computers, respectively. Four single-precision elements are processed using a single SIMD instruction. Thus, more cache memories are needed to show the effect of SIMD, otherwise, the rate of cache misses increases. To reduce the rate of cache misses, the loaded data in cache memory should be reused. The combination of MPI, multi-threading, and SIMD techniques improve the speedups to 3.2, 6.1, 7.5, 9, and 9 on 2, 4, 6, 8, and 10 computers, respectively. The use of blocking technique beside MPI, multi-threading, and SIMD techniques improves the overall performance furthermore. The blocking technique reduces the rate of cache misses by processing blocks of data. Since matrix multiplication involves O(n3) FLOPs, while only
Root
Receive Execute Send
Slave 2
(1) Send part 2 of matrix A & matrix B
(4) Return results
Receive Execute Send
Slave 1
(6) Return results
Send Execute Receive Store
(3) Send part 4 of matrix A & matrix B
A. Performance of the Traditional Matrix Multiplication On a single computer, Figure 3 shows that the performance of the traditional matrix multiplication changes from 0.62 to 0.49 floating-point operations per clock cycle (FLOPs/cc). The use of multi-threading technique by processing four parallel threads results in improving performance to change from 2.4 to 0.84 FLOPs/cc when changing the size of input matrices from 400×400 to 4800×4800 in step of 400. However, the use of the SIMD technique alone by processing four 32-bit data using a single instruction results in improving performance to change from 0.83 to 0.53 FLOPs/cc. Combining both of the multithreading and SIMD techniques improve the performance to change from 7.7 to 0.83 FLOPs/cc. Note that when the matrices fit in the cache memory, the performance is high, otherwise the performance is degraded due to increasing the rate of cache miss. The effect of exploiting memory hierarchy using blocking technique beside multi-threading and SIMD techniques improves the performance to change from 5.9 to 4.7 FLOPs/cc. This improvement in performance on large matrices is due to reusing the loaded data in cache memory that reduces the rate of cache miss.
After each process receives its data, it begins to perform the matrix multiplication on its data and sends the results back to the root process. The root process receives the results and stores them in matrix C .
(5) Return results
Secondly, SIMD is applied to perform the same operation on four floating-point numbers concurrently. Moreover, to exploit memory hierarchy, the blocking technique is used to reduce load/store operations from main memory by reusing the data hold in the cache memory many times. As shown in Figure 2, the input matrices are partitioned into blocks. The block size depends on the size of the cache memory. Thus, all calculations are done based on matrix operations instead of SIMD operations. On other words, the loaded b×b blocks are reused b times before leaving the cache memory in the worst case. The parameter b should be large enough to avoid many load/store operations but small enough to fit the required blocks in the cache memory.
Figure 3. The performance of the traditional matrix multiplication using multi-threading, SIMD, and blocking techniques
(2) Send part 3 of matrix A & matrix B
Firstly, the multi-threading technique is applied on the traditional algorithm for matrix multiplication (Cn×n=An×n×Bn×n) by creating four threads (one thread per core). The rows of matrix A are partitioned into four parts. Each part is assigned to a thread to compute the multiplication of this part and matrix B and to store the results in the corresponding part of matrix C (see Figure 1).
Receive Execute Send
Slave 3
Figure 4. The execution of matrix multiplication using MPI on four computers connected through a network
(a) Performance of serial implementation using MPI
(b) Performance of MPI and multi-threading techniques
(c) Performance of MPI and SIMD techniques
(d) Performance of MPI, multi-threading, and SIMD techniques
(e) Performance of MPI, multi-threading, SIMD, and blocking techniques on small matrix sizes
(f) Performance of MPI, multi-threading, SIMD, and blocking techniques on large matrix sizes
Figure 5. The performance of the traditional matrix multiplication using MPI, multi-threading, SIMD, and blocking techniques
creating O(n2) data movement, O(n) data are reused after loading into cache. On small matrix sizes like 1000×1000, speedups of 8.3, 9.32, 6.25, 5.4, and 4.5 are achieved on 2, 4, 6, 8, and 10 computers, respectively. While on large matrices (from 6000×6000 to 9000×9000), average speedups of 14.9, 24.1, 24, 22.8 and 20.7 are achieved on 2, 4, 6, 8, and 10 computers, respectively. B. Performance of the Strassen’s Algorithm For Strassen’s algorithm of matrix multiplication, the same techniques of parallel processing are applied to improve its performance. Figure 6 shows applying MPI on seven computers to compute matrix multiplication based on Strassen. The root node sends to each slave the required data for computing the intermediate matrices (M1 to M6). During the execution of slaves, the root also computes M7. After each
slave finishing its task, it sends its results back to the root. In our implementation, four threads Thrd1, Thrd2, Thrd3, and Thrd4 are created. Each thread executes the multiplication of the intermediate matrices M1, M2, M3, and M4, respectively (see Section 2). After calculating M1, M2, and M3, Thrd1, Thrd2, and Thrd3 are used for calculating M5, M6, and M7, respectively. Note that there are synchronization between threads since they are used for calculating the blocks of matrix C. As shown in Figure 7, Thrd1, Thrd3, and Thrd4 are used for calculating M1, M4, M5, and M7 for C11. Thrd1 and Thrd3 are used for calculating M3 and M5 for C12. Thrd2 and Thrd4 are used for calculating M2 and M4 for C21. Thrd1, Thrd2, and Thrd3 are used for calculating M1, M2, M3, and M6 for C22. The required synchronizations between threads are done using set/reset events (see [3] for more details).
Thrd1
Send A11 ,A22, B11, B22 to compute M1
Slave 1
Slave 4
Send A21, A22, B11 to compute M2
Send A22, B21, B11 to compute M4
Send A21, A11, B11, B12 to compute M6
Thrd2
Thrd3
Thrd4
Root Send A11, B12, B22 to compute M3
Slave 2
Slave 3
Slave 5
Slave 6
M1
M2
+ ++ C11
Send A11,A12,B22 to compute M5
-
M3
+ + C12
M4
-
M5
+ + C21
M6
M7
-++ +
C22
Figure 7. Strassen’s algorithm by multi-threading technique
Figure 6. Execution of Strassen’s algorithm using MPI on seven computers
Figure 8 presents the performance of the Strassen’s algorithm on single computer using multi-threading, SIMD, and blocking techniques. Without using any parallel processing techniques, the performance of the Strassen’s algorithm is about 0.43 FLOPs/cc. The use of the multi-threading technique by running four threads concurrently enhances the performance to range from 1.1 to 1.3 FLOPs/cc. Note that the parallelism in the Strassen’s algorithm is based on functional decomposition, where each thread executes separate task in specific amount of data. The use of SIMD and blocking techniques do not improve the performance drastically. The performance of the Strassen’s algorithm on a single computer using multi-threading, SIMD, and blocking techniques ranges from 0.9 to 1.2 FLOPs/cc. Applying MPI technique improves the performance of the Strassen’s algorithm furthermore, since the computations of the intermediate matrices M1 through M7 are independent. Figure 9 shown the performance of the Strassen’s algorithm for matrix multiplication using MPI, multi-threading, SIMD, and blocking techniques on 2, 4, 6, and 7 computers. On large matrix sizes, the speedups of using MPI alone are 1.75, 3.14, 3.17, and 5.08 on 2, 4, 6, and 7 computers, respectively. In addition to MPI, the use of multi-threading technique improves the speedups to 4.62, 7.08, 7.11, and 9.31 on 2, 4, 6, and 7 computers, respectively. On 9000×9000 matrices, the combination of MPI, multi-threading, SIMD, and blocking techniques results in speedups of 4, 7.33, 7.38, and 11.33 on 2, 4, 6, and 7 computers, respectively. The average speedups of using parallel processing techniques on large matrices (from 6000×6000 to 9000×9000) are 3.9, 6.8, 6.8, and 10.3 on 2, 4, 6, and 7 computers, respectively.
performance by running parallel tasks into computers connected through a network. In this paper, all these forms of parallelism have been used to improve the performance of the traditional and Strassen’s algorithms for matrix multiplication on a cluster of computers, each has quadruple-core Intel Xeon processor running at 2.33GHz, L1 data cache of 128KB for each core, shared L2 cache of 12MB, and 4GB of main memory. On large matrices (from 6000×6000 to 9000×9000), average speedups of 14.9, 24.1, 24, 22.8 and 20.7 are achieved on 2, 4, 6, 8, and 10 computers, respectively, by applying the combination of MPI, multi-threading, SIMD, and blocking techniques on the traditional matrix multiplication algorithm. However, these techniques speedup the execution of the Strassen’s algorithm by 3.9, 6.8, 6.8, and 10.3 on 2, 4, 6, and 7 computers, respectively.
V. [1] [2]
[3]
[4]
[5]
REFERENCES
J. Hennessay and D. Patterson, Computer Architecture A Quantitative Approach, 5th Edition, Morgan-Kaufmann, September 2011. M. Soliman, “Performance Evaluation of Multi-Core Intel Xeon Processors on Basic Linear Algebra Subprograms,” Parallel Processing Letter (PPL), World Scientific Publishing Company, ISSN: 0129-6264, Vol. 19, No. 1, pp. 159-174, March 2009. A. Binstock and R. Gerber, Programming with Hyper-Threading Technology: How to Write Multithreaded Software for Intel IA-32 Processors, Intel PRESS, ISBN 0970284691, 2003. R. Gerber, A. Bik, K. Smith, and X. Tian, The Software Optimization Cookbook: High-Performance Recipes for IA-32 Platforms, Second Edition, Intel PRESS, ISBN 0976483211, 2006. S. Akhter and J. Roberts, Multi-Core Programming: Increasing Performance through Software Multithreading, Intel PRESS, ISBN
IV. CONCLUSION On Intel multi-core processors, three common forms of parallelism (ILP, TLP, and DLP) can be exploited to provide increases in performance using superscalar execution, multithreading computation, and streaming SIMD extensions, respectively. Whereas ILP is achieved by the compiler and hardware, TLP and DLP are the responsibility of the programmer to write parallel code to speed up the performance. Moreover, exploiting memory hierarchy by reusing the loaded data into higher level cache results in further performance improvement. Besides, the use of MPI improves the Figure 8. The performance of the Strassen’s algorithm using multi-threading, SIMD, and blocking techniques
(a) Performance of serial implementation using MPI
(c) Performance of MPI and multi-threading techniques
(c) Performance of MPI, multi-threading, SIMD, and blocking techniques on small matrix sizes
(d) Performance of MPI, multi-threading, SIMD, and blocking techniques on large matrix sizes
(e) Speedup due to MPI, multi-threading, SIMD, and blocking techniques on small matrix sizes
(f) Speedup due to MPI, multi-threading, SIMD, and blocking techniques on large matrix sizes
Figure 9. The performance of the Strassen’s algorithm for matrix multiplication using MPI, multi-threading, SIMD, and blocking techniques 0976483246, 2006. Intel® 64 and IA-32 Architectures Software Developer’s Manual, Vol. 1, Basic Architecture, Order Number: 253665-047US, Available at: http://download.intel.com/products/processor/manual/253665.pdf, June 2013. [7] M. Tremblay, J. O'Connor, V. Narayanan, and L. He “VIS Speeds New Media Processing,” IEEE MICRO, Vol. 16, No. 4, pp. 10–20, August 1996. [8] R. Lee, “Multimedia Exrensions For General-Purpose Processors,” Proc. IEEE Workshop on Signal Processing Systems (SIPS 97): Design and Implementation formerly VLSI Signal Processing, pp. 9–23, 1997. [9] L. Gwennap, “Digital, MIPS Add Multimedia Extensions,” Microprocessor Forum 96, Vol. 10, No. 15, pp. 1–5, November 1996. [10] K. Diefendorff, P. Dubey, R. Hochsprung, and H. Scales, “Altivec Extension to PowerPC Accelerates Media Processing,” IEEE MICRO, Vol. 20, No. 2, pp. 85–95, March/April 2000. [6]
[11] G. Golub and C. Van Loan, Matrix Computations. John Hopkins Univer sity Press, Baltimore and London, 2nd Edition, 1993. [12] J. Dongarra, I. Foster, G. Fox, K. Kennedy, A. White, L. Torczon, and W. Gropp, The Sourcebook of Parallel Computing, Morgan Kaufmann, 2002. [13] M. Thottethodi, S. Chatterjee, and A. Lebeck, “Tuning Strassen's Matrix Multiplication for Memory Efficiency,” Proc. ACM/IEEE Conference on Supercomputing, SC98, International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1-14, November 1998. [14] M. Soliman, “Multi-Threaded SIMD Implementation of the BackPropagation Algorithm on Multi-Core Intel Xeon Processors,” Neural, Parallel and Scientific Computations, Dynamic Publishers, Atlanta, USA, ISSN 1061-5369, Vol. 15, No. 2, pp. 253-268, June 2007. [15] B. Barney, Message Passing Interface (MPI), Lawrence Livermore National Laboratory, https://computing.llnl.gov/tutorials/mpi/.