Parallel matrix multiplication for various implementations

0 downloads 0 Views 310KB Size Report
Processing Unit (CPU) and 600 times than C++ by using single core CPU respectively by our .... This paper proposes to solve the parallel matrix multiplication ... MPI_Get_processor_name() Gets the name of the processor. • MPI_Bcast() ...
Parallel Matrix Multiplication for Various Implementations Niyameddin Taghiyev

Assist. Prof. M. Akcay

Computer Engineering Department Anadolu University, Eskisehir, Turkey [email protected]

Computer Engineering Department Dumlupinar University, Kutahya, Turkey [email protected]

Abstract— It has become increasingly common to see that supercomputing applications harness the massive parallelism of graphics cards to speed up computations. In this study, an analysis concerning to the time necessity for four different implementations of parallel matrix multiplication is presented. The execution time of parallel matrix multiplications in Compute Unified Device Architecture (CUDA) can be increased to about 10 times than Matlab implementation, 100 times than Java Thread, 300 times than C++ by using duo core Central Processing Unit (CPU) and 600 times than C++ by using single core CPU respectively by our method, as compared with using the fastest tools of GPU- only case or CPU- only case. The goal of this study is to show how to offload parallel computations to the graphics card, when it is necessary, and to give some idea of how to think about code running in the massively parallel environment. Index Terms— Matrix Multiplication, parallel computing, CUDA, MPI, Thread, C++, Java, Matlab, GPU and CPU computing.

I. INTRODUCTION It has been implemented and evaluated two problems on the Graphics Processing Unit (GPU) and on the Central Processing Unit (CPU). Parallel software developers must contend with problems that not encountered during sequential program development so it is necessary to understand the characteristics of parallel designs of applications on CPU and GPU. Parallel Computing facilitates solving computational and data-intensive problems by using of many core processors. We have been used very successfully in a variety of data parallel programs, or programs that obtain parallelism by partitioning the data among the processors. Compute Unified Device Architecture (CUDA) is NVIDIA’s GPU architecture featured in the GPU cards [4], positioning itself as a new means for general purpose computing with GPUs. CUDA gives advantage of massive computational power to the programmer. This massive parallel computational power is provided by NVidia’s graphics cards. GeForce GTS 450 [5] which provides 192 co-operating cores is used in this study. For running multithreaded applications there is no need of streaming computing in GPU, because cores can communicate also can exchange information with each other. CUDA is only well suited for highly parallel algorithms and is useful for highly parallel algorithms. If you want to increase performance of your algorithm while running on GPU then you need to have many

threads. Normally more number of threads gives better performance. For the most of the serial algorithms, CUDA is not that useful. If the problem cannot be broken down into at least a thousand threads then using CUDA has no overall advantage. In that case we can convert serial algorithm into parallel one but, it is not always possible. As mentioned above to get best optimization you need to divide your problem into minimum thousand threads. Then performance of algorithm increases rapidly. In addition, it gives an introduction to programming them using CUDA, NVIDIA’s language for programming GPUs, a standard for programming heterogeneous systems including conventional CPUs and GPUs in [6] and [8]. The main idea of CUDA is to have thousands of threads executing in parallel. All of these threads are going to be executing the very same function (code), known as a kernel [1]. All these threads are executed using the same instruction and different data. Each thread will know its own ID, and based off its ID, it will determine which pieces of data to work on. A CUDA program consists of one or more phases that are executed on either the host (CPU) or a device such as a GPU [12]. In host code no data parallelism phase is carried out. In some cases little data parallelism is carried out in host code. In device code phases which has high amount of data parallelism are carried out. A CUDA program is a unified source code encompassing both, host and device code. The host code is straight forward C++ code. In next step it is compiled with the help of standard C++ compiler only. That is what we can say ordinary CPU process [14]. The device code is written using CUDA keywords for labeling data-parallel functions, called kernels, and their associated data structures. In some cases one can also execute kernels on CPU if there is no GPU device available. This facility is provided with the help of emulations features. CUDA software development kit provides these features. One advantage is that there is no need to write the whole programmer using CUDA technology. If you are writing a large application, complete with a user interface, and many other functions, and then most of your code will be written in C++ or whatever your language of choice is. When you really want to do large mathematical computations then, you can simply write kernel call to call CUDA functions you have written. In this way instead of writing complete programmer you can use GPU for some portion of the code where we need huge mathematical computations.

II. MATRIX MULTIPLICATION PROBLEM Matrix multiplication routines are widely used in the computational sciences in general, mostly for linear algebra. It is also heavily applied on scientific modeling in particular [13]. Matrix multiplication is basic computational problem. Two multiplied matrices: Ai, k*Bk, j, need to fulfill one condition: the height of the first matrix need to be equal the height of the second matrix. The product – matrix C – has the dimension m x p and each its element is equal to:

Where Ci, j is an element of table C in row i and column j. If the height of the first matrix is not equal the width of the second matrix the result – matrix C – is undefined. In this work, in order to simplify the algorithm and conclusions drawn from results, only square matrices were multiplied. There are several parts to creating a complete program using CUDA. To compute matrix multiplication there are three main parts: the main file, multiplication on the machine, and multiplication on the device. Just like any programming languages there must be a main function [2]. When developing in CUDA, the main file must include certain things. Here is algorithm for the main portion of computing matrix multiplication on the CPU and GPU. These are the steps: x Allocate host memory for matrices A and B x Initialize host memory (put in random values for the matrices) x Allocate device memory x Copy host memory to device x Allocate device memory for result x Allocate host memory for the result x Create and start timer x Execute the kernel x Check if kernel execution generated and error x Copy result from device to host x Stop and destroy timer x Compute reference solution (matrix multiplication on CPU) x Check result (compare results on GPU and CPU) x Clean up memory Matrix Multiplication using Graphics Processing Unit Another function that is needed is one that does matrix multiplication on the CPU. The sequential algorithm on CPU performing matrix multiplication is very simple. This function is usually stored in C++ file by itself and is referenced in the main file using the “extern C” syntax. Lastly, on each GPU the multiplying function needs the whole B matrix, and only the 1/n part of the A matrix, where n is the number of processing units. Similarly, while copying data back from GPU the C matrix, only 1/n part of the usual transfer was required. The “cudaMalloc” function was used to speed up the communication between GPU and

CPU [10], but since functions launched on different GPUs were initiated from different threads. The kernel makes up the computation of matrix multiplication on the device, or the GPU [7]. Along with the multiplication, there are other initializations needed to prepare GPU for this computation. These include declaring the thread and block in which the values will be stored. III. FOUR DIFFERENT IMPLEMENTATIONS This paper proposes to solve the parallel matrix multiplication implementation in a distributed environment based on Matlab threads, C++ by using MPI (message passing interface) in Visual Studio Platform 2012, Java threads, and Utilizing CUDA Architecture in Microsoft Visual Studio 2012. A. Matlab Threads [11] There are two levels of parallelism present in Matlab: x Implicit Multi-threaded parallelism for certain built-in Matlab commands, for Matrix-Matrix Multiplication. x Explicit parallelism present in Parallel Computing Toolbox Focus on the implicit multi-threaded parallelism first: The number of threads used is set automatically by Matlab at run time. You can type “>>maxNumCompThreads” to find out how many threads Matlab is using for computation. Note that maxNumCompThreads is currently deprecated and could be discontinued in a future release of Matlab. B. C++ by Using MPI in Visual Studio Platform Matrix multiplication is done by using message passing interface (MPI) [3] in Visual Studio Platform 2012. As a programming language, C++ is used. Aim is to decrease computation time of matrix multiplication by using MPI Parallel Computing Approach. Here are some useful functions we used for solving the problem: x MPI_Init() Initialize the MPI execution environment x MPI_Comm_rank() Determines the rank of the calling process in the communicator x MPI_Comm_size() Determines the size of the group associated with a communicator x MPI_Get_processor_name() Gets the name of the processor x MPI_Bcast() Broadcasts a message from the process with rank "root" to all other processes of the group. x MPI_Scatter() Sends data from one task to all other tasks in a group x MPI_Send() Performs a basic send x MPI_Recv() Basic receive x MPI_Sendrecv_replace() Sends and receives using a single buffer. Note that this should be used while shifting big amount of data, the normal send and receive function can end in a deadlock situation. x MPI_Gather() Gathers together values from a group of processes x MPI_Finalize()Terminates MPI execution environment

C. Java threads [16] Java language which is widely used today supports developing parallel applications via built in libraries for managing the threads. In the presented study, experimental performance of the parallel implementation of matrix multiplication algorithm that provides a basis for most of the matrix operations is investigated. Thorough the experiments, the effect of number of threads and the matrix size on the parallel computation performance are measured. Java has a direct support for multithreading integrated in the language. The java.lang package contains a thread API consisting of the class Thread and the interface Runnable. There are two basic methods to create threads in Java. Threads can be generated by specifying a new class which inherits from Thread and by overriding the run() method in the new class with the code that should be executed by the new thread. A new thread is then created by generating an object of the new class and calling its start() method. An alternative way to generate threads is by using the interface Runnable which contains only the abstract method run(). The Thread class actually implements the Runnable interface and, thus, a class inheriting from Thread also implements the Runnable interface. The creation of a thread without inheriting from the Thread class consists of two steps: At first, a new class is specified which implements the Runnable interface and overrides the run() method with the code that should be executed by the thread. After that, an object of the new class is generated and is passed as an argument to the constructor method of “theThreadclass”. The newthread is then started by calling the start() method of the Thread object. A thread is terminated if the last statement of the run() method has been executed D. Utilizing CUDA Architecture in Microsoft Visual Studio [9] The basics of CUDA code implementation as: x Allocate CPU memory. x Allocate same amount of GPU memory using library function “CudaMalloc”. x Take data input in CPU memory. x Copy data into GPU memory using library function CudaMemCpy with parameter as (CudaMemcpyHostToDevice) x Perform processing in GPU memory using kernal calls (kernel calls are a way to transfer control from CPU to GPU; they also specify the number of grids, blocks and threads i.e. the parallelism required for your program. x Copy final data in CPU memory using library function CudaMemCpy with parameter as (CudaMemcpyHostToDevice) x Free the GPU memory or other threads using library function CudaFree. As we can see, setting up the environment and writing code in CUDA is a fairly easy task. But, it requires that the programmer must have good know-how of the architecture and knowledge of writing parallel codes. The most important phase of programming in CUDA [15] is the kernel calls wherein the

programmer must determine the parallelism that the program requires. The division of data into appropriate number of threads is the major area which can make or break a code. The kernel in the MatMulKernel() kernel given above makes use of the __syncthreads() call. Whenever a thread calls __syncthreads(), all threads in that thread's block must compute up to this point before any thread is allowed to pass that point. With the first call to __syncthreads() we thus insure that every entry of the submatrices of A and B have been loaded into shared memory before any thread begins its computations based on those values. The second call to __syncthreads() ensures that every element of the submatrix of C++ has been processed before we begin loading the next submatrix of A or B into shared memory. Note that while the __syncthreads() primitive enables this sort of inter-thread synchronization, its use does minimize parallelism and may degrade performance if not used wisely and sparingly. IV. EXPERIMENTAL RESULTS We present an analysis concerning to the time necessity for four different implementations of parallel matrix multiplication. We ran experiments with CPU (Matlab threads, C++ by using MPI (message passing interface) in Visual Studio 2012 platform and Java threads) and GPU (Utilizing CUDA Architecture in Microsoft Visual Studio). The first, we experimented parallel multiplication of N*N (N=500, 1000, 1500, 2000) matrices from one to ten threads with two cores in Matlab 2012.

Fig.4.1 Measured Performance of Parallel Multiplication of N*N Matrices by different threads in Matlab Figure4.1. of parallel matrix multiplication timing result for fifty times appears that result of second thread shows good performance. Because of our computer had maximum two cores, we computed matrix multiplication in parallel. From the our Matlab results, it is easily seen that threads show good performance as certain thread numbers in a certain

number of cores but it also shows the same results when it is increased more. If we want to increase the performance so, we should increase the number of cores. In second step, matrix multiplication is done by using MPI on Visual Studio Platform 2012 and C++ is used as a programming language. It is tested on single core and duo core machines with the matrices in size 500*500, 1000*1000, and 2000*2000 for five times for processes 1,2,4,8 and 16. Results are shown Figure 4.2 and Figure4.3:

Figure4.4 shows that a result of parallel matrix multiplication using Java threads. We experimented parallel multiplication of N*N (N=500, 1000, 2000) matrices with different threads in Java.

Fig.4.4. Measured Performance of Parallel Multiplication of N*N Matrices by different threads in Java We ran our variant of the parallel matrix multiplication code on devices of compute capabilities. Our device was GeForce GTS 450 NVidia graphics card with 192 CUDA Cores in [5]. The results are shown in the following Figure4.5. Fig.4.2. Measured Performance of Parallel Multiplication of N*N Matrices on single core machine in C++

Fig .4.3 Measured Performance of Parallel Multiplication of N*N Matrices on duo core machine in C++ As we can get from single core and duo core results, running time of duo core perform two times efficiently than single core.

Fig.4.5. Measured Performance of Parallel Multiplication of N*N Matrices in GeForce GTS 450 NVidia graphics card on CUDA We compare all the Figures for 2000*2000 matrix. The execution time of parallel matrix multiplications in CUDA can be increased to about 10 times than Matlab, 100 times than Java thread, 300 times than C++ with duo core machine and 600 times than C++ with single core machine respectively by our method, as compared with using the fastest tools of GPU (on our NVIDIA hardware)- only case or CPU- only case. Once we execute the program on the GPU using CUDA, along

with the increasing number of matrix, the running time of the results would be better than other on the CPU using Matlab, C++, and Java. In particular, we evaluated the differences in precision for calculated results between the CPU and GPU. V. CONCLUSION Graphics Processing Units are highly useful in parallel computing. It yields magnitudes of higher performance than that of a CPU. The general purpose of using GPUs is facilitating applications development on CPUs. In order to show the different performance between in GPUs and CPUs, present project is executed parallel matrix multiplication on both the machine and the device. The results clearly indicated that the efficiency of GPUs performance. Utilizing such a multiple processor environment will become a new trend in GPU technology for the benefit of many CPUs and GPUs. In such environments, new approaches for realizing optimal load balancing are required to achieve the maximal speed up in the high-performance computing field. A library of parallel programming must be developed for CPU and GPU research. However, it is desirable that using CPUs or GPUs is not users concern actually is performance. ACKNOWLEDGMENT We would also like to thank Berna Seref, Mehmet Sekercioglu and Burak Benligiray while taking “Parallel Computing” course at Anadolu University and their helpful comments. REFERENCES [1] Robert Hochberg, “Matrix Multiplication with CUDA | A basic introduction to the CUDA programming model” August 11, 2012 [2] Jessica Brazelton, “Matrix Multiplication using Graphics Processing Unit” 4/30/2010, p.13. [3] Janko Strassburg, “Parallel Matrix by Matrix Multiplication using MPI” PRACE Training @ BSC, November 2012 [4] NVIDIA: CUDA Compute Unified Device Architecture Programming Guide

[5] NVidia graphics card: http://www.bit-tech.net/hardware/ [6] Qin Wang; Ohmura J; Axida S; Miyoshi T; Irie H; Yoshinaga T “Parallel Matrix-MatrixMultiplication Based on HPL with a GPU-Accelerated PC Cluster” Networking and Computing (ICNC), 2010 First International Conference on Digital Object Identifier: 10.1109/IC-NC.2010.39 Publication Year: 2010 [7] José María Cecilia, José Manuel García, Manuel Ujaldón “The GPU on the Matrix-Matrix Multiply: Performance Study and Contributions” Int’l. Conf. on Parallel Computing (ParCo’09), (Lyon, France) 2009 [8] Satoshi Ohshima, Kenji Kise, Takahiro Katagiri, Toshitsugu Yuba “Parallel Processing of Matrix Multiplication in a CPU and GPU Heterogeneous Environment” High Performance Computing for Computational Science – VECPAR 2006 (2007) [9] Hui Li, Geoffrey Fox, Gregor Laszewski, Zhenhua Guo, Judy Qiu “Co-processing SPMD Computation on GPUs and CPUs on Shared Memory System” Jun 3, 2012 [10] K. Fatahalian, J. Sugerman, and P. Hanrahan “Understanding the Effciency of GPU Algorithms for Matrix-Matrix Multiplication” Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware 2004 [11] Ying Chen and Suan Fong Tan “MATLAB*G: A Grid-Based Parallel MATLAB” Issue Date: 2004-01 [12] Rafia Inam “An Introduction to GPGPU Programming - CUDA Architecture”Technical Report, MRTC, December, 2010 [13] Junjie Li, Sanjay Ranka, Sartaj Sahni “Strassen’s Matrix Multiplication on GPUs” IEEE 17th International Conference on Parallel and Distributed Systems (ICPADS), 2011 [14] Jayshree Ghorpade, Jitendra Parande, Madhura Kulkarni, Amit Bawaskar “GPGPU Processing in CUDA Architecture” Advanced Computing: An International Journal ( ACIJ ), Vol.3, No.1, January 2012 [15] Garland, M., Legrand S., Nickolls J., Anderson J., Hardwick J., Morton S., Philipps E., Zhang Y., & Volkov V. “Parallel Computing Experiences with CUDA” Conference Name: IEEE Micro 28, July-Aug. 2008 [16] Holger Blaar, Matthias Legeler, “Efficiency of Thread-parallel Java Programs from Scientific Computing” [Proceedings 16th International Parallel and Distributed Processing Symposium] 2002.

Suggest Documents