Implementing Basic Computational Kernels of ... - Semantic Scholar

4 downloads 3283 Views 226KB Size Report
Email: [email protected] ... Abstract—This paper implements basic computational kernels .... with generic parallel algorithms based on a template library.
2012 16th Panhellenic Conference on Informatics

Implementing Basic Computational Kernels of Linear Algebra on Multicore Panagiotis D. Michailidis Department of Balkan Studies University of Western Macedonia Florina, Greece Email: [email protected]

Konstantinos G. Margaritis Department of Applied Informatics University of Macedonia Thessaloniki, Greece Email: [email protected]

allel Linear Algebra for Multi-core Architectures (PLASMA) library for implementing some operations of the linear algebra on multi-core platform using the title algorithms. Moreover, Kurzak et al [17] have implemented some matrix factorizations algorithms such as Cholesky factorization, QR factorization and LU factorization on multi-core processors using some of the above multi-thread programming tools such as Cilk and SMPSs. These implementations are based on the tile algorithms. Based on research background, there isn’t a systematic quantitative or performance and qualitative (i.e. the ease of programming effort) study of all multi-core programming tools for implementing scientific computing kernels that based on simple techniques of parallelization, i.e. there isn’t research work which can answer the above research question. In this paper, an unified quantitative and qualitative study of tools for implementing basic kernels from linear algebra is presented. To the best of our knowledge this is the first attempt to conduct an quantitative and qualitative evaluation of these libraries on multi-core systems. An outline of the rest of the paper is as follows. In Section II, a general description of all the reviewed parallel programming tools is presented and in Section III there are some parallelization issues for implementing of scientific computing kernels. In Section IV, a performance and qualitative evaluation of the reviewed parallel programming tools for parallelizing basic computational kernels is described. Finally, in Section V the final conclusions are presented.

Abstract—This paper implements basic computational kernels of the scientific computing such as matrix - vector product, matrix product and Gaussian elimination on multi-core platforms using several parallel programming tools. Specifically, these tools are Pthreads, OpenMP, Intel Cilk++, Intel TBB, Intel ArBB, SMPSs, SWARM and FastFlow. The aim of this paper is to present an unified quantitative and qualitative study of these tools for parallel computation of scientific computing kernels on multicore. Finally, based on this study we conclude that the Intel ArBB and SWARM parallel programming tools are the most appropriate because these give good performance and simplicity of programming.

I. I NTRODUCTION The computational kernels such as matrix - vector product, matrix product and Gaussian elimination lie at the core of many scientific and computational economic applications such as computational statistics, econometrics linear programming and combinatorial optimization. Often, these kernels are computation-intensive, so the sequential execution time is quite large. Therefore, it is profitable to use a multi-core platform as a high performance computer system for the execution. For the parallelization of the computational kernels from scientific computing on multi-core platforms there are many representative parallel programming tools. These tools are Pthreads [13], OpenMP [5], Intel Cilk++ [3], Intel TBB [4], Intel ArBB [2], SMPSs [1], SWARM [9] and FastFlow [6], [7]. These tools are based on a small set of extensions to the C programming language and involve a relatively simple compilation phase and potentially much more complex runtime system. The question that is imposed by programmers is which is the appropriate parallel programming tool for implementing computational kernels on multi-core so that there is a balance between easy programming effort and the high performance. We must note that in the research literature have been reported many parallel implementations of matrix computations and other similar problems of scientific computing on multi-core platforms such as papers [22], [21], [14], [16], [18], [19], [20]. However, most of these implementations are based on different parallelization techniques using the individuals tools and libraries such as Pthreads, OpenMP, Intel MKL and PLAPACK. Recently, Buttari et al [11],[12] developed a Par978-0-7695-4825-8/12 $26.00 © 2012 IEEE DOI 10.1109/PCi.2012.23

II. M ULTI - CORE P ROGRAMMING T OOLS In this section, we present a short review for all multi-core programming environments that are evaluated in this paper. POSIX threads (in short, Pthreads) [13] is a commonly portable API (Application Programming Interface) used for programming shared memory multiprocessors and multi-core processors. This API is a low level library. Hence, it provides the programmer a greater control about how to exploit parallelism at the expense of increasing the difficulty to use it. In the Pthreads programming tool the programmer must create all threads explicitly and use or insert all the necessary synchronization between threads. Pthreads provides a rich set of synchronization primitives such as locks, mutexes, semaphores, barriers and condition variables. 217

The OpenMP [5] is a quite popular and portable API for shared memory parallel programming. Programming using OpenMP is based on the use of compiler directives which tell the compiler which parts of the code can be parallelized and how. These directives provide the programmer to create parallel sections, mark parallelizable loops and define critical sections. When a parallel loop or parallel region is defined, the programmer must specify which variables are private for each thread, shared or used in reductions. OpenMP also provides the programmer with a set of scheduling clauses to control the way the iterations of a parallel loop are assigned to threads, the static, dynamic and guided clauses. If the schedule clause is not specified, static is assumed in most implementations. Finally, the OpenMP provides some library functions to access the runtime environment. With the most recent version of OpenMP a new way of parallelization is available - called the task construct - which allows the programmer to declare and add tasks that can be executed by any thread, despite which thread that encounters the construct, i.e. an implementation of the task concept. The Intel Cilk++ [3] language is based on technology from Cilk [10], a parallel programming model for C language. Cilk++ is an extension of the C++ language to simplify writing parallel applications that efficiently exploit multiple processors. More specifically, the Cilk++ language provides the programmer to insert keywords (cilk_spawn, cilk_sync, and cilk_for), into sequential code to tell the compiler which parts of the code that should be executed in parallel. Cilk++ also provides reducers, which eliminate contention for shared variables among tasks by automatically creating views of them for each task and reducing them back to a shared value after task completion. Moreover, the Cilk++ language is particularly well suited for, but not limited to, divide and conquer algorithms. This strategy solves problems by breaking them into sub-problems that can be solved independently, then combining the results. Recursive functions are often used for divide and conquer algorithms and are well supported by the Cilk++ language. Finally, Cilk++ provides some additional tools like performance analysis and the race condition detector Cilkscreen. Intel Threading Building Blocks (in short, TBB) [4] is an open source library that offers a rich methodology to express parallelism in C++ programs and take advantage of multi-core processor performance. In TBB, the programmer specifies tasks of the program instead of threads and the threads are completely hidden from the programmer. The idea of TBB is to extend C++ with higher level and taskbased abstractions for the parallel programming. The runtime system automatically schedules tasks onto threads in a way that makes efficient use of a multi-core platform. TBB emphasizes data parallel programming model, enabling multiple threads to work on different parts of a data collection enabling scalability to larger number of cores. Finally, TBB uses a runtime-based programming model and provides programmers with generic parallel algorithms based on a template library similar to the standard template library (STL). More specifi-

cally, TBB is based on template functions (parallel_for, parallel_reduce, etc), where the programmer specifies the range of data to be accessed, how to partition the data, the task to be executed in each chunk. Intel Array Building Blocks (in short, ArBB) [2] is an open source high-level API, backed by a library that support data parallel programming solution designed to effectively utilize the power of existing and upcoming throughput-oriented features on modern processor architectures, including multicore and many-core platforms. Intel ArBB extends C++ for complex data parallelism including irregular and sparse matrices and works with tools such as standards C++ compilers. Therefore, ArBB is best suited for compute-intensive, data parallel applications (often involving vector and matrix math). ArBB allows the programmers to express parallel computations with sequential semantics by expressing operations at the aggregate data collection level. For this reason, ArBB tool provides a rich set of data types for representing your data collections (aggregations of data such as matrices and arryas). Fundamental elements of these data types are dense and nested containers, which are collections to which data parallel operators may be applied. All vectorization and threading is managed internally by ArBB. Furthermore, the programmer uses collective operations with clear semantics such as add_reduce that computes the sum of the elements in a given array. ArBB also has language constructions for control flow, conditionals and loops. These operations have their usual sequential semantics and are not parallelized by the system, rather, only specific collective operations are executed in parallel. SoftWare and Algorithms for Running on multi-core (in short, SWARM) [9] is an open source parallel programming library. This library provides basic primitives for multithreaded programming. The SWARM library is a descendant of the symmetric multiprocessor (SMP) node library component of SIMPLE [8]. SWARM is built on POSIX threads that allows the programmer to use either the already developed primitives or direct thread primitives. SWARM has constructs for parallelization, restricting control of threads, allocation and deallocation of shared memory, and communication primitives for synchronization, replication and broadcast. FastFlow [6], [7] is a open source and C++ parallel programming framework for the development of efficient applications for multi-core computers. FastFlow is conceptually designed as a stack of layers that progressively abstract the shared memory parallelism at the level of cores up to the definition programming constructs supported structured parallel programming on shared memory multi-core and many-core platforms. The core of the FastFlow framework is based on efficient Single-Producer-Single-Consumer (SPMC) and Multiple-Producer-Multiple-Consumer (MPMC) FIFO queues, which are implemented in a lock-free and wait-free synchronization base mechanisms. The upper level of the FastFlow framework provides a high-level programming based on parallel patterns. More specifically, FastFlow provides the programmers with a set of patterns implemented as C++ templates:

218

farm, farm with feedback and pipeline patterns as well as their arbitrary nesting and composition. A FastFlow farm is logically built out of three entities: emitter, workers, collector. The emitter dispatches stream item to a set of workers which compute the output data. Results are then gathered by the collector back into a single stream. SMP Superscalar (in short, SMPSs) [1] is a parallel programming framework for implementing of applications for multi-core processors and symmetric multiprocessor systems. SMPSs is based on the parallelization at task level of sequential applications. SMPSs provides the programmers to specify the coarse grain functions of a application (for example, by applying blocking technique), which have to be side-effect free (atomic) functions. These functions are identified by annotations and the runtime system will try to parallelize the execution of the annotated functions also called tasks. Moreover, the programmers is needed to specify the directionality of each parameter (input, output, inout) in a function. If the size of a parameter is missing in the C declaration (i.e. the parameter is passed by pointer), the programmer also needs to specify the size of the memory region affected by the function. Furthermore, the runtime detects when tasks are data independent between them and is able to schedule the simultaneous execution of several of them on different cores. All the above mentioned actions (data dependency analysis, scheduling and data transfer) are performed transparently to the programmer.

is solved using backward substitution algorithm. The time complexity of the Gaussian elimination is 𝑂(𝑛3 ) [15]. The multi-thread algorithms for computational kernels belongs to the algorithms, which mainly solves the problem by solving the same problem on smaller instances of the input. The data (matrix or vector) distribution is the major factor in determining the efficiency of the parallel algorithm and therefore it defines the number of necessary computation and communication operations. In other words, the more balanced the elements are distributed among the available cores, the better is the performance and efficiency of the resulting algorithm. In order to have a good load balance we use a simple data partitioning technique that involves partitioning the data such that each thread works concurrently on an local part of the data. More specifically, in all computational kernels, we divide the matrix into blocks of rows of equal size, i.e. ⌈𝑚/𝑝⌉ where 𝑚 is the number of rows in a matrix and 𝑝 is the number of cores. The above data partitioning technique is applied to all computational kernels and tools. Finally, we must note that we tried to achieve the maximum possible conformity between the various implementations of the computational kernels, so that the comparison is not affected by implementation differences. IV. R ESULTS In order to gain an insight into the practical behavior of each one of the reviewed tool for implementing computational kernels, we carried out a quantitative and a qualitative comparison.

III. P ROBLEMS AND M ULTI - THREADING USING DATA PARTITIONING M ETHOD

A. Quantitative Comparison For the quantitative or performance comparison we have performed some computational experiments. The experiments were run on an Dual Opteron 6128 CPU with eight processor cores (16 cores total), a 2.0 GHz clock speed, 16 Gb of memory and each core with 512 KB L2 cache under Ubuntu Linux 10.04 LTS. During all experiments, this machine was not performing other heavy tasks (or processes). All matrix operations have been implemented in C/C++ programming language using all reviewed multi-core programming tools. For compiling of the multi-thread matrix computations we used four compilers. For compiling of the Pthread, OpenMP and SWARM programs we used the C compiler from the GNU Compiler Collection (GCC) since it is a very widely used compiler. For compiling of the Cilk++ program we used the Intel Cilk++ which is a wrapper compiler around GCC and is the only compiler available for Cilk++. Moreover, for compiling of the TBB, ArBB and FastFlow programs we used the g++ compiler which is part of the GCC collection. Finally, for compiling of the SMPSs program we used the smpss-cc compiler. It is necessary to mention that the compilation of the programs has been made without the ”-O2” optimization flag. Several sets of test matrices and vectors were used to evaluate the performance of the multi-thread computational kernels, a set of randomly generated input matrices or vectors with sizes ranging from 1024×1024 to 5120×5120. To assess

In this section we present the problem definitions for several computational kernels and we also discuss the general data partitioning method which is used for parallelizing of the computational kernels using the reviewed parallel tools and environments that we examined in the Section II. Consider, the matrix-vector multiplication operation, 𝑦 = 𝐴𝑥, where 𝐴 is a 𝑚 × 𝑛 matrix and 𝑥, 𝑦 are vectors with 𝑚 elements. Each element of 𝑦 is produced by computing the inner product of a row of A and of 𝑥. The time complexity of matrix - vector multiplication is 𝑂(𝑚𝑛). Similarly, consider the matrix multiplication operation, 𝐶 = 𝐴𝐵 and 𝐴 is a 𝑚 × 𝑛 matrix, then the operation 𝐴𝐵 can be performed only if 𝐵 is a 𝑛 × 𝑚 matrix and the result 𝐶 will be a 𝑛 × 𝑛 matrix. Each element of 𝐶 is produced by computing the inner product of a row of A and a column of B. The time complexity of matrix multiplication is 𝑂(𝑚𝑛2 ). Finally, the Gaussian elimination operation can be used for solving of dense and large scale linear system of equations 𝐴𝑥 = 𝑏, where 𝐴 = (𝑎𝑖𝑗 )𝑛×𝑛 is a known nonsingular 𝑛 × 𝑛 matrix with nonzero diagonal entries, 𝑏 = (𝑏0 , 𝑏1 , ..., 𝑏𝑛−1 )𝑇 is the right-hand side and 𝑥 = (𝑥0 , 𝑥1 , ..., 𝑥𝑛−1 )𝑇 is the vector of unknowns. The Gaussian elimination transforms the matrix 𝐴 in a triangular form with an accompanying update of the right-hand side, so that the solution of the system 𝐴𝑥 = 𝑏 is straightforward by a triangular solve. This triangular system

219

From the graphs of figure 1, we can make specific performance remarks. For the matrix - vector product, the ArBB implementation has the best performance at execution time for number of cores up to 8 whereas the SWARM implementation has the best performance for 16 cores. On the other hand, the SMPSs implementation has the slowest performance. Moreover, the performance of the remaining tools are closest to the Pthread implementation with exception for large number of cores. For the matrix product operation, we observe that the ArBB implementation is the winner of this comparison for any number of cores. The second best performance is the FastFlow implementation for any number of cores whereas the performance of the rest implementations are identical. Finally, for the Gaussian elimination operation, the ArBB implementation has satisfactory performance results for one core, the SWARM implementation has best performance for 2 to 4 cores and the TBB implementation has best performance for 8 to 16 cores. We must observe that the ArBB implementation has the slowest performance. Furthermore, we must note that the performance of the rest implementations is closest to the Pthread and OpenMP execution time with exception the SWARM and Cilk++ implementations. Based on the graphs of figure 2, we can say that the execution time of all reviewed tools for all computational kernels is increased as the problem size is increased. More specifically, the execution time of the matrix product and Gaussian elimination operations is increased significantly except for matrix - vector product where it is increased slightly. From the graphs of figure 2, we can make specific performance remarks. For the matrix - vector product, the ArBB and SWARM implementations have the best performance at execution time for any problem size. On the other hand, the SMPSs implementation has the slowest performance. Moreover, the performance of the remaining tools have the modest performance and the execution times of these implementations are very close. For the matrix product operation, we observe that the ArBB implementation presents the best performance with large difference is compared to other implementations. The execution times of the remaining tools are identical. Finally, for the Gaussian elimination operation, the SWARM implementation has satisfactory performance results for problem size up to 3072 and the TBB and FastFlow implementations has best performance for problem sizes from 4096 to 5120. We must observe that the ArBB implementation has the slowest performance. Furthermore, the execution times of the rest implementations are similar. In figure 3, we present the overall performance for all the reviewed programming tools as a function of the number of cores and problem sizes. We should mention that the average performance of figure 3 left for each tool and number of core is the average value for all the computational kernels and problem sizes. Similarly, the average performance of figure 3 right for each tool and problem size is the average value for all the computational kernels and number of cores.

the performance of the multi-thread computational kernels for all programming tools, we used the practical execution time and tool’s performance as a measure. The practical execution time is the total time that an multi-thread algorithm needs to complete the computation. The execution time is obtained by calling the C function gettimeofday() and it is measured in seconds. To decrease random variation, the execution time was measured as an average of 40 runs. On the other hand, tool’s performance for each computational kernel has been calculated using the relation 𝐵𝑇 (𝑡𝑜𝑜𝑙)𝑡 × 100, 𝑖 = 1, 2, . . . , 8 𝑇 (𝑡𝑜𝑜𝑙)𝑖

(1)

where 𝐵𝑇 (𝑡𝑜𝑜𝑙)𝑡 is the best execution time for a specific computational kernel, run by all eight tools and 𝑇 (𝑡𝑜𝑜𝑙)𝑖 is the execution time of the tool 𝑖 for the same computational kernel. The best timing is then set equal to 100%. To calculate the overall performance for each tool, for any combination of the problem size and number of cores, we add the percentage values for every computational kernel and divide it by the total number of computational kernels, i.e. 3. Finally, the largest percentage corresponds to the best overall performance. Following in this section, there are some figures extracted from the performance evaluation experimental data. There are presented three figures, two of which presents the performance graphs for all computational kernels such as figures 1 and 2 and other one the average and overall performance such as figure 3. The time in the y-axis in the figures below is in logarithmic scale. Figure 1 presents the execution times for all computational kernels as a function of the number of cores for a constant problem size of 5120 whereas figure 2 presents the execution times for all computational kernels as a function of the problem size for 16 cores. Based on the graphs of figure 1, we can say that the execution time of all reviewed tools for all computational kernels is decreased as the number of cores is increased. More specifically, the execution time of the matrix - vector product, matrix product and Gaussian elimination operations is decreased significantly. Finally, it is necessary to mention that for the Intel ArBB programming tool we haven’t seen a significant decrease at execution time as a function of the number of cores, but we have a significant decrease execution time compared to the execution time of the C sequential program. In other words, the relative speedup of ArBB implementation over the C sequential implementation for the matrix - vector operation ranging from 1 to 6 times faster whereas for the matrix product the relative speedup over the serial implementation ranging from 50 to 65 times faster. On the other hand, the relative speedup over the serial implementation for the Gaussian elimination is limited and ranging from 1 to 2 times faster. However, we can observe that the ArBB and Cilk++ tools for implementing matrix vector operation and Gaussian elimination do not seem to have a natural behavior, i.e. the execution time is increased as the number of cores is increased because of the overhead of complex runtime system of tools.

220

Matrix - Vector Product

Matrix Product

1

Gaussian Elimination

100000

1000 Pthread OpenMP Cilk++ TBB ArBB SWARM FastFlow SMPSs

0.1

0.01

10000

1000

100 0

2

4

6

8

10

12

14

16

Fig. 1.

100

10 0

2

4

6

Number of cores

8

10

12

14

16

0

2

4

6

Number of cores

8

10

12

14

16

Number of cores

Execution times (in secs) of the scientific computing kernels as a function of the number of cores for a problem size of 5120 Matrix - Vector Product

Matrix Product

1

Gaussian Elimination

10000 Pthread OpenMP Cilk++ TBB ArBB SWARM FastFlow SMPSs

1000 Pthread OpenMP Cilk++ TBB ArBB SWARM FastFlow SMPSs

0.01

Pthread OpenMP Cilk++ TBB ArBB SWARM FastFlow SMPSs

100 Execution time in (secs)

1000

Execution time in (secs)

0.1 Execution time in (secs)

Pthread OpenMP Cilk++ TBB ArBB SWARM FastFlow SMPSs Execution time in (secs)

Execution time in (secs)

Execution time in (secs)

Pthread OpenMP Cilk++ TBB ArBB SWARM FastFlow SMPSs

100

10

0.001

10

1 1

0.0001 1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

0.1 1000

1500

2000

2500

Problem size

Fig. 2.

Fig. 3.

3000

3500

Problem size

4000

4500

5000

5500

0.1 1000

1500

2000

2500

3000

3500

4000

4500

5000

5500

Problem size

Execution times (in secs) of the scientific computing kernels as a function of the problem size for 16 cores

Overall performance as a function of the number of cores (left) and overall performance as a function of the problem size (right)

in length (with the exception of Gaussian elimination), more concise and hence easier to understand. The ArBB code for implementing matrix operations is more concise because the statements or operations in the ArBB implementation are vector or expressed at the aggregate data collection level using dense containers as a data types without the use of for loops. The SWARM, OpenMP, Cilk++ and SMPSs implementations were identical in number of lines and their codes were not significantly longer than ArBB implementations. The SWARM implementation allow the programmers to use constructs for parallelization, i.e. in a for loop to parallelize should be used the par_do construct which implicitly partitions the loop among the cores without the need for coordinating overheads such as synchronization of communication between the cores. Furthermore, the SWARM tool provides the programmer library functions for synchronization and reduction operations. In the OpenMP, Cilk++ and SMPSs implementations, the programmer inserts compiler directives (i.e. pragma) or keywords into sequential code to tell the compiler which parts of the

Based on these results, we can observe that the ArBB has the best performance for any number of cores (with exception of 16 cores) and any problem sizes whereas the SWARM is the second best tool for any number of cores and problem sizes. Furthermore, the next best performance in the final order is the OpenMP, Cilk++ and TBB tools with very small differences for any number of cores and problem sizes. Finally, at the third place comes the Pthread, FastFlow and SMPSs tools that have the lowest performance. B. Qualitative Comparison To assess the ease of programming effort in the multicore programming tools, we counted the number of lines of code needed to solve the problem in each case except for the variables declarations. In Table I we present the number of lines of code for all computational kernels that required by each tool. Based on this table, we can make the following remarks: The ArBB algorithm implementations were, in general, shorter

221

Computational kernel Matrix - vector Product Matrix Multiplication Gaussian Elimination

Pthread 8 9 14

OpenMP 5 6 9

Cilk++ 5 6 8

TBB 10 10 9

ArBB 1 4 9

SWARM 4 5 6

FastFlow 13 14 17

SMPSs 5 6 14

TABLE I L INES OF CODE IN THE MULTI - CORE PROGRAMMING TOOLS

code that should be executed parallel. Moreover, the codes of the SWARM, OpenMP, Cilk++ and SMPSs tools are easier to understand and it isn’t required complex programming effort. The Pthread implementations require more lines of code and the programming effort was complex. In the Pthread programming tool provides the programmer low level library routines and the parallelization of a algorithm isn’t automatic in relation to the rest tools, i.e. the programmer is responsible to write the code of parallelism and the distribution of data to each thread. Finally, the TBB and FastFlow algorithm implementations required many lines of code although these provide C++ templates. We must note that the code which is obtained by the TBB and FastFlow programming tools is easier to understand but because each algorithm is implemented as a object class is required additional statements except for the core code of the algorithm. In other words, these tools require restructuring of the sequential code so that the code to be more object oriented.

[5] The openMP API specification for parallel programming, 2012. http://openmp.org/wp/. [6] M. Aldinucci, M. Danelutto, P. Kilpatrick, M. Meneghin, and M. Torquati. Accelerating code on multi-cores with FastFlow. In Proceedings of the 17th international conference on Parallel processing - Volume Part II, Euro-Par’11, pages 170–181, Berlin, Heidelberg, 2011. Springer-Verlag. [7] M. Aldinucci, M. Danelutto, P. Kilpatrick, and M. Torquati. Programming Multi-core and Many-core Computing Systems, chapter FastFlow: high-level and efficient streaming on multi-core. Wiley, 2011. [8] D. A. Bader and J. JaJa. Simple: A methodology for programming high performance algorithms on clusters of symmetric multiprocessors (SMPs). Journal of Parallel and Distributed Computing, 58:92–108, 1999. [9] D. A. Bader, V. Kanade, and K. Madduri. SWARM: A Parallel Programming Framework for Multicore Processors. In Parallel and Distributed Processing Symposium, 2007. IPDPS 2007, pages 1–8, 2007. [10] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou. Cilk: An efficient multithreaded runtime system. In Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 207–216, Santa Barbara, California, July 1995. [11] A. Buttari, J. Dongarra, P. Husbands, J. Kurzak, and K. Yelick. Multithreading for synchronization tolerance in matrix factorization. In Journal of Physics: Conference Series 78(1), 2007. [12] A. Buttari, J. Langou, J. Kurzak, and J. Dongarra. A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput., 35(1):38–53, Jan. 2009. [13] D. Buttlar, J. Farrell, and B. Nichols. PThreads Programming: A POSIX Standard for Better Multiprocessing. O’Reilly Media, 1996. [14] E. Elmroth and F. Gustavson. High-performance library software for QR factorization. In In Applied Parallel Computing: New Paradigms for HPC in Industry and Academia, T. Sorvik et al ., Eds., Lecture Notes in Comput. Sci. 1947, pages 53–63. Springer-Verlag, 2000. [15] G. Golub and C. Van Loan. Matrix Computations. Johns Hopkins University Press, 2nd edition, 1989. [16] B. C. Gunter and R. A. Van De Geijn. Parallel out-of-core computation and updating of the QR factorization. ACM Trans. Math. Softw., 31(1):60–78, Mar. 2005. [17] J. Kurzak, H. Ltaief, J. Dongarra, and R. M. Badia. Scheduling dense linear algebra operations on multicore processors. Concurr. Comput. : Pract. Exper., 22(1):15–44, Jan. 2010. [18] M. Marqu´es, G. Quintana-Ort´ı, E. S. Quintana-Ort´ı, and R. Geijn. Outof-core computation of the QR factorization on multi-core processors. In Proceedings of the 15th International Euro-Par Conference on Parallel Processing, Euro-Par ’09, pages 809–820, Berlin, Heidelberg, 2009. Springer-Verlag. [19] S. F. McGinn and R. E. Shaw. Parallel Gaussian elimination using OpenMP and MPI. In Proceedings of the 16th Annual International Symposium on High Performance Computing Systems and Applications, HPCS ’02, pages 169–, Washington, DC, USA, 2002. IEEE Computer Society. [20] P. D. Michailidis and K. G. Margaritis. Implementing parallel LU factorization with pipelining on a multicore using OpenMP. In Proceedings of the 2010 13th IEEE International Conference on Computational Science and Engineering, CSE ’10, pages 253–260, Washington, DC, USA, 2010. IEEE Computer Society. [21] G. R¨unger and M. Schwind. Fast recursive matrix multiplication for multi-core architectures. Procedia CS, 1(1):67–76, 2010. [22] S. Zuckerman, M. P´erache, and W. Jalby. Fine tuning matrix multiplications on multicore. In Proceedings of the 15th international conference on High performance computing, HiPC’08, pages 30–41, Berlin, Heidelberg, 2008. Springer-Verlag.

V. C ONCLUSIONS In this paper, we performed a quantitative and qualitative comparison of the programming tools in order to answer the question which is the appropriate tool for implementing basic computational kernels on multi-core environment so that there is a balance between the high performance and easy programming effort. Based on the performance and qualitative comparison we can conclude that the Intel ArBB and SWARM programming environments are more appropriate tools for parallelizing the computational kernels using the row wise partitioning method. As future work, it would be interesting to repeat the performance comparison of the computational kernels using these programming tools with optimization compilation. Moreover, the qualitative comparison of the libraries to measure the programming effort can extend using other software engineering parameters such as easy code syntax, library popularity, support for profiling tools and online help facilities and documentation. Finally, we will extend the quantitative and qualitative comparison of the tools for other similar problems of linear algebra and computational statistics. R EFERENCES [1] SMP Superscaler User’s Manual, version 2.4, 2011. http://www.bsc.es/media/4783.pdf. [2] Intel Array Building Blocks, 2012. http://software.intel.com/enus/articles/intel-array-building-blocks/. [3] Intel Cilk Plus, 2012. http://software.intel.com/en-us/articles/intel-cilkplus/. [4] Intel Threading Building Blocks, 2012. http://threadingbuildingblocks.org/.

222