Parallel Matrix Multiplication for Business Applications

Parallel Matrix Multiplication for Business Applications Mais Haj Qasem(&) and Mohammad Qatawneh Computer Science Department, University of Jordan, Amman, Jordan [email protected], [email protected]

Abstract. Business applications, such as market shops, use matrix multiplication to calculate yearly, monthly, or even daily profits based on price and quantity matrices. Matrices comprise large data in computer applications and other fields, which make the efficiency of matrix multiplication a popular research topic. Although the task of computing matrix products is a central operation in many numerical algorithms, it is potentially time consuming, making it one of the most well-studied problems in this field. In this paper, Message Passing Interface (MPI), MapReduce, and Multithreaded methods have been implemented to demonstrate their effectiveness in expediting matrix multiplication in a multi-core system. Simulation results show that the efficiency rates of MPI and MapReduce are 90.11% and 47.94%, respectively, with a multi-core processor on the Market Shop application, indicating better performances compared with those of the multithreaded and sequential methods. Keywords: Business application multiplication

Hadoop

MPI

MapReduce

Matrix

1 Introduction Matrix multiplication is a fundamental operation in linear algebra with related real-life applications. Recently, mathematicians and research scientists have found many applications of matrices due to the advent of personal and large-scale computers, which increased the use of matrices in a wide variety of applications, such as economics, engineering, statistics, and other sciences [14]. Market Shop is a business application that uses spreadsheets for budgeting, sales projections, and cost estimation, making the matrix multiplication useful in these applications. Therefore, using matrix multiplication can dramatically reduce the labor involved in modeling processes that deal with multiple categories of employees, customers, districts, products, or supplies [22]. In addition to these naturally related applications, the focus in using matrix multiplication is computational problems that should be investigated thoroughly to enhance the efficiency of the implemented algorithms for matrix multiplication. Hence, over the years, several parallel and distributed systems for matrix multiplication methods have been proposed to reduce the cost and time of matrix multiplication over multiple processors [5, 15]. © Springer International Publishing AG 2018 R. Silhavy et al. (eds.), Applied Computational Intelligence and Mathematical Methods, Advances in Intelligent Systems and Computing 662, DOI 10.1007/978-3-319-67621-0_3

Parallel Matrix Multiplication for Business Applications

25

Parallel and distributed computing systems are high-performance computing systems that spread out a single application over many multi-core and multi-processor computers in order to rapidly complete the task. Parallel and distributed computing systems divide large problems into smaller sub-problems and assign each of them to different processors in a typically distributed system running concurrently in parallel. MapReduce [17] and Message Passing Interface (MPI) are among these computing systems [10]. MapReduce is an algorithm design and processing paradigm proposed by Dean and Ghemawat in 2004 [7]. MapReduce enables efficient parallel and distributed computing and consists of two serial tasks, namely, map and reduce. Each serial task is implemented with several parallel subtasks. Specific MapReduce paradigms include MapReduce with expectation maximization for text filtering [31], MapReduce with K-means for remote-sensing image clustering [16], and MapReduce with decision tree for classification [19]. MapReduce has also been used for job scheduling [20] and real-time systems [15]. MPI is a standardized means of exchanging messages among multiple computers running a parallel program across a distributed memory. MPI is generally considered to be the industry standard, and forms the basis of most communication interfaces adopted by parallel computing programmers. MPI is used to improve scalability, performance, multi-core and cluster support, and interoperation with other applications [26]. In the current study, we applied efficient MapReduce matrix multiplication with an optimized mapper set produced by MPI library for real-life business applications. We used this method to demonstrate the performance of business applications by using parallel and distributed computing matrix multiplication compared with those of the multithreaded and sequential methods. The rest of the paper is organized as follows. Section 2 reviews works that are closely related to using the matrix multiplication in many applications. Section 3 presents the business applications using the matrix multiplication. Section 4 presents all matrix multiplication methods used. Section 5 presents experimental results, and Sect. 6 summarizes and concludes the paper.

2 Related Work Mathematicians and research scientists have found many applications of matrices due to the advent of personal and large-scale computers that increased the use of matrices in a wide variety of applications, such as economics, engineering, statistics, and other sciences. Traditional sequential algorithms for matrix multiplication consume considerable space and time. To enhance the efficiency of matrix multiplication, Fox [10], Cannon [4], and DNS [9] algorithms have been proposed for parallelizing matrix multiplication. To maximize efficiency, these approaches balance inter-process communication, dependencies, and parallelism level. Parallel matrix multiplication relies on the independence of multiplication, which includes multiple independent element-to-element multiplications and multiple aggregations of independent multiplication results. Zhang et al. [21] presented an outsourcing computation schema in an amortized model for matrix multiplication of two arbitrary matrices that meet the requirements for

26

M.H. Qasem and M. Qatawneh

both security and efficiency. They compared their scheme functionalities with existing works, such as Fiore’s schema [6], Li’s schema [11], and Jia’s schema [12]. Zhang et al. [21] proved that their schema is more efficient in terms of functionality as well as computation, storage and communication overhead. Kumar et al. [14] proposed a privacy-preserving, verifiable, and efficient algorithm for matrix multiplication in outsourcing paradigm to solve the lack of computing resources, where the client, having a large dataset, can perform matrix multiplication using cloud server. Kumar et al. [14] evaluated their algorithm on security, efficiency, and variability parameters. With high efficiency and practical usability, their algorithm can mostly replace costly cryptographic operations and securely solve matrix multiplication algorithm. Ann et al. [2] proposed an approach for calculating the output equation of the hybrid multilayered perceptron (HMLP). They employed a matrix multiplication method simulated and compared with looping method to calculate HMLP output using loops. Their results confirmed that the output of the HMLP calculated using the proposed matrix multiplication method is the same as that calculated using the looping method. The difference is that the processing time of the former is faster than that of the latter for HMLP with more nodes, although the looping method calculated the output faster for HMLP with less nodes. Afroz et al. [1] proposed a new approach that focused on the time analysis of different matrix multiplication algorithms by combining Karatsuba and Strassen methods for reducing time complexity and analyzing matrix multiplication constructions. They remarked that if the methods are perfect, then the approach can be used to pick between the different algorithms, thus creating a hybrid algorithm.

3 Business Applications Market shop is a business application that uses matrix multiplication to calculate yearly, monthly, or even daily profit. Through matrix multiplication, market shop also determines the most purchased product in the market or the effect of purchase quantity due to discount day based on the amount of purchase. Different matrix multiplication methods have been implemented on market shop that uses matrices for the store cost of each product and the quantity purchased for each product. In the proposed market shop schema, the first matrix collects the cost for each product in the market, and the second matrix stores the quantity size of each product purchased during discount day. Figure 1 illustrates the matrix architecture of the market shop. • Product cost per day matrix row represents the cost of each product for different days, whereas the column represents the product price for each day before and after discount day. • Product quantity matrix row represents the quantity of one product purchased during different shifts of discount day, whereas the column represents the quantity of all products purchased during different shifts of the day.


27

• Cost matrix row represents the total amount cost for all products during different shifts of discount day, whereas the column represents the total amount cost of one product during different shifts of the day. The summation of this matrix equals the total profit of the market.

Fig. 1. Matrix architecture of the market shop

4 Matrix Multiplication Methods 4.1

MPI

MPI is a library of routines that can be used to create parallel programs in C or Fortran77. It is a library that runs with standard C or Fortran programs, using commonly-available operating system services to create parallel processes and exchange information among these processes.

28


MPI is designed to allow users to create programs that can run efficiently on most parallel architectures. The design process included vendors (such as IBM, Intel, TMC, Cray, Convex, etc.), parallel library authors (involved in the development of PVM, Linda, etc.), and applications specialists [27]. MPI can also support distributed program execution on heterogeneous hardware. That is, you may run a program that starts processes on multiple computer systems to work on the same problem. This is useful with a workstation farm. These programs cannot communicate with each other by exchanging information in memory variables. Instead they may use any of many MPI communication routines. The two basic routines are MPI_Send, to send a message to another process, and MPI_Recv, to receive a message from another process. The MPI code has been run in IMAN1 which is the Jordan’s first and fastest High-Performance Computing resource, funded by JAEC and SESAME [28]. 4.2

MapReduce

Traditional parallel-based matrix multiplication has been recently replaced with MapReduce, a parallel and distributed framework for large-scale data [17]. Typical MapReduce-based matrix multiplication requires two MapReduce jobs. The first job creates a pair of elements for multiplication by combining input arrays together during map task. The reduce task of this job is inactive at this point. In the second job, the map task independently implements the multiplication operations on each pair of elements. The reduce job aggregates the results corresponding to each output element. Hadoop is a Java open-source platform used for developing MapReduce applications. Google developed this platform [17]. Figure 2 illustrates the Hadoop architecture. In this paper, we used the MapReduce-based matrix multiplication proposed by [13], which reduces both time and memory utilization compared with existing schemas [13]. In the proposed technique, matrix multiplication is implemented as an element-to-block schema, as illustrated in Fig. 3. In the first schema; the first array is decomposed into individual elements, whereas the second array is decomposed into sub-row-based blocks. In the second schema, the first array is decomposed into sub-row-based blocks, and the second array is decomposed into sub-column-based blocks. The number of mappers is

Fig. 2. Hadoop MapReduce architecture


29

Fig. 3. Efficient MapReduce matrix multiplication techniques

determined by the size of the block generated for the second array and selected with the capability of the underlying mapper as basis. Subsequently, a small block size increases the number of blocks, thus requiring additional mappers, and vice versa. The map task (see Table 1) is responsible for the multiplication operations, whereas the reduce task is responsible for the sum operations. The pre-processing step reads an element from the first array and a block from the second array and then merges them into one file. In matrix multiplication, the entire row in the first array must be multiplied with the entire column in the second array to calculate the results of an element in the output. Thus, the results of each mapper in the proposed schemas are aggregated with other multiplication results in the reduce task. Table 1. Efficient MapReduce matrix multiplication operations Scheme Element- ByRow-Block

Input

Output

Pre-Process Map

Files ‹aij ; bkj ; bkj . . . . . . ::›

Row-Block- ByColumn-Block

Reduce Pre-Process Map

‹key, cij ; cij . . . . . .› Files ‹aij ; aij . . . bkj ; bkj ›

Reduce

‹key, cij ; cij . . . . . .›

‹aij ; bkj ; bkj . . . . . . ::› ‹key, ½aij bkj . . . . . . . . . ½aij bkj › ‹key, cij þ cij . . . . . .› ‹aij ; aij . . . . . . bkj ; bkj › ‹key, ½aij bkj . . . . . . . . . ½aij bkj › ‹key, cij þ cij . . . . . .›

5 Experiments and Result Different matrix multiplication methods have been implemented to compare their efficiencies and time performances in a large matrix. All methods were tested in sparse and dense matrices of different sizes. Results of the tested methods are discussed below.

30


MPI: The MPI code was run in IMAN1, Jordan’s first and fastest high-performance computing resource, funded by JAEC and SESAME. We also worked in a Zaina server, an Intel Xeon-based computing cluster with 1G Ethernet interconnection. The cluster is mainly used for code development, code porting, and synchrotron radiation application purposes. In addition, this cluster is composed of two Dell PowerEdge R710 and five HP ProLiant DL140 G3 servers (Table 2). Table 2. Zaina technical details Server CPU per server RAM per server Total storage (TB) OS

7 Servers (Two Dell PowerEdge R710 and five HP ProLiant DL140 G3) Dell (2 8 cores Intel Xeon) HP (2 4 cores Intel Xeon) Dell (16 GB) HP (6 GB) 1 TB NFS Share Scientific Linux 6.4

Technical details are given as follows. The results for different numbers of core are shown in Table 3. As shown in the table of results, increasing the core number up to 8 in the matrix, which is less than 500, takes more time due to the small size of problem that does not need a large number of cores, whereas the matrix within the range of 500 does not need more than 8 cores for its problem because it is inefficient and takes more time. In comparison, the matrix size of up to 1000 is more effective and efficient when increasing the number of cores to 32 due to the large size of problems that need a higher degree of parallelism. Table 3. MPI run time result Matrix

Core 2

Dense matrix 250 * 250 1.245 500 * 500 1.992 1000 * 1000 7.491 2000 * 2000 62.207 4000 * 4000 540.819 Sparse matrix 250 * 250 1.243 500 * 500 1.971 1000 * 1000 7.586 2000 * 2000 62.498 4000 * 4000 537.697

4

8

16

s 1.189 s 1.456 s 3.435 s 22.473 s 185.790

s 1.214 s 1.390 s 2.968 s 17.569 s 135.654

s s s s s

1.284 1.501 2.768 14.390 90.827

s 1.190 s 1.453 s 3.469 s 22.398 s 201.295

s 1.213 s 1.424 s 2.786 s 17.363 s 137.665

s 1.316 s 1.376 s 2.488 s 12.719 s 107.900

32 s s s s s

1.529 1.681 2.655 10.955 88.593

s s s s s

s 1.552 s 1.691 s 2.731 s 11.933 s 101.656

s s s s s

The speedup is the ratio between the sequential time and the parallel time. The speedup for different numbers of core on sparse and dense matrices of different sizes


31

Fig. 4. MPI speedup plotting

are illustrated in Fig. 4. The results show that MPI achieves the best speedups values, especially on large number of processors in dense matrices. MapReduce: The MapReduce results of the matrix multiplication using Hadoop for inputs with various sizes are presented. Simple matrix multiplication process on the platform with various block sizes has been run to determine the optimal length to be given to the mapper before running the actual job and the optimal length of block size with a minimum running time was 20. The running time was cut down in the proposed schemes, as the sorting process in the shuffling process was reduced. As the matrix size grows, the stability of the proposed scheme is almost linear. Results are given in Table 4. The speedup for sparse and dense matrices of different sizes are illustrated in Fig. 5. The results show that MapReduce achieves speedups values less than MPI, especially on large number of processors in dense matrices. Multithreaded: Matrix multiplication using different sizes of thread were tested in various sizes of dense and sparse matrices. Results are given in Table 5. As shown in the table of results, when we increased the thread number up to 4 in the matrix, which is less than 500, more time is required due to the small size of problem that does not need a large number of threads, whereas the matrix within the range of 500 does not need more than 16 cores for its problem. Moreover, the matrix size up to 1000 is more

Table 4. MapReduce run time result Dense matrix Matrix 250 * 250 500 * 500 1000 * 1000 2000 * 2000 4000 * 4000

Time 6.255 12.996 98.324 116.014 321.457

s s s s s

Sparse matrix Matrix 250 * 250 500 * 500 1000 * 1000 2000 * 2000 4000 * 4000

Time 4.021 11.581 95.452 112.450 185.478

s s s s s

32


Fig. 5. MapReduce speedup plotting

Table 5. Multithreaded run time result Matrix

Core time 2 4

Dense matrix 250 * 250 11.408 500 * 500 45.485 1000 * 1000 299.734 2000 * 2000 444.012 4000 * 4000 750.125 Sparse matrix 250 * 250 6.534 500 * 500 29.079 1000 * 1000 117.060 2000 * 2000 287.179 4000 * 4000 349.346

8

16

32

s 11.527 s 10.049 s 41.675 s 39.271 s 234.888 s 225.802 s 380.215 s 315.241 s 664.021 s 601.241

s 9.038 s 8.255 s 36.734 s 36.996 s 215.119 s 204.367 s 248.562 s 220.145 s 521.045 s 412.547

s s s s s

s 6.625 s 6.447 s 27.810 s 26.829 s 111.784 s 106.850 s 235.122 s 201.252 s 299.734 s 234.888

s 6.193 s 6.021 s 25.369 s 23.981 s 103.137 s 99.352 s 162.547 s 155.842 s 225.802 s 215.347

s s s s s

effective and efficient when the number of threads is increased to 32 due to the large size of the problem that needs a higher degree of parallelism. The speedup for sparse and dense matrices of different sizes are illustrated in Fig. 6. The results show that Multithreaded achieves speedups values less than MPI and MapReduce. Sequential: The sequential results of matrix multiplication were tested in dense and sparse matrices of various sizes of. Results are given in Table 6. The comparison between MPI and MapReduce results are always faster and more efficient for the different-sized dense and sparse matrices than those of the multithreaded and sequential methods, as shown in the efficiency table below. The MPI outperformed the MapReduce; thus, the research goal is achieved. Comparison results are given in Table 7 and illustrated in Figs. 7 and 8.


Fig. 6. Multithreaded speedup plotting

Table 6. Sequential run time results Dense matrix Matrix 250 * 250 500 * 500 1000 * 1000 2000 * 2000 4000 * 4000

Time 12.826 55.656 349.346 514.501 834.501

s s s s s

Sparse matrix Matrix 250 * 250 500 * 500 1000 * 1000 2000 * 2000 4000 * 4000

Time 7.359 33.875 132.083 450.485 533.269

s s s s s

Table 7. Time efficiency result Matrix

Sequential MapReduce MPI Dense matrix time efficiency 250 * 250 51.23% 80.24% 500 * 500 76.65% 94.26% 1000 * 1000 71.85% 97.50% 2000 * 2000 77.45% 92.34% 4000 * 4000 61.48% 52.79% Sparse matrix time efficiency 250 * 250 45.36% 83.83% 500 * 500 65.81% 95.94% 1000 * 1000 27.73% 98.12% 2000 * 2000 75.04% 97.35% 4000 * 4000 65.22% 80.94%

Multithreaded MPI MapReduce 85.60% 96.24% 98.70% 95.02% 78.53%

24.23% 64.87% 51.89% 47.30% 22.08%

80.24% 94.26% 97.50% 92.34% 52.79%

33.22% 51.71% 3.93% 27.84% 13.87%

33

34


Fig. 7. Dense compassion plotting

Fig. 8. Sparse compassion plotting

6 Conclusion Using the conducted experimental study as basis, MPI and MapReduce matrix multiplication are always faster than the multithreaded and sequential methods, with 90.11% and 47.94% efficiency, respectively. Hence, parallel and distributed computing for matrix multiplication methods have been proposed to reduce the cost and time of matrix multiplication over multiple processors. MPI matrix multiplication is also more efficient, because its matrix size growth outperforms those of the multithreaded and sequential methods.

References 1. Afroz, S., Tahaseen, M., Ahmed, F., Farshee, K.S., Huda, M.N.: Survey on matrix multiplication algorithms. In: 2016 5th International Conference on Informatics, Electronics and Vision (ICIEV), pp. 151–155. IEEE, May 2016 2. Ann, L.Y., Ehkan, P., Mashor, M.Y., Sharun, S.M.: Calculation of hybrid multi-layered perceptron neural network output using matrix multiplication. In: 2016 3rd International Conference on Electronic Design (ICED), pp. 497–500. IEEE, August 2016


35

3. Catalyurek, U.V., Aykanat, C.: Hypergraph-partitioning-based decomposition for parallel sparse-matrix vector multiplication. IEEE Trans. Parallel Distrib. Syst. 10(7), 673–693 (1999) 4. Cannon, L.E.: A Cellular Computer to Implement the Kalman Filter Algorithm. No. 603-Tl-0769. Montana State Univ Bozeman Engineering Research Labs (1969) 5. Coppersmith, D., Winograd, S.: Matrix multiplication via arithmetic progressions. In: Proceedings of the Nineteenth Annual ACM Symposium on Theory of Computing, pp. 1–6. ACM, January 1987 6. Fiore, D., Gennaro, R.: Publicly verifiable delegation of large polynomials and matrix computations, with applications. In: Proceedings of the 2012 ACM Conference on Computer and Communications Security, pp. 501–512. ACM (2012) 7. Dean, G., Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: OSDI, p. 10. USENIX (2004) 8. Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010) 9. Dekel, E., Nassimi, D., Sahni, S.: Parallel matrix and graph algorithms. SIAM J. Comput. 10 (4), 657–675 (1981) 10. Fox, G.C., Otto, S.W., Hey, A.J.G.: Matrix algorithms on a hypercube I: matrix multiplication. Parallel Comput. 4(1), 17–31 (1987) 11. Li, H., Zhang, S., Luan, T.H., Ren, H., Dai, Y., Zhou, L.: Enabling efficient publicly verifiable outsourcing computation for matrix multiplication. In: 2015 International Telecommunication Networks and Applications Conference (ITNAC), pp. 44–50. IEEE (2015) 12. Jia, K., Li, H., Liu, D., Yu, S.: Enabling efficient and secure outsourcing of large matrix multiplications. In: 2015 IEEE Global Communications Conference (GLOBECOM), pp. 1–6. IEEE (2015) 13. Kadhum, M., Qasem, M.H., Sleit, A., Sharieh, A.: Efficient MapReduce matrix multiplication with optimized mapper set. In: Computer Science On-line Conference, pp. 186–196. Springer, Cham, April 2017 14. Kumar, M., Meena, J., Vardhan, M.: Privacy preserving, verifiable and efficient outsourcing algorithm for matrix multiplication to a malicious cloud server. Cogent Eng. (just-accepted) 1295783 (2017) 15. Liu, X., Iftikhar, N., Xie, X.: Survey of real-time processing systems for big data. In: Proceedings of the 18th International Database Engineering and Applications Symposium. ACM (2014) 16. Lv, Z., et al.: Parallel K-means clustering of remote sensing images based on MapReduce 17. Norstad, J.: A mapreduce algorithm for matrix multiplication (2009). http://www.norstad. org/matrix-multiply/index.html. 19 Feb 2013 18. Thabet, K., Al-Ghuribi, S.: Matrix multiplication algorithms. Int. J. Comput. Sci. Netw. Secur. (IJCSNS) 12(2), 74 (2012) 19. Wu, G., et al.: MReC4. 5: C4. 5 ensemble classification with MapReduce. In: 2009 Fourth ChinaGrid Annual Conference. IEEE (2009) 20. Zaharia, M., et al.: Job scheduling for multi-user mapreduce clusters. EECS Department, University of California, Berkeley, Technical report UCB/EECS-2009-55 (2009) 21. Zhang, S., Li, H., Jia, K., Dai, Y., Zhao, L.: Efficient secure outsourcing computation of matrix multiplication in cloud computing. In: 2016 IEEE Global Communications Conference (GLOBECOM), pp. 1–6. IEEE, December 2016 22. Saadeh, M., Saadeh, H., Qatawneh, M.: Performance evaluation of parallel sorting algorithms on IMAN1 supercomputer. Int. J. Adv. Sci. Technol. 95, 57–72 (2016)

36


23. Mohammed, Q.: Embedding linear array network into the tree-hypercube network. Eur. J. Sci. Res. 10(2), 72–76 (2005) 24. Qatawneh, M., Alamoush, A., Alqatawna, J.: Section based hex-cell routing algorithm (SBHCR). Int. J. Comput. Netw. Commun. (IJCNC) 7(1) (2015) 25. Qatawneh, M.: Multilayer hex-cells: a new class of hex-cell interconnection networks for massively parallel systems. Int. J. Commun. Netw. Syst. Sci. 4(11), 704–708 (2011) 26. Qatawneh, M.: Embedding binary tree and bus into hex-cell interconnection network. J. Am. Sci. 7(12) (2011) 27. Mohammad, Q., Khattab, H.: New routing algorithm for hex-cell network. Int. J. Future Gener. Commun. Netw. 8(2) (2015) 28. Qatawneh, M.: New efficient algorithm for mapping linear array into hex-cell network. Int. J. Adv. Sci. Technol. 90 (2016) 29. Qasem, M.H., Al Assaf, M.M., Rodan, A.: Data mining approach for commercial data classification and migration in hybrid storage systems. World Acad. Sci. Eng. Technol. Int. J. Comput. Electr. Autom. Control Inf. Eng. 10(3), 481–484 (2016) 30. Qasem, M.H., Faris, H., Rodan, A., Sheta, A.: Empirical evaluation of the cycle reservoir with regular jumps for time series forecasting: a comparison study. In: Computer Science On-line Conference, pp. 115–124. Springer, Cham, April 2017 31. Lin, J., Dyer, C.: Data-intensive text processing with MapReduce. Synth. Lect. Hum. Lang. Technol. 3(1), 1–177 (2010)