Performance-Based Parallel Loop Self-scheduling on Heterogeneous Multicore PC Clusters* Chao-Tung Yang1,∗∗, Jen-Hsiang Chang1, and Chao-Chin Wu2 1
High-Performance Computing Laboratory Department of Computer Science, Tunghai University Taichung, 40704, Taiwan ROC {ctyang,g95290009}@thu.edu.tw 2 Department of Computer Science and Information Engineering National Changhua University of Education, Changhua, 50074, Taiwan
[email protected]
Abstract. In recent years, Multicore computers have been widely included in cluster systems. They adopt shared memory architectures. However, previous researches on parallel loop self-scheduling did not consider the feature of multicore computers. It is more suitable for shared-memory multiprocessors to adopt OpenMP for parallel programming. In this paper, we propose a performancebased approach that partitions loop iterations according to the performance weighting of cluster nodes. Because the iterations assigned to one MPI process will be processed in parallel by OpenMP threads running by the processor cores in the same computational node, the number of loop iterations to be allocated to one computational node at each scheduling step also depends on the number of processor cores in that node. Experimental results show that the proposed approach performs better than previous schemes. Keywords: Self-scheduling, Parallel loop scheduling, Multicore, Cluster, OpenMP, MPI.
1 Introduction Recently, more and more cluster systems include multicore computers because almost all the commodity personal computers are multicore architectures. The primary feature of multicore architectures is that multiple processors on the same chip can communicate with each other by directly accessing the data in the shared memory. In this paper, we revise popular loop self-scheduling schemes to fit cluster computing environments. The HPCC Performance Analyzer [1] is used to estimate performance of all nodes rather accurately. The MPI library is usually used for parallel programming in the cluster system because it is a message-passing programming language. However, MPI is not the best programming language for multicore computers. Instead, OpenMP is very suitable for multicore computers because it is a *
This work is supported in part by National Science Council, Taiwan R.O.C., under grants no. NSC 96-2221-E-029-019-MY3 and NSC 98-2220-E-029-004. ∗∗ Corresponding author. W. Zhang et al. (Eds.): HPCA 2009, LNCS 5938, pp. 509–514, 2010. © Springer-Verlag Berlin Heidelberg 2010
510
C.-T. Yang, J.-H. Chang, and C.-C. Wu
shared-memory programming language. Therefore, in this paper we propose to use hybrid MPI and OpenMP programming mode to design the loop self-scheduling scheme for the cluster system with multicore computers.
2 Background Review 2.1 Multicore and Cluster Systems In a multicore processor, two or more independent cores are combined in a single integrated circuit. A system with n cores is most effective when presented with n or more concurrent threads. The degree of performance gain, resulting from use of a multicore processor, depends on the problem being solved and the algorithms used, as well as on their implementation in software. The main features allow dividing largescale programs into several smaller programs for parallel execution by more than one computer in order to reduce processing times, and include modern cluster systems that communicate mainly through Local Area Networks (LANs), and can be broadly divided into homogeneous and heterogeneous systems. 2.2 Loop Self-scheduling Schemes Self-scheduling schemes are mainly used to deal with load balancing [8] and they can be divided into two types: static and dynamic [6, 7]. Static scheduling schemes decide how many loop iterations are assigned for each processor at compile time. The number of processors available for distribution and the calculated dynamics of each machine are taken into consideration when implementing programs with static scheduling. The advantage of static scheduling schemes is no scheduling overhead at runtime. In contrast, dynamic scheduling is more suitable for load balancing because it makes scheduling decisions at runtime. No estimations and predictions are required. Self-scheduling is a large class of adaptive and dynamic centralized loop scheduling schemes. Initially, a portion of the loop iterations is scheduled to all processors. As soon as a slave processor becomes idle after it has finished the assigned workload, it requests the scheduling of unscheduled iterations.
3 Proposed Approach Fig.1 explains our approach. All the loop iterations are kept in the global scheduler. No slave cores are allowed to request iterations directly from the global scheduler. Instead, they have to request from the local scheduler in the same computing node. To utilize the feature of shared-memory architectures, every MPI process of the local scheduler will create OpenMP threads for each processor core on its resident computing node. The messages between the global scheduler and the local scheduler are inter-node communications. They are MPI messages. In contrast, the messages between the local scheduler and the processor core are intra-node communications.
Performance-Based Parallel Loop Self-scheduling
511
Fig. 1. Multicore computing node communications with MPI processes and OpenMP threads
In this context, we propose to allocate α% of workload according to the performance weighted by CPU clock speed and the HPCC [1] measurement of all nodes, and the remaining workload is dispatched by some well-known self-scheduling scheme such as GSS [5]. By using this approach, we need to know the real computer performance with HPCC benchmarks. Then, we can distribute appropriate workloads to each node, and load balancing can be achieved. The more accurate the estimation is, the better the load balance well be. (1) where S is the set of all cluster nodes, CSi is the CPU clock speed of node i, and it is a constant attribute. HPLi is the HPL measurement of HPCC, and this value is analyzed above; β is the ratio between the two values. This algorithm is based on a message-passing paradigm, and consists of two modules: a master module and a slave module. The master module makes the scheduling decision and dispatches workloads to slaves. Then, the slave module processes the assigned work. This algorithm is just a skeleton, and the detailed implementation, such as data preparation, parameter passing, etc., might be different according to requirements of various applications.
4 Experimental Environment and Results The performance of our scheme is then compared with that of other static and dynamic schemes on the heterogeneous cluster. In this work, we implemented three classes of applications in C language, with MPI and OpenMP directives to parallelize code segments for execution on our testbed: Matrix Multiplication, Mandelbrot Set Computation and Circuit Satisfiability. 4.1 Hardware Configuration and Specifications We have built a testbed consisting of fourteen nodes. The hardware and software configuration is specified in Table 1. The network route is stated in Fig.2.
512
C.-T. Yang, J.-H. Chang, and C.-C. Wu
Fig. 2. Network route state Table 1. Our cluster system configuration Host
Processor Model
quad1 Intel™ Core™2 Quad CPU Q6600 @ 2.40GHz quad2 Intel™ Core™2 Quad CPU Q6600 @ 2.40GHz quad3 Intel™ Core™2 Quad CPU Q6600 @ 2.40GHz quad4 Intel™ Core™2 Quad CPU Q6600 @ 2.40GHz oct1 Intel(R) Xeon(R) CPU E5410 @ 2.33GHz oct2 Intel(R) Xeon(R) CPU E5410 @ 2.33GHz oct3 Intel(R) Xeon(R) CPU E5410 @ 2.33GHz oct4 Intel(R) Xeon(R) CPU E5410 @ 2.33GHz t1 Intel(R) Xeon(R) CPU E5310 @ 1.60GHz t2 Intel(R) Xeon(R) CPU E5310 @ 1.60GHz t3 Intel(R) Xeon(R) CPU E5410 @ 2.33GHz t4 Intel(R) Xeon(R) CPU E5410 @ 2.33GHz eta9 Intel(R) Xeon(R) CPU E5420 @ 2.50GHz s1 Intel(R) Xeon(R) CPU E5310 @ 1.60GHz
#o f CPU 1
# of Core 4
RAM
NIC
OS version
2GB
1G
2.6.23.1-42.fc8
1
4
2GB
1G
2.6.23.1-42.fc8
1
4
2GB
1G
2.6.23.1-42.fc8
1
4
2GB
1G
2.6.23.1-42.fc8
2 2 2 2 2 2 2 2 2 2
8 8 8 8 8 8 8 8 8 8
8G 8G 8G 8G 8GB 8GB 8GB 8GB 4G 4GB
1G 1G 1G 1G 1G 1G 1G 1G 1G 1G
2.6.21-xen_Xen 2.6.21-xen_Xen 2.6.21-xen_Xen 2.6.21-xen_Xen 2.6.25.4-10.fc8 2.6.25.4-10.fc8 2.6.25.4-10.fc8 2.6.25.4-10.fc8 2.6.25-14.fc9 2.6.18-128.1.6.el5
4.2 Experimental Results In our experiments, first, the HPL measurements and CPU speed of all nodes were collected. Next, the impact of the parameters α, β, on performance was investigated. With this approach, a faster node will get more workloads than a slower one proportionally. In the matrix multiplicatoin experiment, we find that the proposed schemes get better performance when α is 40 and β is 0.4. Fig. 3 illustrates execution time of traditional scheme, dynamic hybrid (matn*ss3) and the proposed scheme (matn*ss3_omp).
Performance-Based Parallel Loop Self-scheduling
92.60
matngss3_omp 100
19.22
59.04
48.55
59.21
45.17
17.60
19.55
16.79
6.05
4.60
4.77
3.76
3.21
4.00
20.02
matntss3_omp
20
16.82
matntss3
40
59.56
matnfss3_omp
60
45.22
matnfss3 80
104.99
115.12
113.62
120
106.48
matngss3
119.42
Matrix Multiplication
140
E xecutiontim e(sec)
513
0 1024×1024
2048x2048
3072x3072
Matrix size
4096x4096
Fig. 3. Performance improvement comparison for (matn*ss) and (matn*ss_omp) different applications
In Mandelbrot set computation example, we found that the proposed schemes got better performance when α was 40 and β was about 0.4. Fig. 4 shows execution times for the proposed scheme (mann*ss3), and for the proposed scheme of OpenMP (mann*ss3_omp) on the GSS, FSS, and TSS group approaches.
36.45
29.42
mannfss3
mannfss3_omp 30
7.95
8.75
8.73
8.12
2.22
2.33
2.34
2.38
4.58
3.30
manntss3_omp 10
14.48
manntss3 20 11.58
Executiontim e(sec)
manngss3_omp
40
27.26
50
32.69
44.82
manngss3
50.83
Mandelbrot set
60
0 512x512
1024x1024 Image size
2048x2048
Fig. 4. Performance improvement comparison for (mann*ss) and (mann*ss_omp) different applications Circuit satisfiability
60
43.24
44.48
39.58
satnfss3
satnfss3_omp 30 satntss3 20
5.54
6.14
5.97
5.78
1.58
1.53
1.48
1.50
1.45
1.67
5.28
satntss3_omp 10
5.93
Executiontime(sec)
41.44
satngss3_omp
40
39.12
50
43.20
satngss3
0 18
19 Number of variables
20
Fig. 5. Performance improvement comparison for (satn*ss) and (satn*ss_omp) different applications
514
C.-T. Yang, J.-H. Chang, and C.-C. Wu
In Circuit Satisfiability example, we found that the proposed schemes got better performance when α was 30 and β was about 0.5. Fig. 5 shows execution times for the proposed scheme (satn*ss3) and the proposed scheme of OpenMP (satn*ss3_omp).
5 Conclusion and Future Work In this paper, we use the hybrid programming model MPI with OpenMP to design parallel loop self-scheduling schemes for heterogeneous cluster systems with multicore computers. We propose a heuristic scheme, which combines the advantages of static and dynamic loop scheduling schemes, and compare it with previous algorithms by experiments in this environment. In each case, our approach can obtain performance improvement on previous schemes. Furthermore, we hope to find better ways to model the performance functions, such as incorporating amount of memory available, memory access costs, network information, CPU loading, etc. Also, a theoretical analysis of the proposed method will be addressed.
References 1. HPC Challenge Benchmark, http://icl.cs.utk.edu/hpcc/ 2. Bennett, B.H., Davis, E., Kunau, T., Wren, W.: Beowulf Parallel Processing for Dynamic Load-balancing. In: Proceedings on IEEE Aerospace Conference, vol. 4, pp. 389–395 (2000) 3. Chronopoulos, A.T., Andonie, R., Benche, M., Grosu, D.: A Class of Loop Self-Scheduling for Heterogeneous Clusters. In: Proceedings of the 2001 IEEE International Conference on Cluster Computing, pp. 282–291 (2001) 4. Hummel, S.F., Schonberg, E., Flynn, L.E.: Factoring: a method scheme for scheduling parallel loops. Communications of the ACM 35, 90–101 (1992) 5. Polychronopoulos, C.D., Kuck, D.: Guided Self-Scheduling: a Practical Scheduling Scheme for Parallel Supercomputers. IEEE Trans. on Computers 36(12), 1425–1439 (1987) 6. Yang, C.-T., Cheng, K.-W., Shih, W.-C.: On Development of an Efficient Parallel Loop Self-Scheduling for Grid Computing Environments. Parallel Computing 33(7-8), 467–487 (2007) 7. Yang, C.-T., Cheng, K.-W., Li, K.-C.: An Enhanced Parallel Loop Self-Scheduling Scheme for Cluster Environments. The Journal of Supercomputing 34(3), 315–335 (2005) 8. Yagoubi, B., Slimani, Y.: Load Balancing Strategy in Grid Environment. Journal of Information Technology and Applications 1(4), 285–296 (2007)