Performance-Based Parallel Loop Self-scheduling on Heterogeneous ...

Performance-Based Parallel Loop Self-scheduling on Heterogeneous Multicore PC Clusters* Chao-Tung Yang1,∗∗, Jen-Hsiang Chang1, and Chao-Chin Wu2 1

High-Performance Computing Laboratory Department of Computer Science, Tunghai University Taichung, 40704, Taiwan ROC {ctyang,g95290009}@thu.edu.tw 2 Department of Computer Science and Information Engineering National Changhua University of Education, Changhua, 50074, Taiwan [email protected]

Abstract. In recent years, Multicore computers have been widely included in cluster systems. They adopt shared memory architectures. However, previous researches on parallel loop self-scheduling did not consider the feature of multicore computers. It is more suitable for shared-memory multiprocessors to adopt OpenMP for parallel programming. In this paper, we propose a performancebased approach that partitions loop iterations according to the performance weighting of cluster nodes. Because the iterations assigned to one MPI process will be processed in parallel by OpenMP threads running by the processor cores in the same computational node, the number of loop iterations to be allocated to one computational node at each scheduling step also depends on the number of processor cores in that node. Experimental results show that the proposed approach performs better than previous schemes. Keywords: Self-scheduling, Parallel loop scheduling, Multicore, Cluster, OpenMP, MPI.

1 Introduction Recently, more and more cluster systems include multicore computers because almost all the commodity personal computers are multicore architectures. The primary feature of multicore architectures is that multiple processors on the same chip can communicate with each other by directly accessing the data in the shared memory. In this paper, we revise popular loop self-scheduling schemes to fit cluster computing environments. The HPCC Performance Analyzer [1] is used to estimate performance of all nodes rather accurately. The MPI library is usually used for parallel programming in the cluster system because it is a message-passing programming language. However, MPI is not the best programming language for multicore computers. Instead, OpenMP is very suitable for multicore computers because it is a *

This work is supported in part by National Science Council, Taiwan R.O.C., under grants no. NSC 96-2221-E-029-019-MY3 and NSC 98-2220-E-029-004. ∗∗ Corresponding author. W. Zhang et al. (Eds.): HPCA 2009, LNCS 5938, pp. 509–514, 2010. © Springer-Verlag Berlin Heidelberg 2010

510

C.-T. Yang, J.-H. Chang, and C.-C. Wu

shared-memory programming language. Therefore, in this paper we propose to use hybrid MPI and OpenMP programming mode to design the loop self-scheduling scheme for the cluster system with multicore computers.

2 Background Review 2.1 Multicore and Cluster Systems In a multicore processor, two or more independent cores are combined in a single integrated circuit. A system with n cores is most effective when presented with n or more concurrent threads. The degree of performance gain, resulting from use of a multicore processor, depends on the problem being solved and the algorithms used, as well as on their implementation in software. The main features allow dividing largescale programs into several smaller programs for parallel execution by more than one computer in order to reduce processing times, and include modern cluster systems that communicate mainly through Local Area Networks (LANs), and can be broadly divided into homogeneous and heterogeneous systems. 2.2 Loop Self-scheduling Schemes Self-scheduling schemes are mainly used to deal with load balancing [8] and they can be divided into two types: static and dynamic [6, 7]. Static scheduling schemes decide how many loop iterations are assigned for each processor at compile time. The number of processors available for distribution and the calculated dynamics of each machine are taken into consideration when implementing programs with static scheduling. The advantage of static scheduling schemes is no scheduling overhead at runtime. In contrast, dynamic scheduling is more suitable for load balancing because it makes scheduling decisions at runtime. No estimations and predictions are required. Self-scheduling is a large class of adaptive and dynamic centralized loop scheduling schemes. Initially, a portion of the loop iterations is scheduled to all processors. As soon as a slave processor becomes idle after it has finished the assigned workload, it requests the scheduling of unscheduled iterations.

3 Proposed Approach Fig.1 explains our approach. All the loop iterations are kept in the global scheduler. No slave cores are allowed to request iterations directly from the global scheduler. Instead, they have to request from the local scheduler in the same computing node. To utilize the feature of shared-memory architectures, every MPI process of the local scheduler will create OpenMP threads for each processor core on its resident computing node. The messages between the global scheduler and the local scheduler are inter-node communications. They are MPI messages. In contrast, the messages between the local scheduler and the processor core are intra-node communications.

Performance-Based Parallel Loop Self-scheduling

511

Fig. 1. Multicore computing node communications with MPI processes and OpenMP threads

In this context, we propose to allocate α% of workload according to the performance weighted by CPU clock speed and the HPCC [1] measurement of all nodes, and the remaining workload is dispatched by some well-known self-scheduling scheme such as GSS [5]. By using this approach, we need to know the real computer performance with HPCC benchmarks. Then, we can distribute appropriate workloads to each node, and load balancing can be achieved. The more accurate the estimation is, the better the load balance well be. (1) where S is the set of all cluster nodes, CSi is the CPU clock speed of node i, and it is a constant attribute. HPLi is the HPL measurement of HPCC, and this value is analyzed above; β is the ratio between the two values. This algorithm is based on a message-passing paradigm, and consists of two modules: a master module and a slave module. The master module makes the scheduling decision and dispatches workloads to slaves. Then, the slave module processes the assigned work. This algorithm is just a skeleton, and the detailed implementation, such as data preparation, parameter passing, etc., might be different according to requirements of various applications.

4 Experimental Environment and Results The performance of our scheme is then compared with that of other static and dynamic schemes on the heterogeneous cluster. In this work, we implemented three classes of applications in C language, with MPI and OpenMP directives to parallelize code segments for execution on our testbed: Matrix Multiplication, Mandelbrot Set Computation and Circuit Satisfiability. 4.1 Hardware Configuration and Specifications We have built a testbed consisting of fourteen nodes. The hardware and software configuration is specified in Table 1. The network route is stated in Fig.2.

512


Fig. 2. Network route state Table 1. Our cluster system configuration Host

Processor Model

quad1 Intel™ Core™2 Quad CPU Q6600 @ 2.40GHz quad2 Intel™ Core™2 Quad CPU Q6600 @ 2.40GHz quad3 Intel™ Core™2 Quad CPU Q6600 @ 2.40GHz quad4 Intel™ Core™2 Quad CPU Q6600 @ 2.40GHz oct1 Intel(R) Xeon(R) CPU E5410 @ 2.33GHz oct2 Intel(R) Xeon(R) CPU E5410 @ 2.33GHz oct3 Intel(R) Xeon(R) CPU E5410 @ 2.33GHz oct4 Intel(R) Xeon(R) CPU E5410 @ 2.33GHz t1 Intel(R) Xeon(R) CPU E5310 @ 1.60GHz t2 Intel(R) Xeon(R) CPU E5310 @ 1.60GHz t3 Intel(R) Xeon(R) CPU E5410 @ 2.33GHz t4 Intel(R) Xeon(R) CPU E5410 @ 2.33GHz eta9 Intel(R) Xeon(R) CPU E5420 @ 2.50GHz s1 Intel(R) Xeon(R) CPU E5310 @ 1.60GHz

#o f CPU 1

# of Core 4

RAM

NIC

OS version

2GB

1G

2.6.23.1-42.fc8

1

4

2GB

1G

2.6.23.1-42.fc8

1

4

2GB

1G

2.6.23.1-42.fc8

1

4

2GB

1G

2.6.23.1-42.fc8

2 2 2 2 2 2 2 2 2 2

8 8 8 8 8 8 8 8 8 8

8G 8G 8G 8G 8GB 8GB 8GB 8GB 4G 4GB

1G 1G 1G 1G 1G 1G 1G 1G 1G 1G

2.6.21-xen_Xen 2.6.21-xen_Xen 2.6.21-xen_Xen 2.6.21-xen_Xen 2.6.25.4-10.fc8 2.6.25.4-10.fc8 2.6.25.4-10.fc8 2.6.25.4-10.fc8 2.6.25-14.fc9 2.6.18-128.1.6.el5

4.2 Experimental Results In our experiments, first, the HPL measurements and CPU speed of all nodes were collected. Next, the impact of the parameters α, β, on performance was investigated. With this approach, a faster node will get more workloads than a slower one proportionally. In the matrix multiplicatoin experiment, we find that the proposed schemes get better performance when α is 40 and β is 0.4. Fig. 3 illustrates execution time of traditional scheme, dynamic hybrid (matn*ss3) and the proposed scheme (matn*ss3_omp).

Performance-Based Parallel Loop Self-scheduling

92.60

matngss3_omp 100

19.22

59.04

48.55

59.21

45.17

17.60

19.55

16.79

6.05

4.60

4.77

3.76

3.21

4.00

20.02

matntss3_omp

20

16.82

matntss3

40

59.56

matnfss3_omp

60

45.22

matnfss3 80

104.99

115.12

113.62

120

106.48

matngss3

119.42

Matrix Multiplication

140

E xecutiontim e(sec)

513

0 1024×1024

2048x2048

3072x3072

Matrix size

4096x4096

Fig. 3. Performance improvement comparison for (matn*ss) and (matn*ss_omp) different applications

In Mandelbrot set computation example, we found that the proposed schemes got better performance when α was 40 and β was about 0.4. Fig. 4 shows execution times for the proposed scheme (mann*ss3), and for the proposed scheme of OpenMP (mann*ss3_omp) on the GSS, FSS, and TSS group approaches.

36.45

29.42

mannfss3

mannfss3_omp 30

7.95

8.75

8.73

8.12

2.22

2.33

2.34

2.38

4.58

3.30

manntss3_omp 10

14.48

manntss3 20 11.58

Executiontim e(sec)

manngss3_omp

40

27.26

50

32.69

44.82

manngss3

50.83

Mandelbrot set

60

0 512x512

1024x1024 Image size

2048x2048

Fig. 4. Performance improvement comparison for (mann*ss) and (mann*ss_omp) different applications Circuit satisfiability

60

43.24

44.48

39.58

satnfss3

satnfss3_omp 30 satntss3 20

5.54

6.14

5.97

5.78

1.58

1.53

1.48

1.50

1.45

1.67

5.28

satntss3_omp 10

5.93

Executiontime(sec)

41.44

satngss3_omp

40

39.12

50

43.20

satngss3

0 18

19 Number of variables

20

Fig. 5. Performance improvement comparison for (satn*ss) and (satn*ss_omp) different applications

514


In Circuit Satisfiability example, we found that the proposed schemes got better performance when α was 30 and β was about 0.5. Fig. 5 shows execution times for the proposed scheme (satn*ss3) and the proposed scheme of OpenMP (satn*ss3_omp).

5 Conclusion and Future Work In this paper, we use the hybrid programming model MPI with OpenMP to design parallel loop self-scheduling schemes for heterogeneous cluster systems with multicore computers. We propose a heuristic scheme, which combines the advantages of static and dynamic loop scheduling schemes, and compare it with previous algorithms by experiments in this environment. In each case, our approach can obtain performance improvement on previous schemes. Furthermore, we hope to find better ways to model the performance functions, such as incorporating amount of memory available, memory access costs, network information, CPU loading, etc. Also, a theoretical analysis of the proposed method will be addressed.

References 1. HPC Challenge Benchmark, http://icl.cs.utk.edu/hpcc/ 2. Bennett, B.H., Davis, E., Kunau, T., Wren, W.: Beowulf Parallel Processing for Dynamic Load-balancing. In: Proceedings on IEEE Aerospace Conference, vol. 4, pp. 389–395 (2000) 3. Chronopoulos, A.T., Andonie, R., Benche, M., Grosu, D.: A Class of Loop Self-Scheduling for Heterogeneous Clusters. In: Proceedings of the 2001 IEEE International Conference on Cluster Computing, pp. 282–291 (2001) 4. Hummel, S.F., Schonberg, E., Flynn, L.E.: Factoring: a method scheme for scheduling parallel loops. Communications of the ACM 35, 90–101 (1992) 5. Polychronopoulos, C.D., Kuck, D.: Guided Self-Scheduling: a Practical Scheduling Scheme for Parallel Supercomputers. IEEE Trans. on Computers 36(12), 1425–1439 (1987) 6. Yang, C.-T., Cheng, K.-W., Shih, W.-C.: On Development of an Efficient Parallel Loop Self-Scheduling for Grid Computing Environments. Parallel Computing 33(7-8), 467–487 (2007) 7. Yang, C.-T., Cheng, K.-W., Li, K.-C.: An Enhanced Parallel Loop Self-Scheduling Scheme for Cluster Environments. The Journal of Supercomputing 34(3), 315–335 (2005) 8. Yagoubi, B., Slimani, Y.: Load Balancing Strategy in Grid Environment. Journal of Information Technology and Applications 1(4), 285–296 (2007)

Performance-Based Parallel Loop Self-scheduling on Heterogeneous ...

Performance-Based Parallel Loop Self-scheduling on Heterogeneous ...

Suggest Documents

Parallel Image Processing on Heterogeneous

Parallel Computing on Heterogeneous Networks - Semantic Scholar

Failure Resilient Heterogeneous Parallel

Parallel programming languages on heterogeneous architectures ...

On performance analysis of heterogeneous parallel algorithms

Parallel CLP on Heterogeneous Networks - ECLiPSe CLP

Scheduling on Heterogeneous Message Passing Parallel ... - CiteSeerX

Parallel Performance Measurement of Heterogeneous Parallel ...

Heuristic Scheduling of Parallel Heterogeneous

Distributed loop-scheduling schemes for heterogeneous computer

Adaptive Appearance Based Loop-Closing in Heterogeneous ...

Parallel I/O for Scientific Applications on Heterogeneous ... - CiteSeerX

HAT: Heterogeneous Adaptive Throttling for On ... - Parallel Data Lab

Parallel Likelihood Function Evaluation on Heterogeneous Many-core

Parallel Sorting on Heterogeneous Platforms - IEEE Computer Society

Massively Parallel Analysis of Similarity Matrices on Heterogeneous

Parallel Image Processing on Heterogeneous SIMD-MIMD Machines

parallel image processing in heterogeneous computing ... - CiteSeerX

Generating Binary Optimal Codes Using Heterogeneous Parallel ...

Language Virtualization for Heterogeneous Parallel ... - CiteSeerX

parallel image processing in heterogeneous computing ... - CiteSeerX

Scheduling Parallel Computations in a Heterogeneous ... - CiteSeerX

heterogeneous highly parallel implementation of matrix ... - arXiv

parallel image processing in heterogeneous computing ... - CiteSeerX