Using hybrid MPI and OpenMP programming to ... - Semantic Scholar

J Supercomput (2012) 60:31–61 DOI 10.1007/s11227-009-0271-z

Using hybrid MPI and OpenMP programming to optimize communications in parallel loop self-scheduling schemes for multicore PC clusters Chao-Chin Wu · Lien-Fu Lai · Chao-Tung Yang · Po-Hsun Chiu

Published online: 25 February 2009 © Springer Science+Business Media, LLC 2009

Abstract Recently, a series of parallel loop self-scheduling schemes have been proposed, especially for heterogeneous cluster systems. However, they employed the MPI programming model to construct the applications without considering whether the computing node is multicore architecture or not. As a result, every processor core has to communicate directly with the master node for requesting new tasks no matter the fact that the processor cores on the same node can communicate with each other through the underlying shared memory. To address the problem of higher communication overhead, in this paper we propose to adopt hybrid MPI and OpenMP programming model to design two-level parallel loop self-scheduling schemes. In the first level, each computing node runs an MPI process for inter-node communications. In the second level, each processor core runs an OpenMP thread to execute the iterations assigned for its resident node. Experimental results show that our method outperforms the previous works. Keywords Parallel loop scheduling · Cluster computing · Multicore architecture · MPI programming · OpenMP programming · Hybrid programming 1 Introduction A cluster system is composed of loosely coupled computers that work together to solve a problem in parallel by dividing a job into several smaller jobs [1, 15]. As C.-C. Wu () · L.-F. Lai · P.-H. Chiu Department of Computer Science and Information Engineering, National Changhua University of Education, Changhua City, 500, Taiwan e-mail: [email protected] C.-T. Yang High-Performance Computing Laboratory, Department of Computer Science and Information Engineering, Tunghai University, Taichung, 40704, Taiwan

32

C.-C. Wu et al.

more and more inexpensive personal computers (PC) are available, clusters of PCs have become alternatives of supercomputers which many research projects cannot afford. Cluster systems can be divided into two categories: homogeneous clusters and heterogeneous clusters. A homogeneous cluster consists of identical computers and a heterogeneous cluster is comprised of diverse computers. Because of the budget reason, a cluster system is usually expanded gradually. It is originally homogeneous when constructed at the first time. However, it becomes heterogeneous after incorporating newly purchased computers that usually have different hardware configurations from the older ones. Consequently, the major portion of cluster systems is heterogeneous. However, it is difficult to deal with the heterogeneity in a cluster [1–4, 6, 15–17, 19, 20, 22]. Loop scheduling and load balancing on parallel and distributed systems are critical problems, but it is difficult to cope with these problems, especially on the emerging PC-based clusters. In this aspect, an important issue is how to assign tasks to nodes so that the nodes’ loads are well balanced. Some of them focused on loop scheduling where the self-scheduling is especially designed for the loop without cross-iteration data dependences. Static loop self-scheduling schemes make scheduling decisions at compile time [10] while dynamic schemes make decisions at run time [8, 11, 18]. Static schemes cause less runtime overhead but suffer from load-imbalance. On the other hand, dynamic scheduling schemes have opposite features to static schemes. Therefore, dynamic scheduling is the better choice for the heterogeneous computing system if it does not cause too much runtime overhead. Yang et al. have proposed several loop self-scheduling schemes for heterogeneous cluster systems [19, 20, 22] and grid systems [5, 14, 21]. The latest scheme applicable to the cluster system proposed to combine both the advantages of static and dynamic loop scheduling schemes [22]. The scheduling consists of two phases. In the first phase, some portion of the workload is distributed statically among computational nodes based on the values of the performance function for all nodes. In the second phase, the remaining workload is scheduled by some well-known dynamic self-scheduling scheme, such as guided self-scheduling (GSS) [11]. Recently, more and more cluster systems include multicore computers because in the present days almost all the commodity personal computers are multicore architecture. The primary feature of multicore architecture is that multiple processor cores on the same chip can communicate with each other by directly accessing the data in shared memory. In contrast, unlike multicore computers, each computer in the distributed system has its own memory system and thus it relies on the message-passing mechanism to communicate with other computers. The MPI library is usually used for parallel programming in the cluster system because it is a message-passing programming language. However, MPI is not the appropriate programming language for multicore computers. Even though there are still many tasks left in the shared memory that are assigned to some overloaded slave process, other slave MPI processes on the same computing node cannot access to the tasks. Instead, every slave process has to communicate directly with the master MPI process for new tasks. In large cluster systems, the master process may become a bottleneck of the system performance because of excessive amount of communication. To minimize the amount of communication between the slave processes and the master process, it is better to allow the processor cores on the same computing

Optimizing self-scheduling schemes for multicore PC clusters

33

node to communicate with each other directly through the underlying shared memory. OpenMP is employed for this purpose because it is a shared-memory multithreaded programming language. Therefore, in this paper we propose a two-level selfscheduling scheme based on the hybrid MPI and OpenMP programming mode for the heterogeneous cluster system with multicore computers. In the first level, the system is comprised of one global scheduler and multiple local schedulers. Each computing node has exactly one local scheduler that will request tasks from the global scheduler whenever it becomes idle. Every scheduler is an MPI process no matter if it is global or local. In the second level, every local scheduler will create one OpenMP thread for each processor core on its resident computing node. The tasks assigned to the local scheduler will be dispatched to the parallel OpenMP threads. In our approach, only the local scheduler will issue requests to the global scheduler, resulting in reduced amount of inter-node communication. To verify the proposed approach, a heterogeneous cluster is built, and three types of application programs, matrix multiplication, sparse matrix multiplication and Mandelbrot set computation, are implemented to be executed in this testbed. Empirical results show that the proposed approach can obtain performance improvement on previous schemes, especially for the applications that have irregular workload distribution and require a large amount of data communication at each scheduling step. The rest of this paper is organized as follows. In Sect. 2, we introduce several typical and well-known self-scheduling schemes. In Sect. 3, we describe how to employ the hybrid MPI and OpenMP programming model to develop the two-level selfscheduling schemes. Next, our system configuration is specified and experimental results on two types of application programs are also presented in Sect. 4. Finally, the conclusion remarks and future work are given in the last section.

2 Related work A parallel loop is a loop having no cross-iteration data dependences. If a parallel loop has N iterations, it can be executed at most by N processors in parallel without any interaction among processors. However, because the number of available processors in a system is always much smaller than N , each processor has to execute more than one loop iteration. Static scheduling schemes decide how many loop iterations are assigned for each processor at compile time. The advantage of this kind of scheduling schemes is no scheduling overhead at runtime. In addition, it is very applicable to the homogeneous computing system when each loop iteration takes roughly the same amount of time. However, it is hard to estimate the computation power of every processor in the heterogeneous computing system and to predict the amount of time each iteration takes for irregular programs, resulting in load imbalance usually. Dynamic scheduling is more suitable for load balance because it makes scheduling decisions at runtime. No estimations and predictions are required. Self-scheduling is a large class of adaptive and dynamic centralized loop scheduling schemes. Initially, a portion of the loop iterations is scheduled to all processors. As soon as a slave processor becomes idle after it has finished the assigned workload, it requests the scheduler of unscheduled iterations. The total number of iterations that a processor

34

C.-C. Wu et al.

will execute depends on both its speed and the execution time of every assigned iteration. Although the scheduling overhead is proportional to the total number of requests from slaves, allocating too many iterations for every request may lead to poor load balance. Various self-scheduling schemes have been proposed to achieve better load balance with less scheduling overhead. Pure Self-Scheduling (PSS) is the first straightforward dynamic loop scheduling algorithm [10]. Whenever a processor becomes idle, the master will assign one loop iteration to it. This algorithm achieves good load balancing because the maximum waiting time for the last processor is the execution time of one loop iteration. However, it induces excessive runtime overhead because it requires N times to dispatch the iterations one by one by the master if there are N iterations totally. Chunk Self-Scheduling (CSS) assigns k consecutive iterations each time [10]. The chunk size, k, is fixed and must be specified by either the programmer or by the compiler. A large chunk size will cause load imbalance because the maximum waiting time for the last processor is the execution time of k loop iterations. In contrast, a small chunk size is likely to result in too much scheduling overhead. If k is equal to 1, CSS will be degraded to PSS. Thus, it is important to choose the proper chunk size. Guided Self-Scheduling (GSS) assigns decreasing-sized chunks for requests [11]. Initially, the master allocates large chunks to slaves and later uses the smaller chunks to smoothen the unevenness of the execution times of the initial larger chunks. More specifically, the next chunk size is calculated by dividing the number of the remaining iterations by the number of available processors. It aims at reducing the dispatch frequency to minimize the scheduling overhead and reducing the number of iterations assigned to the last few processors to achieve better load balancing. Factoring Self-Scheduling (FSS) assigns loop iterations to processors in phases [8]. It tries to address the following problem of GSS. Because GSS might assign too much work to the first few processors in some cases, the remaining iterations are not timeconsuming enough to balance the workload. During each phase of FSS, only a subset of remaining loop iterations (usually half) is equally distributed to available processors. FSS can prevent from assigning too much workload to the first few processors like GSS does. As a result, it balances workloads better than GSS when loop iteration computation times vary substantially. Trapezoid Self-Scheduling (TSS) reduces the scheduling frequency while still providing reasonable load balancing [18]. The chunk sizes decrease linearly in TSS, in contrast to the geometric decrease of the chunk sizes in GSS. A TSS is represented by TSS(Ns , Nf ), where Ns is the number of the first iterations to be assigned to the processor starting the loop and Nf is the number of the last iterations to be assigned to the processor performing the last fetch. The two parameters, Ns and Nf , have to be specified in TSS either by the programmer or by the compiler. According to the values of Ns and Nf , the number of iterations to be assigned at each step is decreased in a constant ratio. Tzen and Ni [18] have proposed TSS(N/2p, 1) as a general selection, where N is the number of iterations and p is the number of processors. Yang et al. have proposed several loop self-scheduling schemes for heterogeneous cluster systems [19, 20, 22]. In their first work [19], a heuristic was proposed to distribute workload according to CPU clock speed when loop is regular. It assigns


35

loop iterations in two phases. In the first phase, α% of workload is partitioned according to the performance weighted by CPU clock speed. In the second phase, the rest (100 − α)% of workload is distributed according to a traditional self-scheduling scheme. The first phase adopts static scheduling to reduce the scheduling overhead. In contrast, the second phase uses dynamic scheduling to achieve good load balancing. The success of the proposed heuristic heavily depends on the appropriate selection of the α value that is specified either by programmer or by the compiler. Unlike their first work using CPU clock speed for performance estimation in the first phase, they evaluated computer performance by using HINT Performance Analyzer Tool in their second work [20] because many attributes influence system performance, including CPU clock speed, available memory, communication cost, and so forth [12]. HINT (Hierarchical INTegration) is a computer benchmarking tool developed at the Ames Laboratory Scalable Computing Laboratory (SCL). Unlike conventional benchmarks, HINT neither fixes the problem size nor the calculation time. Instead, it measures “QUIPS” (QUality Improvement Per Second) as a function of time. HINT is used to determine whether target systems are relatively homogeneous or relatively heterogeneous. They also adaptively adjust the α value according to the heterogeneity of the cluster [6]. If the target system is relatively heterogeneous, the α value is set to be their defined Heterogeneous Ratio (HR); otherwise, the α value is set to be 100. As a consequence, the computing power can be more accurately estimated and the α value is determined automatically. In the latest scheme [22], they proposed a performance-based approach, which partitions loop iterations according to the performance ratio of cluster nodes. They defined a performance function for each computing node. However, no performance functions are explicitly defined in the paper. Instead, application execution time is used to estimate the values of performance function for all nodes. The reciprocal of the execution time of the target program on each computing node is recorded to form the performance function value. The ratios of these reciprocals are defined as performance ratio of cluster nodes. According to the performance ratio among all slave nodes, firstly α percent of the total workload is statically scheduled. Next, the remainder of the workload is dynamically scheduled by a known self-scheduling scheme. Although previous works have studied the hybrid MPI and OpenMP programming model, they used the self-scheduling scheme only in the OpenMP paradigm [7, 13]. In this paper, we apply the self-scheduling scheme to both the MPI processes and the OpenMP threads. To improve the performance for the proposed two-level self-scheduling scheme, we investigate how to redesign the allocation functions for various well-known self-scheduling schemes.

3 Two-level loop self-scheduling The feature of the heterogeneous cluster system with multicore computers is that it combines the main concepts of distributed-memory and shared-memory parallel machines in a single system. Accordingly, we propose a two-level self-scheduling scheme in this section, and then present how to apply our method to the four wellknown self-scheduling schemes by amending their allocation functions.

36

C.-C. Wu et al.

Fig. 1 Communications with MPI processes on the heterogeneous cluster system with multicore computing nodes

3.1 The main idea A cluster system is comprised of interconnected multiple computing nodes. If multicore computing nodes are included, the cluster system can be regarded as a two-level hierarchical structure. The first level consists of computing nodes and the second level consists of processor cores. Because each computing node has its own memory system and address space, it has to communicate with others by explicitly sending messages through the computer network. In contrast, because the processor cores on a multicore computing node all share the same memory, they can communicate with each other by accessing to data in the shared memory. Accordingly, the communications can be divided into inter-node and intra-node communications. Because the inter-node communications have longer latencies than the intra-node communications, the former should be minimized for optimized communications. However, the previously proposed self-scheduling schemes just totally ignored such a communication issue [19, 20, 22]. They are all based on message-passing paradigms. We give an example to explain how the previous schemes dispatch the work as follows. The cluster system is composed of one master computing node and two slave computing nodes, as shown in Fig. 1. The slave computing node 1 has four processor cores and the slave computing node 2 has two processor cores. There is a centralized scheduler responsible for the task assignment in the system. It is an MPI process running on the master computing node. On the other hand, each processor core on slave nodes executes one slave MPI process. Therefore, there are totally six slave MPI processes in the system. Each MPI slave process will request new tasks directly from the master MPI process whenever it becomes idle. The tasks assigned to a slave process cannot be shared with other slave processes on the same computing node even though the received tasks are on the shared memory. In consequence, one slave process has to issue another request to the master whenever it becomes idle no matter whether another slave process has been assigned too many tasks due to workload irregularity. Each request from the slave to the master is a long-latency inter-node communication.


37

Fig. 2 Communications with MPI processes and OpenMP threads on the heterogeneous cluster system with multicore computing nodes

Moreover, the master process may become a bottleneck of the system performance because there are so many and frequent requests from slave processes at the runtime. To exploit the feature of the hierarchical structure of the emerging cluster system, we propose the following two-level scheduling scheme. The hybrid MPI and OpenMP programming model is adopted to develop parallel applications. Like most cluster systems, the MPI de facto standard, a message-passing programming language, is employed to design parallel programs for inter-node communications. Each computing node runs an MPI process no matter how many processor cores it has. However, OpenMP is used for intra-node communications. Each MPI process will fork OpenMP threads depending on the number of processor cores in its underlying computing node. Every processor core runs one OpenMP thread. OpenMP is a shared-memory multithreaded programming language, which well matches with the feature of the multicore computing node. Unlike MPI processes having completely separate program contexts with their own variables and memory allocations, OpenMP threads share the same memory space and global variables between routines. Consequently, OpenMP threads require less memory space than MPI processes. This property is very beneficial for applications requiring more memory space than the physical memory space. Frequent memory swapping caused by the virtual memory technique will degrade the system performance significantly. Therefore, we can combine both the advantages of message-passing programming and shared-memory programming by the hybrid MPI and OpenMP programming model for the multicore PC cluster. The scheduling scheme consists of one global scheduler and multiple local schedulers. Each slave computing mode has one local scheduler. One processor core of each slave computing node is responsible for the execution of the local scheduler. The processor core running the local scheduler is called the master core and the others are called slave cores. The global scheduler and the local schedulers are all MPI processes. We give an example, as shown in Fig. 2, to explain our approach. Initially, there are three MPI processes created, one for the global scheduler and two for the local schedulers. In addition, all the slave cores on these two slave computing nodes run nothing. All the loop iterations are kept in the global scheduler. No slave cores are allowed to request iterations directly from the global scheduler. Instead, they have to request from the local scheduler in the same computing node. To utilize the feature of shared-memory architecture, every MPI process of the local scheduler will create

38

C.-C. Wu et al.

one OpenMP thread for each processor core on its resident computing node. Therefore, the local scheduler in the computing node 1 will fork four OpenMP threads, and the local scheduler in the computing node 2 will fork two OpenMP threads. On each computing node, all the processor cores will work together to process the work assigned to the local scheduler by the global scheduler. Whenever a local scheduler has no iterations left in his work queue, it requests a chunk of iterations from the global scheduler by issuing an MPI messages. The messages between the global scheduler and the local scheduler are inter-node communications. They are MPI messages. In contrast, the messages between the local scheduler and the processor core are intra-node communications. They are OpenMP messages. Because the inter-node communications are more costly, the global scheduler should assign larger chunks of iterations to the local scheduler at each scheduling step to reduce the number of requests from the local schedulers. However, it has to reserve iterations sufficient for load balance among heterogeneous computing nodes. Therefore, the global scheduler requires an optimization algorithm. Similarly, because the workload distribution of iterations may be irregular, the local scheduler needs a load-balancing algorithm for efficient parallel execution. In summary, two algorithms are required for the twolevel scheduling scheme. In the first-level scheduling, the global scheduler is responsible for deciding how many iterations will be assigned whenever a local scheduler issues a request. The number of processor cores in the computing node, from which the request comes, should be taken into consideration when the decision is made. In the second-level scheduling, because all the processor cores are homogeneous, the local scheduler dispatches the iterations assigned by the global scheduler to all the processor cores primarily based on whether the workload of iterations is regular or not. Basically, static scheduling is preferred in the second level. However, dynamical scheduling is adopted if the iteration workload distribution is irregular. Based on the above idea, in the following subsection we propose a two-level scheduling approach, called Layered Self-Scheduling (LSS). 3.2 Layered self-scheduling The layered self-scheduling is a two-level scheduling approach for multicore PC clusters. In the second-level scheduling, the local scheduler dispatches the iterations assigned by the global scheduler to the parallel OpenMP threads by invoking the OpenMP built-in scheduling routine. The scheduling scheme can be any one of the following schemes: static, guided or chunk schemes. Note that there is implicit barrier synchronization at the end of every parallel OpenMP section, which will cause additional runtime overhead. Whenever all the assigned iterations are processed by OpenMP threads, the local scheduler issues another request to the global scheduler for the next chunk of iterations. In the first-level scheduling, we propose to dispatch tasks based on any wellknown self-scheduling scheme, such as CSS, GSS, FSS or TSS. However, because the global scheduler dispatches tasks to the local scheduler rather than the computing node, we have to modify the allocation function that decides the next chunk size at each scheduling step for every well-known self-scheduling scheme. Furthermore, the


39

global scheduler employs the performance-based scheduling approach. We describe the performance function used in this paper as follows. Let M denote the number of computing nodes, P denote the total number of processor cores. Computing node i is represented by mi , and the total number of processor cores in computing node mi is represented by pi , where 1 ≤ i ≤ M. In consequence, P = M p . The j th processor core in computing node i is reprei i=1 sented by cij , where 1 ≤ i ≤ M and 1 ≤ j ≤ pi . N denotes the total number of iterations in some application program and f () is an allocation function to produce the chunk-size at each step. The output of f is the chunk-size for the next iteration. At the sth scheduling step, the global scheduler computes the chunk-size Cs for the computing node i and the remaining number of tasks Rs , R0 = N, Cs = f (s, i),

Rs = Rs−1 − Cs ,

(1)

where f () possibly has more parameters than just s and i, such as Ri−1 . To estimate the performance of each computing node, we define a performance function (PF) for a computing node i as PFi (V1 , V2 , . . . , VX ),

(2)

where Vr , 1 ≤ r ≤ X, is a variable of the performance function. In this paper, our PF for a computing node i is defined as pi 1 k=1 t

PFi = M piik q=1

1 k=1 tqk

,

(3)

where tij is the execution time (seconds) of processor core j on computing node i for some application program, such as matrix multiplication. Our proposed two-level scheduling approach does not necessarily outperform the single-level scheduling approach proposed by Yang et al. [22] if the allocation functions for self-scheduling schemes are not carefully designed. In the single-level scheduling approach, each MPI process requests iterations from the scheduler only for himself. By contrast, in the two-level scheduling approach, the iterations assigned to the local scheduler will be processed in parallel by OpenMP threads. The number of iterations allocated at each scheduling step in the two-level scheduling approach might become much larger than that in the single-level scheduling approach. During the global scheduler transmits the required information and data to some local scheduler, no more requests from other local schedulers can be processed, resulting in longer waiting time for other local scheduler. If the amount of communication for one scheduling step is too large, the system performance will be degraded. For this reason, this paper also focuses on the design of allocation functions for various self-scheduling schemes. According to the performance function, we propose four scheduling schemes in the following by modifying the allocation functions of four well-known scheduling schemes, where ρ denotes an application-dependent constant. Layered Chunk Self-Scheduling (LCSS) is similar to CSS except that the allocation function f () is amended as follows: Cs = f (s, i, P ) = P Fi × k, where the constant k denotes the fixed chunk size.

40

C.-C. Wu et al.

Layered Guided Self-Scheduling (LGSS) is similar to GSS except the following changes: Cs = f (s, i, Rs−1 ) = ρ × PFi × Rs−1 .

(4)

Layered Factoring Self-Scheduling (LFSS) is similar to FSS except the following amendments: Cs = f (s, i, j, Rs−1 ) = ρ × PFi × Cs = f (s, i, j, Rs−1 ) =

Rs−1 , 2

PFi × Rs−1 , PFj

if (s mod P ) = 0;

otherwise,

(5) (6)

where j denotes the id of the computing node that issues a request at the (s − 1)th scheduling step. Layered Trapezoid Self-Scheduling (LTSS) is similar to TSS except the following amendments. Let the upper bound be u and the lower bound be l. Accordingly, the number of u−l chores is 2×N u+l , denoted by Nc , and consecutive chunks differ in size Nc −1 iterations, denoted by D. Cs = f (s, i, j, Rs−1 ) = ρ × max(PFi × (u − (s − 1) × D, l), Cs = f (s, i, j, Rs−1 ) =

PFi × Rs−1 , PFj

otherwise,

if (s mod P ) = 0; (7) (8)

where j denotes the id of the computing node that issues a request at the (s − 1)th scheduling step. Based on the information of workload distribution and node performance, we propose an algorithm for the two-level loop scheduling scheme. This algorithm is based on a hybrid MPI and OpenMP programming model, and consists of two modules: a master module and a slave module, as shown in Fig. 3 and Fig. 4, respectively. The master module makes the scheduling decision and dispatches workloads to slaves by MPI messages. The computing node with better performance will get more data to process. Then, the slave module processes the assigned work by parallel OpenMP threads. The number of OpenMP threads to be created is equal to the number of the processor cores on the MPI-process-resident computing node. This algorithm is just a skeleton, and the detailed implementation, such as data preparation, parameter passing, and so forth, might be different according to requirements of various applications.

4 Performance evaluations We have constructed a heterogeneous cluster system consisting of eight desktop PCs with sixteen cores, as shown in Table 1. Three types of application programs are implemented to verify our approach in this testbed. The application of matrix multiplication has regular workload distribution and requires data communication at


41

Fig. 3 The algorithm of the global scheduler

each scheduling step. Sparse matrix multiplication has irregular workload distribution and requires data communication at each scheduling step. Mandelbrot set computation has irregular workload distribution and requires no data communication at each scheduling step. The heuristic self-scheduling (HSS) proposed by Yang et al. [22] is compared with our approach, layered self-scheduling (LSS). 4.1 Application 1: Matrix Multiplication Matrix Multiplication is a fundamental operation in many numerical linear algebraic applications. Its efficient implementation on parallel computers is an issue of prime importance when providing such systems with scientific software libraries. Consequently, considerable effort has been devoted in the past to the development of efficient parallel matrix multiplication algorithms, and this will remain a task in the fu-

42

Fig. 4 The algorithm of the local scheduler

Table 1 The configuration of our cluster system

C.-C. Wu et al.


43

ture as well. Many parallel algorithms have been designed, implemented, and tested on different parallel computers or cluster of workstations for matrix multiplication. We have implemented the proposed scheme for Matrix Multiplication. The input matrix A will be partitioned into a set of rows and kept in the global scheduler. At runtime, after the global scheduler decides which rows will be assigned to at each scheduling step, the corresponding row data will be sent to the requesting slave process. On the other hand, every local scheduler has a copy of the input matrix B because it is needed to calculate every row of the matrix C. The global scheduler, viz. the Master module, is responsible for the distribution of workloads to the local schedulers, viz. slave nodes. When a local scheduler becomes idle, the global scheduler sends the local scheduler one integer indicating how many rows will be assigned. Next, the global scheduler sends the corresponding data to the local scheduler. Finally, the OpenMP threads will follow the specified scheduling scheme such as guided self-scheduling to calculate the assigned rows. The C/MPI + OpenMP code fragment of the Slave module for Matrix Multiplication is listed as shown in Fig. 5. As the source code shows, a row is the atomic unit of allocation. We evaluate the performances of four kinds of scheduling schemes in the following. First, we compare chuck self-scheduling based schemes as shown in Fig. 6. The label HCSS(α = x) in the legend denotes the heuristic CSS proposed by Yang et al. [22] and the value of α is equal to x. The label LCSS_openmp(T ) in the legend represents our proposed layered CSS and the self-scheduling scheme adopted by the local scheduler is T . The speedup is obtained by dividing the execution time of the HCSS with α equal to 50 by the execution time of some scheme with the same matrix size. In this experiment, a larger α value leads to a poorer performance in the HCSS approach. The performance difference between different α values is enlarged when the matrix size becomes larger. On the other hand, our scheme outperforms the HCSS no matter which kind of self-scheduling scheme the local scheduler adopts. The speedups of our scheme are almost all larger than 2. Because the workload distribution is regular, if the local scheduler adopts the static scheme, we can get the best performance. In this case, our scheme for input size 4096 × 4096 provides 49.2%, 40.2%, or 40.0% performance improvement over the HCSS if static scheme, CSS, or GSS is adopted by the local scheduler, respectively. Second, we compare guided self-scheduling based schemes as shown in Fig. 7. The label HGSS(α = x) in the legend denotes the heuristic GSS proposed by Yang et al. [22] and the value of α is equal to x. The label LGSS_openmp(T ) in the legend represents our proposed layered GSS and the self-scheduling scheme adopted by the local scheduler is T . In this experiment, a larger α value leads to a better performance in the HGSS approach. Our scheme outperforms the HGSS no matter which kind of self-scheduling scheme the local scheduler adopts. The speedups of our schemes range from 1.4 to 1.58. Because the workload distribution is regular, if the static scheme is adopted in the second-level scheduling, we can get the best performance improvement. The performance differences between static and dynamic schemes become larger when the matrix size becomes larger. In this case, our scheme for input size 4096 × 4096 provides 56.2%, 47.3% or 47.1% performance improvement over the HGSS if static scheme, CSS, or GSS is adopted by the local scheduler, respectively.

44

C.-C. Wu et al.

Fig. 5 The local scheduler algorithm of matrix multiplication

Third, we compare factoring self-scheduling based schemes as shown in Fig. 8. The label HFSS(α = x) in the legend denotes the heuristic FSS proposed by Yang et al. [22] and the value of α is equal to x. The label LFSS_openmp(T ) in the legend


45

Fig. 6 Comparison of CSS based scheme for matrix multiplication. The chunk size is 128

Fig. 7 Comparison of GSS based scheme for matrix multiplication

represents our proposed layered GFSS and the self-scheduling scheme adopted by the local scheduler is T . In this experiment, no better α value can guarantee better performances for different matrix sizes in the LFSS approach. Our scheme outperforms the HFSS only when the matrix size is equal to 4096 × 4096. The best speedup that our schemes can provide is 1.13 when the local scheduler employs static scheduling and the matrix size is equal to 4096 × 4096. In this case, our scheme for input size 4096 × 4096 provides 12.9%, 5.9%, or 6.8% performance improvement over the HFSS if static scheme, CSS, or GSS is adopted by the local scheduler, respectively. Fourth, we compare trapezoid self-scheduling based schemes as shown in Fig. 9. The label HTSS(α = x) in the legend denotes the heuristic TSS proposed by Yang et al. and the value of α is equal to x. The label LTSS_openmp(T ) in the legend repre-

46

C.-C. Wu et al.

Fig. 8 Comparison of FSS based scheme for matrix multiplication

Fig. 9 Comparison of TSS based scheme for matrix multiplication

sents our proposed layered LTSS and the self-scheduling scheme adopted by the local scheduler is T . In this experiment, the larger α value provides better performances for different matrix size in the HTSS approach. Our scheme outperforms the HTSS no matter which kind of scheduling scheme the local scheduler adopts. Furthermore, when the matrix size becomes larger, our method provides better performances. The best speedup that our schemes can provide is 1.41 when the local scheduler employs the static scheduling and the matrix size is equal to 4096 × 4096. In this case, our scheme for input size 4096 × 4096 provides 41.3%, 34.3% or 34.1% performance improvement over the HTSS if static scheme, CSS, or GSS is adopted by the local scheduler, respectively.


47

Fig. 10 The algorithm of sparse matrix multiplication

Fig. 11 Comparison of CSS based scheme for sparse matrix multiplication. The chunk size is 128

4.2 Application 2: Sparse Matrix Multiplication Sparse Matrix Multiplication is the same as Matrix Multiplication, as described in Sect. 4.1, except that the input A is a sparse matrix. Assume that 50% of elements in matrix A are zero and all the zeros are in the lower rectangular. If an element in matrix A is zero, the corresponding calculation is omitted as shown in Fig. 10. Therefore, the workload distribution of iterations in Sparse Matrix Multiplication is irregular. We evaluate the performances of four kinds of scheduling schemes in the following. First, we compare chuck self-scheduling based schemes as shown in Fig. 11. In this experiment, a larger α value does not necessarily lead to a better performance in the HCSS approach. On the other hand, our scheme outperforms the HCSS no matter which kind of self-scheduling scheme the local scheduler adopts. The speedups are all larger than 1.7. However, the improvement decreases when the matrix size is enlarged. For the input size 4096 × 4096, our scheme provides 84.2%, 69.6% or 69.3% performance improvement over the HCSS if static scheme, CSS, or GSS is adopted by the local scheduler, respectively. Though the workload distribution is irregular, static scheme adopted in the second level provides better performance than dynamic scheme. To explain the reason, we chose twenty chunks evenly from 2048 chunks and depicted their execution times in Fig. 12. The workload is decreased constantly and the execution time difference between any two consecutive sampled chunks is very small. Beware that there are 105

48

C.-C. Wu et al.

Fig. 12 Comparison of CSS based scheme for sparse matrix multiplication

Fig. 13 Comparison of GSS based scheme for sparse matrix multiplication

original chunks between any two consecutive sampled chunks and they are not shown in this figure. Because the execution time difference is so small, approximately the workload of the 128 chunks, allocated at one scheduling step in our proposed LGSS scheme, is uniformly distributed. That is why we can obtain the best performance if the local scheduler adopts the static scheduling. Second, we compare guided self-scheduling based schemes as shown in Fig. 13. In this experiment, a larger α value leads to a better performance in the HGSS approach.


49

Fig. 14 Comparison of FSS based scheme for sparse matrix multiplication

Our scheme outperforms the HGSS no matter which kind of self-scheduling scheme is adopted in the second-level scheduling. The speedups are all larger than 1.6 and the static scheme always provides the best performance when the matrix size is the same. In the second level scheduling, these two kinds of dynamic schemes provide similar performance improvements. In this case, our scheme for input size 4096 × 4096 provides 80.3%, 68.5% or 67.3% performance improvement over the HGSS if static scheme, CSS, or GSS is adopted by the local scheduler, respectively. Third, we compare factoring self-scheduling based schemes as shown in Fig. 14. In this experiment, a larger α value may lead to a poorer performance in the HFSS approach. Our scheme outperforms the HFSS no matter which kind of self-scheduling scheme is adopted in the second level scheduling. Our schemes can provide more performance improvements when the matrix size becomes larger. Static scheduling adopted in the second level still provides the best speedup. In this case, our scheme for input size 4096 × 4096 provides 48.5%, 37.3% or 37.8% performance improvement over the HFSS if static scheme, CSS, or GSS is adopted by the local scheduler, respectively. Fourth, we compare trapezoid self-scheduling based schemes as shown in Fig. 15. In this experiment, a larger α value will degrade performance in the HTSS approach. On the other hand, our scheme outperforms the HTSS no matter which kind of scheduling schemes is employed in the OpenMP parallel section. The largest matrix size can provide the best performance improvement. In the second-level scheduling, static scheme provides the best performance improvement. Guided and chunk selfscheduling schemes adopted by the local scheduler provide almost the same speedup. In this case, our scheme for input size 4096 × 4096 provides 52.1%, 37.3% or 37.8% performance improvement over the HTSS if static scheme, CSS, or GSS is adopted by the local scheduler, respectively. 4.3 Application 3: Mandelbrot set computation The Mandelbrot set is a problem involving the same computation on different data points which have different convergence rates [9]. The Mandelbrot set, named after

50

C.-C. Wu et al.

Fig. 15 Comparison of TSS based scheme for sparse matrix multiplication

Benoit Mandelbrot, is a fractal. Fractals are objects that display self-similarity at various scales. Magnifying a fractal reveals small-scale details similar to the largescale characteristics. Although the Mandelbrot set is self-similar at magnified scales, the small scale details are not identical to the whole. In fact, the Mandelbrot set is infinitely complex. Yet the process of generating it is based on an extremely simple equation involving complex numbers. This operation derives a resultant image by processing an input matrix, A, where A is an image of m pixels by n pixels. The resultant image is one of m pixels by n pixels. The proposed scheme has been implemented for Mandelbrot set computation. The global scheduler is responsible for the distribution of workload. When a local scheduler becomes idle, the global scheduler sends two integers to the local scheduler. The two numbers represent the beginning index and the size of the assigned chunk, respectively. The tasks assigned to the local scheduler are then dispatched to OpenMP threads based on a specified self-scheduling scheme. Unlike matrix multiplication, communication cost between the global scheduler and the local scheduler is low, and the dominant cost is the computation of Mandelbrot set. The C/MPI + OpenMP code fragment of the local scheduler for Mandelbrot set computation is listed as shown in Fig. 16. In this application, the workload for each iteration of the outer loop is irregular because the number of executions for convergence is not a fixed number. Therefore, the performance for workload distribution depends on the degree of variation for each iteration. First, we compare chunk self-scheduling based schemes as shown in Fig. 17. In this experiment, a larger α value will improve the performance in the HCSS approach. However, when the image size is enlarged, the improvement is decreased. As for our scheme, the two kinds of dynamic schemes are preferred to be adopted by the local scheduler for better performance. For the larger image size, the chunk self-scheduling provides the best performance improvement. In this case, our scheme for input size 1500 × 1500 provides 0.8%, 10.0% or 9.0% performance improve-


51

Fig. 16 The local scheduler algorithm of Mandelbrot set computation

ment over the HCSS if static scheme, CSS, or GSS is adopted by the local scheduler, respectively. Second, we compare guided self-scheduling based schemes as shown in Fig. 18. In this experiment, increasing the α value may degrade performance in the HGSS ap-

52

C.-C. Wu et al.

Fig. 17 Comparison of CSS based scheme for Mandelbrot set computation

Fig. 18 Comparison of GSS based scheme for Mandelbrot set computation

proach. Our scheme outperforms the HGSS no matter which kind of self-scheduling scheme is adopted by the local scheduler. If static scheme is employed in the second level, we can obtain the best performance improvement. As for dynamic schemes, guided self-scheduling is better than chunk self-scheduling. In this case, our scheme for input size 1500 × 1500 provides 24.3%, 19.1% or 21.5% performance improvement over the HGSS if static scheme, CSS, or GSS is adopted by the local scheduler, respectively. Third, we compare factoring self-scheduling based schemes as shown in Fig. 19. In this experiment, enlarging the α value will improve the performance in the HFSS approach. Our scheme outperforms the HFSS no matter which kind of self-scheduling schemes the second level scheduler adopts. Chunk self-scheduling can provide the best performance improvement. In this case, our scheme for input size 1500 × 1500


53

Fig. 19 Comparison of FSS based scheme for Mandelbrot set computation

Fig. 20 Comparison of TSS based scheme for Mandelbrot set computation

provides 14.3%, 24.6% or 17.0% performance improvement over the HFSS if static scheme, CSS, or GSS is adopted by the local scheduler, respectively. Fourth, we compare trapezoid self-scheduling based schemes as shown in Fig. 20. In this experiment, a larger α value does not necessarily improve performance in the HTSS approach. Our scheme outperforms the HTSS no matter which kind of selfscheduling schemes the local scheduler adopts. If the second level scheduling adopts the chunk self-scheduling, we can get the best performance. Dynamical scheduling is more suitable for Mandelbrot set computation because it has irregular workload distribution and requires little amount of communication at every scheduling step. However, if guided self-scheduling is adopted by the local scheduler, the performance improvement is similar to that if static self-scheduling is adopted because of the excessive scheduling overhead. In this case, our scheme for input size 1500 × 1500 pro-

54

C.-C. Wu et al.

vides 0.3%, 7.3% or 2.1% performance improvement over the HTSS if static scheme, CSS, or GSS is adopted by the local scheduler, respectively. 4.4 Summary We summarize the performance improvements obtained by our scheme for three 3

Speedup

i , where applications in Fig. 21. The average speedup is calculated by i=1 3 Speedupi denotes the speedup of the ith problem size for the same scheme derived in the previous subsections. For the Mandelbrot set computation, it has irregular workload distribution and requires little data communications at each scheduling step. For the heuristic self-scheduling, choosing a larger α value can lead to a better performance. For our approach, if the LGSS is adopted in the first level scheduling, no mat-

Fig. 21 Average performance improvement comparison for three different applications


55

ter which kind of scheme is adopted in the second level scheduling, our scheme can improve the performance over 20%. If any one of LCSS, LFSS, or LTSS is adopted, dynamic scheduling is preferred in the second level scheduling, especially the guided self-scheduling. For the sparse matrix multiplication, it has irregular workload distribution and requires data communications at each scheduling step. Increasing the α value does not necessarily improve the performance for different schemes in the heuristic selfscheduling. For our scheme, the second level scheduling prefers the static scheme no matter which scheme is adopted in the first level scheduling. The reason may be that the chunk size dispatched to the local scheduler at each scheduling step is not large enough. Consequently, it is hard to achieve load balancing by dynamic scheduling and the scheduling overhead will degrade the performance. On the other hand, compared with the results of Mandelbrot set computation, our approach can improve the performance in a higher degree for sparse matrix multiplication, especially if LCSS is adopted. Therefore, the amount of data communications at each scheduling step will affect the performance improvement provided by our approach significantly. For the matrix multiplication, it has regular workload distribution and requires data communications at each scheduling step. The results are similar to that of sparse matrix multiplication. However, its improvements are less than that of sparse matrix multiplication. Therefore, our approach is more beneficial to the applications with irregular workload distribution if the amount of data communication required is fixed. According to the execution time, LGSS with the static local scheduling is the best choice regardless of program characteristics. That is, combining the GSS dynamic scheduling in the first level and the static scheduling in the second level can provide more stable performance improvements. The first level scheduling aims at balancing the workload assignment dynamically among heterogeneous computational nodes by the GSS scheme, while the local scheduling emphasizes the reduction of the local scheduling overhead since the amount of workload per scheduling step is not very large. In the following, the Mandelbrot set computation implemented by GSS based schemes will be further analyzed to explain the source of performance improvement and degradation for different approaches. The HGSS scheme has much longer average idle time than the LGSS scheme, as shown in Fig. 22, where average idle time denotes how long a process has to wait for the completion of the program. Furthermore, the maximum idle time in the HGSS schemes is much longer than that in the LGSS scheme, as shown in Fig. 23, where the maximum idle time is the longest idle time among all processes. According to the results shown in Fig. 22 and Fig. 23, the LGSS scheme has the better load balancing than the HGSS scheme. Next, we investigate how many messages have to transmit from the global scheduler to the local schedulers for various GSS based schemes. For the HGSS scheme, because the α value specifies how much the workload will be scheduled statically, the larger the α value is, the less the communication is required, as shown in Fig. 24. Regardless the α value, our proposed LGSS requires a much smaller amount of expensive inter-node communication. Finally, we compare the scheduling overheads for MPI processes and OpenMP threads as follows. We constructed a cluster system consisting of one master node

56

C.-C. Wu et al.

Fig. 22 Average idle time for various GSS based schemes

Fig. 23 Maximum idle time for various GSS based schemes

and one four-core slave node. The GSS scheme is employed by the global scheduler to measure the scheduling overhead in the pure MPI program model. On the other hand, we created different number of threads to measure the scheduling overhead in the pure OpenMP model by using the GSS scheme. Assume that there are either w slave MPI processes in the pure MPI model or w parallel OpenMP threads in the pure


57

Fig. 24 The number of inter-node messages required for dispatching tasks from the global scheduler to the local schedulers

OpenMP model, the scheduling overhead is calculated as follows: ETw −

ET1 , w

(9)

where ETw represents the execution time when either w slave MPI processes or w parallel OpenMP threads are created, and ET1 represents the execution time when there is only one slave MPI process or one OpenMP thread. The scheduling overhead in the pure MPI model is much higher than that in the pure OpenMP model, as shown in Fig. 25. Moreover, the scheduling overhead becomes larger when the image size is increased because the GSS scheme requires more steps to finish the task assignment. The lower overhead for local scheduling explains why our proposed method outperforms the HSS approach.

5 Conclusions and future work Previous research did not consider the feature of multicore computers when they proposed enhanced self-scheduling methods. In addition, they developed applications based on MPI message-passing programming model. In this paper, a cluster system with multicore computing nodes is regarded as a two-level hierarchical structure. The first level consists of computing nodes and the second level is comprised of processor cores. Accordingly, we proposed a two-level loop self-scheduling scheme based on the hybrid MPI and OpenMP programming model. MPI processes are used for inter-node communications and OpenMP threads are employed for intra-node communications. MPI processes communicate with each other by transmitting messages

58

C.-C. Wu et al.

Fig. 25 Overhead comparisons between the global scheduling and the local scheduling

via network, while OpenMP threads communicate with each other by accessing to data in shared memory. The global scheduler and the local scheduler are responsible for the workload assignment for the first level and the second level, respectively. Both the two kinds of schedulers dispatch tasks based on self-scheduling schemes. Initially, the global scheduler keeps all the tasks. When dispatching the tasks to a local scheduler, the global scheduler also considers the accumulative computational power of all processor cores on the destination computing node. Processor cores cannot communicate directly with the global scheduler. Instead, they request tasks from the local scheduler, which is implemented by the OpenMP built-in scheduling routine. In this way, the inter-node communications can be minimized. We have verified our approach by three kinds of applications. According to the experimental results, our method outperforms the heuristic self-scheduling (HSS) scheme [22]. Moreover, the performance improvement is very stable for any kind of combination of first-level and second-level scheduling schemes. Furthermore, our approach can provide more performance improvements if the application has irregular workload distribution and requires data communications at every scheduling step. Finally, unlike the scheme proposed by Yang et al., it is not necessary for our method to choose an appropriate α value. In the future, we will implement more types of application programs to verify our approach. Furthermore, we hope to find better ways of modeling the performance function, incorporating amount of memory available, memory access costs, network information, CPU loading, etc. Also, based on the two-level scheduling scheme, the allocation function of each kind of well-known self-scheduling scheme will be modified for better performance. Also, a theoretical analysis of the proposed method will be addressed.


59

References 1. Baker M, Buyya R (1999) Cluster computing: the commodity supercomputer. Int J Softw Pract Exp 29(6):551–575 2. Beaumont O, Casanova H, Legrand A, Robert Y, Yang Y (2005) Scheduling divisible loads on star and tree networks: results and open problems. IEEE Trans Parallel Distrib Syst 16:207–218 3. Bennett BH, Davis E, Kunau T, Wren W (2000) Beowulf parallel processing for dynamic loadbalancing. In: Proceedings of IEEE aerospace conference, 2000, vol 4, pp 389–395 4. Bohn CA, Lamont GB (2002) Load balancing for heterogeneous clusters of PCs. Future Gener Comput Syst 18:389–400 5. Cheng K-W, Yang C-T, Lai C-L, Chang S-C (2004) A parallel loop self-scheduling on grid computing environments. In: Proceedings of the 2004 IEEE international symposium on parallel architectures, algorithms and networks, KH, China, May 2004, pp 409–414 6. Chronopoulos AT, Andonie R, Benche M, Grosu D (2001) A class of loop self-scheduling for heterogeneous clusters. In: Proceedings of the 2001 IEEE international conference on cluster computing, 2001, pp 282–291 7. He Y, Ding HQ (2002) MPI and OpenMP paradigms on cluster of SMP architectures: the vacancy tracking algorithm for multi-dimensional array transposition. In: Proceedings of the 2002 ACM/IEEE conference on supercomputing, 2002, pp 1–14 8. Hummel SF, Schonberg E, Flynn LE (1992) Factoring: a method scheme for scheduling parallel loops. Commun ACM 35:90–101 9. Introduction to the Mandelbrot set (2008) http://www.ddewey.net/mandelbrot/ 10. Li H, Tandri S, Stumm M, Sevcik KC (1993) Locality and loop scheduling on NUMA multiprocessors. In: Proceedings of the 1993 international conference on parallel processing, vol II, 1993, pp 140–147 11. Polychronopoulos CD, Kuck D (1987) Guided self-scheduling: a practical scheduling scheme for parallel supercomputers. IEEE Trans Comput 36(12):1425–1439 12. Post E, Goosen HA (2001) Evaluation the parallel performance of a heterogeneous system. In: Proceedings of 5th international conference and exhibition on high-performance computing in the AsiaPacific region (HPC Asia 2001) 13. Rosenberg R, Norton G, Novarini JC, Anderson W, Lanzagorta M (2006) Modeling pulse propagation and scattering in a dispersive medium: performance of MPI/OpenMP hybrid code. In: Proceedings of the ACM/IEEE conference on supercomputing, 2006, pp 47–47 14. Shih W-C, Yang C-T, Tseng S-S (2007) A performance-based parallel loop scheduling on grid environments. J Supercomput 41(3):247–267 15. Sterling T, Bell G, Kowalik JS (2002) Beowulf cluster computing with Linux. MIT Press, Cambridge 16. Tang P, Yew PC (1986) Processor self-scheduling for multiple-nested parallel loops. In: Proceedings of the 1986 international conference on parallel processing, 1986, pp 528–535 17. The Scalable Computing Laboratory (SCL) (2008) http://www.scl.ameslab.gov/ 18. Tzen TH, Ni LM (1993) Trapezoid self-scheduling: a practical scheduling scheme for parallel compilers. IEEE Trans Parallel Distrib Syst 4:87–98 19. Yang C-T, Chang S-C (2004) A parallel loop self-scheduling on extremely heterogeneous PC clusters. J Inf Sci Eng 20(2):263–273 20. Yang C-T, Cheng K-W, Li K-C (2005) An enhanced parallel loop self-scheduling scheme for cluster environments. J Supercomput 34(3):315–335 21. Yang C-T, Cheng K-W, Shih W-C (2007) On development of an efficient parallel loop self-scheduling for grid computing environments. Parallel Comput 33(7–8):467–487 22. Yang C-T, Shih W-C, Tseng S-S (2008) Dynamic partitioning of loop iterations on heterogeneous PC clusters. J Supercomput 44(1):1–23

60

C.-C. Wu et al. Chao-Chin Wu is an Associate Professor and Chairman in Computer Science and Information Engineering, National Changhua University of Education, Taiwan. He received the B.Sc. degree in Computer Science and Engineering from Tatung Institute of Technology, Taiwan, in 1990, and the M.Sc. and Ph.D. degrees in Computer Science and Information Engineering from National Chiao Tung University, Taiwan, in 1992 and 1998, respectively. His research interests are in grid computing, parallel processing, and computer architecture. He is a member of the IEEE Computer Society.

Lien-Fu Lai is an Associate Professor in the Department of Computer Science and Information Engineering at National Changhua University of Education in Taiwan. His research interests include expert systems, knowledge management, software engineering, and parallel processing. He received his Ph.D. in 1999 and M.Sc. in 1995, both in Computer Science from National Central University in Taiwan. For more details, please visit at http://www.csie.ncue.edu.tw/~lflai.

Chao-Tung Yang received a B.Sc. degree in Computer Science and Information Engineering from Tunghai University, Taichung, Taiwan in 1990, and the M.Sc. degree in Computer and Information Science from National Chiao Tung University, Hsinchu, Taiwan in 1992. He received the Ph.D. degree in Computer and Information Science from National Chiao Tung University in July 1996. He won the 1996 Acer Dragon Award for outstanding Ph.D. dissertation. He has worked as an associate researcher for ground operations in the ROCSAT Ground System Section (RGS) of the National Space Program Office (NSPO) in Hsinchu Science-based Industrial Park since 1996. In August 2001, he joined the faculty of the Department of Computer Science and Information Engineering at Tunghai University, where he is currently an Associate Professor. His researches have been sponsored by Taiwan agencies National Science Council (NSC), National Center for High Performance Computing (NCHC), and Ministry of Education. His present research interests are in grid and cluster computing, parallel and high-performance computing, and internetbased applications. He is member of both the IEEE Computer Society and ACM.


61

Po-Hsun Chiu is an undergraduate in the Department of Computer Science and Information Engineering at National Changhua University of Education in Taiwan. After receiving the B.Sc. degree in June 2009, he will become a graduate student in the Department of Computer Science and Information Engineering at National Taiwan University in Taiwan. His research interests include parallel computing and multicore embedded systems.

Using hybrid MPI and OpenMP programming to ... - Semantic Scholar

Using hybrid MPI and OpenMP programming to ... - Semantic Scholar

Suggest Documents

Hybrid Programming with OpenMP and MPI - Semantic Scholar

Hybrid MPI/OpenMP Parallel Programming on ... - Semantic Scholar

Automatic Hybrid OpenMP + MPI Program ... - Semantic Scholar

Parallel Ray Tracing using MPI and OpenMP - Semantic Scholar

Parallel Programming Optimization with Hybrid MPI-OpenMP Thread

Early Experiments with the OpenMP/MPI Hybrid Programming Model

Parallel Programming Optimization with Hybrid MPI-OpenMP Thread

Enhanced Hybrid MPI-OpenMP Parallel ... - IEEE Xplore

Parallel Multigrid Solvers using OpenMP/MPI ... - Semantic Scholar

User co-scheduling for MPI + OpenMP applications using OpenMP

Parallel Programming Using OpenMP - Shodor

Hybrid OpenMP-MPI Turbulent Boundary Layer Code Over 32k Cores

Power-Aware Predictive Models of Hybrid (MPI/OpenMP) - Innovative ...

A Hybrid MPI+OpenMP Solution of the Distributed ... - Science Direct

A Framework for an Automatic Hybrid MPI+ OpenMP code generation

Detecting Thread-Safety Violations in Hybrid OpenMP/MPI Programs ...

Hybrid MPI/OpenMP Parallel Linear Support Vector Machine Training

Automatic performance analysis of hybrid MPI/OpenMP applications

Automatic Hybrid MPI+OpenMP Code Generation with llc

Hybrid OpenMP/MPI programs for solving the time-dependent ... - SCL

Power-Aware Predictive Models of Hybrid (MPI/OpenMP) - Innovative ...

Quantifying Differences between OpenMP and MPI Using a Large

High Performance Computing Using MPI and OpenMP on ... - CiteSeerX

OpenMP, OpenMP/MPI, and CUDA/MPI C programs for solving the ...