2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC-ICESS-CSS 2015)
A Data-oriented Method for Scheduling Dependent Tasks on High-density MultiGPU Systems
Peng Zhang
Yuxiang Gao
Meikang Qiu
Biomedical Engineering Department Stony Brook University Stony Brook, NY, United States
[email protected]
Cluster Solution Department Cray Inc. San Jose, CA, United States
[email protected]
Computer Science Department Pace University New York, NY, United States
[email protected]
Abstract—The rapidly-changing computer architectures, though improving the performance of computers, have been challenging the programming environments for efficiently harnessing the potential of novel architectures. In this area, though the high-density multi-GPU architecture enabled unparalleled performance advantage of dense GPUs in a single server, it has increased the difficulty for scheduling diversified and dependent tasks. We therefore propose a data-oriented method for scheduling dependent tasks for this architecture while providing its implementation. In our method, we model a parallel program as a collection of data-dependent tasks for which data dependencies are managed by an expressive matrix. Accordingly, we develop a hierarchical scheduler infrastructure for our model. In this, a top scheduler is built for querying the data-dependency matrix; three downstream schedulers for queuing computation tasks that are exclusively assigned to processor, accelerator or either; and a multitude of bottom schedulers each for providing a processing element with assigned tasks. We experiment our scheduler for examples of Strassen matrix multiplication and Cholesky matrix inversion algorithms on a computer that has 8 Tesla K40 GPUs. The results show that our method is capable of offering the efficient task parallelism while fulfilling the complex task dependencies. When advanced task-oriented schedulers have been widely designed for distributed systems, a lightweight data-driven scheduler could be an alternative and handy approach that can handle the dependent yet diversified tasks of data-intensive applications for the novel high-density multi-accelerator system. Keywords-data scheduling, task scheduling, computing, heterogeneous multi-GPU systems.
I.
parallel
INTRODUCTION
To accommodate the ever-increasing need for computing powers, today’s advanced parallel computers are exploiting the use of complex network topologies such as torus and its variants for coupling millions of processor cores, as well as co-processors or accelerators such as GPU for boosting nodal performance [1-5]. Though the birth of more novel technologies keep fueling the development of parallel computers, the rapidly-escalating complexity of heterogeneous infrastructures, coupled with a need for handling data-dependent yet diversified tasks, made it a serious roadblock to fully harness the potential of these architectures efficiently [4, 6, 7]. Recently, the dense multi-
GPU architecture has emerged with an acceleratoroptimized design and it has enabled the unparalleled performance advantage of high-density GPUs in a singlenode server. This change in computer architecture naturally invites a reinvestigation of employed algorithms and methods. For example, the matrix multiplication methods performed widely different on the multicore system and the low-density and high-density multi-GPU systems [8]. Similarly, the task-scheduling problem needs to be addressed for these accelerator-optimized systems [9, 10]. Nevertheless, the problem of scheduling dependent tasks for heterogeneous architectures is never trivial and it has a frenetic development [2, 9-15]. The task graph is a routine formulation for most of task scheduling models and in the meantime, a data management is developed as a facility library for task execution. Multitasking program and distributed computing architecture are formulated by the weighted directed acyclic graphs (DAG). Specifically, a vertex of the program DAG is a computation task that is associated with specified data sets. An edge of program DAG indicates a task dependency and the weight on the edge shows the inter-task communication load. A vertex of the computer DAG is a computer node that can process the computation tasks. An edge of computer DAG shows the inter-nodal wiring and the weight on edge could be internode communication latency. The general objective of task scheduling problems is for reducing the completing time, maximizing task parallelism and balancing compute loads. The critical-path, divide-and-conquer, heuristic algorithms are usually considered as numerical solvers for seeking such optimal solutions [16], in which NP-completeness of most algorithms is a barrier. The intrinsic complexities of multicore computers and dynamic attributes of multitasking programs are collectively placing a constraint for the scheduler design. Different from a design methodology for these taskoriented strategies, we reformulate the scheduling model with the data-oriented design and then address the problem of scheduling the dependent and diversified tasks for a dense accelerator system. Briefly, we define an abstraction of every single piece of data and store these data dependencies in an expressive matrix that is referred to as a data-dependence matrix. Fulfillment of task dependency is based on this matrix. By analyzing this matrix and runtime data creation,
we produce new tasks. In support of the high-density multiaccelerator architecture, we adopt a hierarchical scheduler for assignment of diversified tasks to certain processing units. The rest of paper is organized as follows: we present the concept and definitions in Sec II. Accordingly, we present a framework for the hierarchical scheduler infrastructure in Sec III. For implementing the scheduler, we describe the hardware specification and software libraries in Sec IV. On an 8-K40 server, we apply the scheduler to two examples of Strassen multiplication and Cholesky inversion algorithms to demonstrate the applicability in Sec V. Related work is discussed in Sec VI and conclusion is drawn in Sec VII. II.
DATA-ORIENTED DESIGN
The data-oriented method has been used for task mapping problem on distributed computing architectures [15]. For solving the mapping problem on parallel computers, the data movement for inter-task communication is the performance barrier so that the object of task mapping problem is to seek the best assignment of tasks to optimize these communication activities. However, for solving the scheduling problem on high-density multi-GPUs computers, task parallelism and offloading are the common roadblocks that impede the road to fully harness the potential of multiple accelerators. To this end, the data-oriented design is tailored for dense multi-accelerators computers. Obviously, the scheduler design for the multi-GPUs architecture needs to take into account the architectural heterogeneity. A variety of tasks need to be scrutinized based on hardware. For examples, matrix addition is relatively a low-performing task for accelerator but a high-performing task for processor. However, the matrix multiplication is a high-performing task for accelerator but a low-performing task for processor. This shows that diversified tasks should be classified in order that the performance advantage of accelerators could be exploited. Nevertheless, the task classification is not always necessary in the task mapping problem for a homogeneous multicore architecture. A. Definitions and Observations Definition 1 (data module): An algorithm is assumed to consist of a group of data modules D = {ds} in which subscript s is the identity of a data module. S is referred as the set of all data modules and N = |S| is the total number of elements in S, i.e., the total number of participating data modules. Data module is viewed as an atomic unit of data that could be an input or result that is produced at runtime. Thus, there are two kinds of data modules: a module is referred to as an initial module if it has already existed at program startup; otherwise, it is referred to an intermediate data module that is produced at runtime. Definition 2 (task module): An algorithm is assumed to consist of a set of task modules T = {tk} in which subscript k is the identity of a task module. Task module tk is referred to as a method that is denoted as: dr = tk (ds1, …, dsn) in which subscripts r, s1, …, sn are the elements of S. ds1, …,
dsn are inputs and dr is the result of task tk. tk is executable as long as the data ds1, …, dsn are ready. Definition 3 (data dependence): Under the same assumption: when dr is the data that is produced by task tk (ds1, …, dsn), we say: dr depends on data ds1, …, dsn. That is, dr is dependent of dα and dα is antecedent of dr, in which the subscript α is one of s1, …, sn. Data dependence is unidirectional for deadlock avoidance. Initial data have no dependents. Any intermediate data may have multiple dependents but it has at least one antecedent. Definition 4 (data dependence graph): Data dependence graph is defined as a directed acyclic graph G (V, E). Each vertex in V represents a data module (Def. 1) and each edge in E means a data dependence (Def. 3). For example, if dβ is dependent of dα, this dependency is indicated by a directed edge eαβ from dα to dβ. Definition 5 (data dependence matrix) Adjacent matrix of data dependency graph G (V, E) (Def. 4) produces a data dependency matrix A = [aαβ] in which aαβ = 1 if there exists an edge eαβ between data dα and dβ; and otherwise, aαβ = 0. |V| denotes the number of vertices, i.e., the number of data modules (Def. 1). Definition 6 (data relevance): If both dk1 and dk2 have the same antecedent dk, dk1 and dk2 are relevant to each other through dk. All antecedents of data dk form a relevant set, denoted as Ω (dk). Observation 1: ds is an initial data if and only if ∑ a = 0. Proof: ds is an initial data as long as it has no antecedent. That is, dks = 0 for any k (Def. 3). Observation 1 is used to identify the initial data modules through analyzing a data dependence matrix. Observation 2: Ω (dk) = { dm | dmk = 1 }. This observation is a deduction by integration of Def. 3 and Def. 6. It is used to identify a data set with support for producing a given data dk. Observation 3: Given L = { di | ∑ a = 0 }, then |T| = N - |L|. N is the number of data (Def. 1) and |T| is the number of tasks (Def. 2). L is the set of initial data (observation 1) so |L| is the number of initial data. Observation 4: Q (ds) = ∑ a is the number of data modules that depend on ds. This observation offers a means to guide the dynamic memories management. In this, Q (ds) is the number of functions that need ds as input. Q (ds) is calculated directly from data dependency matrix. At runtime, we subtract one from Q (ds) as long as one execution of task passing ds as input. Memory is allocated for ds at first execution and it is freed as along as Q (ds) becomes 0. B. Motivational Examples We show the data dependency matrices for examples of Strassen matrix multiplication and Cholesky matrix inversion algorithms. Both are the tile-based algorithms in which the input matrices are partitioned to the submatrices (tiles). For simplicity, each tile is a square submatrix and stored as a data module. The resulting matrices are produced
and stored in the same tile-based manner. Fig. 1 shows the tile-based partition for Strassen multiplication. In this, the input matrix A and B are stored in a row-major and columnmajor way, respectively. The resulting matrix C is thus stored in a row-major way. Fig. 2 shows the partition for Cholesky matrix inversion. Cholesky inversion consists of three successive steps: Cholesky factorization (S = L×LT), lower triangular matrix inversion for yielding L-1; and last, product of triangular matrix (S-1 = L-T×L-1). Input S is a symmetric positive matrix, L is a lower triangular matrix and L-1 is the inverse matrix of L. A naïve approach for performing Cholesky inversion is to perform the three steps sequentially but it is poor at task parallelization. Thus, interleaving three steps by adhering to task dependencies is needed.
Figure 1. Tile-based partition for matrix multiplication
2×2 tiles (23 modules)
4×4 tiles (168 modules)
Figure 2. Tile-based partition for Cholesky inversion algorithm
The data dependency matrices for Strassen matrix multiplication and Cholesky matrix inversion algorithms are demonstrated in Fig. 3 and Fig. 4, respectively. Generation procedure is as follows: the algorithm is described through using Def. 2 and Def. 3. Def. 4 is employed for creating the data dependency graph and accordingly, Def. 5 for creating the data dependence matrix. Moreover, we use different colors to mark different functions. For examples, Strassen algorithm needs two basic functions: matrix addition (blue) and matrix multiplication (red). These dependency matrices demonstrate that as the number of tiles increases, the dependency complexity of parallel programs is greatly increased.
8×8 tiles (1,280 modules)
16×16 tiles (9,984 modules)
Figure 3. Data dependence matrices for Strassen algorithm under different partition sizes
2×2 tiles (16 modules)
4×4 tiles (80 modules)
8×8 tiles (480 modules)
16×16 tiles (3,264 modules)
Figure 4. Data dependence matrices for Cholesky inversion algorithm under different partition sizes
III.
HIERARCHICAL SCHEDULER MIDDLEWARE
We propose a hierarchical scheduler middleware for the multi-GPU architecture. Multi-GPU architecture consists of two kinds of processing elements: CPUs and GPUs. As usually, CPUs have faster communication to host memories but lower computation capabilities, compared to GPUs that are connected via PCIe lanes. Considering these intrinsic differences, we show the schema of scheduler in Fig. 5. In this, we provide three abstractions: user, scheduler and device. In user, we have two components: (a) “Algorithms” means the user programs that partition data and describe the task dependencies. For example, “Strassen multiplication” needs to point out the number of tiles and the dependency matrix (Fig. 3). (b) “Libraries” means these established libraries such as MKL, CUBLAS and MAGMA. Scheduler is a middle layer for separating the user algorithms and heterogeneous hardware. “Data-oriented scheduler” (DoS) is an upper scheduler that stores the data dependency matrix. At program startup, it employs Observation 1 to find out all of the initial modules and employs Observation 2 to build the data relevance sets for intermediate data modules. Following this, it employs Observation 3 to find the total of computation tasks. Therefore, all of the needed information is based on the data dependency matrix. At runtime, DoS keeps checking the real-time availability of data. A new computation task is queued to one task stack as long as all data for its relevance set are ready.
Three specific-purposed task stacks are connected with DoS. A task stack is a FIFO (first-in first-out) container that buffers computation tasks (Def. 2). GPU or CPU task stacks store tasks that need to execute exclusively on GPU or CPU. Mixed task stack stores tasks that could execute on either GPU or GPU. Later, we would show this separation of tasks stacks is useful. Fig. 6 shows the schematic structure of task stack and the thread-safe mechanism is provided. As downstream of task stacks, GPU/CPU scheduler is built and it keeps taking new tasks off stack top at runtime. CPUProc is a thread running on CPU cores and GPUProc is a thread binding with a specified GPU device. A multitude of GPUProc and CPUProc are running concurrently and keep attempting to receive tasks from GPU/CPU scheduler. A task starts execution as soon as it is deployed on GPU/CPUProc. Besides, memory management unit (MMU) is built. MMU does not only provide the data storage but it also manages memory allocating and freeing. Based on Observation 4, memory for ds is freed once Q (ds) = 0. We benchmark the scheduler performance by measuring the wallclock time (in seconds) and computing scheduling efficiency. Wallclock time of program is referred to as elapsed time between start time of first task and finish time of last task. GPU/CPU scheduling efficiency is defined as the GPU/CPU busy ratio, which is defined in percentage as the ratio of busy time to total time. Per-thread GPU/CPU scheduling efficiencies are computed for individual GPU/CPU threads and then they are averaged out. A larger overall scheduling efficiency implies a better parallelism.
Figure 5. Scheduler structure for implementing DDP on heterogeneous multi-GPU architectures
Figure 6. Schematic structure for task stack
IV.
HIGH-DENSITY MULTI-GPU SYSTEMS
A. Hardware All the experiments are performed on Cray CS-Storm server1. This 2U server is a high-density multi-GPU server rack and it can support eight NVIDIA Tesla K40 accelerators and two Intel Xeon processors. It delivers up to 11.44 TFlops for double precision and 34.32 TFlops for single-precision. 150 GB ECC DDR3 SDRAM is installed and 12 GB GDDR5 memory is for each K40 card. Host processors (Intel Xeon E5-2670 v2) run at 2.5 GHz and have 10 cores each. The host memory speed is 1867 MHz. Memory bandwidth for K40 could be up to 288 GB/sec (ECC off). Four PCIe switches (Gen 3 x16) are enclosed, each switch hooking two GPU cards with host processors. One 150G Intel SSD is installed for local storage. B. Software System software includes RHEL 6.5 and NVIDIA driver 340.32. For best performances of subroutines on CPUs and GPUs, we select three BLAS (basic linear algebra subprograms) libraries: (1) Intel Math Kernel Library (MKL v11.2) for CPUs; (2) CUBLAS (CUDA 6.5) for GPUs; and (3) MAGMA (matrix algebra on GPU and multicore architecture, v1.6) for GPUs. Complier package is Intel Parallel Studio 2015. The scheduler uses the Google hash table library to manage the data dependence matrix and it is written with the boost C++ libraries. V.
RESULTS
We first demonstrate the performance of Strassen algorithm. Though Strassen reduces the computing operations, it increases the computing complexities by demanding more task dependencies. This challenges the scheduler efficiency: low latency is needed for the scheduler to find the next available computation tasks. Second, we investigate the performance of Cholesky matrix inversion, where the interwoven dependencies place a roadblock to task parallelism. A. Strassen Matrix Multiplication Matrix addition and multiplication are two basis operators and their performances are first tested for a wide range of problem sizes (Fig. 7). Fig. 7 clearly shows that (1) GPU completely outperforms CPU in matrix multiplication. A larger input matrix results in a better improvement for GPU over GPU. GPU could be 20~40 times faster than CPU in matrix multiplication. However, (2) CPU is better than GPU in matrix addition. This could be caused by the data transfer latencies in-between GPUs and host. Thus, in our scheduler (Fig. 5), we direct matrix multiplication to GPUs and matrix addition to CPUs.
1 CS-Storm specification: http://www.cray.com/sites/default/files/CrayCSStorm.pdf
Figure 7. Performance of matrix addition and multiplication
Fig. 8 shows the wallclock time (in seconds) for different configurations and Fig. 9 analyzes the speedups of GPUs over CPUs. Fig. 10 illustrates the parallel activities traces (PAT) for multi-GPU and multi-CPU architectures. PAT is a 2D graph that shows activities of concurrent processes and threads. In PAT, horizontal axis shows the wallclock time and vertical axis indicates the device type (CPU/GPU) and thread identities. Different colors indicate different types of tasks. For examples, blue (red) colors mean matrix addition (multiplication) in Fig. 10. Naturally, two ends of a color bar show the starting and ending times of data processing of a particular task and the bar length means the amount of time the underlying task takes. PAT graphic system helps describe vividly the parallel activities of parallel programs. From these results, we find out: • The scheduler significantly improved parallel performances. Fig. 10 shows that 2-GPU solution improved the multi-CPU performances by 80.2% and 8-GPU boosted this improvement to 91.2%. • The scheduler is efficient. In Fig. 10, the average of GPU scheduling is 99%. This confirms that the latency caused by the scheduler is negligible. • For Strassen algorithm, the multi-GPU solutions are approx. 15 times faster than multi-CPU solution (Fig. 9). • With the increase of problem sizes, multi-GPU solutions are much faster than multi-CPU solution, no matter in double or single precision (Fig. 9). However, when problem size is relatively small, more GPUs would not always lead to a better performance. For example, for the case Strassen (24K, 12K) in single precision, the best speedup and performance happens at 4 GPUs solution (Fig. 9 and Fig. 8).
Figure 8. Performance of Strassen algorithms using double-precision (left) and single-precision (right). Vertical axis is number of GPUs. Horizontal axis is wallclock time measured in seconds. In legend, two digits show the size of input matrix and the size of submatrix (tile). For example, “Strassen (72K, 6K)” means the input matrix is a 72000×72000 square matrix and a submatrix is a 6000×6000 square matrix and thus input matrix is partitioned into 12×12 submatrices (tiles).
Figure 9. Speedup of Strassen algorithms for GPU over CPU using double(left) and single-precision (right). Horizontal axis is number of GPU cards. Vertical axis is speedup of GPU over CPU. The legend holds the same meaning as in Fig. 8.
Figure 10. Parallel activities trace (PAT) of Strassen algorithm (problem size = 72000, submatrix size = 12,000, double-precision using 12 CPU cores and up to 8 GPU cores)
B. Cholesky Matrix Inversion Cholesky matrix inversion algorithm requires more basis functions in addition to matrix addition and multiplication.
Fig. 11 shows the performances of individual functions on CPUs and GPUs. The results imply that GPU outperforms CPU when problem sizes are large enough. Thus, all these
new functions are directed to GPUs. Matrix addition remains directed to CPUs. Fig. 12 shows the wallclock time (in seconds) for programs using different problem sizes and precisions. Accordingly, Fig. 13 analyzes the speedups of GPUs over CPUs. Fig. 14 illustrates and compares the parallel activities trace (PAT) between multi-GPU and multi-CPU solutions. From these results and analyses, we find out: • The scheduler could address the challenge of tangled task dependencies such as in the Cholesky inversion algorithm (Fig. 4). It helps reduce the increased complexities of numerical algorithms. Parallelizing Cholesky inversion algorithm usually demands a thorough analysis of multiple critical paths for Cholesky factorization, lower triangular matrix inversion and matrix multiplication, and then it needs to compact substantial interwoven tasks at every possibility. The challenges in optimizing such algorithms are prohibitive even on homogenous systems, not to mention heterogeneous systems. However, our scheduler needs only a data dependence matrix in which we could find out all needed information for optimal task parallelism.
• •
The efficiency of hierarchical scheduler (Fig. 5) is affirmed again in Fig. 14. 8-GPU solution could improve multi-CPU solution by a factor of 90.5 %. Optimizing the multi-CPU and multi-GPU solutions require different partition sizes. Assume that problem size be N, partition size be P and submatrix size be S, then N = S × P. For given N, larger S results in a better performance for multi-GPU solutions since a larger submatrix size can increase the ratio of computation over communication. On the other hand, for same N, larger P results in a better performance on multi-CPU (Fig. 12) since more submatrices can boost the number of maximal concurrent tasks. These facts affirm that the GPUCPU data transfer is still the key roadblock to GPU performance so that large data chunks are preferred; and task parallelism has a strong impact on the multi-CPU performance so more submatrices could better balance workloads among multiple processor cores. Thus, an architecture-dependent optimization strategy is often needed.
Figure 11. Performance of individual functions on CPU or GPU using double- or single-precision for Cholesky inversion algorithm (problem size is input matrix size and partition is 4x4 tiles)
Figure 12. Performance of Cholesky inversion algorithms using double-precision (upper row) and single-precision (lower row). Problem sizes are 45,000 (left column) and 30,000 (right column).
Figure 13. Speedup of Cholesky inversion algorithms for GPU over CPU using double- and single-precision. Horizontal axis is number of GPUs (together with 12 CPU threads). Vertical axis is speedup of GPU over CPU. In legend, two digits show the problem size and submatrix size.
Figure 14. Parallel activities trace (PAT) of Cholesky inversion algorithm (problem size = 45000, submatrix size = 7,500, double-precision using 12 CPU threads and up to 8 GPUs)
C. Discussions Testing Strassen and Cholesky algorithms helps demonstrate the potential of data-oriented method applied to data-intensive algorithms and the applicability of the hierarchical scheduler on dense multi-GPU systems. In this approach, the developers only need to describe substantial tasks sequentially based on user-defined data partitions. Our scheduler analyzes the data dependencies and simultaneously, it manages the memory. Static scheduling is traditionally used to parallelize multitasking program for multi-core and multi-accelerator architectures, preventing the out-of-order issues. Solving a static scheduling problem based on the task DAG is becoming computationally prohibitive, due to the rapid growth of the algorithmic complexities and architectural heterogeneity. Moreover, the varying features of devices are very difficult to model and thus it is hard to predict performance of static methods on real systems. In the literature, dynamic DAG-based task-oriented scheduling is the good alternative for parallelizing real-time tasks on heterogeneous systems, more closely to real sense. However, it still needs to be enhanced in the aspects of time complexity and scheduling efficiency. In this, our scheduler takes the advantage of data-oriented concept to simplify the complex dependency analysis for substantial tasks and it can avoid the out-of-out hazards. The third benefit is balanced computing loads. Fig. 10 and Fig. 14 present the well-balanced computing loads for Strassen and Cholesky algorithms. No starving appears and computing is almost done at the same finish line. This affirms the efficiency of scheduler. Additionally, standard high-performance libraries for assorted platforms (such as CUBLAS, MAGMA and MKL) are integrated into the program so collective excellence of previous efforts could benefit advanced programming. This integration allows the scientists to effortless achieve good performance on hybrid architectures. Last, the results re-affirmed the great potentials of modern accelerators technologies for development of HPC systems. GPU-assisted programs are much faster than CPU-only programs. High-density multi-GPU system delivers remarkable computing performance in the spaceefficient hardware. VI.
RELATED WORK
Heterogeneous architectures have been receiving more and more attentions as the accelerator-optimized design is widely adopted in parallel computers. To harness the potential of novel computers, researchers have been devoted to develop some user-friendly libraries and new programming models with support of automated data transfers and implicit task offloading technologies. StarPU is a runtime system that offered a unified interface to offload computation on accelerators [9, 11]. This framework enabled the capabilities of automating
data transfers and offloading tasks in heterogeneous architectures. This has been used in the high-level middlelayer libraries such as MAGMA [17], SkePU [18] and PASTIX [19]. Similarly, we here developed our scheduler with integration of these low-level libraries. Particularly, we used MAGMA for matrix computations on GPUs. Different from the task-oriented approach, we made a valuable attempt for a data-oriented approach, in which we declare an abstraction for every single piece of data and express data dependencies so that we can manipulate a high level description of data. This differs from a taskoriented approach in which the programmers usually express the complex task graphs and the data management is built just as a facility library for task execution. OpenMP is the parallel programming model in which the applications are annotated with user-defined SIMD directives and the task parallelism is exploited by the compilers. OmpSs [13] is a model based on the OpenMP structure. Data sharing attributes are declared for fulfilling data dependencies. Our approach is not based on OpenMP and it did not adopt SIMD parallel architect. As such, we perform different functions for different processing units for optimal advantage of devices. In addition for the intra-nodal processes, the scheduler is designed for these inter-node processes on distributed computing architectures [10, 14, 15, 20, 21]. For an internode task scheduling problem, not only the task execution time but also the message-passing latencies need to be modeled, which may consider other sophisticated factors such as network topologies and routing policies. There are many excellent models and tools of this kind of problems, such as Parallax that is a heuristics task-scheduling runtime [14], PaRSEC that is a DAG task-scheduling runtime [21], and DAGuE that is a generic distributed DAG framework [12]. However, this kind of problem is out of scope for this work. VII. CONCLUSION We presented a data-oriented approach for scheduling the dependent tasks on high-density multi-GPU systems. In the approach, we formulated the data dependency and proposed the criteria for creating tasks upon availability of real-time data modules. Accordingly, we developed the hierarchical scheduler to implement our approach. Using the scheduler, we tested Strassen multiplication and Cholesky inversion algorithms on an 8-GPU server. The results show that our scheduler is effective and efficient at parallelizing tasks of data-intensive programs. Efficiencies of scheduling concurrent tasks could scale up to 90 % using eight GPUs. Through using our scheduler, 8-GPU system is more than one order of magnitude faster than the CPU-only systems. Task-driven scheduling strategies, static and dynamic, have been widely studied and employed while data-driven scheduling could be an alternative approach that appears more efficient for data-intensive multitasking applications on hybrid systems.
REFERENCES [1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
Y. Deng, P. Zhang, C. Marques, R. Powell, and L. Zhang, "Analysis of Linpack and power efficiencies of the world’s TOP500 supercomputers," Parallel Computing, vol. 39, pp. 271-279, 2013. S. Tomov, R. Nath, H. Ltaief, and J. Dongarra, "Dense linear algebra solvers for multicore with GPU accelerators," in Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW), 2010 IEEE International Symposium on, 2010, pp. 1-8. P. Zhang, R. Powell, and Y. Deng, "Interlacing Bypass Rings to Torus Networks for More Efficient Networks," Parallel and Distributed Systems, IEEE Transactions on, vol. 22, pp. 287-295, 2011. J. Kurzak, D. A. Bader, and J. Dongarra, Scientific Computing with Multicore and Accelerators: CRC Press, Inc., 2010. H. Liu, S. Yu, Z. Chen, B. Hsieh, and L. Shao, "Sparse matrix-vector multiplication on nvidia gpu," Int. J. Numer. Anal. Model, vol. 3, pp. 185-191, 2012. R. Nath, S. Tomov, and J. Dongarra, "BLAS for GPUs," Scientific Computing with Multicore and Accelerators, Kurzak J, Bader DA, Dongarra J (eds). CRC Press: Boca Raton, FL, 2010. S. Yu, H. Liu, Z. J. Chen, B. Hsieh, and L. Shao, "GPUbased parallel reservoir simulation for large-scale simulation problems," in SPE Europec/EAGE Annual Conference, 2012. P. Zhang and Y. Gao, "Matrix Multiplication on HighDensity Multi-GPU Architectures: Theoretical and Experimental Investigations," in ISC High Performance 2015, Frankfurt, Germany, 2015. C. Augonnet, J. Clet-Ortega, S. Thibault, and R. Namyst, "Data-Aware Task Scheduling on Multi-accelerator Based Platforms," in Parallel and Distributed Systems (ICPADS), 2010 IEEE 16th International Conference on, 2010, pp. 291298. P. Zhang, Y. Gao, J. Fierson, and Y. Deng, "EigenanalysisBased Task Mapping on Parallel Computers with Cellular Networks," Mathematics of Computation, vol. 83, pp. 1727 1756, 2014.
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20] [21]
C. Augonnet, S. Thibault, R. Namyst, and P.-A. Wacrenier, "StarPU: A unified platform for task scheduling on heterogeneous multicore architectures," in Euro-Par 2009 Parallel Processing, ed: Springer, 2009, pp. 863-874. G. Bosilca, A. Bouteiller, A. Danalis, T. Herault, P. Lemarinier, and J. Dongarra, "DAGuE: A generic distributed DAG engine for high performance computing," Parallel Computing, vol. 38, pp. 37-51, 2012. A. Duran, E. Ayguadé, R. M. Badia, J. Labarta, L. Martinell, X. Martorell, and J. Planas, "Ompss: a proposal for programming heterogeneous multi-core architectures," Parallel Processing Letters, vol. 21, pp. 173-193, 2011. T. Lewis and H. El-Rewini, "Parallax: A Tool for Parallel Program Scheduling," IEEE Parallel Distrib. Technol., vol. 1, pp. 62-72, 1993. P. Zhang, L. Liu, and Y. Deng, "A data-driven paradigm for mapping problems," Parallel Computing, vol. 48, pp. 108124, 2015. H. Bouwmeester and J. Langou, "A critical path approach to analyzing parallelism of algorithmic variants. Application to Cholesky inversion," arXiv preprint arXiv:1010.2000, 2010. E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek, and S. Tomov, "Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects," in Journal of Physics: Conference Series, 2009, p. 012037. J. Enmyren and C. W. Kessler, "SkePU: a multi-backend skeleton programming library for multi-GPU systems," in Proceedings of the fourth international workshop on Highlevel parallel programming and applications, 2010, pp. 5-14. P. Hénon, P. Ramet, and J. Roman, "PASTIX: a highperformance parallel direct solver for sparse symmetric positive definite systems," Parallel Computing, vol. 28, pp. 301-321, 2002. S. H. Bokhari, "On the Mapping Problem," Ieee Transactions on Computers, vol. 30, pp. 207-214, 1981. W. Wu, A. Bouteiller, G. Bosilca, M. Faverge, and J. Dongarra, "Hierarchical DAG Scheduling for Hybrid Distributed Systems," in 29th IEEE International Parallel & Distributed Processing Symposium.