Towards Dynamic Reconfigurable Load-balancing for Hybrid Desktop Platforms Al´ecio P. D. Binotto∗ † , Carlos E. Pereira∗ and Dieter W. Fellner† ∗ Informatics Institute UFRGS - Federal University of Rio Grande do Sul, Porto Alegre, Brazil Email:
[email protected],
[email protected] † Fraunhofer IGD Technische Universit¨at Darmstadt, Darmstadt, Germany Email:
[email protected],
[email protected]
Abstract—High-performance platforms are required by applications that use massive calculations. Actually, desktop accelerators (like the GPUs) form a powerful heterogeneous platform in conjunction with multi-core CPUs. To improve application performance on these hybrid platforms, loadbalancing plays an important role to distribute workload. However, such scheduling problem faces challenges since the cost of a task at a Processing Unit (PU) is non-deterministic and depends on parameters that cannot be known a priori, like input data, online creation of tasks, scenario changing, etc. Therefore, self-adaptive computing is a potential paradigm as it can provide flexibility to explore computational resources and improve performance on different execution scenarios. This paper presents an ongoing PhD research focused on a dynamic and reconfigurable scheduling strategy based on timing profiling for desktop accelerators. Preliminary results analyze the performance of solvers for SLEs (Systems of Linear Equations) over a hybrid CPU and multi-GPU platform applied to a CFD (Computational Fluid Dynamics) application. The decision of choosing the best solver as well as its scheduling must be performed dynamically considering online parameters in order to achieve a better application performance. Keywords-graphics processors; parallel processing of solvers for systems of linear equations; heterogeneity; load-balancing.
Figure 1. Slices of the 3D real-time CFD simulation over a plane: velocity field and pressure visualization
Supported by a benchmark study and the parametrization of the variables that dynamically influence the system performance applied to SLEs over a hybrid CPU-GPU platform, the main contribution of this PhD research is the development of an analytical 2-phase scheduling strategy supported by a timing profiler and a scenario database. We compare the performance of three solvers for SLEs - Jacobi, Red-Black Gauss-Seidel (GS), and Conjugate Gradient (CG) - applied to the diffusion and projection phases [1] of a 3D real-time CFD application illustrated on Fig.(1). A. Research Methodology
I. I NTRODUCTION In addition to timing constraints, scientific applications usually require high performance platforms to deal with massive calculations. The development of desktop-based accelerators (e.g., Graphics Processing Unit) offers alternatives for application implementation, aiming at better performance. The resulting platform heterogeneity can be considered as an asymmetric multi-core cluster, being a challenge to efficiently explore all available resources. In order to benefit from that powerfulness, applications should use methods to distribute their tasks in a balanced way. This leads to the creation of new strategies to distribute the workload on the hybrid platform with the goal to better meet the application requirements. Dynamic and reconfigurable load-balancing computing is a potential paradigm for those scenarios. Together with timing profiling, scheduling can be performed using an application, platform and scenario-aware mode to improve application performance. 978-1-4244-6534-7/10/$26.00 ©2010 IEEE
The methodology includes the framework design phase as well as the case study specification and the hardware characteristics. Experimental analysis is evaluated by programming the case study that makes use of the designed strategies and by performance benchmarks. Based on that, characteristics observed on the used algorithms can be a base to the generalization of the methods towards other applications. In addition, an evaluation of the performance gain will be carried out with the goal to verify the overhead provided by the proposed strategies. II. S YSTEM OVERVIEW The proposed approach abstracts the PUs using the OpenCL API as the platform independent programming model. Starting with an initial balancing configuration just when the application starts, an online profiler monitors and stores tasks’ execution times and platform conditions in a ”scenario” database. During application execution, a
DB
values presented by Fig.(3). However, this strategy can lead to a PU overflow, i.e., tasks being scheduled to the same PU, showing the need for a context-aware adaptation. Dynamic Reconfiguration: After the first assignment, information provided by online profiling is considered. Based on estimated costs, dynamic parameters, and awareness of runtime conditions, a new task can be reconfigured to other PU just if the estimated time to be executed in the new PU is less than the time in the current PU. III. C ASE S TUDY: BENCHMARK OF SOLVERS FOR SLE S
Figure 2.
Overview of the proposed system
reconfigurable dynamic scheduling is performed considering changes on runtime conditions. Fig.(2) depicts the approach. A. Platform Independent Programming Model The model is based on OpenCL, which encapsulates implementations of a task on different PUs, leveraging intrinsic hardware features and making it platform independent. For GPUs, OpenCL converts code to CUDA. Nowadays, this process has an overhead. This way, we chose to publish preliminary results based on a CUDA implementation. B. Profiler The Profiler executes at runtime focused on task performance and platform conditions. Time profiling is a simple analysis and considers several parameters at execution time, like domain size, data transfer between PUs, processors’ idle time, etc. For its accomplishment, non-functional parameters that influences the performance (we address to our previous work [2]) are characterized and stored in a database that will contain performances for different execution scenarios. Based on that, the load-balancer performs further scheduling. C. Dynamic Load-Balancer This module is composed of two phases (as described in [2]): first, it establishes an initial scheduling guess over the PUs based on estimation costs; second, it analyzes possible changes of runtime conditions and proposes a new task scheduling if it can lead to a performance gain. First Assignment: The first guess faces a multidimensional scheduling problem with NP-hard complexity and become more complex when dealing with more than two PUs and several tasks. To optimize, we base this assignment on heuristics taking into account the performance benchmark described on Section 3. We perform the first scheduling in a static-code way, but dynamically just after the application starts considering the domain size and the premise that the PUs are idle, defining a rule based on the break-even point
There are several approaches for computing or approximating SLEs. Specially, we are analyzing the following iterative methods, giving a brief overview ([3] for details). Jacobi: The method improves iteratively an approximation xm by rearranging and isolating each SLE equation: n X 1 (m) (m+1) bi − i = 1, . . . , n. Aij xj xi = Aii j=1,j6=i
(1) As the system matrix A has a regular pattern, the sum consists of only six values, but the convergence is slow. Red-Black GS: In contrast to Jacobi, GS uses all previous computed values for a new approximation: n i−1 X X 1 (m) (m+1) (m+1) , bi − Aij xj − Aij xj xi = Aii j=i+1 j=1 (2) with i = 1, . . . , n. The sum is split into components containing old and new approximations, improving convergence. But, it induces data dependency, making inapplicable for parallelization. A slight modification, changing the order of processing the equations, removes data dependency in the iteration. Therefore, the unknowns are divided into a red and a black set in a way that all neighbors of a red cell is black an vice versa. As consequence, a complete iteration is split into a red (2i) and a black (2i+1) iteration. Conjugate Gradient: It combines the ideas of steepest descent and conjugate directions. Steepest descent minimizes the functional E = 1/2xT Ax − xT b iteratively by using a search direction that reduces the error optimally. This results in solving Ax = b in the case A is positive definite and symmetric. The second method assures that the directions are perpendicular to all the previous ones in order to optimally exploit the space of search. The combination of these two approaches minimizes the distance to the solution in each iteration. The algorithm consists of the operations dot product, vector-add and a matrix-vector multiplication, Pn yi = j=1 Aij xj , (the most time consuming part [4]). A. Implementation on GPUs using CUDA For the 3D CFD application, the ordering i = 1, . . . , n is replaced by a component representation (i, j, k) with i = 1, . . . , nx , j = 1, . . . , ny , k = 1, . . . , nz , n = nx · ny · nz .
B. Benchmark analysis Three heterogeneous PUs were used in the experiment: a CPU quad-core of 2.4GHz and 8MB L2 cache with 4GB of main memory with 6.4GB/s of bandwidth; a GPU 8800GT (112 cores with a core clock frequency of 600MHz and 512MB of memory with bandwidth of 57.6GB/s); and a GPU GTX285 (240 cores with a core clock frequency of 1.5GHz and 1GB of memory with bandwidth of 159.6GB/s). The PUs’ communication was made via PCIe x16 that bounds the bandwidth of the CPU-GPU link by 4GB/s. The experiment showed that the CG and Jacobi solvers obtained the best performances. For 8 millions of unknowns, both executed on 406 milliseconds (ms) on the GTX285. For 8800GT, the CG computational time was 1198ms and Jacobi was 2637ms. On the CPU, a Jacobi-preconditioned CG reached 39219ms. Taking the CG solver (Fig.(3)), the CPU obtained better performance until 3K unknowns (comparing to the GTX285) and until 7K unknowns (comparing to the 8800GT). In such cases, few threads were launched to enable latency hiding. Beyond the break-even points, the GPUs processing power were fully utilized. Fig.(4) depicts performance on the GTX285, where CG will become faster than Jacobi and GS after reaching the
60
50
Tim e (m s)
40
30
PCG on CPU Conjugate Gradient on 8800 Conjugate Gradient on GTX 285
20
10
0 0
2000
4000
6000
8000
10000
12000
Unknowns
Figure 3.
Break-even point on the CPU and the GPUs
50 45 700
40 600
35
500 400
30
T im e ( m s )
In that representation, the neighbors of one cell (i, j, k) are (i ± 1, j, k), (i, j ± 1, k) and (i, j, k ± 1). In order to full explore the GPU, some requirements have to be met. Global memory access has to be coalesced, data should be available independently of the thread, and multiple access to global memory should be buffered in shared memory. In CUDA, threads are numbered in a specific disjoint pattern, so we can constuct consecutive indices ix. Here, one ix represents the ix-th equation of the SLE and simultaneously implies a position (i, j, k) in the simulation domain ix = k ·nx ·ny +j ·nx +i. The vector of unknowns x and right hand side b represent that position and are stored in that linear pattern. The system matrix A is represented by seven vectors of length nx · ny · nz due to the implicit topology of the simple Cartesian grid. The essential part of the Jacobi and the GS are algorithmic equivalent to a matrix-vector product. Therefore, we focus on that product implementation on GPU. For computing one row (i, j, k), access to the following data is needed: memory of yijk for writing the result, the corresponding entry xijk on the right hand side and its neighbors xi±1jk , xij±1k , xijk±1 , and the matrix entries Aijk , Ai±1jk , Aij±1k , Aijk±1 . Those access patterns can be interpreted in a way that just data from adjacent cells is needed for computation. In that way, the multiplication can be executed for one equation with a coalesced memory access pattern for all data, except for the values xi±1jk , xij±1k , xijk±1 . We use shared memory to buffer the access xi±1jk . For the other remaining data, note that the access is not coalesced.
300
25
200
20
100 0
15
0
1000000 2000000 3000000 4000000 5000000 6000000 7000000 8000000 9000000
10 5 Conjugate Gradient on GTX 285 Jacobi on GTX 285
0 0
50000 100000 150000 200000 250000 300000 350000 400000 450000 500000
Gauss Seidel on GTX 285
Unknowns
Figure 4.
Performance of the solvers on the GTX285 PU
border of approximately 500K unknowns. This gain denotes that many operations are always ’naturally’ coalesced since a sequential strategy is used, i.e., vector-vector operations in which one block of threads can always load a sequential segment of data. The same is valid for reduction kernel used to sum values of a vector. The Jacobi (and the CG matrixmultiply kernel) also profit from such loading strategy. Using multiple GPUs, Fig.(5)-above illustrates the need of 2M unknowns to be faster than one GPU based on CG. With less elements, it results on an increasing of communication. The multi GPU approach demonstrates that the speedup depends on the problem size (Fig.(5)-bellow). On that case, each PU computed half of the domain elements plus the borders elements. The achieved speedup was 1.7 for 8M unknowns. Our implementation obtained similar performance to the work of [1], but a comparison is of difficult evaluation due to differences on the system configuration. IV. R ELATED W ORK The authors of [4] compared linear algebra operations on CPUs and GPUs, performing benchmarks on vectorvector and matrix-vector operations. It was complemented by [5], showing that combining hybrid CPU-GPU architecture is appropriate for scientific computing. Based on that, [6] presented a CPU-GPU performance comparison with a
600
500
Tim e(m s)
400
300
200
100
Single GPU Average two GPUs
0 0
1000000
2000000
3000000
4000000
5000000
6000000
7000000
8000000
9000000
6000000
7000000
8000000
9000000
Unknowns
2.5
2
Time(ms)
1.5
1
0.5
even points. Our implementation also achieved a similar performance gain compared with related works. Remaining Challenges and Objectives: The future main goals of this research are: to parameterize in the database all considered variables, from application and from the execution platform; to incorporate the load-balance reconfiguration phase on the presented case study; to find a common behavior (characteristics) on different algorithms based on other scientific applications to generalize the solution (an interesting approach is to dynamically analyze the matrix A in order to classify it and apply the right solver); and to analyze the performance of the applications without the proposed method to evaluate the strategy overhead (an assumption to predict the number of times a task will be invoked possibly avoids unnecessary reconfigurations in an execution time-window and predicting the future allocation based on its recent use - timing database - can decrease the system overhead).
0 0
1000000
2000000
3000000
4000000
5000000
Unknowns
Figure 5.
CG performance comparison using two GTX285 GPUs
ACKNOWLEDGMENT We would like to thank Daniel Weber and Christian Daniel for their support. A. Binotto thanks the support given by DAAD and Alβan, scholarship no. E07D402961BR. R EFERENCES
static domain size partition, but applied for finite element solvers in mechanics, depicting the need of a dynamic load-balancing to distribute jobs. Load-balancing involving GPUs was investigated by [7], who compared dynamic scheduling methods based on lock and lock-free strategies for CUDA-tasks over the GPU multiprocessors. Recently, [8] described a dynamic task scheduling approach for computing dense linear algebra algorithms on symetric CPU distributedmemory multicore systems, without including GPUs. Significance of the Research: This PhD research follows the state-of-the-art as several scientific applications can now be executed on multi-core desktop platforms. To our knowledge, there is a need for research oriented to support load-balancing over a CPU-GPU (or CPU-accelerators) platform. The work shows its relevance analyzing not just application characteristics but also the context awareness, i.e., the platform execution scenarios. V. C ONCLUSION Based on the performance evaluation and the variables’ parametrization analysis, it was verified the need of strategies for such dynamic scheduling, improving current static programming and scheduling mode used by OpenCL or CUDA [6]. In this work, we propose a method for dynamic load-balancing over CPU and GPU, applied to solvers for SLEs in a CFD case study. Preliminary results point that there are scenarios in which CPU provides better performance, partially based on domain size, indicating the need of this framework. The dynamic scheduling was concentrated on the first guess based on reached performance break-
[1] J. Thibault and I. Senocak, “Cuda implementation of a navierstokes solver on multi-gpu desktop platforms for incompressible flows,” in 47th AIAA Aerospace Sciences. American Institute of Aeronautics and Austronautics, 2009, pp. 1–15. [2] A. Binotto, E. Freitas, M. Wehrmeister, C. Pereira, A. Stork, and T. Larsson, “Towards task dynamic reconfiguration over asymmetric computing platforms for uavs surveillance systems,” Scalable Computing: pratice and experience, vol. 10, no. 3, pp. 277–289, 2009. [3] R. Barrett, M. Berry, T. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. V. der Vorst, Templates for the Solution of Linear Systems. SIAM, 1994. [4] L. Buatois, G. Caumon, and B. L´evy, “Concurrent number cruncher: An efficient sparse linear solver on the gpu,” in High Performance Computation Conference, 2007, pp. 358–371. [5] V. Volkov and J. W. Demmel, “Benchmarking gpus to tune dense linear algebra,” in SC ’08: Proceedings of the 2008 ACM/IEEE conference on Supercomputing. Piscataway, NJ, USA: IEEE Press, 2008, pp. 1–11. [6] D. G¨oddeke, H. Wobker, R. Strzodka, J. Yuspf, P. McCormick, and S. Turek, “Co-processor acceleration of unmodified parallel solid mechanics code with feastgpu,” Journal of Computational Science Engineering, vol. 4, no. 4, pp. 254–269, 2009. [7] D. Cederman and P. Tsigas, “On dynamic load balancing on graphics processors,” in Proceedings of the 23rd ACM symposium on Graphics hardware, 2008, pp. 57–64. [8] F. Song, A. YarKhan, and J. Dongarra, “Dynamic task scheduling for linear algebra algorithms on distributed-memory multicore systems,” in SC’09 International Conference for HPC, Networking, Storage and Analysis, 2009, pp. 1–10.