Stratified Sampling for Even Workload Partitioning. Jeeva Paudel, José Nelson Amaral. University of Alberta, Edmonton, Canada. {jeeva, jamaral}@ualberta.ca.
Stratified Sampling for Even Workload Partitioning Jeeva Paudel, José Nelson Amaral University of Alberta, Edmonton, Canada
{jeeva, jamaral}@ualberta.ca
ABSTRACT
recursively generate one or more child items, thereby, forming a tree-structured collection of items to explore. Such algorithms also use several enhancements to reduce the effort of exploration by avoiding processing of items deemed as unfruitful (e.g., using heuristic functions and dead-end detection to guide search). The inherent difficulty in predicting the number of items that will be explored by an algorithm given an initial implicity-defined state space leads to a load imbalance in a parallel computing setting. A popular technique to alleviate the load imbalance is work-stealing. In work stealing, a processor that runs out of work steals work from another processor. Work stealing can improve an application’s execution time by making work available to idle nodes. However, it can also impose significant overheads of synchronization on shared data structures, and of communication across the network. This work proposes a novel Workload Partitioning and Scheduling (WPS) algorithm that estimates the amount of items generated upon processing an item, so as to create a well balanced distribution of the items. WPS reduces the need for expensive work-stealing operations.
This work presents a novel algorithm, Workload Partitioning and Scheduling (WPS), for evenly partitioning the computational workload of large implicitly-defined work-list based applications on distributed/shared-memory systems. WPS uses stratified sampling to estimate the number of work items that will be processed in each step of an application. WPS uses such estimation to evenly partition and distribute the computational workload. An empirical evaluation on large applications — Iterative-Deepening A* (IDA*) applied to (4×4)-Sliding-Tile Puzzles, Delaunay Mesh Generation, and Delaunay Mesh Refinement — shows that WPS is applicable to a range of problems, and yields 28% to 49% speedups over existing work-stealing schedulers alone.
Categories and Subject Descriptors D.1 [Programming Techniques]: Distributed Programming; H.3.4 [Systems and Software]: Distributed Systems
Keywords 2.
Stratified sampling, Load balancing, X10, PGAS, APGAS
1.
PROBLEM FORMULATION
Let S(n∗ ) = (N, E) be a tree rooted at item n∗ which we call the Work-Item Tree (WIT ). The WIT represents the relation of the items reachable from n∗ , where N is a set of items and, for each n ∈ N , child(n) is the set of items generated when n is processed: child(n) = {ni |(n, ni ) ∈ E}. In contrast with the Artificial Intelligence literature, we write node to refer to a processing node in a computer cluster, and not to a vertex in the WIT ; a vertex in the WIT will be referred as an item. Also, we write S(n∗ ) as S whenever n∗ is clear from context. Given M processing nodes and an implicitly-defined WIT, the Work-Load Distribution Problem consists in partitioning the items in the WIT into M parts W1 , W2 , · · · , WM of similar size. This problem is formulated Pas an optimization problem where the goal is to minimize i,j∈{1,··· ,M } |Wi | − |Wj |, where |Wi | is the size of Wi . This work assumes that all items in the WIT take approximately the same amount of time to process. However, the formulation of the Work-Load Distribution Problem and the WPS algorithm could be easily adapted to deal with items with different processing times.
INTRODUCTION
Many algorithms in Artificial Intelligence and Combinatorial Optimization domains explore state-space to find a path to a goal state or to find a specific state in the statespace that minimizes an objective function. A state in a state-space is herein also referred to as a work item or simply an item. One way to achieve high performance in such algorithms is to process disjoint portions of state-space in different processing nodes of a cluster. However, it is usually difficult to foresee how many child work-items will be generated after processing an item. This is because statespace algorithms usually operate on implicitly-defined state space that consists of either an item or a list of items. The algorithms use a transition function to explore an item and
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author. Copyright is held by the owner/author(s). PACT’14, August 24–27, 2014, Edmonton, AB, Canada. ACM 978-1-4503-2809-8/14/08. http://dx.doi.org/10.1145/2628071.2671422.
3.
OVERVIEW OF THE APPROACH
Existing popular techniques for workload distribution typically process some items in the WIT to generate enough
503
15*Puzzle" DMG"
DMR"
Computa@on" Communica@on"
100" 80" 60"
4.
40"
15/Puzzle" DMG"
X10WS" Dist/WS" WPS"
0"
EVALUATION
WPS is general and can be applied to different problems. As a demonstration, we incorporates WPS into the runtime system of a popular high-performance programming language: X10 [6] to parallelize: (i) IDA* [3] algorithm for finding least-cost path in 15-puzzle problem; (ii) an algorithm for generating Delaunay Meshes [1] (10M points); and (iii) an algorithm for refining Delaunay Meshes [4] (68M triangles). This evaluation uses a blade server with 16 nodes, each featuring two 2 GHz Quad-Core AMD Opteron processors, with 8 GB of RAM and 20 GB of swap space, running CentOS GNU/Linux version 6.2.
20" X10WS" Dist/WS" WPS"
0"
105" 90" 75" 60" 45" 30" 15" 0"
Stealing" Par@@oning"
X10WS" Dist/WS" WPS"
200"
Speedup&Over&Sequen,al&
400"
15)Puzzle" DMG" DMR"
Sequen&al)Execu&on)Time)(s))
600"
Dist*WS" WPS"
%"of"Total"Execu.on"Time"("128workers)"
X10WS" WPS*" WPS+Dist*WS"
800"
of A[i] because the parent items are stored at higher levels of the ST than their children. Optimizing the scheduling and distribution strategy for applications that exhibit irregular dependencies among the work items is left as future work.
DMR"
Figure 1: Results. X10WS is X10’s default intra-node scheduler, Dist-WS [5] is our extension to X10WS for inter-node work-stealing, WPS* is WPS operating alone, and WPS is WPS* and X10WS operating together.
5. frontier items and distribute them across available processors. The proposed Workload Partitioning and Scheduling (WPS) algorithm provides a systematic way to do this without requiring the programmer to code this solution. WPS operates in following four phases: Sampling WITs of practical interest are too large to examine exhaustively. Thus, in phase 1, WPS selectively processes only unique items in WIT to create a sampled tree (ST). This technique, called stratified sampling, was proposed by Chen et al. [2]. WPS offers programmers a customizable labelling system to define properties that constitute two items to be similar. For instance, in an Iterative Deepening A* (IDA*) search tree, two items may be considered similar if their h-values, i.e., their estimated cost to the goal node are equal. A label is represented as an integer value. ST stores a set A[i] of representative-weight pairs hn, wi, with one such pair for every unique label encountered at level i of the WIT during sampling, where w is an estimate of the number of items of label t in the WIT rooted at n∗ . Estimating In phase 2, WPS estimates of the size of the subtrees rooted at each item n ∈ ST . Consider that the values of Yui represent the estimated size of the subtree rooted at the node of label u at level i of the WIT. Then, the values of Yui+1 are used to compute the values of Yui . The size of a leaf item is its weight w. The traversal of the ST , represented by the structure A, starts at the deepest level and moves towards the root. Following this process, this phase of WPS produces a collection χ of Yui values for every u and i encountered in the ST . Partitioning In the Sampling phase, WPS processes a small subset of the items in the WIT through SS. In phase 3, WPS partitions the remaining items in the WIT — the items not processed by SS— into M groups, where M is the number of processing nodes available. Distributing In phase 4, WPS processes items in subset M1 locally, and asynchronously distributes remaining Wj|j=2..M subsets of items to other M − 1 processors. The applications used in this study are agnostic to the order of processing of the work-items – all orders generate valid final results. In applications, where the work-items must be expanded in an orderly fashion, processing of parent items first may be necessary to ensure that its children do not wait for a prolonged time. Work-items in such applications can be processed in a monotonically increasing order
RESULTS
An empirical evaluation using these applications indicates that operating WPS in tandem with intra-node work-stealing scheduler yields 28% to 49% speedups over traditional workstealing schemes alone (see Fig.1). The major performance benefits of WPS are: (i) WPS reduces the execution time spent on work stealing by 10% to 16% over X10WS and by 7% to 12% over Dist-WS; (ii) an even distribution of work items with WPS necessitates fewer steal operations, fewer synchronized accesses to the remote workers, and fewer remote dataaccesses to process stolen tasks. WPS, thus, transmits significantly fewer messages across the network compared to X10WS and Dist-WS; and (iii) WPS enables applications to spend greater share of the execution time on useful computations. Applications used in this study showed limited scope for inter-node load balancing with WPS. Nonetheless, we recommend coordinating WPS with existing load-balancing schemes in runtimes of programming systems because: i) there is no load-balancing overhead if there is no load imbalance in the system; and ii) the infrequent load-balancing operations that may be necessary will be handled by the existing loadbalancing techniques.
6.
REFERENCES
[1] A. Bowyer. Computing Dirichlet Tessellations. The Computer Journal, 24(2):162–166, 1981. [2] P.-C. Chen. Heuristic Sampling: A Method for Predicting the Performance of Tree Searching Programs. SIAM J. on Computing, 21:295–315, 1992. [3] R. E. Korf. Depth-First Iterative-Deepening: An Optimal Admissible Tree Search. Artificial Intelligence, 27(1):97–109, 1985. [4] M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and L. P. Chew. Optimistic Parallelism Requires Abstractions. In Programming Language Design and Implementation, pages 211–222, CA, USA, 2007. [5] J. Paudel, O. Tardieu, and J. N. Amaral. On the Merits of Distributed Work-Stealing on Selective Locality-Aware Tasks. In International Conference on Parallel Processing, pages 100–109, Oct 2013. [6] V. Saraswat, B. Bloom, I. Peshansky, O. Tardieu, and D. Grove. X10 Language Specification. http://x10.codehaus.org/x10/documentation.
504