Parallel Join Patterns with Guards and Propagation Martin Sulzmann
Edmund S. L. Lam
Intaris Software GmbH, Germany
[email protected]
NUS, Singapore
[email protected]
Abstract
• We devise a parallel execution scheme for our join pattern
extension which we have fully implemented as a library in Haskell. The key features are:
Join patterns are a powerful concurrency abstraction for coordinating multiple events. We extend join patterns with guards and propagation and argue that both features are essential in many programming situations. We develop a parallel execution scheme which we have fully implemented as a library in Haskell. Our results provide new insights on how to write parallel programs for multi-core architectures.
A goal-based execution scheme to systematically and exhaustively execute join patterns (Section 6). A composable, non-blocking, highly-concurrent data structure to efficiently execute join patterns in parallel (Section 7). • Our experiments show that our implementation scales well to
1.
multi-core architectures (Section 8).
Introduction
In the age of multi-core programming, we look for high-level, powerful concurrency abstractions which can be efficiently executed in parallel on multiple cores. An attractive concurrency abstraction is offered by the join calculus (Fournet and Gonthier 1996) which provides a simple and intuitive way to coordinate concurrent events via reaction patterns, also known as join patterns. For example, the join pattern put(x) & get(y) reacts, i.e. fires, if we find matching events for put(x) and get(y).
We discuss related work in Section 9 and conclude in Section 10.
We argue that for many practical programming tasks, guards such as k1 == k2 in item(k1,x) & set(k2,y) when k1 == k2 are essential. The guard must be satisfied in order to fire the join pattern. That is, we can only update data y with x if the key k2 we are looking for matches data key k1 .
event put(Async Int) event get(Sync Int)
The challenge for an implementation is that we now need to search for matching events which satisfy a guard condition. On a multicore architecture, we wish to utilize as many cores as possible to increase the performance (parallel search). Prior join pattern execution schemes (Itzstein and Jasiunas 2003; Benton et al. 2004; Mandel and Maranget 2008; Russo 2008) don’t consider guards. As we argue in Section 3, the extension to guards is non-trivial. In this paper, we develop a highly efficient parallel execution scheme for join patterns extended with guards and propagation. In item(k,x) \ get(k,y), the element item(k,x) is propagated which means that multiple getters can execute in parallel on a shared, propagated item. Specifically, we make the following contributions: • We provide numerous examples to motivate an extension of join
patterns with guards and propagation (Section 4).
For ease of presentation, all examples in this paper are slightly sugared compared to our current implementation which is more verbose. Our implementation is available via hackageDB. See join library under Concurrency packages.
2.
Coordination via Join Patterns
put(x) & get(y) = y:= x t1 = do put(3) put(4) y1 Cnt -> [Match] -> [Task] -> Cnt search store cnt match [] = atomicVerifyRemove store match search store cnt match (t:tasks) = case t of Guard g -> if g then search store cnt match tasks else cnt Head h -> do idx if h == h’ then search (localSearch idx) (match++[(h’,localSearch idx)]) tasks else localSearch idx Table 12. Goal-Based Algorithm match. They are used to prune the search space by employing a technique known as back-jumping in AI. Back-jumping allows us to skip several levels during the backtracking search. Here is how back-jumping works in our setting. Function search accumulates the list of matches found so far. Once a complete match is found we need to (a) check if the matches are still valid and then (b) remove the matches from the store. Both steps need to be done atomically because the search interferes with other concurrent goal-based execution threads. The atomicVerifyRemove function guarantees atomic verification and removal. Verification failure implies that one of the matches has been removed. Hence, there is no point to continue the search for tasks which are ’below’ the task belonging to the removed match. Verification will fail again. We prune the search space by picking the continuation of the ’earliest’ task which failed. This is known as back-jumping in AI.
our implementation, we communicate the information in an IORef cell by employing unsafeIOToSTM.
Figure 3 contains an example to demonstrate back-jumping. We assume a three element task list (all heads). We process the tasks from left to right. At each level we initiate a local search for candidates matching either A, B, or C. We assume that the first candidate (match) list [A1 , B1 , C1 ] fails due to C1 . Hence, we continue the local search for C. The next candidate list [A1 , B1 , C2 ] fails due to A1 (a concurrent thread fired a join pattern which involved A1 ). We therefore pick A1 ’s continuation (dotted arrows
9
2009/5/8
Function search is written in continuation-passing style. Continuations are IO computations and our formulation relies on Haskell’s lazy evaluation scheme. Continuations are also attached to each 3 In
Main Scalability Tests
Tasks : [A,B,C] A
localSearch A
/ A1 W
localSearch B
/ B1
localSearch C
/ C1
B
C
/ A2
/ C2 Propagation & Scalability
1. atomicVerifyRemove [A1 ,B1 ,C1 ] failure due to C1 ⇒ continue with localSearch C 2. atomicVerifyRemove [A1 ,B1 ,C2 ] failure due to A1 ⇒ continue with localSearch A Figure 3. Back-jumping Example
indicate back-jumps) because continuing the search for B or C with candidate A1 will always yield failure. Removal of events is a permanent operation. Hence, we continue with A’s local search. In a concurrent setting, new events are continuously added to the store. Hence, there is the potential danger that localSearch won’t terminate because we keep getting new input in form of newly stored events. We avoid this trivial form of non-termination by firstly, iterating via nextEvent from the head to the tail of the store (a linked-list), and secondly, inserting new events at the head of the list. 7.5
Propagated Goals
Once a join pattern has fired, we execute the associated body in the thread of the goal which triggered the firing. In case the goal is propagated, we must continue executing the goal. This is important to guarantee exhaustive application of join definitions. There are a number of choices we have how to implement this requirement.
Table 13. Experiment Results
p1 & ... pn \ s1 & ... sm when g = body we apply the following steps. 1. Search for matching copies of p1,...,pn,s1,...,sm based on the given goal and the list of match tasks. 2. Check that the guard g holds under the matching substitution. 3. Atomically, (a) verify and mark p1,...,pn,s1,...,sm, (b) if successful, on commit, remove copies s1,...,sm and execute body.
For concreteness, consider the earlier example item(k,x) \ get(k,y) = y:= x
(c) Otherwise, continue with step 1.
Suppose that the store contains a number of get requests. Then, the goal item(k,1) triggers the above join pattern by removing one of the get requests. Because item(k,1) is propagated, the join pattern needs to be applied exhaustively on all get requests present in the store.
8.
Experiments
1. After execution of y:= 1, we invoke the goal item(k,1). This approach saves us from creating a new goal thread and enforces a serial execution of the join bodies.
We conducted experiments of our parallel join implementation on a quad-core Intel Xeon 1.86 GHz with 1GB memory, We were using Glasgow Haskell Compiler (GHC) 6.10.1 . Results shown are the relative performance of running 2-4 cores against running on a single core, and are averaged over several test runs. Due to space limitations, we briefly describe each join program in this benchmark, while details and implementations can be found at http://code.haskell.org/parallel-join
2. Create a new thread in which to execute y:= 1. Execute the goal again using its existing thread.
• UnionFind Adopted from (Fr¨uhwirth 2005). Parallelize union
There are a number of possible approaches to guarantee exhaustive applications for propagated goals:
3. First, (a) find all possible matchings which are then (b) either executed sequentially, or concurrently. In our current implementation, we follow the first option which is easy to implement and intuitive for the user to understand. 7.6
Implementation Summary
We summarize the gist of our join pattern execution algorithm. For a given goal, and each join definition (from top to bottom)
10
find implementations a concurrent data structure which maintains union relationship among disjoint sets. In experiments, we test an instance where 8 parallel union operations attempts to unite 9 disjoint sets of size 200. • BaggerProblem The bagger problem simulates a packing prob-
lem where n bags are packed with objects of three sizes and larger objects cannot be stacked on smaller ones. In experiments, we test an instance where 1000 items of various sizes are packed into 40 bags.
2009/5/8
• StackConc/StackSeq Two Implementation of a stack with our
join patterns. StackConc is shown in Table 7, while StackSeq is a variant with the last join pattern removed In experiments, we test an instance of 500 parallel push and pop operations. • GossipGirls Show in Table 6, the gossiping girls problem simu-
lates concurrent processes (girls) communicating and exchanging information, until all girls have the full set of information. In experiments, we test an instance where 50 girls start with mutable disjoint sets of secrets to tell. • SantaXn Adopted from (Trono 1994), the Santa Claus prob-
lem is an exercise of concurrency, where Santa must synchronize with either 3 of 10 elves to discuss toy designs, or all 9 reindeers to deliver toys, with reindeers having higher priority. In experiments, we test an instance where santa must make 80 deliveries or toy discussions (SantaX1). We also investigated a variant where we have 5 Santa Claus’ (SantaX5). • PotatoShackXn A simulation of a fast-food restaurant serv-
ing fries or baked potatoes. The problem consist of concurrent processes, running either customer, cook or kitchen helper routines which must communicate and synchronize with each other. In experiments, we test an instance where 24 customers are served by 1 cook and 1 kitchen helper (PotatoShackX1). We also investigated a variant where we have 5 cooks and 5 helpers (PotatoShackX5). • MusicalChairs A simulation of the game of musical chairs.
The game starts with n + 1 players and n chairs and continues until only one player is left. In experiments, we test an instance where n is 30. Table 13 show our main experimental results. Main Scalability Tests illustrates the relative speed up in performances, and the scalability of each program up to 4 processors. As shown, the test programs experience consistent speed up in performance as we increase number of processors. In some cases (SantaX5, PotatoShackX5), we see significant super-linear speed ups. Experiments SantaX1 and PotatoShackX1 show that super-linear speed ups are largely attributed to running ’output’ processes in parallel (processes that produces the actual outputs measured, eg. Santa Claus and the cooks). In these experiments (SantaX1 and PotatoShackX1) we only have one such ’output’ process, thus we see a significant drop in such super-linear speed up behavior. Discussed in Section 6.3, StackConc shows high scalability as we allow pairing of parallel push and pop operations. For StackSeq, since we disallow this (remove the last join pattern of Table 7), all push and pop operations must synchronize on a single ’Stack’ event, hence we do not get much speed up. We also investigated on the empirical repercussions of not using propagation where possible. In Propagation & Scalability, we see the programs Union Find, Musical Chairs and Potato Shack along with 3 respective variants which do not use propagation (NoProp). These three examples are chosen because they heavily relied on propagated patterns. As seen in the results, the no propagation variants scale worst in general, and for the case of Musical Chairs, propagation is critical for scalability. To summarize, our experiments show that programs implemented in our join pattern implementation scale relatively well and propagation is a useful feature for parallel programming.
9.
Related Work
Join Language and Extensions The join calculus (Fournet and Gonthier 2002) provides the basis for a number of language extensions (Itzstein and Jasiunas 2003; Benton et al. 2004; Mandel
11
and Maranget 2008; Russo 2008) and libraries (Russo 2007; Haller and Cutsem 2008) to support expressive concurrency abstraction in the form of join patterns. Extensions of this idea propose a limited form of guards in form of algebraic patterns (Ma and Maranget 2008), or consider further program abstractions built on top of the basic calculus (Singh 2006). Our earlier (unpublished) work (Sulzmann and Lam 2007; Lam and Sulzmann 2008) describes the gist of an extension of join patterns with a general form of guards and propagation. In this paper, we provide a systematic join pattern execution scheme including a parallel implementation, as well as examples showing that both features are essential in many programming situations. Join Execution Strategies In the absence of guards, the search for events matching a join pattern is fairly straightforward. Join patterns are compiled to finite automata (Fessant and Maranget 1998). States represent how many events have already been matched. Transitions represent the match of a new event. High-performance implementation (Russo 2007) use bit-masking techniques to trigger join patterns in constant time. This execution scheme still applies in case join patterns involve algebraic pattern matching (Ma and Maranget 2008). However, the existing execution scheme breaks down in case of (general) guards (as supported in our system). We require a more elaborate search for matching events. Our choice is a goal-based execution scheme. Events entering the store act as (active) goals which look for missing partner to build a complete join pattern match. which Our experimental results show that this method works effectively in practice. An alternative execution strategy is that join patterns actively look for matching events. This is known as rule-based parallelism and efficient implementations such as the RETE algorithm (Forgy 1982) exist. We leave the exploration of this alternative execution scheme for future work. Parallel Execution Methods We are not aware of any prior work which discusses the parallel execution of join patterns. One possible explanation is that without guards the test for events matching a join pattern can be performed very fast (almost instantly). Hence, there is no need for parallel execution. In the presence of guards, parallel search and execution become important to increase the throughput of join pattern applications. The only closely related work we are aware of is our own work (Sulzmann and Lam 2008) where we study parallel execution of multiset constraint rewrite rules. The (parallel) implementation challenges are similar and our present implementation benefits from our prior multiset rewriting experiences. In this work, we introduce new, improved implementation methods. For example, our goal-based search method in combination with atomic verification and removal builds a STM-like protocol on top of a highly-concurrent CAS-based data structure (Section 7.3). Previously, in (Sulzmann and Lam 2008) we solely relied on STM which incurred a high overhead. We also integrated pruning of the search space via back-jumping. In join pattern setting, we need to take care of the subtle interplay between goal and program threads (Section 7.4). Such issues don’t arise in the multiset rewrite setting. Search Optimizations We consider methods to optimize the search for events matching a join pattern. Our implementation aggressively prunes the search space in case of conflicts among parallel execution search threads (Section 7.4). For each search thread, we review further optimizations which have been extensively studied in the related setting of multiset constraint rewriting.
2009/5/8
For example, in the (contrived) case of the join pattern A&B&C, we may have some domain-specific knowledge that for goal A it is more efficient to first search for matching partner C before searching for B. We refer to (De Koninck and Sneyers 2007) for more details on optimal search orders for matching multi-headed rewrite rules. In the presence of guards, we can prune the search space by checking for guards as soon as the variables involved are bound. For example, consider A(x)&B(y)&C(z) when x>y. Suppose, we have found matches for A(x) and B(y). Before, we continue to find a match for C(z) we check that the guard x>y holds. We can also build index tables to speed up the search. For example, consider A(x)&B(x,y). Suppose, we have found a match for A(x). In addition, to storing each B call we also maintain a hash-table indexed by the first argument. For each matched A we can then use A’s argument to quickly located all potential partners B to build a complete match with A(x)&B(x,y). We refer to (Duck 2005; Schrijvers 2005) for details on early guard scheduling and dynamic indexing. In our current implementation, we use a simple indexing method which we plan to improve in future work.
10.
Conclusion
We have shown the usefulness of an extension of join patterns with guards and propagation. We have developed a systematic goal-based execution scheme and shown how to obtain an efficient parallel implementation. The implementation and all examples in this paper are freely available and are distributed as part of the join package available on hackageDB. Currently, examples must be written in a more verbose syntax which is often the case for library-based language extension. In our case, the issue is that in Haskell we can’t overload the pattern matching syntax (at least not easily as required for our application domain). In future work. we plan to use Template Haskell to provide a better user interface.
Acknowledgments
Cedric Fournet and Georges Gonthier. The join calculus: A language for distributed mobile programming. In Applied Semantics, International Summer School, APPSEM 2000, Caminha, Portugal, September 9-15, 2000, Advanced Lectures, pages 268– 332. Springer-Verlag, 2002. Thom Fr¨uhwirth. Parallelizing union-find in Constraint Handling Rules using confluence analysis. In Proc. of ICLP’05, volume 3668 of LNCS, pages 113–127. Springer-Verlag, 2005. Philipp Haller and Tom Van Cutsem. Implementing joins using extensible pattern matching. In Proc. of COORDINATION’08, volume 5052 of LNCS, pages 135–152. Springer-Verlag, 2008. G. Stewart Von Itzstein and Mark Jasiunas. On implementing high level concurrency in java. In Asia-Pacific Computer Systems Architecture Conference, volume 2823 of LNCS, pages 151–165. Springer-Verlag, 2003. Edmund S.L. Lam and Martin Sulzmann. Finally, a comparison between Constraint Handling Rules and Join-Calculus. Technical report, 2008. Proc. of CHR’08, Fith Workshop on Constraint Handling Rules. Qin Ma and Luc Maranget. Algebraic pattern matching in join calculus. LMCS-4, 1:7, 2008. URL doi:10.2168/LMCS-4(1:7)2008. Louis Mandel and Luc Maranget. Programming in JoCaml (tool demonstration). In Proc. of ESOP’08, volume 4960 of LNCS, pages 108–111. Springer-Verlag, 2008. Claudio Russo. The Joins concurrency library. In Proc. of PADL’07, volume 4354 of LNCS, pages 260–274. SpringerVerlag, 2007. Claudio V. Russo. Join patterns for Visual Basic. SIGPLAN Not., 43(10):53–72, 2008. Tom Schrijvers. Analyses, optimizations and extensions of Constraint Handling Rules: Ph.D. summary. In Proc. of ICLP’05, volume 3668 of LNCS, pages 435–436. Springer-Verlag, 2005. Satnam Singh. Higher-order combinators for join patterns using stm, 2006. Proc. of TRANSACT’06: First ACM SIGPLAN Workshop on Languages, Compilers, and Hardware Support for Transactional Computing.
We thank Matthias Neubauer and ICFP’09 reviewers for their helpful comments. We thank (Ken) Chung-Chieh Shan for pointing out the connection between our continuation-based search method and back-jumping in AI.
Martin. Sulzmann and Edmund S. L. Lam. Haskell – Join – Rules. In Draft Proc. of IFL’07, September 2007.
References Nick Benton, Luca Cardelli, and Cedric Fournet. Modern concurrency abstractions for C#. ACM Trans. Program. Lang. Syst., 26 (5):769–804, 2004.
Martin Sulzmann and Edmund S.L. Lam. Haskell parallel joins library with guards and propagation. http://hackage.haskell.org/cgi-bin/hackage-scripts/package/join, 2009.
Leslie De Koninck and Jon Sneyers. Join ordering for Constraint Handling Rules. In Proc. 4th Workshop on Constraint Handling Rules, pages 107–121, 2007.
Martin Sulzmann, Edmund S.L. Lam, and Simon Marlow. Comparing the performance of concurrent linked-list implementations in Haskell. In Proc. of DAMP’09, 2009.
Gregory J. Duck. Compilation of Constraint Handling Rules. PhD thesis, The University of Melbourne, 2005.
John A. Trono. A new exercise in concurrency. SIGCSE Bull., 26(3):8–10, 1994. ISSN 0097-8418. doi: http://doi.acm.org/10.1145/187387.187391.
Fabrice Le Fessant and Luc Maranget. Compiling join-patterns. Electr. Notes Theor. Comput. Sci., 16(3), 1998.
Martin Sulzmann and Edmund S. L. Lam. Parallel execution of multi-set constraint rewrite rules. In Proc. of PPDP’08, pages 20–31. ACM Press, 2008.
Charles Forgy. Rete: A fast algorithm for the many patterns/many objects match problem. Artif. Intell., 19(1):17–37, 1982. Cedric Fournet and Georges Gonthier. The reflexive CHAM and the join-calculus. In Proc. of POPL’96, pages 372–385. ACM Press, 1996.
12
2009/5/8