Programming with Sociable Resources Swarat Chaudhuri
ˇ y Pavol Cern´
Roberto Lublinerman
Pennsylvania State University University Park, PA 16802, USA
[email protected]
University of Pennsylvania Philadelphia, PA 19104, USA
[email protected]
Pennsylvania State University University Park, PA 16802, USA
[email protected]
Abstract We present a model for shared-memory parallel programming that makes shared objects (“resources”) the drivers of heap-manipulating parallel computations. The model aims to syntactically capture patterns of spatial locality in heap updates and to express the maximum amount of logical parallelism in computations. To achieve this, we take a “resources’-eye” view of parallel operations on the heap. Resources are now viewed as active entities arranged in a network that is the heap. While they actively change their data content and links to other resources, each change is local and restricted to a spatial “neighborhood”. Global computations are phrased as massively parallel compositions of these local operations. Our programming abstractions include operations that merge neighboring resources into larger resources and split complex resources into simpler ones. These abstractions are composable and directly encode heap-allocated data structures and spatial separation among resources. The model is data-race free even though it does not explicitly use locks. We demonstrate that the model allows easy, and easily parallelizable, programming of several important applications exhibiting irregular data-parallelism. In particular, it faithfully expresses the parallelism inherent in many natural processes and thus seems ideal for scientific and multimedia applications modeling them.
Second, it is increasingly clear that writing and optimizing parallel programs that would actually be of use in these domains is hard, and that low-level concurrency primitives like locks make it harder than necessary. Not all concurrent programming, of course, is equally difficult— e.g., “embarrassingly parallel” applications pose few issues. Far more challenging is the problem of efficiently programming applications that combine parallelism with accesses to heap-allocated resources like lists, trees, and graphs. These are the applications that interest us in this paper. In such irregularly parallel applications, the typical instance exhibits data-parallelism, but the worstcase instance does not. This makes compile-time parallelization impossible; in fact, it has been noted [18] that most current implementations of optimistic parallelism (using software transactional memory) suffer in this setting as well. Vexingly, many applications where parallelism is most needed—epidemiological simulations, Delaunay mesh refinement—fall in this category. As a result, many of these applications are perfect challenge problems for designers of new models of shared-memory parallel programming. This understanding is reflected in the recently released Lonestar benchmarks [1], which comprise code and datasets for four such problems. In this paper, we present our response to this challenge: the programming model of Sociable Resources. The model is predicated on the following beliefs:
Categories and Subject Descriptors D.3.2 [Programming Techniques]: Concurrent Programming; D.3.2 [Programming Languages]: Language Classifications—Concurrent, distributed, and parallel languages; F.1.2 [Computation by Abstract Devices]: Modes of Computation—Parallelism and concurrency
1. Parallelism as default: Parallelism should not be an artifact imposed after a program is constructed sequentially. Instead, parallelism should be the default state of affairs in a program, and sequentiality should be imposed only when necessary.
General Terms Languages, Design Keywords Parallel programming, Programming abstractions, Irregular parallelism, Data-parallelism
1.
Introduction
Calls for new programming models for parallelism have been heard often of late [19, 23]. There are at least two reasons for this. First, the demand for scalable parallel programming is perhaps higher than ever: multicore machines are now near-ubiquitous, and highperformance applications that would benefit from parallelism arise not just in scientific computing but also in animation and gaming.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. c ACM [to be supplied]. . . $5.00 Copyright
2. Programming as specifying parallelism: In fact, the programmer should be able to express the maximum amount of logical parallelism that exists in a problem instance. This parallelism may or may not correspond to the physical parallelism of what code runs on what core. Indeed, expressing this logical parallelism should be the preeminent task of the programmer. 3. Resource-orientation: Our challenge here is to design a programming model capturing data-parallelism. As a result, we would do well to think in terms of resources—or heap-allocated data—rather than control. The questions we ask should be closer to “When can two resources be processed in parallel?” rather than “What do we run when a command terminates?” 4. Spatial locality: Parallelism in our challenge problems is closely tied to spatial locality. For example, in Delaunay mesh refinement [7], updates to a heap-allocated mesh apply to contiguous “neighborhoods” of triangles; in social network simulations, a typical update concerns a “neighborhood” of nodes. Any programming model successful in our challenge problems needs to keep spatial locality in mind. We note that thread-based parallel programming, as least in its typical form, violates each of these. As for (1), the typical multi-
threaded program is constructed as an afterthought, from a sequential program achieving the same purpose. This seems especially unfortunate as many irregular parallel applications—e.g., epidemiological simulations—model natural processes that are inherently parallel. This parallelism gets lost in translation to sequential code, and then becomes hard to recover. As for (2) and (3), threads, even in object-oriented languages like Java, are inherently tied to control. The programmer’s goal in writing a thread is to execute a sequence of imperative commands. As for (4), no threaded language that we are aware of is designed with spatial locality among resources as a first principle—or, for that matter, can express spatial locality in a general way. In Java or C#, for example, the heap is a global pool: unless explicitly locked, an object can be accessed by any thread at any time. Instead of trying to “fix” threads, Sociable Resources tries to start from scratch, keeping as design goals the beliefs above. 1.1
Our solution
The key idea of Sociable Resources is to take a “resources’-eye” view of parallelism. We view each resource on the heap as an active entity carrying some private data along with a local program. Instead of asking, “How do threads manage ownership of resources?” we ask “How do resources organize themselves into units and update themselves?” The heap is viewed as a network of resources, with pointers viewed as “links.” The local program of a resource can use three types of actions. First, it can perform a private action—i.e., modify its own data. Second, it can merge with a neighboring resource to form a new resource (this is our command for synchronization). Third, a resource can split into two new resources. All interaction between resources is captured with these primitives; as a resource can only interact with its neighbors (i.e., resources it has pointers to), such interaction can only be spatially local. Also, the data in resources is isolated and cannot be modified by any other resource. A global computation is viewed as a massively parallel composition of local interactions between resources and their neighbors. For an application, consider Delaunay mesh refinement, a problem of great interest in graphics and scientific computing (and part of the Lonestar benchmarks). Here we are given an initial triangulation of a two-dimensional plane; however, some “bad” triangles may not meet desired quality constraints. The problem is to retriangulate the mesh so that there is no bad triangle. It is a property of the problem that such retriangulation affects a “cavity”: a contiguous region in the mesh (in the worst case, this encompasses the entire mesh, but in the vast majority of cases, it does not). Thus, a way to retriangulate would be to search the mesh starting from a bad triangle, create a “cavity”, and retriangulate it. In our approach, the initial triangulation is modeled as a heap whose nodes—triangles— are active resources. If a triangle discovers that it is bad, it forms a cavity via repeated merge calls to its neighbors. The cavity retriangulates itself via a private update before it splits into the new triangles. All triangles in the heap work in parallel. This lets us capture as much parallelism as the problem instance permits. It is easy to see that this design satisfies our four design criteria. First, the default state of affairs here is that resources are parallel and independent, and merge actions have to be explicitly taken to achieve synchronization. Second, it is designed to expose the maximum amount of parallelism in the problem, as it tries to specify the computation in the most local way possible. Note that an actual implementation of the model may or may not choose to implement such extreme parallelism—if communication and other overheads are too high, it may impose sequentiality. For example, in the implementation described in Section 5, we let a single thread simulate thousands of active resources. Third, it is obviously resourceoriented; fourth, the fact all interactions between resources is lo-
cal captures spatial locality. In addition to all this, the model gives strong guarantees of atomicity and isolation: programs here are data-race-free by construction. We define a core language for Sociable Resources called S RL that has commands for private actions, sending and receiving merge requests, and specifying resource splits. The command for receiving merge requests specifies a continuation—a program that will govern the newly created merged resource. A split command specifies continuations for both of the new resources. We evaluate our approach in two ways. First, we demonstrate via several case studies that Sociable Resources can capture reallife applications with irregular data-parallelism. In addition to Delaunay mesh refinement, we consider a problem of finding “communities” in a social network [14, 3] (also part of the Lonestar benchmarks), an epidemiological simulation [9], and the problem of finding the minimum spanning tree in a graph. Second, we show that our approach leads to programs that can run efficiently on a shared memory multiprocessor or on a distributed system. This is done via an implementation strategy that simulates the active heap using a fixed number of threads (we consider 5 values between 4 and 1024), which may be seen as modeling cores in a multiprocessor. We use data for Delaunay mesh refinement provided by the Lonestar benchmark [1] and perform a series of measurements, the most important of which is the ratio of local vs. remote memory accesses. We observe that the computations are mostly local—e.g., the ratio is above 50 in all cases for a computation with 64 threads. The paper is organized as follows. In Section 2, we present the main ideas of our model and give a core language for it. Section 3 offers some elementary programming “tropes”. Section 4 presents a few case studies, and Section 5 an evaluation of our ideas on the Delaunay mesh refinement problem. Related work is discussed in Section 6; we conclude with some discussion in Section 7.
2.
Sociable resources
Now we present the main features of Sociable Resources and a core language for programming in it. First we define heaps. 2.1
Heaps
The central entity in Sociable Resources is the heap: a representation of the complete state of a parallel program at any given time. Let us assume an infinite universe R of resources, which correspond to abstract locations in the heap; let us also assume universes Σ of resource continuations and F of pointer labels. We define a heap to be a labeled directed graph H = (N ⊆ R, E ⊆ N × N, nl : N → Σ, el : E → F × F ) where the node set N comprises resources, E is an edge set, an node-labeling map nl assigns a continuation to every node, and an edge-labeling map el assigning a pair of pointer labels to each edge. The continuation of a resource in a heap is in fact a structure. As we will soon see, one component of it is a program; the other is a labeled record comprising the mutable data encapsulated by the resource. For brevity, we usually refer to the field x of this record simply as the “field x of r”, and denote its value by r.x. We emphasize that these fields describe the internal data of r and are not references to other resources. This distinction between internal and external references is crucial to our model. Edges in the heap capture pointers between resources and are annotated with pointer labels (f1 , f2 ). These labels are analogous to field names in Java-like languages—if the edge (r1 , r2 ) is labeled (f1 , f2 ), then the edge is known to r1 by the name f1 and to r2 by the name f2 . We require that for any resource r and pointer label f : (1) r does not have a data field f , and (2) there is at most one edge that is known to r as f .
(a)
(b)
Adam
Beth Chitra
Beth
Adam
Eve
Eve Chitra David
(c)
Beth Adam David
Chitra Eve
(d) Freddie
David
Beth Eve David
Adam
Ganesh
Chitra
Ganesh
!"#$%&'()*%'(+),+"-)&')./01)) Figure 1. (a) A social network (b) Adam and Chitra merge (c) After a few merges (d) Adam and Chitra split from the “group”
2.2
Computation-carrying resources
Our most significant departure from traditional shared-memory models is that we let resources—or locations in the heap—be active entities. This is because we see computations from the resources’ point of view. Instead of asking “How do threads of control manage ownership of the various pieces of the heap?” we ask “How do pieces of the heap organize themselves into units and update themselves?” Instead of seeing the heap as just a data structure, we view it as a continuously changing network of interacting resources. The reason this makes a difference is that computations on the heap often exhibit spatial locality. Consider a continuously updated social network (Figure 1-(a)). Here, each node maintains some information about other nodes—e.g., “Which of my friends and friends’ friends are single?”—which means that if we update its data, we must update some other nodes as well. Note that these updates are local—a node is affected by changes to its neighbors and their neighbors, but not by changes more than two hops away. Sadly, this locality gets obscured in the typical approach to modeling the problem in thread-based languages—even object-oriented ones—where the network is represented by a heap-allocated graph. (For instance, the social network application in the Lonestar benchmark suite [1] models its dataset this way.) Because any thread is permitted to modify any part of the heap unless it is explicitly locked, processing a large number of updates as above in parallel becomes a global coordination problem. In contrast, in our setting, a “neighborhood” in the heap (a node, its friends, its friends’ friends) that needs to be updated organizes itself into a unit and updates itself using its own computation. Not only does this only require local communication within a radius of two pointer hops, it arguably also captures the natural parallelism of real-life social networks, where information creation and propagation happen bottom-up rather than top-down. The same pattern surfaces in many other scientific and multimedia applications (see Section 4). To capture the active nature of resources, we let each resource r in a heap H carry a computation P in addition to its data D. This computation, called a local program, can refer to r’s data fields as well as names of pointers in and out of it. Now we can properly define the continuation nl (r) of r in the heap H: it comprises the names, data, and local program available to r in H. Formally, nl (r) = (D, P, Fi , Fo )
where D is a labeled record, P is a local program, and Fi and Fo are respectively the sets of pointer labels of edges in and out of r in H (as they are known to r). The sets Fi and Fo are permitted to be unbounded—after all, in the worst case, a resource may have pointers to/from all other resources (although, in many applications with spatial locality, the typical resource has only a few neighbors). Neither do we impose any restrictions on the types of the fields of D—they can be integers, lists, etc (of course, they must be private). Note that in Sociable Resources, each resource is in essence a separate process. Are we, then, resorting to message-passing at a “micro-level”? The answer is yes and no. While the heap here can be seen to comprise “processes,” a process (i.e., resource) can only communicate with (i.e., modify) another resource by following a pointer. Thus, the network of the heap has a topology. This is not enough though: crucially, this topology is highly dynamic, as pointers on the heap get “rewired” all the time. Even this does not give a complete picture, as such rewiring only happens locally. Thus, while our model can be seen as multiprocessing on the heap, its dynamic nature sets it apart. We believe that the best metaphor for this dynamicity is that of a social network, where (1) processes are agents that can only communicate locally; (2) associations may be formed or destroyed dynamically. It is legitimate to ask whether such extreme parallelism is realistic to implement—will not the communication overhead be too much? The answer is that our model is geared towards expressing the maximal logical parallelism in the system—to what extent to exploit it is up to the runtime system. For example, in our experiments, we map programs in our model to 4 to 1024 parallel threads, each thread being responsible for thousands of resources. In general, it seems clear to us that it is easier for a runtime to ignore a specification of parallelism—out of cost considerations— than to extract parallelism from heap-manipulating, mostly sequential code. And one never knows: the day may come when molecular computing makes even resource-level parallelism implementable. 2.3
Programs
Now we present a core language, called S RL, for our model. Programs here are parallel compositions of local programs within resources. The latter follow spatial locality: while they can modify its own data and interact with their neighbors in the heap, they cannot change the heap globally. Also, they are potentially nonterminating and can change over time—i.e., the same resource can carry different programs at different points. The basic primitives we offer are: (1) Private updates, by which a resource modifies its local data; (2) Split, by which a resource splits into two other resources with some potential pointers between them; (3) Merge, by which two adjacent resources merge into one. To see the generality of these, consider our social network, and suppose “Adam” wants to update his relationship status. This requires updates to Adam’s friends Beth and Chitra and their friends David and Eve (Figure 1), which may be coded as follows. First, Adam merges with either Beth or Chitra—let us say Chitra—to form a new resource (Figure 1-(b)). This resource now merges with Beth, David, and Eve in any order. The resultant resource r (Figure 1-(c)) is what we want to update; a private update is now applied to it. Now r splits into its component resources in some order. For example, in Figure 1-(d), Adam and Chitra have split from r. During this split, the pointers among the resources may be “rewired”— e.g., in the above, Adam and Chitra are not friends any longer. Note that if, say, Adam wants to merge with Chitra while she is merging with Eve, then Adam has to wait. This is where synchronization is required. In general, the larger the number of resources on the heap and the less tightly connected the network, the greater is the parallelism—if all available resources merge into one, then we have a sequential program. Thus, the extent to which we are able
i1
(a)
r
i2
! !
!
o3
i1
o1 !
i3
r !( f , f
!
1
2
)
r
i2 i3 ! !
!
! o2 i1
r' o!! 3
r!
says that it is possible for the heap to change from H to H 0 in one atomic step. Let us now define these transitions. Consider a heap H and any resource r in it. The continuation nl (r) = (D, P, Fi , Fo ) of r defines an environment EH,r assigning values to the names of various data fields of D. Let us call a guard g in P enabled, and write r, H |= g, if it evaluates to true under EH,r . There are three kinds of transitions:
o1 o2 o3
o1 i2
o2
i3
r'
o3
!
!
!
i2
i1
(b) o2
i3 !
! ! !
o1
Figure 2. (a) Split and merge ! (b)!Split, no merge
!
!
!
!
!
!
!parallelism depends ! on the input to exploit ! data, and it is possible
to create instances where no parallelization is possible. This is the defining property of irregular parallelism, where static approaches do not work. Our merges and splits, on the other hand, try to dynamically capture the parallelism that is possible. This makes them applicable to a number of problems of interest in scientific and multimedia computing that are known to be difficult to parallelize (e.g., Delaunay mesh refinement). Importantly, commands executed by resources have transactionlike semantics of isolation and atomicity. When a resource r exe!"#$%&'()*%'(+),+"-)&')./01)) cutes a command (perhaps together with another resource, in case of a merge) , the outcome is independent of actions taken by any resource not explicitly involved in the command. This ensures that our commands may be sequentially composed. What if Adam wants to read who he wants to merge with from an input file F ? This is done by modeling F as a resource as well. Adam merges with F , gets pointers to the person he is to merge with, splits from F , and merges with her. In fact, we have a general mechanism for I/O that we discuss in Section 3. 2.3.1
Syntax
A parallel program P in S RL is simply a heap H with no edges and a single initial resource In; its continuation is the initial continuation. Over time, this resource splits off other resources, potentially generating heaps of unbounded size. As for the syntax of local programs, consider a resource r in a heap H such that nl (r) = (D, P, Fi , Fo ). The local program P is a set of guarded commands, each of the form g ⇒ C, where g is a guard and C is a command. The guard is a predicate over data fields available in D or queries over Fi and Fo —e.g., (x > 4) (where x is a data field) or (f ∈ Fi ) are legitimate guards. Commands are of the form: C
::=
setd(d) | split(s, s0 , b) | merge!(f ) | merge?(f, Φ)
where d is a side-effect-free expression over field names of D and pointer labels in Fi and Fo that evaluates to a record of the same type as D, f ∈ Fo is the pointer label of an edge out of r, s and s0 are resource continuations, b is a record used for bookkeeping, and Φ is a function that takes a resource continuation and returns another. Importantly, while d can refer to pointer labels, it cannot dereference them. We skip the details of the syntax of d. As for the meanings of commands, intuitively, setd(d) is a private update, split(s, s0 , b) splits a new resource off r, merge!(f ) is a request by a resource to merge with another, and merge?(f, Φ) is the acceptance of such a request. 2.3.2
Semantics
Now we give an interleaving semantics for S RL. The semantics of a parallel program P is defined by transitions H −→ H 0 , which
Private updates. The command setd(d) performs an assignment to the private data of r. Let D0 be the value of d in the environment EH,r , l the continuation obtained by taking nl (r) and replacing its data by D, and H 0 the heap obtained by changing the continuation of the resource r to l. Then, if (g ⇒ setd(d)) ∈ P and r, H |= g, we have a transition t = H −→ H 0 . Splitting. The command split(s, s0 , b) splits a new resource r0 off r and gives it the continuation s0 ; r continues with the continuation s. During the split, edges to/from r may be rewired by the split and new edges between r and r0 added. If r was the index node of the heap before the split, it continues to be the index node after it. While the edges in and out of r are modified by the split, other nodes in the heap are not touched. Suppose (g ⇒ split(s, s0 , b)) ∈ P and r, H |= g. For the split to occur, the following must hold. First, r0 cannot be in H already. Second, the parameters to the split-operation must state how to rewire edges during the split. Specifically: 1. The record b must provide information about whether new edges between r and r0 are to be created, and if so, what their labels are (this information has a fixed-size encoding). The sets of pointer labels available in s and s0 must be consistent with this edge addition—e.g., if we create a new edge from r0 to r labeled (f1 , f2 ), then f1 must be recorded in s0 as a label for an outgoing edge, and f2 must be recorded in s as a label for an incoming edge. These new labels must not conflict with labels of pre-existing edges connecting other resources to r and r0 . 2. The edges in and out of the original resource r must be partitioned between r and r0 during the split. See Figures 2-(a) and 2-(b). In both cases, r originally had incoming edges labeled i1 , i2 , etc., and to outgoing edges using pointer labels o1 , o2 , etc. Note how the original labels are distributed by the split. Additionally, in the former case, a new edge (r0 , r) labeled (f1 , f2 ) is created; in the latter, no new edge is formed. Let H 0 be the heap obtained from H by changing the continuation of r to s and adding the new resource r0 that has continuation s0 and incoming and outgoing edges specified as above. In that case, we have a transition t = H −→ H 0 . Merging. Merging of resources requires synchronization between two resources r1 and r2 connected by an edge, which is viewed as a channel. Let Pi be the local program of ri and let the edge (r1 , r2 ) have the label (f1 , f2 ). For the merge to occur, we must have (g1 ⇒ merge!(f )) ∈ P1 and r1 , H |= g1 as well as (g2 ⇒ merge?(f2 )) ∈ P1 and r2 , H |= g2 . Let the continuation of r1 in H be s1 . The following happen atomically during the merge: 1. r1 issues a merge request by executing merge!(f1 ). By doing so, it makes its continuation s1 available on the channel e. 2. Synchronously, r2 receives the request executes merge?(f2 , Φ). Let the continuation of r2 at this point be s2 . 3. r2 merges its own continuation with the continuation s1 provided by r1 , producing a continuation s. This is done via the map Φ, which takes one resource continuation and returns another; we have s = Φ(s1 ). We require s to refer only to the names of data fields and pointer labels available in s1 and s2 .
Note that since Φ is local to r2 , it has access to all of the latter’s names and does not need s2 as a parameter. 4. Edges in and out of r1 are rewired so that every edge going out of (similarly, coming into) r1 or r2 now go out of (similarly, come into) r2 (with the exception of edges between r1 to r2 , which are deleted). This rewiring must be consistent with the set of available pointer labels specified by s—e.g., if an edge labeled (f1 , f2 ) now goes out of r, then f1 must be recorded in s as a label for an outgoing edge. 5. r1 is now made inactive, meaning its set of transitions is emptied. The continuation of r2 is replaced by s, so that r2 becomes the merged resource. Let H 0 be the heap obtained by making the above changes to r1 and r2 and the edges connecting them, leaving every other node and edge unchanged. In that case, we have a transition t = H −→ H 0 . Executions. An execution of a parallel program P is a possibly infinite sequence of transitions η = t0 t1 t2 . . . such that t0 is of the form P −→ H for some H (recall that syntactically, the program P is just the initial heap) and for all i, if ti = H −→ H 0 , then ti+1 is of form H 0 −→ H 00 for some H 00 . 2.3.3
Concurrency, race-freedom, and composability
The extreme parallelism permitted by our semantics is better understood using an equivalence relation ≡ between executions, defined so that if η ≡ η 0 for executions η and η 0 , then the runtime system can freely choose between the two while staying correct. Now we study this equivalence and its use in parallel implementation. First consider private updates. First note that the data of a resource r1 cannot be modified by other resources. Also, if f is a pointer label in/out of a resource r1 , no other resource can make it illegal—even if there is a different resource at the other end of the edge, the edge still exists (note that this holds only because a resource cannot dereference such pointers without merging/splitting). As a result, a private update t1 by r1 is causally independent of any transition t2 taken by any other resource r2 . In that case, t1 and t2 can be commuted—i.e., for any two executions η and η 0 , we have η t1 t2 η 0 ≡ η t2 t1 η 0 . Thus, t1 and t2 can always be run in parallel. Now take a transition t1 splitting a resource r0 off r. As t1 rewires the edges in and out of r, it affects the pointer labels of neighbors r0 of r in the heap, and cannot be commuted (i.e., run in parallel) with merge- or split-operations executed by them. This restriction can be implemented optimistically or pessimistically— e.g., r may lock the edges in and out of it, or the runtime may allow transitions to proceed in parallel and roll back in case of the above conflict. Either way, it only affects the typically small number of neighbors of r (and not their private updates either). As for a transition t merging r1 and r2 , once again, it cannot be commuted with merge and split transitions by neighbors of r1 and r2 . All other parallelism is permitted. There is, however, one subtlety. While our semantics only has transitions for corresponding to the simultaneous execution of commands merge! and merge?, in any real implementation of it, one of them will be executed before the other. Our semantics demands that in such a case, all other enabled transitions in r1 and r2 continue to stay enabled (as we have not committed to t yet). Thus, our semantics for S RL is nonblocking—in other words, deadlocks form a violation of semantics. A faithful implementation of this non-blocking semantics requires some intervention on behalf of the runtime. For example, a pessimistic implementation can make r1 and r2 agree on the merge by setting some flags or passing some messages; an optimistic implementation can make r1 go ahead and issue the merge request, canceling the request if it is blocked for too long. Here a particular implementation of a runtime can use randomization to break the
symmetry between the two resources issuing merge requests. After an unsuccessful merge request, a resource can wait for an (increasing) randomly chosen time span. Note that S RL guarantees that the data within each resource is private and can only change through an explicit action by the resource. Further, unless this action is a merge or split operation, this data is isolated from actions taken by other resources. This means that even though we do not use locks in our programs, they are data-race-free: Theorem 1. The language S RL is data-race-free. The above also means that we can easily perform sequential composition of commands within a resource. Suppose a resource r wants to execute the composition C1 ; C2 of commands C1 and C2 . Before it starts executing C1 , it uses a private data field to disable all its other guarded commands; once it finished C2 , it resets the field so that all guards that were previously true become true again. Note that this would not be possible if the private data of resources were not isolated from the rest of the heap.
3.
Tropes
The language in Section 2 is too low-level to be programmed in. In this section, we present a set of constructs—or tropes—which makes programming with resources easier. 3.1
Control constructs
While our core language does not have commands for sequential control flow—e.g., blocks, branching, local procedures—they can be easily implemented. Basically, the same data field used to ensure atomicity of sequential composition of commands can also be used to ensure atomicity of more complex blocks (we omit the details). From now on, we use C-like syntax for blocks, branches, etc. within the body of guarded commands. 3.2
Resource classes
While a heap in our setting may have unboundedly many resources, in typical use, they fall into a few classes. Such classes are similar to classes in object-oriented programming, though they do not, in this paper, allow inheritance, polymorphism, etc. Resources in the same class run the same code and have private data of the same type, so that it is possible to declare them together. We avoid a formal presentation of resource classes, instead proceeding by example. Consider Figure 5, which presents pseudocode for Delaunay mesh refinement. Here, Cavity is a resource class: its (typed) data fields are listed under the header local data, and its guarded commands are listed under the header code. The list neighbors collects the pointer labels for incoming and outgoing edges (in this example, the heap is undirected, so we need just one such set—in general, we would use keywords in-neighbors and out-neighbors. The function mergeCavity merges two instances of the Cavity class. This is a restricted version of what we called Φ in Section 2; while Φ could take an arbitrary continuation and create a new one, here the local program of the merged resource is fixed by the fact that it is an instance of Cavity. We omit further details for want of space. From now on, we consider resource classes to be part of S RL syntax. 3.3
Accepting and requesting on multiple edges
In some cases we want a resource r to be willing to merge with any neighbor as opposed to one along a specific edge. The command for this, written merge?-any(Φ), is implemented by letting r enable merge?(Fi (j), Φ), where Fi (j) is the j-th element of the set of incoming edges to r, and update the cyclic counter j periodically.
Similarly, a resource can specify a subset of edges on which it is willing to take merge requests. The command used for this is merge?-any(Φ, b), where b is a list of booleans, whose length is the same as the number of incoming edges from r, and whose elements indicate whether r is willing to listen on the edge corresponding to the element. Analogously, we have a command merge!-any that lets the resource request a merge on all outgoing edges, and merge!-any(b) for requesting a merge along selected edges. 3.4
Heap traversal by meme swarms
In many settings, we need a facility for traversing the heap. For example, a resource x may be interested in searching the heap for a resource with carrying a special datum d. Such search problems can be solved using memes. Suppose we have a special resource (meme) Sx,d which is split off x, “stealing” a link from x to one of its neighbors x0 in the heap (see Figure 3-(a), with I substituted by x, Ir by Sx,d and r by x0 —this meme has edges to and from x). Now Sx,d merges with x0 , in essence traversing the edge (x, x0 ) in the heap. If x0 possesses d, then the meme stops and reports back to x (i.e., merges with it along the edge (Sx,d , x)). At this point, it has access to all the edges out of x0 and, after setting a data field Visited at x0 , traverses them. It is easy to see that we can implement a depth-first or breadth-first graph search this way. If x wishes to “terminate” this search at any time, it recalls it by merging with Sx,d along the edge (x, Sx,d ) (x does not split off Sx,d ever again, in effect “absorbing” it). In fact, we can use a swarm of memes to parallelize this search. The resource x now spawns a large number of memes that have edges to and from x, traverse the graph in parallel, and report to x when the resource carrying d is found. If the sought-after resource is found by one meme, x recalls the rest using merges. 3.6
! (c)
r
(b)
I
! r
r
Ir
I
(d)
Ir
r
Ir
Input/output
Writing realistic programs, of course, requires a facility for I/O. This is achieved via two resources—a special input server I and an output server O. The resource I contains in its data fields all the input data. For example, an input stream coming e.g. from a user or from the network, can be connected to the resource I. Any data placed in O is assumed to have been output. Every resource r on the heap that needs to do I/O has edges to and from these special resources. For example, Figure 3-(a) shows a part of a linked list where every node is linked to I. The I/O process proceeds as follows. Suppose we want r to read a datum d from I, or, similarly, suppose that there is an input addressed to r. One way to process the input would be for r to merge with I—however, this would impose a sequential bottleneck as there is only one input server in the system. Consequently, we let I split off a special resource Ir —called a meme—that contains d as data and “steals” the link to r from I (Figure 3-(b)—note that I is not connected to r any longer). This meme now merges with r (Figure 3-(c)) and r is able to read d from it. Note that r is now connected to I again. Also note that many such memes can exist in the system simultaneously, allowing different heap elements to perform reads in parallel (Figure 3-(d)). The mechanism for output is similar. 3.5
(a)
Iteration and global application
An important application of heap traversal is iteration through a list-structured heap, where we want to apply a function f to every element in the list in sequence. To implement this, the resource h at the head of the list creates a special meme It (known as an iterator). The iterator traverses the list using the previously described mechanism; on merging with the i-th element L(i) of the list, it applies f to it. Like memes for heap search, it retains its link back to h all the time. On reaching the end of the list, it gets
! ! !
!
! I
!
!I
Figure 3. (a) A list and the input server (b) Spawning a meme (c) Meme coupling with resource (d) Parallel input
!
“absorbed” into h by merging ! with it along the edge (It, h), and the iteration process terminates. The above is, of course, an inherently sequential process. In many cases, it is desired that we apply a function f to every element in the heap in parallel (as opposed in a specific order). In this case, !"#$%&'()*%'(+),+"-)&')./01)) we may use a swarm of memes that traverse the heap in parallel, each meme applying f whenever it locates a node r where f has not been applied yet (a special data field can keep track of whether this application has already taken place). 3.7
Garbage collection
Suppose we are interested in filtering the heap—i.e., we want all elements satisfying a predicate Pred to be removed from the heap. This can be captured using a special resource called GC (for “garbage collector”). Each node r in the heap has a link to GC . The garbage collector periodically traverses all nodes, merges with them, and checks whether the node satisfies Pred . If it does not, GC releases the node back and restores its edges. If it does satisfy Pred , the garbage collector does not let go of r ever again, thus “absorbing” it and removing it from the heap.
4.
Case studies
In this section, we show how to apply Sociable Resources to two irregularly data-parallel examples from Lonestar Benchmark Suite [1]: the Delaunay Mesh Refinement problem and the Focused Community Discovery problem. We also consider the problem of finding the minimum spanning tree of a graph and a problem of agentbased epidemiological simulation. We have also considered other applications such as the Barnes-Hut algorithm for N -body simulation [5] (also part of the Lonestar suite) and an algorithm for edge detection [25] used in image processing. A description of these is postponed to a future version of this paper. 4.1
Delaunay mesh refinement
Mesh refinement is a computationally intensive problem widely used in several domains such as graphics and numerical analysis. The problem has been recently identified as a candidate for parallel implementation, and finding the right way to implement it is an active research area [18, 17, 22]. We briefly describe a refinement algorithm for a Delaunay triangulation of a set of points in a plane. Given a set of points M, a Delaunay triangulation partitions the convex hull of M into a set of triangles such that: (1) the vertices of the triangles, taken together, are M, and (2) no point in M lies in any triangle’s circumcircle (the empty circle property). In many applications (see e.g. [7]), there are further qualitative constraints on the resulting
1: Mesh m = /* read input mesh */ 2: Worklist wl = new Worklist(m.getBad()); 3: foreach Triangle t in wl { 4: Cavity c = new Cavity(t); 5: c.expand(); 6: c.retriangulate(); 7: m.updateMesh(c); 8: wl.add(c.getBad()); } Figure 4. Delaunay Mesh Refinement
triangles. In order to meet these constraints, a Delaunay triangulation often needs to be refined. We blackbox the requirements and suppose there is a function that identifies “bad” triangles. The pseudocode for a sequential algorithm for refining the mesh is in Figure 4. At the start, the worklist is populated by bad triangles from the original mesh. The goal now is to remove each bad triangle by making the mesh finer. For each bad triangle t, the algorithm proceeds as follows1 : • A point p at the center of the circumcircle of the triangle is
inserted. • All the triangles whose circumcircle contains p are collected.
This is called a cavity of t. Cavity is guaranteed to be a contiguous region of the plane, therefore a breadth-first search algorithm touching only a local part of the heap containing the bad triangle can be used to find a cavity. See Figure 9 for an example of cavities. • The cavity is then retriangulated by connecting p with all the
points at the boundary of the cavity. The Delaunay property (the empty circle property) is guaranteed to hold for the newly created triangles. The qualitative constraint may not hold for all the new triangles, so the size of the worklist might increase in certain steps. Note however that the algorithm is guaranteed to terminate. In order to see how the sequential program presented above can be parallelized, it is important to note that retriangulating a cavity is a local operation, and that the order in which the cavities are retriangulated is not important. Our approach will exploit these two characteristics of the problem. The initial triangulation is modeled on the heap as a graph whose node are triangles, and whose edges connect neighboring triangles. The intuitive idea for the computation of refinement is that the “bad” triangles can merge with their neighbors to create cavities, which will then be retriangulated. Figure 5 contains S RL code for a resource class called Cavity. The Cavity resource represents a convex polygon that consists of either one triangle or of several neighboring triangles. The resource has coordinates of the vertices on its border. These are stored in the list borderNodes. It also stores C1, the center of the circumcircle of the original bad triangle, and the radius r of this circle. The resource has four guarded commands. For a bad triangle (tested by function amIBad), the goal is to see whether its neighbors are in its cavity, and if so, to team up with them. The triangles are added to the cavity via a breadth-first search (BFS). It can stop after all the neighbors have been checked and it is known that they do not belong to a cavity (note that the resource Cavity also keeps the border triangles — the triangles which do not belong to the cavity but are just outside its borders). The isComplete? function
Cavity: local data: borderNodes: point list original_triangles : triangle list C1: point r: int neighbors: edge list code: | triangle? & amIBad & not isComplete : merge!-any() | composedCavity? & not isComplete : merge!-any() | composedCavity? & isComplete? : retriangulate(); splitIntoTriangles() | true : merge?-any(mergeCavity) mergeCavity(inObject) { if isTriangle?(origObject) { if inCircle(inObject.C1,inObject.r,origObject) { addToCavity(origObject,inObject,true) // triangle origObject is added // to cavity inObject } else { addToCavity(origObject,inObject,false) // triangle origObject is added // as a border triangle // to cavity inObject } } else { // it is a composed cavity ... } } Figure 5. Delaunay Mesh Refinement in S RL
checks whether the expansion has stopped (note that the cavity is guaranteed to be a contiguous region). If a cavity had not yet finished its expansion, it nondeterministically chooses a (not yet checked) neighbor and tries to merge with it, or listens for a merge request2 . If a cavity C1 receives a merge request from cavity C2 and accepts it, the function mergeCavity takes over as the program for the new merged structure. This function takes the requesting object C2 (inObject in the code) as input and can refer to C1 (origObject). The way it works is as follows. First, it tests whether origObject is a triangle. If so, then it decides whether origObject belongs to the cavity represented by inObject. If this is the case, the new cavity can continue expanding. If note, the traingle origObject is kept as a border triangle. If origObject is a cavity, it still can accept a request for merging with another cavity (in order to avoid deadlocks which would otherwise happen for two neighboring cavities). In this case the function mergeCavity (in the part of code elided by ...) splits the cavity origObject into the original triangles, and merges the appropriate ones to the cavity inObject. Finally, if a cavity discovers that it is complete, it can retriangulate by a private action. The cavity then splits into the new triangles. 2 Recall
1 For
ease of presentation, we suppose here that the bad triangle is not near the boundary of the whole mesh.
that S RL is a guarded command language and the remarks in Section 2 on how the runtime can break the symmetry between resources to prevent deadlocks.
core := R; changed := true; while changed do { changed := false; fringe := neighbors(core); for each v in core { if obj(core - {v}) < obj(core) { core := core - {v}; changed := true; } } for each v in fringe do { if obj(core union {v}) < obj(core) { core := core union {v} changed := true; } } } Figure 6. Focused Communities
()"&' !"#$%&'
Figure 7. Focused Communities: the core and the fringe.
Figure 7 has a picture of the algorithm in progress. At each step, the algorithm checks: • For each node in the core, whether removing this node would
4.2
Agent-based models in epidemiology
A typical question in epidemiology is what type of vaccination distribution is effective to prevent an outbreak from developing to a full scale epidemic. It is well-known that agent-based modeling that enables different type of interactions between agents has advantages over models that assume that there is a uniform probability of any two agents meeting. A more detailed model allows capturing the fact that agents interact only with a certain number of people, those that they meet at home, at their workplace, etc. The survey [9] describes several such approaches used for modeling the spread of the smallpox virus. We consider the model of Burke et al. [6]. It simulates how a virus can spread from a single infected person throughout a small network of towns. Each town consists of one hospital, one school, one other workplace and households of up to seven people. The model extends the interaction assumptions further and has every agent interacting with the same (up to 8) people in public places such as the schools and hospitals. During a ’day’, an agent interacts with all of its immediate neighbors (a fixed number of times, different for each type of community). Transmission of a virus occurs only during these interactions. The computations necessary are thus again purely local, and can be naturally captured in S RL. The interactions are modeled in a similar way as for the social network in Section 2. The agents have a fixed number of neighbors, up to 8 per each environment (home, school) in which they interact. An interaction is modeled by a merge, update (if one of the person is infected, a virus is probabilistically transmitted), and a subsequent split. Modeling the interactions in this way captures the maximum amount of parallelism. 4.3
Focused community discovery
A typical problem in analyzing social networks is focused community discovery. Given a person p, the task is to discover the community to which the person belongs. The community around p is intended to capture information flows in the network, thus we are interested in finding a set of people that contains p and that is robust - i.e. connection between them are stronger then their connection to the outside world. How to discover communities efficiently is a topic of current research (see [14, 3]). A data set connected to this problem is a part of the Lonestar Benchmark Suite. We consider an algorithm for focused community discovery from [14]. Figure 6 has the pseudocode for the algorithm. The algorithm greedily optimizes an objective (given by the function obj). The algorithm keeps its current hypothesis for the community (the core set). The fringe set is the set of adjacent nodes, that is nodes that are not in the core, but are directly connected to it.
increase the objective. If so, the node is removed. • For each node in the fringe, whether including this node would
increase the objective. If so, the node is added to the core. The process continues until no change occurs or a cycle is detected. Let us suppose that we are given an input stream of requests, each consisting of an individual p and an update that needs to be performed on all the members of p’s community, for example about an announcement p makes. These requests can be processed in parallel. We show how the algorithm for focused community discovery can be implemented in S RL. We will used the input resource I described in Section 3. There will be a resource class called Community. The code for community closely follows the code in Figure 6. The only major difference is that the set union and set difference operations need to be implemented using merges and splits. We remark that in the paper [14] which was concerned with the algorithmic aspect of the problem, not the programming language aspect, the pseudocode was presented in a fashion very similar to our approach. This hints at the fact that our approach is perhaps quite natural. 4.4
Minimum-spanning tree
To demonstrate how it is possible to implement graph algorithms using our approach, we consider an algorithm for the minimum spanning tree. The algorithm was presented in [4]. Its main idea is to proceed by building larger and larger local minimum spanning trees. In order to model a weighted graph, each resource representing a node in the graph will have a data field of type list of integers (outEdges) representing the weights of the outgoing edges. Initially each spanning tree contains only a single node. A spanning tree maintains a minimal outgoing edge, that is an edge of minimal length out of those that lead to a node outside of the spanning tree. The spanning tree can either nondeterministicaly issue a request for merging along this minimal edge, or listen for merge requests along the other edges. When it receives a merge request, the two spanning trees (the requestor m1 and the receiver m2 ) join to form a new spanning tree m3 - this is the task of the continuation mergeSpanningTree. The set of nodes of the spanning tree m3 is the union of nodes of m1 and m2 . The spanning tree is composed of edges of the two original spanning trees and the edge on which the merge request was received. The minimum outgoing edge from m3 needs to be found as well. It is easy to see that this algorithm will find a minimum spanning tree. Once again the important thing to note is that all the operations are purely local.
Spanning tree: local data: outEdges: int list minOutEdge: edge localSpanningTree : tree neighbors: edge list code: | true : merge!(min_out_edge) | true : merge?-any(mergeSpanningTree) Figure 8. Minimum spanning tree
Bad triangles Cavities Divisions
Figure 9. Delaunay Refinement: a mesh fragment
5.
Experimental Evaluation
As stated earlier, the goal of Sociable Resources is to express the maximum logical parallelism possible in an application. This logical parallelism may or may not correspond to the physical parallelism possible in a parallel or distributed machine. It is one of our core beliefs that the programmer should not be unduly concerned with the machanics of physical parallelism—instead, a runtime system should map resources (millions of which operate in parallel) to parallel threads possible in the system (which are far fewer in number). The implementation strategy for such a runtime may vary and will depend on the details of the underlying architecture. In this section, we discuss one such strategy. Let us first consider the possibilities we have. At one extreme is the naive strategy where there is no physical parallelism at all: the concurrent resources are simulated by one physical thread. In this case, we do not At the other is the strategy where a parallel runtime system maps each resource to a native platform thread, so that merges and splits requires synchronization with threads running neighboring nodes in the event of merges and splits. Such an approach might be a viable alternative on an hypothetical parallel machine permitting millions of parallel threads (e.g., a molecular computer), but it is clearly untenable in today’s parallel computers. We choose a strategy—known as a division-based strategy—in between where each physical thread simulates a few thousands of resources. Each thread is responsible for simulating a subset of the heap (known as a division). We evaluate five scenarios where the number of threads ranges from 4 to 1024. A division can be viewed a thread’s address space. A thread is permitted to follow pointers outside this address space, for example when implementing a merge? request, but those qualify as remote access and require synchronization with the corresponding thread. In typical use, a thread will avoid making remote accesses to the extent possible. Further, if the thread is a worker-thread responsible for performing transitions corresponding to local nodes in its division and it may consider its job done when no more transitions are enabled in its division. Divisions in the heap—along with their invariants—are maintained by the runtime system. The goal of the
Figure 10. Delaunay Refinement: the initial mesh is shown on the left and after several thousand of retriangulations on the right runtime is to maintain a legitimate distribution of divisions while maximizing local (as opposed to remote) accesses. Of course, this strategy can be further refined using architecturedependent features. In chip-multiprocessors, divisions will presumably be mapped to caches; in distributed implementations, divisions will be mapped to the local memory of nodes and edges that cross division will be replicated across sites and maintained accordingly. An attraction of our formulation is that while it exposes heap localization to the programmer, it pushes architecture-dependent optimizations into the runtime system. Obviously, this is not free: the runtime cannot infer all that a programmer knows about an application. However, we stress that Sociable Resources is meant to target irregular data-parallel applications rather than general parallelism, which makes it likelier that certain patterns of optimization will apply (of course, building optimized runtimes for these applications is a non-trivial question). We are even open to the possibility that in the optimized implementation of such runtimes, the programmer may have the ability to provide lower-level “hints” to the runtime regarding optimizations. Now we discuss our division-based strategy in more detail— specifically, how we use physical threads to simulate the transitions in the abstract semantics defined in section 2.3.2. Assume that at some point the heap is partitioned into a number of divisions; a thread can choose to execute any of the enabled transitions on their nodes belonging to its division: (1) Private transitions can be executed with no need for any synchronization nor lock acquisition. (2) split transitions require acquiring locks for remote edges. In our implementation the split nodes are created in the same division. (3) merge?/merge! transitions require synchronizing with the corresponding thread and decide the division where the merged node will reside. In our implementation the merged node resides in the requesting node’s division. Additionally, the runtime can migrate nodes between divisions. in the context of a merge?/merge! transition. The requester thread first synchronizes with the corresponding thread then moves the requested node in the requesting partition and finally performs the transition locally. The requested node is brought to the requester division alongside with some of its neighbors. We call the group of nodes transferred a migration neighborhood. The size of the migration neighborhood is defined by a policy parameter — we measure the influence of this parameter on performance. 5.1
Evaluation
We have not yet implemented a compiler for S RL yet. However, in a preliminary evaluation, we have estimated the performance of implementing S RL programs via the division-based strategy by handtransalting a S RL program for a typical irregular application— Delaunay Mesh Refinement—into a thread-based Java program where each of a fixed number of threads simulates the division assigned to it. The findings are encouraging; now we present them.
# threads 4 16 64 256 1024
0 330:1 140:1 64:1 30:1 17:1
1 739:1 272:1 106:1 53:1 30:1
Neighborhood size 2 4 6 1650:1 1920:1 2353:1 553:1 697:1 850:1 241:1 296:1 368:1 122:1 148:1 185:1 65:1 76:1 84:1
9 2952:1 899:1 378:1 188:1 67:1
12 2235:1 773:1 386:1 147:1 49:1
Table 1. Local initial partition: Local/remote ratio vs # threads and migration neighborhood sizes.
# threads 4 16 64 256 1024
0 16:1 12:1 12:1 11:1 10:1
1 33:1 29:1 27:1 25:1 21:1
Neighborhood size 2 4 6 66:1 147:1 238:1 57:1 129:1 208:1 54:1 122:1 196:1 50:1 109:1 167:1 40:1 86:1 122:1
9 357:1 302:1 284:1 209:1 138:1
12 445:1 401:1 324:1 189:1 158:1
Table 2. Random initial partition: Local/remote ratio vs # threads and neighborhood sizes.
# threads 4 16 64 256 1024
0 0 9 53 456 3454
1 0 8 63 466 3028
Neighborhood size 2 4 6 0 2 0 15 3 4 66 38 46 458 410 313 3053 3002 2772
9 1 4 38 179 1148
12 1 7 30 315 1454
Table 3. Local initial partition: Conflicts vs # threads and migration neighborhood sizes.
# threads 4 16 64 256 1024
0 16 53 245 1665 4177
1 8 57 196 753 4435
Neighborhood size 2 4 6 4 2 3 39 43 65 174 149 226 778 766 970 3600 5339 4463
9 17 42 137 969 3542
12 10 46 149 920 6087
Table 4. Random initial partition: Conflicts vs # threads and neighborhood sizes. In our modeling of mesh refinement, nodes in the heap represent either triangles or cavities in formation—Figure 9 shows a fragment of the initial mesh with two cavities being built. An initial partition, like the one in Figure 10 on the left, is created and one thread executes the code for its region’s entities; Figure 10 shows cavities being built in each region for an execution with 16 threads. We perform a series of measurements to show the suitability of this approach. We look at merge?/merge! transitions and we classify them as either local when the requested node is in the current division or remote otherwise; we also look at the number of merge?/merge! transitions that try to merge two cavities and classify those as conflicts. We perform measurements of the aforementioned quantities with respect to the number of threads, size of the migration neighborhood and if the initial partition is random or local. We consider two types of initial partitions. In the first, the assignment of resources are assigned to divisions exhibits locality. In this initial partition the euclidean (2D) space is divided recursively into squares up to the number of threads; most of the neighbors of triangles in a division are inside it. In the second, the resources are assigned to threads randomly.
All experiments are performed on a mesh of 100,364 triangles, 47,768 of which are initially bad. Although the algorithm is guaranteed to terminate with a valid mesh, the particular solution depends on the order in which the bad triangles are processed. On average a run retriangulates 140,000 cavities, with a maximum cavity size of 12 and an average cavity size of 3.75 triangles. The ratio between local and remote accesses reflects to what extent the spatial locality is exploited. Table 1 show the results of these experiments starting from a initial partition that exhibits locality. In this initial partition the euclidean (2D) space is divided recursively into squares up to the number of threads; most of the neighbors of triangles in a division are inside it. The ratio improves as the number of threads decreases and this is explained by the fact that the size of the divisions increases and so does the ratio between local and foreign edges. It seems that the optimal neighborhood size to migrate is between six and nine; this might be explained by the fact that cavities are relatively small (up to 12 triangles in this dataset). If the initial partition is assigned at random, as show in Table 2, the qualitative results are similar, but the ratio is lower due to the fact that the initial partition exhibits little locality. It is worth noticing that for 64 and 256 threads the ratios are comparable if a large enough neighborhood is used when migrating nodes. The results show that as the number of threads gets larger more migrations are required, and this is explained by the fact that divisions are smaller (a linear increase in the length of the boundary induces a quadratic increase in the area); still the computation is mainly local. A good initial partition makes a difference when divisions are large. As divisions get smaller that advantage is almost lost if one uses a proper neighborhood size for node migration. The number of conflicts is larger for random partition than the local partition (Table 4 vs Table 3) and increases as the number of threads increases. However the total number of conflicts, i.e. unsuccessful cavities, is relatively small when compared with the total number of cavities processed (below 1% when the number of threads is less or equal to 256, and under 5% for 1024 threads).
6.
Related work
As our model views the heap as an evolving network, it is not surprising that it relates to message-passing models of concurrency like CSP [16], the π-calculus [20], and the join calculus [8]. In fact, our merge!-merge? pairs are inspired by the rendezvous construct of CSP. However, our model is very much an imperative, sharedmemory model. Existing message-passing languages are not quite suited to encoding irregular shared data structures (process calculus encodings of buffers, mutable cells, etc. exist [20], but are mostly of theoretical interest), and our view of such structures as networks and pointers as channels is, so far as we are aware, new. Also, while our heap is a network, it is a very special type of network, and operations on it are tied directly to its interpretation as a heap. Our goal of providing abstractions for spatial locality would be lost in encodings in more general message-passing formalisms. On the other end, in say Java multithreading, the heap is global, and any thread can access any heap-allocated resource unless the programmer explicitly locks it. In contrast, we offer a first-class treatment of spatial locality and guarantee data isolation. and are data-oriented as opposed to control-oriented. The idea of resource-oriented parallel programming goes back to the Actor model [13, 2] and early object-oriented languages. While the former had a notion of locality, it has no analog of our heaps, merges, and splits. More closely related is the “chemical” language Γ [4], where computations involve parallel applications of rewrite rules (analogs of our splits/merges) to multisets of “molecules” (analogs of our resources). However, as Γ uses unordered multisets, it does not naturally capture spatial locality in a
dynamic graph, which is the essence of our model. Also, our resources are active, not just data items manipulated by rewriting. Among more recent related work are language-level transactions [11, 12], which allow composable parallelism and are often implemented optimistically using transactional memory. In almost all work on this topic, however, the view of parallel computation is control-oriented and does not address data locality as a first principle. As a result, most implementations of transactional memory track reads and writes to the entire memory to detect conflicts. As Kulkarni, Pingali, et al [18, 17] point out, this makes them behave inefficiently while many handling large, irregularly parallel heap-manipulating programs. Their solution—implemented in the Galois system [18, 17]—is to combine thread-based optimistic parallelism with data-types—e.g., unordered iterators—that give the runtime information about data dependencies [18], and to use heuristics for data partitioning [17]. Related proposals are to privatize transactions [22] and to combine transactions with data-centric synchronization [24]. While these systems share some design intentions with us, they stay within a thread-based, control-oriented framework. In contrast, our model aims to liberate programmers from thinking in terms of control at least in the kind of contexts studied here, having the slogan: “While writing data-parallel applications (irregular or not), think only about (locality of) data.” As for the idea of spatial separation, it was stressed often in the early days of concurrent programming by Dijkstra, Hoare, and Brinch Hansen [10, 15]. More recently, it has emerged as a vital tool in local reasoning about heap-manipulating programs [21]. However, we are not aware of any shared-memory programming model that is designed with separation as a first principle.
7.
Conclusion
We have proposed a new, resource-centric programming model for shared-memory parallelism (in particular irregular data-parallelism). The heap here is seen as a network of active resources, which behave “socially”—merging and splitting with their neighbors. As for future work, the goal of this paper was to present a programming model, not a programming system. In future, we will expand it into a complete language for object-oriented, spatially local programming. This raises foundational questions about how inheritance, polymorphism, etc. should relate to spatial locality. Identifying “killer apps” for such a language from graphics, scientific computing, and the social sciences is another important goal. The other main objective is the use of this model in building optimized implementations of data-intensive applications for emerging multi-core platforms. Our model follows the philosophy that parallel programs should be specifications of parallelism; a targeted implementation of Sociable Resources can however implement this specification as sequentially as cost-benefit tradeoffs on the ground demand (e.g., our implementation simulates thousands of resources using a single thread). This creates a need for compilers that such tradeoffs into account. Finally, exploring connections with the emerging field of molecular computing remains a long-term goal to us. Molecules also exhibit massive parallelism and spatially local interaction, and attachment and detachment in molecular assemblies [26] have parallels with merges and splits in our model. Preliminary investigations on this are under way.
References [1] The Lonestar Benchmark Suite. Available on: http://iss.ices.utexas.edu/lonestar/index.html. [2] G. Agha, I. Mason, S. Smith, and C. Talcott. A foundation for actor computation. Journal of Functional Programming, 7(1):1–72, 1997. [3] W. Aiello, Ch. Kalmanek, P. McDaniel, S. Sen, O. Spatscheck, and J. van der Merwe. Analysis of communities of interest in data
networks. In PAM, pages 83–96, 2005. [4] J. Banˆatre and D. Le M´etayer. The GAMMA model and its discipline of programming. Sci. Comput. Program., 15(1):55–77, 1990. [5] J. Barnes and P. Hut. A hierarchical O(N logN ) force-calculation algorithm. Nature, 324(4):446–449, December 1986. [6] D. Burke, J. Epstein, D. Cummings, J. Parker, K. Cline, R. Singa, and S. Chakravarty. Individual-based computational modeling of smallpox epidemic control strategies. Academic Emergency Medicine, 13(11):1142–1149, 2006. [7] P. Chew. Guaranteed-quality mesh generation for curved surfaces. In SCG, pages 274–280, 1993. [8] C. Fournet and G. Gonthier. The reflexive CHAM and the joincalculus. In POPL, pages 372–385, 1996. [9] T. Grune Yanoff. Agent-based models as policy decision tools: The case of smallpox vaccination. Technical report, Royal Institute of Technilogy, Sweden. [10] P. B. Hansen. Structured multiprogramming. Commun. ACM, 15(7):574–578, 1972. [11] T. Harris and K. Fraser. Language support for lightweight transactions. In OOPSLA, pages 388–402, 2003. [12] T. Harris, S. Marlow, S. L. Peyton Jones, and M. Herlihy. Composable memory transactions. In PPOPP, pages 48–60, 2005. [13] C. Hewitt, P. Bishop, and R. Steiger. A universal modular actor formalism for artificial intelligence. In IJCAI, pages 235–245, 1973. [14] K. Hildrum and P. Yu. Focused community discovery. In ICDM, pages 641–644, 2005. [15] C. A. R. Hoare. Towards a theory of parallel programming. 1972. [16] C. A. R. Hoare. Communicating Sequential Processes. 1985. [17] M. Kulkarni, K. Pingali, G. Ramanarayanan, B. Walter, K. Bala, and L. Chew. Optimistic parallelism benefits from data partitioning. In ASPLOS, pages 233–243, 2008. [18] M. Kulkarni, K. Pingali, B. Walter, G. Ramanarayanan, K. Bala, and P. Chew. Optimistic parallelism requires abstractions. In PLDI, pages 211–222, 2007. [19] E. A. Lee. Are new languages necessary for multicore?, 2007. [20] R. Milner. Communicating and Mobile Systems: the π-calculus. 1999. [21] J. Reynolds. Separation logic: A logic for shared mutable data structures. In LICS, pages 55–74, 2002. [22] M. F. Spear, V. J. Marathe, L. Dalessandro, and M. L. Scott. Privatization techniques for software transactional memory. In PODC, 2007. [23] H. Sutter and J. Larus. Software and the concurrency revolution. ACM Queue, 3(7):54–62, 2005. [24] M. Vaziri, F. Tip, and J. Dolby. Associating synchronization constraints with data in an object-oriented language. In POPL, pages 334–345, 2006. [25] J. Webb. Adapt: Global image processing with a split and merge model. Technical report, Carnegie Mellon University, 1991. [26] P. Yin, H. Choi, C. Calvert, and N. Pierce. Programming biomolecular self-assembly pathways. Nature, 451:318–322, 2008.