Single Operation Multiple Data - Data Parallelism at Subroutine Level

Single Operation Multiple Data - Data Parallelism at Subroutine Level Eduardo Marques CITI / Departamento de Informática Faculdade de Ciências e Tecnologia Universidade Nova de Lisboa 2829-516 Caparica, Portugal

Abstract—The parallel nature of the multi-core architectural design can only be fully exploited by concurrent applications. This status quo pushed productivity to the forefront of the language design concerns. The community is demanding for new solutions in the design, compilation, and implementation of concurrent languages, making this research area one of great importance and impact. To that extent this paper proposes the expression of data parallelism at subroutine level. The calling of a subroutine in this context spawns several execution flows, each operating on distinct partitions of the input dataset. Such computations can be expressed by simply annotating sequential subroutines with data distribution and reduction policies, delegating the management of the parallel execution to a dedicated runtime system. The paper overviews the key concepts of the model, illustrating them with some small programming examples, and describes a Java implementation built on top of the X10 [1] runtime system. A performance evaluation attests that this approach can provide good performance gains without burdening the programmer with the writing of specialized code. Keywords-: Concurrent Programming, Data Parallelism, Single Operation - Multiple Data

I. I NTRODUCTION The shift from frequency to core scaling in CPU design ushered a new era in computer organization. Concurrent programming has been under the spotlight ever since, considering that the increase of the number of cores does not imply faster sequential applications. Nonetheless, although multi-core architectures have been the standard processor architectural design for over six years now, the mainstream programming languages, compilers and runtime systems remain mostly unaltered. The community is demanding for new solutions in the design and implementation of concurrent languages in general, making this research area one of great importance and impact. Such has been recognized by both academia and industry, fostering the proposal of several new programming languages, such as X10 [1], Chapel [2], Fortress [3] and Axum [4], and libraries, such as Intel’s Parallel Building Blocks suite [5] and Java’s Fork/Join framework [6]. The uttermost objective is the transition of parallel computing from the scientific community to the mainstream of software development. This requires the definition of linguistic constructions that, on one hand, permit the simple

Hervé Paulino CITI / Departamento de Informática Faculdade de Ciências Le Tecnologia Universidade Nova de Lisboa 2829-516 Caparica, Portugal Email: [email protected]

and natural expression of parallelism, hence not degrading the level of productivity currently observed in general purpose languages, while on the other, do not compromise performance nor restrict the possibility of tuning the code, when needed. Task parallelism is often explored by providing the means for subroutines (functions, methods, service operations, and so on) to be offloaded to a pool of worker threads, working on behalf of the application. Cilk [7], X10 and Java’s executors, among others, built from this concept. Whereas, data parallelism is usually explored at loop-level, of which OpenMP [8], Intel TBB [9] and PGAS laguages, such as UPC [10] and also X10, are examples. This papers proposes the expression of data parallelism at subroutine level. The calling of a subroutine in this context spawns several tasks, each operating on a different partition of the input dataset. These tasks are offloaded for parallel execution by multiple worker threads, and run in conformity to a variation of the Single Program Multiple Data (SPMD) execution model, which we have baptised as Single Operation Multiple Data (SOMD). The execution model is presented to the programmer as a Distribute-Map-Reduce paradigm: the input dataset is decomposed so that multiple instances of the subroutine can be executed in parallel (the map stage), and their partial results reduced to compute the final result. This approach provides a framework for the average programmer to express data parallel computations by annotating unaltered sequential subroutines, hence taking advantage of the parallel nature of the target hardware without having to program specialized code. We envision that commonly used distributions and reductions will be available in the form of libraries. In such cases, the programmer’s work can be confined to the selection of the distribution and reduction policies to be applied. This, however, does not prevent the definition of custom policies, or even the tailoring of the method’s code to better adjust it to the SOMD model, either due to performance or algorithmic issues. We instantiated the model in the Java programming language, using the X10 Java runtime system (X10JRT) as the backbone for our own runtime system. X10JRT provides us the framework to execute Java methods complying to

Invocation

Client

Server

Cluster

Distribution

Master

Wait for result

Continue computation

Invocation

Client

Final result

Results reduction

Data partition

Data distribution

Wait for result Final Result

Figure 1.

Transparency to the caller

our specification (SOMD methods from now on) in shared and distributed memory architectures, being that our current focus is on multi-core architectures. The remainder of this paper is structured as follows: Section II overviews the SOMD model; Section III describes how it has been instantiated in the Java programming language; Section IV evaluates our prototype implementation from a performance perspective; Section V compares our work to others in the field, and; finally Section VI presents our concluding remarks and prospective future work. II. T HE SOMD E XECUTION M ODEL The SOMD execution model consists on carrying out multiple instances of a given method1 in parallel, over different partitions of the input data. The invocation is decoupled from the execution, a characteristic intrinsic to the active object [11] design pattern that makes the parallel nature of execution model transparent to the invoker. As is illustrated in Figure 1, the invocation is synchronous, and the execution is carried out by multiple concurrent flows. The figure also unveils the distribution and reduction stages. As will be further detailed in Section II-A, the first decomposes the input dataset in a pre-defined number of partitions, whilst the second computes the final result from the ones produced by each method execution. Although we are mainly focussed on multi-core architectures, the model can be also be applied to distributed memory environments. The portability of the code from the former to the latter is subservient to the argument evaluation strategy of the hosting language. The passing of references to memory positions in distributed environments requires a distributed memory management layer that falls out of the scope of this paper. The absence of such layer causes each method instance (MI) to hold a local copy of the argument. Figure 2 showcases the execution model in distributed memory environments, which follows a master-worker pattern. The node running the invoking thread assumes the master role, performing the following tasks: i) top-level application of the distribution stage; ii) dispatching of the computation to the worker nodes; iii) execution of a subset of the MIs; iv) collection of the partial results, and; v) computation of the reduction stage. 1 Our

current research is applying the SOMD model to object-oriented languages, thus, from this point onward, we will use the method terminology rather than subroutine.

Results reduction Partial result

Continue computation

Figure 2.

Execution in heterogeneous environments

The distribution stage is hierarchical, the input data set is first decomposed among the target nodes and then among the workers running in each node. In turn, there is no native support for parallelism in the reduction stage. This is motivated by the facts that: 1) in single machine environments the number of partial results to reduce are usually small, hence not heavy enough to justify a parallel execution, and 2) the hierarchical application of a reduction policy may compromise the correctness of the computed results. For example, given a set of values vi : i ∈ [1, n + m]: average(average(v1 , ..., vn ), average(vn+1 , ..., vn+m )) 6= average(v1 , ..., vn , vn+1 , ..., vn+m ) Therefore, instead of exposing these details to the programmer, we opt to relegate them to the reduction strategy implementation, which may itself resort to a SOMD method. A. Distribute-Map-Reduce Paradigm The model is presented to the programmer in the form of a Distribute-Map-Reduce (DMR) paradigm that borrows some ideas from the MapReduce framework [12]. A brief description of each stage follows: Distribute - partitions the target value into a collection of values of the same type. It can be applied to multiple input arguments and local variables. Map - applies a MI to each partition of the input data set. Reduce - combines the partial results of the previous stage to compute the method’s final result. Note that the method to apply does not have to be altered. Given a argument of type T , a distribution over such argument must by a function of type T 7→ List < T >. Moreover, a reduction applied to a method that returns a value of type R must be a function of type List < R >7→ R. Thus, the method’s application complies to its original prototype, since it receives one of the elements of the distribution set (of type T ) and provides an output of type R. Figure 3 depicts a simplified version of the paradigm comprising a single distributed value. B. Constructs The model has been instantiated as an extension to the Java programming language. The adoption of the extension approach arises from the fact that this work integrates a

Distribute

Data structure D (of type T)

Reduce

Map Partition of D (of type T)

Partition of D (of type T)



Figure 3.

input

output Method

input

output Method

input

Method

input Method

that, avoid the computation of the value from the size of the partial arrays.

Partial result (of type R)

Partial result (of type R) Final result (of type R)

output


output


The Distribute-Map-Reduce (DMR) paradigm

larger research project that addresses a service-based concurrent programming model [13], which is being implemented as a Java extension. The concrete syntax here presented serves only as a vehicle to expose our fundamental research and is not the paper’s main focus. 1) Distributions and Reductions: We begin by presenting the base distribution/reduction constructions. Distributions are declared by qualifying a given variable with the dist modifier as follows: dist C(~e) variable_declaration where C(~e) denotes a constructor of the class implementing the distribution policy. If the variable’s type is an array, the policy may be omitted in favour of a default behaviour, which partitions the array as evenly as possible. In turn, reductions are applied to methods. The reducewith modifier qualifies their declaration as follows: method_header reducewith C(~e) {method_body} We opt to place the modifier after the header declaration for readability reasons. Distribution and reduction policies must, respectively, conform to the Distribution and Reduction interfaces depicted in Listing 1. Note that, in Java, the use of distributed variables implies copying the contents of the original data structure to the newly created partitions. In Section IV we will address strategies to avoid such overhead. public i n t e r f a c e D i s t r i b u t i o n { T [ ] d i s t r i b u t e (T x , int p a r t i t i o n s ) ; } public i n t e r f a c e Reduction { T reduce ( T [ ] p a r t i a l s ) ; }

Listing 1.

The Distribution and Reduction interfaces

Listing 2 illustrates the use of the proposed constructs in a simple addition of integer arrays. The use of the default distribution splits the input arrays as evenly as possible among the available workers. The target method applies the sequential algorithm to the adjudicated partition and produces a partial result that is assembled by the ArrayAssembler reduction (Listing 3). Note that this is a general reduction that can be used to assemble any kind of arrays, and thus suitable to be included in a library. Optimized implementations can be developed, for instance to receive the size of the final array as argument and, with

i n t [ ] sum ( d i s t i n t [ ] array1 , d i s t i n t [ ] a r r a y 2 ) reducewith ArrayAssembler < I n t e g e r > ( ) { i n t [ ] a r r a y = new i n t [ a r r a y 1 . l e n g t h ] ; f o r ( i n t i = 0 ; i < a r r a y . l e n g t h ; i ++) array [ i ] = array1 [ i ] + array2 [ i ] ; return array ; }

Listing 2.

The sum of two arrays in parallel

public class ArrayAssembler implements Reduction { public T [ ] reduce ( T [ ] [ ] a r r a y ) { i n t count = 0 ; / / compute size of the final array f o r ( i n t i = 0 ; i < a r r a y . l e n g t h ; i ++) count += a r r a y [ i ] . l e n g t h ; T [ ] r e s u l t = new T [ count ] ; / / assemble partial arrays f o r ( i n t i = 0 , count = 0 ; i < a r r a y . l e n g t h ; i ++) f o r ( i n t j = 0 ; j < a r r a y [ i ] . l e n g t h ; j ++) r e s u l t [ count ++] = a r r a y [ i ] [ j ] ; return r e s u l t ; } }

Listing 3.

The ArrayAssembler reduction

Listing 4 illustrates the use of a custom distribution in the context of a method that computes the node count of a given tree. The example demonstrates that data parallelism in our model is not restricted to arrays. The original method applies a recursive solving strategy, which, when ported to SOMD, leads to successive applications of the distribution. To avoid such pattern we resort to two methods, permitting the use of both the parallel and sequential versions. The TreeDist distribution partitions the original tree as illustrated in Listing 5, and the ArraySum reduction sums up the values of the intermediate arrays to calculate the final result. Once more these policies are general enough to be used in multiple settings. public i n t c o u n t S i z e P a r a l l e l ( d i s t : T r e e D i s t ( ) Tree t ) reducewith SumReduction ( ) { return countSize ( t ) ; } public i n t c o u n t S i z e ( Tree t ) { i n t sum = 0 ; f o r ( Tree s : ( L i s t ) t . sons ) sum += c o u n t S i z e ( s ) ; r e t u r n 1 + sum ; }

Listing 4.

The count of nodes in a given Tree in parallel

class T r e e D i s t implements D i s t r i b u t i o n { public Tree [ ] d i s t r i b u t e ( Tree t r e e , i n t n ) { A r r a y L i s t a = new A r r a y L i s t () ; A r r a y L i s t b = new A r r a y L i s t () ; Tree n i l = new N i l ( ) ; a . add ( t r e e ) ; f o r ( i n t i = 0 ; i < n ; i ++ ) { A r r a y L i s t u = a ; a = b ; b = u ; a . clear () ; f o r ( Tree t : b ) { a . add ( t . IsEmpty ( ) ? n i l : t . L e f t ( ) ) ; a . add ( t . IsEmpty ( ) ? n i l : t . R i g h t ( ) ) ; } } a . add ( 0 , t r e e . Copy ( n ) ) ; return a . toArray ( ) ; } }

Listing 5.

The TreeDist distribution

2) Shared Variables and Synchronization Constructs: The shared variable modifier enables the sharing of local variables among the multiple MIs. Up this point only nondistributed object parameters were considered to be shared. To synchronize the access to shared variables we permit the delimitation of atomic blocks. Note that these only

synchronize the MIs and do not try to acquire the object’s monitor. For that purpose synchronized blocks should be used instead. Furthermore, we allow for synchronization points in the form of barriers. Listing 6 exemplifies the use of all these constructs: line 2 declares a shared local variable, lines 7 to 9 modify the value of such variable in mutual exclusion, and line 10 guarantees that the variable holds its final value before it is used by all MIs. 1 2 3 4 5 6 7 8 9 10 11 12 13

public i n t [ ] n o r m a l i z e ( d i s t i n t [ ] a r r a y ) reducewith ArrayAssembler < I n t e g e r > ( ) { shared i n t g l o b a l M i n = I n t e r g e r . MAX_VALUE ; int localMin = array [ 0 ] ; i n t [ ] r e s u l t = new i n t [ a r r a y . l e n g t h ] ; f o r ( i n t i = 1 ; i < a r r a y . l e n g t h ; i ++) i f ( array [ i ] < localMin ) localMin = array [ i ] ; atomic { i f ( localMin < globalMin ) globalMin = localMin ; } barrier ; f o r ( i n t i = 0 ; i < r e s u l t . l e n g t h ; i ++) r e s u l t [ i ] = array [ i ] − globalMin ; }

Listing 6.

Shared variables and synchronization constructs

3) Distributed Shared Arrays: Some problems greatly benefit from the overlapping of the partitions, namely to enable the reading and writing of positions beyond the distributed boundaries. When only reading operations are in order, a custom distribution strategy does the job. However, since distributed values are not shared between MIs, modifications performed by one are not seen by the remainder. To overcome this limitation we propose the notion of distributed shared array, which enables an MI to access elements adjacent to its partition. The syntax is a follows: distshared(e1 , e2 ) variable_declaration where e1 and e2 are expressions that denote the number of accessible adjacent positions to the left and the right, respectively. Listing 7 applies the construct to the replacement of all occurrences of a string in a given text. In order to be suitable for such distribution the text must assume the form of an array of characters. The distribution allows for each MI to access as many positions to the right as the ones necessary to hold an occurrence of the string to replace (oldStr) that begins within the boundaries of the assigned partition. public char [ ] r e p l a c e ( S t r i n g o l d S t r , S t r i n g newStr , distshared ( 0 , o l d S t r . l e n g t h −1) char [ ] t e x t ) reducewith ArrayAssembler ( ) { f o r ( i n t i =0; i < t e x t . l e n g t h −(o l d S t r . l e n g t h −1) ; i ++) { String s = " " ; f o r ( i n t j =0; j < o l d S t r . l e n g t h ; j ++) s = s + text [ j+i ] ; i f ( s . equals ( o l d S t r ) ) f o r ( i n t j =0; j < o l d S t r . l e n g t h ; j ++) t e x t [ j + i ] = newStr . c h a r A t ( j ) ; } return t e x t ; }

Listing 7.

Replacing all the occurrences of a string in a text

III. A RCHITECTURE AND I MPLEMENTATION The model is being instantiated as an extension to the Java language. The new constructs are translated into Java by a dedicated compiler built on top of Polyglot [14], a Java-toJava compiler that provides a framework to easily extend

Java

X10 X10 X10 X10

X10

distribution distribution distribution of distribution Aof A of Aof distribution ofAA distribution distribution distribution of distribution B of B of B of B distribution of B

method (dist A, dist B) reducewith R { dist C; shared D;

distribution of C GlobalRef GlobalRef GlobalRef with GlobalRef with D with D with D D D

Javabody JavaJavaJava method method method method (distmethod (dist A, dist (dist A, dist B) (dist A,reducewith B) dist A,reducewith dist B) reducewith B)Rreducewith { R{ R{ R{ dist C; dist C; dist C; dist C; shared shared D;shared D;shared D; D; method method body method body method bodybody } } } }

finish finish { finish { finish { { finish { for(tfor(t in (0..(Runtime.NTHREADS-1))) for(t in (0..(Runtime.NTHREADS-1))) for(t in (0..(Runtime.NTHREADS-1))) in (0..(Runtime.NTHREADS-1))) for (t in (0..(Runtime.NTHREADS-1))) { async async { async { { async{ {async

}

}

}}}} } }

Activities Activities Activities Activities withwith thewith thewith the the method method method bodymethod bodybodybody }

}

} reduction R Results Results Results Results reduction reduction reduction reduction

Figure 4.

Mapping SOMD execution model into X10

the language. This compiler guarantees the program’s type safety and generates Java code, which is in turn compiled by a standard Java compiler. The runtime system is based on the X10 Java runtime system (X10JRT). X10 [1] is a concurrent programming language that implements a Asynchronous Partitioned Global Address Space (APGAS) model [15]. Computation happens at places, abstractions for a portion of the partitioned address space plus a set of asynchronous computations (referred to as activities) that operate on such portion. The language can be compiled either to C++ or Java. We leverage on the latter’s runtime system to be able to distribute work within and across multi-core architectures. A. Compilation into the X10 Java Runtime Distributions generate arrays of the type to distribute. We experimented with both Java native and X10 arrays. The fact that the remainder of the application operates on Java types led us to choose the former, since the use of X10 arrays would impose the overhead of runtime type conversion. Naturally, the MIs of a SOMD method are mapped into X10 activities. Their execution is delimited by a finish block, a synchronization barrier that ensures the termination of the spawned MIs before the execution of the reduction stage. Shared variables are mapped into X10 shared variables, GlobalRefs, whose scope may encompass several places. Both communication and consistency concerns are delegated to the X10JRT. Figure 4 illustrates the mapping of a method that distributes three arrays (two parameters and a local variable) and resorts to a shared variable. Atomic blocks are mapped in their X10 counterpart, while barriers are mapped into clocks, a construct that extends the notion of split-phased barrier to allow multiple activities to register and synchronize in multiple points. Activities can be associated to a clock through the clocked keyword as follows: async clocked(~c) { code } where ~c denotes a sequence of previously declared clocks. Each barrier construct is mapped into a c.advance() statement, which causes the activity to block until all the remainder registered in clock c have reached the same synchronization point. Finally, distributed shared arrays are mapped into X10 DistArrays, which are distributed along multiple places. Every array access triggers a verification that checks if the

Java

X10

eue dequ

work queue enqueue wait for result final result continue computation

Figure 5.

Integration of Java and X10 runtime systems

given index is mapped onto the current place. If not, the computation is shifted to the place that holds such position, so that the write or read operation is performed locally (X10 activities cannot access data located at a place other than their own). This solution assures that all overlapping positions are shared between the MIs. B. The Runtime System The runtime system resorts to the X10JRT to support the execution of SOMD methods within and across multicore machines. A dedicated thread is spawned to launch the X10JRT, create a work-queue that establishes the communication bridge between both runtimes, and wait for execution requests. In turn, a SOMD method invocation from a Java application thread enqueues the request in the cited queue, and blocks on a condition variable until the result has been computed - conforming to Java’s synchronous call semantics. Request processing is serialized, i.e. only one SOMD method executes at a time. Each request generates as many MIs as workers in the X10 thread pool, thus exhausting the available resources. Future work will agile the configuration of the system so that multiple SOMD methods may execute concurrently. Figure 5 illustrates the integration of both runtime systems. IV. E VALUATION This section evaluates of implementation from a performance perspective. Our analysis has a twofold character: first, it is a speedup analysis of the application of SOMD to Java methods; secondly it is a comparative analysis of the SOMD and X10 implementations of a set of benchmark applications. All the applications used in this evaluation are available at http://asc.di.fct.unl.pt/serco/somd. All measurements were performed on a system composed of two Quad-Core AMD Opteron 2376 CPUs at 2.3 GHz and 16 Gigabytes of main memory, running the 2.6.26-2 version of the Linux operating system. A. Speedup Analysis This study comprised many SOMD method implementations, namely arithmetic and statistic operations, and sorting

algorithms. This presentation focuses the most representative: the calculation of an histogram and of the mean value of an array, the merge-sort algorithm, the K-means algorithm and two implementations of matrix multiplication: lines distributes the lines of the first matrix, while cols distributes the columns of the second. We experimented with array sizes of different orders of magnitude. For a simpler reading of the results we classify these in classes A to D. Table I presents the configuration settings, the execution times of the original Java methods, and the partitioning overhead for each class. The graphs depicted in Figure 6 present the speedups obtained for each application in relation to the original sequential implementation. These measurements account both the execution of the method and the time spent in the workqueue. The latter is however negligible since our benchmark infrastructure runs a single method at a time. We omit the graphs for class D since the matrix multiplication examples are the only to obtain very timid speedups. Graphs (a) to (c), in particular, present the speedups for the direct application of the DMR paradigm. Direct application indicates that the original methods were not modified in order for the paradigm to be applied. We will refer to this approach as partition-based, since the input data set in effectively partitioned and copied to a new location. Analysing the results we can observe that the matrix multiplication implementations are the ones that perform best. This is naturally due to the computational weight of the original method. The lines implementation achieves a linear speedup for classes A and B, lowering its performance to a peek of 5.2 in class C, and a peek of 1.4 in class D, both at 8 cores. In turn, the cols implementation benefits from a nearly perfect mapping on the computer’s memory hierarchy to achieve a super-linear speedup of 14.4 and 19.7 in classes A and B, respectively. For example, in class A the execution time drops to 15.4s on the 8 core configuration. Of the remainder, K-means is the one that performs best, achieving a interesting speedup of 6 in the class A configuration. On the other end is Mean, whose short execution time does not justify the partitioning overhead. Finally, the performance of Merge-sort is bounded by the reduction stage, which has to sort the partial arrays. This application is a good candidate for the implementation of a parallel reduction strategy. Graphs (d) to (f) present the speedups for a different implementation strategy that will refer to these as rangebased. The motivation is to reduce the partition overhead by distributing ranges of the target data structures(s) among the MIs, rather than the structure itself. This requires modifying the method’s code to introduce the notion of partial computation, since the algorithm no longer has the illusion of operating over the whole structure. Listing 8 showcases the range-based implementation of the example of Listing 2. The performance gains are mostly notable in Mean, where

Applications Histogram Mean Merge-sort K-means Matrix Multiplication - lines Matrix Multiplication - cols Histogram Mean Merge-sort K-means Matrix Multiplication - lines Matrix Multiplication - cols

Class A Execution time (s) array size: 100.000.000 2.34 array size: 100.000.000 0.21 array size: 100.000.000 23.98 nr observations: 25.000.000 141.47 matrix size: 2000 222.94 matrix size: 2000 222.94 Class C array size: 1.000.000 0.024 array size: 1.000.000 0.005 array size: 1.000.000 0.231 nr observations: 250.000 1.40 matrix size: 750 3.35 matrix size: 750 3.35 Configuration

Class B Execution time (s) array size: 10.000.000 0.22 array size: 10.000.000 0.02 array size: 10.000.000 2.18 nr observations: 2.500.000 14.26 matrix size: 1500 82.87 matrix size: 1500 82.87 Class D array size: 100.000 0.006 array size: 100.000 0.003 array size: 100.000 0.061 nr observations: 25.000 0.185 matrix size: 250 0.148 matrix size: 250 0.148

Partition overhead (s) 0.5 0.5 0.5 0.165 0.00007 0.05

Configuration

0.0045 0.0045 0.0045 0.0015 0.00004 0.033

Partition overhead (s) 0.065 0.065 0.065 0.017 0.00005 0.04 0.00045 0.00045 0.00045 0.0002 0.000035 0.005

Table I R EFERENCE TABLE FOR THE PROBLEM CLASSES

void sum ( i n t [ ] array1 , i n t [ ] array2 , d i s t i n t [ ] range ) { i n t begin = range [ 0 ] ; i n t end = range [ 1 ] + 1 ; for ( int i = 0; i < ( end − begin ) ; i ++) a r r a y 1 [ i + begin ] = a r r a y 1 [ i + begin ] + a r r a y 2 [ i + begin ] ; }

Listing 8.

2

1,5

SOMD/X10

a speedup is observable for some class A configurations, and Histogram, The remainder have higher execution times, and thus the partitioning overhead has a low impact on the overall performance. Note that the cache effect observed in the partition-based version of Matrix multiplication - cols is lost, since the matrix is no longer partitioned.

1

0,5

0 1

2

3

4

5

6

7

8

Number of Workers

Monte Carlo

Mandelbrot

Mandelbrot-‐I

Crypt

Crypt-‐I

Stream-‐I

Range-based implementation of the sum method Figure 7.

B. X10 Comparative Analysis This second analysis has the objective of assessing the quality of our prototype implementation. For that purpose we ported a number of X10 benchmark applications to the SOMD model, and compared the results obtained by both implementations. The applications included in the study are: Monte Carlo the calculation of an estimate of the π constant through the Monte Carlo method; Mandelbrot - the calculation of the Mandelbrot set; Crypt - the encrypting and decrypting of a text, and Stream - computation of formula: a ← b ∗ α + c, where a, b and c are arrays of integers. All the applications but Monte Carlo distribute at least one array. In fact, using the X10 reference configurations: Crypt distributes an array of 50 Megabytes, Mandelbrot performs two distributions over a matrix of 6000×750 bytes and over an array of 750 doubles, and Stream distributes three arrays of 33 Megabytes. Figure 7 reports the overhead of the SOMD implementations in relation to their X10 counterparts. The Monte Carlo and Mandelbrot benchmarks break even, with X10 scaling a little better in the first, and SOMD in the both implementations of the second. The partition-based implementation of Crypt loses to X10, which scales better than SOMD up to the 38% gain in the 8 worker configuration. The rangebased implementation, however, outperforms X10 in 75%. This can be explained by the number of Java operations

SOMD vs X10

statements generated by the X10 compiler (over 1000), whist the code generated for the SOMD implementation uses only 60 of such statements. Finally, the SOMD partition implementation of stream benchmark suffers from the distribution of the 3 big arrays. Its performance is around 18 times worst than X102 . The range-based version greatly reduces this overhead, scaling even better than the original X10 implementation. Applications Histogram Mean Merge-sort K-means Matrix Multiplication - lines Matrix Multiplication - cols

Distribution default default default default default 19

Reduction 8 6 40 8 14 12

Total 8 6 40 8 14 31

Table II L INES OF CODE

C. Productivity Analysis We performed a small productivity analysis by measuring the amount of code required to implement the distribution and reduction policies for each application. Table II presents such measurements, corroborating our statement that the proposed model is simple to use. This affirmation is even more substantiated with the facts that: 1) most array manipulation methods can resort to the default distribution; 2) 2 We omit it from graph because the enlargement of the scale affects the readability of the remaining data.

16

20

7

16

12

6

14

8 6

5

12

Speedup

Speedup

10 Speedup

8

18

14

10 8

3

6

4

2

4

2

1

2

0

0

0 1

2

3

4

5

6

7

8

1

2

3

4

Number of workers

5

6

7

1

8

2

3

4

5

Histogram

Mean

Merge Sort

Histogram

Mean

Merge Sort

Histogram

Merge Sort

K-‐means

Matrix Mul=plica=on -‐ lines

Matrix Mul=plica=on -‐ cols

K-‐means





(b) Speedup - partition-based (Class B)

7

7

7

6

6

6

5

5

5 Speedup

8

Speedup

8

4 3

3

2

2

2

1

1

1 0

0 1

2

3

4

5

6

7

8

1

2

3

4

Number of workers

5

6

7

8

1

2

3

4

5

Histogram

Mean

Merge Sort

Histogram

Mean

Merge Sort

Histogram

Merge Sort

K-‐means



K-‐means





(e) Speedup - range-based (Class B) Figure 6.

6

7

8

Number of workers

Number of workers

(d) Speedup - range-based (Class A)

8

4

3

0

7

K-‐means

(c) Speedup - partition-based (Class C)

8

4

6

Number of workers

Number of workers

(a) Speedup - partition-based (Class A)

Speedup

4

K-‐means

(f) Speedup - range-based (Class C)

Performance Graphs

most of the distributions and reductions are general enough to be available in a library, and 3) the computations they perform are algorithmic problems that do not require special knowledge of parallel programming. V. R ELATED W ORK Concurrent and parallel computing have been important research areas for a long time, and along the years many programming models have been proposed. Nonetheless, when it comes to mainstream programming, concurrency is usually obtained by explicitly creating new execution flows supported by thread APIs. On the parallel computing field, MPI [16], a message-passing communication standard, has been the community’s de-facto parallel programming model since 1994. MPI defines a library of low-level operations that provide high flexibility, allowing for the programming of both the SPMD (Single Program Multiple Data) and the MPMD (Multiple Program Multiple Data) paradigms. Both these approaches provide a low-level of abstraction that requires the explicit management of the execution flows and of their interaction, which is error-prone and results in hard to maintain code. Moreover, when it comes to parallel algorithms, these approaches require the implementation of per-processor versions of the sequential algorithms. A popular framework is OpenMP [8], a mix of compiler directives, library calls and environment variables that provide an easy way to annotate parallelism in a shared memory model. Data-parallelism in OpenMP is at loop-level, which contrasts with our subroutine approach. Cilk [7] is still a reference when in comes to functional (or task) parallelism. It extends the C language with primitives to spawn and synchronize concurrent tasks (C functions)

in shared-memory architectures. dipSystem [17] applied a similar approach for distributed environments providing a uniform interface for the programming of both shared and distributed memory architectures. Cilk has recently been incorporated by Intel in their PBB suite [5]. A data parallel programming model that is quite popular is PGAS (Partitioned Global Address Space), instantiated by UPC (Unified Parallel C) [10] and Titanium [18]. PGAS extends array programming, allowing execution flows to transparently communicate by accessing array positions other than the ones they own, being all communication compiler generated. As all array programming languages, PGAS languages are restricted to the SPMD paradigm, which is not flexible nor supports nested parallelism. Languages X10 [1] and Chapel [2] overcome this limitation by extending PGAS with processing unit abstractions (localities) to which code may be shipped and executed. Language constructs allow a) the asynchronous spawning of tasks in localities, b) the distribution of arrays across localities and the subsequent spawning of activities to operate over the distributed data, and c) the management of the results of such asynchronous activities. This model allows programming both MPMD and SPMD paradigms, but data parallelism is still performed at loop-level by iterating over the distributed data structures. Other recent developments are the Intel PBB suite [5], and the Axum [4] language by Microsoft. The work of Intel focuses only on shared memory multi-core architectures. PBB incorporates Cilk with libraries to a) abstract threads by aggregating them into higher-level constructs named tasks, hence increasing productivity by hiding low-level details, and b) provide the seamless distribution of arrays among

several cores, borrowing the concepts of array programming languages. Axum is .NET based language that grows from the actor model to express parallelism.

[6] Oracle and/or its affiliates., “The Java Tutorials - Fork/Join,” http://docs.oracle.com/javase/tutorial/essential/concurrency/ forkjoin.html.

VI. C ONCLUSIONS AND F UTURE W ORK

[7] R. D. Blumofe, C. F. Joerg, B. C. Kuszmaul, C. E. Leiserson, K. H. Randall, and Y. Zhou, “Cilk: An efficient multithreaded runtime system,” in Proceedings of the Fifth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), Jul. 1995, pp. 207–216.

This paper proposes the expression of data parallelism at subroutine level. These can be annotated with distribution and reduction policies, in order to be executed according to a SOMD execution model. Such approach allows for nonexperts in parallel programming to be able to express data parallel computations with a small learning curve. A Java implementation allowed us to annotate a wide set of methods, and with that assess the model’s usefulness and our prototype’s performance. The evaluation presented in Section IV demonstrated that the simple annotation of distribution and reduction policies obtained speedups in many scenarios. The temporal complexity of the remainder was to low to subsume the partitioning overhead. The situation was overcome with the introduction of slight alterations on the original code, which greatly reduced the cited overhead. Although our initial results are good, the X10 runtime system often imposes an extra overhead. Its features are very useful for distributed memory environments, but we think that we are able to improve the performance on a single Java virtual machine with a dedicated runtime. We are working in such enterprise. Future work also includes the implementation of parallel reduction policies, the formalization of the proposed constructs, and usability tests. ACKNOWLEDGEMENT This work was partially funded by PEstOE/EEI/UI0527/2011, Centro de Informática e Tecnologias da Informação (CITI/FCT/UNL) - 2011-2012. R EFERENCES

[8] L. Dagum and R. Menon, “OpenMP: An industry-standard API for shared-memory programming,” Computing in Science and Engineering, vol. 5, no. 1, pp. 46–55, 1998. [9] J. Reinders, Intel threading building blocks. Associates, Inc., 2007.

O’Reilly &

[10] W. Carlson, J. M. Draper, D. E. Culler, K. Yelick, E. Brooks, and K. Warren, “Introduction to UPC and Language Specification,” IDA Center for Computing Sciences, Tech. Rep. CCS-TR-99-157, 1999. [11] R. G. Lavender and D. C. Schmidt, “Active object: An object behavioral pattern for concurrent programming,” in Pattern languages of program design 2. Boston, MA, USA: AddisonWesley Longman Publishing Co., Inc., 1996, pp. 483–499. [12] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, 2008. [13] H. Paulino, A. M. Dias, M. C. Gomes, and J. C. Cunha, “A service centred approach to concurrent and parallel computing,” Faculdade Ciências e Tecnologia, Universidade Nova de Lisboa, Tech. Rep., 2012. [14] N. Nystrom, M. R. Clarkson, and A. C. Myers, “Polyglot: An extensible compiler framework for Java,” in Compiler Construction, 12th International Conference, CC 2003, Warsaw, Poland, April 7-11, 2003, Proceedings, ser. Lecture Notes in Computer Science, G. Hedin, Ed., vol. 2622. Springer, 2003, pp. 138–152.

[1] P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, and V. Sarkar, “X10: an object-oriented approach to non-uniform cluster computing,” SIGPLAN Not., vol. 40, no. 10, pp. 519–538, 2005.

[15] V. Saraswat, G. Almasi, G. Bikshandi, C. Cascaval, and D. Cunningham, “The asynchronous partitioned global address space model,” in 1st ACM SIGPLAN Workshop on Advances in Message Passing (AMP’10), Toronto, Canada, 2010. ACM Press, 2010.

[2] D. Callahan, B. Chamberlain, and H. Zima, “The cascade high productivity language,” in High-Level Parallel Programming Models and Supportive Environments, 2004. Proceedings. Ninth International Workshop on, 2004, pp. 52–60.

[16] Message Passing Forum, “MPI: A Message-Passing Interface Standard,” University of Tennessee, Tech. Rep. UT-CS-94230, May 1994.

[3] E. Allen, D. Chase, J. Hallett, V. Luchangco, J.-W. Maessen, S. Ryu, G. L. Steele Jr., and S. Tobin-Hochstadt, “The Fortress Language Specification,” Sun Microsystems, Inc., Tech. Rep., 2007. [4] Microsoft Corporation, “Axum programmer’s guide,” http://download.microsoft.com/download/B/D/5/BD51FFB2C777-43B0-AC24-BDE3C88E231F/Axum [5] Intel Corporation, “Intel parallel building blocks,” http://software.intel.com/en-us/articles/intel-parallel-buildingblocks/.

[17] F. Silva, H. Paulino, and L. Lopes, “di_psystem: A parallel programming system for distributed memory architectures,” in Proceedings of the 6th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface. Springer-Verlag, 1999, pp. 525– 532. [18] K. A. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. N. Hilfinger, S. L. Graham, D. Gay, P. Colella, and A. Aiken, “Titanium: A high-performance java dialect,” Concurrency - Practice and Experience, vol. 10, no. 11-13, pp. 825–836, 1998.

2: Data-Parallelism and Data-Flow 1. The Parallelism ... - CiteSeerX

Dynamically Tuning Level of Parallelism in Wide Area Data Transfers

Prediction of Optimal Parallelism Level in Wide Area Data Transfers

Chapter 4 Data-Level Parallelism in Vector, SIMD ...

High Throughput Virtual Screening with Data Level Parallelism in Multi ...

Function Level Parallelism Driven by Data Dependencies - Core

Tradeoff between Data-, Instruction-, and Thread-Level Parallelism ...

Data Level Parallelism in Vector, SIMD, and GPU Architectures

Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU ...

Tradeoff between Data-, Instruction-, and Thread-Level Parallelism ...

Optimizing Parallelism of Big Data Analytics at ...

Operation in data warehouse

Combining Control and Data Parallelism: Data Parallel ... - CiteSeerX

An Exploration of Asynchronous Data-Parallelism - CiteSeerX

Braid: integrating task and data parallelism

Work-Efficient Nested Data-Parallelism - CiteSeerX

Task Parallelism and Data Distribution - MINES ParisTech

Extending Fortran 90 by Nested Data Parallelism

Amorphous Data-parallelism in Irregular ... - Purdue Engineering

Exploiting Application Data-Parallelism on Dynamically ... - UCI

Abstractions for Adaptive Data Parallelism - Semantic Scholar

Data Parallelism Exploiting for H.264 Encoder

Optimistic Parallelism Benefits from Data Partitioning

Warp-Level Parallelism: Enabling Multiple Replications In Parallel on ...

Single Operation Multiple Data - Data Parallelism at Subroutine Level

Single Operation Multiple Data - Data Parallelism at Subroutine Level

Suggest Documents