An API for Coarse Data Flow Parallelism on SMPs

A Framework for Coarse Region Data Flow Parallelism Dr. Paul J. Hinker and Brad Lewis Sun Microsystems, Inc.

Contact Brad Lewis Sun Microsystems, Inc. 500 Eldorado Blvd., MS UBRM11-202 Broomfield, CO 80021 [email protected]

Abstract As the number of processors and the number of cores per processor continue to increase, parallel processing on SMPs continues to grow in importance. Today programmers commonly use OpenMP on shared memory systems. While OpenMP has made parallel development in sequential languages such as Fortran and C easier, it is still a complex and difficult problem. It has long been recognized that data flow is a superior model for parallel programming. Despite much research into data flow machines and languages, there have been no machines produced to date that are commercially successful. This paper describes an implementation of the well known LAPACK blocked LU factorization that doubles the scalability of the traditional approach of providing parallel kernels. This is achieved by using the CORD (COarse Region Data flow) Framework. A simple region based paradigm is used to describe algorithms in terms of operations on data regions. From this definition, a dependency graph is derived and used to direct parallel execution. This allows for a combination of task and data parallelism that approaches the theoretically maximum as the granularity of the operations is made finer.

A Framework for Coarse Region Data Flow Parallelism Dr. Paul J. Hinker and Brad Lewis Sun Microsystems, Inc. Abstract As the number of processors and the number of cores per processor continue to increase, parallel processing on SMPs continues to grow in importance. Today programmers commonly use OpenMP on shared memory systems. While OpenMP has made parallel development in sequential languages such as Fortran and C easier, it is still a complex and difficult problem. It has long been recognized that data flow is a superior model for parallel programming. Despite much research into data flow machines and languages, there have been no machines produced to date that are commercially successful. This paper describes an implementation of the well known LAPACK blocked LU factorization that doubles the scalability of the traditional approach of providing parallel kernels. This is achieved by using the CORD (COarse Region Data flow) Framework. A simple region based paradigm is used to describe algorithms in terms of operations on data regions. From this definition, a dependency graph is derived and used to direct parallel execution. This allows for a combination of task and data parallelism that approaches the theoretically maximum as the granularity of the operations is made finer.

1 Introduction Data flow programming has long been recognized as a superior model to control flow programming for parallel applications[7]. Despite all the research into data flow machines and languages, the majority of the parallel scientific programs are still written in traditional control flow languages such as C and Fortran and run on Von Neuman machines. The trend in hardware is toward increased shared memory parallelism. Increases in the number of processors per system and the number of cores per processor continue to push the number of threads of execution needed to fully utilize SMP systems higher and higher. A new approach in software is needed to exploit the capability of new hardware. This paper describes an implementation of the LAPACK blocked LU factorization using coarse grain data flow parallelism provided by the CORD

Framework. Coarse grained data flow parallelism has been shown to be effective on traditional control flow architectures[1,10]. The SCHEDULE package provides a framework for writing data flow code in standard Fortran[6]. CORD adds a concept of regions similar to that found in the language RL[3]. The CORD concept of a region is a subset of a data object. The data object of interest in the LU factorization is a matrix. For simplicity, regions are restricted to rectangular sub-matrices. The definition could be extended with different data objects to allow for arbitrary partitioning of data sets. CORD programs consist of two phases; a composition phase and execution phase. In the composition phase, the algorithm is described in terms of nodes. A node is defined as an operation performed on a set of regions. The dependencies between the nodes is deduced to create a dependency graph. In the execution phase, nodes are selected from the graph within the constraints of the data dependencies for execution. To execute a node, the operation for that node is applied to the associated regions. The LAPACK LU block factorization uses BLAS kernels to improve performance over elemental algorithms. The kernels can be ported and tuned for different hardware. Parallel versions of BLAS are commonly available from most hardware vendors. The scalability is limited by Amdahl's Law as not all the code is executed inside a parallel kernel. The CORD implementation takes advantage of BLAS kernels to provide serial performance in each thread of execution. The use of data flow parallelism takes advantage of the inherent concurrency in the algorithms to achieve much higher scalability.

2 Matrix Regions and Blocks LAPACK blocked algorithms work on submatrices or blocks of a matrix. The size of the blocks is determined by a block size parameter. CORD provides support for blocked algorithms through the blocked_matrix type. A blocked matrix is defined by three properties for each dimension: the size, block size, and base index. Blocked matrix m = (rs, rnb, rb, cs, cnb, cb), where rs is the number of rows, rnb the row block size , rb is the base index for the rows, cs

the number of columns, cnb the column block size, and cb is the base index for the columns. This abstraction simplifies the process of programming a blocked algorithm. A rs by cs matrix has been converted into a r by c blocked matrix where r = (rs + rnb - 1) / rnb and c = (cs + cnb – 1) / cnb. A region reg can be defined in terms of blocks by giving the starting and ending block indices in both the row and column dimensions, reg = (r1, r2, c1, c2) where r1 and r2 represents the low and high bounds on the rows and c1 an c2 the bounds on the columns. The algorithm can be completely defined in terms of operations upon regions. During the composition phase, the programmer can ignore any cleanup code involving regions less than the block size. This is handled during the execution phase.

3 LU Factorization Composition The LAPACK DGETRF routines computes the factors of a real double precision m by n matrix A such that A = P * L *U, where P is a permutation matrix, L is lower triangular with unit diagonal elements and U is upper triangular[9]. The blocked algorithm has a main loop that starts in the upper left hand corner and works down the diagonal of the matrix until it reaches the lower right side of the matrix. For each iteration five steps are performed as shown in Figure 1. First, the subroutine DGETF2 is called to factor the diagonal and sub-diagonal blocks in the ith column using partial pivoting. Then the subroutine DLASWP performs the pivoting on the trailing block columns and the forward columns. Next, the blocks to the left of the diagonal in the ith row are updated with DTRSM. Finally an update on the blocks below the ith row and to the right of the ith column is performed with DGEMM. The CORD implementation follows the LAPACK algorithm step by step with two exceptions. The first is based on the observation that regions updated by trailing swaps are not accessed again. Therefore, they can be deferred until the end of the run. This allows one pass over these columns instead of many to perform the pivoting. Also, the forward looking DLASWP step has been combined with the DTRSM step, allowing more work on the same regions to be performed by a single operation. CORD supports a region-centric paradigm,

approaching an algorithm as a set of operations performed on data regions. This is a natural way for programmers to view algorithms. Each step in an algorithm is specified with the following methodology: 1. Create a node to perform the given step. The function CREATE_NODES allocates and initializes one or more nodes to perform a given operation on a given number of regions. 2. Describe the regions that are operated on. The routines ADD_REGION_READ and ADD_REGION_WRITE are used to insert information into the node that describing each region. 3. If the step exhibits data parallelism, it can be decomposed into independent steps with a call to DECOMPOSE. 4. Add the node to the graph at a given priority. The routine ADD_NODES_TO_GRAPH derives data dependencies to the nodes previously added to the graph based on the region information. In this way, a dependency graph is built as nodes are added. CORD programmers must understand two things about the steps in an algorithm. First they must be able to describe the regions of data that are accessed. Secondly, they must determine which steps are to be decomposed into independent operations to execute in parallel. A

i

IPIV

A

i

IPIV

i

i

Step 1: DGETF2 A

Step 2: DLASWP(Trailing) IPIV

i

i

A

IPIV

i

i

Step 3: DLASWP(Forward) A

i

Step 4: DTRSM IPIV

Read/Write Regions i Read Only Regions

Step 5: DGEMM

Figure 1 : Blocked DGETRF Steps There are several factors to consider when decomposing an operation. The foremost consideration is if the nature of the step itself allows it to be divided into independent operations. In the LU algorithm, each step can easily be decomposed except the DGETF2 step. For example, the DTRSM computes the solution to triangular system of equations for several different right hand side vectors. Each vector corresponds to a column in the output matrix and can be operated on independently. The DLASWP and DGEMM also can operate on the output columns independently. For the LU, each step is decomposed to act on each block of columns independently. Figure 2 shows the columnar decomposition for the DGEMM. A

A

A

A

A

A

Figure 2 : DGEMM decomposition A second consideration in decomposing the steps is the memory accessed by each of the operations. Where possible, it is advantageous to maintain blocks of contiguous memory to be

worked on. Therefore a column-wise decomposition is generally preferable given the row major nature of Fortran arrays. These considerations must often be weighed against the benefits of exposing more parallelism. For instance, the DGEMM step performs a matrix multiply with each element of the product computed with a dot product that could be performed independently. Therefore DGEMM can be decomposed along rows as well. This increases the parallelism and reduces the idle time during execution. However; the individual operations run at a slower rate making the overall effect on the runtime negligible. The row-wise decomposition has a negative impact by increasing the memory footprint for the graph and adding complexity to the source code when applied to all DGEMM operations. A specific case was found where row-wise decomposition proved beneficial. The first column of blocks for a DGEMM step operates on the same column as the DGETF2 from the next iteration. It is advantageous to further parallelize only this column by decomposing it along rows. Doing so causes the computation for this column to be spread over many threads, thereby providing the data necessary to start the next iteration earlier. This is critical because the serial DGETF2 steps act as bottlenecks during the run. Getting those steps done sooner satisfies more data dependencies and frees more nodes to be executed. The result is more work to keep the processors fed and far less idle time during the run. The subroutine GEN_DGETRF_GRAPH as shown in Appendix A handles the composition portion of the implementation by specifying all the operations in the algorithm and building a dependency graph for the operations. The number of blocks in the matrix is determined by querying input BLOCKED_MATRIX type objects with he routine GET_NUMBER_OF_BLOCKS. A main loop steps down the diagonal, block by block, and specifies the bulk of the work with three operations for each iteration: DGETF2, DTRSM/DLASWP, and DGEMM. The remaining code specifies the trailing swaps to perform the pivoting on blocks below the diagonal. The trailing DLASWP operation works on a region with a blocked triangular shape as shown in Figure 3. Each blocked column can be operated on independently. The triangular region cannot be represented with a rectangular blocked matrix

region. This prevents the use of the decomposition function. However; the data parallelism of the operation can still be expressed by creating a node for each column and adding a region containing the blocks below the diagonal. A

Figure 3 – Triangular DLASWP Region The end result of the composition phase is a dependency graph. The nodes in this graph represent all the operations performed on the blocked matrix. For instance, Figure 4 shows the graph for a square matrix with 5 blocks per side. Each arrow represents a data dependency between two nodes. A data dependency in this case indicates that two nodes access the same block of memory. Each operation has a priority associated with it that is used to direct execution. At runtime, from the set of operation that are found to have their data dependencies satisfied the ones with higher priorities will be executed first.

DGETF2

DTRSM

DGEMM

ROW DGEMM

Figure 4 – LU Dependency Graph

Operation

Priority

DGETF2

1

DTRSM/DLASWP

3

DGEMM

2

DLASWP

4

Figure 5 – LU Priorities

DLASWP

4 LU Factorization Execution The composition and execution phases for the factorization can run concurrently on separate threads. As soon as the composition routine places a node in the dependency graph, the execution phase can begin to perform the operations. Typically one thread performs the composition while many threads perform the execution. The composition thread can join in the execution after it completes the composition. This can be achieved with a simple OpenMP construct as shown in Appendix A. The LU execution routine, EXE_DGETRF_DAG is shown in Appendix A. The subroutine executes as long as there is work to perform, querying the graph with the GET_NEXT_NODE function to receive the next node from the graph to execute. If there is no work available the thread will busy wait inside GET_NEXT_NODE. When all the nodes in the graph have been executed a value of zero will be returned. The selection of the next node to execute is guided by 4 factors: 1. Data dependencies Only those nodes with all data dependencies satisfied can be executed. 2. Node priorities The nodes are held in individual execution queues based on priorities. The higher priority queues are examined first. 3. Order of placement in the graph A node added first will be selected for execution first if everything else is equal. 4. Search strategy The default is to only consider the node at the front of each execution queue. For the LU this strategy is effective. The option exists to treat each queue as a pool and to examine the entire queue looking for nodes to execute. Each node contains all the information necessary to perform an operation. A task index, which specifies the operation to be performed, is accessed with the GET_TASK_ID function. Also encapsulated in the nodes are the definitions of the data regions involved in the operation. The execution code for each operation is written with knowledge of the number and order of regions. The execution routine views the data in the traditional elemental manner; the matrix A is stored as a two-dimensional array. The

GET_ABSOLUTE_INDEX and GET_REGION_SIZE routines return the starting index and size of the regions respectively. This provides the information necessary to perform the computations using traditional code. No effort is required by the programmer to convert from the block matrix paradigm that was used to define the algorithm. Each node contains a bit that determines whether the specified operation has been completed. When the operation for a node has been completed, the MARK_AS_DONE routine is called. This sets a bit and indicates to all nodes that the data computed by this node is available. On shared memory systems no special communication is needed between processors. When a thread performing an operation writes a block of the matrix, it can be accessed by all other executing threads.

5 LU Results and Analysis These results were observed on a Sun Fire E6900 Server with 24 dual-thread UltraSparc IV processors. As shown in Figure 6, the CORD implementation provides a significant improvement in scalability of the LU factorization of a 4000 by 4000 matrix over the traditional parallel kernel approach. In this case, the maximum speedup achieved more than doubled. LU Speedup(4000 X 4000) 32 28

Speedup

24 20 Perfect Scaling CORD Amdahl's Maximum Parallel Kernels

16 12 8 4 0 0

4

8

12

16

20

24

28

32

NCPUS

Figure 6 – Speedup on Sun Fire E6900 Server Figure 6 also compares the observed results to those predicted by Amdahl's Law:

S =1/ F1−F / P where S is the maximum speedup achievable on P

processors when F is the fraction of the runtime that is serial. All the steps in the LU were parallelizable except for the DGETF2, which was observed to account for 5.6% of the total runtime of a serial run. Though this is a small percentage of the total runtime, Amdahl's correctly predicts this will limit the scalability of the parallel kernel version to 11.7 on 32 threads. At best , roughly one third of the processing power of the system will be utilized. It requires perfect scaling of the parallel steps for the algorithm to approach this upper limit. The observed speedup for the parallel kernel version was indeed significantly lower than the predicted maximum. The scalability of the LU increases as the matrix size increases because the percentage of time in the serial code decreases. As the problem size increases the scalability of the algorithm approaches perfect scaling. The 4000 by 4000 size matrix was chosen as an example where the scalability of the algorithm is limited by Amdahl's law to much less than the capability of the hardware. The CORD implementation was able to exceed the maximum speedup predicted by Amdahl's on runs with 2 threads up to 32 threads. The increased scalability of the algorithm is due to concurrency exposed by the dependency graph. The DGETF2 step cannot be directly parallelized. The dependency graph shows that, with the exception of the first and the last DGETF2 invocation, there is an opportunity for this operation to run concurrently with the DGEMM and DTRSM operations. The coarse grained data flow approach achieved better scalability than is possible with parallel kernels even were it possible to obtain perfect scaling from all but the DGETF2 routine.

6 Conclusion The CORD implementation of the LU matrix factorization presented doubles the scalability of the algorithm over a traditional control flow implementation and breaks the performance limit predicted by Amdahl's Law. The 5% of execution time spent in the serial DGETF2 kernel severally limits the scalability of the parallel kernel implementation. The CORD implementation features execution directed by a dependency graph that allows the DGETF2 to execute concurrently with other tasks. CORD shows promise as a method to increase the parallel performance on industry standard hardware. CORD provides a

framework to utilize coarse grain data flow for blocked matrix algorithms. CORD can be extended to support a wide range of applications by defining regions for additional data structures. The internal code is written in standard C and Fortran and should be portable with the exception of some minor details involving passing parameters between these two languages.

8. Philip Cox, Simon Gauvin, Andrew Rau-

Chaplin, Adding parallelism to visual data flow programs, Proceedings of the 2005 ACM symposium on Software visualization, p. 135 – 144, May 14 - 15, 2005, St. Louis, Missouri, United States

9. E. Anderson , Z. Bai , C. Bischof , L. S. 7 References 1. Monica S. Lam , Martin C. Rinard, Coarse-

grain parallel programming in Jade, Proceedings of the third ACM SIGPLAN symposium on Principles and practice of parallel programming, p.94-105, April 21-24, 1991, Williamsburg, Virginia, United States

2. Saniya Ben Hassen , Henri Bal, Integrating task and data parallelism using shared objects, Proceedings of the 10th international conference on Supercomputing, p.317-324, May 25-28, 1996, Philadelphia, Pennsylvania, United States

3. Bradford L. Chamberlain , E. Christopher

Lewis , Calvin Lin , Lawrence Snyder, Regions: an abstraction for expressing array computation, Proceedings of the conference on APL '99 : On track to the 21st century, p.41-49, August 1999, Scranton, Pennsylvania, United States

4. Dianne P. O'Leary , G. W. Stewart, Data-flow

algorithms for parallel matrix computation, Communications of the ACM, v.28 n.8, p.840853, Aug. 1985

5. do Nascimento, L.T.; Ferreira, R.A.; Meira, W., Jr.; Guedes, D., Scheduling data flow applications using linear programming, International Conference on Parallel Processing, p.638 -645, June 14-17, 2005

6. J. J. Dongarra and D. C. Sorenson. A portable environment for developing parallel FORTRAN programs. Parallel Computing, (5):175-186, 1987.

7. John A. Sharp. Data Flow Computing Theory and Practice.

Blackford , J. Demmel , Jack J. Dongarra , J. Du Croz , S. Hammarling , A. Greenbaum , A. McKenney , D. Sorensen, LAPACK Users' guide (third ed.), Society for Industrial and Applied Mathematics, Philadelphia, PA, 1999

10.Suhler, P. A., Biswas, J., Korner, K. M., and

Browne, J. C. 1990. TDFL: a task-level dataflow language. J. Parallel Distrib. Comput. 9, 2 (Jun. 1990), 103-115.

APPENDIX A – dgetrf.f MODULE LU * * Parameters for task id's * INTEGER GETF2_TASK PARAMETER (GETF2_TASK = 1) INTEGER GEMM_TASK PARAMETER (GEMM_TASK = 2) INTEGER TRSM_TASK PARAMETER (TRSM_TASK = 3) INTEGER SWAP_TASK PARAMETER (SWAP_TASK = 4) * * Parameters for priorities * INTEGER GETF2_PRIORITY PARAMETER (GETF2_PRIORITY = 1) INTEGER GEMM_PRIORITY PARAMETER (GEMM_PRIORITY = 3) INTEGER TRSM_PRIORITY PARAMETER (TRSM_PRIORITY = 2) INTEGER SWAP_PRIORITY PARAMETER (SWAP_PRIORITY = 1) INTEGER NUM_PRIORITY_QUEUES PARAMETER (NUM_PRIORITY_QUEUES = 3) * * Other Parameters * INTEGER NB PARAMETER (NB = 96) DOUBLE PRECISION ONE PARAMETER (ONE=1.0D0) END MODULE SUBROUTINE DGETRF(M, N, A, LDA, IPIV, INFO) USE DATA_FLOW USE LU IMPLICIT NONE * * *

Arguments INTEGER INTEGER DOUBLE PRECISION INTEGER INTEGER INTEGER

* * *

M N A(LDA, N) LDA IPIV( * ) INFO

Local variables TYPE (DIRECTED_GRAPH) DAG TYPE (BLOCKED_MATRIX) BA TYPE (BLOCKED_MATRIX) BIPIV

*

External subroutines EXTERNAL EXTERNAL

GEN_DGETRF_DAG EXE_DGETRF_DAG

* CALL INITIALIZE_GRAPH(DAG, NUM_PRIORITY_QUEUES) CALL SET_MATRIX_VALUES(DAG, BA, M, NB, 1, N, NB, 1) CALL SET_MATRIX_VALUES(DAG, BIPIV, MIN(M,N), NB, 1, 1, 1, 1) !$OMP PARALLEL !$OMP MASTER CALL GEN_DGETRF_DAG(DAG, BA, BIPIV) !$OMP END MASTER CALL EXE_DGETRF_DAG(DAG, M, N, A, LDA, IPIV, INFO) !$OMP END PARALLEL CALL FREE_GRAPH(DAG) RETURN END

SUBROUTINE GEN_DGETRF_DAG(DAG, A, IPIV) USE DATA_FLOW USE LU IMPLICIT NONE * * *

Arguments TYPE (DIRECTED_GRAPH) TYPE (BLOCKED_MATRIX) TYPE (BLOCKED_MATRIX)

* * *

DAG A IPIV

Local variables TYPE (GRAPH_NODE), POINTER :: CNODES(:) INTEGER I INTEGER J INTEGER NBLOCKS TYPE (GRAPH_NODE), POINTER :: NODES(:) INTEGER NCOLS INTEGER NNODES INTEGER NROWS

* * *

Executable Statements NROWS = GET_NUMBER_OF_BLOCKS(A, 1) NCOLS = GET_NUMBER_OF_BLOCKS(A, 2) NBLOCKS = MIN(NROWS, NCOLS) DO I=1, NBLOCKS

* * *

Create a node to specify dgetf2 CALL CALL CALL CALL

* * *

CREATE_NODES(NODES, DAG, 1, 2, GETF2_TASK) ADD_REGION_WRITE(NODES(1), A, 1, I, NBLOCKS, I, I) ADD_REGION_WRITE(NODES(1), IPIV, 2, I, I, 1, 1) ADD_NODES_TO_GRAPH(DAG,NODES,1,GETF2_PRIORITY)

Create nodes to specify dtrsm/dlaswp IF (I.LT. NCOLS) THEN CALL CREATE_NODES(NODES, DAG, 1, 2, TRSM_TASK) CALL ADD_REGION_WRITE(NODES(1), A, 1, I, NROWS, I+1, NCOLS) CALL ADD_REGION_READ(NODES(1), A, 2, I, I, I, I) NNODES = DECOMPOSE(DAG, NODES(1), 1, COLUMN_DECOMPOSITION) CALL ADD_NODES_TO_GRAPH(DAG,NODES,1,TRSM_PRIORITY) END IF

* * *

Create nodes to do dgemm IF (I .LT. NBLOCKS) THEN CALL CREATE_NODES(NODES, DAG, 1, 3, GEMM_TASK) CALL ADD_REGION_WRITE(NODES(1), A, 1,I+1,NROWS,I+1,NCOLS) CALL ADD_REGION_READ(NODES(1), A, 2, I+1,NROWS,I,I) CALL ADD_REGION_READ(NODES(1), A, 3, I,I,I+1,NCOLS) NNODES = DECOMPOSE(DAG, NODES(1), 1, COLUMN_DECOMPOSITION) CALL GET_NODE_CHILDREN(CNODES, NODES(1)) DO J= 1, NNODES NNODES = DECOMPOSE(DAG, CNODES(J), 1, ROW_DECOMPOSITION) END DO CALL ADD_NODES_TO_GRAPH(DAG,NODES,1,TRSM_PRIORITY) END IF END DO

* * *

Create nodes to specify the trailing dlaswp IF (NCOLS .GE. NROWS) THEN NNODES = NROWS -1 ELSE NNODES = NCOLS END IF CALL CREATE_NODES(NODES, DAG, NNODES, 2, SWAP_TASK) DO J=1, NNODES

CALL ADD_REGION_WRITE(NODES(J), A, 1, J+1,NROWS, J, J) CALL ADD_REGION_READ(NODES(J), IPIV, 2, J+1,NROWS, 1, 1) END DO CALL ADD_NODES_TO_GRAPH(DAG,NODES,NNODES,SWAP_PRIORITY) CALL FINALIZE_GRAPH(DAG) RETURN END SUBROUTINE EXE_DGETRF_DAG(DAG, M, N, A, LDA, IPIV, INFO) USE DATA_FLOW USE LU IMPLICIT NONE * * *

Arguments TYPE (DIRECTED_GRAPH) DAG INTEGER M INTEGER N DOUBLE PRECISION A(LDA,*) INTEGER LDA INTEGER IPIV(*) INTEGER INFO

* * *

Local Variables TYPE (GRAPH_NODE),POINTER:: CURNODE INTEGER I, IB, IINFO, J, JB, K, KB DO WHILE (GET_NEXT_NODE(CURNODE, DAG, CURNODE) .EQ. 0) IF (GET_TASK_ID(CURNODE) .EQ. GETF2_TASK) THEN I = GET_ABSOLUTE_INDEX(CURNODE, 1, 1) IB = GET_REGION_SIZE(CURNODE, 1, 1) JB = GET_REGION_SIZE(CURNODE, 1, 2) IINFO = 0 CALL DGETF2 (IB,JB,A(I,I),LDA,IPIV(I),IINFO) IF (INFO.EQ.0 .AND. IINFO.GT.0) INFO = IINFO + I - 1 DO K = I, MIN( M, I+JB-1 ) IPIV( K ) = I - 1 + IPIV( K ) END DO ELSE IF (GET_TASK_ID(CURNODE) .EQ. TRSM_TASK) THEN I = GET_ABSOLUTE_INDEX(CURNODE, 1, 1) J = GET_ABSOLUTE_INDEX(CURNODE, 1, 2) JB = GET_REGION_SIZE(CURNODE, 1, 2) KB = GET_REGION_SIZE(CURNODE, 2, 1) CALL DLASWP(JB, A(1,J), LDA, I,I+KB-1, IPIV, 1) CALL ___PL_PP_DTRSM('Left','Lower', 'No transpose', & 'unit', KB, JB, 1.0D0, A(I,I), LDA, A(I, J), LDA) ELSE IF (GET_TASK_ID(CURNODE) .EQ. GEMM_TASK) THEN J = GET_ABSOLUTE_INDEX(CURNODE, 1, 1) JB = GET_REGION_SIZE(CURNODE, 1, 1) I = GET_ABSOLUTE_INDEX(CURNODE, 1, 2) IB = GET_REGION_SIZE(CURNODE, 1, 2) K = GET_ABSOLUTE_INDEX(CURNODE, 2, 2) KB = GET_REGION_SIZE(CURNODE, 2, 2) CALL ___PL_PP_DGEMM('N','N',JB,IB, KB,-ONE, A(J,K), & LDA, A(K,I),LDA,1.0D0,A(J,I),LDA) ELSE IF (GET_TASK_ID(CURNODE) .EQ. SWAP_TASK) THEN I = GET_ABSOLUTE_INDEX(CURNODE, 1, 1) IB = GET_REGION_SIZE(CURNODE, 1, 1) JB = GET_REGION_SIZE(CURNODE, 1, 2) J = GET_ABSOLUTE_INDEX(CURNODE,1,2) CALL DLASWP (JB,A(1,J),LDA,I,I+IB-1,IPIV,1) END IF CALL MARK_AS_DONE(DAG, CURNODE) END DO RETURN END

An API for Coarse Data Flow Parallelism on SMPs

An API for Coarse Data Flow Parallelism on SMPs

Suggest Documents

2: Data-Parallelism and Data-Flow 1. The Parallelism ... - CiteSeerX

Irregular Coarse-Grain Data Parallelism Under LPARX 1 ... - CiteSeerX

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in ...

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in ...

Implementing Parallelism and Scheduling Data Flow Graphs on Java ...

COARSE-GRAIN PARALLELISM USING REMOTE ...

Exploiting Coarse Grained Parallelism in ... - Semantic Scholar

Exploiting Coarse-Grained Parallelism Using Cloud ... - MDPI

Exploiting loop-level parallelism on coarse-grained reconfigurable ...

Exploiting Loop-Level Parallelism on Coarse-Grained ... - cs.York

An Exploration of Asynchronous Data-Parallelism - CiteSeerX

Exploiting Coarse-Grain Verification Parallelism for Power-Efficient ...

SMPS: An FPGA-based Prototyping Environment for

An Efficient Implementation of Nested Data Parallelism for ... - CiteSeerX

Exploiting Application Data-Parallelism on Dynamically ... - UCI

Data Access Partitioning for Fine-grain Parallelism on Multicore ...

A Calculus for Exploiting Data Parallelism on Recursively ... - CiteSeerX

Abstractions for Adaptive Data Parallelism - Semantic Scholar

Data Parallelism Exploiting for H.264 Encoder

Ignorability and Coarse Data

Deep Jam: Conversion of Coarse-Grain Parallelism to ... - Inria

Exploiting Coarse-Grain Parallelism in the MPEG-2 ... - CiteSeerX

An API and Middleware Systems for Scientific Data ...

Towards an Efficient API for Optimisation Problems Data - School of ...