SPC-XML: A Structured Representation for Nested ... - CiteSeerX

5 downloads 7966 Views 148KB Size Report
SPC-XML has a PROCESS tag to define the content of a logical process. ... suitable run-time scheduling tools should be included and activated in the target.
SPC-XML: A Structured Representation for Nested-Parallel Programming Languages ? Arturo Gonz´ alez-Escribano1, Arjan J.C. van Gemund2 , and Valent´ın Carde˜ noso-Payo1 1

Dept. de Inform´ atica, Universidad de Valladolid. E.T.I.T. Campus Miguel Delibes, 47011 - Valladolid, Spain Phone: +34 983 423270, eMail:{arturo,valen}@infor.uva.es 2 Embedded Software Lab, Software Technology Department, Faculty of Electrical Engineering, Mathematics and Computer Science, P.O.Box 5031, NL-2600 GA Delft, The Netherlands Phone: +31 15 2786144, eMail:[email protected]

Abstract. Nested-parallelism programming models, where the task graph associated to a computation is series-parallel, present good analysis properties that can be exploited for scheduling, cost estimation or automatic mapping to different architectures. In this paper we present an XML intermediate representation for nestedparallel programming languages from which the application task-graph can be easily derived. We introduce some design principles oriented to allow the compiler to exploit information about the task synchronization structure, automatically determine implicit communication structures, apply different scheduling policies, and generate lower-level code using different models or communication tools. Results obtained for simple applications, using an extensible prototype compiler framework, show how this flexible approach can lead to portable and efficient implementations.

1

Introduction

A common practice in high-performance computing is to program applications in terms of the low-level concurrent programming model provided by the target machine, trying to exploit the maximum possible performance. Portable APIs, such as message-passing interfaces (e.g. MPI, PVM) propose an abstraction of the machine architecture, still obtaining good performance. However, programming in terms of these unrestricted coordination models can be extremely error-prone and inefficient, as the synchronization dependencies that a program can generate are complex and difficult to analyze by humans or compilers [14]. Important decisions in the implementation trajectory, such as scheduling or data-layout become extremely difficult to optimize. Considering these problems, more abstract ?

This work has been partially supported by: JCyL under contract number VA-083/03, the EC (FEDER) and the Spanish MCyT (Plan Nacional de I+D+I, TIC2002-04498C05-05 and TIC2002-04400-C03). A preliminary version of this paper was presented at the 11th Int. Workshop on Compilers for Parallel Computers.

programming models, which restrict the possible synchronization and communication structures available to the programmer, are proposed and studied (see e.g. [21]). These models, due to their restrictions, are easier to understand and program, and can provide tools and techniques that help in mapping decisions. Nested-parallelism models present a middle point between expressiveness, complexity and easily programming [21]. They restrict the coordination structures and dependencies to those that can be represented by series-parallel (SP) task-graphs (DAGs). Thus, nested-parallelism is also called SP programming. Due to the inherent properties of SP structures [22], they provide clear semantics and analyzability characteristics [16], a simple compositional cost model [9, 19, 20] and efficient scheduling [1, 7]. These properties can lead to automatic compilation techniques that increase portability and performance. Examples of parallel programming models based on nested-parallelism include BSP [23], nested BSP (e.g. NestStep [15], PUB [2]), BMF [20], skeleton based (e.g. SCL [5], Frame [4], OTOSP/LLC [6]), SPC [9], and Cilk [1]. In previous work (see e.g. [10]) we have shown that many applications classes, including some typical irregular scientific applications, may be efficiently mapped to nested-parallelism due to some inherent load or synchronization regularities, or using simple balancing techniques. In [12] we also discussed how using an structured XML representation to specify the coordination of a parallel program may lead to a flexible an extensible compiler framework, where scheduling or mapping plug-ins may be automatically selected by the application structure properties (for a further discussion about extensible programming see also [24]). SPC-XML is an evolving highly-abstract XML intermediate-representation of nested-parallel programs. Although it is not intended to be used directly by the programmer, but through front-end translators, it is a complete parallel coordination language. It is designed as a portable, easy to parse, and extensible language to simplify experimentation with nested-parallel compiler technology. SPC-XML uses a common specification syntax for coarse-grain computations (like in BSP [23] models) or fine-grain computations (like in data-parallel models such as HPF [3]). By means of simple recursive specifications it delegates to the compiler the selection of the appropriate granularity for a given application and target machine. Since no specific memory-model is forced in the language computations may be adapted to a particular target architecture, supporting both, distributed and shared-memory models. Due to its extensibility features it may support, but it is not limited to, any specific set of compile-time scheduling and data-layout techniques (such in OpenMP [17]), or a given generic run-time scheduling system for dynamically spawned processes (such in Cilk [1]). New scheduling or mapping techniques may be applied, and the compilation strategy may be guided by simple and accurate cost-models. This XML specification is designed to simplify the reconstruction of the synchronization task-graph of the application, and to obtain dependency and data-flow information which help the compiler in determining the implicit communication structure. It is possible to: (a) express the different types of collective or structured communication operations found in other languages (e.g.

message-passing interfaces [14] or OTOSP/LLC [6]), and (b) inherently support structured programming or parallel skeletons (e.g. [4, 18]). However, it achieves it using only one implicit synchronization mechanism (the end of a parallel section) in combination with the explicit data-flow information naturally exposed in the standard parameter-substitution of processes invocations. In this paper we also show how communication structures may also be restructured and optimized at the low-level, obtaining solutions similar to manually coded message-passing programs. We present preliminary results for simple applications, generated with a source to source translator prototype which exploits these techniques. Performance is comparable to that obtained with manually developed MPI codes. The paper is organized as follows: Section 2 introduces our new proposal for a tag-based coordination language. In section 3 we discuss how the features of this language help at the different stages of a completely-automatic compilation-path. Experiments and results with an example application are presented in section 4. Finally, section 5 draws our conclusion.

2

SPC-XML: A Tag-Based Coordination Language

In this section we introduce an intermediate highly-structured coordination language based on XML. It is named SPC-XML, due to the SPC nested-parallel model [9] (Series-Parallel and Contention model). The full description of the language including its DTD can be found in [13]. This full parallel synchronization language is designed to support any feature to be found in a nested-parallel environment such as recursion, critical sections, distributed or shared-memory models, and manual data-layout specifications. In Fig. 1 we show an excerpt of a simple cellular-automata program representation. The language is highly verbose and it is not designed to be written directly by a programmer, but as a convenient intermediate representation. Front-ends to translate legacy code written in any nested-parallel language to this representation would be a straightforward development effort. Nevertheless, standard XML tools may be used to edit, visualize or check consistency of SPC-XML representations. Although its main functionalities and semantics are clearly defined, SPC-XML is still syntactically evolving to find a more mature level. The design principles of SPC-XML are: 1. SPC-XML model implements the same semantics as the SPC model and its underlying process algebra (see e.g. [9]). Processes are composed to form a program with only two possible operators – sequential and parallel –, which may be freely nested. 2. It uses XML tags to directly represent the parallelism structure. NestedParallelism is a hierarchical structured form of expressing parallelism. Thus, a representation using an XML structured document is natural. 3. It is a coordination language [8]. Tags represent only the synchronization structure. Any classical sequential language may be used to code sequential sections. This separation simplifies the recognition of the parallel structure.

4. The program must specify explicit input/output interfaces for sequential sections and logical processes. The behavior of a data item in an interface (input and/or output) will be compulsory specified. This will help the compiler to compute data-flow and automatically derive communications for a given task decomposition and mapping. 5. Tag and attribute names are fully readable to help humans recognize the main program structures easily. 2.1

Document Structure

An SPC-XML document contains a HEADER tag and a BODY tag. The first one is used mainly for documentation and to specify low-level sequential code blocks
cellAutom
...

Fig. 1. SPC-XML example: Excerpt of a cellular-automata program representation

to be included at the beginning of the target program. The body of the document contains a collection of elemental units (functions and processes). If a process called main is found inside a document, this document represents a program which execution begins at that main process. If the document does not include a main process, it is a library which may not be compiled alone. All elemental units may have a DOC tag containing extra documentation tags for automatic documenting tools. Programming comments are included with the usual XML tags . 2.2

Logical Processes and Functions

SPC-XML has a PROCESS tag to define the content of a logical process. Each process has a well-defined interface, whose formal parameters are defined with the IN, OUT, INOUT tags, declaring the input/output intent of the data item in the interface. Processes may contain also a LOCAL tag to declare variables, with process visibility and scope. The BODY tag of a process contains the tags which define its behavior. The serial composition is implied in the tags declaration order as in many common procedural languages. For programmer convenience, other common control-flow statements are supported by the IF/THEN/ELSE, WHILE, REPEAT, and LOOP tags. The parallel composition operator is implemented using the PARALLEL tag. This tag may only contain one or more PARBLOCK tags, whose contents are composed in parallel. The PARALLEL tag, have an optional attribute named p=”number”, to specify a fixed number of parallel-blocks to spawn. The PARBLOCK tags may also have a p=”number” attribute, to associate the tag content with a logical identifier inside the local subgroup of blocks. A default parallel-block (specify with p=”*”) may be used to fill-up with its content all the non-associated blocks in a parallel region. The closing parallel tag implies a synchronization of all the blocks to proceed. At this logical synchronization point, modifications done to the same variable in different blocks are made consistent. By default, the modifications done by only one block will persist. Nevertheless, optional attributes may be used in the PARALLEL tag to use typical reduction operations for given data elements. As in any nested-parallel model, other PARALLEL tags may appear inside a block, or in other processes called inside a block. Sequential codes are enclosed inside a FUNCTION tag, to distinguish them from the logical processes. However, the input/output interface is defined exactly in the same way. For symmetry, processes and functions are invocated with the same CALL tags, containing PARAMETER tags to specify the formal to real parameter substitutions. The language includes some more features to help in the mapping trajectory. For instance, the programmer, or a profiling tool, may provide to the scheduling modules with a hint, adding a performance estimation to any sequential functions using the optional workload=”number” attribute. 2.3

Data Representation and Memory Model

As a coordination model, SPC-XML tags do not imply data manipulation. The only purpose to define data containers is to describe data-flow between logical

processes and through sequential codes. Thus, SPC-XML variables are generic multi-dimensional arrays of a given data type defined in the sequential language used. Variables are defined by VAR tags, always inside the LOCAL tag of a process. There are no expression or data manipulation tags. The only tags which imply code execution are CALL tags referring to function names (sequential tasks). Special variables called overlaps are supported to define logical data partitions over other SPC-XML array variables. Thus, simple data-layouts may be devised. Each overlap element becomes an alias for a subarray of an associated SPC-XML variable. They are defined in the local section with the VAR-OVERLAP tag, which attributes specify the name of the associated variable and the layout of the overlap pieces. Subarray specifications may be expressed with a colon-notation similar to Fortran90 and some macro-definitions. Moreover, the language provides with some special terms which represent typical block or stride layouts. SPC-XML provides a generic distributed-memory model. A process or function works with local copies of the parameters obtained from the caller. However, when the programmer knows it is safe to work with shared-memory instead of creating local copies of a variable, she may give a hint to the compiler, using the special shared=”yes” attribute on that variable. Nevertheless, the programmer should not rely on shared-memory for communication as the underlying architecture or back-end may not support it. The only communications between parallel tasks should be driven through processes/functions interfaces.

3

Task-Graph Reconstruction and Mapping

Task decomposition is directly derived from the tag structure of an SPC-XML specification. In the case of applications which synchronization structure may be determined in compile-time, the whole task graph may be expanded and classical graph-scheduling techniques may be applied. For more data-dependent or dynamic programs, the graph represents only the potential structure, and suitable run-time scheduling tools should be included and activated in the target code to obtain the appropriate program behavior. In this paper we focus on exploiting the language characteristics for applications in the first case. 3.1

Task Decomposition and Synchronization Structure Detection

The lexical/syntactical parsing of an SPC-XML document may be completely done by generic and portable XML parsers. Due to the clear semantics and highly structured form of the documents, an application graph is easily reconstructed from its high-level specification. When considering static programs, the number of tasks and the the exact shape of the graph is completely deterministic. In this task-graph we only consider one type of nodes (tasks which may contain computation functions) and edges (precedence dependencies which may derive in data-flow at lower implementation levels). The structure of a static application is then reconstructed as a DAG (Direct Acyclic Graph) along the following guidelines: (1) A process invocation is always expanded, inlining its

initData

Inlined process: parallelCell

n=0

INOUT P[p]

*

*

*

*

*

*

*

*

*

seqCellComp

IN d[p−1] IN u[p+1]

n=1

n=2

data u[0]

P[0]

... * n=999

*

d[0] u[1]

P[1]

*

*

printData

d[1] u[2]

P[2]

d[2] u[3]

P[3]

d[3]

Data overlaps

Fig. 2. Example of graph reconstruction for a cellular-automata (4 processors)

content where the CALL tag is found, also doing parameter substitution; (2) a task-node is delimited by consecutive PARALLEL ending and opening tags, and it contains the function calls found inside the code it represents; (3) the content of a PARBLOCK tag is processed as a subgraph; (4) loops with a deterministic number of iterations are expanded. The closing tag of the loop is in the same task as its opening tag in the next iteration. An example of their application to of our cellular-automata program is shown in Fig. 2. 3.2

Mapping and Scheduling Strategies

The mapping module of an SPC-XML compiler should use a machine-model (which will be also specified in another XML application). The simplest machinemodel may contain only one data-item: the number of processors; while more complex models may contain other resources information or allocating costs (e.g. communication parameters, different performances on heterogeneous clusters). The module may check graph-structure characteristics to select the best suitable scheduling or mapping technique. The scheduling algorithm selected will supply the graph with annotations about task-to-processor bindings. Workload annotations, introduced at the specification level through appropriate attributes, will be a key-point to obtain efficient schedules and accurate cost models. Mapping decisions are also guided by communication costs. Thus, mapping modules should compute the communication-volume generated by a given mapping, using the information about the input/output interfaces of the functions contained inside the nodes. At this level, several implementation variants may be compared to decide the best one for the given target machine. (see e.g. [11]).

P[0] u[1]

d[0] P[1] u[2]

d[1] P[2] u[3]

d[2] P[3]

P[0] u[1]

d[0] P[1] u[2]

d[1] P[2] u[3]

d[2] P[3]

1r

P[0] u[1]

ow

ow 1r

1r

d[0] P[1] u[2]

ow

ow 1r

1r

ow

d[1] P[2] u[3]

ow

1r

4k rows

d[2] P[3]

4k rows

d[1] P[2] u[3]

4k rows

d[0] P[1] u[2]

4k rows

P[0] u[1]

d[2] P[3]

Fig. 3. Communication information and a non-nested-parallel mapping solution

The top diagram in Fig. 3 represents the information obtained from the SPC-XML function interface tags, for one iteration of the cellular-automata program. This information clearly determines the data-flow between tasks. However, many low-level parallel-tools or communication-layers are non-restricted to nested-parallelism. Compilers based on highly-abstract nested-parallel models, may optimize communication interchange, specially at synchronization points, if global information is available (see e.g. [2]). SPC-XML may provide enough data about communication to allow very simple but efficient optimization techniques. For instance, the mapping module may detect task nodes where some received data is not used, but only redirected to other tasks. In this case, data may be sent directly to the task where it is used. In the bottom diagram in Fig. 3 we represent such a mapping solution. The synchronization node completely disappears, because every communication is redirected to further target nodes. Moreover, the mapping module may use information about data-sizes to reduce the total communication-volume. In the example, the dashed lines have 4096 more times data-volume than the non-dashed lines. Thus, the tasks vertically aligned in the figure may be scheduled to the same processor to improve locality. This final solution, automatically obtained from the high-level nested-parallel specification, is the typical best-solution designed when programming a cellularautomata program with a low-level message-passing library. As applications become more complex, this approach may reduce development and debugging effort. Moreover, porting an optimizing an application to a different target machine (for instance a heterogeneous cluster with different resource-power per node) may not imply changes in the high-level code as data partition sizes and communications are adapted if needed.

Reference MPI code (720 x 720 matrix) 8

SPC-XML generated code (720 x 720 matrix) 8

160 iterations 120 iterations 80 iterations 40 iterations

7

7

6

Speed-Up

Speed-Up

6

5

4

5

4

3

3

2

2

1

160 iterations 120 iterations 80 iterations 40 iterations

2

3

4

5

# processors

6

7

1

8

2

3

MPI Reference code (120 iterations) 8

8

960x960 elements 720x720 elements 480x480 elements 240x240 elements

7

6

7

8

7

8

6

Speed-Up

Speed-Up

5

# processors

960x960 elements 720x720 elements 480x480 elements 240x240 elements

7

6

5

4

5

4

3

3

2

2

1

4

SPC-XML generated code (120 iterations)

2

3

4

5

# processors

6

7

8

1

2

3

4

5

# processors

6

Fig. 4. Speed-Up of the MPI reference codes and the SPC-XML generated codes

4

Experiments and Results

We have built a source to source translator prototype implementing some of the techniques previously discussed. Using only the number of processors of the target-machine a simple scheduling is applied and communication redirections computed. The mapped graph is then translated by a straightforward back-end to an MPI implementation. An approach to a flexible and extensible compiler framework exploiting more SPC-XML characteristics is also presented in [12]. We compare results obtained when executing a manually-developed and optimized cellular-automata MPI program, and the solutions generated by our prototype for different matrix sizes. The target-machine is a heterogeneous Beowulf cluster connected by a 100Mbit/sec. Ethernet network, and composed by: 3 workstations with an AMD-800Mhz processor; 1 with an AMD-750 Mhz; and 4 with an AMD-500 MHz. We present results of a configuration where processors with different speeds are evenly interleaved. This configuration shows the smoothest scalability effects. We have used input matrices up to 960 × 960 elements, and a number of iterations in the range [40, 160]. In Fig. 4 we show the speed-ups obtained when executing the MPI reference codes and the corresponding codes automatically-generated from SPC-XML specifications. The plots show how the scaling effects obtained with different matrix sizes and different number of iterations are similar. We remark that we have detected some inefficiencies in the data-buffering and sequential treatment in-

troduced by our simple back-end in the automatically-generated codes. They produce small performance-losses, specifically for 4 or 6 processors due to the heterogeneous nature of the cluster. However, the performance-loss is always less than 3.5% comparing with the corresponding reference code. This shows that sequential code optimization may have more impact on the performance of a parallel application than choosing a high-level structured programming model. The results presented in this paper are extensible to most regular applications programmed with a coarse-grain style, which derive in static synchronization structures. In [12] we present similar experiments with a more complicated unstructured application, using a graph-partitioning technique to balance an irregular sparse-matrix computation.

5

Conclusion

In this paper we have discussed an XML intermediate representation for nestedparallel programming languages, named SPC-XML. The design principles of this representation allow the compiler to exploit information about the synchronization structure of an application, automatically reconstructing its task-graph. Different mapping or scheduling techniques may be automatically selected as a function of the structural details of the graph. Moreover, information about dataflow and implicit communication structures are also exposed and may be easily optimized, to generate low-level codes adapted to a specific target-machine. Results obtained for some applications, using a prototype compiler, show how this flexible approach may reduce the development effort, leading to efficient implementations from portable and high-level nested-parallel specifications. SPCXML is the base for a much more generic framework. Future work will include a further development of mapping techniques, focusing on more powerful symbolic mapping or scheduling strategies, available for SP programming models.

References [1] R.D. Blumofe and C.E. Leiserson. Scheduling multithreaded computations by work stealing. In Proc. Annual Symp. on FoCS, pages 356–368, Nov 1994. [2] O. Bonorden, B. Juurlink, I. von Otte, and I. Rieping. The Paderborn University BSP (PUB) library - design, implementation, and performance. In Proc. IPPS/SPDP’99, San Juan, Puerto Rico, Apr 1999. Computer Society, IEEE. [3] P. Brinch Hansen. An evaluation of high performance Fortran. ACM SigPlan, 33(3):57–64, Mar 1998. [4] M. Cole. Frame: An imperative coordination language for parallel programming. Technical Report EDI-INF-RR-0026, Div. Informatics, Univ. of Edinburgh, Sep 2000. [5] J. Darlington, Y. Guo, H.W. To, and J. Yang. Functional skeletons for parallel coordination. In Europar’95, LNCS, pages 55–69, 1995. [6] A.J. Dorta, J.A. Gonzlez, C. Rodrguez, and F. de Sande. LLC: a parallel skeletal language. Parallel Processing Letters, 13(3):437–448, Sept 2003.

[7] L. Finta, Z. Liu, I. Milis, and E. Bampis. Scheduling UET–UCT series–parallel graphs on two processors. Theoretical Computer Science, 162:323–340, Aug 1996. [8] D. Gelernter and N. Carriero. Coordination languages and their significance. Communications of the ACM, 35(2):97–107, Feb 1992. [9] A.J.C. van Gemund. The importance of synchronization structure in parallel program optimization. In Proc. 11th ACM ICS, pages 164–171, Vienna, Jul 1997. [10] A. Gonz´ alez-Escribano. Synchronization Architecture in Parallel Programming Models. Phd thesis, Dpto. Informtica, University of Valladolid, Jul 2003. [11] A. Gonz´ alez-Escribano, A.J.C. van Gemund, and V. Carde˜ noso. Predicting the impact of implementation level aspects on parallel application performance. In Proc. CPC’2001 Ninth Int. Workshop on Compilers for Parallel Computing, pages 367–374, Edinburgh, Scotland UK, Jun 2001. [12] A. Gonz´ alez-Escribano, A.J.C. van Gemund, V. Carde˜ noso-Payo, R. PortalesFern´ andez, and J.A. Caminero-Granja. A preliminary nested-parallel framework to efficiently implement scientific applications. In M. Dayd´e et al., editor, VECPAR 2004, number 3402 in LNCS, pages 541–555. Springer, Apr 2005. [13] A. Gonz´ alez-Escribano, A.J.C. van Gemund, V. Carde˜ noso-Payo, and R. PortalesFern´ andez. SPC-XML(v0.4): An intermediate structured language for nestedparallel programming environments. Technical Report IT-DI-2005-0001, Dept. Computer Science, Univ. of Valladolid, Jan 2005. [14] S. Gorlatch. Send-Recv considered harmful? myths and truths about parallel programming. In V. Malyshkin, editor, PaCT’2001, volume 2127 of LNCS, pages 243–257. Springer-Verlag, 2001. [15] C.W. Kessler. NestStep: nested parallelism and virtual shared memory for the BSP model. In Int. Conf. on Parallel and Distributed Processing Techniques and Applications (PDPTA’99), Las Vegas (USA), Jun-Jul 1999. [16] K. Lodaya and P. Weil. Series-parallel posets: Algebra, automata, and languages. In Proc. STACS’98, volume 1373 of LNCS, pages 555–565, Paris, France, 1998. Springer. [17] OpenMP ARB. OpenMP version 2.5 specification. WWW. On http://www.openmp.org/ (last access May 2005). [18] S. Pelagatti. Structured Development of Parallel Programs. Taylor & Francis, 1998. [19] R.A. Sahner and K.S. Trivedi. Performance and reliability analysis using directed acyclic graphs. IEEE Trans. on Software Eng., 13(10):1105–1114, Oct 1987. [20] D.B. Skillicorn. A cost calculus for parallel functional programming. Journal of Parallel and Distributed Computing, 28:65–83, 1995. [21] D.B. Skillicorn and D. Talia. Models and languages for parallel computation. ACM Computing Surveys, 30(2):123–169, Jun 1998. [22] J. Vald´es, R.E. Tarjan, and E.L. Lawler. The recognition of series parallel digraphs. SIAM Journal of Computing, 11(2):298–313, May 1982. [23] L.G. Valiant. A bridging model for parallel computation. Comm.ACM, 33(8):103– 111, Aug 1990. [24] G. Wilson. Extensible programming fot the 21st century. ACM Queue, 2(9):48–57, December–January 2004-2005.

Suggest Documents