Performance analysis and estimations play a central role in the design and development of parallel appli- cation software. In parallel environments, the perfor-.
Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995
Portrayal
of Parallel
Applications for Performance Prediction
Kattamuri IBM
Extended
Abstract
here. There are only three types of variables in PSL programs: The types int and float are standard and are used for variables that represent parameters and loop control etc. The var type variables correspond to the variables in the application program. (Their actual types in the application program are irrelevant here.) They are declared using HPF style directives. For example, in the program of Figure 1, a and b are two vectors which are block distributed onto 4 processors. In a PSL program there is no code for com-
def
(n), distribute (4) :: a,b;
(block),
(i=l:n) { a[i] = compute 5 ; b[i] = case { (i == 1 (( i == n): compute default:
1
compute
w+35
using
5 ;
a[i], b[i-11;
1
return
a,b;
Figure 1: A sample PSL program puting the values for these variables. Instead, special constructs specify their dependences and computation costs. In Figure 1, we use the loop construct (i=1:n) which repeats the enclosed block n times. The compute clause specifies that each element of a takes 5 units to compute and has no dependences. Hence all the elements can be potentially computed in parallel. Since the array is block distributed, each processor computes n/4 elements, so that the entire vector is ready after all the 4 processors spend 5n/4 units of time. The interior and boundary elements of b are
Because of space limitations, we only give a flavor the language here through simple examples. The syntax, including scoping rules and block structures, of a PSL program is modeled along the C language, except for a handful of extensions which are summarized
IEEE
foo(w,n) = { int w,n; var dimension processors
Specification
1060-3425/95$4.0001995
and
Ekanadham and Vijay K. Naik T. J. Watson Research Center P. 0. Box 218 Yorktown Heights, NY 10598.
Performance analysis and estimations play a central role in the design and development of parallel application software. In parallel environments, the performance related parameter space is much larger than that in the sequential case. As a result, simple “paper and pencil” type analysis is not always a viable option. Numerous efforts have gone into developing tools to help users understand the performance of their parallel applications. Typically these tools tend to be used at run-time or as post-processors [5, 61. Performance data is gathered by monitoring the program execution and/or by collecting execution trace. These tools serve a useful purpose as performance tuning aids after an application has been parallelized. As such, these tools are not directly useful in designing and developing parallel applications. In this paper, we introduce a new notion called performance portrayal of a parallel application program for Performance Specification. The portrayal captures the speed characteristics of the rate determining segments, the control flow, and dataflow of an application as well as the parameters of the parallel environment in which it is executed. The specification is expressed as a program in a Portrayal Specificat ion Language (PSL), which we introduce in this paper. A PSL program has close resemblance to the parent application program. It has all the relevant application program variables and their structural information, but actual computations are not performed. Only timing information is computed.
Language
Evaluation
442
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE
Proceedings of the 28th Annual Hawaii International Conference on System Sciences - 1995
computed differently, which is expressed by the case statement that resembles the switch statement in C. Each boundary element takes 5 units of time. Each interior element takes w+35 units. The using clause specifies that computation of each interior element of b depends on the element to its left as well as on the corresponding element of a. In order to keep track of the dependences, we impose a single assignment constraint on the application variables. In Figure 1, the two vectors are dynamically created in the routine foo and returned. Their component specifications must occur once in the routine and it is considered a runtime error if an element computation is specified more than once. The following code illustrates the reuse of structures and also the concept of reduction. var :: a,b,x; var dimension
(50) :: r; (i=1:50) { a,b = foo(21.54,100); x = relax(a,b); r[i] = reduce (2,20) using x[i=l:lOO];}
Each invocation of foo returns a new pair of vectors a and b. They are passed to relaz which returns another vector z. Some error reduction is done on x in each iteration. The using clause gives the variables to be accumulated. The accumulation is done in two phases: first all values on each processor are locally reduced and the cost for this is specified as 2 units per element. Then the local sums on all the processors are accumulated and the cost for this is given as the second parameter. This is a very simple example and there are also other more sophisticated ways of specifying reductions. Communication costs can be specified using the channel directive. A channel is a named entity that characterizes the pattern and costs of communication. It can model packetized communication. As data elements are accessed across the processors, they are collected and sent out in packets of specified size. Fixed and variable costs incurred by the sending and receiving processors can be specified for the channel. All the relevant quantities - processor costs, bandwidth, packetization etc. - are parameterized and concisely specified in a channel declaration. Because of space limitations we omit further details.
Executing
PSL
Programs
a straightforward implementation, one can represent each function as a dataflow graph in which the nodes represent the computation of values for application variables and the arcs represent the flow of information specified by the using clauses. Each application variable (or each of its components) is represented as an I-structure [l]. An event-driven simulation can enforce all the timing constraints for processing as well as the communication. Unrelated events are executed in an over eager fashion. The language also has some provisions to sequence certain program segments to balance this. Based on this work, we have developed a performance estimation tool called PET. Some highlights of PET and preliminary results are reported in [2]. Approaches similar to ours are reported in [3] and [4]. However, our methodology differs in the level of abstraction and detail. A distinguishing feature of our approach is that the PSL specification bears close resemblance to a sequential program written for a shared memory model.
References
PI Arvind
and K. Ekanadham. Fuiure Scientific Journal of Programming on Parallel Machines, Parallel and Distributed Computing, vol. 5, pp. 460-493, 1988.
PI K.
Ekanadham, V. K. Naik, and M. S. Squillante. PET: Parallel Performance Estimation Tool, To appear in the Proceedings of the 7th SIAM Conference on parallel processing for scientific computing, 1995.
PI T.
Fahringer, R. Blasko, and H. Zima, Automatic Performance Prediction to Support Parallelization of Fortran Programs for Massively Parallel Systems. In Proceedings of International Conference on Supercomputing, 1992.
[4] P. Mehra, C. Schulbach, and J. Yan, A Comparison of Two Model-Based Performance-Prediction Techniques for Message-Passing Parallel Programs. In Proceedings of SIGMETRICS’94, 1994. [5] D. Reed, et al, The Pablo Performance Analysis Environment, Technical Report, University of Illinois, 1992. [6] S. Sarukkai and D. Gannon. mance
With the single assignment restriction, PSL programs are functional style programs and render themselves for parallel execution in a natural manner. In
debugging
environmenl
SIEVE: A pepforfor parallel pro-
grams. J. Parallel and Distributed Vol. 18, pp. 147-168, 1993.
Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS'95) 1060-3425/95 $10.00 © 1995 IEEE
Computing,