June, 1991. Abstract. PCP is an implementation of the split-join parallel programming paradigm for the ... National Laboratory under Contract W-7405-ENG-48.
Programming in PCP Brent Gorda, Karen Warren and Eugene D. Brooks III June, 1991
Abstract
PCP is an implementation of the split-join parallel programming paradigm for the C programming language. In the split-join paradigm a team of processors executes the user program from main() to exit(). The model allows for exploitation of nested parallelism via a mechanism called team splitting. All of the features of PCP are block structured and allow for arbitrary nesting of parallel constructs. In this manual we document PCP and give examples of its use in writing portable parallel programs.
Work performed under the auspices of the U. S. Department of Energy by the Lawrence Livermore National Laboratory under Contract W-7405-ENG-48.
DISCLAIMER This document was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor the University of California nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any speci c commercial products, process, or service by trade name, trademark, manufacturer, or otherwise, does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or the University of California. The views and opinions of authors expressed herein do not necessarily state or re ect those of the United States Government thereof, and shall not be used for advertising or product endorsement purposes.
i
Contents
1 2 3 4
Introduction The PCP Split-Join Model Program Structure Basic PCP Vocabulary 4.1 4.2 4.3 4.4 4.5
master : : : : : : : barrier : : : : : : forall : : : : : : : lock, unlock : : : private, shared
1 2 4 4
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
5 PCP Teams
5.1 Team State : : : 5.2 Team Splitting : 5.2.1 split : : 5.2.2 splitall 5.3 Teamprivate : :
6 7 8 9
11 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
Memory Allocation BBN Speci c Features I/O How to Compile and Load PCP Codes 9.1 Make le
: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
10 Examples 11 PCP Error Messages 12 Reference Manual
12.1 Lexical Conventions : : : : : : : 12.1.1 Keywords : : : : : : : : : 12.1.2 Constants : : : : : : : : : 12.1.3 PCP Prede ned Variables 12.2 Declarations : : : : : : : : : : : : 12.2.1 Storage Class Modi ers : 12.2.2 Type Speci ers : : : : : : 12.3 Statements : : : : : : : : : : : : 12.3.1 Scheduling : : : : : : : : 12.3.2 Synchronization : : : : : 12.4 Preprocessing : : : : : : : : : : : 12.4.1 File Inclusion : : : : : : :
A PCP Man page
4 5 6 8 9
11 12 12 14 15
16 17 17 18
18
21 22 24 : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : :
24 24 24 24 25 25 25 26 26 26 27 27
27
ii
B Future PCP References
28 29
1
1 Introduction In programming for a parallel machine there are many things to consider: allocation of processors, scheduling, communication, and synchronization. The parallel tasks must be identi ed and access to shared data items must be controlled. Currently there does not exist a compiler that will take a C language program, analyze it, tackle the issues of good memory management, minimization of processor communication, synchronization, and compile it into the most eective, ecient use of your multiprocessor. It is the opinion of some researchers working on these problems that such automatic tools will not exist within a decade, if ever. The PCP [1] programming language puts the issues of scheduling, communication and synchronization directly into the hands of the programmer so that ecient parallel code can be written for the multiprocessors that are available today. In PCP one programs for a virtual multiprocessor using the same notion of spin waiting that would be used on a real multiprocessor. In a spin wait the processor keeps polling the resource's availability and does not take up another task in the meantime. PCP allows the user to specify which portions of the program are to be executed in parallel, which are to be executed by subteams, and which are to be executed by one processor only. A portable PCP program allows any number of processors to be allocated to the job. The number of processors is selected by the user of the program at run time and may be varied from one processor to any number of processors which still provides ecient execution. Through the use of a compile time ag, the PCP preprocessor can also produce serial code which does not contain the run time synchronization overhead required for parallel execution and provides an excellent check of parallel results. PCP supports an extension of the Single-Program-Multiple-Data (SPMD) model that is available in H. Jordan's FORCE [2] and IBM EPEX/FORTRAN [3]. The key additional concept is the notion of team splitting that allows an arbitrary subdivision of the team of processors executing the code, and allows each sub-team to execute arbitrarily dierent code within the constraint of block structure. This is a powerful extension of the SPMD programming model which supports the exploitation of nested concurrency, for both subroutines and nested loops, in a exible way.
2 By default, all staticly allocated data is shared1 and thus accessible by all of the processors. Stack, or auto, data is private to a processor and is stored in a processor's local memory if it exists. PCP, the Parallel C Preprocessor, was originally written to solve the problem of the portability of C language based parallel programs among several dierent shared memory multiprocessors. The preprocessor is machine independent; the amount of machine dependent run time support is small and can be easily implemented with fast inline code. PCP has been used successfully on the Alliant, Sequent, Cray, SGI and Stellar machines. The current development platform is on the BBN TC2000 machine. In the sections which follow, we will describe the PCP vocabulary and demonstrate with simple examples how it can be used.
2 The PCP Split-Join Model A split-join parallel program is multi-threaded at the highest level. The premise of the splitjoin model is that all processors start at the beginning of the program execution, execute the same program, and stay in operation until the end of the program. One might think in terms of a team of processors executing the program. The advantage of the split-join model, compared to the fork-join model, is the low overhead associated with the exploitation of nested concurrency. Time is not spent adding a processor to the team or dropping a processor from the team. In parallel programming parlance, there are threads, streams, tasks, processes, processors, etc. There is much confusion about what these terms mean because vendors and research institutions have used the terms to mean dierent things. The PCP user need only be concerned with processors. In PCP the parallel job runs on a virtual multiprocessor. From the user's point of view, each team member is a physical processor and can be programmed as such. A piece of work or task is a unit of coding that can be scheduled. A task could be a function call or a pass through a do-loop. Tasks may be divided up among dierent team members, perhaps assigning dierent iterations of a loop to dierent processors, or having one processor perform serial work while the others wait. 1 On some implementations the default treatment of data is switchable between shared and private using compile time ags.
3 Using PCP, the user's program starts out with all of the team members (processors) executing the code. It is up to the user to specify serial sections and manage synchronization using the PCP extensions to C. Live lock can occur when processes are busy waiting for an event that will not happen if the process that will post the event is not scheduled. The PCP runtime system will ensure that several parallel jobs which over-commit the available processor resources will be scheduled in such a way that live lock cannot occur. This frees the programmer and the PCP runtime support to use spin waiting for processor synchronization. Spin waiting is the highest performance solution to the synchronization problem that is available on all machine targets. Some multiprocessor systems support high level synchronization constructs which have implicitly attached scheduling support, but each vendor's synchronization primitives constrain the programming model in a unique way and most are too inecient for our purposes. Using the example shown in Figure 1 which calculates a simple sum, we will show how PCP is used to write an eective, ecient C language program to run on a parallel machine. Each Section will de ne a new term and we will show a simple example of its use. Below is the usual C form of our sample program. It may be helpful to try out the examples as you read this document. The Make le for building a PCP program is found in Section 9.1. #include int sum = 0; main() { int i; for (i = 0; i < 24; i += 1) { sum += i; } printf("sum = %d \n", sum); }
Figure 1: example in C
4
3 Program Structure Given a well structured code, PCP requires no basic changes in the structure other than those which may make more eective use of parallelism. For those converting regular C programs to PCP, there are no changes required in the standard keywords and syntax of C. Every PCP module which makes explicit use of concurrency constructs must contain the #include statement.2 A PCP main program has the form shown in Figure 2. #include main(argc, argv) int argc; /* Argument count */ char **argv; /* Argument strings */ { }
Figure 2: PCP program structure
4 Basic PCP Vocabulary 4.1
master
A block of code that is to be executed by only one processor in a team, called the team master 3 , is set o using the syntax below. Within the context of a speci c team, a speci c processor will execute the code delimited by a master block. A master block is often used in the portion of the program that performs initialization. Input/output and memory allocation is also often performed within a master block. At a much smaller scale of granularity, master blocks are used to initialize shared data such as accumulators which all team members will access. 2 This wart comes from the days of a dumb preprocessor, and macros, based on the standard C preprocessor. We should probably consider dropping the requirement and having the PCP preprocessor handle it invisibly. 3 The other processors, if any, are called the team slaves .
5 master { }
Naive programmers might be tempted to regard a master block as a global serial section, but one should be very careful about this. If team splitting, discussed later in Section 5.2, has occurred, a master block is not a global serial section unless other means have been used to prevent two teams from executing a particular master block concurrently. A master block is a serial section only in the context of a speci c team. If team splitting has taken place, an arbitrary number of teams may be independently executing your program. Arbitrary PCP code may be enclosed by a master block. One might think that doing this would have little use as the other processors associated with the team do not participate in the execution of the enclosed concurrent work. There is one good use for this, however. Because a master block allows only the team master to execute the enclosed code, it can often be used to isolate a race condition that exists due to inadequate synchronization, or inadvertent data sharing.
4.2
barrier
The team of processors executing the code freely run through it unless explicit synchronization primitives are encountered. One basic, and frequently used, form of synchronization is the barrier : barrier;
A barrier requires all members of the team to arrive at the barrier before any are allowed to continue. Each team has its own distinct barrier. A barrier is often used after a master block, or a forall loop, to ensure that the preceding work is complete before any processor is allowed to continue on. A fast algorithm [4, 5] which has no hot spots4 or critical regions has been implemented for the PCP runtime support. 4
A hot spot is a shared memory variable for which all processors are vying.
6 In the example below, one team member performs the summing and printing of the result5. The barrier statement guarantees that the master block is completed before the members of the team continue. Note that the variable, sum, is an external variable that is shared by all the processors. The code example shown in Figure 3 is a trivial conversion of a serial C code to PCP code. A master block encloses the entire executable code of the program, ensuring that only one processor executes it, and the barrier statement prevents the other team members from exiting before the team master has completed its work. If any team member were to be allowed to exit earlier, the job could end before the sum is printed. A job ends when any team member exits. #include #include shared int sum = 0; main() { master { for (int i = 0; i < 24; i+=1) { sum += i; } printf("sum = %d \n", sum); } barrier; }
Figure 3:
4.3
master
- barrier example
forall
The forall loop is the PCP concurrent equivalent of the C language for loop. It achieves a ne-grained parallelism by dividing up the passes of the for loop among the members of the team: forall (int
var =
; ;
var +=
) f
The examples shown explicitly declare shared variables to be shared. This is not necessary as shared is the default for non-automatic variables, but is strongly suggested for program portability 5
7 g
The indices of the loop are interleaved among the members of the executing team. The loop index variable must be6 declared in the forall statement. We have borrowed this syntax from C++ to remind the programmer that the loop index is not de ned after the closing brace of the loop body. The and expressions are currently restricted to simple constants or variables. The expression is unrestricted but not checked for sanity. forall loops may be nested arbitrarily, although the likelihood of any concurrency being taken advantage of in a nested forall loop is quite low. We show the use of forall in the example in Figure 4. #include #include shared int sum = 0; main() {
/*
forall (int i = 0; i < 24; i+=1) { sum += i; } barrier; NOTE: the sum obtained is not the same as the serial result!
*/
master { printf("sum = %d \n", sum); } barrier; }
Figure 4:
forall
example
The barrier following the forall loop ensures that each member of the team has executed its portion of the loop before continuing. However, there is now a problem with how the shared variable, sum, is being incremented. Since the team members are reading sum, adding their individual increments to it, and then storing the value back to sum in any order and at any time, there is no guarantee that the sum that is printed out will be correct. Obviously we need to synchronize access to the variable, sum. We shall show how 6
Should we relax this requirement?
8 this problem may be solved with the use of locks.
4.4
,
lock unlock
Often there are critical sections of the code that must be executed by only one processor, or one team member, at a time. Concurrency must be inhibited in a statement that reads, modi es, and then writes a shared variable. To prevent team members from destructively interfering with each other, entrance to a critical section of a code must be restricted so that only one processor may execute it at a time. This is accomplished by using a lock. PCP oers spin wait locks that are implemented by variables of the lock data type which has the two states locked and unlocked. A lock variable is a statically allocated and initialized C data type: lock
var =
unlocked;
The functions that change the state of a lock are lock() and unlock() which take the pointer to the lock variable as an argument. lock(), when passed a pointer to a variable of the data type, lock, waits until the lock is unlocked and then atomically sets it to locked. unlock, when passed a pointer to a lock, sets it to unlocked. A lock is used to protect a critical section in the following way: lock(&var); unlock(&var);
If the lock variable is shared, then the critical section is global. If the lock is declared teamprivate, see Section 5.3, then the critical section is local to the team. If for some strange reason, the lock is a private variable, see Section 4.5, then the lock function is essentially a no-op. In Figure 5, we show an example of using a lock to safeguard a critical section. The example will now give the correct sum, although it will not perform as well as one might like. The undesired side eect of creating a critical region to protect the access to the shared variable, sum, is that we have restricted the concurrency which can be exploited in the program. In this case, each processor must wait for a turn to enter the critical region before it can contribute an element to sum and we nd that the parallel version, although
9
#include #include shared lock lsum = unlocked; shared int sum = 0; main() { forall (int i = 0; i < 24; i+=1) { lock(&lsum); sum += i; unlock(&lsum); } barrier; master { printf("sum = %d \n", sum); } barrier; }
Figure 5:
lock
example
it now produces the correct result, does not run much faster than the serial version. In this case, we can x the problem by having each processor compute a partial sum and then add in its contribution to the global sum after completing its portion of the forall loop. See Figure 6. The advantages of using partial sums are much less contention for the critical region, each processor only need enter it once, and the potential use of machine registers by an optimizing compiler to eciently compute the partial sum.
4.5
private, shared
PCP assumes an interleaved shared memory and a way of causing all extern and staticly allocated variables to default to the shared status, i.e. to be shared among all processors. On the BBN TC2000, it is best to use the storage class modi er, shared, for all shared variables. shared
type-speci er var[, var]...;
Local memory is optional; to have static variables unique to a processor, PCP oers
10
#include #include shared int sum = 0; shared lock lsum = unlocked; main() { private int partial_sum = 0;
/* zero out each partial_sum */
forall (int i = 0; i < 24; i+=1) { partial_sum += i; /* increment each partial_sum */ } barrier; /* wait until loop is done */ lock(&lsum); sum += partial_sum; unlock(&lsum); barrier;
/* sum up partial_sum's */
/* wait until all processors */ /* contribute their partial sum */
master { printf("sum = %d \n", sum); } barrier; }
Figure 6: processor partial sum example
11 the storage class modi er, private: private
type-speci er var[, var]...;
A processor has exclusive access to its private data. private data resides in the local memory associated with a given processor, or is implemented using arrays in shared memory on those machines lacking support for local memory. In the implementation on the BBN TC2000, memory may be defaulted to private or shared via a compile line ag. See the man page (appendix A) for more information.
5 PCP Teams In PCP a grouping of processors is called a team . Team members, or processors, are not created and destroyed within a PCP program. A team of processors, the size set by user and system controlled means, enters the main program at the start of the job. The team of processors entering the main program may dynamically split up into two or more sub-teams, however. In this operation, called team splitting , a processor only requires access to a small number of local variables which record its identity and team membership. Each processor selects membership in a new team in response to dynamical runtime conditions without communicating with other processors. The rst team member to exit or encounter an unhandled exception causes the program to stop.
5.1 Team State Each member of a team carries in its local memory a group of static variables which identi es it. The sophisticated PCP programmer may want to use the variables that describe the team state. These are: A read only variable, the number of processors executing the program. It is the size of the team which enters main.
NPROCS
A read only variable, the processor index. It has a value unique to each executing processor in the range from 0 to NPROCS-1.
IPROC
12 TSIZE The team size.
If no team splitting has occurred, this will be equal to
NPROCS.
The index of the member within the team. It must have a unique value within the team in the range, 0 to TSIZE-1.
TINDEX
TDESC
The team descriptor, a non-negative value unique to the team.
5.2 Team Splitting The key dierence between PCP and the SPMD programming models which have preceded it is the support for the concept of team splitting. A team may disassociate into two or more sub-teams at run time, each processor requiring only a few operations on its locally stored team state to accomplish the feat. The sub-teams rejoin to become the parent team again later on, in a block structured way. Team splitting is handled with fast inline code and might be considered the ecient substitute for the fork-join operation. The cost of team splitting is independent of team size, but may depend on the number of sub-teams into which a parent team is split.
5.2.1
split
To divide up a number of tasks, which is known at compile time, among sub-teams which are split from the parent, one uses static team splitting: split [weight1] f
g and
[weight2]
f
g ... and
[weightn]
f
The tasks may be executed in any order, including sequentially if the team encountering the split statement can not be split for some reason. The progress of the algorithm must not depend upon incremental progress of each task running in parallel if the program is to be portable to multiprocessors with small processor counts. If one task is much greater than another, one may assign weights to the blocks of work to achieve load balancing. The g
13 weights determine the fraction of the current team's processors which are split into each subteam. The weights must be an expression which evaluates to a number less than one. Regardless of the weights, each task will be given a subteam of at least 1 processor. The remaining processors are used for the last task. See Figure 7 for an example of a weighted split. #include #include shared int shared int float wt1 float wt2
sum1 = 0; sum2 = 0; = .3; = .7;
shared lock lock1 = unlocked; shared lock lock2 = unlocked; main() { split wt1 { forall (int i = 0; i < 6; i+=1) { lock(&lock1); sum1 += i; unlock(&lock1); } } and wt2 { forall (int i = 0; i < 14; i+=1) { lock(&lock2); sum2 += i; unlock(&lock2); } } barrier; /* wait until loop is done master { printf("sum1 = %d sum2 = %d } barrier;
\n", sum1, sum2);
}
Figure 7:
split
example
*/
14
5.2.2
splitall
The dynamic version of team splitting is the splitall loop: splitall (int var = ; ;var += [; n1][ ; n2]) f
g
When a team encounters a splitall loop, it splits into a number of subteams to which the indices of the loop are interleaved. If more concurrency is encountered in the enclosed code and the sizes of the subteams executing the indices of the splitall loop are not one, this concurrency will be exploited. As in the doall construct, each loop is executed by all encountering team members for lack of enough members to split into subteams. The optional integer expression, n1 , gives the desired number of teams; optional integer expression, n2 , gives the desired team size. All of the members of the incoming team are divided as evenly as possible into n1 , the number of subteams desired. If n2 , the new team size is given, then the job will be assigned to as many teams as possible of the desired team size, the splitall indices being stripped out to these teams. The other team members will continue on with the code following the splitall block. To aid in the use of the splitall term, the PCP ag: -nt
n
speci es the default number of subteams for splitall. If the -nt n ag is not used, the default number of subteams for splitall is 2. Thus if neither n1 nor n2 is speci ed, then the number of teams is 2 or the value of the -nt n ag. As in the static split, if the encountering team size is 1, then there is a new team descriptor, TDESC, but no subdivision of tasks. To give a trivial application of team splitting, consider the parallel computation of a set of matrix vector products. If a team split would be pro table, the team encountering the splitall block is divided into subteams, each subteam handling a subset of the indices i. The library routine mvprod() is designed for team entry and contains parallel language constructs designed to eciently exploit the parallelism of each matrix vector product. If the team which enters
15
double **result; double ***matrices; double **multplcnd; int dim; int number; splitall(int i = 0; i < number; i += 1) { mvprod(result[i], matrices[i], multplcnd[i], dim); }
Figure 8: splitall example the splitall loop has 100 processors and the number and dimension of the matrix-vector products is 5 and 20, respectively, we see that the use of team splitting will have a substantial impact on program performance.
5.3
Teamprivate
To have static data shared among the members of a particular team, but treated as private to the team, one uses the storage class modi er, teamprivate: teamprivate
type-speci er var[, var]...;
Note that access to a teamprivate variable must be synchronized because it is shared among team members. Teams are dynamic associations of processors which are frequently created and destroyed. Each live team (existing association of processors) has a team descriptor which uniquely identi es it and which is used to index an array in shared memory to implement teamprivate data. Team descriptors are reused as teams are created and destroyed. Because of this, initializations for teamprivate data have no meaning as the storage area for a speci c team descriptor may have been corrupted by a previously existing team. Currently team descriptors have a range from 1 to 256, or 1 to n, n < 1024, if the ag -mxtd
n
has been used, see Section 5.2.2. PCP quickly calculates the new team descriptor for team splits in local memory by shifting the current team descriptor left n times where n is the log (base 2) of the number of new subteams and `or-ing' in an integer from 0 to the
16 (number of subteams -1). Depending on the size of the split, this method may result in the use of non-sequential team descriptors. Some users prefer space to speed and therefore we have implemented a \compact" team descriptor method. It is activated by the runtime ag, -cmptd
n
where n is the starting integer for the compact method. Thus -cmptd 2 will result in the use of sequential team descriptors from 2 through 256 or 2 through n if the compiler
ag, -mxtd
n
has been used. If the runtime ag, -cmptd 2 and the compile time ag, -maxtd 6, are used, then team descriptors from 2, 3, 4, 5, and 6 will be used. An error exit will occur if the code needs more descriptors. If the user wishes to change the lower limit of the compact team descriptors during run time, the processors at the topmost split level must call the function, pxp td chg(n) where n is the new lower limit.
6 Memory Allocation PCP functions to allocate memory are used in the same way as the malloc function which takes the requested size as an argument and returns a pointer. To allocate shared memory, one uses the function, shmalloc. To allocate private memory, one uses the function, prmalloc. To allocate default memory, the function, malloc, is available. Default memory may be controlled through the use of the PCP ags, "-ms" and "-mp". See Section 9.1 on how to compile and load PCP codes.
17
7 BBN Speci c Features There are several features speci c to the BBN TC2000 machine that may be used. It is to be noted, however, that using these features causes programs to be potentially non-portable to other machine. At best a program which takes advantage of these speci cs may only run slower on another implementation. To take advantage of special dynamic memory options present on the BBN TC2000, the heapmalloc function is available. On the BBN TC2000, see the man page on heap for speci cs on the use of the heapmalloc routine.
8 I/O There are two possible implementations of I/O support in a PCP execution environment. The rst is one in which le descriptors and le position pointers are shared by all processors. In the second implementation, le descriptors are private to each processor. A given PCP implementation may provide one, the other, or both of these paradigms. The current implementation on the BBN TC2000 only supports private le descriptors and position pointers. The implications of private le information is that parallel I/O is not guaranteed to produce correct results if done in parallel to the same le by multiple processes.
18
9 How to Compile and Load PCP Codes 9.1 Make le PCP codes are contained in les with the sux .pcp. They are compiled and loaded in a manner similar to that of compiling and loading ordinary C codes, with the addition that les ending in .pcp are sent to the PCP preprocessor rst. The compiler invokes the loader, ld. In addition to compiler and loader ags, there are special ags for the PCP preprocessor also. Invoke the PCP preprocessor/compiler with the command:
pcp [option ...] file
...
PCP accepts several types of les: .pcp
PCP source files
.pfp
PFP source files
.PFP
PFP source files - to be passed through the C preprocessor
.c
C source files
.f
Fortran source files
.s
Assembly language source files
.o
Object files
.po
Object files
The following option is enabled by default: -ms
default shared memory
PCP options available: -v
-n -o output
verbose printout; all commands executed are displayed on the screen. similar to -v, but the commands are not executed. the output le is called output instead of a.out
19
-V -pcp -pfp -E -ms -mp -serial -nt
n
-mxtd
n
When used with the -c ag, names the output output rather than file.o which allows generation of .po les. print the PCP version number. use PCP with version number, version; e.g., -pfp2.0d version may also be old, new; e.g., -pfpnew same as above, for PFP; e.g., -pcp2.0b run the preprocessor only; results in a .f or .c le default dynamic memory to shared. default dynamic memory to private. produce a serial code; helpful for checking results default number of subteams for splitall. maximum team descriptor value.
PFP runtime options available: -cmptd n lower limit of team descriptors for compact method. We show a typical Make le for compiling and loading a PCP program, pgm: PCP=pcp PCPFLAGS= SRCS = pgm.pcp PROGS = pgm SPROGS = spgm all: $(PROGS) pgm : pgm.pcp $(PCP) -o pgm pgm.pcp $(PCPFLAGS) spgm : pgm.pcp $(PCP) -o spgm -serial pgm.pcp $(PCPFLAGS) clean: rm -f *.o clobber: clean rm -f $(PROGS) $(SPROGS)
20 This Make le allows the user to conveniently obtain a binary which is compiled for serial execution only. This is especially convenient for debugging and performance comparisons. The serial version of the program has the same name as the normal parallel version, except that an s has been pre xed to the name. The user, however, can change this as desired. The header le, pcp.h is installed in the directory /usr/local/include on most systems. Many C compilers look in this directory automatically for header les for locally supported software. The standard PCP runtime library archive is installed in /usr/local/lib. The PCP preprocessor itself (actually a driver and a collection of lters) is installed in /usr/local/bin. On the BBN TC2000, each new version of PCP becomes pcpnew or a beta test version. As it is deemed reliable, it becomes the standard pcp. Older versions are also available to the user; see the man page for release info.
21
10 Examples A more complex example: /* This routine solves the linear system A.X = B using straightforward Gauss elimination without pivoting. The routine mungs a, leaving the results of the reduction in it, and puts the solution X in the array B. For a description of the method used, see "Matrix Computation for Engineers and Scientists" by Alan Jennings, Section 4.1. The comments refer to section numbers in the book. */ #include #include void dgauss(a, b, dim) double **a; double *b; int dim; { int i, k; /* We first do dim reduction steps. */ for(k = 0; k < dim; k += 1) { if(a[k][k] == 0.0) { /* check */ fprintf(stderr, "dgauss: a[%d][%d] = 0\n", k, k); exit(1); } forall(int i = k+1, i < dim, i += 1) { double xtemp = a[i][k]; if(xtemp == 0.0) continue; a[i][k] = 0.0; /* eq (4.1a) */ xtemp /= a[k][k]; for(int j = k + 1; j < dim; j += 1) { a[i][j] -= a[k][j] * xtemp; /* eq (4.1b) */ } b[i] -= b[k] * xtemp; /* eq (4.1c) */ } barrier; } /* Now we perform dim back substitutions. */ for( int i = dim - 1; i >= 0; i -= 1) { master { b[i] /= a[i][i]; } barrier; forall (int k = i-1; k >= 0; k -=1) { b[k] -= a[k][i] * b[i]; } } barrier; }
22
11 PCP Error Messages Below we list the error messages which are currently produced by the PCP preprocessor, and what they mean. In addition to these error messages, the C compiler may produce error messages of its own. The PCP preprocessor passes line directives to the C compiler so that error messages which it produces refer to the original .pcp source le. Sometimes things can still be a bit cryptic, possibly due to an error in the preprocessor itself or a strange interaction of the PCP runtime concurrency control code with the user's code. In this case, modifying the default rules of the Make le, discussed earlier in Section 9.1, so that the .c les are not removed but are left for inspection, can be quite helpful in determining the problem. If you experience a problem with the error reporting built into PCP, please report this to the appropriate MPCI core sta member so that something can be done to prevent its recurrence. "usage:
pcp [-v|-n|-V|-E|-ms|-mp|-serial|-gnu] [compiler options] [-o file]
file ..."
Explanation: Incorrect arguments to pcp. See the man page in apendix A for ag meanings. "nesting limit for includes exceeded at line nn file xxx"
Explanation: No more than 10 nested include les allowed. "unterminated include on line nn of file xxx"
Explanation: An incorrectly formated include directive with a missing closing " or > has been encountered. "forall:
too many nested forall loops"
Explanation: No more than 100 nested forall loops allowed. "pcp forall:
line nn:
xxx operation not supported in forall"
Explanation: Method of incrementing forall index not allowed. "Bad "master" usage on line nn"
Explanation: Incorrect usage of master.
23 "Attempt to initialize xxx data on line nn of file xxx"
Explanation: Variables cannot be initialized in a teamprivate declaration. "Bad xxx declaration, line nn, file xxx"
Explanation: Problem with private or teamprivate declaration. "split:
non matching braces"
Explanation: Each task in a split must be surrounded by matching braces. "split:
too many split levels:
nn"
Explanation: Maximum of 4 split levels allowed.
24
12 Reference Manual 12.1 Lexical Conventions Below we list the translation units which have special meaning to the PCP lexical analyzer.
12.1.1 Keywords The following identi ers are reserved for use as PCP keywords and may not be used otherwise: barrier forall lock locked master private shared split splitall teamprivate unlock unlocked
12.1.2 Constants PCP lock variables (see Section 12.2.2) may have only two values: locked unlocked
12.1.3 PCP Prede ned Variables The variables listed below are declared in the le and describe the team state. These variables are local to a processor; there are no hot spots associated with accessing
25 them. Because the team state is in local storage, these variables may be freely cached in registers by the C compiler. See Section 5.1 for a more extended discussion of their uses. extern int _NPROCS; extern int _IPROC; extern int _TSIZE; extern int _TINDEX; extern int _TDESC;
12.2 Declarations 12.2.1 Storage Class Modi ers There are three modi ers for storage classes in PCP: private shared teamprivate private is used to indicate that extern or static data is local to a processor with each
team member having its own copy. On some implementations processor private memory is supported by the operating system and is used directly. On other systems all memory is shared between processors and private memory is implemented using arrays in shared memory, indexed by _IPROC. shared
is used to indicate that extern or static data is global to all the processors.
is used to indicate that extern or static data is local to a team of processors, but shared among them. It is implemented using arrays in shared memory indexed by the team descriptor, _TDESC. teamprivate
12.2.2 Type Speci ers The PCP type, lock
26 is a variable with two states, locked and unlocked, which is used in conjunction with the routines lock() and unlock() to implement critical regions. If the lock is in shared memory the critical region is global in scope, if the lock is in teamprivate memory the critical region is enforced only among members of the current team. If the lock is is private memory, you have made a mistake.
12.3 Statements PCP introduces several new statements for parallel processing. These treat the scheduling of tasks and the synchronization of the processors performing them.
12.3.1 Scheduling Because PCP is a split-join model, the entire team of processors is executing the code from start to nish. The following statements allow the user to schedule tasks for the members of the team: masterf g forall (int
var =
; ;
var +=
) f g
These statements allow the user to schedule tasks for subteams. The statements create the teams as well as scheduling the tasks for them. splitf g and f g splitall (int
var =
; ;
var +=
) f g
When a team encounters a master block, one team member, known as the master , executes the code within the block. When a team encounters a forall loop, the indices of the loop are interleaved among the processors. Static team splitting is speci ed by split and and. Concurrent execution of the code following the two keywords takes place. splitall parcels out the indices of the loop to a collection of sub-teams.
12.3.2 Synchronization PCP oers the following synchronization operations which take the form of expression statements: barrier;
27 lock(&var); unlock(&var);
The barrier operation requires all team members to arrive at that point before continuing. The lock() operation spins until the given lock variable whose pointer is the argument to the function is unlocked and then locks it. unlock() unlocks the lock.
12.4 Preprocessing This section deals with instructions for the C compiler.
12.4.1 File Inclusion A PCP program must contain the following #include statement: #include
A PCP Man page
28
B Future PCP The current version of the Parallel C Preprocessor was developed by a small group of users who implemented features as requirements for them were found in existing parallel programs. By delaying the implementation of features until they are actually needed in existing codes, we have ensured immediate feedback from users and have ensured that the implementation lls the requirements of the users well. As more users make use of PCP and we push into the realm of hundreds of processors, we expect that new needs will arise and language extensions will have to be supported to ll these needs. Please contribute your thoughts and suggestions to the MPCI core sta so that further re nement of this parallel programming model can proceed. We hope to address the following issues in the future:
portable support for concurrency safe I/O which most target machines lack,
relaxing restrictions on the control expressions in the forall statement,
29
References [1] Eugene D. Brooks III, PCP: A Parallel Extension of C that is 99% Fat Free , UCRL99673, Lawrence Livermore National Laboratory, 1988. [2] Harry F. Jordon, The Force: A Highly Portable Parallel Programming Language , Proceeding of the International Conference on Parallel Processing, August, 1989. [3] F. Darema, D. A. George, V. A. Norton and G. F. P ster, A single-program-multiple data computational model for EPEX/FORTRAN , Parallel Computing, April, 1988. [4] Eugene D. Brooks III, The Butter y Barrier , UCRL-95737, Lawrence Livermore National Laboratory, November, 1986. [5] D. Hensgen, R. Finkel, U. Manber, Two Algorithms for Barrier Synchronization , International Journal of Parallel Programming,vol 17(1):1-17 (1988). [6] Brent C. Gorda and Eugene D. Brooks, The MPCI Gang Scheduler, The 1991 MPCI Yearly Report: Harnessing the Killer Micros, Draft, Lawrence Livermore National Laboratory, March, 1991