Unbalanced Computations onto a Transputer Grid 1 ... - CiteSeerX

3 downloads 575 Views 196KB Size Report
Via Torino, 155 { 30170 Venezia Mestre { Italy (Email: [email protected]) .... template is the code of the process that realized a part of the implementation ...
to appear in the Proceedings of the 1994 Transputer Research and Applications Conference (NATUG 7)

Unbalanced Computations onto a Transputer Grid B. Bacci , M. Danelutto , S. Orlandoy, S. Pelagatti, & M. Vanneschi Department of Computer Science University of Pisa { Corso Italia 40 { 56125 Pisa { Italy (Email: fbacci,marcod,susanna,[email protected]) y Department of Applied Mathematics and Computer Science { University of Venezia Via Torino, 155 { 30170 Venezia Mestre { Italy (Email: [email protected]) 

Abstract

Many applications exists that are characterised by having some "core" function code repeatedly applied over all the elements of a compound data structure or over all the elements of an input data stream. Here, a technique is discussed that allows parallel implementations of these applications to be derived that achieve high performance and eciency in the machine resource usage. The technique is especially e ective when the computational load of each application sub-tasks requires a variable amount of time to be computed. Some results concerning the usage of these techniques on a transputer based machine are discussed here along with the technical details of the implementation schemas that have been used.

1 Introduction

Many scienti c and non-scienti c applications can be parallelised in an easy way. We refer to the computations that are characterised by a large amount of time spent in repeatedly computing the same functions/procedures/statements over di erent input data sets and such that there are no direct data dependencies between di erent computations of these function/procedure/statement. These computations are often referred to as \embarrassingly parallel". However, many problems have to be solved, as usual, when parallelising these applications [7, 8]:  subcomputations must be mapped over the processing elements (PEs) (mapping problem);  input data sets must be spread over the PEs (data distribution problem);  result data sets must be gathered and, possibly, data structures must be re-built in memory, holding the separate and \independent" result data items (data collection problem);  inter-PE communications and process scheduling over each one of the PEs participating to the computation must be devised (communication and process scheduling problem);  nally, load balancing strategies should be devised that allow an optimal exploitation of the machine resources to be achieved (load balancing problem). 1

rebuild

forall

input data stream

body get item body body

body

body put result

body

input

output

program init forall data_item in {data_set decomposition} body(data_item) rebuild results end program

output data stream program init while non_empty(input_stream) get data_item from input_stream body(data_item) put results onto output_stream end program

Figure 1: The application classes: the skeleton of a c.map application (left) and the skeleton of a c.farm application It's worthwhile noticing that many of the problems stated above are NP -complete, in the general case. The goal of this work is to present a methodology that can be adopted to program a particular class of \almost embarrassingly parallel" applications over a distributed memory, MIMD machine. The machine we take into account is a transputer based 2D mesh, but the methodology presented applies to a large class of machines. In particular, we will show how, by recognising the parallelism form inherent in the application and by (approximately) knowing some parameters depending on the target architecture and on the target application, an ecient implementation of the parallel application can be devised in a fully automatic way. This ecient implementation schema is embodied in a template that can be re-used either by the programmer of a parallel application or by a compiling tool recognising that a parallel application has a given structure.

2 The application class

The class of applications we are going to take into account include the following two subclasses: c.map the applications whose central part is somehow equivalent to a forall loop and such that the statement(s) belonging to the forall body do not contain any inter-iteration data dependency, but those related to the iteration variable; c.farm the applications that repeatedly operate over a stream of input data sets of the same type, computing the same program over all the items of the input stream, and producing as output a stream of results; The two class of applications are characterised as follows: 1. programs belonging to the c.map class have the code subdivided into three parts. In the rst part, some initialisation is performed and some computation steps are performed that are inherently sequential. At the end of this phase, some kind of \vector" data structure has been produced (and it is held in some memory) that will be subsequently processed in parallel. In the second part, a sort of forall loop is executed, that applies a given function onto each one of the processing elements of the vector data structure produced at the end

of the rst phase. In the third phase, the single results coming from the application of the \core" function over all the elements of the vector data structure are \re-built" into a new vector data structure. Possibly, this data structure either represents the result of the application, and in this case is simply put onto the output le, or it is submitted to a further processing phase, inherently sequential. The \skeleton" for these applications is drawn in Figure 1 (left). 2. programs belonging to the c.farm class repeatedly apply their code over all the elements appearing onto some kind of input data stream. Each element is taken from the input data-stream, it is processed by applying the \core" function onto it, and then the result is delivered onto the output stream. The \skeleton" for these applications is drawn in Figure 1 (right). The skeletons for the applications belonging to the c.map and to the c.farm class appear to be very similar. In fact, the only di erence between the two kind of applications lies in the fact that in the c.map applications there exists a time in which the whole data set whose elements have to be \computed" in parallel exists in memory, while in the c.farm applications those elements appear onto the input data stream one-at-a-time. In this paper, we will only take into account applications belonging to the classes just described that present the following additional feature:  the computation of di erent body statements (i.e. body statements executed on di erent data_items), takes really variable amounts of time. The distribution of the time required to compute a body(data_item) is not exactly known, but the average time required for that computation is known. This feature, which is typical of many applications belonging to the c.map and c.farm classes, prevent us from using the classical \geometric" data decomposition schemas to achieve load balancing and eciency in the machine resource usage. The load imbalancy feature stated above further di erentiates the applications we take into account in the c.farm class from those considered in [11] and the applications we take into account in the c.map class from those considered in [3] and [9].

3 An ecient implementation strategy

In order to eciently parallelise the applications belonging to the classes described above, we devised a methodology that is really suitable for automatisation. The methodology assumes that we are designing parallel applications that will eventually run onto a distributed memory. MIMD machine. In this case, we take into account a Transputer based machine. Actually, the methodology has been developed using a Meiko Computing Surface 1 as the target machine. Indeed, the methodology can be \ported" to any other kind of parallel machine, provided that: 1. the machine has independent processing elements (MIMD). 2. there exists a \regular" interconnection network allowing data transfers to be performed between the di erent processing elements of the machine (here \regular" means that every processing element in the target architecture has an iso-tropic view of the interconnection network). 3. each processing element has full control over its local memory modules, or, in other words, there is no mechanism primitive to the machine supporting some kind of shared memory abstraction. Actually, this is a loose requirement, in that if shared memory is supported as primitive on the parallel machine, the same methodology can be followed but di erent implementation techniques will eventually be used. This methodology is based onto three distinct phases.

1. In the rst phase some process graphs (implementation templates, in our terminology) are designed that allow to implement in an ecient way the parallel core of the applications. These templates are designed in a parametric way, in that they can be targetted to the amount of resources (processing elements) available on the target machine. The implementation templates are designed as \mapped" process graphs. This means that an implementation template can be represented as a process graph having annotations telling where the di erent processes have to be mapped onto the physical architecture. Furthermore, each parametric implementation template is supplied with some analytical models computing its expected performance (completion time and service time) as a function of the number of resources, i.e. processing elements, that are assigned to the parametric template for the execution. Each implementation template is built out also of a set of process templates. Each process template is the code of the process that realized a part of the implementation template. In the general case, this codes are not complete, but some contain macros that make calls to user speci ed code (such as that of the body function) or need to be instantiated with other user supplied information (such as the type of the data structures used by the application). The process templates are written in such a way that they can be easily re-used in the implementation of di erent applications both by a programmer and by some kind of automatic program generator. 2. In the second phase, a way is provided to recognize c.map and c.farm applications, starting from their code. The \way" we follow to perform this analysis depends on the source language of the application that we want to parallelise. If the application is written in one of the usual imperative sequential languages, some kind of data- ow analysis can be applied to the code to understand how the application code can be matched with the application skeletons of Figure 1. If the application is already written in some kind of (classical) parallel programming language, the task can still be performed by using some analysis tool. However, all of the work we presented here has been performed during the development of a new parallel programming language, named P3L [5], that directly includes, as primitive statements, the possibility to express c.map and c.farm computational patterns. 3. In the third phase, once the implementation templates have been designed and a way has been devised to recognise the application patterns, we provide a cost calculus that allows to devise the optimal number of resources to be allocated to an implementation template in such a way that optimal performance is achieved in the nal application implementation. Here optimality has to be intended as follows: if we x the amount of resources assigned to a parametric implementation template to a number larger that the \optimal" one, no further performance increase is achieved, while, on the other hand, if we x the number to a value smaller that the optimal one, the achieved performance is lower than that obtained in the optimal case. It is obvious that this optimality criterion depends upon the implementation template. By choosing di erent implementation template, one can obtain the same performance values with di erent amounts of resources used. With respect to the cost calculus, we assume known the costs of some of the basic operations of the target machine as well as some costs relative to the target application such as the average time spent in the computation of the \core" function. It's worth noticing that the rst and the second phase are performed just once, when one designs the methodology, while the third phase has to be performed every time we want to parallelise an application. However, in the third phase, implementation templates are re-used, every time by instantiating the macros they hold with the proper user supplied code calls or parameters. In the following, we will discuss the implementation templates we have chosen, their details, some of their analytical models and, last, how the cost calculus can be used to devise optimal implementations of an application onto a particular parallel machine. The implementation templates discussed here assume a two-dimensional mesh-connected architecture, but similar templates can be devised for other topologies without e ort.

double-bus template (3PE/w)

single bus template (2PE/w)

linear template (1PE/w) Emitter PE

Collector PE

Emitter Ring PE

Collector Bus PE

Worker PE

Interprocessor Link

PE function

Figure 2: The implementation templates for a 2D-mesh, transputer based, MIMD machine 3.1 The implementation templates

In order to eciently implement the parallel core of both the unbalanced c.map and c.farm applications, we designed three di erent implementation templates. All of them accommodate processes that distribute (sub)tasks, processes that compute body code, and processes that collect the results of this body computations. Furthermore, each template accommodates a process sparking tasks to be computed in parallel, the emitter process, and another process collecting and re-building the nal data structures that represent the application result, the collector process. The body code is computed by a string of worker processing elements, in each one of the three di erent templates. Each worker processing element repeatedly computes the body code over the sub-tasks incoming onto its input channel. However, the three templates di er for the number of PEs used and in the process placement. The three templates are depicted in Figure 2. These three templates are used in the implementation of both the c.map and the c.farm applications. The double bus template of Figure 2 works as follows:  the emitter processor (within this template process and processor can be considered as synonyms) performs di erent tasks depending on whether the template is used to implement a c.map application or a c.farm application. In the former case, it receives from the environment some initial, compound data set, splits it in chunks and delivers chunks (i.e. task packets) to the string of processing elements acting as the distribution bus. Each chunk represents a task that can be computed independently of the other tasks. In the latter case, the emitter process just waits for a new task to arrive onto its input channel and passes it out to the string of processors acting as the distribution bus.  the string of emitter ring processors implements a distributed, load balancing task scheduling discipline. Each processor in this string of PEs holds two slots: one for task proceeding left-

to-right and another one for tasks proceeding right-to-left. The slots circulate among the processing elements and are lled up by the rst processor in the string, the one directly connected to the emitter processor. This PE receives tasks from the emitter and places them onto the slot ring as soon as a free slot passes through. Thus normally task ow through the slotted ring. The processors belonging to the emitter ring have an input channel connected to the corresponding worker process. Along these channels, the worker processes dispatch requests for new tasks to compute as soon as they have no more tasks to compute. The processors of the emitter ring pop tasks from their current slots to serve these task requests coming from the worker processes. If all the worker processes are busy, it may happen that tasks reach the last element of the emitter ring. In this case they are moved from the left-to-right slot to the right-to-left slot and start coming back in the emitter chain. This mechanism, implements the load balancing strategy.  the worker processes just perform a cycle where: 1. they ask for a task to the corresponding emitter ring process, 2. they compute the task received from the emitter ring process, 3. they deliver the result to the corresponding collector bus process. Care is take to avoid that the worker processes are never left idle. Thus, they apply task prefetching, in that they always bu er a task to be computed within their local memory, in such a way that computation of the new task can always start while link interfaces ask the emitter ring process for a new task.  the collector bus processes just receive computed results from the worker processes and deliver them to the collector process.  nally, the collector process performs di erent actions depending on whether the implementation template is used for a c.map or a c.farm application. In the former case, it collects all the results, re-builds the compound data structure that represents the result of the application and eventually puts it onto its output channel. In the latter case, it re-orders the results in such a way that the input ordering of the tasks is respected and delivers them onto its output channel. This means that some bu ering activity could be required at the collector node. All of the processes involved in this template use double or triple bu ering techniques similar to those described in [4] to improve communication eciency. The behaviour of the double bus template is graphically depicted in Figure 3. The other two implementation templates, the single bus and the linear one, behave similarly to the double bus template. The di erence lies in the fact that while in the double bus template all the PEs hold a single process (therefore process and processor are synonyms), in these other templates some processes are grouped onto a single processing element, to keep the resource usage lower. In particular:  in the single bus template, the upper string of emitter processors both distribute tasks to the worker processing elements and route results to the collector process. Therefore, the ring implemented by the emitter ring processors holds \double" slots with place for both tasks and results. These slots are allocated and routed with the same mechanism adopted for the emitter processes of the double bus template.  in the linear template, also the worker processes are allocated onto the emitter ring processing elements, thus achieving further resource usage eciency, at the price of a lower performance. The principle under which all of these three templates work is that tasks must be distributed according to availability of free worker processes (processors) in order to achieve load balancing. Any one of these templates can accommodate an arbitrary number of workers, i.e. an arbitrary

Figure 3: The behaviour of the double bus implementation template number of PEs computing body code. Of course, the important thing to take into account is that the communications required to distribute tasks and to collect results must not overwhelm the worker activity. Thus the bandwidth of the three pipeline stages, the emitter ring, the worker string and the collection bus, determine the maximum amount of worker processes that can be accommodated in the template. We will better explain this concept in subsection 3.2. For the moment, it's sucient to notice that the maximum amount of workers that can be inserted in each one of these templates depend on both the sizes of the data that have to be transmitted over the links (both the task data and the result data) and the amount of time spent in the computation of a single task over one of the worker processes. The global behaviour of the three templates is such that the double bus template is the one that delivers the best performance (in terms of the application completion time) and the worst eciency, while the linear template is the one that delivers the worst performance but presents the better eciency with respect to resource optimisation. The implementation templates discussed here (actually, the related process templates) have been implemented on the Meiko Computing Surface using the C programming language plus some calls to the low level part of the Meiko CS-Tools. In particular, we did not use CSTools Transports at all. Instead, we used CHANNELs, along with the cread and cwrite library calls. An example of process template relative to the double ring implementation template is depicted in Figure 4. 3.2 The cost abstraction and the analytical model

Let us outline now the kind of analytical models associated to the implementation templates discussed in Section 3.1. We assume that the following costs (a cost of an operation is the time spent onto the target architecture to perform that operation) are known: 1. the average time required to compute the body function onto a subtask. 2. the time spent to emit a task onto the emitter process (temit ), to collect a result on the collector processor (tcoll ) and to make a move of the slotted ring onto the emitter ring processes (tp ).

{

ro_coll_parms *parms = (ro_coll_parms *) block; channel_handle &coll_bus_in = parms->coll_bus_in; channel_handle &coll_bus_out = parms->coll_bus_out; channel_handle &out = parms->out; task_slotted_bus *t_bus = new task_slotted_bus; task_worker *t_w = new task_worker; task_slotted_bus *t_bus_tmp = new task_slotted_bus; unsigned char worker_res = FALSE; t_bus->fe_mark = TRUE;

while(1) { if (worker_res == FALSE) { if (out.receive(t_w, NONBLOCKING) != C_ERROR_CHANNEL_EMPTY) worker_res = TRUE; } if (worker_res == TRUE) { if (t_bus->fe_mark == TRUE) { t_bus->fe_mark = FALSE; memcpy(&t_bus->params, &t_w->params, sizeof(t_w->params)); memcpy(&t_bus->tags[0], &t_w->tags[0], sizeof(t_w->tags)); t_bus->end_mark = t_w->end_mark; worker_res = FALSE; } else if (t_w->tags[0] > t_bus->tags[0]) { *t_bus_tmp = *t_bus;

memcpy(&t_bus->params, &t_w->params, sizeof(t_w->params)); memcpy(&t_bus->tags[0], &t_w->tags[0], sizeof(t_w->tags)); t_bus->end_mark = t_w->end_mark;

memcpy(&t_w->params, &t_bus_tmp->params, sizeof(t_bus_tmp->params)); memcpy(&t_w->tags[0], &t_bus_tmp->tags[0], sizeof(t_bus_tmp->tags)); t_w->end_mark = t_bus_tmp->end_mark; } } coll_bus_out.send(t_bus); coll_bus_in.receive(t_bus); } }

Figure 4: The process template of one of the collection bus processes in the double bus implementation template

3. the costs of the basic communication primitives onto the target architecture. E.g., on the Meiko machine, we assume known the costs t0 and t1 , the startup time and the per-byte time of point-to-point interprocessor communication, respectively. Overall, these two costs allow to compute the completion time and the service time of a communication of a d bytes data structure happening between two adjacent processors as tc = t0 + d  t1 and ts = t0 , respectively. 4. the costs of some basic machine operations such as tsched , the time spent to schedule a process or ta , the time spent in accessing the local memory. We also assume to know the amount of sub-task that have to be eventually computed in the overall application (nt). This number is the number of tasks that will appear onto the input data stream, in the case of c.farm applications, while it is the number of elementary items belonging to the compound data structure, in the case of c.map applications. Most of the costs and parameters listed above can be measured once and for all when the target machine is known. The other are a ected by the application features and must be re-computed every time a new application is to be parallelised using our templates. In particular, the average time spent in the computation of the body code has to be computed. As we will see in the next Sections, it is not necessary to know exactly this time. Instead, it is sucient to know a good approximation of the average value in order to make good usage of the implementation templates illustrated in Section 3.1. The average times spent in the body computations can be derived by a moderate, initial pro ling phase. For each one of the templates, we studied analytical formulas expressing its service and completion time as functions of the costs and parameters described above. Service times express the amount of time that one has to wait before a new task can be dispatched to the template, while completion times express the amount of time taken to compute the overall application, and both of these times are function of the parameters listed above and of the number of worker processes included in the implementation templates. The complete formulas can be found in [6, 10]. 3.3 The cost calculus

By solving the formulas of service and completion time with respect to the number of worker processes we can eventually derive the \optimal" number of workers that have to be included in the implementation templates. In other words, by taking into account the completion time (service time) expressed as a function of the number of workers, by deriving the formula and analysing its zeroes, we can devise the number of worker processes that have to be included in the process templates in order to achieve the best completion (service) times. Now two situation can arise: 1. the optimal number of workers can lead to a resource (PE) count that is smaller than the number of available PEs on the target machine. This is the \happy" case. In order to get an ecient parallel application we only have to use the parametric implementation template \instantiated" with this number. The formulas guarantee that if more PEs are used, no further performance is obtained, while if less PEs are used, smaller performances are achieved. 2. the optimal number of workers leads to a resource (PE) count which is larger than the number of PEs available onto the target machine. In this case, we can just instantiate the parametric implementation template with the maximum number of workers such that the overall count of PEs ts onto the target machine. In alternative, we can move to a more ecient implementation template, i.e. we can consider the single bus or the linear implementation templates. In the framework of our P3L project, where we made extensive usage of implementation templates including the ones discussed here, we developed a deterministic, polynomial time algorithm, that always assigns to the templates used in the development of parallel applications a number of

Processor execution modes 42 40 38 36

Processor

34 32 30 28 26 24 22 20 18 16 138

141

144

Idle Interrupt handler User code Channel send

147

150

153

156

159

162

165

168

171

174

177

180

183

Time (x 1000)

Channel receive Process startup/cleanup

Figure 5: (A slice of the) Time chart for the execution of a farm template resources such that the \optimal' behaviour of the parallel application is achieved. This algorithm takes also into account the possibility of having hierarchically nested templates [1, 2]. Summarising, the analytical formulas determining the completion and service time of the different templates presented here have been developed, once and for all, in such a way that when has an application falling in either the c.map or in the c.farm class, one can pro le the execution times of the body code and then devise the right (in terms of which template and of how many resources to use) implementation template that can be used to eciently implement that application onto the target parallel machine. It is worthwhile noticing that if we want to address a di erent target architecture, in general new implementation templates have to be designed along with their analytical models, but the methodology described in this section can still be applied. Furthermore, it is worth noticing that the methodology described here allows a clear separation of tasks to be achieved:  the \expert" in the machine usage can design ecient implementation templates and provide the corresponding analytical models  the user (application programmer) can apply the cost calculus and derive ecient parallel applications without being involved in the analysis of the machine features and performances.

4 Experimental results

As already mentioned in this paper, we used the implementation templates discussed in Section 3.1 in a project where an entirely new, high level parallel programming languages has been developed,

named P3L (the Pisa Parallel Programming Language). According to Skillicorn, the language presents a \restricted computing model" language [12], as it only allows to express parallel computations having a well de ned pattern. The pattern is to be chosen among a set of ready-to-use patterns called parallel constructs that represent primitive statements of the language. Sequential computations can be expressed in any existing sequential programming language. However, for our experiments, we used the plain C language. Within that framework, every time the programmer indicates that a given computation belongs either to the c.map (and consequently indicates a decomposition strategy to obtain the data items to be processed) or to the c.farm class, an implementation template such as the one presented here is used to implement the corresponding parallel constructs included in the application code. We measured the performance achieved on di erent applications and we concluded that whenever when high load imbalancies can be expected, the implementation templates, along with their analytical models, achieve optimal load balancing, i.e. adding further worker PEs does not increase the performance, while considering a smaller number of PEs leads to worse performance values. In particular, this means not only that by using the implementation templates of Figure 2 we obtained optimal load balancing and, consequently, good resource usage, but also that the number of worker processes that must be included in the template to achieve this result is always correctly predicted by the analytical models provided with the template (even when the execution time of di erent (sub)tasks varies according to an exponential distribution law). In the following subsections, some of the experimental results we obtained are discussed in detail. 4.1 Load balancing

Load balancing was one of the main goals we wanted to achieve with the introduction of the implementation templates of Section 3.1. Actually, we experimented c.map and c.farm applications whose body code required variable amounts of time to be completed. We experimented with applications having both almost constant body times and times distributed according to both normal and exponential distribution laws around an average value. In all these cases, the time spent in the actual execution of the parallel application implemented by using our implementation templates is very close (around 5% away, maximum) to the time predicted by the analytical models of the templates. This result has been obtained by using the templates with the \optimal" number of workers devised by the analytical models. We tried to insert further workers in the implementation templates, and in all cases (i.e. with all the time distributions mentioned above) we achieved an (almost) equal completion time. Furthermore, we tried to run the applications with a number of workers that was smaller than the computed optimal one, and the result was that the application completion time increased (proportionally to the number of workers taken away from the optimal template con guration). These results lead us to conclude that the load balancing strategy used in the templates is e ective. A further indication that the load balancing strategy is correct can be found by looking at Figure 5. In Figure 5 a time chart, obtained with our performance analysis tools, is presented showing how load balancing is achieved in a c.farm application. In this case a double bus implementation template has been used. The application code (the body part) was just spending a variable amount of time (by using a timer process) between 0 and 10000 clock ticks. The average time spent in the computation of the body code has been considered 5000 tics. The central line of the three group of lines around PE no. 20, 30 and 40, represent the lifelines of the worker PE, while the upper and the lower lines represent the lifelines of the collector bus and emitter ring processes, respectively. It can be seen that, although the tasks they compute have very di erent lengths, they are almost always busy. 4.2 Relative performance of the di erent implementation templates With a set of benchmark c.map and c.farm applications, we measured the relative performance

of the implementation templates presented in Section 3.1. The results are summarised in Figure 6 and in Table 1. From Figure 6, which is relative to a single application but that presents

Ex. Time 50

Template 2PE/w Template 3PE/w

25

5

Workers

10

Figure 6: Completion time of the same application implemented using the double bus and the single bus implementation templates a behaviour which is typical of all the applications we considered, it can be argued that the double bus implementation template achieves a better speedup with respect to the single bus implementation template. However, for a large part of the curve, the completion times of the two templates are almost identical. This means that below the emitter ring saturation condition, the single bus template can be conveniently used, as it uses less resources than the double bus template. Table 1 reports the results achieved with a simple application performing matrix multiplication. The application was in the c.farm class and has been fed with a stream of 2000 matrixes (10  10

oating point elements each). The amount of resources considered available onto the Meiko target architecture was 25 PEs. In that case, due to the small weight of the body code with respect to the weight of the communications required to move the matrixes belonging to the input stream Template 2PE/w { Opt.=4w Template 3PE/w {Opt.=6w Workers Ex. Time Nodes Speedup Ex. Time Nodes Speedup 1 11.47 1 11.47 1 2 5.75 7 1.99 5.76 9 1.99 3 3.85 9 2.97 3.84 12 2.98 4 3.46 11 3.31 2.89 15 3.96 5 3.5 13 3.27 2.32 18 4.93 6 3.65 15 3.14 1.98 21 5.79 7 3.65 17 3.14 1.98 24 5.79 8 3.7 19 3.1 9 3.7 21 3.1 Table 1: Execution times, matrix multiplication (10  10), stream of 2000 matrixes

C o m

C o m

p l. T i m e

p l. T i m e #worker

Coarse grain core computation

#worker

Fine grain core computation

Figure 7: Scalability vs. Computation grain size and to the output stream, the implementation templates do not scale very much. The double bus template scales up to 6 workers and the single bus template scales up to 4 workers (both these values correspond to the values computed by the analytical models). However, the speedup achieved with the single bus template is smaller because the communications of tasks and results are performed onto the same processing elements, thus leading to a serialization of these kind of communications and, as a consequence, to a slowdown of the process feeeding tasks to the worker processors. 4.3 Scalability

All of the templates described in this work present good scalability when the amount of time spent in the computation of the body code is much larger than the amount of time spent in communicating tasks and results along the transputer links. The typical behaviour is depicted in Figure 7. This behaviour is reasonable as the larger is the time spent in the communication of the tasks and of the results along the links, the coarser must be the task computation in order to actually take advantage of parallel task execution. 4.4 MIMD vs. Workstations

It is often claimed that workstation achieve such high performances that it is not worthwhile to parallelise applications unless they are very huge (computationally expensive). We performed some tests on our class of parallel applications. A signi cant test was the following. We build a benchmark application that processed an input data stream made up of oats by performing a body computation that only spends time in a loop. The time spent in the loop was proportional to the oat value, and the stream of oats has been generated using a certi cated random number generator. The results are reported in Table 2. In this case, the optimum number of workers devised by using the templates analytical models was 9 for the double bus template and 6 for the single bus template. The workstation almost outperforms the application obtained by using the single bus template, but instead it does not succeed in outperforming the application obtained by using the double mesh template. However, if the relative performance of the two central processors (the workstation sported a 70 MIPS HPPA chip) is taken into account, the time that we could have obtained with the double mesh template using 8 workers is 0.67 seconds, that represents a speedup of 7.72 with 8 workers. It can be argued that the number of processing elements used is actually 27 instead of 8, but similar results can be achieved using the other, more ecient implementation templates we discussed in section 3.1. 4.5 Communication grain size

Finally, some words have to be spent about a possible optimisation of the templates discussed here. It is well knonw that on transputer based machine, further speedups can be obtained if

double bus template single bus template workers Compl. time PE Speedup Compl. time PE Speedup 1 30.34 2 30.34 2 2 15.38 9 1.97 15.40 7 1.97 3 10.26 12 2.95 10.27 9 2.95 4 7.70 15 3.94 7.71 11 3.94 5 6.16 18 4.92 6.20 13 4.92 6 5.14 21 5.9 5.39 15 5.6 7 4.42 24 6.86 5.16 17 5.8 8 3.93 27 7.72 9 3.74 30 8.11 10 3.74 33 8.11 HPPA 720 Workstation (70 Mips) 5.2 Table 2: Workstation vs. Transputers the communication grain is optimised, i.e. if we try to group multiple communications into a single communication in order to pay just once the link setup time instead of paying it multiple times. We made some experiments on this topic and we also studied analytical models for the templates using communication optimisation. The results were quite surprising. If optimisation of the communication grain is used, the optimal number of workers that can be accommodated in our templates does not vary but of a unit. This means that a template working in optimality with nw workers, once transformed in order to take into account communication optimisation may accommodate something like nw + 1 workers at most (under a reasonable set of conditions). Taking into account that in order to optimise communication we have to handle packet of tasks and results instead of single task and result communications, it turns out that communication grain optimisation is not worthwhile to be performed. However, when very ne grain computations are taken into account on machines having fast (but not that much) communication support, the e ect of communication grain optimisation could be considerable.

5 Conclusion

We showed how, by using well de ned implementation templates, a class of applications can be eciently and automatically parallelised onto transputer-based machines. The strategy presented here, however, does not rely on the transputers: it can be applied to any MIMD machine based onto a regular interconnection strategy. The strategy has been validated by a large number of experiments performed by using a prototype tool we developed in Pisa.

References

[1] B. Bacci, M. Danelutto, and S. Pelagatti. Resource optimization via structured parallel programming. Technical Report TR-30/93, Department of Computer Science, University of Pisa (Italy), 1993. Available on ftp anonymous at the site ftp.di.unipi.it. [2] B. Bacci, M. Danelutto, and S. Pelagatti. Resource Optimization via Structured Parallel Programming. In IFIP WG.10.3 Working Conference on Programming Environments for Massively Parallel Distributed Systems, April 1994. to appear. [3] W. Cai and D. B. Skillicorn. Evaluation od a Set of Message-Passing Routines on Transputer Networks. In A. R. Allen, editor, Transputer Systems { Ongoing Research. IOS Press, 1992.

[4] R. S. Cok. Parallel Programs for the Transputer. Prentice-Hall, Englewood Cli s, New Jersey, 1991. [5] M. Danelutto, R. Di Meglio, S. Orlando, S. Pelagatti, and M. Vanneschi. A methodology for the development and support of massively parallel programs. Future Generation Computer Systems, 8(1{3), July 1992. [6] M. Danelutto, S. Pelagatti, and M. Vanneschi. An analytical study of the processor farm. Technical Report HPL{PSC{91-22, Hewlett Packard Laboratories, Pisa Science Center (Italy), 1991. [7] G. C. Fox, M. A. Johnson, G. A. Lyzenga, S. W. Otto, J. K. Salomon, and D. W. Walker. Solving Problems on Concurrent Processors. Prentice Hall International, 1988. [8] A. J. G. Hey. Experiments in MIMD parallelism. In PARLE 1989. LNCS 365, Springer-Verlag, 1989. [9] Yen-Chun Lin and Yu-Ho Cheng. Automatic Generation of Parallel Occam Programs for Transputer Rings. IEEE Transactions on Parallel and Distributed Systems, 3(16):121{133, 1992. [10] S. Pelagatti. A methodology for the development and the support of massively parallel programs. Technical Report TD-11/93, Dept. of Computer Science { Pisa, 1993. PhD Thesis. [11] Pritchard, D. J., Askew, C. R., Carpenter, D. B., A. J. G. Hey, and D. A. Nicole. Pratical parallelism using transputer arrays. In PARLE 1987, volume 258 of Lecture Notes in Computer Science, pages 28{42. Springer-Verlag, 1987. [12] D. B. Skillicorn. Models for Practical Parallel Computation. International Journal of Parallel Programming, 20(2), April 1991.

Suggest Documents