Cost Models for Partitioning Parallel Computations in Two Tiered

Cost Models for Partitioning Parallel Computations in Two Tiered Architectures Csaba Andras Moritz and Lars-Erik Thorelli Peter Thanisch and George Chochia Department of Teleinformatics Laboratory for Computer Science Royal Institute of Technology University of Edinburg S-16440 Kista, Sweden Edinburgh EH9 3JZ, Scotland

Abstract The requirements to reduce the price/performance ratio and to provide scalability force multicomputer vendors to market a new type of multicomputer system, built with commodity components. The processors are grouped into hypernodes (or subsystems), which are connected through scalable high-speed interconnects. The system concept is somewhat hybrid: within a hypernode the machine is eectively a shared-memory system, while between hypernodes it is a distributed memory system. Additionally, clusters of SMP (Symmetric Multiprocessor) workstations follow the same architecture model. This paper presents a parallel machine model, called LoGHfsSPg , an extension of LogP, re ecting the aforementioned architectures in terms of: (1) the latency associated with communication between hypernodes L, (2) the processor overhead associated with sending and receiving messages between hypernodes o, (3) the available inter-hypernode gap per byte for long messages G, (4) the number of hypernodes H , (5) the processor overhead associated with shared memory accesses within hypernode s , (6) the available per-hypernode gap per byte for long messages S , and (7) the number of processors per hypernode P . This model provides the performance characterization required for ecient partitioning of parallel computations in two-tiered architectures. We illustrate the performance parameters chosen by extracting the model parameters for MPI programs on the Convex Exemplar SPP1600 machine. Finally, we show the properties of the model by considering a favorite testbed, an acyclic, directed graphs called the Diamond DAG for which we derive and analyse ecient schedules. Keywords: two tiered multicomputers, computer clusters, parallel machine models, performance models, partitioning.

1 Introduction Most parallel computers today are either based on symmetric multiprocessing (SMP) with shared memory or massively parallel processing (MPP) with distributed memory. The use of commodity processors and interconnects favor a server or multicomputer model with a modular architecture in which highperformance hypernodes or subsystems are connected to each other with high-speed communication links. In view of this, we refer to these architectures as to two tiered architectures or TTAs. Besides the usage of such systems in scienti c and engineering applications, there is also an interest in designing database servers with such architecture. Tandem Computers and Sequent Computer Systems typify this trend [9]. Possible application areas are video on-demand, large data warehouses and OLCP (Online Complex Transaction Processing) systems. Current SMP systems use at most 32 processors [23] as they face the problem of contention on memory and IO subsystem accesses for larger con gurations. One intuitive way to diminish the impact of contention for larger systems is to connect SMPs together in TTAs. The main advantage of the TTA is that it allows assembling a relatively scalable system from basic building blocks. High level models for distributed memory parallel machines have been designed to provide a performance tool for algorithm designers. The input to these models comprises a small set of general parameters which ignore idiosyncratic machine details. However, none of the models explicitly address the performance characteristics of two tiered systems. In this paper we present a high level model for TTA system. This model is an extension of LogP [3], designed for distributed memory machines. Our intention with this model is to provide the performance characterization required for ecient partitioning of

parallel computations in two-tiered architectures. We present the practical motivations for choosing the parameters for each level. We derive the model parameters for the Convex Exemplar SPP1600 using MPI as a communication layer. We derive and analyze the parallel execution time of the synthetic Diamond DAG problem derived with stripe partitioning, and discuss dierent con gurations in terms of the number of hypernodes and processors per hypernodes. Finally, we outline directions for possible strategies in scheduling heuristics. This paper is organized as follows: in Section (2) we describe of TTA multicomputers, commerciallyavailable TTA machines and abstract models for multicomputers. Section (3) motivates the choice of parameters of the proposed LoGHfsSPg model and some scheduling hints based on the motivations presented. Section (4) analyses the Diamond DAG and the performance of the Convex Exemplar SPP1600 machine based on the LoGHfsSPg model. Section (5) presents guidelines for partitioning parallel computations on TTAs. Finally, section (6) contains our conclusions.

2 Background 2.1 Architecture of Two Tiered Multicomputers An MPP system uses a design in which processor elements have local dedicated memory and separate copies of the operating system kernel. The communication network is often a scalable high-speed interconnect, connecting hundreds of processors. A two tiered architecture (TTA) substitutes the processor elements with subsystems containing a number of processors connected with commodity interconnects to the same bus and sharing local memory. The system design relies on a hierarchical memory structure to meet the simultaneous requirement of low latency and high bandwidth. Low-latency, shared memory in a hypernode can eciently support ne-grained parallelism within applications. This hybrid system model can support dierent programming environments such as virtually shared memory programming models [14] (allowing the user to have a shared memory view of the system) and message passing programming models can be supported.

2.1.1 Commercial Examples Tandem ServerNet [7]. Tandem has presented a

exible architecture known as ServerNet. With the

INTERCONNECTION NETWORK

Hypernode 1

PE1

Hypernode n

..............

Network interface PE2

Network interface PE1

PE2

PE3

PE1 PE4

.... PE3

PE1 PE4

.....

SHARED LOCAL MEMORY

....

....

SHARED LOCAL MEMORY

Figure 1: Overview organization of two tiered architectures help of three basic components it is possible to build both SMP, MPP systems and hybrid two tiered systems. The basic components of ServerNet are (1) Low-cost routers, which can be combined into arrays to build a packet-switched, point-to-point, interconnected mesh. (2) A processor interface implemented as an ASIC chip, which provides the connections between the processors, the local memory and the router network. (3) A peripheral-device-interface which provides router network connections to IO buses (e.g., PCI and SCSI) and external network interfaces (e.g., ATM, Ethernet, FDDI). If the processorinterface ASIC provides each processor with its own local memory, ServerNet builds an MPP system. If the processor-interface ASICs share local memory with several processors, ServerNet builds an SMP system. Sequent Quad [8]. The system architecture used by Sequent is formed by connecting Quads together with high-speed communication links. Each Quad comprises 4 Pentium Pro CPUs on a multiprocessor system bus. Each Quad's multiprocessor bus can be operated independently. Sequent's IQ-Link communication logic manages each Quad's RAM so that the system's memory behaves like a global shared memory. Accesses to local memory inside a Quad are fast (250ns). Accessing a memory outside a Quad takes 3 s. The IQ-Link maintains data coherency among all the private memories of Quads. Convex Exemplar SPP1200/ SPP1600 [10]. The processor used in the Exemplar SPP1200/XA and SPP1600/XA is HP PA-RISC 7200. The Exemplar's scalable parallel architecture groups PA-RISC processors, memory, and I/O components into sub-units. The SPP1600/XA, the highest performance member, can have up to 16 hypernodes, for a total of 128 processors. The SPP1600/XA delivers a peak performance

of 30.7 GFLOPS. Features of the Exemplar include Global Shared Memory (GSM). This allow the usage of both shared memory programming models and explicit message passing programming models. NCR 5100, 5200 [11]. This family of parallel computers can be scaled up to 32 processor per subsystem with up to 512 processors organized in 16 subsystems. The CPUs used are 4 to 32 133 MHz Intel Pentium, each with 4MB second level cache, 32MB LARC (Limited Address Range Cache) per processor board. The system bus per subsystem provides 400MB/sec bandwidth. The high-speed interconnect used is the BYNET network based on a folded Banyan network topology. The BYNET network bandwidth scales linearly as processor subsystems are added. Each processor subsystem has two BYNET interface adapters.

tional schedule ST and a communication schedule SC . CLAUD is intended for modeling distributed-memory machines. ETF. ETF [12] is a low-level model , embodying descriptions of the underlying architecture and the communication architecture. The parameters de ning the model are: the cost (p,p') of transferring a unit length message between a pair of processors p and p' in a network, (t,t') the amount of data which has to be transferred between two tasks t and t', the rst of which is an immediate predecessor of the other, and the time (t), required to compute task t . ETF considers communication between each pairs of tasks to have separate costs. For further model references see [5], [19].

2.2 Overview of Parallel Performance Models

3 Proposed Model: LoGHfsSPg

Various abstract models of the performance of parallel machines have been introduced to serve as a bridge between parallel machines and parallel programming models (e.g., message passing, shared memory, dataparallel) and to support parallel algorithm designers. The choice of the model re ects a trade-o between realism and ease of use. Two of the models that we examine, namely CLAUD [2] and ETF [12] are lowlevel, in the sense that they are used for designing scheduling heuristics for various interconnection network topologies. The LogP [3] performance model for distributed memory machines is higher-level, as it neglects the interconnect topology. LogP and LogGP. LogP [3] models parallel machine performance in terms of four parameters: (1)L = Latency or the upper bound on the time to transmit a message from its source to destination, (2)o = overhead or the time period during which the processor is busy with sending or receiving a message, (3) g = gap or the minimum time interval between consecutive communication operations, (4) P = the number of processors. LogGP [1] extends LogP by taking into account the increased bandwidth for long messages supported by special hardware in multicomputers. CLAUD. CLAUD ('Cost and Latency Augmented Dag' ) [2] models parallel machine performance in terms of four parameters: (1) the sending overhead, s , (2) the receiving overhead, r , (3) the network latency for a unit of information, (4) the amount of data to be passed q ; e.g. if q is the amount of data to be passed between two nodes then q is the transmission time along one link, (5) the number of processors used. A CLAUD schedule is a pair: a computa-

In the LoGHfsSPg model, the choice of parameters re ects the two kinds of inter-processor communication in TTAs: intra-hypernode, based on a messaging mechanism implemented with shared memory accesses inter-hypernodes, based on messages and some kind of message passing communication layer Message-passing layers use shared memory operations within a hypernode and virtually shared memory programming models rely on some message passing mechanism (e.g., Active Messages [17], RPC, Berkeley Sockets) for implementing remote memory access between hypernodes. Below, we discuss the facts that aected our choice of performance parameters for LoGHfsSPg . Interconnect Topology. At the present time many low-latency interconnection networks are available. There is generally no consensus in which network topology to prefer. Commercially available multicomputers have chosen dierent topologies. Interconnection networks in commercial multicomputers can also have a fault-tolerant design, i.e. they can recon gure in case of failures of physical links. It is also shown in performance evaluation of fast network interfaces [4] that the user to user communication is mainly limited by the network interface. In LoGHfsSPg we neglect the in uence of the inter-hypernode network topology on communication performance. Message length. Our model for long messages is similar to the one in LogGP. We assume that there is some hardware support available for long message transfers. We use dierent parameters for inter-

Platform Meiko Myrinet NOW Paragon

L(sec)

7,5 11,1 6,3

os (sec)

1,7 2 1,4

or (sec)

1,6 2,6 2,2

Table 1: The LogP latency and overheads for Active Messages in distributed memory machines and network of workstations according to [4] Environment Cray T3D with CRI PVM 3.3.4 Cray T3D with EPCC MPI 1.4.a SGI workstation cluster with Oak Ridge PVM 3.3.10 SGI workstation cluster with Argonne MPI 1.0.12

T

259 80 2567 2858

Table 2: Round-trip communication performance based on MPI/PVM on CrayT3D and SGI COW [22]. hypernode transfers and intra-hypernode transfers. We model the long message gap per byte for interhypernode accesses with G. We use the S parameter to capture the gap per byte for long messages for intra-hypernode communications. The inverse of these parameters capture the bandwidths of inter- and intrahypernode communication for long messages. Processor overheads. The overhead parameter value is determined by the communication software used and the access of the network interface. The overhead in communication layers like MPI is dominated by the cost of executing tens or hundreds of operations. However, in optimized fast communication layers like the Active Messages [4] the overheads are comparable with the network latency and the gap parameter. In LoGHfsSPg we distinguish between the send (or startup) os and receive or overheads. Typical values for these overheads are presented in tables 1 and 2. Intra-hypernode inter-processor communication is achieved through shared memory accesses. We assume that message passing layers use shared memory operations. An exchange of data through shared memory may also involve semaphore synchronization and context switching. In typical shared memory implementations of PVM [16] each task is given a shared memory buer previously created with a shmget() system call. During inter-task communication, one task maps the contents of the other task's shared buer into its own memory. We include this overhead in form of the performance parameter s in LoGHfsSPg . Network latency between hypernodes. Latency is in uenced by the network interface, and the network link bandwidth. We use the LogP latency model for inter-hypernode communication: the latency L is the upper bound on the time to transmit a message from its source to destination. In tables 1 and 3 we show

Platform Ts > L 10MB/s NOW 124 Cray T3D 3.1 Convex Exemplar 4-10

Table 3: LogP latency L (in microseconds) on Convex Exemplar, NOW and Cray T3d. typical values for the network latency L [4].

Multithreading and context-switching. Multi-

threading is a convenient way to write parallel programs. As threaded programs rely on shared memory they can be implemented within a hypernode. Especially in the case of virtually shared memory programming models the cost associated with the access and synchronization of a remote shared variable can be very expensive. Data-driven preemptive execution models are used to hide these latencies by suspending the thread blocked on a remote shared variable access and continuing with a new thread [13]. Contextswitching of a thread will require a system call for saving the context information, resulting in a processor overhead. The cost of context-switching in a multiprocessor is very much implementation dependent (see [24] for references). Indirectly, the cost of context-switching is included in the communication startup overheads. Eect of caches and prefetching. Latency resulting from accessing a memory outside a hypernode often is minimized through a hierarchy of caches which can transparently move the data closer to the processor, and/or dynamically copy the required data to memory elements which are closer to a particular processor. Dierent approaches (cache-coherence algorithms) [24] exist for providing consistency. Accessing remote data that is not present in the local cache of the hypernode results in considerable time lost as complete cache lines or pages are transferred across the network. In systems with software based cachecoherency schemes, e.g. the MIT-MGS [21] systems, such costs are even more signi cant. Memory hierarchy optimization techniques such as prefetching may reduce inter-hypernode latencies. Although cache effects are important they are speci c to particular machines and thus dicult to incorporate in a model. Network gap for short messages. The gap g, as de ned in LogP, is the minimum time interval between successive sends or receives. In this paper we ignore the gap parameter and assume that messages can be injected to the network as soon as they are constructed; c.f. [19].

Thread spawn/join time 15 10

CSW 5 -

Table 4: Typical context-switching and thread spawn/join times [18] [10] in sec. 1/G or network bandwidth

1000 microseconds

Platform AD66XX with Alpha Risc 133 Convex Exemplar

o=Inter-hypernode overhead os=Intra-hypernode overhead

100

10 100.00

Inter-hypernode Intra-hypernode 1

MB/sec

1

10

100 1000 10000 Message length in bytes

100000

10.00

Figure 3: Comparison inter and intra-hypernode standard non-blocking MPI send overheads on Convex

1.00 1

10

100 1000 10000 100000 1e+06 Message length in bytes

Figure 2: Comparison of 1=G inter-hypernode and 1=S intra-hypernode bandwidth on Convex Exemplar SPP1600 as function of message length

Network interface implementation aspects. Implementation aspects of the hypernodes can in uence the way communication performance should be modeled. One such aspect to consider could be the utilization of one of the processors on each hypernode for communication tasks. In some implementations of PVM a separate demon PVMD (i.e. clusters of SG Power Challenge, clusters of multiprocessor Sun 20) is taking care of all communication. In the case when PVMD is scheduled on a separate processor, the overhead of sending a message o-hypernode can be modeled exactly as the overhead of sending the message within the hypernode (as it is not dierent from sending a message to a processor inside a hypernode). The number of network interfaces or links used per a hypernode can also vary and in uence the message transmission times.

3.1 Summary of LoGHfsSPg Parameters In summary, the main parameters of the LoGHfsSPg model are: 1. the maximum latency or delay in communicating the unit of information on the network between processors located in separate hypernodes: L; 2. the send and receive overheads for initiating ohypernode communication : or ; os ; these overheads are of the same magnitude and could be substituted with o = max(os ; or );

3. long message inter-hypernode gap G; 4. the total number of hypernodes: H ; 5. the send and receive overheads for initiating communication inside a hypernode: sr ; ss ; these overheads are of the same magnitude and can be substituted with s = max(ss ; sr ); 6. long message transfer intra-hypernode gap S ; 7. the number of processors per hypernode: P .

4 Applications of LoGHfsSPg 4.1 LoGHfsSPg Signature of Convex Exemplar SPP1600 with MPI In this section we extract the LoGHfsSPg model parameters for the Convex Exemplar SPP1600 machine. The Convex Exemplar SPP1600 can comprise up to 16 hypernodes, each of which has up to 8 processors, an I/O port, and up to 2 gigabytes of physical memory. Hypernodes are interconnected with 4 rings. The MPI layer used is Convex MPICH V1.0.12.1. The system we used is from the Swiss Center for Scienti c Computing and has two hypernodes, with a total of 16 HP-PA 7200 CPUs. From gures 2 and 3, we observe that there is a factor of ten performance gap between inter- and intra-hypernode message transfers. Similarly, the software overheads for intra-hypernode communication are much less than their inter-hypernode counterparts because they are based on simple shared memory primitives. The overheads also show some variation depending on message lengths. These results show

L

4-10

o

30-200

G

0.05

H

2

s

2-20

S

0.01

P

4

Table 5: LoGHfsSPg signature for the Convex Exemplar with MPI in s. The overheads shown regard 100 and 10000 byte messages. The gap parameters S and G shown correspond to peak inter-hypernode and intra-hypernodes communication bandwidths. n,0

0,0

0,n

Figure 4: The Diamond DAG that it is important to understand the costs involved with o-hypernode communication.

4.2 LoGHfsSPg Schedule for the Diamond DAG In this section we derive and analyze ecient schedules for the Diamond DAG (Directed Acyclic Graph) using the LoGHfsSPg model. The Diamond DAG with the number of vertexes n n can be represented as a n n grid, with vertexes placed at the intersection points (see Figure 4). Vertexes are the tasks with dependencies represented by the edges. The horizontal edges point to the right; vertical edges to the top. We de ne a unit time as the time to compute a task. All tasks require the same computational time. The parameters: LoGHfsSPg are expressed in terms of a unit time. The Diamond DAG is important for a number of applications. Dynamic programming algorithms are in many cases reducible to it. These algorithms are extensively used in molecular biology , for instance, in DNA chain comparison. The Diamond DAG has previously been analized in a delay model [6] and in CLAUD [2]. The evaluation of the Diamond DAG on the TTA implies the following activities: computation of the tasks, intra- and inter-hypernode inter-processor communication (IPC). One way to represent these activities on a time scale is to use a Gantt chart de ning a schedule of parallel computation i.e. the time when each activity starts and ends. The Gantt Chart in Figure 5 represents a feasible schedule for the Diamond DAG in LoGHfsSPg . We describe a schedule in what follows. The DAG is partitioned into equal coarse horizontal stripes each allocated on one hypernode. Each

coarse stripe is partitioned into equal ne stripes allocated on the processors within a hypernode, one ne strip per processor. Two processors within a hypernode are used for inter-hypernode communication, thus the number of the ne stripes is P ? 2. The ne stripe is split into m equal rectangles with c = n2 =(H (P ? 2)m) tasks each. The intra-hypernode IPC occurs once the rectangle is computed. Let us enumerate the ne stripes from the bottom. A processor allocated on some stripe q > 1 starts computing a block once the n=m data associated with the edges connecting two vertical blocks is available. The data is received from the processor allocated on the q ? 1 stripe. The communication takes Sn=m + 2s units of time: 2s for interprocessor synchronisation by both processors and Sn=m for passing data. We denote the sum Sn=m + s by e. The interhypernode communication is handled by two processors: one receiving data another one sending. Once the tasks in the block belonging to the topmost ne strip of a coarse strip are computed the n=m data is passed to another hypernode. First, the data is copied into a separate region in the shared memory by the processor computing the top stripe. Then the sending processor accesses shared data and initiates data transfer taking s + o units of time. The header of a packet arrives at the destination hypernode after L units of time. The communication is complete in d = Gn=m units of time after the header has arrived. Our schedule is valid if o + d c + s that is

n2 n o + Gm H (P ? 2)m + s:

(1)

Note that in case 2(o+d)+e+s c+s one processor can handle both receiving and sending data allowing the remaining P ?1 processors to do the computational work. De ne the makespan M as an elapsed time for computing the Diamond DAG. We represent it in the following form

M = (H ? 1)w + u ? y (2) where w = (P ? 2)(s + c + e) + L + d + o + e +

s is the time between the header of the rst packet arrives at the hypernode and reaches the next one;

u = (P ? 2 + m ? 1)(e + c + s) + d + o is the time

from the moment when the header of the rst packet arrives into the hypernode computing the last coarse stripe until the DAG is evaluated; nally y = d + o + e + s must be subtracted from the makespan as the hypernode computing the rst coarse stripe is not receiving packets.

w

u

e s c e s s c e e e s s c s c s c d o e d o e d o e d o e L L L L s o d s o d s o d s o d e s c e s c e s c e s c e s c e s c e s c e s c d o e d o e d o e d o e L

L

L

e s e

c c

c

L

Figure 5: A computational and communication schedule for the Diamond DAG in the LoGHfsSPg model. The rectangular block c is associated with a time to compute n2 =(H (P ? 2)m) tasks, s with interprocessor synchronization overhead, e with overhead Sn=m + s for passing data between two processors within a hypernode, d with the overhead Gn=m for passing a packet of size n=m between the hypernodes and o with the startup overhead. The arrows represent the latency L associated with communication between the hypernodes. The Gantt chart represents a schedule for two hypernodes. In our schedule we have a parameter m which we choose to minimize a makespan. The extremum of M in m , if existing, can be found from the equation @M=@m = 0 which gives pn2s mopt =

q

1 ? (H (P ?2)+((P ?1)HS1 +(H ?1)G?3S)=n) : (3)

Checking that the second derivative is negative we nd that mopt gives a minimum. As 1 m n , the best we can do is to choose m = min(n; max(1; mopt )). Consider the asymptotics n ! 1 when H and P are xed such that H (P ? 1) 1. Observe that mopt =

(n) i.e. for suciently large n it is morep than 1. Suppose mopt < n . Choosing m = n= 2s and substituting it in (2) we get M = n2 =(H (P ? 2)) + O(n), i.e. the makespan is asymptotically optimal. If mopt > n we set m = n , which also gives an optimal value for the makespan. Suppose we can vary the number of hypernodes in a system. Let us nd the number of hypernodes minimizing the makespan. Observe that the inequality ( 1) expresses the fact that the interhypernode communication is less than or equal to a computation within a hypernode. Notice that the minimal makespan (with this schedule) is achieved when the inequality holds. Indeed, suppose that the communication overhead dominates the computations, i.e. o + d > c + s for some number of hypernodes H and H 0 such that H 0 < H , then H 0 hypernodes evaluate the DAG with a smaller number of the interhypernode communications and the same number of communications within a hypernode therefore M (H 0 ) < M (H ). Let H' denote the number of hypernodes yielding equality in ( 1). Solving @M=@H = 0

we nd the formal minimum

p

Hopt = (P ? 2)(2o + L + (Pn ?(11)2?s1+=m(()P ? 1)S + G)n=m) : (4) Thus the number of hypernodes minimizing the makespan is H = min(H 0 ; Hopt ). The equations ( 4) and ( 3) contain recursive dependences between H and m . An approximate solution can be found by breaking them at some iteration. Assume that the time to compute aptask is less than s , i.e s > 1. Choosing m = n= 2s and substituting it in ( 4) we get Hopt (m). Suppose Hopt (m) < H 0 which may hold only for relatively small values of P . Substituting m and H = Hopt (m) in ( 2) we get

!

r

p M = n 2 P ? 2 + 2 2s + S + O(1) p

(5)

p

where = 2o + L + (P ? 1)(2s + 2sS ) + G 2s . In case Hopt > H 0 substitute H = H 0 in ( 2) which gives

p

!

2s((P ? 1) ) +pL + O(1): M = n p + 2s (P ? 2)( ? 2s ? S 2s)

(6) p where = o + s + 2s(S + G). One interesting result of this section is the optimal blocking factor m that for larger con gurations asymptotically approaches pn2s . Other practical whatif scenarios could be investigated based on the analysis presented in this section. One could derive for example the optimal H; P con guration given a xed total number of processors in the system.

5

Partitioning Hints

In the following we outline a couple of strategies for partitioning schemes in two tiered systems. These strategies aim to reduce the o-hypernode communication events and communication volume. Grouping o-hypernode messages into larger messages. Minimizing the number of o-hypernode communication events. Partitioning communication graphs in such a way that ne-grained parts are scheduled within hypernodes. Scheduling the critical path tasks within a hypernode Using task replication [6] between hypernodes.

6 Conclusion We have outlined a model suitable for analysing the performance of TTAs. The LoGHfsSPg model is based on seven parameters capturing the relevant performance aspects at each level. It is motivated by the fact that known current models do not re ect the important performance characteristics of TTAs. TTAs will be more and more available in the future because they have a competitive price/performance ratio, provide incremental scalability and can be used for a wide range of commercial applications. We discussed the model parameters and applied the model in two dierent ways. First, we extracted the model parameters for MPI on a Convex Exemplar SPP1600 machine. Second, we derived and analysed the optimal makespan in the case of stripe partitioning for the Diamond DAG. Based on the LoGHfsSPg model we suggested guidelines for partitioning algorithms on such machines. Our hope is that this work will encourage the design of general purpose partitioning heuristics for two tiered architectures.

References [1] A. Alexandrov, M. Jonescu, K. E. Schauser, and C. Scheiman." LogGP: Incorporating Long Messages into the LogP Model" Proc. of SPAA'95, Santa Barbara, July 1995. [2] G. Chochia, C. Boeres, and P. Thanisch. Analysis of Multicomputer Schedules in Cost and Latency Model of Communication.Abstract Machine Workshop 1996.

[3] D. Culler, R. Karp, D. Patterson, A. Sahay, K.E.Schauser, E.Santos, R. Subramonian, and T. von Eicken. LogP: Towards a realistic model of parallel computation. In Proc. of the Fourth ACM SIGPLAN Symp. on Principles and Practice of Parallel Programming, San Diego, CA, May 1993. [4] D. Culler, L. Liu, R. Martin, and C. Yoshikawa, \LogP Performance Assessment of Fast Network Interfaces," Technical Report, Computer Science Division, university of California, Berkeley, Nov. 1995, (available on the Web.) [5] W.F. McColl. BSP programming. In G Blelloch, M Chandy, and S Jagannathan, editors, Proc. DIMACS Workshop on Speci cation of Parallel Algorithms, Princeton, May 9-11, 1994. American Math. Society, 1994. [6] C. H. Papadimitriou and M. Yannakis. Towards an architecture-independent analysis of parallel algorithms. SIAM J. Comput., 19:322-328, 1990 [7] Tandem Computers. www.tandem.com/ [8] Sequent Computer Systems. www.sequent.com/ [9] T. Thomson. The network in the server.BYTE, July 1996 [10] Convex Technology Center of HP. www.convex.com/ [11] NCR. www.ncr.com/ [12] J-J. Hwang, Y-C. Chow, F.D. Anger, and C-Y.Lee. Scheduling precedence graphs in systems with interprocessor communication times. SIAM J. Comput., 18(2):244-257, 1989. [13] Thorelli L-E. The EDA Multiprocessing Model. Tech. Report TRITA-IT-R 94:28, CSLab, Dept. of Teleinformatik, KTH. 1994. [14] K.L. Johnson, M. Frans Kaashoek, and D. A. Wallach. CRL:High-Performance All-Software Distributed Shared Memory. Proceedings of the Fifteenth Symposium on Operating Systems Principles, December 1995. [15] MPI: A Message-Passing Interface Standard. June 12, 1995 [16] Al Geist, Adam Beguelin, Jack Donagarra, Weicheng Jiang, Robert manchek, and Vaidy Sunderam. PVM:Parallel Virtual Machine-A Users' Guide and Tutorial for Networked Parallel Computing. MIT Press, 1994. [17] T.von Eicken, D.E.Culler, S.C.Goldstein, and K.E.Schauser. Active Messages: A Mechanism for Integrated Communication and Computation. Proc. of the 19th ISCA, May 1992. [18] Alpha Data Ltd., Causawayside 82, Edinburgh, Scotland. Context Switching Evaluation on Alpha Boards. Product Speci cations, 1996. [19] M. I. Frank, A. Agarwal, M. K. Vernon, \LoPC:Modeling Contention in Parallel Algorithms", Proc. of the SIXTH ACM SIGPLAN Symp. on Princip. and Practice of Parallel Programming, Las Vegas, Nevada, June 18, 1997. [20] K. Keeton, T. Anderson, and D. Patterson, \LogP Quanti ed: The Case for Low-Overhead Local Area Networks," Hot Interconnects III: A Symp. on High Performance Interconnects, Stanford Univ., Stanford, CA, Aug. 1995. [21] Donald Yeung, John Kubiatowicz, and Anant Agarwal. MGS: A Multigrain Shared Memory System. Proceedings of the 23rd ISCA,pages 44-55, May 1996. [22] Mike Ess, editor, ARSC Cray T3D User's Group Newsletter 83, April, 1996; Available from http://www.arsc.edu/user/TUG.shtml [23] Otto Goerlich, Parallel Databases, IBM Informationsysteme GmbH, SMPs and Multi-Purpose Parallel Computers. It Verlag fur innovative Technologien GmbH, Munich, Maj 1996. [24] Kai Hwang, Advanced Computer Architecture. McGrawHill, 1993.

Cost Models for Partitioning Parallel Computations in Two Tiered

Cost Models for Partitioning Parallel Computations in Two Tiered

Suggest Documents

Exploiting Symmetry for Partitioning Models in Parallel Discrete ... - Vub

A Two-Tiered Software Architecture for Automated ... - Parallel Data Lab

Challenges in Building a Two-Tiered Learning ... - Parallel Data Lab

TWO DIMENSIONAL WARRANTY COST MODELS

A two-tiered cost-efficient computerized cognitive ...

Using ICT for Teaching Parallel Computations

A PARALLEL CODE FOR MULTIPRECISION COMPUTATIONS OF ...

Scheduling Parallel Computations in a Heterogeneous ... - CiteSeerX

STRUCTURAL MECHANICS COMPUTATIONS ON PARALLEL ...

Parallel computations in modular group algebras

Scheduling Parallel Computations in a Heterogeneous ... - CiteSeerX

Parallel Computations in the Volunteer based ...

Hypergraph-Partitioning Based Decomposition for Parallel Sparse ...

Parallel Algorithms for Dynamically Partitioning Unstructured Grids

Hypergraph Partitioning for Parallel Iterative ... - users.cs.umn.edu

Data Partitioning for Parallel Entity Matching

Mesh Partitioning for Parallel Computational Fluid Dynamics ...

Exploiting Symmetry in Parallel Computations for ... - Purdue e-Pubs

An Appropriate Algorithm in Parallel Computations for ... - Science Direct

Computations for Markov Chain Usage Models - CiteSeerX

Online Algorithms for Location-Aware Task Offloading in Two-Tiered

parallel multilevel graph-partitioning software

Parallel sorting in two-dimensional VLSI models of computation ...

Recent Advances in Matrix Partitioning for Parallel Computing on