OVM: Out-of-order execution parallel virtual machine - Semantic Scholar

2 downloads 0 Views 308KB Size Report
this article, we introduce an original client–server execution model based on RPCs called out-of-order ..... Some parts of this subroutine stay executed on the 181.
OF

Future Generation Computer Systems 855 (2001) 1–13

OVM: Out-of-order execution parallel virtual machine

4

George Bosilca∗ , Gilles Fedak, Franck Cappello

5

Laboratoire de Recherche en Informatique (UMR 8623), Universite Paris-Sud, 95405 Orsay Cedex, France

DP RO

3

Abstract

7

17

High performance computing on parallel architectures currently uses different approaches depending on the hardware memory model of the architecture, the abstraction level of the programming environment and the nature of the application. In this article, we introduce an original client–server execution model based on RPCs called out-of-order parallel virtual machine (OVM). OVM aims to provide three main features: portability through a unique memory model, load-balancing using a plug-in support and high performance provided by several optimizations. The main optimizations are: non-blocking RPCs, data-flow management, persistent and non-persistent data, static data set distribution, dynamic scheduling and asynchronous global operations. We present OVM general architecture and demonstrate high performance for regular parallel applications, a parallel application with load balancing needs and a parallel application with real-time constraints. We firstly compare the performance of OVM and MPI for three kernels of the NAS 2.3. Then we illustrate the performance capability of OVM for a large real-life application that needs a load balancing support called AIRES. Finally, we present the performance of a real-time version of the PovRay ray-tracer demonstrating the reactiveness of OVM. © 2002 Published by Elsevier Science B.V.

18

Keywords: RPC; OVM; Beowulf; Cluster

19

1. Introduction

20

Today parallel programmers are facing at least three decision parameters when choosing a parallel programming model. The first parameter concerns the fast evolution of parallel architectures. Ten years ago, dominant high performance computers were vector machines. In the nineties, MPP with message passing or shared memory systems became very popular. Today parallel computers are clusters of multiprocessors gathering SMP nodes connected by a high speed network. The computers of the next generation, with IBM Power4, Alpha 21364 processors or Mips r14000, will be clusters of NUMA nodes with a very deep memory hierarchy (four or five levels). The programmer should

12 13 14 15 16

21 22 23 24 25 26 27 28 29 30 31 32



Corresponding author. E-mail addresses: [email protected] (G. Bosilca), [email protected] (G. Fedak), [email protected] (F. Cappello).

0167-739X/01/$ – see front matter © 2002 Published by Elsevier Science B.V. PII: S 0 1 6 7 - 7 3 9 X ( 0 1 ) 0 0 0 7 1 - 1

UN

1 2

choose very carefully a programming model that will be efficient and portable across several generations of parallel machines. The second parameter concerns the abstraction level of the programming model. The programmer has a choice between three major categories. The hardware inspired approaches allow to program the parallel computers using their native memory model: message passing, shared memory or even a mix of them. Some other models use abstractions such as communicating threads, tuple space or communicating objects to provide a programming model portable across a large range of platforms. Finally, languages like HPF [1] offer a higher level of abstraction, allowing the programmer to manipulate applications features (data parallelism) and not architecture features (memory system). Performance is crucial for a parallel architecture and hardware inspired models are often considered as the most efficient way to program

EC

11

RR

9 10

CO

8

TE

6

33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51

56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88

2. OVM

90

2.1. General principles

91

OVM is a parallel environment based on client– server architecture. It provides: (a) an API for programming applications starting from scratch or from an existing sequential version and (b) a runtime support managing multiple computing resources. The

94 95

UN

93

CO

89

92

OF

55

Fig. 1. General organization of OVM parallel execution model.

DP RO

54

parallel computers. However, portability may enforce to choose a high level programming model. The nature of the application is the third parameter. The application could be a regular one with (e.g. dense matrix computation) or without (e.g. sparse matrix computation) predictable behavior. Some applications need to dynamically expand and reduce the number of tasks during the execution. Typical examples of such applications are branch and bound methods. Some other applications require high responsiveness and regular throughput. Depending of the application nature, the programmer may choose an existing programming model or develop his proper execution model on top of an existing programming model. In addition to the time consumption of this later approach, the programmer will still face the problems of performance and portability. The out-of-order execution parallel virtual machine (OVM) system is an original computing model designed to reach high performance with a client–server architecture. In this paper, we demonstrate that OVM provides high performance on regular and irregular parallel applications with load balancing needs and on a parallel application with some real-time constraints. Moreover, because OVM provide a higher abstraction level than hardware inspired environments, it is portable across a large range of architectures. Section 2 presents OVM principles, design and basic performance. Section 3 compares it with a traditional programming approach (MPI) for high performance regular applications (NAS NPB 2.3 benchmark). Section 4 presents the parallelization and the performance of a large real-life and regular application with huge load balancing needs called AIRES using OVM. Section 5 describes the performance of a real-time version of the PovRay ray-tracer parallelized with OVM.

general architecture of OVM is based on two main features: an RPC style programming interface and a build-in dynamic load balancing mechanism. Fig. 1 presents the general organization of OVM parallel execution model. An OVM program uses three program entities: the client program, a function library loaded by all servers and a scheduling plug-in that will control the load balancing strategy of the broker. The client executes a sequential control program. Currently supported languages are Fortran and C. The programmer inserts OVM RPC calls in the client program. During the execution, when the client reaches an OVM RPC statement, the OVM library requests the RPC execution to the broker. The broker receives the RPC request, selects a server and schedules the RPC request to the server. Because OVM RPCs are non-blocking, the client can launch several outstanding RPCs subsequently. The broker manages these RPCs, launching them on different servers according to the scheduling policy. As a result, several outstanding RPCs are running concurrently on different servers leading to a parallel execution.

EC

53

2.2. OVM programming approach

RR

52

G. Bosilca et al. / Future Generation Computer Systems 855 (2001) 1–13

TE

2

OVM programming model for regular applications is close to the SPMD programming style. The programmer has to distribute the data sets among the computational nodes. He also has the responsibility to decompose the global workload in subtasks. The programmer may use global data movements for the communications between the client and the servers and between the servers. There are several significant differences between the OVM programming approach and the traditional SPMD message passing approach. In OVM, the programmer does not manage: (a) the workload distribution among the computational nodes,

96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118

119

120 121 122 123 124 125 126 127 128 129 130 131

G. Bosilca et al. / Future Generation Computer Systems 855 (2001) 1–13

139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161

OF

138

DP RO

137

TE

136

EC

135

client requests to create arrays colidx, rowstr and a on the server side. For this implementation, these arrays are duplicated on all servers. Note that the original MPI implementation of CG also duplicates these arrays on all MPI nodes. The first part of the code also creates params on the server side. This variable will serve for distributing the work among the servers. It contains a parameter set (mainly loop boundaries) which is different on all servers. The client transfers the actual content of colidx, rowstr and a using the OVM broadcast global operation. For params, the client makes several individual transfers using the put operation. This part of the code is not shown on the figure. CG continues on the client with a sequential section. Then the code reaches the main iteration block. This block calls the conjugate gradient subroutine and contains some normalization operations. This code section is also executed on the client. The next part of the code shows the conjugate gradient subroutine. Some parts of this subroutine stay executed on the client such as the loop with label 101. The next part is the main computational loop nest. This is actually the only part of the whole CG program that uses OVM RPCs. The loop nest itself is not shown on the figure. It is a part of the server function library. The P array is the only variable witch is modified by the loop nest. This is also the only variable actually distributed among the servers. So it is broadcasted and gathered, respectively, before and after each iteration of the loop with label 111. The OVM preprocessor provides OVM loops. It

RR

134

CO

133

(b) the point to point communication between nodes and, (c) the synchronization of computational nodes. Load balancing is managed by the broker. Point to point communications are initiated by the broker according to the workload management and correspond to subtask migrations from one computational node to another. The synchronization is governed only by the dependencies among operations (computations and communications). Programmer makes requests to OVM annotating a sequential program with directives. Directive annotations concern data management (creation, movement) between client and server, global operations and identification of the functions to execute remotely. A pre-processor translates the programmer directives into low level API function calls. In the translation process, the pre-processor prepares the requests for the broker. Programmer may also directly use some API functions although this programming level is not recommended because real applications usually require a very large number of API function calls. As an example, Fig. 2 presents some parts of the NAS2.3 CG benchmark. This code corresponds to the client part of the application. All OVM RPC calls are asynchronous. This parallel version directly derives from the sequential version of CG. The OVM version of CG starts on the client with some sequential and initialization sections (this part of the code is not shown on the figure). As the first program interaction with OVM, the

Fig. 2. Sketch of the client part of CG NAS benchmark parallelized with OVM.

UN

132

3 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191

G. Bosilca et al. / Future Generation Computer Systems 855 (2001) 1–13

DP RO

OF

4

Fig. 3. Server part of the CG program.

197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214

2.3.1. Reduced RPC critical path Performance of OVM programs leans on the capability of the system to launch remote execution with a minimum latency because all potential concurrent executions are launched sequentially from the client. As for communication libraries, we define a critical path (Fig. 4) between the client and the server. The client, broker and server software architectures have been designed to minimize the critical path. As previously mentioned, a pre-processor translates the OVM directives in the client code into C statements and OVM API calls. There is no buffer copy between client code and API functions. The API functions directly call the communication library functions. There is no scheduling management on the client side. The client sends synchronous or asynchronous requests

TE

196

Performance of OVM relies on several optimization principles. Some of them are related to software architecture design and communication support. The other are more related to an extension of the RPC programming style.

EC

195

2.3. OVM high performance RPC

RR

194

translates the OVM FOR loop in several C and OVM API statements in order to construct an RPC launching loop. The name of the service (function) called on the server side is SERVICE CG. In this part of the code, several outstanding non-blocking RPCs will be sent from the client to the broker. The broker will forward the requests to different servers leading to a concurrent execution of the RPCs. Fig. 3 presents the server part code of CG. The code corresponding to the server part of CG is nearly the same as the main computational loop nest of the original sequential version. The params variable implements the loop boundaries. OVM implementation of CG demonstrates that the program is close to the sequential version except for the parallel part. For this part only some simple modifications are applied to the original code. As for SPMD programming style, the programmer should distribute data and works among the processors. However, he does not deal with point to point communications and he only cares about the global operations on distributed data sets. The programmer has the responsibility to select the parts of his program to execute remotely.

CO

193

Fig. 4. General architecture and critical path for remote executions.

UN

192

215

216 217 218 219 220

221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236

G. Bosilca et al. / Future Generation Computer Systems 855 (2001) 1–13

244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276

277 278 279 280 281 282

OF

243

DP RO

242

2.3.2. Very fast communications Of course round-trip latency for RPCs not only depends on the architecture of the broker and the server but also on the communication support. To achieve very fast RPCs, OVM relies on high performance SANs or LANs (Myrinet, SCI, Gigabit Ethernet, Servernet, Giganet, Quadrix) and user level protocols (GM, PM, Active messages, Fast Messages, VIA, BIP).

2.3.4. Decoupled RPCs and persistent data OVM provides two ways for communicating arguments to RPCs. The first way is similar to traditional RPCs: arguments are passed to the remote function by values and the data are sent from the client to the server (crossing the broker) within the RPC message. This approach of argument communication fit the cases where: (1) the size of the arguments is small (less than a threshold depending on the network performance) or (2) transferred data could not be reused by the server for subsequent RPCs. When the arguments are large and when the arguments can be reused by the server, it is more profitable to store them at the server side. This feature corresponds to the notion of persistent data on the server side. RPCs using persistent data on server side are called decoupled RPCs. In decoupled RPCs, arguments are passed by reference. Values of the argument must already exist at the server side before the RPC call. OVM provides five main operations related to persistent data on server side (move data from client to server and vice versa): create, kill, put, get and move). The two firsts operations create (respectively kill) a persistent variable on the server side and return references that may be used to store and read data sets by the client remote function calls. put operation writes values to existing persistent variables on the server side using references. get does the reciprocal operation (reads a value from server side and writes it on the client). move operation allows point to point communication of a persistent variable between two servers. Fig. 5 presents the protocols for Put and Get operations. The numbers on the figure describe the

2.3.3. Non-blocking RPCs OVM provides blocking and non-blocking RPCs. Non-blocking RPC is the cornerstone to provide concurrency for RPC programming style. Typically, the client loops on non-blocking RPC calls to launch a large set of concurrent remote executions. The client

Fig. 5. Put and Get operation protocols.

TE

241

EC

240

asynchronous RPC and server interactions are driven by data-flow dependencies. If the client needs to be informed of the termination of an asynchronous RPC, it requests to get the result of this RPC using a get operation. Since the get operation is blocking, the client will wait until the result of the RPC comes back from the broker before continuing.

RR

239

and waits for the result. The result is provided by the broker. The basic roles of the broker are to schedule the RPC requests on the servers, manage the server workloads and to order data allocation, deallocation, reutilization and movements on the server side. Since scheduling and workload management may require some substantial computations, the broker architecture uses a hierarchical design with two levels. The first selection level is dedicated to fast execution. So this level only uses fast scheduling algorithms (simple) for server selection (short term decisions). Basically, the broker checks all arguments and launches the execution to a server if all arguments are available on this server. If any argument is missing on the server or is not available due to data dependences, the request is stored in a list of pending requests. The second level deals with the execution order of the pending requests and the server workload. The long term scheduler may decide to migrate partially or totally the workload of a server to one or several other servers. The servers software architecture uses a multithreading approach (Fig. 4). This architecture uses a classical approach where a thread listens to requests while a pool of threads are waiting. When a request arrives, the listening thread unlocks one of waiting threads and starts to process the request. The new listening thread waits for a new request. The programmer may add some prefetch annotations (or API function calls) in the client program to order librarie prefetch on the server side.

CO

238

UN

237

5 283 284 285 286 287 288 289

290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322

329 330 331 332 333 334 335 336 337 338

339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368

2.3.5. Data-flow execution During the execution, a data-flow graph is built from the RPCs requests. In a typical client sequence, the client sends several asynchronous RPC requests to the broker. The broker peruses these requests to check that arguments of the RPC are ready on the server side. If one of the arguments is not ready, the broker stores the request in the pending list. When a server ends a service, the result can be transferred to the client, and it informs the broker of all the arguments that have been modified. The broker updates the pending list of delayed requests. Requests ready to launch are sent by the broker to servers. In principle, the data-flow mechanism and implementation of OVM are very close to the instruction management of an out-of-order super-scalar microprocessor.

OF

328

DP RO

327

2.3.7. Asynchronous global operations OVM provides a set of global operations on persistent data. These operations collect data that are spread among servers (gather), distribute data among servers (scatter), compute global result from distributed data (reduce), copy data from the client to several servers (broadcast) and exchange data among servers (alltoall). These operations are executed like asynchronous RPCs. They are launched by the client and managed by the broker. If there are several outstanding global operations, they are scheduled accordingly to the data dependencies and server workload. So the termination order of several consecutive global operations may not be related to the launch order. Moreover, the synchronous progress of the sub-operations related to a global operation is not guaranteed. Global operations work within a communication group. Before launching a global operation, the client describes the data set to use for a global operation. The broker sets-up a communication group including all servers that hold a part of the data set. The client may request several global communications. Since each data set may be spread among a subset of the servers: (1) not all servers may belong to one

EC

326

server that is driven by data distribution. The programmer decomposes a large data set into several subsets and allocates them on server side using OVM persistent data operations (create and put). The broker will distribute them on the servers using a round-robin scheme. When the client launches RPCs, it describes on which subset the function should be applied. The broker automatically selects the server that holds the required subset. For fine grain tasks and small argument size, it would be preferable to let the broker selects itself the server. Using persistent data would prevent this “on-line” server selection. So dynamic scheduling relies on non-persistent data and dynamic server allocation by the broker. The programmer uses RPCs that encapsulate the argument in the message. When the broker receives such RPCs, it selects a server based on its workload without checking for the data dependencies. The fine grain scheduling is based on a first come first serve approach. The two control approaches can be used individually, subsequently or simultaneously inside the same program.

2.3.6. Static data distribution and dynamic scheduling OVM provides two ways for controlling task distribution among the server according to task complexity. We define task complexity as a two parameters criterion. The first parameter concerns task duration. The second parameter concern the size of the task arguments. The programmer is responsible to select the appropriate control approach according to his program. For coarse grain tasks and large data sets, it often would be preferable to use OVM features for static distribution: (a) persistent data and (b) task allocation on

RR

325

messages order. The values are effectively transferred on bold lines. The broker manages the create, kill, put, get and move operations. When the broker selects different servers for data-dependent functions, it has the responsibility to move persistent data sets from producing servers to consuming ones. Client is not aware of the data transfers issued by the broker. Decoupled RPC is a main issue toward high performance with RPC programming style because the execution a real application leads to a sequence of remote function calls that may exhibit data dependencies and locality (read after write, read after read). Persistent data allows to avoid multiple exchange transfers between a client and a server when consecutive functions calls have such locality properties.

CO

324

UN

323

G. Bosilca et al. / Future Generation Computer Systems 855 (2001) 1–13

TE

6

369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415

423

426

Global performance of OVM highly depends on the performance of basic operations like remote procedure call and global operations, data movements between client and servers and between servers.

427

3.1. Experimentation platform

428

All performance measurements presented in this paper use the same parallel experimentation platform. The platform connects 66 × 200 MHz Pentium Pro nodes by a Myrinet network. 64 nodes are used as servers (64 processors), the two other nodes are the client and the broker. The software environment includes Linux 2.2.17, Score 3.2 [2] and the PGI compilers. Score/PM raw performances on Myrinet is a latency of 5 ␮s and a bandwidth of 1 Gbit/s.

425

429 430 431 432 433 434 435 436

Asynchronous remote void execution + get Synchronous remote void execution Client part of remote execution Client local void function call

50 ␮s 32 ␮s 7 ␮s 0.16 ␮s

3.2. Remote execution overhead

437

Table 1 shows the client request latencies. Since the requests return immediately after executing a void function, the remote execution overhead equals the request latency.

438

3.3. Performance of remote global operations

442

Global operations only concern server sides. They correspond to MPI global operations (they have the same semantics). Fig. 6 presents the performance of the two versions (OVM and MPI) of the all-to-all data exchange on the server side for 4, 16 and 64 processors. Globally, OVM global operations have a higher latency than the MPI ones but a similar bandwidth for messages longer than 1024 bytes. OVM implementation of global operations is customized to achieve best performance. MPI implemen-

443

RR

424

CO

420

Fig. 6. Performance of the all-to-all global operations for OVM (server side) and MPI.

UN

419

DP RO

3. Performance of basic OVM operations

418

Table 1 Remote execution overhead

TE

422

417

EC

421

communication group and (2) a server can belong to one or several communication groups at the same time. If several global operations are launched concurrently, the broker manages the communication groups and makes global operations progress concurrently.

416

7

OF

G. Bosilca et al. / Future Generation Computer Systems 855 (2001) 1–13

439 440 441

444 445 446 447 448 449 450 451 452 453

456 457 458

tation corresponds to the optimized MPICH version developed at RWCP [2]. Although global operations are implemented on top of point to point communications for the two versions (OVM and MPI), the algorithms used for the two versions may be different.

4. OVM versus MPI on NAS Benchmarks

460

To measure the performance level of OVM for regular applications, we compare it against MPI for three benchmarks (FT, CG, EP) of the NAS 2.3. MPI codes correspond to the original implementation of the NAS NPB 2.3. To program the OVM version, we start from the sequential version of the NAS 2.3 and add RPC calls in the codes. Note that we did not optimize neither the MPI nor the OVM versions according to memory hierarchy. These two implementations could be considered as using the same optimization level. FT, CG and EP exhibit different computation/communication ratios as well as different communication patterns. CG uses a lot of point to point communications with large and small messages. FT uses all-to-all communication patterns. EP is mainly computation bound and requires only one reduction at the end of the program. Table 2 compares the execution time according to the number of nodes for OVM and MPI implementations of FT, CG and EP (Class A). Table 2 demonstrates that OVM versions can approach the performance of MPI versions of the NAS benchmarks EP and FT. A different array distribution leading to a poor cache utilization is the main reason behind the lower performance of CG with OVM. These results show that the overhead of the built-in load balancing mechanism within OVM is negligible up to 32 nodes.

464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485

The Pierre Auger Observatory project is an international effort to study high energy cosmic rays. Two giant detector arrays, each covering 3000 km2 , will be constructed in the northern and southern hemispheres. Associated to this observatory is the air shower extended simulations (AIRES) application. AIRES simulates particle showers produced after the incidence of high energy cosmic rays on the earth’s atmosphere. The complete AIRES 2.2.1 source code consists of more than 590 Fortran 77 routines, adding up to more than 80,000 source lines. Basically, the application manages a stack of particles for which it computes the incidence on the atmospheric particles that are closer to earth surface. The simulation proceeds step by step following an iteration process. When the high energy cosmic ray enter the atmosphere it collides with a first set of particles that are stored in the stack. Each of the collided particles will collide with other atmospheric particles. This collision generation is processed iteratively until the end of the simulation. There is no collision between particles belonging to different sub-showers. So, the global shower can be viewed as a hierarchy of independent showers. The particle collision process follows a Monte-Carlo approach. Collision between particles depends on the probability law and keeping a particle that has reached the threshold also depends on a probability law. As a consequence, the number of particles inside the stack evolves during the execution, decreasing or increasing in each iteration. The Fig. 7 presents the evolution of the stack during a typical simulation. The OVM implementation of AIRES aims to provide an interactive use of AIRES. So the OVM implementation must speed-up the computation of an air shower. Another approach should be to launch several independent air shower simulations on different nodes. This will improve the throughput but not the response time of one air shower simulation. We decompose the application in two codes: the client program and the server library. The AIRES parallel execution follows tree steps: initialization, computation, result gathering. During the initialization, the client starts the simulation from a parameter file. When the iteration count reaches a threshold, the client stops the simulation and sends its simulation context to the servers. The con-

487

RR

463

Table 2 Execution time according to the number of nodes (N) for OVM and MPI implementations of EP, CG and FT N

2 4 8 16 32

EP

CG

FT

MPI

OVM

MPI

OVM

466.96 233.42 116.74 58.39 29.2

466.90 233.38 116.6 58.37 29.2

38.6 20.95 11.94 7.1 4.4

CO

462

38.7 21.3 13.6 12.12 7.1

MPI

144.2 73.01 38.7 19.1 11.82

UN

461

486

EC

459

5. OVM for a real-life irregular application

OF

455

DP RO

454

G. Bosilca et al. / Future Generation Computer Systems 855 (2001) 1–13

TE

8

OVM

144.3 73.2 40.3 20.4 12.08

488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532

9

DP RO

OF

G. Bosilca et al. / Future Generation Computer Systems 855 (2001) 1–13

Fig. 7. Evolution of the particle stack during a typical simulation.

538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561

6. Real-time ray tracing with OVM

580

The Persistence Of Vision Ray-Tracer (PovRay) [3] is an open-source software which provides high quality 3D image synthesis through ray tracing. Basically, PovRay consists in a scene description language and a ray tracing engine. In the original version of PovRay, a numerical sequential program computes the ray tracing. PVMPov and PovMPI are two parallel versions

581

TE

537

562

EC

536

part is not relevant because it encompasses initialization and check-pointing. To hide the large execution time variance related to Monte-Carlo approach, the execution times for the sequential and OVM versions are computed from 20 executions. In order to feed each server, several non-blocking outstanding requests are sent to them. The load balancing strategy uses the following features: (a) an iso energy partitioning of the stack (i.e. the partitions of the stack have the same energy), (b) an over partitioning of the stack (i.e. the outstanding requests per server is set to 8) and (c) a dynamic load balancing algorithm with a first come first serve strategy. Fig. 8 shows that the OVM version provides nearly a linear speed-up on the parallel part for a very irregular real world parallel application. This performance strongly relies on the OVM high performance RPC, dynamic scheduling and global operations.

RR

535

text mainly consists of the client particle stack when it has stopped the simulation. We call this part of the execution the sequential part. Note that this part includes the reading of the parameter file and the writing of intermediary computations (check-pointing). During the simulation, the client invokes a remote sub-shower simulation using a specific set of parameters for each sub-shower. These parameters describe the stack part that will be used for the sub-shower simulation. Because the client invokes non-blocking asynchronous remote executions, several sub-shower simulations are executed concurrently on the server side. Since the simulation duration depends on probability laws, it differs for all sub-showers. When a server ends a simulation, it sends a signal to the broker that sends another parameter set corresponding to another sub-shower simulation. We call this part of the computation the parallel part. Sub-shower placement follows a very simple round-robin scheme, with a first end first serve load managing policy. Fig. 8 presents the speed-up according to the number of servers for the simulation of a medium size shower. The speed-up is computed from the global execution times of the sequential version and the OVM implementation of AIRES. The figure presents the speed-up for: (1) the complete application including the sequential part and (2) the application without the sequential part. One should notice that the sequential

CO

534

UN

533

563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579

582 583 584 585 586 587

G. Bosilca et al. / Future Generation Computer Systems 855 (2001) 1–13

DP RO

OF

10

592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612

EC

591

The first implementation issue when parallelizing an application with OVM is to split the application in two parts: the client and the services (functions to call on server side). In our parallelization, the client starts with the initialization and then asks to generate several images (the view point turning around the scene). The servers compute the image in parallel, each of them having the responsibility of a part of the image. The client gathers the image and possibly displays it on a screen. The performance evaluation concerns two parameters: the reactiveness and the throughput. Fig. 9 presents the computation time for an image according to the number of nodes and the image size. Large images should be considered as benchmarks. Since they require some substantial computations, the overhead of the client and servers coordination could be considered as negligible. When the image size decreases, the overhead takes a greater part in the total execution time. The Fig. 9 shows that the execution time decreases with the image size at the same rate. This test demonstrates the ability of this PovRay implementation to speed-up the computation of small images. The speed-up for the four images is nearly linear accord-

RR

590

of the original PovRay program. They mostly allow to compute larger images or images with greater complexity. The OVM implementation of PovRay is intended to provide real-time image synthesis. So the aim is to speed-up the computation compared to a uniprocessor considering a constant size and a constant complexity problem (scene). To test the reactiveness and throughput of the OVM implementation, we use a simple scheme to generate several and periodic synthesized images of the same scene. The view point simply rotates around the scene with a predefined shift angle. In principle, the parallelization of the image synthesis relies on distributing the computation of the image pixel among the computing nodes. Some mechanisms should be used to balance the load of the nodes because: (1) the computation time for each pixel (or set of) is not necessarily the same, (2) the total computation time should be reduced as much as possible. Two approaches could be used to load balance the executions: static and dynamic tasks allocation. One of our aims was to adopt a very simple strategy to parallelize PovRay using OVM. So the image is split into blocs, the mapping of the blocs among the processor uses a round-robin scheme and the load balancing mechanisms uses a static approach.

CO

589

UN

588

TE

Fig. 8. Speed-up of the OVM version of Aires over the sequential version.

613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637

11

TE

DP RO

OF

G. Bosilca et al. / Future Generation Computer Systems 855 (2001) 1–13

Fig. 9. Computation time depending on the scene and the rendering size.

640 641 642 643 644 645 646 647 648

ing to the number of nodes. The actual speed-up is 29 for 32 processors for all scenes. Note that some scenes like galleon may not scale as well due to some features (a lot of textured triangles) that does not match with the static load balancing algorithm used in our implementation. The main features of OVM for real-time ray tracing are persistent data at the server side, global operations and data-flow analysis. Globally, these results demonstrate the scalability of the OVM implementation of PovRay and its ability to sustain periodic rendering.

7. Related works

650

The RPC programming model has been used, since its introduction in 1984, mainly in the context of client–server systems across LAN, MAN and WAN. Most of the current extensions of the RPC model use object management mechanisms and object-oriented

653 654

UN

652

CO

649

651

programming style (Remote Method Invocation in JAVA, CORBA RPC). Few experiments have been done in the context of high performance computing with the original RPC programming style. Several research projects like Active messages [4], SHRIMP [5] and Fast Messages [6] provide either a fast communication system with an RPC like API or a performance optimized implementation of traditional RPC. None of them address the issue of RPC programming style performance in the context of high performance parallel computing. Problem solving environments such as Netsolve [7] and Ninf [8] provide a support for remote execution of library functions. The user calls the remote execution of a sequential or parallel computation through a programming language or an interactive environment like Matlab or Scilab. These environments are typically used to call large computations. These environments are worthwhile to use for remote executions with a duration of several seconds on a LAN and several min-

EC

639

RR

638

655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674

680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716

8. Concluding remarks

718

In this paper, we have addressed the issue of high performance parallel computing using OVM (an

UN

717

719

Acknowledgements

753

OF

679

720

DP RO

678

RPC-based execution environment) for regular, irregular parallel applications and parallel applications with some real-time constraints. The regular applications are considered as very difficult tests for an environment addressing irregular applications because the performance results takes into account all overheads related to the dynamic load-balancing mechanism. In this context, we have demonstrated that OVM can approach the performance of other parallel programming paradigms (designed for regular applications) for three kernels of the NAS benchmarks. The irregular parallel applications are other targets of OVM. We have presented the parallelization approach and the performance of a real world application called AIRES. Although the code is very large, we have proposed a simple parallelization scheme based on the OVM RPC programming style. The performance evaluation shows nearly an optimal speed-up compared to the original sequential version. The last performance evaluation has concerned a real-time version of the PovRay ray-tracer. A parallel real-time version of PovRay need to face two problems: the real-time demand (frames per second) and a linear speed-up according to the number of processors. Our implementation of PovRay with OVM match these two severe constraints. The next issues for OVM are: (1) the performance evaluation of a large range of platforms from SMP machines to clusters of SMPs, (2) the scalability evaluation and improvement of the OVM runtime for large parallel machines and (3) study of scheduling strategies used by the broker to improve data locality.

EC

677

utes on MAN and WAN. They are implemented on top of TCP-UDP/IP protocol. They do not fit the responsiveness and regular throughput that are required for real-time applications. In the past 10 years, a particular attention has been paid in studying alternative parallel execution models (languages and runtimes) addressing (some times simultaneously) two major issues: (1) allowing a same code to run on shared memory and distributed memory systems and (2) providing some intrinsic load balancing features (some times configurable) for irregular applications. Most of the research was done in the contexts of object-oriented languages and/or multi-threaded programming [9–13]. None of these systems has become a standard for parallel programming. Moreover, there are very few performance comparisons with message passing and shared memory programming approaches. Multi-thread environments like Cilk [14], PM2 [15], Athapascan [16] use RPC for communication between threads. The RPC mechanism invokes remote executions and for some environments implements remote write and remote read operations. These environments are mainly dedicated to irregular parallel applications. These environments have not been designed to compete in performance with hardware inspired programming models on regular applications. The Manta [17] project focus on Meta-computing. It uses JAVA for programming parallel applications. The communications between parallel tasks rely on the RMI protocol. Manta RMI on top of Myrinet reaches a latency of 39.9 ␮s. SMARTS [18] uses a data-flow approach to schedule the horizontal and vertical parallelism of an application. During the execution, SMARTS schedules the iteration blocks within and across loops on several processors using a dynamically generated dependence graph. SMARTS is based on a master–slave approach with a control thread which manages the workload of the slaves. Currently, SMARTS only works on shared memory multiprocessors (i.e. SGI Origin) and provides and object-oriented programming interface.

RR

676

CO

675

G. Bosilca et al. / Future Generation Computer Systems 855 (2001) 1–13

TE

12

721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752

Our thanks to Olivier Richard for his participation to OVM. Most of the results presented in this paper have been obtained at the Distributed System Software Laboratory of the RWCP. We wish to deeply thank Prof. Ishikawa, A. Hori, H. Tezuka for their help and support.

754

References

760

[1] High Performance Fortran Language Specification, 1.1 Edition, Rice University, Houston, TX, May 1993.

755 756 757 758 759

761 762

G. Bosilca et al. / Future Generation Computer Systems 855 (2001) 1–13

DP RO

OF

D. Trystram (Eds.), Parallel Computing: State-of-the-art and Perspectives, Proceedings of the Conference ParCo’95, Ghent, Belgium, September 19–22, 1995, Advances in Parallel Computing, Vol. 11, Elsevier/North-Holland, Amsterdam, 1996, pp. 279–285. [16] J. Briat, I. Ginzburg, M. Pasin, B. Plateau, Athapascan runtime: efficiency for irregular problems, in: Proceedings of the Europar’97 Conference, 1997, pp. 590–599. [17] J. Maassen, R. Nieuwpoort, R. Veldema, H. Bal, A. Plaat, An efficient implementation of Java’s remote method invocation, in: Proceedings of the ACM Symposium on Principles and Practice of Parallel Programming (PPoPP), 1999. [18] S. Vajracharya, S. Karmesin, P. Beckman, J. Crotinger, A. Malonv, S. Shende, R. Oldehoeft, S. Smith, SMARTS: exploiting temporal locality and parallelism through vertical execution, in: Proceedings of the 1999 ACM SIGARCH International Conference of Supercomputing (ICS’99), June 1999, pp. 302–310.

TE

Gilles Fedak is a PhD student in computer science from Paris-Sud University, Orsay, France since 1999. His PhD work focus on high performances computing using remote idle computers over Internet. He is the main contributor to the XtremWeb project which is a peer-to-peer global computing system. www.xtremweb.net.

Franck Cappello is a CNRS researcher leading the Cluster and GRID Group at LRI (Computer Research Laboratory) in Paris-Sud University, Orsay, France. His research interests include parallel programming models, parallel runtime environments and performance evaluation of clusters, clusters of multiprocessors and large scale distributed systems. He is working on hybrid programming models (MPI + OpenMP), multi-purpose runtime environments for parallel applications (OVM) and peer-to-peer global computing using federations of idle computers over Internet (XtremWeb).

RR

CO

815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832

George Bosilca is a PhD student in computer science from Paris-Sud University, Orsay, France since 1999. His PhD work focus on quality of services and programming models on parallel and distributed architectures. His research interests also include security on cluster computing and remote idle computers over Internet.

EC

[2] Mpich-pm/clump: an MPI library based on mpich 1.0 implemented on top of score, Report, Real World Computing Partnership, Parallel and Distributed System Software Laboratory, 1998. http://www.rwcp.or.jp/lab/pdslab/mpichpm/. [3] Persistance of vision ray tracer, PovRay Open Source Project. http://www.povray.org/. [4] T. von Eicken, D.E. Culler, S.C. Goldstein, K.E. Schauser, Actives messages: a mechanism for integrated communication and computation, in: Proceedings of the ISCA’92, 1992, pp. 256–266. [5] M.A. Blumrich, K. Liand, R. Alpert, C. Dubnicki, E.W. Felten, J. Sandberg, Virtual memory mapped network interface for the shrimp multicomputer, in: Proceedings of the 21st Annual International Symposium on Computer Architecture, 1994, pp. 142–153. [6] S. Pakin, V. Karamcheti, A.A. Chien, Fast messages: efficient, portable communication for workstation clusters and MPPs, IEEE Concurr. 5 (2) (1997) 60–73. [7] H. Casanova, J. Dongarra, NetSolve: a network-enabled server for solving computational science problems, Int. J. Supercomputer Appl. High Performance Comput. 11 (3) (1997) 212–223. [8] H. Nakada, H. Takagi, S. Matsuoka, U. Nagashima, Utilizing the metaserver architecture in the Ninf global computing system, Lect. Notes Comput. Sci. 1401 (1998) 607. [9] A. Grimshaw, W. Strayer, P. Narayan, Dynamic objectoriented parallel processing, IEEE Parallel Distributed Technol. (1993) 33–47.http://www.cs.virginia.edu/∼mentat/. [10] A. Chien, J. Dolby, B. Ganguly, V. Karamcheti, X. Zhang, Supporting high level programming with high performance: the Illinois concert system, in: Proceedings of the Second International Workshop on High-level Parallel Programming Models and Supportive Environments, April 1997. http://www-csag.ucsd.edu/projects/concert.html. [11] M. Rinard, The design, implementation, and evaluation of jade, a portable, implicitly parallel programming language, Ph.D. Thesis, Stanford, CA, 1994. [12] J.S. Chase, F.G. Amador, E.D. Lazowska, H.M. Levy, R.J. Littlefield, The amber system: parallel programming on a network of multiprocessors, in: Proceedings of the SOSP12, December 1989, pp. 147–158. [13] L. Kale, Parallel programming with charm: an overview, Technical Report 93-8, Department of Computer Science, University of Illinois at Urbana-Champaign, 1993. [14] R.D. Blumofe, M. Frigo, C.F. Joerg, C.E. Leiserson, K.H. Randall, Dag-consistent distributed shared memory, in: Proceedings of the IPPS’96, April 1996, pp. 132–141. [15] R. Namyst, J.-F. Méhaut, PM2 : parallel multithreaded machine. A computing environment for distributed architectures, in: E.H. D’Hollander, G.R. Joubert, F.J. Peters,

UN

763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814

13

833 834 835 836 837