Distributing Numerical Algorithms: Some Experiences with Network Computing System (NCS) and Parallel Virtual Machine (PVM)
M. Alfano Dipartimento di Ingegneria Elettrica Universita' degli Studi Viale delle Scienze - 90128 PALERMO
[email protected] G. Lo Re Centro Universitario di Calcolo Universita' degli Studi Viale delle Scienze - 90128 PALERMO
[email protected]
Abstract Nowadays
distributed
Consequently have been distribute Network
systems
several execute
Computing
become
methods and
developed. and
have
tecniques
effective for
and
widely
distributed
used.
applications
The Remote Procedure Calls (RPC) allow users to pieces
System
of
a
program
(NCS)
is
a
on
different
development
computers.
environment
based
The on
RPCs for distributed software acting as a high-level tool for the management of the RPCs. The Parallel Virtual Machine (PVM) uses only message passing primitives
to
allow
the
interprocess
communication
and
consequently
to
distribute tasks among different machines. Using these software systems we present
some
consequent
experiences
made
implementation
to
distribute
methods.
numerical
algorithms
and
th
1.
Introduction
Nowadays distributed systems have become very popular and largely used These are environments where the existence of different autonomous machines is hidden to the user who deals with a single computing system. In consequence there is a great interest for such systems and the new relate programming tecniques. The are related mainly with the following aspects: -
parallel execution of programs; communication and synchronization between the different parts parallel program; management of emergency conditions as a partial system failure.
of
a
Many tools allowing distributed programming have already been built and others are to date under development. For distributed systems, where different CPUs are connected with high-speed links, the reference model is "a local environment model" and the communication between processes is done only by means of message exchange. In such environment among others two well-defined tool classes can be found: - Message
passing
primitives
- RPC (Remote Procedure Call)
In the former case two processes exchange messages using Send and Receive primitives and following different rules for synchronous and asynchronou message passing. The sending process can wait for the receiver acknoledgment or continue without waiting.
In the latter case, the basic philosophy of RPCs [5] is very similar to that of the traditional programming methods and simplifies the distributed application building since the communication management is transparent t the user.
In this work we consider the main factors to be considered when distributin numerical algorithms. Next some methods based on these factors are presented and as a practical example a distributed algorithm for a matri inverse computing is illustrated. 2.
Distributing
When distributing considered:
numerical a
algorithms
numerical
algorithm
some
main
factors
- the complexity order of the algorithm: an algorithm meaningful only for high-complexity computing;
should
distribution
be is
- the intrinsic possibility to distribute the algorithm: the related dependencies graph shows the relationship among the different steps of th algorithm;
- the computers power: the algorithm can be distributed in a different way depending on the used computers. Having computers with comparable power it is useful to distribute the heaviest steps among the different CPUs
2
parallelizing as much as possible. In environments with different-level computers it is more convenient to execute the heaviest steps on the mos powerful CPUs at expense of parallelism;
- the operating system: in order to distribute an algorithm on a network, fo implementing reasons it is better to work with operating systems allowing dynamic creation of processes (UNIX, VMS, etc.);
- the distributing SW: it plays a preminent role because its structure must b followed. In our case Network Computing System and Parallel Virtual Machine have been used for our implementations. 2.1 NCS
NCS [10] is an inplementation of the Network Computing Architecture (NCA) [15], to distribute applications in a heterogeneous environment. It is based o three main components: 1 ) The RPC run-time library provides the routines programs to execute procedures on remote hosts. 2 ) The Location Broker is a server that resources available in the network.
provides
that
enable
informations
local
about
the
3 ) The Network Interface Definition Language (NIDL) allows to specify a communication interface between user and service provider. The NIDL Compiler takes this interface definition as an input and generates the necessary modules (stubs) for binding the different parts of a program.
RPC appears as an extension of the traditional construct of the procedure ca exising in most programming languages and uses syntax and semantics very similar. The only difference is the procedure call can run on any node in th distributed environment.
Client Call
Server Request
Return Reply
RPC execution fig. 1
In fig. 1 a RPC execution scheme is shown. A system implementing RPC (client) provides to marshall the parameters in a request message and sends to the execution node (server). A corrispondent component on the arriva node provides to unmarshall the parameters and executes the procedur
3
called. Eventual output values are similarly handled and follow the opposit path. 2.2
PVM
The PVM system [11] is a framework allowing the distribution of an application over a set of heterogeneous processing elements loosely coupled viewed as a single concurrent computation resource fig.2.
Application 1
Application 2
PVM SYSTEM
Lan 1
Lan 2 fig. 2
Application programs are built as a set of components, each rapresenting subtask, at a moderately large level of granularity. These components can b mapped to the phisical resources in three different ways depending of th user requirements:
- trasparent mode, the subtasks are automatically mapped by the PVM to the processing elements; - architecture-dependent mode, the user indicates the computer architectures on wich to execute the subtasks; - low-level mode, the user specifies a particular machine for any single subtask.
PVM is based on the message-passing paradigm where any component of the distributed program is able to communicate with any other using primitive such "send" and "receive". Synchronization primitives are, also, available to the user via barriers and rendezvous. All these PVM primitives are provided as library routines wich can be called from C and Fortran. PVM uses the XDR (eXternal Data Rapresentation) standard to convert data beetween different architectures.
4
3.
Some
distribution
methods
When distributing a numerical algorithm the factors previously analyzed should be cosidered and varying them different implementation methods ca be found. In this section we describe some of them developed following a natural evolution and, even if they do not exhaust the innumerable implementin possibilities, they are among the most representative. We only use a client and a server but having more than one server the approach does not change. The easiest way of distributing a numerical algorithm is to find out the heaviest computational components and execute them on the most powerfu host available. With RPCs it is realized by means of a client process implementing the main body of the algorithm and requiring a remote server to execute one or mor tasks. In this case the client forwards a RPC towards a server and remains in a waiting state until it receives a reply from the server (fig. 3).
Client ..... ..... Call
Server Request ..... ..... Return
..... .....
Reply
fig. 3
This natural method, very convenient in presence of a powerful computer has the disadvantage not to parallelize the algorithm steps because the clien must wait the RPC completion before restarting the execution of the next operations.
This is because the RPCs, among the different methods for inter-process communication, result intrinsically synchronous: a process invoking a RPC blocks until the execution of the routine is terminated and eventual result are returned.
The parallelism is implemented forking a process in two distinct subprocesses. Such duplication can be done either on the client side or on th server one if the operating systems allows it. Doing so a process will be in charge of making a RPC (or receiving one the server case) while the other one will be immediately free to continue and execute the next instructions.
5
Supposing the client forks, it will create a child process which invokes the RPC. Such process must wait for the RPC results and then pass the obtained data to the father process (fig. 4). Client ..... ..... fork ..... ..... Call ..... .....
..... .....
Server Request ..... ..... Return Reply
..... pipe .....
fig. 4
This can be achieved either by means of shared variables or, as in the UNIX case, the most common, by means of pipes which are the only tools UNIX provides for inter-process communication on the same computer. In the latter case the blocking nature of pipes avoids the use of other synchronization routines to put the father in a waiting state until the arriv of the data from the child. However pipes do not allow the passage of numerical data , so it is necessary two-way conversion of them into character data. This together with the usag of pipes, computationally relevant, constitutes an undesirable overload. To overcome phases:
this
inconvenient
the
RPC
can
be
splitted
in
two
different
1 -
the child deals only with invoking the remote subroutine and providing the input data transmission;
2 -
the father, after checking the remote execution be terminated (verifying the child termination), makes a call to a second remote procedure which gives it back the output data.
This is realized (fig. 5) by means of a server process always active wich receives the input data from the child, executes the computation, keeps th output data and gives them back to the father during the second RPC. In this way pipes are no longer necessary because the father receives the output directly from the remote server.
6
Client
Server
..... fork
.....
.....
.....
.....
..... ..... ..... ..... .....
Request
..... ..... ..... .....
Reply Request Reply
fig. 5
A variation of the previous method allows to make a client process fully asynchronous. Now this process invokes a RPC where it only passes the input data to the server process. This one does not perform really the request bu creates a child process to execute the operation (fig. 6). In such way the server returns the control to the client almost immediatel since the waiting time is only limited by the data transmission. The client will have got a non-blocking send without the need of the creation of a child process as previously discussed.
Client
Server
..... .....
..... Request
..... fork
.....
Reply
..... .....
exit
..... .....
Request
.....
Reply
fig. 6
Using the simple message-passing paradigm of PVM all the blocking problems seen before do not arise any longer. The client process does not have any procedure to call, it must only transmit the input data to the proce server eventually with some informations regarding the type of the require operation. The data transmission is not blocking, thus the sending proces can continue its execution until it needs the output data. The computing scheme becomes (fig. 7):
7
Client
Server
.....
..... .....
send
receive ..... ..... ..... send
..... ..... ..... receive .....
fig. 7
4.
A
case
study:
an
inverse
matrix
computation
As a practical application of the methods before described we have considere the distribution of the BDM (Block Decomposition Method) algorithm, a reelaboration of the traditional bordering method, for computing a matrix inverse. This method is described in [7]. What is interesting here, is to consider that known the X 1 1 , X1 2 , X2 1 and X 2 2 submatrices of the step k (where X 1 1 is the sub-matrix obtained at the step k-1 and X 1 2 , X2 1 and X2 2 the border submatrices drawn from the initial matrix, fig.8)
p
x11
p+r
X12
p
p+r
X 21
X
22
A fig. 8
it is possible to calculate the sub-matrix Z(Z 11 , Z12 , Z21 , Z22) of the step k+1 by means of the computing sequence:
8
a)
C 21 = X 21⋅ X 11
b)
D= X 22 - C21⋅ X 12
c)
-1 Z 22 = D
d)
C 12 = X11⋅ X 12
e)
12
f)
11
g)
a
d
b
Z = C12 ⋅ Z 22
Z = X11 + Z12 ⋅ C21
Z 21 = Z22 ⋅ C21
c
e g
f fig.9 The Z sub-matrix will be then the
sub-matrix X 1 1 of the step k+1.
Note the generic step k operations follow the dependencies graph showed above (fig. 9). Specifically for any step k the d operation can be executed parallely to the b and c o n e s and the g operation does not depend from the e and f o n e s .
As pointed out in [7] the steps a , d and f are the heaviest from the computin point of view and must be mostly considered when distributing the algorithm The computing environment used for our implementations is made up by: RISC - IBM RISC/6000 mod. 520 with AIX v.3.1.5 3090 - IBM 3090 200J VF with VM/ESA (Virtual Machine/Enterprise System Architecture) v.1.0. VAX
- Digital VAX 6510 with VMS (Virtual Memory System) v.5.4.
DEC
- Digital DECstation 5000 mod. 125 with ULTRIX v.4.2
These computers are connected in Ethernet and communicate by means o TCP/IP protocols. These ones constitute the basis on which NCS and PVM operate.
Conforming to the logical iter previously treated we have implemented the following four methods:
9
4.1
Method
Computing
1
(sequential)
environment: client server
RISC or DEC 3090
As previously discussed the high computer power of the server, much bigge than that of the client, suggests to use the former for the computationa heaviest steps of the algorithm renouncing to parallelism.
Thus the client process invokes a RPC delegating the server to perform the matrix products of the a and d steps (fig. 3).
Since both the client and the server use parts of the initial matrix, instead o passing them to the server at any step it is more convenient the server own a copy of the initial matrix since the beginning. This expedient will be used in the next methods too.
The algorithm structure is the following (the remote operations are typed i italic): 0. Trasmission
channel
initialization
1. Inverse computation of the matrix (X 1 1 (0 )) =(x ij ) i,j = 1,...4; 2. If det(X 11 (0) ) = 0 then if x 1 1 ≠ 0 then 1/x 11 and IND = 1; else IER = 1 and exit; else IND = 4. 3.
While N-IND > 4: 3.1. 3.2. 3.3. 3.4.
X 11 . X 12 = C12 and X 21 . X 11 = C 21; C21 . X12 = Q; D = X 22 - Q (Schur complement); Z 22 = D -1; 3.4.1.
If det(D ) = 0 then standard step of the bordering method; If IER = 1 then exit; else IND = IND + 1 and go to 3; else IND = IND + 4;
3.5. 3.6. 3.7.
C 12 . Z 22 = -Z 12; -Z 12 . C 21 = E; Z 11 = X 11 + E;
4. If IND < N then LL = N-IND; 5. For J = 1, LL 5.1. Standard application of the bordering method.
10
Note that at step 3.1 for the remote execution of both products it is necessar to send only X 1 1 since X 1 2 and X 21 are the bordering sub-matrices directly available from the initial matrix already in the server.
4.2
Method
Computing
2
(parallel
synchronous)
environment: client RISC or DEC server RISC or DEC or VAX or 3090
The client process creates a child sub-process which calls the remote procedure. The latter waits for the RPC results and then passes them to the father process by means of a pipe (fig. 4). The algorithm structure becomes the following: 0.
Trasmission
channel
1.
...
2.
...
3.
While N-IND > 4:
initialization
and
pipe
creation;
3.0 Child creation child.1. X 11 . X 12 = C12 ; child.2. writing C 1 2 -> p i p e ; 3.1. 3.2. 3.3. 3.4.
C21 = X21 . X11; Q = C21 . X12; D = X 22 - Q (Schur Z 22 = D -1 3.4.1.
complement);
if det(D ) = 0 then standard step of the bordering method; if IER = 1 then exit; else IND = IND + 1 and go to 3; else IND = IND + 4;
3.5. 3.6. 3.7. 3.8. 4.
...
5.
...
reading from the p i p e -> C 1 2 ; C 12 . Z 22 = -Z 12 -Z 12 . C 21 = E Z11 = X 11 + E
11
4.3
Method
Computing
3
(parallel
environment: client server
pseudo-asynchronous) RISC or DEC RISC or DEC or VAX or 3090
The remote procedure call is divided in two distinct phases (fig. 5): 1 -
the task of the child is only to invoke the remote subroutine and to provide the input data transmission;
2 -
the father, after checking the remote execution be terminated (verifying the child termination), makes a call to a second remote procedure which gives it back the output data.
The algorithm structure becomes the following: 0.
Transmission
1.
...
2.
...
3.
while N-IND > 4: 3.0.
initialization;
child process creation child.1.
3.1. 3.2. 3.3. 3.4.
channel
remote computation of X 1 1 . X 1 2 = C12 ;
X21 . X11 = C21; C21 . X12 = Q; D = X 22 - Q (Schur Z 22 = D -1; 3.4.1.
complement);
if det(D ) = 0 then standard step of the bordering method; if IER = 1 then exit; else IND = IND + 1 and go to 3; else IND = IND + 4;
3.5. 3.6. 3.7. 3.8. 3.9. 4.
...
5.
...
waiting for child termination; RPC for output data -> C 1 2 C 12 . Z 22 = -Z 12 -Z 12 . C 21 = E Z11 = X 11 + E
This method is called pseudo-asynchronous blocked until the RPC completion.
12
because
the
child
process
is
4.4
Method
Computing
4
(parallel
environment: client server
asynchronous) RISC or DEC RISC or DEC or VAX
Now the client process invokes a RPC where it only passes the input data to the server process. This one does not perform really the request but creates child process to execute the operation (fig. 6). In such way the server returns the control to the client almost immediatel since the waiting time is only limited by the data transmission. The client part of the algorithm becomes: 0.
Transmission
channel
1.
...
2.
...
3.
while N-IND > 4:
initialization;
3.0. data send for the remote computation X 11 . X 12 = C12 ; 3.1. 3.2. 3.3. 3.4. 3.5. 3.6. 3.7. 3.8.
... ... ... ... RPC for receiving the output data -> C 1 2 ; C 12 . Z 22 = -Z 12; -Z 12 . C 21 = E; Z 11 = X 11 + E;
4.
...
5.
...
4.5
Method
Computing
5
(PVM)
environment: client server
RISC or DEC RISC or DEC
As previously described the client process only passes the input data to th server process (fig. 9). This one performs the operation and returns the results to the client. In this way the latter does not have to wait for the server reply but can continue executing the next instructions until it decides to receive the outpu data of the remote operation. The client part of the algorithm becomes: 0.
Enrolling of the program in the PVM environment
1.
...
2.
...
13
3.
while N-IND > 4: 3.0. send 3.1. 3.2. 3.3. 3.4. 3.5. 3.6. 3.7. 3.8.
4.
...
5.
...
of the input data
X11;
... ... ... ... receive of the output data -> C 1 2 ; C 12 . Z 22 = -Z 12; -Z 12 . C 21 = E; Z 11 = X 11 + E;
On the server side we have a symmetric situation with a r e c e i v e of the inpu data, the computation of C 12 = X 11 . X 12 and a send of the output data. 5.
Final
remarks
and
performance
evaluation
It is outside the scope of this work a comparison beetween NCS and PVM. Here we want only to make some considerations about the methods showed befor whith regards to the used tools. The first method appears convenient when a very powerful server is available (IBM 3090). The disavantage of such approach is not to parallelize the algorithm steps since the client must wait for the completion of the remote procedure cal before continuing the execution of the next operations. This method has been described as first approach to the distributed computing, but it is not very significant for environments where the main goal is the parallel execution of program parts on different CPUs. Among the parallelizing methods the synchronous one results the most natural since only one RPC is made passing the input parameters and receiving the output data. This writing linearity does not correspond to a computing easeness: the overhead for reading/writing the data from and to the pipe (with the relat conversion) lowers notably the performances. Opposite the asyncronous methods, though more complex (the synchronism intrinsic chararacteristic of the RPCs, must be overcome in some way), show notable advantages from the computational point of view. With regards to the last three methods it must be considered they do not allo multi-processes clients or servers running on computers with operating systems that cannot create processes dynamically (e.g. IBM 3090 with VM and CMS). The PVM method does not present the syncronization problems of the previous ones. However this advantage implies a heavier job for the programmer who has the responsability of the communication coordination message switching and data integrity. The experiments have been led in an environment where an Etherner LAN rappresented the trasmisssion medium. So for the performance evaluation w have considered only the CPU time of the client process, without caring the bottlneck due to the transmission channel wich a high-speed network would eliminate.
14
Fig. 10 shows the time trends for the parallel methods vs. the dimension of th inverting matrix. The client and the server runned on the RISC/6000 and th VAX 6510 for the NCS cases and on the RISC/6000 and the DECstation for the PVM case. Matrix Dim.
Synchronous Pseudo-Asyncr. Asyncronous
100 200 300 400 500 600
6,53 35,00 96,46 258,98 477,05 722,37
2,30 16,10 52,93 127,44 248,06 431,47
1,83 13,08 42,77 107,09 213,38 377,53
PVM
1,34 9,84 33,69 86,45 182,24 332,18
Client CPU times
fig. 10
Fig.11 shows the graphic representation for the time trends of the previou table. 800,00 700,00 600,00 500,00 400,00 300,00 200,00 100,00 0,00 100
200
300
fig. 11 synchronous pseudo-asynchr. asynchronous PVM
15
400
500
600
References [1] G.R. Andrews, F.B. Schneider "Concepts and Notations for Concurrent Programming" , Computing Surveys, v.15 n.1, March 1983 [2] F. Baiardi, M. Vanneschi concorrente" , Ed. F. Angeli
"Linguaggi
per
la
programmazione
[3] H. E. Bal, J. G. Steiner A. S. Tanembaum "Programming Languages for Distributed Computing System" , ACM Computing Survey v. 21 n. 3 September 1989 [4] A. Beguelin, J. Dongarra, A. Geist, R. Manchek, V. Sunderam "A User's Guide to PVM Parallel Virtual Machine", ORNL/TM-11826 [5] A.D. Birrel, B.J. Nelson "Implementing Remote Procedure Calls" , Transactions on Computer Systems, v.2 n.1, February 1984 [6] G. Coulouris, J. Dollimore Addison Wesley
ACM
"Distributed System Concept and Design",
[7] E. Francomano, A. Pecorella, A. Tortorici Macaluso "Parallel Experience on the Inverse Matrix Computation", Parallel Computing v. 17 North Holland 1991, pp. 907 - 912 [8] G.A. Geist, V.S. Sunderam "Network Based Concurrent Computing on the PVM System", Public Domain [9] P. Gibbons "A stub generator for multilanguage RPC in heterogeneous environments" , IEEE Transactions on Software Engineering, v.13 n.1, January 1987 [10] M. Kong Hall
"Network Computing System Reference manual" ,
Prentice
[11] V. Sunderam "PVM: A Framework for Parallel Distributed Computing" , Concurrency: Practice & Experience, v.2 n.4, December 1990 [12] A.S. Tanembaum, R. van Renesse "Distributed Operating Systems" , Computing Surveys, v.17 n.4, December 1985
ACM
[13] E. Walker, R. Floyd, P. Neves "Asyncronous remote operation execution in distributed systems" , IEEE Computer 1990, pp.253-259 [14] G. Welling, B. Badrinath "An architecture of a threaded many-to-many RPC" , IEEE Computer 1992, pp.504-511 [15] L. Zahn
"Network Computing Architecture" ,
16
Prentice Hall