Oct 28, 1993 - user's pvmd daemon process on each participating processor. Process ... with each other and with tasks on other hosts via a single daemon.
Porting the PVM Distributed Computing Environment to the Fujitsu AP1000 C.W.Johnson and D. Walsh Department of Computer Science Australian National University October 28, 1993
Abstract
The PVM system (Portable Virtual Machine) is a programming environment for distributed parallel C and Fortran-based programs[1]. Originally for distributed heterogeneous workstations, it has recently been ported to multiprocessors such as the Intel iPSC/2 and Thinking Machines Corporation's CM5 [2,3]. We have ported the PVM system to the Fujitsu AP1000 kiloprocessor. We describe the process and communications model used for PVM on the AP1000 and consider further work to improve functionality and performance, including changes mooted to the Cell operating system and the structure of the host controlling processes.
1 Introduction: PVM The PVM (Portable Virtual Machine) is a software system that permits a collection of heterogeneous computers programming to appear as a single computer to a user. PVM presents a user with a message-passing, heterogeneous model: a program consists of a variable number of dynamically created, heterogeneous processes known as tasks. Tasks communicate with and control each other by message passing (non-blocking send, blocking receive, non-blocking probe, barrier synchronisation, and UNIX-like signals). Multicasting is an additional form of message sending. Messages are \packed" from heterogeneous data types before sending and unpacked into typed data on receipt. The PVM functions are provided to the user as C or Fortran routine calls. The important communication and control calls are: pvm send immediately send the data in the active message buer pvm sendsig sends a signal to another PVM process pvm recv blocks until a message with speci ed message tag (or ANY) has arrived from the speci ed source (or ANY), and places it in a new active receive buer. pvm nrecv non-blocking receive pvm mcast multicasts the data in the active message buer to a set of tasks pvm initsend clear default send buer P1-D-1
pvm pkbytejintj oat
pack values from an array of the speci ed type (byte, int, float ) into the active message buer pk upkbytejintj oat unpack an array of values of the speci ed type from the received active message buer Tasks can be spawned on any host where the user has sucient privileges. Tasks are addressed by a PVM-speci c addressing scheme that is mapped onto socket connections by the user's pvmd daemon process on each participating processor. Process groups are an optional feature; barrier synchronisation is tied to process group membership. :::
:::
:::
pvm spawn starts new PVM processes pvm tasks returns information about the tasks running on the virtual machine pvm joingroup enrolls the calling process in a named group pvm barrier blocks the calling process until all processes in a group have called it There are no global reduction operations in PVM. The architecture and distribution of the processors on which tasks are run is transparent to the user program. The PVM system includes management functions (also embodied in specialised processor management tools) to add or delete host computers from the set included in the virtual machine, and to start or kill tasks running on them. Tasks may run on any host processor known to a particular PVM con guration: the system expects a correctly con gured le system on each host to nd locally compiled versions of the executables, and uses the UNIX rsh command by default to start the correct executable le on the chosen processor(s). PVM allows tasks to communicate with other tasks on the same host, and with PVM tasks on other hosts, with no dierence in program source code. The packing and unpacking of messages allows for dierences in representation of data such as integers or oats on dierent processor architectures. The XDR functions are used to eect machine-independent data transfer if necessary, but if two machines have the same architecture, data can be transferred in raw form. User programs can include management functions such as pvm addhosts adds one or more hosts to the virtual machine (the hosts are named by strings, and pvmd must already be running on the hosts) pvm con g returns information about the present virtual machine con guration pvm delhosts delete one or more hosts from the virtual machine
1.1 Implementing PVM Processes
The workstation implementation of PVM uses a UNIX process to represent each task. Tasks on the one host communicate with each other and with tasks on other hosts via a single daemon process known as pvmd. Local pipes are used to communicate between a task and its local pvmd, and TCP/IP sockets are used to communicate between pvmd processes on separate hosts. The pvmd processes mediate all communication between tasks on dierent hosts. The widely distributed PVM version 2.4 has recently been superseded by PVM 3. The previous reliance on TCP/IP sockets and UNIX process structures have been weakened suciently that implementing PVM on a multiprocessor machine such as the AP1000 is relatively easy. P1-D-2
The PVM 3 distribution includes an MPP variant, which includes source code for multiprocessor implementations of PVM for the CM5, Paragon and I860 [3,4]. The PVM3 MPP code has been used as the base for porting PVM3 to the AP1000. To the PVM user the multiprocessor appears as a single PVM host with a number of user task processes; to the multiprocessor these processes are processes that run on the processing elements of the machine. In addition, the multiprocessor is a fully participating host in the PVM virtual machine, and its processes (cell tasks) can communicate with those on other nodes (workstation processes). This allows the multiprocessor machine to be used as a set of computational nodes in the one virtual machine with external nodes, i.e. networked workstations. The characteristics of the MPP code are: \tasks are always spawned on the nodes only one pvmd runs on the front-end and it has to support all the tasks running on the nodes." \Messages between two nodes should go directly between them [using a native message-passing routine] while messages destined for another host on the Internet should go to the user's PVM daemon on the multiprocessor for further routing." The multiprocessor pvm daemon running on the front-end \listens for messages coming from both the outside network and the nodes of the multiprocessor". Multicast is mediated by the host (as it is by the conventional workstation implementation's pvmd daemon). Consequently, \node-to-node messages are sent directly across and very fast, but packets going to another machine must be relayed by pvmd and that link is slow, especially for large messages." [1] Our AP1000 implementation is based on the MPP code modules of the PVM distribution. The match of abilities between PVM and the AP1000 is not complete: we need to restrict the use of some PVM operations (dynamic process spawning and process groups, in particular); and PVM cannot exploit some powerful AP1000 operations such as global reductions (xy sum etc., scatter and gather). The question of how to extend PVM to include such operations has not been addressed here. :::
1.2 Performance
The performance of PVM is constrained by the cost of spawning processes, and by its communications. The PVM system (version 2.4) does not scale well beyond 60 processors [5] because of UNIX system limits on le descriptors. Serialisation of process spawning may be serious in some programs with a highly dynamic set of processes, but this model is not as common as a static set of processes established before any signi cant computation phase, and several multiprocessor systems have no dynamic spawning at all. Communications speed is the main constraint on PVM programs. The need for conversions in data packing and unpacking depends on the architecture of the source and destination processors, and in a homogeneous set of processors it requires no more overhead than copying the data into and out of message buers. Data transfer rate { TCP/IP software overheads and network transfer performance { is the main constraint. Programmers therefore commonly use PVM for loosely-coupled parallel programs, which can be run with acceptable eciency with the communications performance of LAN-connected workstations. Point-to-point communications are expected to be slow and will constrain the choice of algorithms employed. Porting PVM to multiprocessors means that PVM now also becomes a portable parallel programming system: applications can be prototyped on workstation networks and run on multiprocessors with no change to the code. In gaining this ability, programmers' expectations of scalability and performance will change, to expect PVM implementations to scale as well as the multiprocessor architecture itself, and with performance similar to the machine's native P1-D-3
operations, despite the history of PVM and of their programs. The measure of performance for PVM on multiprocessors is twofold. process spawning cost Point-to-point and multicast communications overhead and eective data transfer rate, Process spawning cost is not signi cant in most programs, and will be ignored for a rst comparison. Point-to-point communications can be compared to the machine's native mode communications operations, since large improvements in comparison with networked TCP/IP communications can be taken for granted, For multiprocessors native communications speed is not the only factor in PVM messagepassing performance: multiprocessor PVM's eciency depends on how well the PVM model maps onto the native system processes, addressing, and operations. For the AP1000 system the correspondence between PVM and properties of the AP1000 CellOS operating system and front-end caren process is reasonably close: but there are some problems here in matching the low-level model with the AP1000 CellOS, and with managing multicast. There will be very large dierences in multicast performance between the case where all cells are involved, and cases where even one less cell is involved: in the rst case a single AP1000 cbroad operation can be used, whereas in the second a sequence of separate message sends is needed, perhaps 1000 sends compared to a single broadcast. Suggestions for improvements that we believe can be got by modi cations to CellOS are described in the section on Future Work, below.
2 AP1000 implementation of PVM
2.1 Implementation Issues
The issues to be faced in implementing PVM on a multiprocessor start with questions of restrictions of the basic PVM system: Heavyweight or lightweight processes for PVM user tasks: we represent each PVM user task by a (heavyweight) AP1000 cell task. Closed or open PVM system: a closed system is one where the set of PVM nodes is contained within the multiprocessor, and has no communication with the outside world. An open system is one that attaches the multiprocessor to an existing PVM system of workstations (and other multiprocessors) as a set of processor nodes with communications both between themselves and to external workstation processes. This is also the choice between internal or extended communications. For the AP1000 we choose to have an open system. Cell processes have PVM system addresses, and can communicate with external processes using PVM message calls and PVM addresses. Static or dynamic process creation: the sys.c7 standard Fujitsu version of the AP1000 CellOS operating system allows only static process creation, where all cell tasks are created simultaneously. Heterogeneous or homogeneous processes: the AP1000 system allows heterogeneous tasks to run on cells, but the PVM pvm spawn call allows only one executable le to be named. Following the restriction to static process creation we support only one pvm spawn call, hence homogeneous processes. P1-D-4
More detailed issues in a multiprocessor implementation are host/cells model: daemon(s) and user tasks addresses number of tasks per cell cell startup message types messages and fragments point-to-point communications multicast communications signals compatibility with native operating system We consider these points for the AP1000 below.
3 Mapping the PVM task model to AP1000 3.1 Host/cells model: daemon(s) and user tasks
We follow the PVM3 mpp distribution by representing a PVM task as an AP1000 cell task, and having a host-processor process acting as a pvmd daemon for those tasks. We divide the responsibilities of the PVM daemon between two host processes. A host pvmd process is started by the normal PVM management operations and runs as a normal UNIX process. This process does not allocate the AP1000 cells, and hence it can run inde nitely, like other PVM daemons, and support many PVM programs in succession. The host pvmd only assigns the AP1000 cells when an external PVM process calls pvm spawn, specifying the AP1000 as the location for executing the spawned tasks. On the AP1000 under the current sys.c7 system the use of the cells is a single resource allocated to one user at a time, and remaining allocated for the life of a host program process. It uses a process called caren to manage the resource allocation. In normal AP1000 programs a host task is created that is responsible for communicating to the AP1000 device after initial con guration (cconfxy call) and cell task creation (ccreat call). This host forks the caren which retains the machine resource for the life of the process. To allow the PVM daemon to persist on the host (in the normal way of pvmd processes) it cannot be allowed to retain the AP1000 machine resource. For this reason it was decided not to make the PVM daemon a normal AP1000 host process, but to have it fork two processes when performing a spawn. One is the caren process, and the other is an appvmdriver process that acts as a conduit for messages between cell tasks and the PVM daemon on the host. This process model is illustrated in Figure 1. In this way, the AP1000 is tied up only for the life of one PVM application program, and not for the life of the con gured Parallel Virtual Machine. Other users can use the AP1000 between separate PVM program runs. The PVM runs are controlled by the normal PVM task management tools. P1-D-5
Figure 1: Caren Process model for PVM
P1-D-6
Communication between the host pvmd and the cells uses a socket connection between pvmd and appvmdriver. A strict half-duplex client-server protocol is followed. The pvmd
is the client; it makes requests that are equivalent to the subset of CellOS host library calls that are needed to support pvmd operations. These are equivalent to cconfxy(), ccreat(), cbroad(), l asend(), crecv(), cprobe(), host exit(). Data for messages sent and received are copied down the socket connection. This is a source of ineciency that be eliminated in future by sharing memory between these two processes.
3.2 Addresses
The PVM system de nes a 32 bit address, formatted as s, g (which together specify process class), h (PVM host number), and p elds. The rst 3 have xed meanings. The 18 bit p eld is local implementation dependent: on a workstation it would normally refer to the process-id. For the AP1000 the p eld identi es cellid, taskid. To allow for the maximum con guration of 1024 cells, 10 bits are used for the cellid, 7 for the taskid, and 1 bit is included to distinguish the host from its cells. This host address bit identi es processes on the host processor. Note that this structure restricts the range of pvm user taskids per cell to less than 7 bits' worth (0..127).
3.3 Number of tasks per cell; static or dynamic
The major incompatibility between the PVM3 MPP model and AP1000 CellOS is in dynamic process creation: dynamic task creation is not available in the basic CellOS. All tasks must be created in a single ccreat call from the host. This implies that only a single call of pvm spawn is allowed, and hence only homogeneous tasks. Our initial AP1000 version imposes a restriction of one process per cell. This limits the classes of PVM program that can be run to those with a single pvm spawn, and hence to multiple copies of the one process executable. The minimum number of processes spawned is one; the maximum is the number of cells in the particular AP1000. Further work can extend this to multiple (homogeneous) tasks per cell relatively easily, as long as the number is a simple multiple of the number of cells. The ANU extensions to CellOS permit dynamic task creation, with some restrictions. These variants are discussed further below (Future Work).
3.4 Cell task startup
In the user task lpvmmimd library the function beatask is called to initialise a process (cell task) as a PVM-task. It is called at the start of every pvm * library function, so that there is no explicit pvm-enrolment call. System-speci c information (such as host PVM address, parent-process id) is passed to each task when it starts (by AP1000 broadcast (cbroad) from the appvmdriver). The beatask routine receives this information pack as its rst action and is able to calculate its own PVM address (from its cellid and the host PVM address).
3.5 Message types
PVM has a concept of message type similar to that of the AP1000: messages are tagged by type when sent, and a user task may block until a message of a speci c type is received, or specify a wildcard (ANY type) value. The obvious choice would seem to be to implement PVM P1-D-7
types directly by mapping into AP1000 message types. However, the structure of the PVM distribution software encourages another method: since the mpp distribution places message type handling in its machine-independent layer, it is much easier to implement using none of the AP1000 message type functionality. It is not known what eect this may have on performance.
3.6 Messages and fragments
Similarly, the message fragmentation and buer handling in the distribution are untouched. Messages are sent and received as fragments for which the AP1000 type is ignored by our lowlevel routines. The PVM type and message identi ers are carried within the PVM headers of these messages. This maintains a clean separation of the PVM library layer from the native machine communications layer, which is a desirable property of computer communications.
3.7 Point-to-point communications
For implementing point-to-point communications between task-daemon, task-task, task-external task, the PVM MPP distribution code uses asynchronous sends and receives of fragments: it avoids copying data on sending, and generally receives directly into the nal buer, also avoiding copying. This must be modi ed substantially for the AP1000. For sending, the AP1000 has a nonblocking, synchronous l asend routine, which makes a copy of the data (except when running in linesend mode). In linesend mode this operation blocks at the sender until the message is sent. Neither form is blocking, in that it does not wait for receipt by a user task. Either form of l asend will suce for sending. The ANU extensions to CellOS include asynchronous sending: see Future Work, below. For receiving, the AP1000 has no asynchronous receive. Messages are received into a system buer, from where they must be copied into the user program. Non-blocking receives are possible by a cprobe() call. ANU extended CellOs allows a pre-receive call to register a fragment buer as the direct destination of the next incoming fragment, and this may improve performance (see Future Work, below).
3.8 Multicast communications
The PVM multicast operation sends a copy of a single message buer to each of a list of destination addresses. A mutlicast may originate from a cell or from an external PVM process. A cell task that executes a multicast operation sends the message to the host pvmd. This daemon dispatches the message to all members of the address list. Where there are external processes in the address list the message is sent to their host(s) for redistribution. Where there are cell tasks in the address list the daemon sends the message to them all (via appvmdriver). It broadcasts to all cells (if the address list includes them all), or sends copies of the message sequentially to each cell addressed. This has two sources of ineciency: the need to go through the pvmd in every case (communications with the host is a bottleneck to be avoided if possible) the use of broadcast in only one case (where all cells { including the sender { were addressed). P1-D-8
It is particularly inecient in the case where a cell initiates a broadcast to all cells other than itself. This operation is available as directly as a cell library routine on the AP1000. This form of multicast is probably common enough in user programs to justify further work in making the cell library more intelligent.
3.9 Signals
The AP1000 does not support UNIX-like signals for user tasks on the AP1000. PVM Signals are not supported in this implementation.
3.10 Compatibility with native CellOS operating system
To avoid interference with our implementation of PVM messages and controls, no CellOS message sending or receiving calls are allowed in user tasks, including any sends, receives, readmsg. CellOS or Acacia le input/output is safe, being handled in a dierent message domain. The more general use of message domains [6] for the PVM implementation communications would make the PVM message passing transparent to the user; this has not been implemented yet.
4 Performance A benchmark program in the distribution is used to measure communications bandwidth for workstation PVM user tasks. On two workstations on the same local Ethernet segment (the best situation that can be had for normal workstations) measured average throughput of 234kByte/sec is obtained for 10kByte messages, 58kByte/sec for 1kByte. If the performance of our implementation achieved nearly 50% of measured AP1000 transfer rates, i.e. 2.5Mbytes/sec (half of 5.5Mbytes/sec)[7]then we would be satis ed. Measured performance gures are not yet available but we hope to report them soon.
5 Future Work This is a rst attempt at an implementation of PVM on the AP1000. It shows how the PVM model can be ported to the AP1000, and indicates several areas for eciency improvements.
5.1 Asynchronous sending and receiving
Using asynchronous message receiving (as implemented in the ANU sys.anu pre recv function extensions to CellOS) may improve performance by reducing internal message buering and copying.
5.2 Multiple tasks per cell
Multiple homogeneous tasks per cell
Where the number of spawned tasks is less than the number of cells, there is no problem for this implementation. One task is placed on each cell, and the AP1000 system supports a reduced number of cells by the cconfxy function. Simple broadcast operations then send to each active cell. Where the number of tasks is greater than the number of cells then the AP1000 ccreat function can still be used, to create a number of copies of the task executable on each cell, with P1-D-9
a range of CellOS task-ids. However, the use of the AP1000 broadcast operations cannot be used in a simple way to send to all tasks, because it sends only once to each cell { i.e. to all tasks with a single task-id. If the number of tasks per cell is equal, we may broadcast to all tasks by using a sequence of cbroad calls, one to each cell task-id: that is, multiple broadcast copies will be sent by the host, one for each task in a cell. It seems unlikely that the number of tasks per cell will be large, so this is acceptably ecient. If there is no such even number of tasks per cell then the extra tasks must also get a copy of the message. It appears to be wasteful to send copies from the host, sequentially. The cell tasks must be made more intelligent about incoming messages, and must be able to re-send messages locally to other tasks on the same cell.
Heterogeneous tasks on cells PVM is able to spawn heterogenous tasks only by multiple calls of pvm spawn. To implement this we require dynamic process creation on the cells { see below.
5.3 Dynamic cell process creation
ANU extensions to the cell operating system to allow dynamic process creation will extend the functionality of AP1000 PVM. A restricted form of dynamic task loading has been used in other applications software for the AP1000 [8] and experiments will be done with this form if possible, to provide as much functionality as possible under the standard sys.c7 system. Further extensions to the sys.anu version of the operating system are likely to have full dynamic task creation: the absence of this feature in sys.c7 is an operating system design choice, not a machine architecture restriction.
5.4 Host process structure
The use of two processes in the host (pvmd and appvmdriver) has an noticeable eect that PVM host-to-cell message passing eciency and task startup latency times are slower than desirable. It is planned to add to the Caren model the facility to allow a normal persistently running UNIX process to connect to and open, then release, the AP1000 multiple times during its life. To do so we will implement new library functions connect cap and disconnect cap. After calling connect cap and blocking until the /dev/cap device could be successfully opened, the calling process would not only then have the AP1000 cells resource, but would also share access to the AP1000 CBIF memory (communication buer interface) with caren. Messages could then be sent without additional copying within the caren process, resulting in the simpler process model of Figure 2, with more ecient broadcast and inter-host communication. After releasing the AP1000 the pvmd process could later re-connect, without the overhead of forking a new appvmdriver process.
5.5 Barrier synchronisation and process groups
Barrier synchronisation has not been implemented. In the PVM model it depends on de ning a process group, and a barrier synchronisation operation is within that group. For ecient use of the AP1000 sync operations it would be necessary for the process group to consist of one task on all cells in the con guration. P1-D-10
Figure 2: Proposed Caren Process model
P1-D-11
6 Conclusions The PVM system has been implemented on the AP1000, under restrictions that are relatively easily borne by user programs. It enables users to debug programs on networks of workstations and port them to an AP1000 to use larger number of processors and for greatly improved performance. Achievable extensions to the CellOS operating system will reduce the restrictions and improve performance further.
References [1] Al Geist, Adam Beguelin, Jack Dongarra, Weichang Jiang, Robert Manchek & Vaidy Sunderam, \PVM 3.0 User's Guide and Reference Manual," ORNL/TM-12187, May 1993. [2] V. S. Sunderam, \PVM: A Framework for Parallel Distributed Computing," Concurrency: Practice and Experience 2 (December 1990), 315{339. [3] A. L. Beguelin, J. J. Dongarra, G. A. Geist, W. C. Jiang, R. J. Manchek, B. K. Moore & V. S. Sunderam, \PVM 3.2: Parallel Virtual Machine System 3.2," University of Tennessee and Oak Ridge National Laboratory, Software Distribution (netlib.cs.utk.edu), 1992. [4] W. A. Shelton, Jr. & G. A. Geist, \Developing Large Scale Applications by Integrating PVM and the Intel iPSC/860," Proceedings Intel Supercomputer Users' Group 1992 Annual Users' Conference, Dallas, TX (October 1992). [5] Brian K. Grant & Anthony Skjellum, \The PVM System: An In-Depth Analysis and Documenting Study { Concise Edition," LLNL, TR UCRL-JC-112016, Livermore, CA, 1992. [6] David Walsh, Peter Bailey & Bradley M. Broom, \Message Domains: Ecient Support for Layered Message Passing Software," in Proceedings of the First Annual Users' Meeting of Fujitsu Parallel Computing Research Facilities, Fujitsu Ltd., Fujitsu Laboratories Ltd., November 1992. [7] Toshiyuki Shimizu, Takeshi Horie & Hiroaki Ishihata, \Performace Evaluation of the AP1000," in Proceedings of the First Annual Users' Meeting of Fujitsu Parallel Computing Research Facilities, Fujitsu Ltd., Fujitsu Laboratories Ltd., November 1992. [8] Brian Corrie & Paul Mackerras, \Data Shaders," in Proceedings of Visualization '93, San Jose, California, October 1993.
P1-D-12