d. achieve the above three points while appearing to the user that the program is ... and Distributed C [Pleier, 1993] to name a few, have all achieved some degree .... need to communicate to each other or get results from other processes the ..... Fine-grained parallelism will consume too much capacity due to its demand for.
A Survey of Basic Issues of Parallel Execution on a Distributed System
D.J.V. Evans and A.M. Goscinski School of Computing and Mathematics Deakin University Geelong, VIC, 3217 Australia
Abstract This report examines the basic issues involved in implementing parallel execution in a distributed computational environment. The study was carried out by considering our claim that both a compiler should be directly involved in detecting processes of a program to run in parallel on a distributed system, and that a distributing operating system, in particular global scheduling should provide a support for such parallel execution. For this purpose, first of all the issues of parallelism of a program’s processes has been examined. Issues such as the granularity of parallelism, use of parallelism, the generality of parallelism and the programmer’s actions when writing parallel programs have been addressed. Possible future developments are then presented. Following this a support for parallel execution in distributed systems has also been discussed. In particular the issues of interprocess communication, process synchronisation, memory management, process management including global scheduling, reliability and distributed applications have been addressed. This report concludes with the major issues that need to be studied and implemented in future compilers and distributed operating system to allow transparent distributed processing to occur, and the benefits of such systems.
Page 1
1
Introduction
Distributed systems will become a reality in the near future if the problems of distributed processing were solved. These problems all involve the issues of how to: a. b. c. d.
have the processes of one program running at different locations on the network, use the physically distributed memory, implement the accessing of required remote computational resources, and achieve the above three points while appearing to the user that the program is running locally on a virtual single centralised machine.
Many of the features required to support distributed processing have been achieved in isolation, mainly in the field of parallel computing or in research into distributed processing or distributed operating systems [Bal, 1990; Goscinski, 1991]. Parallel computation has been achieved in various ways as in Occam [Inmos Ltd., 1983] or in the current and next proposed version of Ada [Ada 9X Mapping/Revision Team, 1993]. Having physically distributed memory appear as a single global memory has been achieved in Linda [Ahuja et al., 1986]. Transparent access to some remote resources, such as files, has been implemented in a limited way in most LANs that use file servers and in some experimental distributed operating systems such as Clouds [Dasgupta et al., 1991]. Process migration and remote process execution has been used in Sprite and RHODOS [Douglis and Ousterhout, 1991; Gerrity et al., 1990]. A result that includes all the required features has, so far, evaded researchers. Distributed computing can improve performance and reliability [Bal, 1990; Baker and Ousterhout, 1991]. The increase in reliability will occur either by several processors doing the same task (multiple copies) [Goscinski, 1991] or by using the information about processors and program states that is available in the system [Baker and Ousterhout, 1991]. Improved performance comes through the ability to have programs run as parallel processes rather than as sequential processes [Goscinski, 1991; Inmos Ltd., 1983; Kumar et al., 1994]. To some degree parallel execution in physically distributed systems has been achieved. Linda [Carriero and Gelernter, 1988], Orca [Bal, 1990], Sprite [Ousterhout, et al., 1988], PVM [Sunderam, 1990] and Distributed C [Pleier, 1993] to name a few, have all achieved some degree of parallel execution of processes but have not fully supported distributed processing. The majority of these systems and languages require special notations or actions by the programmer which force the parallelism be explicitly identified by the programmer. The development of a distributed operating systems has been much slower than the evolution shown for networks and workstations/PCs. Distributed operating systems under development include Amoeba, LOCUS, Mach, RHODOS and V and are all examined and contrasted in [Goscinski, 1991]. Languages which support the distributed processing include Linda [Ciancarini and Guerrini, 1993], CSP, NIL, Ada, Distributed Processes and Emerald, are all examined in [Bal et al., 1989]. As yet there is no operating system or language that has gained widespread acceptance. The management of resources which are distributed over a geographical area is not an easy task. To manage or control a distributed computing environment for maximum benefit will require the development of a distributed operating system that supports distributed processing [Gerrity et al., 1990; Goscinski and Zhou, 1994]. A distributed operating system manages the interaction of multiple processors in control and management of resources over a network and involves the nodes on the network in such activity. Distributed processing is the use of multiple
Page 2
processors to run a single program or process. Distributed processing cannot be achieved without the support of a distributed operating system. Programming for distributed applications raises several issues [Goscinski and Zhou, 1994]: a. b. c. d. e. f.
parallel execution and granularity of parallel execution, communication between concurrently executing processes, synchronisation of concurrent processes with the program, mapping processes to workstations, detecting the failure of a process or processor, and action to be taken when a failure of a processes or processor has occurred.
Communication between processes in a distributed system will use some form of message passing and these messages must be handled by the operating system [Bal, 1990; Douglis and Ousterhout, 1991; Douglis and Ousterhout, 1991]. The synchronisation of concurrent processes require that processes that are concurrently executing are allowed to finish prior to the rest of the program continuing. The operating system is the logical place for this [Ben-Ari, 1990; Deitel, 1990; Goscinski, 1991]. It is the operating system that is responsible for process control [Goscinski, 1991; Goscinski and Zhou, 1994] and it is logical for the operating system to make decisions in regard to monitoring processes and what to do upon failure of processes. It has been proposed that a compiler can be used to address and solve the first issue listed above [Goscinski and Zhou, 1994]. A compiler can be used to detect parts of code that can be executed in parallel. This is already done when optimising loops for the Cray Y-MP [Deitel, 1990] and can be shown to work for problems listed in Section 2.4 [Kumar et al., 1994]. A compiler can be used to detect when program synchronisation should occur. This information can be passed from the compiler to the operating system to help control the flow of the program. This rest of this report will detail some of the major issues in parallel and distributed computing and how they relate to distributed processing. This will cover: a. b. c.
d.
a brief description of concurrent execution and the levels of parallelism, the difference between parallel and distributed paradigms, some of the features that support distributed processing which are currently available, access mechanisms to these features and possible improvements to the access mechanisms, examples of distributed applications and programs, and suggested improvements to current techniques.
A conclusion at the end iterates the major points and summarises the arguments of this paper.
2
Parallel Execution
Most programs during their execution are conceptualised as a series of sequential processes. One process does not start until its predecessor is finished, and the program is considered to be running from the start of its first process until the conclusion of its last process [Deitel, 1990; Lilja, 1994]. Sequential execution is the most common form of program execution. A large sequential program may have many processes between start and end while a small program may consist of only one process.
Page 3
While a sequential program has its processes run after each other it is not true that all processes rely upon the results obtained from a preceeding process. This allows some processes to be run in parallel with each other. Processes from the same program that are running at the same time are said to be running in parallel [Deitel, 1990; Kumar et al., 1994; Tanenbaum, 1990]. There are two types of parallel execution, logical and physical. Logical parallelism is when a parallel program is running on a uniprocessor time-sharing system. The parallel processes have their own time slice, process states and other data associated with a process. It is impossible to determine which parallel processes is using the CPU at any given time. The program behaves as if it has parallel processes but is running in a pseudo-parallel manner [Bal, 1990; Gough and Mohay, 1988; Tanenbaum, 1990]. Physical parallelism is when each process executes at the same time as each other but each process has access to a separate CPU. This is the concurrent execution of processes [Ben-Ari, 1990; Goscinski, 1991; Gough and Mohay, 1988; Tanenbaum, 1990]. It is well documented that parallel execution of a program’s processes can reduce the time required for its execution [Ahuja et al., 1986; Ben-Ari, 1990; Camp et al., 1994; Ellis, 1990; Kumar et al., 1994; Ligon and Ramachandran, 1994; Lilja, 1994; Sterling and Shapiro, 1986; Tanenbaum, 1990; Zielinski et al., 1994]. Any discussion about concurrent execution is meaningless without describing the conceptual level at which the parallelism occurs. The terms most often used for are fine-grained parallelism and coarse-grained parallelism [Deitel, 1990; Tanenbaum, 1990] and these terms are related to the level and frequency of the communication and synchronisation that occurs within a program when it is running as parallel processes. Fine-grained parallelism is generally based at a low level. Low level in this context means that the level of parallelism is based around a source code instruction that has a close correspondence to machine instructions [Inmos Ltd., 1983; Tanenbaum, 1990]. The source code instructions can be executed in parallel. Each instruction in the source code takes very few machine instructions to carry out. The groups of machine instructions that correspond to the source code are executed on separate processors. Coarse-grained parallelism is based at a conceptually higher level than fine-grained parallelism. This parallelism is based upon the logical structure of the program. It is implemented around procedures, functions or some other grouping of code that the exhibits structure inherent in the program [Ahuja et al., 1986; Pleier, 1993;Tanenbaum, 1990]. Fine-grain and coarse-grain parallelism have equivalents in hardware being tightlycoupled machines and loosely-coupled machines, respectively [Deitel, 1990; Tanenbaum, 1990]. Fine-grained parallelism normally runs on tightly-coupled machines while coarsegrained parallelism is normally found on loosely-coupled machines [Tanenbaum, 1990].
2.1
Fine-grained Parallelism
Fine-grained parallelism is conceptually low level with regards to computing and is considered as being close to the machine instruction level. This means there is a close correspondence to a command in the programming language and the machine instructions used to carry out the command. Languages such as Occam [Inmos Ltd., 1983], Aeolus, ParAlfl, ParLOG and Concurrent Prolog [Bal et al., 1989] are considered to support fine-grained parallelism. Languages that support fine-grained parallelism can be very cryptic and may not natively support the rich set of data structures used by programmers working with high level
Page 4
languages. An example of this is Occam [Inmos Ltd., 1983] where the programmer must build such constructs before using them. Not all fine-grained languages are as restrictive as Occam. Fine-grained languages such as Aeolus, ParAlfl, ParLOG and Concurrent Prolog [Bal et al., 1989] are considered to support fine-grained parallelism but these languages support a range of data structures and have well defined syntax. These languages were designed to allow a program to execute over several processors and hide the details of interprocess communication from the programmer but require the programmer to specify parallel components. Fine grained-parallelism is most often seen in machines that allow high-speed access to memory and have multiple processors. These are usually specialised machines and are often classified as vector-processors, array-processors, dataflow computers, etc. which require specialised languages and techniques to access their features [Kumar et al., 1994]. In these machines there is frequent synchronisation, usually every few clock cycles or instructions. All memory locations are usually accessible to all processors that require access. When processes need to communicate to each other or get results from other processes the mechanisms are very fast and the physical distance small [Tanenbaum, 1990]. Fine-grained parallelism is normally implemented in loops [Bal et al., 1989; Ellis, 1990; Inmos Ltd., 1983; Kumar et al., 1994; Lilja, 1994; Tanenbaum, 1990] although there are alternative implementations [Ahuja et al., 1986; Carriero and Gelernter, 1988; Dasgupta et al., 1991; Jul et al., 1988; Tanenbaum, 1990]. Each pass of a program’s loop is executed concurrently, rather than sequentially. Some compilers, such as the Fortran compiler for Cray’s X-MP, adjust nested loops to gain maximum efficiency from the machine [Deitel, 1990]. In summary, the definition of what constitutes fine-grained parallelism varies with the philosophical view and background of an author. This contrast of views can be seen by comparing the views of Lilja (1994) against Kumar et al. (1994). Both concentrate on the use of loops for concurrent processing. Lilja, whose background is in electrical engineering, describes fine-grained parallelism as having each instruction of a single loop iteration executed on separate processors. Kumar et al. have a computing science/mathematical background and describe fine-grained parallelism as all instructions of a single loop iteration being executed on separate processors.
2.2
Coarse-grained Parallelism
Fine-grained parallelism is generally described as being low-level in nature as there is tight synchronisation and a close correspondence to the source code and machine instruction used. Coarse-grained parallelism is conceptually at a higher-level of abstraction than finegrained parallelism. It is based on the logical constructs and interactions of the program. Synchronisation occurs when the program requires data to be exchanged or when a program requires such action to ensure correct operation. The rendezvous mechanism in Ada [Ada 9X Mapping/Revision Team, 1993] is a classic example of such interaction. Examples of concurrent programming languages that support forms of coarse-grained parallelism are Linda [Carriero and Gelernter, 1988], Distributed Processes, Orca and Argus [Bal et al., 1989]. In these languages a program may be running over several machines and the timing on the machines cannot be guaranteed to be the same, so synchronisation and communication of the processes within a program occurs in a different manner than with fine-grained parallelism. Coarse-grained parallelism is generally characterised by less frequent and less regular synchronisation and communication of processes than is the case with fine-grained parallelism.
Page 5
Hardware support for coarse-grained parallelism is normally found on loosely-coupled machines. In a loosely-coupled system the processors do not physically share memory. This requires co-operating processes to communicate via some form of messages, rather than have direct access to memory locations, if the processes of the one program are executing on different processors [Goscinski, 1991; McBryan, 1994; Tanenbaum, 1990]. In summary, definitions of coarse-grained parallelism differs from person to person as did the description for fine-grained parallelism. Again contrasting the views of Lilja (1994) and Kumar et al. (1994) is illustrative of the issue. Lilja has coarse-grained parallelism as one or more complete iterations of a loop on a processor. Kumar et al. define coarse-grained parallelism as more than one complete iteration on a processor. It can now be seen that Lilja’s view of coarse-grained parallelism is equivalent to Kumar et al.’s view of fine-grained parallelism. It should be noted that there is an overlap of views for coarse-grained parallelism that does not exist for fine-grained parallelism.
2.3
Repetition of Calculations (Loop Structures)
In scientific or engineering computations, which often use iteration to find a solution, a programmer will frequently use a loop structure. If the problem is to be solved in a concurrent manner the computing environment will dictate whether fine-grained or coarse-grained parallelism is to be used. Using either granularity of parallelism gives rise to two problems. These are the domain-mapping problem and the problem of dependencies. A problem with using parallelism based on loops is how to effectively divide the data so there is an even load across all processes and processors. This is called the domain-mapping problem [Camp et al., 1994; Kumar et al., 1994]. The domain-mapping problem addresses how to efficiently break the data into discrete groups that can be assigned to the processes in the most effective and efficient manner possible. For the classes of problems listed in Section 2.4, the division of data is well understood. Another issue of concern with loops is the dependency of the results upon previous non-local results. Solutions to mathematical or engineering problems based on loops often require results from a previous iteration. Using fine-grained parallelism each iteration may be done on a different processor. If a result from one iteration is required to complete a calculation in another iteration, as in Newton’s method for finding roots of equations, and the earlier result is not at the same processor as the current iteration, then some form of communication must occur to obtain the earlier result and allow the computation to finish. If there is a large number of iterations then there needs to be a lot of communication between processors. This again shows how the network may be tied up in servicing a fine-grained parallel computation and indicates the problems arising from a poor domain-mapping [Kumar et al., 1994; Lilja, 1994].
2.4
Generality of Parallelism
Parallelism based on iteration restricts concurrent execution to a narrow range of problems [Kumar et al., 1994; Lilja, 1994; Ponnusamy et al., 1993] such as: a. b. c. d. e.
matrix algorithms, data ordering (sorting) algorithms, some graph theory algorithms, search for discrete optimisation algorithms, fast Fourier transformations, and
Page 6
f.
systolic algorithms.
The problem types listed above can all be represented and solved by iteration. In many cases the problems were originally described and solved by such methods. The best example of this is the solving of a system of linear equations. The system can be stored in a matrix and each row or column of the matrix is assigned to be solved in parallel [Kumar et al., 1994]. However not all classes of problems can be solved by iteration. For effective and widespread use the parallelism must be more general in the problems it can handle than the few areas listed above. Languages such as Ada [Ada 9X Mapping/ Revision Team, 1993], Emerald [Black et al., 1987], Concurrent C, NIL, Argus, etc. [Bal et al., 1989] and Distributed C [Pleier, 1993] base the parallelism on the process or procedure level of abstraction. This approach is also taken by some operating systems such as Clouds [Dasgupta and LeBlanc, Jr., 1991], RHODOS [Gerrity et al., 1990] or Sprite [Douglis and Ousterhout, 1991]. By basing the parallelism at the process or procedural level a much wider range of problems can be, and has been, addressed. The level of parallelism is expressed through the language used and its supporting constructs. Tasks and packages used in Ada show parallelism at the procedure level [Ada 9X Mapping/Revision Team, 1993]. Occam [Inmos Ltd., 1983] shows fine-grained parallelism with the unit of parallelism being based around each instruction. Parallelism is expressed through objects, in object oriented environments, typified by Emerald [Jul et al., 1988] and Aeolus [Dasgupta et al., 1991]. Programming languages with a low level of abstraction tend to be difficult to program, as in Occam [Inmos Ltd., 1983]. Low level languages often do not have built-in support for control constructs that programmers have come to expect in high level languages. The need to build such control constructs, such as if... then... else..., from low level primitives make programming in low level languages harder than programming in high level languages. Languages with a higher level of abstraction allow a wider range of problems to be solved more easily than languages with a low level of abstraction. This can be seen by programming the same problem in various languages, as is shown in Ben-Ari (1990).
2.5
Parallel Programming and Programmers
Programming is most often taught as a sequential skill, because people have a natural tendency to reason in sequential rather than parallel terms [Jonston and Heinz, 1978; Miller, 1956; Nielsen and Smith, 1973; Reed, 1992]. This section examines the different implementations of parallelism and the problems with how people program. Most methods of concurrent programming require the explicit expression of which parts of a program are to be executed in parallel. Ada [Ada 9X Mapping/Revision Team, 1993] requires the use of tasks and packages which use a rendezvous mechanism for communication and synchronisation. Occam [Inmos Ltd., 1983] requires the keywords PAR or SEQ to specify and change to parallel or sequential processing respectively. Modula-2 [Gough and Mohay, 1988] uses the concept of co-routines, identified by the key-words COBEGIN and COEND, to specify concurrent execution. Many programmers have problems coming to terms with concurrent programming. Writing a parallel program, even using well known parallel algorithms, is a hard task for all but the most experienced programmers due to the requirement of explicitly specifying parallel
Page 7
components. The programmer must have a deep understanding of the problem being addressed, the language being used and, in many cases, the machine the program is to be executed on [BenAri, 1990; Lilja, 1994; Kumar et al., 1994; Ponnusamy et al., 1993] otherwise errors will occur in the resulting computation. The programmer’s problems are compounded if they have to also specify which processor to run a process as in ParAlfl [Bal et al., 1989]. There are some examples of implicit parallelism that have been implemented [Grey, 1993; Douglis and Ousterhout, 1991]. Grey has implemented distributed compilations for C over a Sun network while Douglis and Ousterhout have a distributed version of ‘make’ for the Sprite operating system. These both implement execution of their code in a concurrent manner. The user is not aware of the parallel operations occurring, only that they get the expected results in less time than before. These examples are very coarse-grained; there are no data dependencies in the way they are executed and they are running in homogeneous environments. Both of the above examples have had high user acceptance [Grey, 1993; Douglis and Ousterhout, 1991]. This appears to stem from the fact that the parallelism was implicit, and therefore hidden from the users. The users did not have to concern themselves with the many problems associated with implementing the concurrency of the execution. Realising concurrent execution implicitly seems to have gained rapid acceptance for these simple cases. There are some issues that need to be addressed to make implicit concurrent execution useful for wider use. The first is to implement the parallelism at a finer grain than in Sprite. The current implementation of implicit coarse-grained parallelism is too coarse to be useful in a general programming environment. The second concern, which stems directly from the first, is that data dependencies must be able to be detected. In the above examples, the type of activity used has no data dependencies and so the problem is not relevant but if implemented at a lower level of abstraction the problem needs to be resolved. The third item is that repeated executions of the same program with the same data must yield the same results. This covers the twin areas of race conditions [Ben-Ari, 1990; Deitel, 1990; Tanenbaum, 1990] and synchronisation [Bal, 1990; Bal et al., 1989; Ben-Ari, 1990; Deitel, 1990; Goscinski, 1991; Gough and Mohay, 1988; Kumar et al., 1994; McBryan, 1994].
2.6
Parallelism for Distributed Processing
In the majority of uniprocessor systems all memory is managed by a single operating system. Memory is located on the machine and is accessed by a single bus. All memory references are in relation to a single memory that is shared by all processes. When cooperating processes need to share data they place the data in a memory location that is known to both processes. As the memory is on the same machine as the processors this is an easy and efficient way of sharing data. This method is known as memory sharing as two or more processes share the same memory location [Bal et al., 1989; Deitel, 1990; McBryan, 1994; Tanenbaum, 1990]. In a distributed system the computing hardware is located in physically separate locations. The nodes are joined by some form of network and most nodes have their own dedicated memory. There is no access to a single, shared, memory in distributed systems. Interprocess communication, and the sharing of data, must also be achieved over the network, generally via some form of message passing or remote procedure call [Bal, 1990; Bal et al., 1989; Deitel, 1990; Goscinski, 1991; McBryan, 1994; Tanenbaum, 1990]. To ensure that a program produces the correct results, the processes that are executing concurrently must interact at the correct time. If procedures are to work together and share data
Page 8
there will be an imposed order of which process will read or write the data’s value and when it will occur. This process of ensuring the correct operation of a program’s processes is called synchronisation [Bal, 1990; Bal et al., 1989; Goscinski, 1991; Kumar et al., 1994; Kumar et al., 1994; McBryan, 1994]. In a distributed system, synchronisation of the concurrent processes, along with data sharing, must be achieved over the connecting network, thus adding to network traffic. The network capacity to support parallelism must also be considered. A network’s communication channel has a finite amount of data it can carry [Agrawal and Malpani, 1991; Khan, 1992; Young, 1990]. The encoding of data has allowed higher effective communication rates while still keeping the amount of signal traffic within the limits defined by Shannon’s Law [Shannon, 1948]. In a distributed systems there are overheads imposed by: a. b. c. d. e.
administration of the network (ensuring timing, integrity of the links, etc.), the requirements to synchronise processes, the characteristics of the communication media, the support for sharing of data between processes, and the need to support user demands for activities such as sharing files and interuser communication.
Fine-grained parallelism will consume too much capacity due to its demand for frequent synchronisation. This will result in unacceptable performance overheads in the network. Fine-grained parallelism also shares results between processes frequently. This overhead of recurrent data sharing will also consume vital bandwidth of the communication channel. However, these two factors will have a bigger impact on the speed of the computation, as processes must wait to receive the data they require. This implies: a. b.
communication overheads of fine-grained parallelism will inflict unacceptable performance on the supporting network, and execution time of processes will be much slower due to the need to frequently send data to communicating processes if a fine-grained model is used.
For these reasons fine-grained parallelism is not fully suitable for distributed processing. The level of parallelism in a distributed system should be coarse-grained because: a. b.
coarse-grained parallelism has less frequent synchronisation needs, and communication via remote procedure calls or message passing will not unduly load the network or slow the processes’ execution time due to the lesser frequency of such actions.
Due to market forces there will always be a requirement for systems that support the varying levels of granularity from fine-grained to coarse-grained. However for effective and efficient concurrent execution on a distributed system coarse-grained parallelism should be supported.
3
Future Developments
This section indicates the possible improvements and future developments in supporting the operations required for distributed processing. A global view of all memory would allow larger programs to be run or the same size
Page 9
program to be run more effectively. Memory being accessed on a symbolic basis rather than content or known address allows the programmer to be freed from issues that can be automated and allows the binding of variables to be delayed until the program is to be run, at which time the binding is handled by the operating system. Using the same operations for local and remote memory locations would free the programmer from needing to consider where the program is running and allows an operating system to standardise access. The second issue is one of the most important, how to make concurrent processing implicit. By automating the detection of what can be done in a concurrent manner the programmer need not consider the problems of how to implement a parallel algorithm or program, what resources are required or how to split data to balance the load. Whenever a procedure has been automated in computing, such as allowing memory to be addressed symbolically, it has resulted in more productivity from the programmer and the production of more reliable code. Making concurrent processesing implicit involves detecting units of parallelism. For a distributed computing environment parallelism should be coarse-grained. This should minimise communication overheads and determine units of parallelism at places that minimise data exchanges. If possible determination of the units of parallelism could be formulated in a generic manner and allow the adjusting of the size of the parallel units to find an optimal granularity for such systems. This also allows objective measurements to be taken to justify the granularity of parallelism that is finally implemented. From the above it follows that any changes should be based on a well known language. There is a large base of software in existence. Companies will not change from existing methods unless there is an overriding commercial reason to do so. A company cannot afford the time to swap from one system to another. Any additions or changes must support current software. This means that a well accepted and well known language such as C, Cobol, Fortran, Pascal, or similar, should be used. Any changes to the language should be kept to a minimum, as in Concurrent Pascal or Distributed C, using extensions to the current language. Consequent on the above items, the compiler should be used to support and automate the detection of sequential and concurrent processes. This allows a current language to be used and allows parallel computations to be achieved without the programmer being aware of what is occurring. Time in training programmers is saved and current applications only need to be recompiled to take advantage of concurrent processing. The compiler, through analysis of the code should be able to determine what parts of a program must be run in sequence and therefore identify the components that could be run in parallel. The last issue to be addressed is the integration of a distributed operating system and the compiler for distributed processing. A compiler must be constructed with a known operating system so that it produces code that is compatible. The distributed operating system will have the information to tell how many spare processors there are and where they are located. A compiler cannot know at compile time what system resources are available at run time. This implies the compiler should also generate code that, if need be, can be run as a sequential program. This means using the compiler to analyse the program and pass the data about the sequential and parallel components of a program to the distributed operating system. Based on this information, the global schedular of a distributed operating system can map a process to the workstation and thus create a process on the designated processor.
Page 10
4
Support for Parallel Execution in Distributed Systems
In a distributed system, which consists of multiple workstations or personal computers, there may be idle or lightly loaded computational units which can be used to support parallel execution. This section examines some of the facilities that support parallel execution in a distributed environment and are currently available or are the subject of research and development. It also looks at their implementation and identifies problems with the implementation used. The features covered are: a. b. c. d. e.
interprocess communication, process synchronisation, memory management models, process management, and system reliability and availability.
This discussion will concentrate on the facilities of a distributed operating system which are critical for parallel execution by making it feasible and strongly influences its performance. The programming interface, in particular shared memory, is not discussed in detail.
4.1
Interprocess Communication
In a uniprocessor computer system a variable’s value is stored at a memory location. Any process that needs to access the variable will use the relevant memory location. As only one process at a time is active in the processor there is no race condition [Ben-Ari, 1990; Tanenbaum, 1990]. With concurrent processing on a computer system that uses shared memory on a single machine, race conditions (that is, the correct synchronization of parallel processes) must be addressed. The problem of race conditions is resolved, depending on the system, by using semaphores, monitors, system interrupts, etc. [Ben-Ari, 1990; Gough and Mohay, 1988; Tanenbaum, 1990]. This ensures the integrity of data for processes that are sharing a variable and are running concurrently. In distributed processing a program may be running on several different processors with their own memory, all at different locations around the network. Using shared memory for interprocess communication is not possible in a distributed system as each processor does not have direct access to the memory on the other machines on the network [Goscinski, 1991]. Access to shared variables used by concurrently executing processes must be done via the network in distributed processing. Methods currently used are based on message passing or remote procedure calls [Birrell and Nelson, 1984]. These methods eliminate the race condition problem as access to the shared variable is handled as an atomic action. Using remote procedure calls or message passing for interprocess communication still entails using the connecting network. If the level of parallelism is fine-grained or the communication mechanisms used are inefficient then a large portion of the networks capacity will be used in supporting interprocess communication [Ligon and Ramachandran, 1994; Zielinski et al., 1994]. This again reinforces the need for coarse-grained parallelism for concurrently executing processes in a distributed system. Remote procedure calls or message passing methods are many times slower than using shared memory due to the network delays [Tanenbaum, 1990]. However message passing represents a higher level of abstraction for interprocess communication than does shared
Page 11
memory [Gough and Mohay, 1988; McBryan, 1994; Pleier, 1993]. Either method can be implemented on shared memory machines to ensure data consistency, but access to shared variables via a common shared memory cannot be achieved directly in a distributed environment [Deitel, 1990; Goscinski, 1991; Tanenbaum, 1990; Zielinski et al., 1994]. Thus to implement interprocess communication in a distributed system message passing or remote procedure calls must be used as they are the only mechanisms available. The possibility of using distributed shared memory to support interprocess communication in a distributed system is still the subject of research.
4.2
Synchronisation
Synchronisation, as described by Goscinski (1991) covers the related problems of process control and the use of shared resources. These problems are solved in a variety of ways. Some environments have the programmer specify when synchronisation occurs in the language, as in Distributed C [Pleier, 1993] or Ada [Ada 9X Mapping/Revision Team, 1993]. Distributed systems based on the object-oriented paradigm have it implicit in the constructs as in Emerald [Jul et al., 1988] or Clouds [Dasgupta et al., 1991]. Synchronisation may be supported in the operating system as in Sprite [Douglis and Ousterhout, 1991] and RHODOS [Gerrity et al., 1990] or in many other ways (i.e., point-to-point message passing, rendezvous, remote procedure calls, one-to-many message passing, distributed data structures or shared logical variables) as described in the article by Bal et al. (1989). The issue of shared resources concerns the way in which concurrent processes access a resource that they both require. A resource may be anything from a file, terminal screen, keyboard or memory location. Most resources allow only one process to access the resource at any given time and the access of the resource must be performed as an atomic action. The term atomic action refers to the practice that the whole operation required is carried out in its entirety or the resource is left in its original state. To enforce atomic actions a variety of methods are used, depending upon the computing environment. These methods include critical sections, semaphores, guarded horn clauses or select statements [Bal et al., 1989; Ben-Ari, 1990; Goscinski, 1991; Gough and Mohay, 1988; Kumar et al., 1994; Tanenbaum, 1990].
4.3
Memory Management
How memory is managed by the operating system is one of the important issues in computing [Deitel, 1990; Goscinski, 1991; Tanenbaum, 1990]. The model used will affect the method of interprocess communication and the efficiency of computation. One common method is to consider memory on each machine to be distinct and controlled by the processor with which it is associated. This is the approach taken by network operating systems such as Sprite [Douglis et al., 1991; Douglis and Ousterhout, 1991], Sun’s network file system Sun-NFS [Sun Microsystems, 1991], Emerald [Black et al., 1987; Jul et al., 1988] and many others. The access and control of the memory is managed locally, by the controlling CPU, with no global view of memory usage. The maximum size of a process or a program is limited by this ‘local only’ use of memory. The limiting factor is the amount of free memory at the node on which the process is executing [Deitel, 1990]. For processes that are executing concurrently on different nodes and require interprocess communication, the communication mechanism needs to support the identification of the machine and memory address of the variable that is being used to exchange data.
Page 12
Memory can also be viewed as a common global pool and this is the approach used by RHODOS [Gerrity et al., 1990] and Linda [Carriero and Gelernter, 1988; Ciancarini and Guerrini, 1993]. All memory from all the connected machines is managed with the global view and is accessible to all processes and processors. This allows much larger programs and data to be executed and the communication mechanism only requires to know the global address of the variable. The use of global memory has some penalties. There needs to be some global strategy for the memory management and the strategy needs to implemented. The implementation of the management may itself be distributed or centralised [Goscinski, 1991]. Resolution of the global address to a single machine’s local address is handled by the operating system. Linda uses global memory through the concept of tuple-space [Ahuja et al., 1986]. Tuple-space is based on finding a memory location by content, not address. This can be complex and can require the programmer to know some of the results of a computation prior to a program being executed. Requiring a programmer to know the location or details of a variable is not a common practice. This is a step backwards towards needing to know memory addresses, a style of programming that has long been out of style due to the possibilities of errors. Linda’s use of global memory is very good as it takes a global view of memory usage. Linda’s high level of abstraction for memory management (the global use of memory) contrasts with its low level implementation (the use of tuple in tuple-space). Languages such as C [Kernighan & Ritchie, 1988], Pascal [Findlay and Watt, 1988], Modula-2 [Gough and Mohay, 1988] or Clips [Giarratano, 1991], to name but a few, all use symbolic references to variables. One of the goals of distributed computing is the transparent addressing and access of the memory of all machines. An ideal memory management system in a distributed system would be able to view all the distributed memory as a global pool, and allow the programmer to address memory on a symbolic basis [Goscinski, 1991; Gerrity et al., 1990]. Distributed shared memory (DSM) is another method of using memory in a distributed computing environment. DSM manages the physically separate memory and makes the memory pool appear as a single common memory. DSM can be considered a virtual global memory as the memory is physically distributed and managed as separate memories but appears to the program or user as a large single memory. DSM evolved from attempts to build scalable parallel machines which overcame the problems of [Systems Architecture Research Centre, 1993]: a. b.
saturation of buses by data, long latency periods of secondary storage.
DSM is not without its own problems. These problems include false sharing, ownership of memory pages, memory synchronisation, level of implementation, dissemination of results and memory coherence [Ben-Ari, 1990; Goscinski, 1991; Systems Architecture Research Centre, 1993]. As DSM simulates a global memory there are also problems with the maximum memory that can be addressed [Ben-Ari, 1990; Goscinski, 1991] and whether DSM should be supported with specialised hardware [Systems Architecture Research Centre, 1993]. DSM has been implemented in some distributed environments such as DASH [Lenoski et al., 1994] at Stanford University and is beneficial in that it addresses many of the lower level problems of distributed computing.
4.4
Process Management For distributed processing a program's processes can be executed on different
Page 13
processors in a concurrent manner. In a distributed computational environment the problem of where to run a process, how a process can be moved around the system and how to control a process are problems that do not arise in other computing environments. 4.4.1
Global Scheduling
Global scheduling is the issue of determining which process is assigned to which processor and when; i.e. how a process should be run in a distributed environment. This is done with regard to the total load of the distributed system. To achieve this, global scheduling is encompasses into two sub-categories: a. b.
static allocation used when computational loads are steady, and load balancing used when computational loads continuously fluctuate.
Static allocation is the determination of where a processes is to be executed prior to its invocation while load balancing is the dynamic movement of a processes during its execution. Static allocation addresses the issue of where to create and run a process. Load balancing is concerned with when and where to move an active process in order to distribute computational load evenly. Static Allocation The first issue is where to run a process in a distributed environment. The assignment of a process to a processor for a process to start is the issue of static allocation. The program requires to know the processor for its processes before the process starts [Goscinski, 1991]. This is also known as task placement. Static allocation assumes a process will remain on the processor it is assigned to. The major advantage of static allocation is its simplicity [Goscinski, 1991]. There are several drawbacks to static allocation that should be mentioned: a. b.
does not adjust to changes in system loads, and normally exhibits poor resource utilization.
Load Balancing Load balancing is where to run one or more of the currently executing processes in a distributed environment [Deitel, 1990; Goscinski, 1991]. If some of the processors in a distributed system are idle and some are heavily loaded where should a program run a concurrent process? The goal is to have a balanced load over the whole system and by inference each node in a distributed network should carry an equal share of the load. There are several methods to solve this problem. The first and conceptually the simplest solution is to have the programmer specify where a process is to run. The programmer explicitly states which parts of a program are to be run in parallel and which processor is to be used for the concurrently executing components. Several languages allow this, such as Occam [Inmos Ltd., 1983], Concurrent PROLOG [Bal et al., 1989] and Mach 1000 [Deitel, 1990] as can operating systems such as V [Cheriton, 1988]. A process is mapped to a processor as specified by the programmer. The problem with this method is that in a distributed system the load on a given processor is unlikely to be known by a programmer when writing a program. This can lead to a processor having a very high load while other nodes on the connecting network have very low loads. Equal loads on the processors cannot be guaranteed and therefore this method should not be used as it is not dynamic and does
Page 14
not spread the load in an efficient manner. Dynamic allocation of processes to processors by the operating system has been successfully implemented in Sprite [Douglis and Ousterhout, 1991] and Linda [Ciancarini and Guerrini, 1993]. In these environments it is the operating system that is responsible for where processes are to be run. This allows processors to be tasked dynamically by the operating system and ensures a more even distribution of loads on each processor. The problems associated with a dynamic load balancing approach are which algorithms and management policies to use. The tasks that dynamic load balancing must be concern with include: a.
b. c. d.
how to keep track of: i. loads on each processor, and ii. communication patterns between processes, when should processors not be given a concurrent process, when processes should move, and how concurrent migrating processes communicate with each other.
By keeping track of loads on each processor the operating system is adding to the network traffic. If, instead of keeping load data of each node, an operating system polls for a lightly loaded processor it may take too long for a site to be located and the program would have been better if run sequentially. The question of when a processor is free for running a concurrent process for a remote program has no simple solution as yet [Goscinski, 1991]. Sprite [Douglis et al., 1991] monitors a node for user activity and if there is no activity the node is a candidate for a process to be run on it. An alternate method may be the use of system wide processor parameters. A processor is a candidate if its load is less than some value, and if the load on the processor is higher than the value it is not a target until the load drops below the given value. Despite the problems it appears that dynamic load balancing is the better and more adaptable method. It can handle load fluctuations and determine what resources are available without overloading a processor [Goscinski, 1991]. 4.4.2
Global Scheduling Mechanisms
Static allocation and load balancing elaborate the policy of global scheduling. That is which process should be created on a remote node and which process should be moved and where should the process be moved to. Remote process creation and remote process migration are their respective mechanisms. Remote Process Control For distributed processing the question arises of how to control a remote process. It does not matter if they are concurrent, the trouble is how to enable the usual operations associated with process control, such as creation, suspension, destruction, etc. [Bach, 1986; Bal, 1990; Deitel, 1990; Goscinski, 1991] on a remote processor. All operations actions associated with remote process control need to be transparent and these problems are compounded when the issue of process migration is added. How these operations can be implemented depends upon the system design philosophies and the programming paradigm used. Distributed operating systems such as Clouds [Dasgupta and LeBlanc, Jr., 1991], Emerald [Jul et al., 1988], RHODOS [Gerrity et al., 1990] and Sprite [Ousterhout, et al., 1988]
Page 15
allow for remote process control except for the creation of a process on a remote node. V-system [Cheriton, 1988] allows a remote process to execute a Unix style fork on the same processor thus spawning a new process but this is not the same as creating the new process on a different node. The ability to natively support the creation of a remote process seems to be elusive with all the above references creating a process at the program’s original node and then migrating the process if required. This slows down computation as processes migration requires transferring the whole state of the process over the network and inserting the process into the remote processor’s run queue prior to the processes starting to run. For distributed operating systems that support load balancing true remote process creation must be supported and must be transparent to the programmer and the user. This will lead to a reduction in network load and faster computation of the concurrent processes. Process Migration In a distributed system supported by dynamic load balancing there is the need for process migration. This is the ability of the operating system to move a process that is currently executing from one node on the system to another. Process migration is a mandatory feature in a distributed system if dynamic load balancing is to be achieved. The problems that have been overcome in implementing process migration include [Black et al., 1987; Dasgupta and LeBlanc, Jr., 1991; Douglis and Ousterhout, 1991; Goscinski, 1991]: a. b. c. d. e. f. g. h.
determining a process’ state, detaching a process from its current environment, transferring the process state with all relevant information, inserting the process into its new environment, how to handle communications with the process while the process is being moved from on processor to another, how to handle the future communications after the process has moved how to make the new process location transparent, and what sort of migration policy should be used.
Each implementation of process migration has answered these issues in different ways and there is no clear indication of the best way to solve these issues [Ahuja et al., 1986; Bal et al., 1989; Dasgupta et al., 1991; Douglis and Ousterhout, 1991; Goscinski, 1991; Jul et al., 1988]. It is clear is that for distributed processing the activity of process migration should be transparent to the user and the programmer and should allow concurrent execution of parallel processes while maintaining dynamic load balancing [Douglis and Ousterhout, 1991; Goscinski, 1991]. This facility is available in Sprite [Douglis and Ousterhout, 1991], Emerald [Black et al., 1987], Chorus [Chorus Systems, 1990] and RHODOS [De Paoli, 1993] amongst others.
4.5
Reliability and Availability
The possibility of increased reliability is a promised feature of distributed processing. Due to the nature of distributed processing there is also the possibility of increased availability. This section addresses these issues and their implication for concurrent execution of processes. Reliability is how much the computer system can be trusted in its performance and results [Goscinski, 1991]. This may be thought of as getting the correct result from the system or informing the user that an error has occurred. Users will not trust and subsequently will not
Page 16
use systems that are unreliable. Results must be replicable, that is the same data run with the same program should give the same results. The results must also be correct, that is the results can be proven and should agree. Availability, in a distributed system, is the ability to get a non-operative node up and working again. If a node stops processing because of hardware faults then the node is off-line until repairs are effected. However nodes often fail because of software errors that produce results that cause the processor to stop until reinitialised and the network is aware of the node's presence again. For file-servers on current network operating systems this can take considerable time, fifteen minutes or more, depending on the operating system and what the task of node. In a distributed environment there is data about the state of files and computations at any time and this data can be used to reduce the time that a node may require to be operable [Welch et al., 1989]. This was implemented for the file server in the Sprite network operating system and reduced the time it took to be available, after a system crash, from over fifteen minutes to under 90 seconds [Baker and Ousterhout, 1991]. This same type of information could be used in distributed processing to allow a program to continue from the point it was at, prior to the node failing. As mentioned above, availability is also the ability to continue processing with reduced resources. If five percent of the computing environment is disabled it would be preferable for all users to continue being able to process data with an impaired response time rather than have some users lose their access to the computing resources entirely while others continue unaffected [Deitel, 1990; Goscinski, 1991]. For distributed processing if a processor connected to the network is not operable the node cannot process data and users cannot connect via that node. It may be possible in the future that if a node becomes inoperable it may still allow a user access to the system via its communication ports and allow remote addressing of its screen. This will allow all users to remain connected and allow a graceful degradation of computing services in the face of equipment failure. A problem arises in the case when a concurrent process is being executed and the node it is running on becomes inoperable during the running of the process. If the node is the programs home machine a decision needs to be made to either continue the concurrent processes or destroy the concurrent processes and rerun the program. In most systems the current practice is to allow the processes to finish and then have operating system not return any results to the non-operating home processor or if the concurrent process requires communication or synchronisation then the concurrent process is destroyed [Ada 9X Mapping/Revision Team, 1993; Deitel, 1990; Douglis et al., 1991; Goscinski, 1991]. If availability of a distributed system is increased it may be preferable to suspend the concurrent processes and be able to rebuild the program’s state on the processor that went down and have the program continue [Welch et al., 1989].
5
Distributed Applications
Many computer applications are not subject to the benefits of parallel execution. While some programs can be transformed into parallel equivalents there are no gains in performing these transformations. There are programs that can be considered naturally distributed, that is there are gains to be made from such programs being executed in a concurrent manner and the programs form naturally independent sections. This section examines some of the applications that are subject to concurrent execution of processes.
Page 17
Several types of algorithms lend themselves to coarse-grained parallel execution. Discrete mathematical modelling, such as weather prediction, is one major area. Each part of the model has distinct starting values, called boundary conditions, and these can be solved in parallel. Currently this is normally done on parallel machines but as each equation to be solved is worked on independently there is no reason why each equation cannot be solved on other machines and the results gathered as each equation finishes [Kumar et al., 1994]. This is, as mentioned earlier, an adaption of a fine-grained algorithm to work as a coarse-grained algorithm and this problem suffers from the domain-mapping problem [Camp et al., 1994]. Fast Fourier transformations, matrix and vector operations are examples of algorithms that are normally implemented as fine-grained programs. With some manipulation, these algorithms can be implemented with coarse-grained parallelism [Kumar et al., 1994; Lilja, 1994]. This transformation from fine to coarse grain is normally achieved by having the data in some form of array to allow the splitting of the data into discrete blocks. This is simple for matrices, which are normally stored in array form. Most other methods for changing a finegrained algorithm into a coarse-grained algorithm involve manipulations that leave the data in some form of matrix or array. Systems of linear equations are solved this way with the linear equations being manipulated so that the polynomials are transformed into a matrix which can be solved using Gaussian elimination [Kumar et al., 1994; Kreyszig, 1993]. Program compilation can be done in a distributed manner but only at an extremely coarse-grained level [Douglis and Ousterhout, 1991; Grey, 1993]. This is not an application that would appear as being a likely candidate to be executed in parallel. There are no loops that are used for iteration and no large data set to operate on. The similarity to discrete mathematical modelling is that the parallel units are discrete and have no interdependencies. The parallel components are the concurrent compilation of modules and libraries. This is coarse-grained parallelism at the extreme end and is fairly simple to implement. The compilations are automated by a replacement program for the compiler that finds nodes with low usage and sends the source code to be compiled on the remote node. When compilations have finished the results are sent to the machine the compilation started on. While this is transparent to the user and results in faster compile times and more even load balancing on the system this is really just an automation of current remote execution commands that are currently available. Linda’s use of tuple-space, allowing memory to be referenced by content rather than address, implements parallelism in another way. If the set of data to be used is large, Linda creates multiple processes to work on the data and divides the data into sets that each process is responsible for processing [Ahuja et al., 1986; Carriero and Gelernter, 1988; Ciancarini and Guerrini, 1993]. Each of these replicated processes may be on different machines. This replication may also be compatible with distributed processing, allowing many workers to access the data. If this approach is used for large data sets the domain-mapping problem again needs to be considered [Camp et al., 1994; Kumar et al., 1994]. For applications to take advantage of parallelism through concurrent execution in a distributed computational environment it is necessary to produce distributed applications. This requires current programs to be changed so that parallel units are identified at a level of granularity finer than for distributed compilation but coarse enough to be useful in a distributed environment. This will require an analysis of current programs and compilation techniques.
6
Conclusions
Page 18
The aim of this report was to identify some of the key areas for future work in distributed processing. This was done by evaluating the current literature in the area of parallel processing from the distributed system perspective and discussed the issues relevant to distributed processing. From this it has been determined that future work for distributed processing should include: a. b. c. d. e. f. g. h. i. j.
the memory model used should be based on the usage of global memory, the parallelism to be implemented should be coarse-grained, the detection of parallel units should be implicit, parallelism should be implemented to minimise communication overheads, the parallelism should be generic and allow adjustment of the granularity to allow measurements to be taken, parallelism should be generic to allow more programs to take advantage of concurrent execution of parallel components, the changes should be implemented using a well known language, if possible there should be no changes to the language that a programmer would be aware of, the operating system to compiler interface needs to be formalised, and global scheduling, comprising of both static allocation (supported by remote process control) and load balancing (supported by process migration), should be employed to support parallel execution in a distributed system.
The benefits of implementing these items are: a. b. c. d. e.
the programmer is not required to state the parallelism in a program, the programmer does not have to know what resources are available, the program can make use of distributed resources, and users can take advantage of concurrent processing without the need to understand the underlying architecture, and parallelism is implemented in an implicit manner.
Distributed computing will become the main computing paradigm of the future but not until it can be applied to a wide range of problems, is easily accessible and can be relied on for accurate results. For this to occur the issues listed above must be addressed.
References Ada 9X Revision/Mapping Team (1993) Programming Language Ada, Cambridge, Massachusetts: Intermetrics, Inc. Agrawal, D. and Malpani, A. (1991) Efficient Dissemination of Information in Computer Networks, The Computer Journal, Vol. 34, No. 6, pp 534-541. Ahuja, S., Carriero, N. and Gelernter, D. (1986) Linda and Friends, IEEE Computer, August, pp 26-34. Bach, M.J. (1986) The Design of the Unix Operating System, Englewood Cliffs, New Jersey: Prentice-Hall, Inc. Baker, M. and Ousterhout, J. (1991) Availability in the Sprite Distributed File System, ACM Operating Systems Review, Vol. 25, No. 2.
Page 19
Bal, H.E. (1990) Programming Distributed Systems, Hertfordshire: Prentice Hall. Bal, H.E., Steiner, J.G. and Tanenbaum, A.S. (1989) Programming Languages for Distributed Computing Systems, ACM Computing Surveys, Vol. 21, No 3, pp 261-322. Ben-Ari, M (1990) Principles of Concurrent and Distributed Programming, Hemel Hempstead: Prentice-Hall International. Birrel, A.D. and Nelson, B.J. (1984) Implementing Remote Procedure Call, ACM Transactions on Computer Systems, Vol. 2, No. 1, pp 39-59. Black, A., Hutchinson, N., Jul, E., Levy, H. and Carter, L. (1987) Distribution and Abstract Types in Emerald, IEEE Transactions on Software Engineering, Vol. SE-13, No. 1, pp 65-76. Carriero, N. and Gelernter, D. (1988) Applications Experience with Linda, Proceedings of the ACM SigPLan Parallel Programming: Experience with Applications, Languages and Systems 1988, New Haven, Connecticut, USA, pp 173-187. Camp, W.J., Plimpton, S.J., Hendrickson, B.A, and Leland, R.W. (1994) Massively Parallel Methods for Engineering and Scientific Problems, Communications of the ACM, Vol. 37, No. 4, pp 31-41. Cheriton, D.R. (1988) The V Distributed System, Communications of the ACM, March 1988, pp 314-331. Chorus Systems (1990) Overview of the Chorus Distributed Operating System, Technical Report CS/TR-90-25, Chorus Systems. Ciancarini, P. and Guerrini, N. (1993) Linda meets Minix, ACM Operating Systems Review, Vol. 27, No. 4, pp 76-92. Dasgupta, P. and LeBlanc R.J. Jr. (1991) The Structure of the Clouds Distributed Operating System, in: Agrawala, A.K., Gordon, C.D. and Hwang, P., Mission Critical Operating Systems, Amsterdam, Washington: ISO Press. Dasgupta, P., LeBlanc R.J. Jr., Ahamad, M. and Ramachandran, U. (1991) The Clouds Distributed Operating System, IEEE Computer, November, pp 34-44. Deitel, H.M. (1990) Operating Systems, 2nd Ed, Reading, Massachusetts: Addison-Wesley Publishing Company. De Paoli, D. (1993) The Multiple Strategy Process Migration Manager for RHODOS: The Logical Design, Technical Report C93/37, School of Computing and Mathematics, Deakin University, Geelong, Australia. Douglis, F., Kaashoek, M.F., Ousterhout, J.K. and Tanenbaum, A.S. (1991) A Comparison of Two Distributed Systems: Amoeba and Sprite, Computing Systems, Vol. 4, No. 4, pp 353-384. Douglis, F. and Ousterhout, J. (1991) Transparent Process Migration: Design Alternatives and the Sprite Implementation, Software Practice and Experience, Vol. 21, No 8. Ellis, G.K. (1990) Parallel Extensions to C, Dr. Dobb's Journal, August, pp 70-79. Findlay, W. and Watt, D.A. (1988) Pascal, 3rd Ed, London: Pitman Publishing.
Page 20
Gerrity, G., Goscinski, A., Indulska, J., Toomey, W. and Zhou, W (1990), The RHODOS Distributed Operating System, Technical Report CS 90/4, Department of Computer Science, University College, University of New South Wales, Canberra, 6th February 1990. Giarrantano, J.C. (1991) Clips User’s Guide, Software Technology Branch, Information Systems Directorate, Lyndon B. Johnston Space Center: NASA. Goscinski, A. (1991) Distributed Operating Systems: The Logical Design, Sydney: AddisonWesley Publishing Company. Goscinski, A. and Zhou, W. (1994) Towards a Global Computer: Improving the Overall Distributed System Performance and the Computational Services Provided to Users by Employing Global Scheduling and Parallel Execution, ARC Large Grant Application, Deakin University. Gough, K.J. and Mohay, G.M. (1988) Modula-2:A Second Course in Programming, Englewood Cliffs, New Jersey: Prentice-Hall, Inc. Grey, P. (1993) Distributing C Compiles, Australian Unix Users Group Newsletter, Vol. 14, No. 3, pp 60-69. Inmos Ltd. (1983) Occam Programming Manual, Englewood Cliffs, New Jersey: Prentice-Hall, Inc. Johnston, W.A. and Heinz, S.P. (1978) Flexibility and Capacity Demands of Attention, Journal of Experimental Psychology: General, Vol. 107, pp 420-435. Jul, E., Levy, H., Hutchinson, N. and Black, A. (1988) Fine-grained Mobility in the Emerald System, ACM Transactions on Computer Systems, Vol. 6, No.1, pp 109-133. Kernighan, B.W. and Ritchie, D.M. (1988) The C Programming Language, 2nd Ed, Englewood Cliffs, New Jersey: Prentice-Hall, Inc. Khan, A.S. (1992) The Telecommunications Fact Book, Albany, New York: Delmar Publishers Inc. Kreyszig, E. (1993) Advanced Engineering Mathematics, 7th Ed., New York: John Wiley and Sons, Inc. Kumar, V., Grama, A., Gupta, A. and Karypis, G. (1994) Introduction To Parallel Computing: Design And Analysis, Redwood City, California: The Benjamin/Cummings Publishing Company, Inc. Lenoski, D.E., Laudon, J.P., Gharachorloo, K., Weber, W-D., Gupta, A., Hennessy, J.L., Horowitz, M. and Lam, M.S. (1992) The Stanford DASH Multiprocessor, IEEE Computer, March, 1992, pp 63-79. Ligon III, W.B. and Ramachandran, U. (1994) Evaluating Multigauge Architectures for Computer Vision, Journal of Parallel and Distributed Computing, Vol. 21, No. 3, pp 323-333. Lilja, D.J. (1994) Exploiting the Parallelism Available in Loops, IEEE Computer, Vol. 27, No. 2, pp 13-26. Miller, G.A. (1956) The Magical Number Seven, Plus or Minus Two: Some Limits on Our
Page 21
Capacity for Processing Information, Psychological Review, Vol. 63, pp 81-97. McBryan, O.A. (1994) An Overview of Message Passing Environments, Parallel Computing, Vol. 20, No. 1-4, pp 417-444. Nielsen, G.D. and Smith, E.E. (1973) Imaginal and Verbal Representations in Short-term Recognition of Visual Forms, Journal of Experimental Psychology, Vol. 101, pp 375-378. Oracle (1993) Oracle 7 White Paper, Redwood Shores, California: Oracle Corporation. Ousterhout, J.K., Cherenson, A.R, Douglis, F.K, Nelson, M.N. and Welch, B.B. (1988) The Sprite Network Operating System, IEEE Computer, Vol. 21, No. 2, pp 23-30. Pleier, C. (1993) The Distributed C Development Environment, Germany: Technische Universität München. Ponnusamy, R., Saltz, J., Choudhary, A., Hwang, Y-S and Fox, G. (1993) Runtime Support and Compilation Methods for User Specified Data Distributions, Proceedings of the Sixth SIAM Conference on Parallel Processing, pp 187-192. Reed, S.K. (1992) Cognition: Theory and Application, 3rd Ed., Pacific Grove, California: Brooks/Cole Publishing Company. Shannon, C.E. (1949) Communication in the Presence of Noise, Proceedings of the I.R.E., January, pp 10-21. Sterling, L and Shapiro E. (1986) The Art of Prolog, Cambridge, Massachusetts: The MIT Press. Sun Microsystems (1991) Sun System and Network Managers Guide, California: Sun Microsystems, Inc. Sunderam, V.S. (1990) PVM: A Framework for Parallel Distributed Computing, Concurrency: Practice & Experience, Vol 2, No. 4, December 1990, pp 315-339 Systems Architecture Research Center (1993) Experiences with Distributed Shared Memory, Technical Report TCU/SARC/1993/3, Department of Computer Science, City University, Northampton Square, London, EC1V 0HB. Tanenbaum, A.S. (1990) Structured Computer Organization, London: Prentice- Hall International (UK) Limited. Young, P.H. (1990) Electronic Communication Techniques, 2nd Ed., New York: Macmillian Publishing Company. Welch, B., Baker, M., Douglis, F., Rosenblum, M., and Ousterhout, J. (1989) Sprite Position Statement: Use Distributed State for Failure Recovery, Proceedings of the Second Workshop on Workstation Operating Systems, pp 130-133. Zielinski, K., Gajecki, M. and Czajkowski, G. (1994) Parallel Programming Systems for LAN Distributed Computing, Proceedings of the 14th International Conference on Distributed Computing Systems, pp 600-607.
Page 22