TransCom: a Communication Microkernel for

0 downloads 0 Views 82KB Size Report
parallel programming model based on message passing. ... development of a high level environment for parallel programming. ..... A comparison has been carried out in [HEY95] where six techniques are shown: ... [INM92a] “The Transputer Reference Book”, Third Edition, Inmos Ltd. 1992. [INM92b] INMOS. Ansi C Toolset.
TransCom: a Communication Microkernel for Transputers J.C. Moure Daniel Franco Elisa Heymann Emilio Luque

Unidad de Arquitectura de Ordenadores Departamento de Informática Universidad Autónoma de Barcelona 08193 - Bellaterra, Barcelona, Spain E-mail: [email protected] If parallel computers have to become general purpose tools, it is necessary to develop services that make transparent its internal characteristics and make parallel programming easier. Trying to fulfil this goal and to have a platform for the test and evaluation of mechanisms for a parallel architecture, a microkernel called TransCom has been designed for a distributed-memory multiprocessor. The microkernel includes services to virtualise the communication network, which provides a parallel programming model based on message passing. It has been designed in two steps: a tiny system core called TransRouter providing simple functions for data transport and routing, but without protocol services, and a basic set of communication primitives, on top of the TransRouter and including its own communication protocols, to make up the TransCom. Concurrently with the development of the microkernel some studies and simulations have been undertaken in order to virtualise the “processor resource” through the load distribution among the parallel computer processors.

1. Introduction The properties an application programmer demands from a computer system are basically: fast and easy development of programs, reliable execution, help for error debugging, easy portability and fast execution either with present and with upcoming computers. Parallel computers are not an exception. Nowadays, they look like very much earlier sequential computers regarding the lack of facilities to use them. The challenge for a parallel computer’s designer is to provide a high-level and simple view of the computer, making transparent the internal characteristics of its architecture. This new view of the computer, a mixture of physical components (hardware) and system programs (software) is known as a virtual computer. The computer’s designer faces with different and conflicting goals. On the one hand, the virtual computer must provide general services increasing usability, reliability, productivity and portability for a wide range of applications. On the other hand, execution will not be as fast as if application and architecture had been directly tuned. In fact, some services can be a dead weight for many applications that do not use them. This overhead imposed by the system software could be stretched out by allowing the programmer to access the ‘actual’ architecture, instead of accessing the virtual architecture. Nevertheless, two factors decrease the importance of obtaining high performance from a couple application-architecture: any strategy a programmer uses to tune application and computer can be embodied into the virtual computer in the future and any software service can be implemented by hardware, improving its performance. To sum up, in the same way the address translation mechanisms (virtual memory) and the automatic data movement through the memory hierarchy (cache memory) have prevail as essential characteristics of any sequential computer, in spite of some loss of performance in specific situations, it will be necessary to find general mechanisms for parallel computers, where its benefits surpass its drawbacks. The two basic goals of our work are the study of improvements to the internal architecture and the development of a high level environment for parallel programming. In this framework, a microkernel for a distributed-memory multiprocessor, called TransCom, has been designed. It includes services to virtualise the communication network among processors. The microkernel joins the flexibility to change, modify and include services with the possibility of executing real applications to obtain precise results to

CAOS

TransCom: a Communication Microkernel for Transputers

validate the mechanisms and to measure its performance. The final results can be extrapolated to calculate the incidence of implementing the services in hardware. Additionally, the microkernel is the foundation to build a complete environment for parallel programming that would allow to experiment at all levels, from the internals of the architecture to the development of applications, and including the programming model. Concurrently with the development of the microkernel some studies and simulations have been undertaken in order to virtualise the load distribution among the parallel computer. Simulation is another interesting option when studying the incidence of architectural improvements, specially when there is no idea about the advantages and disadvantages of a particular mechanism, because it allows a fast prototype development to generate earlier results. Also, it is useful to test mechanisms that, given its high cost of implementation or physical limitations, are unrealisable. The rest of the paper is organised as follows. In the first place, there is a description of how the microkernel virtualise communications. Sections 3 and 4 present the two layers making up the microkernel, the inner layer is called TransRouter, [MOU95], and the upper layer, build on top of the TransRouter, that constitutes the whole microkernel or TransCom, [FRA95]. The studies about load balancing and its integration into the microkernel are explained in section 5. Finally, we conclude by discussing related work and some conclusions.

2. The virtualisation of the Communications The microkernel, or TransCom, represents the first step to obtain a general-purpose parallel computer, because it makes transparent the topological characteristics of the communication network, providing a virtual architecture where all the processors are directly connected to each other and all the (logical) connections do provide the same communication capability (named effective bandwidth and latency). The microkernel has been designed in two steps. Firstly, a tiny system core called TransRouter, [MOU95], has been developed to provide simple functions for data transport and routing, but without

protocol services. It provides the foundation for modular extensions of more complex communication models. The second step was the design and development of a basic set of communication primitives, including its own communication protocols, on top of the TransRouter to make up what we called the TransCom, [FRA95]. TransCom provides a parallel programming model based on message passing (CSP [HOA85]) and described in [LUQ94]. The software is implemented in a Tandem Supernode [PAR89] from Parsys, with 32 T8-transputers [INM92a] connected through a statically reconfigurable network. The TransRouter software is a data transport kernel. It is not intended for general use, instead it is designed as a foundation for modular extensions of more complex communication primitives. It provides services to move packets of data between processors, but lacks protocol services to synchronise the sending and receiving of the packets. Protocols have to be provided by an upper layer and implemented upon the TransRouter to fit it with a particular programming model(s), for example, synchronous channels, asynchronous buffers like PVM, [PVM94], active messages, [EIC92], shared virtual memory, etc. Also, the communication primitives address processors, which are the basic entities of computation, and not processes, which can be embedded in the system by an upper layer. The TransCom offers message passing between any number of processes. Variable length messages, communication channels and processes are the new entities introduced by the TransCom. Processes invoke explicit primitives upon logical channels. There are different types of channels that allow the communication among two or more processes, offering high level communication mechanisms to the programmer. The basic channel protocol is synchronous as shown in section 4.

3. TransRouter The basic goals in the TransRouter’s design are: • Provide a virtual communication system where all processors exchange data as if they were directly connected to each other (to make transparent the topology of the network to the user).

CAOS

TransCom: a Communication Microkernel for Transputers

• The virtualisation of the underlying interconnection hardware has to be achieved not only at the functional or logical level but even at the performance level: the data transfer capability has to be the same for any couple of processors in the system.

1

3

5

4

6

8

7

9

2

5

0 Host

Actual Arquitecture

4

3

6

2

7

1 8

9

0

Host

Virtual Arquitecture

Figure 1. Virtualisation of the Communication Network

Figure 1 presents the actual architecture, where each processor is connected to up to four processors, and the virtual architecture where all processors are connected to each other. In the first case, each connection has the same bandwidth and latency (or distance) but between non-neighbour processors, the bandwidth and latency consumed by a packet transmission are multiplied by the number of links that data have to traverse. In the second case, all the logical links have the same bandwidth and latency, and all processors are at the same distance.

3.1. TransRouter Characteristics The unit of information transmitted among processors is a fixed-length packet, made up by an information part and a header identifying the destination processor and containing routing information. The communication functions reference the processor identifiers, from zero upwards, and not topological characteristics of the network. The 0 identifier is reserved to the processor directly connected to the host. It is possible to send data to a given processor or to broadcast data to all the processors simultaneously. This characteristic allows the simple and efficient implementation of all kind of primitives for global transport of messages and global synchronisation. The interconnection topology selected for the current implementation is called midimew [BEI87], but both the number of processors and the topology can be changed easily. The performance of the communication system, both the bandwidth and the latency, is uniform and does not depend on the particular processors that communicate but on the average load of the network. The performance will also depend on the topology’s connectivity degree. Given the current system characteristics (transputer), a Store&Forward flow control is used. The data transport service is reliable, deadlock-free and starvation-free. It uses the virtual channel’s approach described by Dally [DAL90], consisting on breaking the cycles of the static routing algorithm with virtual channel. Anyway, packets delivered by the routing system must be immediately read by the processor or discarded, because they will be overwritten. Protocols can not restrain packets from being receipt or they must assure that cyclic dependencies are not allowed. There is no control with regard to a wrong use of the services. Also, order is not guaranteed among packets sent from the same source processor to the same destination processor. A protocol can be build on top of the TransRouter to provide packet reordering when necessary. In the current implementation, the efficient use of the communication resources (physical links and DMA’s) has higher priority than the efficient use of the CPU. The concurrent use of all the links is preferred even if that means that more time is spent by the CPU serving communication. Furthermore, efficient use of bandwidth is preferred than low latency, because the incidence of latency is expected to be reduced by multithreading, overlapping communication and computation.

CAOS

TransCom: a Communication Microkernel for Transputers

3.2. Interface Specification The TransRouter primitives are shown in table 1. When an upper layer uses the TransRouter to send a packet to some processor, first it has to obtain a free buffer to store the packet using the TrRout.Buffer.Request function, which can block the requesting process. After copying the desired information, the upper layer will invoke the TrRout.Send or TrRout.Broadcast functions to send the packet to a suitable processor or to all the processors in the system (including the source processor). Constants: Packet.Size Send Calls: TrRout.Buffer.Request (B) TrRout.Send (P) TrRout.Broadcast () Receive Calls: TrRout.Receive (B) TrRout.Receive.Release () Information Calls: TrRout.WhoAmI () TrRout.N.Proc () TrRout.MemStart () TrRout.MemEnd ()

/* Maximum size (in bytes) of a packet of data */ /* Request a buffer B. Blocking primitive */ /* Send the last requested buffer to processor P */ /* Send the last requested buffer to all the processors */ /* Wait and receive a buffer. Blocking primitive */ /* Release received buffer */ /* Returns current processor identifier */ /* Returns the number of processors in the system */ /* Returns the start address of the free memory area */ /* Returns the end address of the free memory area */ Table 1. TransRouter Functions

The upper layer has to declare a server process to manage the input packets and implement the suitable protocol, which will be invoked by the TransRouter when a data packet arrives to its destination. The server process defines its entry point calling the TrRout.Receive function and blocks until the arrival of a packet. It then completes the actions corresponding to the protocol associated to the packet, awaking other internal processes if necessary, and returning control to the TransRouter as soon as possible calling the TrRout.Receive.Release function. The buffer will be reused by the system, so it is mandatory that the server process copies the information in other location, if it must be preserved. Finally, there are four functions to obtain information about the only aspects of the actual architecture that an upper layer can access: the number of processors, the current processor’s identifier, and the start and end of the free memory area.

3.3. Obtaining Homogeneous Performance With the current static routing algorithm, communication latency is bounded provided that collisions never occur between packets using the same physical link. Unfortunately, in any sparse communication network, some communication patterns can not be realised without collisions, and this collisions can reduce system performance drastically. The problem is not the lack of resources but an inappropriate use of them, because some links will be always busy while other will be under-used. A solution to this problem was presented by M.D. May and P.W. Thompson in [MAY93], based on a work made by L.G. Valiant [VAL82]. The work indicates that it is possible to design a universal routing system that can implement all communication patterns with an efficient bandwidth and bounded message latency (proportional to log(number of processors)). To eliminate the network hot-spots, twophase routing is employed. Every packet is first dispatched to a randomly chosen intermediate destination and then to its final destination. In the two phases the static routing algorithm is used. The average values of latency and bandwidth are doubled but, with a very high probability, hot-spots are

CAOS

TransCom: a Communication Microkernel for Transputers

evenly distributed over the links for all communication patterns and resources are efficiently used. Also, the current values of latency and bandwidth barely differ with respect to the theoretical values.

4. TransCom A new layer is implemented over the packet service TransRouter, previously explained, making up the complete TransCom microkernel. It provides a message-passing mechanism based on the CSP programming model. A programming model has the advantage of including a program methodology, which gives a set of rules to help the programmer write his program. Next, a brief description of the programming model, fully introduced in [LUQ94] is outlined. The programmer designs his application by means of expressing parallelism as a collection of processes. A process is a bit of sequential code with its own internal data (variables). Processes are executed concurrently and collaborate to solve a common problem exchanging information by message-passing. Channels are used to carry out the message-passing. A channel is a program entity that links two or more processes and transports information. The data type of the information, the protocol used on transmission and the buffering capability are defined in the channel structure. Explicit communication primitives are called by processes to perform a communication action. These primitives are supplied by the kernel. The source and destination specification is determined in a indirect way by the identifier of the channel which links the processes. Therefore, internal process code creation and definition of the links through channels are managed independently. This feature can help the programmer because the application is written in two steps: first, the process graph which defines the links and the parallel application structure and, second, the internal code. On a first stage, we can think that channels are unidirectional, point-to-point and its protocol is synchronous. This means that processes involved in the communication are blocked until each one has executed its corresponding communication primitive. From this moment on, the information is transferred and both processes can continue its execution. Because it is known that each data sent is consumed by the receiver, a synchronous protocol guarantees a transparent and safe semantic.

4.1. Interface Specification Basic primitives are send, receive, and detect message. Send (int Channel, void * P) Receive (int Channel, void * P) Test (int Channel)

/* Sends the message in P to Channel */ /* Receive a message from Channel on P */ /* Test if there is a message waiting on Channel */

The synchronous, point-to-point, communication is the simplest one. In addition, higher-level communication types are defined through different channel types. These types offer the programmer useful programming structures. Following, the different channel types for communication among several processes are explained: • Many-to-one communication: where each sender process sends a different message to an only receive process. Each receive primitive receives only one message from one of the senders. • One-of-many communication: where one process send messages to a set of processes. Each primitive sends one message to one of the receivers. There exists two possibilities: the destination process is specified or is the first ready process. • Many-to-many communication: where a set of sender processes send messages to a set of receivers processes. Again, the destination process can be specified or can be the first ready process. • One-to-many communication (multicast): the same message is sent to all the members of a process group.

CAOS

TransCom: a Communication Microkernel for Transputers

point-to-point many-to-one

one-of-many

many-to-one-of-many

multicast

Figure 2. Channel types

This set of channel types gives the programmer higher level programming structures, which make the programming language more powerful. For instance, MPI-like data-movement primitives, [MPI94], the “worker-farmer” paradigm or global broadcast operations can be easily implemented. Furthermore, there is the possibility to change the channel protocol by adding buffering capability to the channels. This is a way to decouple the sending and receiving of messages, allowing different rates in the sending and receiving operations. This allows a certain level of asynchronous operation, since process blocking is not possible while data buffer is not full. In this sense, semisynchronous or asynchronous PVM-like protocols can be implemented, [PVM94].

4.2. Implementation Message service implementation has been carried out by means of concurrent processes. Fig. 3 shows TransCom internal process structure and their relationship. Each user process (User) has associated two handler processes (HandSend and HandRecv) whose function is to interface with the User process in order to send and receive messages. This interface is done by using transputer internal software channels (UserHandSend and UserHandRecv) without sharing common data. Furthermore, there is a SendMess process which sends messages packet by paquet to the network through the TransRouter and another process RecvMess which collects packets delivered by the TransRouter. These two processes communicate with the others using internal software channels and also accessing to common data structures (ChannOutput and ChannInput). ChannOutput UserHandSend

HandSend

SendMess

UserHandRecv

HandRecv

RecvMess

User

ChannInput

T r a n s R o u t e r

Figure 3. TransCom: Process Diagram

5. Load Balancing and Process Migration In addition to the processor interconnection network virtualisation, another parallel computer feature that seems suitable to be virtualised is the processor resource. Processor management has a great impact on the computer performance. Process distribution among processors should achieve a high rate of process usability during the execution time. This is difficult to get by a static process distribution because the computing load and communication patterns are not known at compile time. Therefore, due to the computing volume of each process and their dependencies some processors can have no process ready while other processors have ready processes waiting to be executed. The goal would be a system able to dynamically distribute processes in order to balance the load of each processor. In this way, the programmer would see the parallel computer as a “virtual processor” resource made of all the physical processors in the system. That is, a virtual processor where the load would be automatically distributed among physical processors, flowing from the heaviest loaded processors to the less loaded processors.

CAOS

TransCom: a Communication Microkernel for Transputers

To achieve this goal, dynamic process migration mechanisms must be provided. There will be two phases to solve. On the one hand, the system must be able to decide which process must be migrated, when and where must be moved (migration policies), and, on the other hand, a migration technique must be developed for the particular system (migration mechanisms). Migration policies can be classified in relation to several criteria as described in [LUQ95]: whether they are centralised or not, whether independent or overlapped domain, whether local or global and if they are based on heuristic approximations or on iterative methods that can be mathematically analysed, like those based on diffusion schemes. Migration mechanisms involve two tasks: firstly, one related to code migration and, secondly, another related to process communication. The first task implies to interrupt the process, to move code and data to a different position in the memory of a different processor, and, finally, to continue its execution in the new processor. Process interruption and reassume is not difficult in transputers thanks to its microprogrammed scheduler. However, code reallocation to a new memory location cause many problems because code generated by the compiler (INMOS C Toolset [INM92b]) is position-dependent. The problems are absolute memory references generated at execution time when a function is called, or in using pointers, or global variables or calling library functions. Two kind of solutions have been adopted: do not allow language features which use absolute references or create a server process which monitors absolute references creation and does the necessary translation. See [HEY95] The second task consist of achieving that the migrated process will continue receiving messages in its new destination. At this point, three different situations can appear: messages sent to a process that are in the network when the process is going to migrate, messages generated while the process is migrating, and messages generated after the process has migrated. Several techniques can be distinguished which have different impact on the global performance. Each technique adopts a different solution on each of the above cases. A comparison has been carried out in [HEY95] where six techniques are shown: full protocol, message rejection, message forwarding, use of a initial direction, centralised scheme and mailbox, and where a simulation has been carried out to show the impact of each technique in the communications performance.

6. Related Work and Discussion There are some works closely related to our approach in the design of a communication service. The Universal Packet Routing Interface [DEB92] is an “interface to a packet-routing facility for global communications in a locally-connected message-passing machine”. It has been also implemented in a network of T8-series transputers and provides a view where all processors are connected to each other but without considering homogeneous performance. Active Messages [EIC92] also takes a minimalist approach, but it is a complete message passing model available to the programmer and its goal is to reduce the cost imposed to the communication latency by many operating systems. Finally, the T9000 processor and the C104 router, [INM93] and [MAY93], represent a hardware approach to build general purpose interconnection networks. Its goal is very similar to ours but it does not address the implementation of a broadcast service. To conclude, a communication microkernel called TransCom has been designed and implemented. It offers a complete virtualisation of the communication network in a parallel computer and has two main goals: • to have a platform for testing and evaluating different solutions in a parallel architecture design. An open question is to find which services has to be included in a general parallel computer. A future work consists of modelling the current parallel environment and obtaining system parameters in order to evaluate its performance and extrapolate the results to study the appropriate level of integration (hardware, firmware, or software) of each service provided by the microkernel. • to be the foundation of a user-friendly parallel programming environment. The microkernel supports message passing between processes and makes the machine characteristics transparent to the user. In the future, it is expected to include load balancing services using process migration. This work is part

CAOS

TransCom: a Communication Microkernel for Transputers

of a general project called SEPP (Software Engineering for Parallel Processing) [WIN94], which major interests comprises tools for program design, simulation, run-time support and behaviour analysis in parallel systems. This work was supported by the Spanish Comisión Interministerial de Ciencia y Tecnología (CICYT) under contract number TIC 92/0547

7. References [BEI87] R. Beivide, E. Herrada, J.L. Balcázar, and J. Labarta, “Optimized mesh-connected networks for

SIMD and MIMD architectures”, Proc. 14th Int. Symp. on Comput. Archit., June 1987, pp. 163-170. [DAL90] W. J. Dally “Virtual Channel Flow Control” 17th Int. Symp. on Comp. Arch., 1990, pp. 60-68. [DEB92] M. Debbage, M. B. Hill, “ParaPET: A Parallel Programming Environment Toolkit” Technical Report ESPRIT Project P5404, University of Southampton, November 1992. [EIC92] T. Eicken, D. Culler, S. Goldstein and K. Schauser, “Active Messages: A mechanism for integrated communication and computation”, in: Proc. Nineteenth Int. Symp. on Computer Architecture (ACM Press, New York, 1992) [FRA95] D.Franco, J.C. Moure, E. Heymann, E.Luque “TransCom: Communication services for a transputer-based multicomputer” Technical Report, Computer Science Department. Universitat Autonoma de Barcelona. 1995 [HEY95] E. Heymann “Soporte para la migración de procesos en computadores paralelos de memoria

distribuida” Master Thesis, Computer Science Department. Universitat Autonoma de Barcelona. 1995 [HOA85] C. A. R. Hoare “Communicating Sequential Processes”, Prentice-Hall Int. 1985 [INM92a] “The Transputer Reference Book”, Third Edition, Inmos Ltd. 1992

[INM92b] INMOS. Ansi C Toolset. 1992 [INM93] “The T9000 Transputer Hardware Reference Manual”, Inmos Ltd. 1993 [LUQ94] E. Luque et al. “A Parallel Programming Model for Distributed Systems” Technical Report.

Computer Science Department. Universidad Autonoma de Barcelona. 1994 [LUQ95] E.Luque, A.Ripoll, A.Cortés, T.Margalef “A distributed diffusion method for dynamic load

balancing on parallel computers” Proceedings of the Euromicro Workshop on Parallel and Distributed Processing. IEEE Press 1995. pp.43-50 [MAY93] M.D. May, P.W. Thompson, P.H. Welch “Networks, Routers and Transputers: Function, Performance and Applications” IOS Press, 1993. [MOU95] J.C. Moure, D.Franco, E. Heymann, E.Luque “TransRouter: a data transport microkernel for transputers” Technical Report, Computer Science Department. Universitat Autonoma de Barcelona. 1995

[MPI94] “MPI. A Message-Passing Interface Standard”. Message Passing Interface Forum. 1994 [PAR89] “Hardware Reference for the Parsys SN1000 Series” Technical Reference, Parsys Ltd, September 1989.

[PVM94] A.Geist et al. “PVM. User’s Guide and Reference Manual”. Oak Ridge National Laboratory. 1994 [VAL82], L.G. Valiant, “A scheme for fast parallel communication” SIAM J. on Computing 11 (1982)

pp. 350-361. [WIN94] S.C. Winter and P. Kacsuk, “Software Engineering for Parallel Processing” Proceedings of the

8th symposium of microcomputer and microprocessor applications, vol 1,pp. 285-303, Budapest, 1994.