Technical Report DHPC-028
The DISCWorld Peer-To-Peer Architecture Andrew Silis
K.A. Hawick
Computer Science Department University of Adelaide Adelaide SA 5005 Australia
Computer Science Department University of Adelaide Adelaide SA 5005 Australia
[email protected]
[email protected]
January 1998
1 Introduction
services. The code for these services may be stored in a database. To enable servers to be added and removed from the system at any time a peer-to-peer communication model is to be used. This will mean that the system will comprise a set of peer level servers. Due to the dierent heterogeneous resources we are undertaking to manage, a portable software approach is necessary. Implementation is being carried out in Java [2] although services provided by servers may not be completely portable as Java code may contain native methods. One single server may be responsible for a variety of computing resources as some resources may not be able to execute a Java Virtual Machine such as a CM5.
A critical aspect of a robust, scalable architecture for distributed computing is the relationship between participating platform nodes. Other software systems such as DCE [4] have been limited by the xed hierarchical server relationships and the diculty in dynamically and automatically adding new platforms to or removing platforms from a running system. We discuss the implications of retaining a serverless architecture or a peer-to-peer relationship between all participating nodes in the DISCWorld system. We describe the issues that arise in implementing this using Java and objectoriented software technologies. We want a way to coordinate, control and access various distributed computing resources for our DISCWorld distributed computing environment [6]. One assumption we make about these resources is that they are geographically separated by large distances, making latency a signi cant concern. The user of the system may ambiguously choose a service therefore requiring further intervention by the user in order to dierentiate the correct service. This means that the system must be interactive. Both servers and services will need to be able to be added and removed at runtime. The model of computing that will be used to make use of these distributed resources will be a large grained approach, requiring minimal message passing. Fine grained problems will also be able to be solved using a farm of workstations running MPI [3] or PVM [1] as an embedded resource to be managed. Due to the large grained nature of this system the number of servers used is expected to be kept relatively small while the number of services may be arbitrarily large. This may mean services require bulk storage facilities, out of memory, with the memory acting as a cache of the
2 Serverless Architectures
We have decided on a peer-to-peer or serverless server approach for a system of linked and equal servers. Peer-to-peer or serverless is de ned as: \Peer-to-peer is a communications model in which each party has the same capabilities and either party can initiate a communication session." [5] Every peer will initially be identical but services may be added dynamically that will then distinguish between the peers and as such they will evolve over time. The initially symmetric peers will retain some of their symmetry in respect to the operations that can be performed upon each other. The reasons for using a peer-to-peer communication model are the possibilities for the dynamic recon guration of the network of peers and the robustness of the system. New servers may be added and old ones removed at runtime with minimal eect on the system. The peer-to-peer approach helps to increase robustness as the failure of one node will not detrimentally eect the whole system but may eect the availability of some services and the execution of currently executing services. 1
Naming of the services, servers and data is also a concern in order to allow the caching of results and out of memory storage for some services. For the time being we assume that each server, service and each piece of data, are given a unique identi er.
3 Implementation Choices
and
Design
A discussion of the DISCWorld architecture based upon a peer-to-peer system for large grained applications follows. Two related questions arise during the peer-to-peer implementation and design. How do the peers know of each others existence and how do the peers know what services exist? Two solutions to the rst question are to have every peer knowing about every other peer, or each peer knows about only some of the other peers. The second approach must have the constraint that each peer can somehow nd out about each other peer by traversing through other peers. The problem remains how do we implement this eciently whilst remembering the eects of latency and the need for interactivity.
host b
host a host c
host e
host d
Figure 1: Each peer knows of some peers Possible solutions for services are similar to the solutions for servers. It is expected that the number of services oered will usually far exceed the number of hosts. It may be sucient in some cases to allow every peer to know about every service although in general this list of services may be prohibitively long. In the case where each peer may not know every other peer we have several problems. The ability for every peer to be able to nd every other peer can be lost easily due to a peer failing and fragmenting the system. This can be seen from gure 1. If for
host b
host a host c
host e
host d
Figure 2: Every peer knows every peer instance host b fails then the system becomes fragmented and host a can not reach host c. Searching for services may also pose a problem. If the user needs to nd out if a particular service exists or the dierent kinds of services with that name then the graph formed by the peers needs to be traversed. The user will want as much interactivity as possible and will not tolerate large waiting times. With this system it is possible to have large latencies due to traversing the peers as they may not be optimally set out and can be spread across the network. It may be possible to try to spread enough information around the system to stop single peer failure from fragmenting the graph and also order the system in such a way that searches are carried out as fast as possible perhaps by passing service information partially in advance or by a shared piece of memory keeping track of all the services. Distributed shared memory, perhaps replicated and implemented in software, could be used for this but care must still be taken so that the information is not lost due to a server failure. Bottlenecks may also occur in such a system with large amounts of information having to be passed through a host like host b in gure 1. This could be avoided by keeping references of other servers and services when they are found but eventually every host may know about every other host so this approach now looks like the second approach. This approach is similar to the one that we are planning to use. When every peer has knowledge of every other peer we have the problem that the list of peers starts to grow very large. Scalability will be a problem. Updating of this list consistently is also a problem. The updating may be done using a piece of shared memory or some kind of atomic
broadcast or multicast [7]. Atomicity is needed to keep the host lists consistent between peers. Systems already exist to do this although often only in the network layer which would reduce portability. Many of the concepts can still be used in software. If extremely large systems are needed it may be necessary to change the peer-to-peer system to another system such as a tree of peer-to-peer groups to try and increase its scalability. To locate services in this system it is possible to
ood each peer with a request about its services. This may ood the network but will provide good interactive response times for the user if the network can handle it otherwise the response times may be extremely poor. A mixture of the two above methods will be used for DISCWorld. Each peer will have its own policy that will determine how it has knowledge of other peers and of services. This policy will determine what peers and services it tries to locate and what information to store in its database. Extra information about the services can also be stored in the database. When a peer needs to know about a service it may simply check the database although that information may be out of date. Alternatively it could check all hosts for the service. Its actual action will depend on the policy used and this policy can vary between peers. This is not a perfect solution but we hope it will work in most cases for our needs. One possible policy for a peer could be to periodically update a database of peers and services available. To discover the information in order to perform the update the known peers can be polled. When a service is requested the information can be retrieved from the database. To add a peer to the system the peer can be bootstrapped from another peer or from a database. We plan to make each peer capable of having its own bootstrapping policy. On removing a peer, it simply become unavailable. What happens when a host disappears needs to be determined. First detection of the dissapearance is necessary and will initially be performed using a timeout based on the time it usually takes to complete the service. This timeout information may be kept in the database with the service information. Problems with adding services using class les may be encountered due to naming diculties. The use of Java archive(JAR) les to group class les to allow class les with same name is a possible solution. Data can also be incorporated into the JAR les. Extra storage may sometimes be required when using JAR les due to loading classes in the JAR le that may not be required. More than one service can be packaged in same JAR le. JAR les can be compressed and as such, downloaded faster. Each JAR le should be self contained so that it
includes every Java class le needed to comprise the service. Execution speed of services may also be increased due to all the class les existing in the JAR le so that when the classes are dynamically bound it is not necessary to download the class le from across the network. This approach also avoids the problems of a server adding a service and dynamically downloading class les as needed. The source of the class les may disappear leaving the service incomplete. To check if the storage of code in a database was feasible a codeserver was implemented. It showed that we could load the required Java class les from a database. The current prototype implementation is a system where every peer knows about every other peer, usually. The consistency of the peers is currently not guaranteed so if two peers join the system by bootstrapping from two separate peers not all peers may know about each other. This is because the bootstrapping policy currently is to copy the list of peers from other peers and then tell each peer about itself. As two peers joining at the same time will not have each other in the lists that they get they will not inform each other about themselves. Services are not currently bootstrapped. We need to be able to add services dynamically. In the current implementation Java services are added in two ways. Either by sending code to all peers or by sending code to one peer and informing some subset of the other peers where that service is located. Currently we do not handle the removal of peers, the death of peers, the revoking of services or the removal of services. What we have done so far is to implement a minimal number of initial peer services. This will enable us to add all other required services on top of the system including services used by the peers themselves. The exact set of initial services needed needs to be looked at very closely especially in respect to thread scheduling to determine nal composition. The removal of services may be necessary due to an explicit command or may be required due to lack of available storage space to store all the services. We are planning to implement this using a similar policy mechanism to that of determining peers and services. Extra information can be stored in each peers database relating to services. This is an example of storing auxiliary meta data in the database. Using this extra information the policy can determine when to remove the services. Possible policies may be when the amount of available space falls below a certain point to remove services based on the number of accesses of services or the last time of access of services. The policy may also specify a constraint that the service must currently be available on another peer to avoid the service becoming unavailable and to
avoid the service being lost completely. It may be necessary to transfer the code to another peer in order to satisfy this constraint. No way currently exists to start a server with services. Services are used by using an initial invoke service. This is currently performed using a Java remote method invocation(RMI) but will need to be performed in a way so that pipelining and other more complex structures of services can be invoked.
4 Conclusions
Although this is a hard problem, we believe we can come up with some compromises that will work most of the time and can be re ned if we can set up appropriate frameworks for encoding policies. It is important to realise that the current system is not fully reliable although it shows that the concept of adding code and distributing the code can be accomplished and that a peer-to-peer model is possible for our needs. The current implementation is simply a proof of concept prototype.
5 Future Directions
A detailed survey needs to be taken of the peer-topeer architecture ideas to see how it ts in with the rest of the DISCWorld system to make sure that it will be compatible. The core services that a peer provides to enable operation of the peer including the addition of services needs to be decided upon and implemented. The scheduling of threads in the Java Virtual Machine may in uence this considerably. Once a new basic framework is implemented the implementation of dierent policies for the peers needs to be undertaken. The performance of these policies, the situations in which they are most appropriate and the interaction between them needs to be evaluated. The storage of code in databases on each peer can then be investigated with services added to determine if a service should be performed on another peer or the code downloaded. The peer-to-peer communication policies might store code as well as service information so that that peer now provides the service.
6 Acknowledgements
This work is being carried out as part of the Distributed High Performance Computing Infrastructure (DHPC-I) project of the Research Data Networks Cooperative Research Center (RDN CRC) and is managed under the On-Line Data Archives Program of the Advanced Computational Systems CRC. RDN and ACSys are established under the Australian Government's CRC Program.
References
[1] \PVM: Parallel Virtual Machine - A User's Guide and Tutorial for Networked Parallel Computing", A.Geist, A.Beguelin, J.Dongarra, W.Jiang, R.Manchek, V.Sunderam, Pub. MIT Press, 1994. [2] \The Java Language Reference", James Gosling, Bill Joy, Guy Steele, Pub. AddisonWesley, 1996, ISBN 0-201-63451-1. [3] \Using MPI - Portable Parallel Programming with the Message-Passing Interface", W.Gropp, E.Lusk, A.Skjellum, Pub. MIT Press, 1994, ISBN 0-262-57104-8. [4] \Introduction to OSF DCE", The Open Software Foundation, Pub. Prentice Hall, 1995, ISBN 0-13-185810-6. [5] A source: George McDaniel. IBM Dictionary of Computing, Tenth Edition, McGraw-Hill, (1993). http://whatis.com/peertope.htm [6] \DISCWorld: An Integrated Data Environment for Distributed High-Performance Computing", K.A.Hawick et al, Technical Note DHPC-027, January 1998. [7] \Totem Group Communication System". http://alpha.ece.ucsb.edu/pub/publications.html#totem