(1) The Directory Agent keeps track of the computers belonging to the cluster. .... As a result, the Directory Agent adds the host of the Controller Agent to the.
A Mobile Multi-Agent System for Distributed Computing
Stefan Kleijkers, Floris Wiesman, and Nico Roos International Institute of Infonomics / Universiteit Maastricht, Department of Computer Science P.O.Box 616 6200 MD Maastricht, The Netherlands
The paper describes a peer-to-peer distributed-computing platform based on mobile agents. The platform, called yaca, has been built on top of Aglets, a mobile-agent platform written in Java. Next to the client agents, which seek computational resources in a cluster, yaca consists of four agents, who manage the computers within a cluster. (1) The Directory Agent keeps track of the computers belonging to the cluster. These computers are called nodes. (2) The Weather Agent monitors the status of a node and (3) the Account Agent keeps track of the resources used by client agents. (4) The Controller Agent manages a single node. It controls the access to the node and it migrates client agents to other nodes in the cluster if the node becomes overloaded. The Controller Agent receives help from the Weather Agent and Account Agent to manage the node. Experiments showed that yaca brings little overhead. However, its load balancing algorithm has room for improvements. Abstract.
1 Introduction Despite the exponential growth of CPU speed, modern computers cannot cope with the growing demand for computational resources that occur in areas such as Bio-Informatics. Computer models used in physics, biology and technology are getting more re ned and often show an exponential time complexity. Moreover, the amount of data to be processed is growing rapidly. Due to these factors the demand of computer power outweighs its growth. At the same time, there is a huge amount of unused computer power available on the desks of people, especially after oÆce hours. The seti@home project has demonstrated convincingly the scalability and feasibility of distributed computing with spare CPU cycles. In Spring 2002, there were 500,000 active users; a total of almost 1 million CPU years had been donated. Gnutella, KaZaa, and Morpheus have generated ample interest in peer-to-peer (P2P) networks and have shown their utility. Combining these two approaches is an obvious step. Mobile agents seem the ideal tools for such a uni cation, because (a) they can react exibly on the availability of resources, and (b) they allow for new computational tasks to be created without the need for
2
Kleijkers et al.
installing new software on multiple hosts. This study investigates the possibility of mobile agents for P2P distributed computing. The remainder of this paper is organized as follows: Sect. 2 outlines existing approaches to distributed computing, Sect. 3 describes our platform based on mobile agents, and Sect. 4 presents the results of experiments conducted with our plaform. Finally Sect. 5 presents our conclusions.
2 Distributed Computing The best-known convential tools for distributed computing are PVM and MOSIX. We describe them below, as well as POPCORN and PaCMAn, two distributed computing systems with a P2P approach. PVM consists of a set of libraries for distributing tasks and communication between tasks located on dierent computers. The programmer has to use functions from the PVM library to setup a cluster of computers, distribute the task equally to the dierent computers in the cluster, and start the computers working on the task. After all computers have nished their part, the programmer can use library functions to collect the results from the dierent computers. PVM is in exible: it is not possible to redistribute tasks dynamically while the cluster is working. While it is possible for a node to join or leave the cluster when execution is in progress, the programmer must make the program aware of such situations and take proper action. More advanced than PVM is MOSIX. With this system an application can fork itself and distribute its job to its children and compute the job as if it worked on one computer while some of the children already run on another computer. MOSIX monitors the resources of the system and migrates processes to other nodes in the cluster when the current node becomes overloaded. The MOSIX approach has two advantages. Firstly, the programmer can write an application as if it was to be run on a single computer with multiple processors. The programmer can fork the program multiple times to obtain new instances of the application and every instance will receive a part of the job. All the instances will then work on the job until they nish and the parent instance can collect the results from its children. Secondly, the system is scalable. Every node in the system sends its load information to a small random set of other nodes and every node builds a list of load information about the nodes it received information from. If a node becomes overloaded it will choose a node from the list for migrating its processes to. Hence, the nodes only have to send little information over the network to keep each other informed. Experiments indicate that MOSIX performs signi cantly better than PVM [1]. POPCORN [2] is a P2P system for selling and buying processor resources on the Internet. The main purpose of POPCORN is to investigate market mechanisms for selling processor resources on the Internet. Buyers have to write their programs using the POPCORN library. A program or a program subtask can be sent to the market to nd a seller. Upon agreement between buyer and seller,
Lecture Notes in Computer Science
3
the program is executed on the selling host and the results are returned to the buyer. PaCMAn [3] is a system for distributed computing based on the Aglets mobile-agent platform [4]. Each host runs an Aglets server. A broker keeps track of the system con guration, resources, and network load by dispatching special agents that gather the required information. Client agents can migrate between hosts while retaining their state. In most distributed computing systems the processes carry the data with them, or the data is located at a xed node and the processes have to access the node if they need the data. PaCMAn uses a dierent approach: data from a single host is divided into more or less equal pieces and is distributed over the dierent hosts within the PaCMAn cluster. The client agents move to the hosts with the data and start processing the data.
3 The Yaca Platform At the start of the project six design goals were set. None of the existing platforms met all of these objectives. The platform is to be used in an open environment, and hence should be able to run on multiple operating systems. This goal is satis ed by using the Java programming language. The platform will be based on mobile agents. A mobile agent | as opposed to a static agent | can move to another host and bring its data. This property facilitates the migration of an agent from a host that has become overloaded to a host with a lower load. Moreover, hosts do not need to have the computational tasks in advance: they are embedded in mobile agents. After reviewing the various available mobile agent platforms, we chose the Aglets platform. The load must be well-balanced over the nodes. The greater the dierences in CPU load between nodes, the greater the number of CPU cycles wasted. The overhead has to be low. If coordinating the agents takes too much overhead, it outweighs the eort of wrapping computation tasks in mobile agents. The possible use of markets for computational resources. It should be possible to extend the platform with markets for computational resources.1
{ The platform has to be operating system independent.
{
{
{
{
In the remainder of this section we describe our platform, yaca. Section 3.1 provides an overview of the platform, Sections 3.2{3.6 describe the various yaca agents, Section 3.7 discusses how resources are monitored, and Sect. 3.8 describes the yaca security system. 1
Yaca is an abbreviation of Yacatecutli, the Aztec god of the merchants and travelers. The merchants are connected to the market extensions of Yaca and the travelers to the mobility of the agents within Yaca.
4 3.1
Kleijkers et al. Overview
The Yaca platform consists of four types of management agents running on the Aglets platform. Besides the management agents, there are client agents. These are the agents performing tasks for users, but strictly speaking they are not part of the yaca platform. The rst agent is the Directory Agent. It keeps a list of all the hosts that are part of the cluster. Hosts that belong to a cluster are called nodes. The second agent is the Controller Agent. Every node has a Controller Agent running, the Controller Agent controls the node. The third and fourth agents are the Weather Agent and Account Agent. The Weather Agent monitors the load of the system, while the Account Agent keeps track of the resources being used by the clients. Figure 1 shows two nodes in a cluster. It illustrates which agents communicate with each other. Host A contains the Directory Agent and the two Controller Agents communicate with this Directory Agent. Furthermore, the Weather Agent, Account Agent, and client agents only communicate with the ControllerAgent.
Fig. 1.
Interaction between Yaca agents on two hosts.
Lecture Notes in Computer Science 3.2
5
The Directory Agent
The rst agent of the yaca platform is the Directory Agent. The main task of the Directory Agent is managing a list of all the computers that take part in the cluster. Only one Directory Agent is required per cluster. When a computer wants to join the cluster it has to register itself at the Directory Agent. The Directory Agent will check if that host is not already added, if not then it will add the host. If a host wants to leave the cluster it has to deregister itself from the Directory Agent. The main purpose for maintaining this list is providing nodes a way of knowing which other hosts take part in the cluster. Individual nodes can ask the Directory Agent for a list of other hosts in the cluster. The node uses this list to decide to which host it will migrate a client when the load gets too high. The Directory Agent does not return the complete list of hosts but only a part of it. The Directory Agent constructs this list by selecting hosts from the complete list at random, thereby improving scalability. If the Directory Agent would return the whole list, the node would have to query all the available nodes (see Sect. 3.3). In a small cluster this would not be a problem, but in a large cluster it would cause too much overhead. An additional purpose for holding a list of nodes is the control over the cluster. It gives the user the possibility to exclude hosts from the cluster. This way a security system can be built for controlling the number of nodes, which hosts are allowed, and the migration process. 3.3
The Controller Agent
The second yaca agent is the Controller Agent. Every node in the cluster has one Controller Agent. The Controller Agent is the ruler on a node. It decides if and when agents will be migrated and where to. Furthermore it decides which agents are permitted on the node and what their restrictions are. After a Controller Agent is started it registers itself at the Directory Agent. As a result, the Directory Agent adds the host of the Controller Agent to the list of nodes. After registering, the Controller Agent starts its job. First it will wait for a Weather Agent and an Account Agent to register themselves with the Controller Agent, then it will accept client agents. The client agent has to be registered at the Controller Agent, else the Controller Agent will not accept the client on the host and will remove the client. The Directory Agent controls the access of hosts to the cluster; the Controller Agent controls the access of clients to a host. 3.4
The Weather Agent
The third yaca agent is the Weather Agent. Each node also needs one Weather Agent. This agent keeps track of the system resources. If the node runs out of resources it informs the Controller Agent, and the Controller Agent decides what
6
Kleijkers et al.
to do. The Weather Agent monitors the load of the system by determining the load of the processor. It checks the load at regular intervals. The level of processor load at which the Weather Agent informs the Controller Agent can be set at start up time. Furthermore, it is possible to change the value at runtime, by sending a message to the Weather Agent. This is especially useful for the Controller Agent. In this way it can control the load of the node. The Controller Agent can decide to permit clients when the system is idle and refuse clients if the host is occupied by a user. To make the decision whether to inform the Controller Agent as easy as possible the Weather Agent only uses a single value for representing the load of the dierent system resources. If the Weather Agent would monitor more than one resource it would need a function to combine multiple values into a single one. Currently the Weather Agent only uses the load of the system processor, hence there is no need for some sort of encoding function. The Weather agent does not inform the Controller agent every time it checks the load of the system. It will rst check the load several times and take the average of those samples. The average will be sent to the Controller Agent. This way the system can handle fast and short uctuations in the load.
3.5
The Account Agent
The last of the four management agents is the Account Agent. The Account Agent is responsible for keeping track of the resources used by a client agent. This is necessary for the payment. If a client has paid for some resources yaca has to prevent the client from using more resources than the client has paid for. When a client agent enters a node it has to register itself at the Controller Agent, which will then register the client at the Account Agent. The Account Agent now monitors the client agent. If the client agent wants to migrate to another node it deregisters itself from the Controller Agent. The Controller Agent then deregisters the client from the Account Agent and asks the Account Agent for the resources being used by the client and writes this information to a log le. The Controller Agent may migrate the client to another node in the cluster, because of low system resources at the current node. In this case the Controller Agent deregisters the client from itself and the Account Agent. After that it will register the client agent at the new node. The Controller Agent also asks the Account Agent for the resources being used by the client and informs the Account Agent at the new node about the resources being used by the client until that time. When the client nishes its job or migrates by itself, the total resources usage of the client is written to the log le by the last Controller Agent. Collecting all these logs at a central place can help to improve the level of control over the resources being used by dierent clients, but this is not part of yaca at the moment.
Lecture Notes in Computer Science 3.6
7
The Client Agents
The agents that carry out the real work for the user are the client agents. All clients have to register themselves at the Controller Agent on the node they just entered. When a client migrates by itself it has to deregister itself from the Controller Agent and register itself at the Controller Agent on the next node. When a client is migrated by the Controller Agent, it will be deregistered from the current node and are registered at the new node by the Controller Agent. Without registration they will be removed from the node by the Controller Agent. A useful design pattern for yaca clients is to build one main agent. This agent holds all program code and data. After it is started, the main agent clones itself several times, thus creating child agents. Because of the cloning the main agent knows the identi ers of the child agents. These identi ers are required for communication with other agents. The main agent can now delegate subtasks to the dierent child agents. The child agents then begin to work on their subtasks. After they have nished, the child agents return their results to the main agent. 3.7
Resource Monitoring
The Weather Agent is responsible for keeping track of the system resources. The current version only takes into account the processor load. To do so yaca has to give up part of its operating-system independence. The Weather Agent has to monitor the processor resource by monitoring some value the operating system delivers. Java itself does not provide a good way of monitoring the load of the processor, and it does not suÆce to monitor the load of the Java virtual machine, because we are interested in the load of the whole system. Therefore the Weather Agent calls a program outside the Java virtual machine. This way the Weather Agent can query the operating system directly. This program is a small C program for the Windows operating system, and a little shell script for the Unix operating system. How does the Weather Agent quantify the processor load? Dierent operating systems provide dierent methods for measuring processor load. Yaca uses a method which all modern operating systems support: monitoring the length of the queue of processes in the ready state. Every operating system has a list of all processes (i.e., programs) that are running on the machine. This list is subdivided into smaller lists. There are two lists that are important for this discussion, namely the list of sleeping processes and the list of ready processes. Not all of these processes need the processor all the time. Some processes are waiting on some event, for instance input from a user. Those processes are put in the list of sleeping processes, if the event occurs the operating system puts them in the list of ready processes. In the list of ready processes we nd all processes that want to use the processor, because they have something to do, for instance reacting on an event from the user. The operating system decides which process from the ready queue may use the processor. It uses an algorithm for dividing the processor time fairly over
8
Kleijkers et al.
the dierent processes. If the list of ready processes grows, new work gets added while the processor has not nished the old work yet. This indicates a growing load. Because all operating systems provide a number that relates to the number of processes in the running queue, this number can be used as a good estimation of the load. 3.8
Security
Without any security measures the Controller Agent cannot force clients to leave the node and cannot deny clients access to the node. Furthermore, clients could dispose one of the agents of yaca or take over there function. Of course, this would disrupt the proper working of the system. Therefore, there is a need for a security system, that grants the Controller Agent full control of the node. No other agent can dispose or migrate the Controller Agent. Also the other three management agents (Directory Agent, Weather Agent, and Account Agent) and the clients have to be protected against dispose or migrate messages from other client agents. At the start up of yaca, Aglets has to secure the three agents of yaca (or four if the Directory Agent also runs on that node). From then on no other agent can dispose or migrate one of the three or four agents. Subsequently the Controller Agent registers itself at the Directory Agent and if the registration is successful the Weather Agent and Account Agent register themselves at the Controller Agent. Only after all these steps have successfully nished, the client agents are allowed to enter. If yaca would allow clients to register themselves at the Controller Agent before the Weather Agent and Account Agent are registered, the client agent could fool the Controller Agents and act as the Weather Agent or Account Agent. When all management agents have properly registered themselves, yaca allows client agents to enter the node. Client agents could enter the node without registering at the Controller Agent, but this is not allowed, because otherwise the client could use the resources of a node without the possibility of migration and without paying for it (if there is a payment facility). To prevent this the Controller Agents checks regularly for clients on the node that are not registered. If the Controller Agent nds clients that are not registered it disposes them. An alternative would be to use the security system of Aglets for disallowing clients to the node.
4 Experiments This section describes the experiments conducted with yaca and their results. The experiments were focused on the performance of yaca in comparison with conventional programming. Section 4.1 describes the methods used for the experiments, Sect. 4.2 lists the materials used for the experiments, and the results are presented in Sect. 4.3. Finally, the conclusions of the experiments can be found in Sect. 4.4.
Lecture Notes in Computer Science 4.1
9
Methods
The experiments conducted on yaca focused on performance. On no parallel computing system the running time of a task will decrease linearly with the number of nodes of the cluster. There are three potential causes. First, the bookkeeping of the cluster and the migration of tasks (agents) require computational overhead. Second, it takes some time to distribute tasks over nodes evenly. During this time some nodes become overloaded while others are still idle. Third, not every task can be split up in subtasks. In our research we do not focus on this problem, therefore our experiments use a task that can be split up easily. So as to investigate to what extent the rst two causes mentioned above in uenced the performance, a program has been written to search for prime numbers. It counts the number of prime numbers in a given interval. The program was implemented in two ways. As a pure Java program, running outside yaca, and as an agent, running inside yaca. The pure Java program uses no threads, hence it tests all numbers sequentially. The agent clones itself n times and assigns all children n1 of the subtask. The division of subtasks is straightforward. For instance, if the agent has to search for primes between 0 and 20; 000; 000, and it will divide the task in 10 subtasks, the rst child agent has to search for primes between 0 and 2; 000; 000, the second between 2; 000; 000 and 4; 000; 000, and so forth. The agent program was tested in two dierent settings: on a single node cluster and on a dual node cluster. The agents on the single node cluster show the overhead of yaca, the agents on the dual node cluster show the overhead of the migration process and the balancing of the cluster. On a single node cluster cloning has no eect, because no child agent can migrate to another node. On the dual node cluster, child agents can migrate to the other node. 4.2
Materials
The pure Java program and the single node cluster experiment were both tested on a system with an AMD Athlon 900 MHz processor and 256 MB of RAM. The operating system used was Windows 2000 pro, running Aglets 1.1.0 with Java JDK 1.1.8. No other programs were running while conducting the experiments. The dual node cluster experiment was conducted on two of the above systems. The two systems were connected to each in a 100 MBit Ethernet LAN. This ensured that the results were not in uenced by network delay. 4.3
Results
Table 1 shows the results of the experiments. All experiments started from 0 until the number in the rst column. The lowest number is 20; 000; 000, because below this number no dierences could be measured between the dierent experiments. The second, third, and fourth column of the table display the time in seconds the algorithm needed for nishing the task. The second column displays the pure
10
Kleijkers et al. Table 1.
Time (in seconds) for nishing the task.
Number Pure Java Single node Dual node
20,000,000 40,000,000 60,000,000 80,000,000 100,000,000
152 410 731 1103 1518
154 411 732 1108 1518
155 405 655 870 1215
Java program, the third column the single node cluster, and the fourth column the dual node cluster. As can be seen, there is no signi cant dierence between the native Java program and the single node cluster. In the beginning there is also no real dierence between the single node cluster and the dual node cluster. Figure 2 depicts the numbers from Table 1.
Fig. 2.
Time (in seconds) for nishing the task, by number of investigated integers.
As the search interval grows and the time increases the balancing algorithm of
yaca starts working and a dierence can be seen between the single node cluster
and the dual node cluster. Table 2 shows us the relative gain in percentage of the time between the single and dual node cluster. The theoretical maximum is 50 percent. As can be seen, the gain increases with larger tasks, that take longer to complete and with small intervals the overhead of migration is bigger than
Lecture Notes in Computer Science
11
the gain in time. After some period of time the gain stays the same, this is the point where the balancing is at its maximum. Table 2.
Relative gain for the dual node cluster.
Number Gain
20,000,000 40,000,000 60,000,000 80,000,000 100,000,000
4.4
-0.65% 1.46% 10.52% 21.48% 19.96%
Conclusions on the experiments
Some conclusions can be drawn by the results presented in the previous section. The rst conclusion is that the yaca platform itself has little overhead, as shown by the small dierences in time between the pure Java program and the agent on the single node cluster. This indicates that yaca consumes little resources for itself. A second conclusion is the need for large tasks for yaca to be eective with multi-node clusters. With small tasks, which take little time to complete, no gain can be achieved by using multi-node clusters. With very small tasks, there will even be a loss, and it is better to run such tasks on a single node cluster or just as a pure Java program. As can be seen from the results the gain with large tasks does not come close to the 50 percent, thus the balancing algorithm is not eÆcient enough and needs more tuning. Especially the Weather Agent and Controller Agent have to be tuned to get a more eÆcient balancing algorithm. The Weather Agent has to inform the Controller Agent more often or when sudden increases in load occur, instead of at regular intervals. The Controller Agent has to take better actions when a large increase in load occurs, at the moment it only moves one agent at the time. This is open for more experimentation. From the current result can be seen that the balancing of the cluster begins working with tasks more than approximately 5 minutes, between the 20; 000; 000 and the 40; 000; 000 prime search task. The maximum eÆciency is reached with tasks more than about 15 minutes, between the 60; 000; 000 and 80; 000; 000 prime search task.
5 Conclusions We have described yaca, a platform for P2P distributed computing based on mobile agents. There is a Directory Agent for the management of the cluster
12
Kleijkers et al.
and a Controller Agent for the management of the node. The Weather Agent monitors the load of the node and the Account Agent monitors the resources used by the client agents. Furthermore, a simple load balancing algorithm has been implemented for migrating client agents to other nodes if a node becomes overloaded. The experiments that were carried out with the implementation showed that yaca brings little overhead. However, yaca's load balancing algorithm turned out to be not satisfactory. Fortunately, the algorithm has ample room for improvements. In future research on yaca we will extend the Weather Agent with monitoring of memory usage, disk usage, and other resources. Moreover, we plan to investigate how the introduction of a market mechanism can lead to more dynamical load balancing. The introduction of payment for computational tasks also will increase the exibility for the client agents; a task with a high priority can be given more resources by oering a higher price. In [5] we provide more details on the possibilities of market mechanisms for distributed computing.
References 1. Barak, A., Braverman, A., Gilderman, I., La'adan, O.: Performance of PVM with the MOSIX preemptive process migration. In: Proc. 7th Israeli Conf. on Computer Systems and Software Engineering, Herzliya (1996) 2. Nisan, N., London, S., Regev, O., Camiel, N.: Globally distributed computation over the Internet - the POPCORN project. In: Sixth International World Wide Web Conference, Santa Clara, California USA (1997) 3. Evripidou, P., Samaras, G., Panayiotou, C., Pitoura, E.: The PaCMAn metacomputer: Parallel computing with Java mobile agents. In: Proceedings of 25th Euromicro Conference, In Press, Milan, Italy (1999) 4. Lange, D.B., Oshima, M.: Programming and Deploying Java Mobile Agents with Aglets. Addison Wesley, Reading, Massachusetts (1998) 5. Kleijkers, S.: Distributed computing with multi-agent systems. M.Sc. thesis Universiteit Maastricht, Dpt. of Computer Science CS{01{04 (2001)