The typical applications share the common property of being easily ... As network services are getting increasingly complex and demanding, the need for more ...
Grid Computing Distribution Using Network Processors Björn Liljeqvist and Lars Bengtsson Department of Computer Engineering SE-412 96 Göteborg Chalmers University of Technology Sweden {e6liljan,labe}@ce.chalmers.se
Abstract This paper suggests a new Grid computing architecture using programmable routers with network processors to distribute computing tasks at wire speed in a grid. Task Message datagrams containing code and data, but without explicit destination addresses are sent across the network, e.g. the Internet, and are immediately routed to the most appropriate computer by the network processors. Only a return IP address specifying where to send the resulting data is supplied. Ideally, the routers communicate with each other and with the computers according to our Grid Protocol, but a solution also exists for building a similar architecture on top of TCP/IP. Routers keep information of the workload and/or the availability of the connected computers, and the aggregated capacity of the rest of the Grid, as seen through the nearest routers. The purpose is to efficiently utlilize and load balance all computers on the Grid in a truly distibuted fashion, creating a highly scalable Grid and performance enhancements. 1.0 Introduction The vision at hand is the Great Global Grid, “GGG”, where Internet computers around the world cooperate to form one gigantic virtual computer providing processing power in a manner similar to the power grid’s electrical supply. If the GGG could become reality, the compute time of unused desktops and servers, would no longer be wasted. In fact, proposals [1] exist for a “Grid economy” where compute power is regarded as a resource similar to oil or electricity, that is sold and purchased over the Internet. In this paper, a Grid architecture is presented based on a new idea of how routing and deployment of computing task can be achieved. At the center of the proposed model is the possibility of using Network Processors, NPs, capable of analysing data running through a network at “wire speed”. NPs are a very recent area of research, but the main point of interest in the context of this paper is the programmability the NPs offer to a network, and accordingly to network services. Network services and protocols can become less rigid and more complex as the in-network intelligence of NPs is employed. This proposal suggests that the new possibilities of the NPs open the door for a building of a Grid on the Internet that is both fast, simple and with a high degree of total utilisation. The paper will start by giving some brief background information on cluster and grid computing, as well as on NPs. The proposal for a new grid architecture is presented subsequently. A suggestion is made for a Grid Protocol intended for the Network level as a complement to IP, but for the computing role of the Internet as opposed to the mere communication. However, the proposal is not necessarily dependent on new protocols to be put into practise, as a fully functional model can be placed on top of TCP/IP; this will also be demonstrated. The concept of Task Messages, TM, which form the basic unit of programs that execute in the Grid is presented, along with the mechanisms for making routing decisions. The final part gives the arguments for the possible benefits of the new architecture, as compared to existing approaches.
2.0 A Background on Grid Computing and Network processors The proposed Grid architecture is a merge between ideas from both traditional Grid computing and the new field of Network Processors. The following sections will give a brief background on the two. 2.1 Cluster Computing and the Grid A computational cluster signifies a set of proprietary computers in a private network of any configuration that is used to solve problems that require exceptional (numerical) processing. The applications tend to be exclusively scientific problems requiring numerically heavy computations, such as meteorological forecasts, genomic searches, analyses of data produced by particle accelerators etc. The typical applications share the common property of being easily parallelizable; each computer in the cluster executes its program on its share of the entire data. The term grid is more commonly used when referring to a cluster where the computers, the computational nodes are geographically distributed over a large area, possibly at a global scale, and the Internet is used as the connecting network. The famous SETI@home project [3] is a good example, in which desktops around the world download a set of the entire data and report back to the master server the result of their computations. Parallel Virtual Machines is another way of organising a grid of heterogenous computers [7]. In this case, a Message Passing Interface (MPI) is used, which means that different programs running on different machines communicate through message passing over the network. The goal of PVM is to have the cluster of computers running different operating systems on different hardware appearing as one single computer to the user. The user may write programs in high level languages like C or Fortran which are then compiled specifically for the PVM. However, in all grid approaches, the decision of which computer should do what job is left to one server keeping track of the available resources. This server may be software running on one computer, or distributed to several computers in the network. Further, since the overhead, i.e. the time required for transfer of data, code and particularly resource allocation may be large, grids are used for tasks that take plenty of time to complete. 2.2 NPs and Active Networking Network processors, NPs, are aimed for networking equipment such as routers and switches. The main requirement on network equipment is speed; in a router connected to a OC-192 data carrier, packets enter and leave at 10 Gbit/second and the NP must classify them, perform necessary modifications such as checksum computation and time-to-live decrementation and forward the packets without creating a bottleneck within the device. The NP is therefore said to operate at “wire speed”: no packets are dropped because the NP cannot process them fast enough. Practically all commercial NPs are variants on the same theme: multiple simple RISC cores, “packet processing elements”, that perform fast but simple operations on the packet stream; a packet buffer for storing packets; an address look-up table built on CAM (Content Addressable Memory) technology or fast RAM with special hardware for search algorithms, and a switch fabric. The switch fabric, which provides connection between the NP and the different egress ports, may be a separate device not on the same chip as the NP itself. As network services are getting increasingly complex and demanding, the need for more flexible, programmable devices increase accordingly. “Network services” is a term generally interpreted as a synonym to Quality-of-Service, but this may not be as truthful in the near future. A lot of current research centers on ways to let the network be more than just a transportation of data containers [4]. The idea is to have so-called “programmable routers” processing not only IP packet headers, but the very data itself. That way, if a file is being downloaded from a web server, the format or even the content of the file may be different when it arrives compared to what it looked like when it left its
server. Another similar idea is to get away from the rigid process of establishing data communication protocols and let the packets themselves carry a description, a piece of code, of how they want to be processed in the routers. The common term for all these efforts is Active Networking. AN researchers admit that any commercial use is still some years into the future [5], but the idea of actually making use of in-network intelligence to form part of tasks that until now have been the exclusive right of the application layer, that idea has a solid ground. 3.0 Protocols for the Grid The proposed grid architecture is best implemented using a specifically designed Grid Protocol, GP, that would run in parallel to IP. That way, computers would not send IP packets between one another; they would not run an application on OSI-level 7 that needs the other layers to perform. However, it is also possible to build the same grid architecture on top of IP. In the latter case, the sending and forwarding of computing tasks would be done one step at a time; from computer to nearest router, and further between router and router. TCP-connections could be established that would let the tasks move one hop each time. At every stage, a decision would be made on where to forward it, and the forwarding process would be done by establishing a new TCP connection. This process would be more flexible, and prove to be possible to experiment with on today’s infrastructure, but it would not be as fast as having a special protocol just for the Grid; one that would stay at the network level at all stages of task forwarding. 3.1 The Grid Protocol at the Network Level The Grid Protocol provides a fully decentralised platform for letting computers share in the execution of programs. There is never a master computer, or any special server that keeps track of which computer is to do what part of a computing task: this is done by the network itself, at wire speed. As a comparison, consider a traditional cluster or grid. In a regular grid, the control computers, grid masters, main servers or whatever term is used, must at all times have knowledge of the grid topology, the availability and the capacity of the computers. Either the master server is responsible for preparing suitable pieces of work to give to the computers; another case is the computer who requests assistance from the master server who finds the appropriate computer that will assist. There is also the peer-topeer approach in which there is no single master server present, the computers rather communicate in a de-centralised way and share their workload with each-other. In all cases, though, the computers communicate at the application layer using ftp or specialised application layer protocols on top of TCP/IP over the network. If a computer wants assistance, then finding the node to which a task, a piece of data and a program, should be sent requires several steps of communication: contacting the master, the master checks its tables for suitable targets, connects to the target machines and returns their addresses to the one who made the request, which in turn sends the tasks to these target machines. The process can be more or less different, but the point is that several steps are involved that are not really related to the execution of the tasks themselves. Instead, consider the following model: A computer that need extra compute power sends out the program along with the data on the network in the form of a Task Message, TM, consisting of one or more GP datagrams. The TM does not have a destination address. It is the job of the router to identify the TM as belonging to the Grid Protocol, and to route it to the local machine that is most suited for the task, or to forward the TM further into the Grid. Once the TM has been received and processed by a machine somewhere in the Grid, the resulting data is sent back to the computer from which the TM originated. This is the basic concept. The process is similar to IP routing, where routers forward packets along the route that is least heavily loaded with traffic. In the GP, routers route TMs to the place in the Grid where they are best served.
3.2 The Grid Protocol at the Application Layer over TCP/IP In the Wire Speed Grid Project at Chalmers, the Grid Protocol is modelled at the application layer. The behaviour is much the same, but the TMs travel through TCP connections established between each two nodes. Whenever a TM is going out from a computer, a connection is established with the closest router using port 2741. The TM is transferred in its entirety to the router, which examines its routing table for the best choice of next hop. When a computer (which may be another router, representing “the rest of the Grid”) has been selected, a new connection is set up and the process is repeated. This approach suffers of course from larger latencies and overhead, compared to a pure Grid Protocol implementation, because of the additional involvement of TCP/IP. With routers supporting GP, TMs can follow the so-called “fast path” in the routers, meaning that they will be forwarded in wire speed, hence the name of this proposal. The application layer approach will require TMs to follow the “slow path”, and wait longer at each hop. Nonetheless, it is an approach that allows for simulations of the basic properties of the architecture. 4.0 Task Messages Program tasks and data travel in the Grid as Task Messages. A TM is to the Grid what an IP packet is to the Internet. The computer that makes the decision to invoke the Grid sends the program and the data in TMs. The proper dividing, the parallelisation of the program into TMs could be done at the original computer; the possibility would also exist to equip the TMs with some data on how they could be further divided. That way, a router would split a TM into pieces at linear time and send them along for parallel processing. This option would off-load the original computer of the job of making the optimal division of a task. A schematic picture of a TM is seen in figure 1. It contains the code to be executed, the data to be operated on, and the return IP address to which the result of the operations will eventually be sent. There is also support for TM fragmentation due to size limits in the data-link layer, and possibly even for matters of execution: a big task can be divided – in the router at wire speed – into smaller units that process in parallel in the Grid.
7DVN 0HVVDJH *ULG 3URWRFRO ,GHQWLILHU
5HWXUQ DGGUHVV
6L]H
7DVN ,'
)UDJ 1U
77/
&RGH
'DWD
Figure 1. Schematic description of a task message.
If the code to be executed already is present at the target machine, such as may be the case with the Quicksort of FFT algorithms, the TM can contain a “pointer” to these programs that are already present, rather than having the code transported over and over again.
What form of code should a TM contain? Since the Grid consists of heterogenous computers, one idea is to transmit Java bytecode that is executed on virtual machines. The biggest problem with this approach is performance, since Java can never be as fast as regular compiled machine code. Just-intime compilation, however, has made Java a much faster language than before and could be considered if all aspects are considered. It should be safe, due to the “sand-box” environment of program execution. It’s also easy to implement, which makes it suitable at the initial stages of development. It is a central issue that from the point of view of the computer sending the TM to the Grid, it is of no importance where it is processed; what matters is that the result is delivered to it without unacceptable delay. To use IP numbers for specifying return addresses is only logical, since the return of resulting data is a matter of communication, as opposed to the distribution of tasks in the GP which is not dependent of location.
6.0 Routing in the Grid Grid routing decisions are made by software running on the NPs in the routers. They have to decide where the incoming TM should be sent. To make that decision, the routers need information on the effective performance of the single computers to which it is connected via a LAN, as well as the aggregated effective performance of the rest of the Grid, as it can be seen via the other routers to which it is connected. Consider the network in figure 2. &RPSXWHUV LQ D /$1 & & &
&
&
&
1:
(WKHUQHW
*ULG $ FRPSXWHU
&
&
&
5
&
5
&
&RPSXWHUV LQ D /$1 & & &
5 &
(WKHUQHW
5
*ULG &
&
&
*ULG
*ULG
&
&
$URXWHU
/$1 RI FRPSXWHUV &
&
&
&
&
$ FRPSXWHU Figure 2. The Grid architecture. From the perspective of any single participating computer, the Grid is the sum of all other computers in the network. From the perspective of any router, there are the local area network computers, and then there is the rest of the Grid. That is all the components need to know.
The computers in a LAN are set up for participation in the Grid through the operating system which informs the router that the computer in question may accept TMs. The routers that “speak” GP, that form part of the Grid i.e., inform their nearest neighbouring routers of this fact. Each router has a GP routing table, similar to the ordinary IP routing table, but storing performance data for each destination within reach. A network processor has special hardware support for look-up tables such as IP routing tables, and it is therefore simply a matter of software to implement a second routing table based on processing power in the Grid. Fortunately, this GP routing table can be kept
very small in comparison with the first, since it only needs one entry for each computer in the LAN, and one entry for each egress port connecting the router to other routers. Each GP route table entry stores the IP address of the previously mentionted computers and routers and their corresponding egress port, along with the Effective Performance, EP. EP is the measure in e.g. Mflops of how much a computer can offer in compute power. EP is achieved in an indirect way; apart from at setup time, when a computer informs the router of its existance and possibly tells of its available capacity, there is no communication of performance data between the computer and the router. The router can tell the EP anyway, by measuring how fast the result of a forwarded TM returns. If the router knows the size and the complexity of the TM, it makes a rough estimate of the EP of a specific target machine by dividing the estimated number of operations (size times complexity) with the time required for the result to be returned. Any machine which receives a TM is marked in the GP routing table as “busy”, and it will not receive more TMs until it is finished with its work. To calculate the EP of the rest of the Grid, such as it is seen from one router to another, a similar algorithm is used. Each router egress port that connects the router to another network represents a subGrid. The EP of the sub-Grid is calculated by dividing the task size with the time it takes to complete. Note that the completion time includes total travel time for the TM. There may always be plenty of processing power somewhere else, but it cannot always be optimal simply to send TM further away in the Grid, since the routing decision must account for total travel time. When a router makes the mathematical division to calculate the EP of a sub-Grid however, it does not know how big a part of the time required to deliver a result is due to data transfer and what is actual CPU time – and it does not need to know it either, for the reasons stated previously. In the GP table, each EP number is updated with every TM result that arrives. Result data arriving ahead of the estimated time causes an increase in EP for the route table entry corresponding to the computer or sub-Grid the result came from; data requiring more time to process than estimated causes a corresponding decrease of EP. It follows that resulting data must travel back to the original computer along the same path as the original TM. This is achieved by adding the IP addresses of each hop to the TM header along the way. The resulting data is sent back using source routing, i.e. defining from the start which path the data should take. For an incoming TM the routing strategy is: 1) Select from the GP route table the target having the highest EP. 2) Forward the TM to the selected target. If support for TM division is included, the TM can be divided and the parts forwarded along different paths. It is worth to mention that one other possible approach for routing is to have a “proximity policy” which makes sure that local resources always are exploited before forwarding TMs further into the Grid. The proximity policy would be insentitive to EP, and thus easier to implement. However, an intelligent routing policy exploiting the EP measures would yield a computational Grid that behaves much like an electric circuit. Currents always follow the path with the least impedance, and they split at the nodes according to Kirchhoff’s laws making perfect proportions and load balances. If another network is attached, it is reflected throughout the entire Grid, much like the attachment of one circuit to another. 7.0 The Wire Speed Grid Project The objective of the “Wire Speed Grid Project”, launched at the Computer Engineering Department at Chalmers, is to explore the possibilities offered by the new architecture. A test system is currently being built which are to be used as the primary vehicle for performance evaluations and further architecture development. The test systems currently consists of a number of networked PCs but will in the next step be expanded to utilize computers on the Internet.
7.1 Scalability To scale a regular grid means administrating hundreds to thousands to millions of computers. This could be done in a distributed way though, with a large number of administrators monitoring their local participating computers. Scaling the proposed architecture is dependent not only on the participating computers, but on the routers as well. The routers have to be programmable network processors running GP software that lets them operate in the Grid, or at least – in the case of overlaying TCP/IP – have host computers that understand the Grid control software. Programming the network is of course another story, but it is expected that a market and an industry will arise for network processor software. The Internet community will look for new ways to exploit the programmable network. On the other hand, if the routers understand the Grid Protocol, nothing more has to be done when building on the Grid, e.g. attaching another network to the original one, since the administration of the Grid is done in the network itself. Accounting for the used computing resources, in economic terms, is a different problem that has to be solved within GP. This and related issues are acknowledged but left for future work. 7.2 Performance How is performance compared between grids? Computing performance is not an absolute measure; depending on the benchmark applied, performance will be described differently. The Wire Speed Grid project aims at investigating the performance for programs that are presently both common and uncommon in Grid environments. If the total overhead is lower with the proposed scheme, then it is reasonable that the decision to divide a job into TMs and have several computers assist on it could be made for smaller jobs than in a regular Grid. That would certainly yield a greater performance for those benchmarks. Another parameter to look at is total Grid utilisation in per cent. A successful grid architecture should keep the participating computers saturated with computing tasks. To keep utilisation high, common grid applications tend to be heavy mathematical computations that take hours to complete. The initial cost for letting the grid work on a problem is fairly high, so every computer that receives a job should have to work on it for quite some time. The proposal suggest that tasks could be allocated and processed with significantly lower overhead. As said earlier, this makes it worth invoking the Grid even for smaller tasks. It is possible that saturating the Grid with both smaller and larger tasks will increase total utilisation. 7.3 Applicability Performance is not all; a new Grid architecture may broaden the space of problems that benefit from the Grid. The participating computers should only install a piece of software that let them accept and execute TM code, and this is done only once. It is not necessary to manually install another program simply because the Grid is to be used for another application. Indeed, the Grid is meant to be used simultaneously for a wide variety of programs, and no adjustments should have to be made between programs. Still, not all programs are parallelizable. It is possible that some restrictions must be set on the programs that are to be executed in parallel in the Grid, like e.g. no use of global variables. 7.4 Fault-Tolerance The Grid has to be tolerant to different classes of faults. A participating computer may go down in the middle of a job; a router may be disfunctional or contain bad data in its routing table; resulting data may be corrupted or intentionally sabotaged. Most of these issues are not exclusive to this proposal, and there exists a lot of research on how to deal with them. If a computer goes down in the middle of a
job, the original computer will notice that it does not receive its result, and retransmit the TM. Or, a policy can be introduced to let TMs send back “I’m alive!”-messages to the original computer, etc. If a disfunctional router cuts off one part of the network, it will affect utilisation, but not generate computing errors. If a router has a faulty routing table that makes cycles appear, there is indeed a problem, but it can be solved by introducing a time-to-live counter in the TMs. 8.0 Conclusion This paper presents a new Grid architecture based on fully distributed administration present in the network layer. Programmable routers built on Network Processors make the decisions to route computing tasks, in the form of Task Message datagrams, to the most appropriate destination. The analysis and routing is performed at wire speed, as the data travels through the router. A Task Message contain no explicit destination address; it is the job of routers to direct them to a computer where they will be attended to. The Task Messages contain instead the return address to which the resulting data will be sent. The main content of a Task Message is a piece of code and data to operate on. The most appropriate destination, as seen from a router, is either a computer in a directly connected LAN, or a sub-Grid; in the latter case, the Task Message is forwarded to another router, which will make a routing decision of its own. The Wire Speed Grid project at Chalmers University of Technology is currently setting up a test environment for the proposed architecture. The purpose is to see if this form of Grid have higher performance for typical Grid programs, such as massively parallel computations of partial differential equations; and for programs that are generally considered not to be big enough to be worth executing in a Grid. The issue of scalability is also addressed, as it seems logical that by employing the very network infrastructure and have it participate in the Grid Protocol, the Grid can grow very easily, simply by attaching more networks. References: [1] Buyya, Rajkumar and Vazhkudai, Sudharshan: “Compute Power Market: Towards a MarketOriented Grid. Proceeedings of the First IEEE/ACM International Symposium on Cluster Computing and the Grid, 2001, pp. 574 -581. [2] Sarmenta, Luis F.G.: “Sabotage-Tolerance Mechanisms for Volunteer Computing Systems”, Proceeedings of the First IEEE/ACM International Symposium on Cluster Computing and the Grid, 2001, pp. 337 -346. [3] http://setiathome.ssl.berkeley.edu/faq.html [4] Patel, Amit: “Active Network technology”. IEEE Potentials. 2001, 20 (1) , pp. 5 -10. [5] Wolf, Tilman and Turner, Jonathan S, “ Design Issues for High-Performance Active Routers” ,IEEE Journal on Selected Areas in Communication, Vol. 19, No. 3, March 2001, pp. 404 –409. [6] http://www.ibm.com, http://www.intel.com, http://www.motorola.com [7] http://www.mpiforum.org